Error in phone. No signal or online connection, slow internet al

Avoid Network Outages with Automation and AI Tools to Make Core Networks More Stable

Networks are becoming more complicated, needing improved tools to ensure they continue to operate correctly. As 5G evolves there are more complicated service guarantees to be managed through network slices, a proliferation of frequency bands with different characteristics, new requirements coming from the deployment of network APIs, and rising complexity from the shift to software-based virtualized infrastructure using cloud-based cores as the norm. AI and automation tools offer new ways to avoid network outages.

Additionally, commercial customers are increasingly important to operators to grow revenues. These customers have much higher expectations of network reliability than do smartphone users. With so many new service requirements, automation is now essential. 5G-Advanced’s arrival in 2024 further widens the potential uses to which a 5G network can be put, which should improve operator revenues, but only if operators can deliver the stable experience that these new customers expect.

Numerous mobile operators have been suffering massive network outages — this history now threatens to undermine their ability to add these new commercial revenue streams. Core network outages are incredibly high-impact, unlike the low-level brushfires of individual cell sites dropping out. If one cell goes down, it only affects a few people and it’s easy for them to find a nearby cell that does work. But when a network-wide outage happens it affects everyone everywhere all at once, and then those customers hit every operational team at the operator involved.

Over the past few years, operators in every region have suffered such catastrophic network-wide incidents (summarized in Table 1), which have led to regulatory interest, direct compensation payments for users and public apologies from CEOs. For operators seeking to enter new markets, for example with hybrid private networks where the macro network extends dedicated infrastructure, these outages make selling these new uses much harder.

Table 1. Major network outages between 2018 and 2024

Year	Country	Operator	Customers affected
2018	Japan	SoftBank	40 million
2018	Vietnam	MobiFone	N/A
2018	UK	O2	32 million
2021	Japan	NTT Docomo	13 million
2021	Switzerland	Swisscom	N/A
2022	Japan	KDDI	40 million
2022	Canada	Rogers	12 million
2022	Switzerland	Swisscom	N/A
2023	Australia	Optus	10 million consumers and 400,000 businesses
2023	UK	O2’s MNVOs: giffgaff, Sky Mobile, Tesco Mobile	N/A
2023	UK	Three	N/A
2024	US	AT&T	N/A
2024	Vietnam	All operators	N/A
2024	UK	Three	N/A

Rogers, Canada

Perhaps the biggest and highest-impact outage of recent years was Rogers’ in July 2022, which affected its entire mobile and fixed networks at the same time. Canada isn’t a country most would associate with network instability, and indeed networks there had been stable previously. Yet the outage still occurred, showing that such a high-impact event can happen anywhere. Rogers customers faced enormous disruption because they couldn’t use Wi-Fi as a backup on the operator’s fixed services. More than 12 million subscriptions were affected across both fixed and mobile. The only solution was to use Rogers’ direct competitor networks — including for Rogers’ own operations staff!

The problem was caused by errors made in multiple network configuration changes on the fateful day. With better tools, such as the use of a digital twin to simulate the impact of the new configurations with a clear visualization of the network traffic topology, this kind of outage could have been avoided. CRTC, the Canadian regulator, took an interest too. Its detailed report on the outage recommended multiple changes, including greater network redundancy and the use of different core networks for fixed and mobile connectivity. This highlights the importance of core network stability to avoid complete network outages.

For operators seeking to merge — which is now a common approach as they look to improve their economics and network scale, especially in markets with only four players — outages like Rogers’ highlight to regulators a redundancy risk in having a small number of network operators. Regulators may choose to block mergers to promote greater network diversity in response to outages.

KDDI, Japan

But Canada is far from the only market where network-wide outages have led to regulatory and government involvement in the aftermath. In Japan — like Canada, a market associated with high-quality networks — the deputy chief cabinet secretary told reporters he expected KDDI to compensate users for a network-wide outage also in 2022. About 40 million people were affected by the problem, which was a result of core maintenance on voice calling.

NTT Docomo, Japan

Outages can be caused by network signalling storms. Also in Japan, NTT Docomo suffered a network-wide outage in October 2021 owing to a configuration change going wrong in the phone location database, which resulted in a huge volume of traffic flooding the network and disrupting phone calls and Internet data services. In a 5G service-based architecture this would be the Unified Data Management (UDM) function; in 4G an equivalent is the Home Subscriber Server. Around 4.6 million users had problems with voice communications alongside 8.3 million users suffering a data outage.

Signalling storms in the core network often arise as operations teams seek to return service levels after an outage triggered by some other cause. In a 5G network, alongside the UDM, signalling storms can cause problems with the Access and Mobility Management Function (AMF), where authentication can become overwhelmed, and the Session Management Function (SMF). Network-based prediction tools can help to avoid such incidents by enabling teams to practise their response to possible outages through simulations.

Optus, Australia

Australia’s Optus also suffered a 12-hour outage in November 2023 because of a fault in its network safety systems. Again, this affected both fixed and mobile services — including those of its mobile virtual network operator (MVNO) customers. More than 10 million people and over 400,000 businesses were affected. Like other network-wide outages, this needed C-level involvement to placate customers, the press and the regulator — Optus then-CEO Kelly Bayer Rosmarin took the lead in communicating externally during the outage, with government ministers asking questions. The impact of the incident was felt by emergency responders and transportation networks.

AT&T, US

In the US, AT&T suffered a complete outage on its cellular network in February 2024 for 12 hours, which caused New York’s attorney general to open an investigation. Just four months later, the carrier faced another major outage, this time affecting calls between carriers. Such incidents should be avoidable using modern core network operations and management tools and practices such as AI and digital twins to plan network updates and better simulate changes. AT&T’s initial review found the cause to be the application and execution of an incorrect process while working to expand the network.

Europe: Three, giffgaff and Tesco Mobile

Operators in all regions have had similar system failures. Based on the evidence of the past few years, major outages can happen to any operator — even the most apparently reliable — in any market at any time. There are many examples in Europe: despite its positioning in Switzerland as the largest and most dependable operator, Swisscom suffered outages that triggered national press interest in August 2021 and May 2022. In the UK, O2-owned MVNO giffgaff, Tesco Mobile and Sky Mobile all suffered an outage affecting calls and texts. Three UK experienced an outage in December 2023, and again in February 2024 with three outages in just four days.

Vietnam

Not all outages are caused by problems within an operator’s control. But the operations and management systems must be prepared to cope regardless. In Vietnam there have been several problems with international undersea cables in recent years. For example, in June 2024 three of the country’s five major submarine cables had problems requiring management steps by all of Vietnam’s operators to maintain service and prevent adverse knock-on impacts on their networks.

In other markets, such critical submarine cables are used to connect different parts of a country, for example in the Philippines or between Indonesia’s islands. A modern core enables easier management of these kinds of problem and will avoid collateral damage from signalling storms. Network operations teams can practise how to respond to mission-critical systems through the use of digital twins. AI-based tools help teams quickly combine multiple configuration changes to respond to a potential problem and then simulate how well different AI-suggested responses might perform.

Eleven Countries: O2 UK, MobiFone in Vietnam, SoftBank in Japan

Another massive impact outside the control of a single mobile network operator was Ericsson’s global software problem in December 2018. This caused outages in networks in 11 countries including O2 UK, MobiFone in Vietnam and SoftBank in Japan, which lasted for most of a day. In the UK alone, over 32 million people were affected. As with many of the other outages above, this triggered a C-level response from the operators affected. Rather than wait for CEOs and chief financial officers to sign off on the needed tools after a damaging public outage, network teams should proactively make the case to prevent such problems from ever happening by investing in modernization.

The Role of a Modern Core Network in Avoiding Problems and Speeding Recovery

The core network must be even more stable now that we have a wider range of network applications. More and more devices now connect to mobile networks, going beyond smartphones to include a vast range of Internet of things devices embedded into cars, acting as sensors for smart agriculture or cities, connected cameras monitoring security, as well as consumer applications such as smartwatches. If the core network is unstable, it’s no longer just smartphone users that will suffer, but potentially every part of the economy in multiple sectors that now use 5G, especially with 5G-Advanced’s new capabilities.

Even where the core network has not been the cause of an outage, a modern AI-capable core should offer better ways to avoid problems, mitigate outage impacts, and more safely return a network to normal operation. Visualization of network topology and traffic flows helps teams predict issues and resolve problems if they occur. When network outages are being resolved, there’s an especially high risk of a sudden surge in traffic — a signalling storm — that can cause a network to fail, again. Reliant network tools that take advantage of AI are becoming essential in the age of more types of customer, greater use of the cloud and the arrival of 5G-Advanced.

Otherwise, operators risk seeing more network-wide outages. The full cost of these incidents is beyond calculation as it will include direct compensation to customers, lost working time for operator teams, overtime payments and brand damage from which it will take years to recover. Ongoing regulatory actions are also likely and may include licensing changes or even the blocking of operator mergers that would have improved scale economies. The full impact from just one such outage will take years to see and resolve. They’re best prevented through the deployment of modern autonomous network tools.

Written by: Ian Fogg

Posted on 6 August 2024