Editorial

AI-Powered Anomaly Detection

Next-Level Response and Customer Satisfaction: AI-powered anomaly detection reduces response times and increases quality

Mikko Puuskari - Head of Automation, Elisa Polystar

Mikko Puuskari, Head of Automation at Elisa Polystar

November 28, 2022

In almost all mobile network markets, device penetration has already reached saturation point – pretty much everybody already has a cellphone. In many countries, the same is also true for smartphones. This means that retaining customers is vital to maintain or increase profits, and the best way to retain customers is to provide them with the best possible experience.

Therefore, it is essential that mobile operators can respond rapidly to issues, ideally resolving them before they degrade the customer’s experience. This is increasingly difficult to do in today’s mobile environments, which are becoming ever-more complex and are required to support ever-growing numbers of devices and applications.

However, recent advances in artificial intelligence (AI) and machine learning (ML) are making it possible to continuously monitor networks for performance-related issues and detect problems before they affect the customer experience. ML-enhanced anomaly detection in mobile networks is a vital tool for mobile operators in their fight to keep customers and win new ones.

Performance matters for satisfied customers

Operators have already identified some thresholds beyond which customer satisfaction becomes significantly affected. For example, throughput rates below 5 Mbit/s tend to result in increased dissatisfaction among subscribers.

When customers are unhappy, they complain, but many of them do not complain straight away, and some do not complain at all. But degraded performance impacts the satisfaction of all of these customers, and this may even lead them to decide to switch operators.

Receiving a complaint from one customer likely means that many other customers are experiencing the same problem, which may have been irritating them for some time. This is obviously far from ideal.

Increased complexity and load impact response times

Modern mobile networks are extensive, increasingly complex, and, most importantly, dynamic. The demands placed on the network constantly vary as devices join and leave, and as devices move between cells. On top of this, equipment is changed and updated, and of course, there are inevitable faults and breakdowns.

These kinds of changes in the network – planned and unplanned – result in an increased need for operators to monitor network behavior and performance. With 5G and the Internet of Things, the number of devices accessing mobile networks is also growing rapidly along with different types of applications and content, increasing the load on networks, adding to the difficulty of monitoring performance, and increasing the risk of performance degradation that impacts customer satisfaction.

Operators have network operation centers (NOCs) that receive alarms directly from network equipment when something is not working as it should be. These alarms are usually defined by the equipment vendor, and they are a useful tool for dealing with hardware and software faults.

However, standard NOCs – the kind with banks of monitors and human staff – have been focused on monitoring alarms that come from equipment as well as selected network-level performance KPIs. Due to the dynamic nature of networks, there can be many reasons for degraded performance that are not directly related to equipment faults. Standard NOCs may not be able to identify these kinds of issues before they result in degraded performance and dissatisfied customers.

Continuously monitoring specific KPIs is a potential solution

Many technical support actions are for situations that may be related to, e.g., a misconfiguration somewhere in the network. One way to recognize these issues before they become a problem is to continuously monitor a set of defined KPIs, such as success rate, drop rate, packet loss, and throughput. But this kind of monitoring is labor-intensive and impractical, and some changes are too small to be detected even by a trained human eye.

In addition, with modern networks, operators may need to monitor dozens of KPIs simultaneously for every cell in their network, and the networks can be very extensive, totaling hundreds of thousands of cells.

This can result in the need to continuously monitor potentially tens of thousands of data sets in an attempt to spot anomalies before they become issues. It is usually impossible – and always impractical and financially non-viable – to scale human effort up to the level needed to monitor all the KPIs.

Traditionally, fixed thresholds have been used to monitor both KPIs that directly impact the customer experience (such as throughput rates), as well as other KPIs, like signal strength, that can indicate other issues in the network – issues that may not be directly detectable by users, but which may eventually result in a performance hit that they will notice.

Modern systems that incorporate advanced AI and ML offer us the opportunity for a better solution that utilizes adaptive monitoring rather than fixed thresholds. This will lead to improved responsiveness and increased quality, resulting in improved customer satisfaction and retention.

Utilizing AI and ML to continuously monitor KPIs

Elisa Polystar provides a fully automated first-line network operations center – Virtual NOC. One feature of the Virtual NOC is an AI-powered anomaly detection algorithm that issues “performance alarms This is in contrast to equipment-level alarms, which are mostly triggered by hardware or software faults. Equipment alarms can also be triggered by breaches of some performance thresholds, but the criteria are static. These performance alarms are generated by the ML Engine anomaly detection routine, which continuously monitors a set of defined KPIs. This provides a more flexible and granular way of monitoring performance, based on detecting the point at which something changes and begins to degrade, rather than performance values that simply cross a fixed threshold.

The goal is to spot gradual increases or decreases in trends that may otherwise go unnoticed, or sudden and sustained step changes. This is achieved by training an ML algorithm with historical data presented as time series. The ML uses an anomaly detection method based on statistical change point detection, which is a proven method for identifying these kinds of anomalies and is trained with real-world historical network data.

The result is a general method for detecting anomalies in a variety of different KPIs that network operators want to supervise – a method that is capable of identifying clear changes in trends, whether they are gradual or sudden.

Benefits of ML-powered performance alarms

So, what are the benefits that artificial intelligence and machine learning bring? They can provide early warning of issues that will result in degraded performance levels that could lead to users experiencing poor service. They can detect issues that are not directly related to hardware and software faults. And they can help identify the root causes of issues.

Once the ML algorithm is properly configured and trained and its parameters have been tuned, it can continuously monitor a selected set of relevant KPIs network-wide and look for degraded performance based on observed changes rather than fixed thresholds. This is based on statistical modelling, which allows the system to see the point at which things start to change for the worse rather than waiting until it becomes obvious to the users of the network.

There is also the potential for using ML algorithms to make forecasts, predicting issues before they start to occur.

Added value for network operators

Utilizing a system like this means that we can detect anomalous performance immediately – within minutes or even seconds – and potentially react straight away, rather than waiting for a threshold to be passed or for a customer to submit a complaint. Operators can identify issues and proactively resolve them before they impact the customer experience – before the customers start complaining or think about switching operators.

Early identification of issues that stem from problems such as suboptimal configuration rather than faults means that operators can optimize the performance of their network without their users experiencing degraded performance.

ML assists human experts by helping them identify the issues that are the most relevant and prioritizing them. It also allows them to focus on what is essential rather than routine monitoring.

At the end of the day, utilizing such a system permits network operators to reduce the time it takes them to respond to anomalies in their networks and increases the quality that their customers experience, reducing the number of unhappy users – helping limit churn and creating better experiences that can, in turn, be used to recruit new customers.