Conference PaperPDF Available

Data-Driven Analytics for Automated Cell Outage Detection in Self-Organizing Networks

Authors:

Abstract and Figures

In this paper, we address the challenge of autonomous cell outage detection (COD) in Self-Organizing Networks (SON). COD is a pre-requisite to trigger fully automated self-healing recovery actions following cell outages or network failures. A special case of cell outage, referred to as Sleeping Cell (SC) remains particularly challenging to detect in state-of-the-art SON, since it triggers no alarms for Operation and Maintenance (O&M) entity. Consequently, no SON compensation function can be launched unless site visits or drive tests are performed, or complaints are received by affected customers. To address this issue, we present and evaluates a COD framework, which is based on minimization of drive test (MDT) reports, a functionality recently specified in third generation partnership project (3GPP) Release 10, for LTE Networks. Our proposed framework aims to detect cell outages in an autonomous fashion by first pre-processing the MDT measurements using multidimensional scaling method and further employing it together with machine learning algorithms to detect and localize anomalous network behaviour. We validate and demonstrate the effectiveness of our proposed solution using the data obtained from simulating the network under various operational settings.
Content may be subject to copyright.
A preview of the PDF is not available
... Several work applied machine learning to analyze minimization of Driving Tests (MDT) measurements data. Zoha et al. [39] proposed a local outlier factor based detector and a support vector machine based detector to analyze MDT data for autonomous cell outage detection. Onireti et al. [22] applied a -nearest neighbor ( -NN) and a local outlier factor anomaly detector to analyze MDT data to detect control and data plane problems in cells. ...
... The above related work on applications of machine learning to mobile network management can be classified from twodimensions of analysis method and data used in analysis. In terms of analysis method, -NN [7], [22], local outlier factor based detector [39], support vector machine [39], diffusion mapping [6], deep learning [13], [19], semi-supervised learning [14], and -means [23] have been used. In terms for data used in analysis, call data record (CDR) [6], [13], [14], [23], high-dimensional radio signal power and quality data obtained from the network [6], Minimization of Driving Tests (MDT) data [7], [19], [22], [39], and eNodeB data [9] are used. ...
... The above related work on applications of machine learning to mobile network management can be classified from twodimensions of analysis method and data used in analysis. In terms of analysis method, -NN [7], [22], local outlier factor based detector [39], support vector machine [39], diffusion mapping [6], deep learning [13], [19], semi-supervised learning [14], and -means [23] have been used. In terms for data used in analysis, call data record (CDR) [6], [13], [14], [23], high-dimensional radio signal power and quality data obtained from the network [6], Minimization of Driving Tests (MDT) data [7], [19], [22], [39], and eNodeB data [9] are used. ...
Article
Full-text available
With the increasing network topology complexity and continuous evolution of the new wireless technology, it is challenging to address the network service outage with traditional methods. In the long-term evolution (LTE) networks, a large number of base stations called eNodeBs are deployed to cover the entire service areas spanning various kinds of geographical regions. Each eNodeB generates a large number of key performance indicators (KPIs). Hundreds of thousands of eNodeBs are typically deployed to cover a nation-wide service area. Operators need to handle hundreds of millions of KPIs to cover the areas. It is impractical to handle manually such a huge amount of KPI data, and automation of data processing is therefore desired. To improve network operation efficiency, a suitable machine learning technique is used to learn and classify individual eNodeBs into different states based on multiple performance metrics during a specific time window. However, an issue with supervised learning requires a large amount of labeled dataset, which takes costly human-labor and time to annotate data. To mitigate the cost and time issues, we propose a method based on few-shot learning that uses Prototypical Networks algorithm to complement the eNodeB states analysis. Using a dataset from a live LTE network consists of thousand of eNodeB, our experiment results show that the proposed technique provides high performance while using a low number of labeled data.
... The boundary is then used to classify live data as normal or anomalous. In [65] and [68], this is done by using a one class Support Vector Machine (SVM) which works generally as described in Section VI above. In the specific case of the one class SVM, the budget is used to allow a small number of outliers in the normal data to be misclassified as faults, in [58] order to achieve optimum anomaly detection performance on live data. ...
... Pre-Processing [29], [33], [42], [53], [64], [65], [66], [83] Detection [27], [33], [42], [59], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82] Diagnosis 1: Action Determination [27], [70], [82], [83], [84], [85], [86], [87], [60] Compensation [33], [92], [89], [62], [90], [91], [88] Diagnosis 2: Root Cause Analysis [96], [47], [61], [97] measure; a random detector would score 0.5 and an ideal detector would score 1.0. Typical examples of the best results currently available are shown in Table 12. ...
... Pre-Processing [29], [33], [42], [53], [64], [65], [66], [83] Detection [27], [33], [42], [59], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82] Diagnosis 1: Action Determination [27], [70], [82], [83], [84], [85], [86], [87], [60] Compensation [33], [92], [89], [62], [90], [91], [88] Diagnosis 2: Root Cause Analysis [96], [47], [61], [97] measure; a random detector would score 0.5 and an ideal detector would score 1.0. Typical examples of the best results currently available are shown in Table 12. ...
Article
Full-text available
This paper surveys the literature relating to the application of machine learning to fault management in cellular networks from an operational perspective. We summarise the main issues as 5G networks evolve, and their implications for fault management. We describe the relevant machine learning techniques through to deep learning, and survey the progress which has been made in their application, based on the building blocks of a typical fault management system. We review recent work to develop the abilities of deep learning systems to explain and justify their recommendations to network operators. We discuss forthcoming changes in network architecture which are likely to impact fault management and offer a vision of how fault management systems can exploit deep learning in the future. We identify a series of research topics for further study in order to achieve this.
... Advanced ML models, similar to those used for traffic prediction, have been adapted to detect anomalies by learning from vast datasets of network activity. These models are trained to recognize patterns indicative of potential issues, enabling network operators to preemptively address problems before they escalate [41]- [44]. c) Cell rate and UE spectral efficiency predictions: Gijon et al. assessed cell throughput prediction accuracy using real network key performance indicator (KPI) counters, exploring a variety of ML methodologies such as support vector regression, k-nearest neighbors, and decision trees and artificial neural network (ANN)-based ensemble approaches [45]. ...
Article
Full-text available
The energy consumption of mobile networks poses a critical challenge. Mitigating this concern necessitates the deployment and optimization of network energy-saving solutions, such as carrier shutdown, to dynamically manage network resources. Traditional optimization approaches encounter complexity due to factors like the large number of cells, stochastic traffic, channel variations, and intricate trade-offs. This paper introduces the simulated reality of communication networks (SRCON) framework, a novel, data-driven modeling paradigm that harnesses live network data and employs a blend of machine learning (ML)- and expert-based models. These mix of models accurately characterizes the functioning of network components, and predicts network energy efficiency and user equipment (UE) quality of service for any energy carrier shutdown configuration in a specific network. Distinguishing itself from existing methods, SRCON eliminates the reliance on expensive expert knowledge, drive testing, or incomplete maps for predicting network performance. This paper details the pipeline employed by SRCON to decompose the large network energy efficiency modeling problem into ML- and expert-based submodels. It demonstrates how, by embracing stochasticity, and carefully crafting the relationship between such submodels, the overall computational complexity can be reduced and prediction accuracy enhanced. Results derived from real network data underscore the paradigm shift introduced by SRCON, showcasing significant gains over a state-of-the-art method used by a operator for network energy efficiency modeling. The reliability of this local, data-driven modeling of the network proves to be a key asset for network energy-saving optimization.
... Furthermore, stochastic geometry-based models are unable to capture the network dynamics which include mobility management and transmission latency. Therefore, several machine learning (ML) based techniques are proposed in current literature that leverage training and tuning of ML based models to determine the behavior of different configuration and optimization parameters (COPs), such as antenna tilt, transmit power, cell load in relation to different key performance indicators (KPIs), like coverage, capacity or energy efficiency [10]- [12]. These COP-KPI relationships can then be used for COP-KPI optimization. ...
Article
Full-text available
The future of cellular networks is contingent on artificial intelligence (AI) based automation, particularly for radio access network (RAN) operation, optimization, and troubleshooting. To achieve such zero-touch automation, a myriad of AI-based solutions are being proposed in literature to leverage AI for modeling and optimizing network behavior to achieve the zero-touch automation goal. However, to work reliably, AI based automation, requires a deluge of training data. Consequently, the success of the proposed AI solutions is limited by a fundamental challenge faced by cellular network research community: scarcity of the training data. In this paper, we present an extensive review of classic and emerging techniques to address this challenge. We first identify the common data types in RAN and their known use-cases. We then present a taxonomized survey of techniques used in literature to address training data scarcity for various data types. This is followed by a framework to address the training data scarcity. The proposed framework builds on available information and combination of techniques including interpolation, domain-knowledge based, generative adversarial neural networks, transfer learning, autoencoders, fewshot learning, simulators and testbeds. Potential new techniques to enrich scarce data in cellular networks are also proposed, such as by matrix completion theory, and domain knowledge-based techniques leveraging different types of network geometries and network parameters. In addition, an overview of state-of-the art simulators and testbeds is also presented to make readers aware of current and emerging platforms to access real data in order to overcome the data scarcity challenge. The extensive survey of training data scarcity addressing techniques combined with proposed framework to select a suitable technique for given type of data, can assist researchers and network operators in choosing the appropriate methods to overcome the data scarcity challenge in leveraging AI to radio access network automation.
... Furthermore, stochastic geometry-based models are unable to capture the network dynamics which include mobility management and transmission latency. Therefore, several machine learning (ML) based techniques are proposed in current literature that leverage training and tuning of ML based models to determine the behavior of different configuration and optimization parameters (COPs), such as antenna tilt, transmit power, cell load in relation to different key performance indicators (KPIs), like coverage, capacity or energy efficiency [10]- [12]. These COP-KPI relationships can then be used for COP-KPI optimization. ...
Preprint
Full-text available
The future of cellular networks is contingent on artificial intelligence (AI) based automation, particularly for radio access network (RAN) operation, optimization, and troubleshooting. To achieve such zero-touch automation, a myriad of AI-based solutions are being proposed in literature for modeling and optimizing network behavior to achieve the zero-touch automation goal. However, to work reliably, AI based automation, requires a deluge of training data. Consequently, the success of AI solutions is limited by a fundamental challenge faced by cellular network research community: scarcity of training data. We present an extensive review of classic and emerging techniques to address this challenge. We first identify the common data types in RAN and their known use-cases. We then present a taxonomized survey of techniques to address training data scarcity for various data types. This is followed by a framework to address the training data scarcity. The framework builds on available information and combination of techniques including interpolation, domain-knowledge based, generative adversarial neural networks, transfer learning, autoencoders, few-shot learning, simulators, and testbeds. Potential new techniques to enrich scarce data in cellular networks are also proposed, such as by matrix completion theory, and domain knowledge-based techniques leveraging different network parameters and geometries. An overview of state-of-the art simulators and testbeds is also presented to make readers aware of current and emerging platforms for real data access. The extensive survey of training data scarcity addressing techniques combined with proposed framework to select a suitable technique for given type of data, can assist researchers and network operators in choosing appropriate methods to overcome the data scarcity challenge in leveraging AI to radio access network automation.
... It implements the SON paradigm, as this is one of the most promising areas for an operator to save Capital Expenditure (CAPEX), Implementation Expenditure (IMPEX), and Operational Expenditure (OPEX), and can simplify network management through self-directed functions (self-planning, self-deployment, self-configuration, selfoptimization, and self-healing) [13]. A clear example of SON applications related to resilient mobile networks is autonomous Cell Outage Detection (COD), which is a prerequisite for triggering fully automated self-healing recovery actions after cell outages or network failures [14]. ...
Article
Full-text available
Fault tolerance and the availability of applications, computing infrastructure, and communications systems during unexpected events are critical in cloud environments. The microservices architecture, and the technologies that it uses, should be able to maintain acceptable service levels in the face of adverse circumstances. In this paper, we discuss the challenges faced by cloud infrastructure in relation to providing resilience to applications. Based on this analysis, we present our approach for a software platform based on a microservices architecture, as well as the resilience mechanisms to mitigate the impact of infrastructure failures on the availability of applications. We demonstrate the capacity of our platform to provide resilience to analytics applications, minimizing service interruptions and keeping acceptable response times.
... Since our framework is designed to detect anomalies within minutes-the conventional techniques involve subscriber complaints and drive tests that consume hours and sometimes days to detect the anomaly (cell outage) [34]-this potentially improves QoS and truncates OPEX as timely identification of anomalous cell means quicker problem resolution. Detection of surged traffic activity in a region can also act as an early-warning towards potential congestion that might choke the network. ...
Article
Full-text available
Escalating cell outages and congestion-treated as anomalies-cost a substantial revenue loss to the cellular operators and severely affect subscriber quality of experience. State-of-the-art literature applies feed-forward deep neural network at core network (CN) for the detection of above problems in a single cell; however, the solution is impractical as it will overload the CN that monitors thousands of cells at a time. Inspired from mobile edge computing and breakthroughs of deep convolutional neural networks (CNNs) in computer vision research, we split the network into several 100-cell regions each monitored by an edge server; and propose a framework that pre-processes raw call detail records having user activities to create an image-like volume, fed to a CNN model. The framework outputs a multi-labeled vector identifying anomalous cell(s). Our results suggest that our solution can detect anomalies with up to 96% accuracy, and is scalable and expandable for industrial Internet of things environment.
Chapter
In order to meet the challenges of ambitious capacity, user experience, and resource efficiency gains, the next‐generation cellular networks need to leverage end‐to‐end user and network behavior intelligence. This intelligence can be gathered from the mobile network big data which includes the massive telemetric data about network health and status as well as data about user whereabouts, preferences, context, and mobility patterns. As a result, exploitation of big data on wireless cellular network is emerging as an indispensable approach for harnessing intelligence in future wireless communication networks. In this article, we first identify and classify the big data that can be gathered from different layers and ends of a wireless cellular network. We then discuss several new utilities of the big data that can bridge the existing gaps to meet 5G requirements. After that we summarize the existing literature on data analytics for cellular network performance. We present different platforms and two different frameworks to implement big data analytic‐based solutions in 5G and beyond and compare their pros and cons. We then discuss how key performance indicators (KPIs)‐based data collection may not suffice in 5G. Through an exemplary study, we show how to unleash the full potential hidden within the big data, granularity of low‐level performance indicators, and how context is essential. Finally, we highlight the opportunities that can be availed from big data in cellular network and the challenges therein.
Article
Full-text available
5G is anticipated to embed an artificial intelligence (AI)-empowerment to adroitly plan, optimize and manage the highly complex network by leveraging data generated at different positions of the network architecture. Outages and situation leading to congestion in a cell pose severe hazard for the network. High false alarms and inadequate accuracy are the major limitations of modern approaches for the anomaly—outage and sudden hype in traffic activity that may result in congestion—detection in mobile cellular networks. This indicates wasting limited resources that ultimately leads to an elevated operational expenditure (OPEX) and also interrupting quality of service (QoS) and quality of experience (QoE). Motivated by the outstanding success of deep learning (DL) technology, our study applies it for detection of the above-mentioned anomalies and also supports mobile edge computing (MEC) paradigm in which core network (CN)’s computations are divided across the cellular infrastructure among different MEC servers (co-located with base stations), to relief the CN. Each server monitors user activities of multiple cells and utilizes L-layer feedforward deep neural network (DNN) fueled by real call detail record (CDR) dataset for anomaly detection. Our framework achieved 98.8% accuracy with 0.44% false positive rate (FPR)—notable improvements that surmount the deficiencies of the old studies. The numerical results explicate the usefulness and dominance of our proposed detector.
Article
Full-text available
This article surveys the literature over the period of the last decade on the emerging field of self organisation as applied to wireless cellular communication networks. Self organisation has been extensively studied and applied in adhoc networks, wireless sensor networks and autonomic computer networks; however in the context of wireless cellular networks, this is the first attempt to put in perspective the various efforts in form of a tutorial/survey. We provide a comprehensive survey of the existing literature, projects and standards in self organising cellular networks. Additionally, we also aim to present a clear understanding of this active research area, identifying a clear taxonomy and guidelines for design of self organising mechanisms. We compare strength and weakness of existing solutions and highlight the key research areas for further development. This paper serves as a guide and a starting point for anyone willing to delve into research on self organisation in wireless cellular communication networks.
Article
Full-text available
We address the problem of detecting “anomalies” in the network traffic produced by a large population of end-users following a distribution-based change detection approach. In the considered scenario, different traffic variables are monitored at different levels of temporal aggregation (timescales), resulting in a grid of variable/timescale nodes. For every node, a set of per-user traffic counters is maintained and then summarized into histograms for every time bin, obtaining a timeseries of empirical (discrete) distributions for every variable/timescale node. Within this framework, we tackle the problem of designing a formal Distribution-based Change Detector (DCD) able to identify statistically-significant deviations from the past behavior of each individual timeseries. For the detection task we propose a novel methodology based on a Maximum Entropy (ME) modeling approach. Each empirical distribution (sample observation) is mapped to a set of ME model parameters, called “characteristic vector”, via closed-form Maximum Likelihood (ML) estimation. This allows to derive a detection rule based on a formal hypothesis test (Generalized Likelihood Ratio Test, GLRT) to measure the coherence of the current observation, i.e., its characteristic vector, to the given reference. The latter is dynamically identified taking into account the typical non-stationarity displayed by real network traffic. Numerical results on synthetic data demonstrates the robustness of our detector, while the evaluation on a labeled dataset from an operational 3G cellular network confirms the capability of the proposed method to identify real traffic anomalies.
Conference Paper
Full-text available
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Book
Covering the key functional areas of LTE Self-Organising Networks (SON), this book introduces the topic at an advanced level before examining the state-of-the-art concepts. The required background on LTE network scenarios, technologies and general SON concepts is first given to allow readers with basic knowledge of mobile networks to understand the detailed discussion of key SON functional areas (self-configuration, -optimisation, -healing). Later, the book provides details and references for advanced readers familiar with LTE and SON, including the latest status of 3GPP standardisation. Based on the defined next generation mobile networks (NGMN) and 3GPP SON use cases, the book elaborates to give the full picture of a SON-enabled system including its enabling technologies, architecture and operation. "Heterogeneous networks" including different cell hierarchy levels and multiple radio access technologies as a new driver for SON are also discussed. Introduces the functional areas of LTE SON (self-optimisation, -configuration and -healing) and its standardisation, also giving NGMN and 3GPP use cases. Explains the drivers, requirements, challenges, enabling technologies and architectures for a SON-enabled system. Covers multi-technology (2G/3G) aspects as well as core network and end-to-end operational aspects. Written by experts who have been contributing to the development and standardisation of the LTE self-organising networks concept since its inception. Examines the impact of new network architectures ("Heterogeneous Networks") to network operation, for example multiple cell layers and radio access technologies.
Article
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Conference Paper
With the rapid development of the mobile wireless system, the operator is experiencing unprecedented challenges on service maintenance and operational expenditure, which drives the demand for realizing automation in current networks. The cell outage detection is considered as an effective way to automatically detect network fault. Our work presents an automated cell outage detection mechanism in which a clustering technique called Dynamic Affinity Propagation (DAP) clustering algorithm is introduced. Performance metrics are collected from the network during its regular operation and then fed into the algorithm to produce optimal clusters for further anomaly detection. The proposed mechanism has been implemented in the LTE-Advanced simulation environment, through which we have successfully detected the configured cell outages and located their specific outage areas.
Article
Nonmetric multidimensional scaling (MDS) is adapted to give configurations of points that lie on the surface of a sphere.There are data sets where it can be argued that spherical MDS is more relevant than the usual planar MDS.The theory behind the adaption of planar MDS to spherical MDS is outlined and then its use is illustrated on three data sets.
Chapter
Suppose dissimilarity data have been collected on a set of n objects or individuals, where there is a value of dissimilarity measured for each pair.The dissimilarity measure used might be a subjective judgement made by a judge, where for example a teacher subjectively scores the strength of friendship between pairs of pupils in her class, or, as an alternative, more objective, measure, she might count the number of contacts made in a day between each pair of pupils. In other situations the dissimilarity measure might be based on a data matrix. The general aim of multidimensional scaling is to find a configuration of points in a space, usually Euclidean, where each point represents one of the objects or individuals, and the distances between pairs of points in the configuration match as well as possible the original dissimilarities between the pairs of objects or individuals. Such configurations can be found using metric and non-metric scaling, which are covered in Sects. 2 and 3. A number of other techniques are covered by the umbrella title of multidimensional scaling (MDS), and here the techniques of Procrustes analysis, unidimensional scaling, individual differences scaling, correspondence analysis and reciprocal averaging are briefly introduced and illustrated with pertinent data sets.
Conference Paper
Base stations experiencing hardware or software failures have negative impact on network performance and customer satisfaction. The timely detection of such so-called outage or sleeping cells can be a difficult and costly task, depending on the type of the error. As a first step towards self-healing capabilities of mobile communication networks, operators have formulated a need for an automated cell outage detection. This paper presents and evaluates a novel cell outage detection algorithm, which is based on the neighbor cell list reporting of mobile terminals. Using statistical classification techniques as well as a manually designed heuristic, the algorithm is able to detect most of the outage situations in our simulations.