Conference PaperPDF Available

Proactive Failure Management in Smart Grids for Improved Resilience: A Methodology for Failure Prediction and Mitigation

December 2015

December 2015

DOI:10.1109/GLOCOMW.2015.7414155

Conference: 2015 IEEE Globecom Workshops (GC Wkshps)

Authors:

Igor Kaitovic

University of Lugano

Slobodan Lukovic

University of Lugano

Miroslaw Malek

University of Lugano

Proactive management roadmap

…

Failure predictor design stages

…

Figures - uploaded by Miroslaw Malek

Content may be subject to copyright.

Content uploaded by Miroslaw Malek

Content may be subject to copyright.

Proactive Failure Management in Smart Grids for

Improved Resilience

A Methodology for Failure Prediction and Mitigation

Igor Kaitovic, Slobodan Lukovic, Miroslaw Malek

Advanced Learning and Research Institute (ALaRI), Faculty of Informatics

Università della Svizzera italiana (University of Lugano)

Lugano, Switzerland

{igor.kaitovic, slobodan.lukovic, miroslaw.malek}@usi.ch

Abstract—A gradual move in the electric power industry

towards Smart Grids brings several challenges to the system

operation such as preserving its resilience and ensuring security.

As the system complexity grows and a number of failures

increases, the need for grid management paradigm shift from

reactive to proactive is apparent and can be realized by

employing advanced monitoring instruments, data analytics and

prediction methods. In order to improve resilience of the Smart

Grid and to contribute to efficient system operation, we present a

blueprint of a comprehensive methodology for proactive failure

management that may also be applied to manage other types of

disturbances and undesirable changes. The methodology is

composed of three main steps: continuous monitoring of the most

indicative features; prediction of failures; their mitigation. The

approach is complementary to the existing ones that are mainly

based on fast detection and localization of grid disturbances, and

reactive corrective actions.

Keywords—Smart Grid, Proactive Management, Failure

Prediction, Resilience, Security, Synchrophasor

I. INTRODUCTION

Current approaches to electric power grid operation are

essentially reactive and are based on state estimation, using

run-time measurements that are typically obtained from

SCADA and, in the recent period more frequently, from

synchrophasors (i.e. Phasor Measurement Units - PMUs)

[1][2]. If a disturbing event is detected (e.g. a voltage sag, or a

frequency deviation), corrective actions are taken (e.g. load

shading, or altering a transmission line load) to bring the

system back to the stable state and to prevent disturbance

propagation. The reactive approach becomes insufficient with

increasing complexity of the grid, introduction of a vast

number of distributed (renewable) energy resources (due to

their intermittent nature); new types of loads like electric

vehicles; as well as demand response programs [3]. Moreover,

considering Smart Grid as a cyber-physical system, new types

of disturbances may occur due to interdependencies between

cyber and electric infrastructures and cyber attacks [4][5][6].

Thus, innovative approaches are needed to address with these

challenges and make Smart Grids more resilient. In the scope

of this paper, we think of resilience as defined in [7], namely

the ability of a system to provide and maintain an acceptable

level of service in the face of various faults and cyber attacks.

In this sense, resilience may be seen as a system property that

unifies dependability, security and performability.

Failures impact both, dependability and performability. A

failure in power systems occurs when the quality of the electric

power delivered to the customer deviates from the one defined

by the service agreement [6]. The minimum quality level also

depends on the type of the customer (e.g. a household or an

industrial consumer). The most severe failure - power outage,

is a complete interruption of power delivery. A blackout

represents a total interruption of the electric power delivery

service that affects all users in the area [8]. Numerous

blackouts in the recent history (e.g. the ones reported in [9] and

[10]) that effected hundreds of millions of customers, are a

strong motivation to further improve grid’s resilience.

Considering importance of Smart Grid as a critical

infrastructure and the impact that failures have on its overall

resilience, we propose a proactive approach to management of

failures which relies on improved availability of grid status

information (e.g. obtained from PMUs) and advanced data

analytics. Our methodology is based on: (i) continuous

monitoring of the most indicative features and time series, (ii)

prediction algorithms trained on features’ logs to predict

failures and (iii) mitigation techniques to prevent impending

failures or minimize their effects. The main difference of our

approach and the state of the art is that our focus is not on a

specific type or aspect of failures (e.g. cascading failures) but

on defining a comprehensive approach to address different

types of failures. The approach may also be applied to address

other types of disturbances or undesirable changes (e.g. cyber

attacks).

The rest of the paper is organized as follows. A change of

paradigm from reactive to proactive Smart Grid management is

outlined in Section II. State of the art is briefly presented in

Section III and covers both industrial and academic trends. A

blueprint of a methodology for proactive management of

failures is described in detail in Section IV and the problem of

obtaining the data for training and evaluating the failure

predictor is addressed in Section V. The anticipated effect on

resilience is discussed in Section VI. Section VII concludes the

paper.

This work has been supported in part by grants from the Hasler

Foundation and Swiss Commission for Technology and Innovation (CTI)

II. FROM REACTIVE TO PROACTIVE SMART GRID

MANAGEMENT

With ability to collect and analyze increasing amount of

data in real time, we observe a rapid paradigm shift (as on the

Fig. 1 from Gartner1 ) from analyzing the past and monitoring

the present to predicting and prescribing the future. For

systems such as Smart Grids and other we used to ask

questions like what happened and why it happened but now

with ability to collect more data and rapid processing capability

we are or will be able to ask questions about the future: what

will happen and what can we do about it? Obviously, the level

of difficulty increases as questions about the future are usually

more difficult than the ones about the past but, on the other

hand, only by careful analysis of the past and the present we

are able to reasonably speculate about the future. At the same

time, an increase in resilience and decrease in maintenance cost

are expected. And this is the essence of our contribution in this

paper by advocating the development of failure prediction and

mitigation methods to significantly improve Smart Grid’s

resilience.

III. STATE OF THE ART

Failure prediction for proactive fault management in

computer systems has been extensively researched [11] with

some methods finding its way into industrial practice [12][13].

Efficient grid management of emerging Smart Grids attracts

considerable attention in both industrial and academic

communities (see, for example [1] and [14]-[18]) especially

that the massive insertion of renewables followed by improved

interaction between utilities and end consumers brought

additional uncertainties in power distribution systems. On the

other hand, modern technologies such as synchrophasors and

advanced Energy Management System (EMS) applications

greatly facilitated grid observability and management.

Therefore, utilities and grid operators have recently started

looking into novel concepts for proactive grid management as

presented in [1]. In this sense “proactive operation” means

taking preventive actions in order to avoid anticipated

problems in the grid. This concept considers two main aspects

– failure prevention and asset management (i.e. predictive

maintenance). In this regard the following areas of proactive

grid management solutions are presented in [1]: 1) decision

support systems (DSSs); 2) synchrophasor solutions; 3)

symbiotic integration of synchrophasors with fast-acting

1 Gartner IT Glossary, Predictive Analytics: http://www.gartner.co m/it-glossary/predictive-analytics

controls. The proposed solutions rely on the analysis of a

combination of PMU and SCADA/EMS collected and

processed grid related information. The usage of historical (i.e.

pre-event) data is also foreseen.

Considerable research efforts have been invested in

improvement of grid reliability from the perspective of

efficient maintenance coupled with maximization of assets

utilization. In particular this concerns a shift from scheduled to

predictive grid maintenance [14]. The predictive maintenance

instruments include a group of programs named Reliability-

Centered Maintenance (RCM). In an RCM approach, various

alternative maintenance policies can be compared to select the

most cost-effective one for sustaining equipment reliability. A

relevant work on the case study of the city of New York has

investigated knowledge discovery methods and statistical data

analysis for preventive maintenance [15]. It introduced a

general process for transforming historical electrical grid data

into models that aim to predict the risk of failures for

components and systems. These models can be used directly by

power companies to assist with prioritization of maintenance

and repair work. The study has proved that prediction results

are accurate enough to support decision making.

Major industrial players such as Alstom and PG&E (Pacific

Gas & Electric) announced to advance Synchrophasor Grid

Monitoring into Proactive Grid Stability Management. In 2013,

Alstom Grid collaborated with PG&E project team to deliver

enhanced e-terra integrated real-time synchrophasor and

Energy Management Systems (EMS) applications as the first

stage of the Production Grade Synchrophasor Project. This

enables PG&E to monitor power system behavior from a new

class of GPS time-synchronized, high resolution PMUs. These

devices take grid measurements with the rate of up to 120

times per second versus the traditional four to six second data

rate generated by unsynchronized SCADA sensors. The

increased observability allows PG&E to identify and analyze

system vulnerabilities in real-time, assess available transfer

margins across transmission corridors and provide corrective

actions to prevent potential blackouts. In the future, Alstom’s

Grid Stability Package will help to integrate existing

measurement-based PMU analytics, model-based EMS and

dynamic stability analytics to enable proactive management of

grid stability [16]. Tollgade in partnership with DTE Energy

implemented a pilot to prove the concept of “predictive grid”

across DTE Energy’s service territory by installing advanced

monitoring equipment and predictive grid analytics software to

find tell-tale signs of faults and asset health symptoms before

outages occur [17]. The results have shown that almost 86% of

all outages have been preceded by line disturbances which

leaves significant space for further work in grid failure

prediction.

In [18] an approach to predict power grid weak points, and

specifically to efficiently identify the most probable failure

modes in static load distribution for a given power network has

been developed. The approach is applied to two examples. The

algorithm represents a power network adaption of the heuristic

originally developed to study low probability events in physics.

One finding is that, if the normal operational mode of the grid

is sufficiently healthy, failures are relatively sparse, i.e., the

failures are caused by load fluctuations at only a few buses.

Descriptive

Analyti cs

Diagnostic

Analytics

Predictive

Analytics

Prescriptive

Analytics

What

happened?

Why did it

happen?

What will

happen?

How can we

make it happen?

Difficulty

Value

Fig. 1. Proactive management roadmap

IV. SMART GRID PROACTIVE FAILURE MANAGEMENT

In computer systems, a failure is an event that occurs when

the delivered service deviates from the correct one [19]. In

power systems, a term failure is traditionally used to describe a

complete outage, when no power is delivered to the user. We

address Smart Grid faults, errors and failures in [6] and define

power grid failures as events that occur not only when the

power delivery is interrupted but also when the quality of the

electric power delivered to the customer deviates from the one

defined by the service agreement. The minimum level of

quality depends on the type of the customer. This margin is

used to differentiate between errors and failures.

Power grid failures may affect a large number of

consumers causing customers’ dissatisfaction. Moreover, they

have a highly negative impact on economy, especially in

industrial environments. As power grid is a safety-critical

infrastructure, failures may also compromise safety and pose

life risks. This calls for a comprehensive approach to managing

Smart Grid failures including a need for failure prediction

methods. Failure prediction triggers preventive actions to

mitigate anticipated failures by reducing their effects and,

whenever possible, fully preventing impending failures. When

prevention is not possible and a failure is unavoidable, failure

prediction may activate preparation for repair actions.

An overview of the proposed methodology for proactive

management of Smart Grid failures is depicted in Fig. 2. The

main stages that are monitoring, failure prediction and failure

mitigation are discussed in detail in the following subsections.

The methodology may be implemented as an additional

application of EMS. It may also be adapted for handling other

disturbances and undesirable changes, as well as for early

warning on cyber attacks that are also identified as a threat to

Smart Grid’s resilience [5].

A. Grid Monitoring

Monitoring infrastructure provides data to a failure

prediction algorithm. The set of data, its quality and monitoring

frequency have a high impact on the quality of prediction.

Until recently, the main set of measurements was collected

through SCADA, typically on every 2 to 4 seconds [1]. These

data are usually scarce, noisy, inaccurate and not synchronized.

As such they cannot be used to derive an accurate grid model

that is a basis for the prediction of future states and anticipation

of failures.

The introduction of a large number of PMUs, smart meters

and advanced metering devices, changes the picture

significantly. These new sensors and measuring devices

generate a vast amount of data that is more accurate and more

importantly, synchronized. With time tags, measurements from

different locations may be related and current state of the

systems may be estimated with sufficient accuracy. PMUs

provide information on voltages, currents and phasors on all

three phases separately with the rate of up to 60 [1] or even 120

data points per second in some industrial solutions [16]. Today,

this data is mainly used for the state estimation, detection of

disturbances and their localization; it may also be a valuable

input for failure prediction. Besides real-time measurements,

monitoring infrastructure provides static network data that

corresponds to the parameters and substation configurations

[2]. Additional devices that provide ambient measurements

(e.g. air and wire temperature, wind speed and direction) may

also be a valuable input for failure prediction. Finally, weather

prediction and load forecasts are an important input as extreme

weather conditions and high demands are frequently identified

as root causes of grid failures [6][9].

Due to a large amount of data and still limited computing

resources, not all the data may be processed at runtime. Thus,

monitoring must be adaptive in terms that monitoring

frequency as well as the number of features sent to the

predictor may be adjusted based on the current state of the grid

and estimated probability of near-future failures. For example,

if the probability of failure is estimated as high, it may be

needed to acquire the data with higher rate and from additional

sources (e.g. from a PMU in a different part of the network) in

order to obtain more accurate prediction.

B. Failure Prediction

The goal of online failure prediction is to identify, at

runtime, whether a failure will occur in the near future based

on an assessment of the monitored current system state and the

analysis of past events. The output of a failure predictor is the

probability of a failure imminence in the near future. A failure

predictor should predict as many failures as possible while

minimizing the number of false alarms. Numerous prediction

methods are already successfully used for the enterprise

computer systems for online, short-term prediction of failures

[11]. The prediction quality is identified as a critical part of the

entire predict-mitigate approach [20].

A design of a failure predictor should be conducted in three

phases depicted in Fig. 3. In the first phase a model of the

system should be conceived from the topology of the grid (or

its part). The model should clearly relate system parts to PMU

measurements, identify parts of the system where, based on

historical records, disturbances are the most frequent, and

establish a relation between system parts in terms of

disturbance propagation. The main usage of the model is in

preliminary selection of the most indicative features. As

different types of failures may occur in the grid (e.g. line trips,

generators failures, failures of the cyber infrastructure,

interdependent failures), they must be identified, classified and

properly described with related ranges of features’ values.

Classification of Smart Grid failures and fault taxonomy are

presented in [4] and [6]. Historical data, that are necessary to

Smar t Grid

Monitoring

Ensembl e of

failure pr edictors

Diagn osis

Decis ion on

Count ermeas ures

Failure Prediction

Failure Mitigation

Implementation of

Coun term easur es

Fig. 2. Methodology overview

train and evaluate a predictor, may be obtained from

measurements logs. Alternatively, simulating Smart Grid and

grid failures may be more beneficial for the data generation,

especially in the first phase. The problem of collecting the data

for training and evaluating the predictor will be addressed in

Section V.

In the second phase, the obtained dataset should be

analyzed. Data conditioning includes extraction of the features

(also called events, variables or parameters by different

research communities) and structuring the data in a form that

may be used as input for the prediction algorithm. In particular,

each data set in the stream, that describes one system state,

should be associated with a failure type or marked as failure-

free. A preliminary feature selection should be conducted while

taking into account system model. Feature selection is the

process of selecting the most relevant features (and examples

for algorithm training) and combining them in order to

maximize predictors’ performance; discard redundant and

noisy data; obtain faster and more cost-effective algorithm

training and online prediction; and better interpret the data

relations (data simplification for better human understanding).

Feature selection methods may be classified as filters, wrappers

and embedded methods. A widely used filter method is

Principal Component Analysis (PCA). PCA converts a set of

correlated features into a set of linearly uncorrelated features

(principal components) using orthogonal transformation. The

procedure is independent with respect to the type of the

prediction algorithm that will be used and thus very appropriate

for preliminary selection of features. Numerous packages are

available for feature selection, including those that are a part of

popular tools for statistical analysis (e.g. Matlab/Octave,

Python and R). A good overview of feature selection methods

is given in [21].

In the final stage, the predictor is adapted and evaluated. In

fact, an ensemble of predictors may be used to improve quality

of prediction. Having in mind a large number of existing

failure prediction algorithms, the most viable solution is to

select and to adopt one of them. A comprehensive survey of

failure prediction algorithms is given in [11]. Three main

approaches used for prediction are: failure tracking, symptom

monitoring and detected error reporting. Failure tracking draws

conclusions about upcoming failures from the occurrence of

the previous ones. These methods either aim at predicting the

time of the next occurrence of a failure or at estimating the

probability of failures co-occurrence. Symptoms are defined as

side effects of looming faults that not necessarily manifest

themselves as errors. Symptom-monitoring based predictions

analyze the system features in order to identify those that

indicate an upcoming failure. Several methodologies for the

estimation were proposed in the past, including function

approximation, machine-learning techniques, system models,

graph models, and time series analysis. Finally, the methods

based on detected error reporting, such as the rule-based, the

co-concurrence-based and the pattern recognition methods,

analyze the error reports to predict if a new failure is about to

happen. Current trends in predicting cascading failures in

power grids are mainly based on the application of machine-

learning approaches, such as neural networks, support vector

machines and anomalies’ detection (see, for example, [15]).

In order to speed up the prediction, the set of selected

features should be refined by extracting the most indicative

ones. To facilitate the process, heuristics (such as the ones

presented in [22]) may be employed. After each iteration, the

quality of prediction has to be evaluated. The process

terminates when a sufficient quality of prediction is reached so

that resilience is improved. A prediction quality metric and the

expected effect of our approach on resilience are addressed in

Section VI.

C. Failure Mitigation

Once the prediction mechanisms anticipate a failure,

corrective actions that will mitigate it should be scheduled and

activated. The mitigation is composed of three phases as

depicted in Fig. 2. In the diagnosis phase, the output of the

prediction is analyzed. Additional algorithms may be employed

to better identify the location of the anticipated failure. In the

second phase of mitigation, a decision on a countermeasure is

taken. This decision should take into account the probability of

a failure (provided by the predictor), the cost of the measure

(e.g. maintenance cost or the cost in terms of the number of

customers affected), the probability of a successful mitigation

and the overall effect on resilience (for example the effect on

steady-state availability). Finally, the implementation of the

countermeasure has to be performed.

The ultimate goal is to fully prevent the failure (e.g. by grid

reconfiguration or by decreasing the load of a specific line). If

that is not possible, then the effect of a failure should be

minimized or confined (e.g. by preventive load shading) or a

preparation of repair actions may be triggered to minimize the

repair time (e.g. by directing a maintenance team to the

location where a failure is expected). Some of these techniques

can lead to a system performance degradation. For example, if

a failure predictor was wrong, unnecessary preventive load

shading may be conducted affecting a subset of customers. The

entire process may be implemented as fully automated or it

may require the involvement of an operator for decision-

making. This may depend on the type of mitigation and its

cost. When more than one failure is anticipated, a coordinated

management of mitigation is required.

A great opportunity for mitigation of failures in Smart Grid

lies in the employment of Flexible Alternating Current

Historical

Data

Failure

Classific ation

SystemModel

DataConditioning

PreliminaryFeature

Selection

Network

Topology

Phase1Phase2Phase3

Algori thmSele ction

Evaluation

FeatureSelection

Refine ment

Data Analysis

Design and

Evaluation

Fig. 3. Failure predictor design stages

Transmission Systems (FACTS) that provide a sub-second

response [1]. FACTS incorporate electronic-based and other

static controllers that provide control of one or more AC

transmission system parameters to enhance controllability and

increase power transfer capability. Standard definitions of

FACTS-related terminology is given in [23]. As demonstrated

in [24], these devices can alter power flow in transmission lines

in a fashion that can prevent failures from occurring in the

system. An overview of available FACTS and their capability

for handling specific problems is given in [1]. More details on

mitigation techniques in power systems may be found in [25].

Finally, considering Smart Grid as cyber-physical system, and

having in mind that failures may propagate from the computing

infrastructure [4], techniques for computer systems recovery

should also be included.

V. DATA PROVISION FOR FAILURE PREDICTION

For an efficient training and evaluation of a predictor, a

vast amount of relevant data is needed. This data must contain

a sufficient number of training examples during and before grid

failures. The problem of obtaining such data is that advanced

measuring devices like PMUs are still not widely deployed and

that electric power distribution companies mostly do not make

collected data publicly available. Several academic projects

exist where the data from PMUs and FDRs (Frequency

Disturbance Recorders) are collected for the purpose of state

estimation, visualization, disturbances detection and prediction.

These projects include FNET/GridEye2, EPFL Smart Grid3 and

openPDC4. Even that they are valuable sources, the problem of

insufficient number of training examples that contain the data

collected during a failure still remains. This is mainly due to

the fact that power grid failures are relatively rare events and

that, in some of these projects, a relatively small university

campus grids are monitored with a limited number of PMUs.

An alternative way to train and evaluate prediction

algorithms is to use fault injection methods in power systems

simulators. A number of commercial and open-source grid

simulators are available, including Power System Toolbox

(PST), Power Analysis Toolbox (PAT), Power System

Analysis Toolbox (PSAT), and MatDyn [26]. Still, one should

keep in mind that these and other simulators do not necessarily

take into account all the aspects of the future grid, like a strong

impact of cyber (computing) infrastructure on grid operation

[4]. This is particularly important, having in mind relative

frequency of software failures and their potential propagation

from cyber to electric infrastructure [6].

Fault injection, in a simulation environment, may be

employed to emulate failures and obtain sufficient number of

failure-related training examples. In computer systems, fault

injection is mainly used for testing fault tolerance, identifying

root causes and evaluating dependability [27] but it may be

also used for improvement and validation of failure prediction

by accelerating the process of data collection through data

generation [28]. Many of power system simulation tools also

implement fault injection mechanisms to evaluate reliability of

2 FNET/GridEye: http://fnetpublic.utk.edu/

3 EPFL Smart Grid: http://smartgrid.epfl.ch/

4 openPDC: http://openpdc.codeplex.com/

the grid. For example, MatDyn [26] allows time-domain

simulation of power systems with injection of faults.

VI. EXPECTED EFFECT ON RESILIENCE

Proactive management of failures relies on failure

prediction and mitigation techniques to fully prevent failures or

to minimize their effects. With sufficiently good prediction and

proper corrective actions a highly positive effect on resilience

is expected. For example, in the case of computer systems,

preliminary studies show that steady-state availability of a

system may be improved by an order of a magnitude if the

quality of failure prediction is high and if failures are

anticipated sufficiently in advance [20][29].

Precision and recall are widely used as metrics for the

failure prediction quality. Precision is the ratio of the number

of correctly predicted failures to the total number of predictions

(issued alarms). Recall is the ratio of the total number of

correctly predicted failures to the total number of failures. High

precision means a high percentage of correct predictions and

high recall means that a high percentage of failures are

predicted. Low precision indicates that too many false alarms

are issued. Low recall indicates that a large number of failures

are not predicted. A good quality predictor has high precision

and high recall. Still, it is frequently necessary to compromise

on precision and recall by increasing one at the expense of

another [11].

The correctness of prediction has a decisive impact on

Smart Grid’s resilience and maintenance. If a prediction is

correct, a failure may be avoided. This will improve resilience

of a system and may also decrease maintenance cost. For

example, if a transmission line failure is correctly predicted, a

preventive load shading could be triggered. This will affect a

subset of customers but will prevent the effect of cascading

failures that would finally affect a much larger set of

customers. The overall effect on resilience will be positive. On

the other hand, if prediction is not correct (false alarm),

unnecessary corrective actions may be triggered, resulting in

additional costs and compromising resilience. For example, a

failure of a transmission line may be incorrectly predicted as

there is no real danger of a line failure. As a consequence,

preventive load shading may be triggered as a corrective

action. This will cut off a subset of customers and create a

partial outage even though that there was no danger of a

failure.

In conclusion, the overall effect that proactive management

may have on Smart Grid’s resilience strongly depends on the

quality of prediction and the cost of related corrective actions

(in terms of both, financial cost and time). While dependability

and performability may be improved with good prediction and

effective mitigation, system’s dependability may be

compromised due to bad prediction that will also affect the

efficiency of the overall operation of the grid. If the

methodology is adopted for managing cyber attacks through

their early detection a similar effect on security may be

expected.

VII. CONCLUSIONS

Growing complexity of the grid, increasing number of

intermittent renewable energy resources, changes in interaction

with end consumers, rising needs for electric power and a

demand for more flexible and efficient grid management pose a

challenge to maintaining high resilience standards. On the

other hand, modern Energy Management Systems

complemented by advanced monitoring instruments enable

improved grid observability, state estimation and grid/outage

management which can be effectively used, as advocated in

this paper, in proactive failure management.

We propose a comprehensive methodology for proactive

management of failures in Smart Grid based on data analytics

and inspired by well-established solutions employed in

computer engineering. The concept relies on statistical analysis

of historic pre-event data coupled with online monitoring of the

most indicative features for prediction of near-future failures

and their mitigation. The proposed strategies aim at efficient

prevention of failures, which at the same time improves

resilience and decreases maintenance cost. Since our approach

is holistic it may also be applied to address other types of

disturbances or undesirable changes (e.g. cyber attacks) that

pose a challenge to assurance of high level of resilience. Each

step of the methodology is described including a discussion on

challenges, implementation strategies and opportunities for

improving Smart Grids resilience. A particular care is given to

a description of a failure predictor design whose quality has a

decisive impact on resilience, cost and customer satisfaction.

REFERENCES

[1] J. Giri, “Proactive management of the future grid”, IEEE Journal on

Power and Energy Technology Systems, vol.2, no.2, pp.43-52, June

2015

[2] A. Monticelli, “Electric power system state estimation”, Proceedings of

the IEEE vol.88, no.2, pp. 262-282, August 2002

[3] S. Lukovic, I. Kaitovic, M. Mura, and U. Bondi, “Virtual power plant as

a bridge between distributed energy resources and Smart Grid”, 43rd

Hawaii International Conference on System Sciences (HICSS), Hawaii,

USA, January 2010

[4] J.-C. Laprie, K. Kanoun, M. Kaâniche, “Modelling interdependencies

between the electricity and information infrastructures”, 26th

International Conference on Computer Safety, Reliability & Security

(SAFECOMP), Nuremberg, Germany, September, 2007

[5] O. Kosut, L. Jia, R.J. Thomas, L. Tong, “Malicious data attacks on the

Smart Grid”, IEEE Transactions on Smart Grid, vol.2, no.4, pp. 645-

658, December 2011

[6] I. Kaitovic, S. Lukovic, and M. Malek, “Unifying dependability of

critical infrastructures: Electric power system and ICT (concepts, figures

of merit and taxonomy)” 21st IEEE Pacific Rim International

Symposium on Dependable Computing (PRDC), Zhangjiajie, China,

November, 2015

[7] J.P.G. Sterbenz, D. Hutchison, E.K. Cetinkaya, A. Jabbar, J.P. Rohrer,

M. Scholler, and P. Smith, “Resilience and survivability in

communication networks: Strategies, principles, and survey of

disciplines”, ACM Computer Networks Journal, vol.54, no.8, pp.1245-

1265, June 2010

[8] M. Zima, "Special protection schemes in electric power systems",

Literature survey, ETHZ, 2002

[9] A. Atputharajah, and T.K. Saha, “Power system blackouts - literature

review”, International Conference on Industrial and Information

Systems (ICIIS), Sri Lanka, December 2009

[10] L.L. Loi, T.Z. Hao, S. Mishra, D. Ramasubramanian, S.L. Chun, Y.X.

Fang, “Lessons learned from July 2012 Indian blackout”, 9th IET

International Conference on Advances in Power System Control,

Operation and Management (APSCOM), Hong Kong, China, November

2012

[11] F. Salfner, M. Lenk, and M. Malek, “A survey of online failure

prediction methods”, ACM Computing Surveys (CSUR), vol.42, no.3,

art. 10, March 2010

[12] G. Hoffmann, and M. Malek, “Call availability prediction in a

telecommunication system: A data driven empirical approach”, 25th

IEEE Symposium on Reliable Distributed Systems (SRDS), Leeds, UK,

2006

[13] Y. Watanabe, H. Otsuka, Y. Matsumoto, “Failure Prediction for Cloud

Datacenter by Hybrid Message Pattern Learning”, 11th IEEE

International Conference on Autonomic and Trusted Computing (ATC),

Bali, Indonesia, December 2014

[14] J. Endrenyi, et al., A. Schneider, and Ch. Singh, “The present status of

maintenance strategies and the impact of maintenance on reliability”,

IEEE Transactions on Power Systems, vol.16, no.4, pp.638-646,

November 2001

[15] C. Rudin, et al., “Machine learning for the New York city power grid”,

IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol.34, no.2, pp.328-345, February 2012

[16] Alstom Press Centre Home, “Alstom and PG&E to advance

synchrophasor grid monitoring into proactive grid stability

managemen”, September 2014, [Online], Accessed on July 2015,

Available at: http://www.alstom.com/press-centre/2014/8/alstom-and-

pge-to-advance-synchrophasor-grid-monitoring-into-proactive-grid-

stability-management/

[17] Tollgrade Communications, Inc, “Building a predictive grid for the

Motor City”, Predictive Grid Quarterly Report, vol.1, February 2015,

[Online], Accessed on July 2015, Available at:

http://www.metering.com/wp-content/uploads/2015/02/Predictive-Grid-

Quarterly-Report_Vol1_Tollgrade-Communications_Feb15.pdf

[18] M. Chertkov, F. Pan, and M.G. Stepanov, "Predicting failures in power

grids: The case of static overloads", IEEE Transactions on Smart Grid,

vol.2, no.1, pp.162-172, March 2011

[19] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic

concepts and taxonomy of dependable and secure computing”, IEEE

Transactions on Dependable and Secure Computing, vol.1, no.1, pp.11-

33, January 2004

[20] F. Salfner, and M. Malek, “Proactive fault handling for system

availability enhancement”, 19th IEEE International Parallel and

Distributed Symposium (IPDPS), Orlando, FL, USA, 2005

[21] I. Guyon, and A. Elisseeff, “An introduction to variable and feature

selection”, ACM Journal of Machine Learning Research, vol.3, pp.1157-

1182, March 2003

[22] H. Liu, and L. Yu, “Toward integrating feature selection algorithms”,

IEEE Transactions on Knowledge and Data Engineering, vol.17, no.4,

pp.491-502, April 2005

[23] A.A. Edris et al, “Proposed terms and definitions for flexible AC

transmission system (FACTS)”, IEEE Transactions on Power Delivery,

vol.12, no.4, pp.1848-1853, October 1997

[24] A. Z. Faza, S. Sedigh, and B.M. McMillin, “Reliability analysis for the

advanced electric power grid: From cyber control and communication to

physical manifestations of failure”, 28th International Conference on

Computer Safety, Reliability, and Security (SAFECOMP), p.p. 257-269,

Hamburg, Germany, September 2009

[25] M.H.J. Bollen, “Understanding power quality problems: Voltage sags

and interruptions”, Chapter 7. IEEE Press, September 1999, ISBN: 978-

0-7803-4713-7

[26] S. Cole, and R. Belmans, “MatDyn, a new Matlab based toolbox for

power system dynamic simulation”, IEEE Transactions on Power

Systems (accepted for future publication), July 2011

[27] Mei-Chen Hsueh, T.K. Tsai, R.K. Iyer, “Fault injection techniques and

tools”, Computer , vol.30, no.4, pp.75-82, April 1997

[28] M. Vieira, H. Madeira, I. Irrera, and M. Malek, “Fault injection for

failure prediction methods validation”, 5th Workshop on Hot Topics in

System Dependability (HotDep), Estoril, Portugal, 2009

[29] F. Salfner, M. Schieschke, and M. Malek, “Predicting failures of

computer systems: A case study for a telecommunication system”, 20th

International Parallel and Distributed Processing Symposium (IPDPS),

April 2006

Predictive Analytics: A Shortcut to Dependable Computing

Conference Paper

Full-text available

Aug 2017

Miroslaw Malek

The paper lists three major issues: complexity, time and uncertainty, and identifies dependability as the permanent challenge. In order to enhance dependability, the paradigm shift is proposed where focus is on failure prediction and early malware detection. Failure prediction methodology, including modeling and failure mitigation, is presented and two case studies (failure prediction for computer servers and early malware detection) are described in detail. The proposed approach, using predictive analytics, may increase system availability by an order of magnitude or so.

A Review of Smart Grid Anomaly Detection Approaches Pertaining to Artificial Intelligence

Article

Full-text available

Jan 2024

The size of power grids and a complex technological infrastructure with higher levels of automation, connectivity, and remote access make it necessary to be able to detect anomalies of various kinds using optimal and intelligent methods. This paper is a review of studies related to the detection of anomalies in smart grids using AI. Digital repositories were explored considering publications between the years 2011 and 2023. Iterative searches were carried out to consider studies with different approaches, propose experiments, and help identify the most applied methods. Seven objects of study related to anomalies in SG were identified: attacks on data integrity, unusual measurements and consumptions, intrusions, network infrastructure, electrical data, identification of cyber-attacks, and use of detection devices. The issues relating to cybersecurity prove to be widely studied, especially to prevent intrusions, fraud, data falsification, and uncontrolled changes in the network model. There is a clear trend towards the conformation of anomaly detection frameworks or hybrid solutions. Machine learning, regression, decision trees, deep learning, support vector machines, and neural networks are widely used. Other proposals are presented in novel forms, such as federated learning, hyperdimensional computing, and graph-based methods. More solutions are needed that do not depend on a lot of data or knowledge of the network model. The use of AI to solve SG problems is generating an evolution towards what could be called next-generation smart grids. At the end of this document is a list of acronyms and terminology.

Online Fault Prediction Based on Collaborative Filtering in Smart Grid

Article

Full-text available

Jul 2023
MATH PROBL ENG

Smart grid, responsible for upgrading traditional power networks by integrating with cutting-edge information and communication networks, forms coupled networks but also pose potential hazards in the face of fault cascade. In coupled networks, fault prediction is of significance because tight interaction between power nodes and communication nodes makes the smart grid more vulnerable. Unfortunately, most existing works of fault prediction are specific to a single network and do not consider the correlation of coupled elements. To address these limitations, in this paper, we highlight the interdependence of networks and define fault correlation. Further, we propose a probabilistic prediction model using collaborative filtering in machine learning. We finally present an online prediction algorithm. We conduct experiments to illustrate the effectiveness of our prediction algorithm with different parameters and give some observations that may give more insight into interdependent networks.

Deep Neural Networks for Multivariate Prediction of Photovoltaic Power Time Series

Article

Full-text available

Jan 2020

The large-scale penetration of renewable energy sources is forcing the transition towards the future electricity networks modeled on the smart grid paradigm, where energy clusters call for new methodologies for the dynamic energy management of distributed energy resources and foster to form partnerships and overcome integration barriers. The prediction of energy production of renewable energy sources, in particular photovoltaic plants that suffer from being highly intermittent, is a fundamental tool in the modern management of electrical grids shifting from reactive to proactive, with also the help of advanced monitoring systems, data analytics and advanced demand side management programs. The gradual move towards a smart grid environment impacts not only the operating control/management of the grid, but also the electricity market. The focus of this article is on advanced methods for predicting photovoltaic energy output that prove, through their accuracy and robustness, to be useful tools for an efficient system management, even at prosumer’s level and for improving the resilience of smart grids. Four different deep neural models for the multivariate prediction of energy time series are proposed; all of them are based on the Long Short-Term Memory network, which is a type of recurrent neural network able to deal with long-term dependencies. Additionally, two of these models also use Convolutional Neural Networks to obtain higher levels of abstraction, since they allow to combine and filter different time series considering all the available information. The proposed models are applied to real-world energy problems to assess their performance and they are compared with respect to the classic univariate approach that is used as a reference benchmark. The significance of this work is to show that, once trained, the proposed deep neural networks ensure their applicability in real online scenarios characterized by high variability of data, without requiring retraining and end-user’s tricks.

A Survey on Power Grid Faults and Their Origins: A Contribution to Improving Power Grid Resilience

Article

Full-text available

Dec 2019

One of the most critical infrastructures in the world is electrical power grids (EPGs). New threats affecting EPGs, and their different consequences, are analyzed in this survey along with different approaches that can be taken to prevent or minimize those consequences, thus improving EPG resilience. The necessity for electrical power systems to become resilient to such events is becoming compelling; indeed, it is important to understand the origins and consequences of faults. This survey provides an analysis of different types of faults and their respective causes, showing which ones are more reported in the literature. As a result of the analysis performed, it was possible to identify four clusters concerning mitigation approaches, as well as to correlate them with the four different states of the electrical power system resilience curve.

Self-Repairable Smart Grids Via Online Coordination of Smart Transformers

Article

Nov 2016

The introduction of active devices in Smart Grids, such as smart transformers, powered by intelligent software and networking capabilities, brings paramount opportunities for online automated control and regulation. However, online mitigation of disruptive events such as cascading failures, is challenging. Local intelligence by itself cannot tackle such complex collective phenomena with domino effects. Collective intelligence coordinating rapid mitigation actions is required. This paper introduces analytical results from which two optimization strategies for self-repairable Smart Grids are derived. These strategies build a coordination mechanism for smart transformers that runs in three healing modes and performs collective decision-making of the phase angles in the lines of a transmission system to improve reliability under disruptive events, i.e. line failures causing cascading failures. Experimental evaluation using self-repairability envelopes in different case networks, AC power flows and varying number of smart transformers confirms that the higher the number of smart transformers participating in the coordination, the higher the reliability and the capability of a network to self-repair.

Review of Smart Grid Failure Prediction and the Need for its Study in STEM Careers

Conference Paper

Oct 2023

Detection and Mitigation of Cascading Failures in Interconnected Power Systems

Conference Paper

Sep 2017

A Hierarchical Framework for Smart Grid Anomaly Detection Using Large-Scale Smart Meter Data

Article

Apr 2017

Real-time monitoring and control of smart grids is critical to the enhancement of reliability and operational efficiency of power utilities. We develop a real-time anomaly detection framework, which can be built based upon smart meter data collected at the consumers’ premises. The model is designed to detect the occurrence of anomalous events and abnormal conditions at both lateral and customer levels. We propose a generative model for anomaly detection that takes into account the hierarchical structure of the network and the data collected from smart meters. We also address three challenges existing in smart grid analytics: (i) large-scale multivariate count measurements, (ii) missing points, and (iii) variable selection. We present the effectiveness of our approach with numerical experiments.

A Methodology for Proactive Maintenance of Uninterruptible Power Supplies

Conference Paper

Oct 2016

Unifying Dependability of Critical Infrastructures: Electric Power System and ICT: Concepts, Figures of Merit and Taxonomy

Conference Paper

Full-text available

Nov 2015

Lessons Learned from July 2012 Indian Blackout

Conference Paper

Full-text available

Jan 2012

Grid disturbances occurred on 30th and 31st of July 2012 leaving millions of Indians in the dark for hours. It was understood that in the blackout that occurred on 31st of July, hydro power was slowed down which led to inadequate power generation while people overdrew more power for cooling off since the temperature was extremely high. This paper investigates the main reasons for the occurrence of the blackout. A simplified simulation model was established with DIgSILENT Power Factory to reproduce the grid behavior when the blackout occurred. After the sensitivity analysis, policy and strategies are recommended to enhance the grid security and robustness at the end of the paper.

Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach

Conference Paper

Full-text available

Nov 2006

Availability prediction in a telecommunication system plays a crucial role in its management, either by alerting the operator to potential failures or by proactively initiating preventive measures. In this paper, we apply linear (ARMA, multivariate, random walk) and nonlinear (Radial and Universal Basis Functions) regression techniques to recognize system failures and to predict the system's call availability up to 15 minutes in advance. Secondly we introduce a novel nonlinear modeling technique for call availability prediction. We benchmark all five techniques against each other. The applied modeling methods are data driven rather than analytical and can handle large amounts of data. We apply the modeling techniques to real data of a commercial telecommunication platform. The data used for modeling includes: a) time stamped event-based log files; and b) continuously measured system states. Results are given in terms of a) receiver operator characteristics (AUC) for classification into classes of failure and non-failure states and b) as a cost-benefit analysis. Our findings suggest: a) high degree of nonlinearity in the data; b) statistically significant improved forecasting performance and cost-benefit ratio of nonlinear modeling techniques; and finally finding that c) log file data does not contribute to improve model performance with any modeling technique

Toward integrating feature selection algorithms for classification and clustering

Article

Apr 2005

This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.

Proposed terms and definitions for Flexible AC Transmission System (FACTS)

Article

Oct 1997

FACTS is an acronym which stands for Flexible AC Transmission System. FACTS is an evolving technology-based solution envisioned to help the utility industry to deal with changes in the power delivery business. This paper presents results of Task Force 3 of the IEEE's FACTS Working Group of the DC and FACTS Subcommittee which had the assignment to establish appropriate definitions of FACTS-related terminology. These definitions will be included in the IEEE Dictionary.

Failure Prediction for Cloud Datacenter by Hybrid Message Pattern Learning

Conference Paper

Dec 2014

In operations and management of large-scale cloud data enters, it is essential for administrators to handle failures occurring in their infrastructure before causing service-level violations. Some techniques for failure prediction have been studied because they can be used to start the troubleshooting process at the early stage of troubles and to prevent service-level violations from occurring. By its nature, however, failure prediction involves a certain amount of incorrect detection (false-positive). When applying failure prediction to the operation and management of cloud data enters, incorrect detection can result in the execution of unnecessary workaround tasks and additional costs. Existing methods for failure prediction using Bayesian inference to identify message patterns related to a certain failure are difficult to apply to relatively stable systems, because the accuracy of their predictions deteriorates in environments where failure rarely occurs. In order to solve this problem, we propose a novel method to improve the accuracy of failure prediction by suppressing incorrect detections using a hybrid score that integrates the probability of simultaneous occurrence between a message pattern and a failure and frequency of the message patterns for the failure. We implemented this method and evaluated the accuracy in a real commercial cloud data enter. The evaluation results revealed that it improved the accuracy of failure prediction by 31.9% compared with the existing method in terms of precision in the best case.

Proactive Management of the Future Grid

Article

Jun 2015

Jay Giri

The energy management system (EMS) at utility control centers collects real-time measurements to monitor current grid conditions. The EMS is also a suite of analytics that synthesizes these measurements to provide the grid operator with information to identify current problems and potential future problems. With evolving grid influences, such as growth of variable renewable generation resources, distributed generation, microgrids, demand response (DR), and customer engagement programs, managing the grid is becoming more challenging. Concurrently, however, there are nascent new technologies and advances in grid management schemes that will improve the ability to manage the future grid operations. These technologies include new subsecond synchrophasor measurements and analytics, advances in highperformance computing, visualization platforms, digital relays, cloud computing, and so on. Advances in grid management schemes include adding more intelligence at the substation and distribution systems, as well as microgrids and wide-area monitoring systems. One key initiative is to develop a predict-and-mitigate paradigm enabling anticipatory vision and timely decisions to mitigate potential problems before they spread to the rest of the grid. The word “proactive”means “to act now in anticipation of future problems.”Proactive grid management opportunities and solutions are described in this paper.

Fault Injection for Failure Prediction Methods Validation

Article

Jan 2009

Failure prediction methods are becoming sine qua non conditions for effective availability enhancement in complex computer and communication systems. Therefore, there is a growing need for validation, benchmarking and assessment of such methods on real industrial data. Our thesis is that the effectiveness of such methods can be significantly enhanced when combined with fault injection. Then, not only failures can be predicted but also potential root causes can be identified based on symptoms, which can be observed at runtime on the system. We first briefly introduce failure prediction and fault injection methods and then present a methodology for improving and validating failure prediction methods using fault injection. We anticipate that with our approach we will be able to more efficiently predict forthcoming outages and also identify root causes which in turn will enable effective recovery or failure avoidance and as a result substantially enhance availability.

Modeling Interdependencies between the electricity and Information Infrastructures

Conference Paper

Sep 2007

Understanding Power Quality Probems - Voltage Sags and Interruptions

Book

Jan 2000

Math Bollen

“Power quality problems have increasingly become a substantial concern over the last decade, but surprisingly few analytical techniques have been developed to overcome these disturbances in system-equipment interactions. Now in this comprehensive book, power engineers and students can find the theoretical background necessary for understanding how to analyze, predict, and mitigate the two most severe power disturbances: voltage sags and interruptions. This is the first book to offer in-depth analysis of voltage sags and interruptions and to show how to apply mathematical techniques for practical solutions to these disturbances. From UNDERSTANDING AND SOLVING POWER QUALITY PROBLEMS you will gain important insights into Various types of power quality phenomena and power quality standards Current methods for power system reliability evaluation Origins of voltage sags and interruptions Essential analysis of voltage sags for characterization and prediction of equipment behavior and stochastic prediction Mitigation methods against voltage sags and interruptions.

Proactive Failure Management in Smart Grids for Improved Resilience: A Methodology for Failure Prediction and Mitigation

Figures

Recommended publications

Mitigating the Potential for Damage Caused by COTS and Third-Party Software Failures

A framework for disturbance analysis in smart grids by fault injection: Generating smart grid distur...

Unifying Dependability of Critical Infrastructures: Electric Power System and ICT: Concepts, Figures...

Optimizing Failure Prediction to Maximize Availability

Engineering Dependable Systems with Predictive Technologies