ArticlePDF Available

A reinforcement learning framework for optimal operation and maintenance of power grids

May 2019
Applied Energy 241:291-301

May 2019
241:291-301

DOI:10.1016/j.apenergy.2019.03.027

Authors:

Roberto Rocchetta

University of Applied Sciences and Arts of Southern Switzerland

Michele Compare

Aramis srl

Enrico Zio

MINES ParisTech and Politecnico di Milano

Show all 5 authorsHide

We develop a Reinforcement Learning framework for the optimal management of the operation and maintenance of power grids equipped with prognostics and health management capabilities. Reinforcement learning exploits the information about the health state of the grid components. Optimal actions are identified maximizing the expected profit, considering the aleatory uncertainties in the environment. To extend the applicability of the proposed approach to realistic problems with large and continuous state spaces, we use Artificial Neural Networks (ANN) tools to replace the tabular representation of the state-action value function. The non-tabular Reinforcement Learning algorithm adopting an ANN ensemble is designed and tested on the scaled-down power grid case study, which includes renewable energy sources, controllable generators, maintenance delays and prognostics and health management devices. The method strengths and weaknesses are identified by comparison to the reference Bellman's optimally. Results show good approximation capability of Q-learning with ANN, and that the proposed framework outperforms expert-based solutions to grid operation and maintenance management.

The Markov Decision Process associated to the health state of a degrading component; circle markers indicate maintenance actions whereas squared markers indicate operational actions.

…

The flow chart displays an episode run and how the learning agent interacts with the environment (i.e. the power grid equipped with PHM devices) in the developed Reinforcement Learning framework; dashed-line arrows indicate when the learning agent takes part in the episode run.

…

: The transmission lines ampacity and reactance proprieties.

…

The Q(s, a) values displayed using ECDFs and the 3 clusters. Comparison between the reference Bellman's solution (bottom plot) and the QL+ANN solution (top plot).

…

The maximum expected reward, ˆ qa(S|µ a ), for increasing total load and different degrading condition of the generators.

…

Figures - uploaded by Roberto Rocchetta

Content may be subject to copyright.

Content uploaded by Roberto Rocchetta

Content may be subject to copyright.

A Reinforcement Learning Framework for Optimal

Operation and Maintenance of Power Grids

R. Rocchettaa, L. Bellanib, M. Compareb,c, E Ziob,c,d,e , E Patelli*a

aInstitute for Risk and Uncertainty, Liverpool University, United Kingdom

bAramis s.r.l., Milano, Italy

cEnergy Department, Politecnico di Milano,Italy

dMINES ParisTech, PSL Research University, CRC, Sophia Antipolis, France

eEminent Scholar, Department of Nuclear Engineering, College of Engineering, Kyung Hee

University, Republic of Korea

Abstract

We develop a Reinforcement Learning framework for the optimal management

of the operation and maintenance of power grids equipped with prognostics

and health management capabilities. Reinforcement learning exploits the in-

formation about the health state of the grid components. Optimal actions are

identiﬁed maximizing the expected proﬁt, considering the aleatory uncertainties

in the environment. To extend the applicability of the proposed approach to re-

alistic problems with large and continuous state spaces, we use Artiﬁcial Neural

Networks (ANN) tools to replace the tabular representation of the state-action

value function. The non-tabular Reinforcement Learning algorithm adopting an

ANN ensemble is designed and tested on the scaled-down power grid case study,

which includes renewable energy sources, controllable generators, maintenance

delays and prognostics and health management devices. The method strengths

and weaknesses are identiﬁed by comparison to the reference Bellman’s opti-

mally. Results show good approximation capability of Q-learning with ANN,

and that the proposed framework outperforms expert-based solutions to grid

operation and maintenance management.

Keywords: Reinforcement Learning, Artiﬁcial Neural Networks, Prognostic

and Health Management, Operation and Maintenance, Power Grid,

Uncertainty

1. Introduction

Power Grids are critical infrastructures designed to satisfy the electric power

needs of industrial and residential customers. Power Grids are complex systems

including many components and subsystems, which are intertwined to each other

∗Corresponding author: edoardo.patelli@liverpool.ac.uk

Preprint submitted to Applied Energy March 15, 2019

and aﬀected by degradation and aging due to a variety of processes (e.g. creep-

age discharge [1], loading and unloading cycles [2], weather-induced fatigue [3],

etc.). Maximizing the Power Grid proﬁtability by the a safe and reliable delivery

of power is of primary importance for grid operators. This requires developing

sound decision-making frameworks, which account for both the complexity of

the asset and the uncertainties on its operational conditions, components degra-

dation processes, failure behaviors, external environment, etc.

Nowadays, Power Grid Operation and Maintenance (O&M) management is en-

hanced by the possibility of equipping the Power Grid components with Prog-

nostics and Health Management (PHM) capabilities, for tracking and managing

the evolution of their health states so as to maintain their functionality [4].

This capability can be exploited by Power Grid operators to further increase

the proﬁtability of their assets, e.g. with a smarter control of road lights [5]-[6],

exploiting wide are control of wind farms [7] or with a better microgrid control

[8] and management [9]. However, embedding PHM in the existing Power Grid

O&M policies requires addressing a number of challenges [10]. In this paper, we

present a framework based on Reinforcement Learning [11]-[12], for settings the

generator power outputs and the schedule of preventive maintenance actions in

a way to maximize the Power Grid load balance and the expected proﬁt over

an inﬁnite time horizon, while considering the uncertainty of power production

from Renewable Energy Sources, power loads and components failure behaviors.

Reinforcement Learning has been used to solve a variety of realistic control and

decision-making issues in the presence of uncertainty, but with a few applica-

tions to Power Grid management. For instance, Reinforcement Learning has

been applied to address the generators load frequency control problem [13], the

unit commitment problem [14], to enhance the power system transient stabil-

ity [15] and to address customers’ private preferences in the electricity market

[16]. Furthermore, the economic dispatch [17] and the auction based pricing

issues [18] have also been tackled using Reinforcement Learning. In [19], a Q-

learning approach has been proposed to solve constrained load ﬂow and reactive

power control problems in Power Grids. In [9], a Reinforcement Learning-based

optimization scheme has been designed for microgrid consumers actions man-

agement, and accounting for renewable volatility and environmental uncertainty.

In [20], a comparison between Reinforcement Learning and a predictive control

model has been presented for a Power Grid damping problem. In [21] a review

of the application of reinforcement learning for demand response is proposed,

whereas in [8], the authors have reviewed recent advancements in intelligent

control of microgrids, which include a few Reinforcement Learning methods.

However, none of the revised works employs Reinforcement Learning to ﬁnd

optimal O&M policies for Power Grids with degrading elements and equipped

with PHM capabilities. Moreover, these works mainly apply basic Reinforce-

ment Learning algorithms (e.g., the SARSA(λ) and Q-learning methods [12]),

which rely on a memory intensive tabular representation of the state-action

value function Q. The main drawback of these tabular methods lies in their

limited applicability to realistic, large-scale problems, characterized by highly-

dimensional state-action spaces. In those situations, the memory usage becomes

burdensome and the computational times are intractable. To extend the appli-

cability of Reinforcement Learning methods to problems with arbitrarily large

state spaces, regression tools can be adopted to replace the tabular representa-

tion of Q(refer to [12] for a general overview on algorithms for RL and [22] for

a introduction to deep RL).

In [23], a deep Q-learning strategy for optimal energy management of hybrid

electric buses is proposed. In [24], Reinforcement Learning method is used to

ﬁnd the optimal incentive rates for a demand-response problem for smart grids.

Real-time performance was augmented with the aid of deep neural networks.

Two RL techniques based on Deep Q-learning and Gibbs deep policy gradient

are applied to physical models for smart grids in [25]. In [26], a RL method

for dynamic load shedding is investigated for short-term voltage control; the

southern China Power Grid model is used as a test system. In [27], RL for

residential demand response control is investigated. However, only tabular Q-

learning methods are investigated. To the best authors knowledge, none of the

reviewed work proposed a non-tabular solution to operational and maintenance

scheduling of power grid equipped with PHM devices.

In this paper, to extend the applicability of the proposed Reinforcement Learn-

ing method, we use Artiﬁcial Neural Networks (ANNs), due to their approxima-

tion power and good scalability propriety. The resulting Reinforcement Learning

algorithm enables tackling highly-dimensional optimization problems and its ef-

fectiveness is investigated on a scaled-down test system. This example allows

showing that Reinforcement Learning can really exploit the information pro-

vided by PHM to increase the Power Grid proﬁtability.

The rest of this work is organized as follows: Section 2 presents the Reinforce-

ment Learning framework for optimal O&M of Power Grids in the presence

of uncertainty; a scaled-down power grid application is proposed in Section 3,

whereas the results and limitations of Reinforcement Learning for Power Grid

O&M are discussed in Sections 4; Section 6 closes the paper.

2. Modeling framework for optimal decision making under uncer-

tainty

In the Reinforcement Learning paradigm, an agent (i.e. the controller and

decision maker) learns from the interaction with the environment (e.g. the grid)

by observing states, collecting gains and losses (i.e. rewards) and selecting ac-

tions to maximize the future revenues, considering the aleatory uncertainties

in the environment behavior. On-line Reinforcement Learning methods can

tackle realistic control problems through direct interaction with the environ-

ment. However, oﬀ-line (model-based) Reinforcement Learning methods are

generally adopted for safety-critical systems such as power grids [28], due to the

unacceptable risks associated with exploratory actions [28].

Developing an oﬀ-line Reinforcement Learning framework for Power Grid O&M

management requires deﬁning the environment and its stochastic behavior, the

actions that the agent can take in every state of the environment and their

eﬀects on the grid and the reward generated. These are formalized below.

2.1. Environment State

Consider a Power Grid made up of elements C={1, ..., N }, physically

and/or functionally interconnected, according to the given grid structure. Sim-

ilarly to [10], the features of the grid elements deﬁning the environment are the

nddegradation mechanisms aﬀecting the degrading components d∈D⊆Cand

the npsetting variables of power sources p∈P⊆C. For simplicity, we assume

D={1, ..., |D|},P={|D|+ 1, ..., |D|+|P|} and |D|+|P| ≤ N. The extension

of the model to more complex settings can be found in [10].

Every degradation mechanism evolves independently from the others, obeying a

Markov process that models the stochastic transitions from state sd

i(t) at time

tto the next state sd

i(t+ 1), where sd

i(t)∈ {1, ..., Sd

i},∀t,d∈D, i = 1, ..., nd.

These degradation states are estimated by the PHM systems (e.g., [29]).

Similarly, a Markov process deﬁnes the stochastic transitions of the p-th power

setting variable from sp

j(t) at time tto the next state sp

j(t+ 1), where sp

j(t)∈

{1, ..., Sp

j},∀t,p∈P, j = 1, ..., np. Generally, these transitions depend on ex-

ogenous factors such as the weather conditions.

Then, system state vector S∈ S at time treads:

St=hs1

1(t), s1

2(t), . . . , s|P|+|D|

n|P|+|D|(t)i∈ S (1)

where S=×f=1,...,nc

c=1,...,|P|+|D|

{1, ..., Sc

f}.

2.2. Actions

Actions can be performed on the grid components g∈G⊆Cat each t. The

system action vector a∈ A at time tis:

at=ag1(t), . . . , ag%(t), . . . , a|g||G|(t)∈ A (2)

where action ag%is selected for component g%∈Gamong a set of mutually

exclusive actions ag%∈Ag%,%= 1, ..., |G|,A=×%=1,...|G|Agρ. The action set

Ag%includes both operational actions (e.g. closure of a valve, generator power

ramp up, etc.) and maintenance actions. Speciﬁcally, Corrective Maintenance

(CM) and Preventive Maintenance (PM) are the maintenance actions consid-

ered in this paper. When CM action is performed to ﬁx a faulty component,

which is put from an out-of-service condition to a in-service, As-Good-As-New

(AGAN) condition. Diﬀerently, predictive maintenance can be performed on an

in-service, non-faulty (but degraded), component, to improve its degradation

state.

Constraints can be deﬁned for reducing Ag%to a subset ˆ

Ag%(S)⊆Ag%, to take

into account that some actions are not allowed in particular states. For example,

Corrective Maintenance (CM), cannot be taken on As-Good-As-New (AGAN)

components and, similarly, it is the only possible action for failed components.

In an opportunistic view [10], both Preventive Maintenance (PM) and CM ac-

tions are assumed to restore the AGAN state for each component. An example

of Markov process for a 4 degradation state component is presented in Figure

Operation Actions

Mainteinance Actions

AGAN

Deg1

Deg2

Fail

Figure 1: The Markov Decision Process associated to the health state of a degrading com-

ponent; circle markers indicate maintenance actions whereas squared markers indicate opera-

tional actions.

2.3. Stochastic behavior of the environment state

As mentioned before, the development of a Reinforcement Learning frame-

work for optimal O&M of Power Grids has to necessarily rely on a model of the

stochastic behavior of the environment. We assume that this is completely de-

ﬁned by transition probability matrices associated to each feature f= 1, ..., nc

of each component c= 1, ..., |P|+|D|and to each action a∈ A:

c,f =





p1,1p1,2· · · p1,Sc

p2,1p2,2· · · p2,Sc

.....

pSc

f,1pSc

f,2· · · pSc

f,Sc







c,f

(3)

where pi,j represents the probability Pa

c,f (sj|a, si) of having a transition of

component cfrom state ito state jof feature f, conditional to the action

j=1

pi,j = 1.

This matrix-based representation of the environment behavior is not mandatory

to develop a Reinforcement Learning framework. However, it allows applying

dynamic programming algorithms that can provide the Bellman’s optimal O&M

policy with a pre-ﬁxed, arbitrarily small error ([11]). This reference true solu-

tion is necessary to meet the objective of this study, which is the investigation of

the beneﬁts achievable from the application of Reinforcement Learning methods

to optimal Power Grid O&M, provided that these methods must not be tabular

for their application to realistic Power Grid settings.

The algorithm used to ﬁnd the reference solution is reported in Appendix 6.

2.4. Rewards

Rewards are case-speciﬁc and obtained by developing a cost-beneﬁt model,

which evaluates how good the transition from one state to another is, given that

ais taken:

Rt=R(St,at,St+1)∈R

Generally speaking, there are no restriction on the deﬁnition of a reward func-

tion. However, a well-suited reward function will indeed help the agent converg-

ing faster to an optimal solution [30]. Further speciﬁcations will depend strongly

on the speciﬁc RL problem at hand and, thus, will be provided in section 3.3.

2.5. A non-tabular Reinforcement Learning algorithm

Generally speaking, the goal of Reinforcement Learning for strategy opti-

mization is to maximize the action-value function Qπ∗(S,a), which provides an

estimation of cumulated discounted future revenues when action ais taken in

state S, following the optimal policy π∗:

Qπ∗(S,a) = Eπ∗"∞

t=0

γtR(t)|S,a#(4)

We develop a Reinforcement Learning algorithm which uses an ensemble of

ANNs to interpolate between state-action pairs, which helps to reduce the num-

ber of episodes needed to approximate the Qfunction.

Figure 2 graphically displays an episode run within the algorithm. In details,

we estimate the value of Qπ(St,at) using a diﬀerent ANN for each action, with

network weights µ1,...,µ|A| , respectively. Network Nl,l= 1, ...|A|, receives

in input the state vector Stand returns the approximated value ˆql(St|µl) of

Qπ(St,at=al).

To speed up the training of the ANNs ([31]), we initially apply a standard super-

vised training over a batch of relatively large size nei , to set weights µ1,...,µ|A|.

To collect this batch, we randomly sample the ﬁrst state S1and, then, move

nei + Φ steps forward by uniformly sampling from the set of applicable actions

and collecting the transitions St,at→St+1,at+1 with the corresponding re-

wards Rt, t = 1, ..., nei + Φ −1. These transitions are provided by a model of

the grid behavior.

Every network Nl, l ∈ {1,...,|A|}, is trained on the set of states {St|t=

1, ..., nei,at=l}in which the l-th action is taken, whereas the reward that

the ANN learns is the Monte-Carlo estimate Ytof Qπ(St,at):

Yt=

t+Φ

t0=t

γt0−t·Rt0(5)

After this initial training, we apply Q-learning (e.g., [30],[12]) to ﬁnd the

ANN approximation of the optimal Qπ∗(St,at). Namely, every time the state

Stis visited, the action atis selected among all available actions according to

the −greedy policy π: the learning agent selects exploitative actions (i.e., the

action with the largest value, maximizing the expected future rewards) with

probability 1 −, or exploratory actions, randomly sampled from the other fea-

sible actions, with probability .

The immediate reward and the next state is observed, and weights µatof net-

work Natare updated: a single run of the back-propagation algorithm is done

([32],[33]) using Rt+γ·maxl∈{1,...,|A|} ˆql(St+1|µl) as target value (Equation 6).

This yields the following updating:

µat←µat+αat·[Rt+γ·max

l∈{1,...,|A|} ˆql(St+1|µl)−ˆqat(St|µat)] ·∇ˆqat(St|µat) (6)

where αat>0 is the value of the learning rate associated to Nat([30]).

Notice that the accuracy of the estimates provided by the proposed algorithm

strongly depends on the frequency at which the actions are taken in every state:

the larger the frequency, the larger the information from which the network

can learn the state-action value [30]. In real industrial applications, where

systems spend most of the time in states of normal operation ([34]), this may

entail a bias or large variance in the ANN estimations of Qπ(St,at) for rarely

visited states. To overcome this issue, we increase the exploration by dividing

the simulation of the system, and its interactions with the environment and

O&M decisions, into episodes of ﬁxed length T. Thus, we run Nei episodes,

each one entailing Tdecisions; at the beginning of each episode, we sample

the ﬁrst state uniformly over all states. This procedure increases the frequency

of visits to highly degraded states and reduces the estimation error. At each

episode ei ∈ {1, . . . , Nei}, we decrease the exploration rate =ei according to

=0·τei

, and the learning rate αl=α0·(1

1+Kα·tl), where α0is the initial

value, Kαis the decay coeﬃcient and tlcounts the number of times the network

Nlhas been trained ([30]).

3. Case study

A scaled-down Power Grid case study is considered to apply the Reinforce-

ment Learning decision making framework. The Power Grid includes: 2 con-

trollable generators, 5 cables for power transmission, and 2 renewable energy

sources which provide electric power to 2 connected loads depending on the

(random) weather conditions (Figure 3). Then, |C|=11. The two traditional

generators are operated to minimize power unbalances on the grid (Figure 3).

We assume that these generators, and links 3 and 4, are aﬀected by degrada-

tion and are equipped with PHM capabilities to inform the decision-maker on

their degradation states. Then, D={1,2,3,4}. The two loads and the two

renewable generators deﬁne the grid power setting, P={5,6,7,8}

s(t+1)

Stochastic Grid Model

8 9

11 12

22 23

PHM

a(t)

R(t+1)

s(t)

a(t+1)

t=t+1

rand()<εy

Exploitation

argmaxa(ANNa[s(t+1)])

Exploration

rand()

Agent Select Actions

s(t+1)

Train ANN for action a(t)

Out: R(t+1)+𝛾maxa(t+1) ANNa(t+1)[s(t+1)]

Input: S(t)

ANN action 1

ANN action n

....

ANN action 2

Q(S,a)

Learning Agent

observed new state

observed reward

𝛼1

𝛼2

𝛼n

s(t+1)

Larning rate decay

𝛼a(t)=f(𝛼a(t),k)

t=1

start

S(1)=randi(Ns)

Initialize

Episode Run

Figure 2: The ﬂow chart displays an episode run and how the learning agent interacts with the

environment (i.e. the power grid equipped with PHM devices) in the developed Reinforcement

Learning framework; dashed-line arrows indicate when the learning agent takes part in the

episode run.

Gen 2

Gen 1

RES 2

RES 1

Load 1 Load 2

PHM System

Figure 3: The power grid structure and the position of the 4 PHM-equipped systems, 2

Renewable Energy Sources, 2 loads and 2 controllable generators.

3.1. States and actions

We consider nd= 1 degradation features, d= 1, .., 4, and np= 1 power

features, p= 1, .., 4. We consider 4 degradation states for the generators, sd

{1, .., Sd

1= 4},d= 1,2, whereas the 2 degrading power lines, d= 3,4, have

three states: sd

1={1, .., Sd

1= 3}. State 1 refers to the AGAN condition,

whereas state Sd

1to the failure state and states 1 < sd

1< Sd

1to degraded states in

ascending order. For each load, we consider 3 states of increasing power demand

1={1, .., Sp

1= 3},p= 5,6. Three states of increasing power production are

associated to the Renewable Energy Sources, sp

1={1, .., Sp

1= 3}, p = 7,8.

Then, the state vector at time treads:

S(t) = s1

1, s2

1, s3

1, s4

1, s5

1, s6

1, s7

1, s8

1

Space Sis made up of 11664 points.

The agent can operate both generators to maximize the system revenues by

minimizing the unbalance between demand and production, while preserving

the structural and functional integrity of the system. Then, g∈G={1,2}and

%= 1, ..., |G|= 2. Being in this case subscript %=g, it can be omitted.

Notice that other actions can be performed by other agents on other components

(e.g. transmission lines). These are assumed not under the agent control, and,

thus, are included in the environment. Then, the action vector reads a= [a1, a2],

whereas Ag={1, .., 5},g∈ {1,2}, and |A| = 25. This gives rise to a 291600

state-action pairs. For each generator, the ﬁrst 3 (operational) actions concern

the power output, which can be set to one out of the three allowed levels. The

last 2 actions are preventive and corrective maintenance actions, respectively.

CM is mandatory for failed generators.

Highly degraded generators (i.e. Sd

g= 3, d = 1,2) can be operated at the lower

power output levels, only (ag= 1 action).

Tables 1-3 display, respectively, the costs for each action and the corresponding

power output of the generators, the line electric parameters and the relation

between states sp

1and the power variable settings.

Table 1: The power output of the 2 generators in [MW] associated to the 5 available actions

and action costs in monetary unit [m.u.].

Action: 1 2 3 4 5

Pg=1 [MW] 40 50 100 0 0

Pg=2 [MW] 50 60 120 0 0

Ca,g [m.u.] 0 0 0 10 500

Table 2: The transmission lines ampacity and reactance proprieties.

From To Ampacity [A] X [Ω]

Gen 1 Load 1 125 0.0845

Gen 1 Load 2 135 0.0719

Gen 1 Gen 2 135 0.0507

Load 1 Gen 2 115 0.2260

Load 2 Gen 2 115 0.2260

Table 3: The physical values of the power settings in [MW] associated to each state Sp

1of

component p∈P.

State index sp

11 2 3

p= 5 Demanded [MW] 60 100 140

p= 6 Demanded [MW] 20 50 110

p= 7 Produced [MW] 0 20 30

p= 8 Produced [MW] 0 20 60

3.2. Probabilistic model

We assume that the two loads have identical transition probability matrices

and also the degradation of the transmission cables and generators are described

by the same Markov process. Thus, for ease of notation, the components sub-

scripts have been dropped.

Each action a∈ A is associated to a speciﬁc transition probability matrix Pa

describing the evolution of the generator health state conditioned by its opera-

tive state or maintenance action.

The transition matrices for the considered features are reported in Appendix

6. Notice that the probabilities associated to operational actions, namely ag=

1,2,3, aﬀect diﬀerently the degradation of the component. Moreover, for those

actions, the bottom row corresponding to the failed state has only zero entries,

indicating that operational actions cannot be taken on failed generators, as only

CM is allowed.

3.3. Reward model

The reward is made up of four diﬀerent contributions: (1) the revenue from

selling electric power, (2) the cost of producing electric power by traditional

generators, (3) the cost associated to the performed actions and (4) the cost of

not serving energy to the customers. Mathematically, the reward reads:

R(St,at,St+1) =

p=5 Lp(t)−EN Sp(t)

∆t·Cel −

g=1

Pg·Cg−

g=1

Ca,g −

p=5

EN Sp(t)·CE NS

where Lpis the power demanded by element p,Cel is the price paid by the

loads for buying a unit of electric power, Pgis the power produced by the gen-

erators, Cgis the cost of producing the unit of power, Ca,g is the cost of action

agon generator g, ∆t= 1his the time diﬀerence between the present and the

next system state and EN Spis the energy not supplied to load p; this is a

function of the grid state S, grid electrical proprieties and availability M, i.e.

EN S (t) = G(S,M) where Gdeﬁnes the constrained DC power ﬂow solver ([35],

see Figure 2). CE N S is the cost of the energy not supplied.

Costs CEN S ,Cgand Cel are set to 5, 4 and 0.145 monetary unit (m.u.) per-unit

of energy or power, respectively. These values are for illustration, only.

4. Results and discussions

The developed algorithm (pseudo - code 1 in Appendix) provides a non-

tabular solution to the stochastic control problem, which is compared to the

reference Bellman’s optimality (pseudo-code 2 in Appendix). The algorithm

runs for Nei = 1e4 episodes with truncation window T= 20, initial learning

rate α0= 0.02, initial expiration rate 0= 0.9 and decay coeﬃcients Kα= 1e−2.

The learning agent is composed of 25 fully-connected ANNs having architectures

deﬁned by Nlayers = [8,10,5,1], that is: 1 input layer with 8 neurons, 1 output

layer with 1 neuron and 2 hidden layers with 10 and 5 neurons, respectively.

The results of the analysis are summarized in the top panel in Figure 4, where

the curves provide a compact visualization of the distribution of Qπ∗(S,a) over

the states, for the available 25 combinations of actions. For comparison, the

reference optimal action-value function is displayed in the bottom panel. The

results of the two algorithms are in good agreement, although minor inevitable

approximation errors can be observed for some of the state-action pairs. Three

clusters can be identiﬁed: on the far left, we ﬁnd the set of states for which

CM on both generators is performed; being CM a costly action, this leads to

a negative expectation of the discounted reward. The second cluster (C 2 )

corresponds to the 8 possible combinations of one CM and any other action on

the operating generator. The ﬁnal cluster (C 1 ) of 16 combinations of actions

includes only PM and operational actions. If corrective maintenance is not

performed, higher rewards are expected.

Both Generators

Corrective

Mainteinance

One Generator

Corrective

Mainteinance

Corrective

Mainteinance

C 1

C 2

C 3

Figure 4: The Q(s, a) values displayed using ECDFs and the 3 clusters. Comparison between

the reference Bellman’s solution (bottom plot) and the QL+ANN solution (top plot).

Figure 5: The maximum expected reward, ˆqa(S|µa), for increasing total load and diﬀerent

degrading condition of the generators.

In Figure 5, each sub-plot shows the maximum expected discounted reward given

by the policy found by Algorithm 1, conditional to a speciﬁc degradation state

of the generators and for increasing electric load demands. It can be noticed

that when the generators are both healthy or slightly degraded (i.e. P2

d=1 sd

2,3,4), an increment in the overall power demand entails an increment in the

expected reward, due to the larger revenues from selling more electric energy to

the customers (dashed lines display the average trends). On the other hand, if

the generators are highly degraded or failed (i.e. P2

d=1 sd

1= 7,8), an increment

in the load demand leads to a drop in the expected revenue. This is due to the

increasing risk of load curtailments and associated cost (i.e. cost of energy not

supplied), and to the impacting PM and CM actions costs. Similar results can

be obtained solving the Bellman’s optimality (e.g. see [36]).

To compare the Qvalues obtained from Algorithm 1 to the Bellman’s ref-

erence, a convergence plot for 3 states is provided in Figure 6. Every state is

representative of one of the 3 clusters C 1,C 2 and C 3 (see Figure 4): S1=

[1,1,1,1,1,1,1,1] has both generators in the AGAN state, S2= [4,1,1,1,1,1,1,1]

has one generator out of service while S3= [4,4,3,3,3,3,3,3] has both genera-

tors failed. Figure 6 also reports the corresponding reference Bellman’s solutions

(dashed lines): their closeness indicates that the Reinforcement Learning algo-

rithm converges to the true optimal policy.

4.1. Policies comparison

Table 4 compares the results obtained from the developed Reinforcement

Learning algorithm with the Bellman’s optimality and two synthetic policies.

The ﬁrst suboptimal policy is named Qrnd , in which actions are randomly se-

lected. This comparison is used as reference worst case, as it is the policy that

a non-expert decision maker would implement on the Power Grid. The second

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Episodes

-4000

-2000

2000

4000

6000

8000

10000

12000

max

a Q(s,a)

MDP: Q s1

MDP: Q s2

MDP: Q s3

QL+ANN: Qs1

QL+ANN: Qs2

QL+ANN: Qs3

Figure 6: The convergence of the max

a∈{1,...,|A|} ˆqa(S|µa) for 3 representative system states (i.e.

generators fully-operative, partially failed/degraded and fully-failed).

synthetic policy, named Qexp , is based on experience: the agent tries to keep the

balance between loads and production by optimally setting the power output

of the generators. However, he/she will never take PM actions. This reference

policy is that which an agent not informed on the health state of the compo-

nents would apply on the Power Grid elements.

Table 4 shows on the ﬁrst row the Normalized Root Mean Squared Error

(NRMSE, i.e., the error averaged over all the state action pairs and normalized

over the min-max range of the Bellman’s Q) between the considered policies

and the Bellman’s reference Q.

In the next rows, Table 4 shows the averaged non-discounted return R(t) =

t=1 R(t)

T, independent from the initial state of the system, its standard de-

viation σ[R(t)], the average value of the energy not supplied, ENS, and the

corresponding standard deviation σ[ENS]. These values have been obtained by

Monte Carlo, simulating the system operated according to the four considered

policies.

We can see that the Reinforcement Learning policy yields negative values of the

average energy-not-supplied (about -45.2 MW), which are smaller than those of

the reference Bellman’s policy solution method (-6.7 MW). This indicates that

both Bellman’s and Reinforcement Learning policies yield an overproduction of

electric power. However, the reward of the Bellman’s solution is larger, due to

the closer balance between load and demand and, thus, lower costs associated

to the overproduction.

Concerning the expert-based policy Qexp , it behaves quite well in term of aver-

age EN S , with results comparable to the Bellman’s optimality. On the other

hand, the resulting Qand R(t) are both smaller than those of the Bellman’s

policy and the Reinforcement Learning policy. This is due to the increased

occurrence frequency of CM actions and associated costs. The random policy

produces sensibly worsen the results of both ENS and rewards.

To further explain these results, we can look at Table 5. For the four considered

policies, the panels report the frequency of selection of the 5 actions available

for the generators, conditional to their degradation state: the Bellman’s policy

in the top panel, left-hand side, the Reinforcement Learning policy in the top

panel, right-hand side, the suboptimal random policy in the bottom panel, left-

hand side, and the expert-based policy in the bottom panel, right-hand side. In

each panel, the ﬁrst 4 rows refer to the possible degradation states of Gen 1,

whilst the last 4 rows show the results for Gen 2.

With respect to the Bellman solution it can be observed that when Gen 1 is

nearly failed (s1

1= 3), it undergoes PM for the vast majority of the scenarios

(80.9 % of the states). Conversely, when Gen 2 is nearly failed (s2

1= 3), the

optimal policy is more inclined to keep it operating (54.3 % of the scenarios)

rather than perform a PM (45.7 %). This means that in the states for which

1= 3, the agent is ready to: (1) take the risk of facing failure and (2) have the

generator forcefully operated at the minimum power regime. This diﬀerence in

the operation of the two generators can be explained by considering the spe-

ciﬁc topology of the system, the inherent asymmetry in the load, renewable and

controllable generators capacity, and the PHM devices which are not uniformly

allocated on the grid.

In terms of action preferences, the Reinforcement Learning policy presents some

similarities and diﬀerences when compared to the Bellman ones. In particular,

given a nearly failed state for Gen 1, this is more likely to undergo PM (20.4

% of the times) if compared to Gen 2 (only 14.1 %). This is in line with the

results of the Bellman’s policy. However, a main diﬀerence can be pointed out:

following the Reinforcement Learning policy, PM actions are taken less fre-

quently, with a tendency to keep operating the generators. This is reﬂected in

the rewards, which are slightly smaller. Nonetheless, the Reinforcement Learn-

ing policy tends to optimality and greatly outperforms the random policy, as

expected, and also presents an improvement with respect to the expert-based

solution to the optimization problem. This gives evidence of the beneﬁt of PHM

on the Power Grid.

As expected, the action selection frequencies of the randomized policy do not

depend on the states of the generators and PM are not selected in the expert-

based policy, as required when it has been artiﬁcially generated.

One main drawback of the developed algorithm is that it is computationally

quite intensive (approximately 14 hours of calculations on a standard machine,

last row in Table 4). This is due to the many ANNs trainings, which have to be

repeated for each reward observation. However, its strength lies in its applica-

bility to high dimensional problems and with continuous states. Furthermore,

its eﬀectiveness has been demonstrated by showing that the derived optimal

policy greatly outperformed an alternative non-optimal strategy, with expected

rewards comparable to the true optimality. Further work will be dedicated to

reducing the computational time needed for the analysis, possibly introducing

time-saving training algorithms and robust and eﬃcient regression tools .

Table 4: Comparison between the policy derived from the QL+ANN Algorithm 1, a synthetic

non-optimal random policy, an expert-based policy and the reference Bellman’s optimality

Policy πBellman’s QL+ANN Qrnd Qexp

NRMSE 0 0.083 0.35 0.11

R(t) 532.9 439.1 260.3 405.2

σ[R(t)] 347.5 409.3 461.6 412.2

EN S -6.71 -45.22 15.16 -8.1

σ[ENS] 71.2 75.8 80.9 66.2

Comp. time [s] 17.3e4 5e4 - -

5. Discussion, limitations and future directions

The proposed approach has been implemented in the open source compu-

tation framework OpenCossan [] has been tested on a scaled-down power grid

case study with discrete states and relatively small number of actions. This was

a ﬁrst necessary step to prove the eﬀectiveness of the method by comparison

with a true optimal solution (i.e., the Bellman’s optimal solution). It is worth

remarking that RL cannot learn from direct interaction with the environment,

as this would require unproﬁtably operating a large number of systems. Then,

a realistic simulator of the state evolution depending on the actions taken is

required. This seems not a limiting point in the Industry 4.0 era, when digital

twins are more and more common and reﬁned. Future research eﬀorts will be

devoted to test the proposed framework on numerical models of complex sys-

tems (for which reference Bellman’s solution is not obtainable) and on empirical

data, collected from real world systems, is also expected.

6. Conclusion

A Reinforcement Learning framework for optimal O&M of power grid system

under uncertainty is proposed. A method which combines Q-learning algorithm

and an ensemble of Artiﬁcial Neural Networks is developed, which is applica-

ble to large systems with high dimensional state-action spaces. An analytic

(Bellman’s) solution is provided for scaled-down power grid, which includes

Prognostic Health Management devices, renewable generators and degrading

components, giving evidence that Reinforcement Learning can really exploit

the information gathered from Prognostic Health Management devices, which

helps to select optimal O&M actions on the system components. The proposed

strategy provides accurate solutions comparable to the true optimal. Although

inevitable approximation errors have been observed and computational time is

an open issue, it provides useful direction for the system operator. In fact,

he/she can now discern whether a costly repairing action is likely to lead to a

long-term economic gain or is more convenient to delay the maintenance.

Table 5: Decision-maker actions preferences. Percentage of actions taken on the genera-

tors conditional to their degradation state (following the Bellman’s policy, the Reinforcement

Learning policy, the sub-optimal policy and the expert-based policy).

Bellman’s policy Reinforcement Learning policy

a1= 1 2 3 4 5 1 2 3 4 5

1= 1 24.3 7.4 58 10.2 0 7.5 20.5 71.5 0.65 0

1= 2 28.2 6.4 65.4 0 0 0.6 29.4 69.4 0.6 0

1= 3 19.1 0 0 80.9 0 79.6 0 0 20.4 0

1= 4 0 0 0 0 100 0 0 0 0 100

a2= 1 2 3 4 5 1 2 3 4 5

1= 1 38.9 8.6 45 7.4 0 2.7 27.6 69.6 0 0

1= 2 36.1 11.4 52.5 0 0 2.4 24.3 72.9 0.3 0

1= 3 54.3 0 0 45.7 0 85.9 0 0 14.1 0

1= 4 0 0 0 0 100 0 0 0 0 100

Randomized Policy Expert-based policy

a1= 1 2 3 4 5 1 2 3 4 5

1= 1 25.6 25.2 24.6 24.3 0 0 37 63 0 0

1= 2 23.8 25.3 25 25.9 0 0 37 63 0 0

1= 3 52.1 0 0 47.9 0 100 0 0 0 0

1= 4 0 0 0 0 100 0 0 0 0 100

a2= 1 2 3 4 5 1 2 3 4 5

1= 1 24.6 24.9 25.6 24.7 0 76 2.4 21.6 0 0

1= 2 24.5 25.1 24.9 25.4 0 76.6 1.8 21.6 0 0

1= 3 50.4 0 0 49.6 0 100 0 0 0 0

1= 4 0 0 0 0 100 0 0 0 0 100

References

[1] J. Dai, Z. D. Wang, P. Jarman, Creepage discharge on insulation barriers in

aged power transformers, IEEE Transactions on Dielectrics and Electrical

Insulation 17 (4) (2010) 1327–1335. doi:10.1109/TDEI.2010.5539705.

[2] R. Goyal, B. K. Gandhi, Review of hydrodynamics instabilities in francis

turbine during oﬀ-design and transient operations, Renewable Energy 116

(2018) 697 – 709. doi:https://doi.org/10.1016/j.renene.2017.10.

012.

URL http://www.sciencedirect.com/science/article/pii/

S0960148117309734

[3] H. Aboshosha, A. Elawady, A. E. Ansary, A. E. Damatty, Review on

dynamic and quasi-static buﬀeting response of transmission lines under

synoptic and non-synoptic winds, Engineering Structures 112 (2016) 23 –

46. doi:https://doi.org/10.1016/j.engstruct.2016.01.003.

URL http://www.sciencedirect.com/science/article/pii/

S0141029616000055

[4] M. Compare, L. Bellani, E. Zio, Optimal allocation of prognostics

and health management capabilities to improve the reliability of

a power transmission network, Reliability Engineering & System

Safetydoi:https://doi.org/10.1016/j.ress.2018.04.025.

URL http://www.sciencedirect.com/science/article/pii/

S0951832017306816

[5] M. Papageorgiou, C. Diakaki, V. Dinopoulou, A. Kotsialos, Y. Wang, Re-

view of road traﬃc control strategies, Proceedings of the IEEE 91 (12)

(2003) 2043–2067. doi:10.1109/JPROC.2003.819610.

[6] J. Jin, X. Ma, Hierarchical multi-agent control of traﬃc lights based on col-

lective learning, Engineering Applications of Artiﬁcial Intelligence 68 (2018)

236 – 248. doi:https://doi.org/10.1016/j.engappai.2017.10.013.

URL http://www.sciencedirect.com/science/article/pii/

S0952197617302658

[7] R. Youseﬁan, R. Bhattarai, S. Kamalasadan, Transient stability enhance-

ment of power grid with integrated wide area control of wind farms and

synchronous generators, IEEE Transactions on Power Systems 32 (6) (2017)

4818–4831. doi:10.1109/TPWRS.2017.2676138.

[8] M. S. Mahmoud, N. M. Alyazidi, M. I. Abouheaf, Adaptive intelli-

gent techniques for microgrid control systems: A survey, International

Journal of Electrical Power & Energy Systems 90 (2017) 292 – 305.

doi:https://doi.org/10.1016/j.ijepes.2017.02.008.

URL http://www.sciencedirect.com/science/article/pii/

S0142061516325042

[9] E. Kuznetsova, Y.-F. Li, C. Ruiz, E. Zio, G. Ault, K. Bell, Reinforcement

learning for microgrid energy management, Energy 59 (2013) 133 – 146.

doi:https://doi.org/10.1016/j.energy.2013.05.060.

URL http://www.sciencedirect.com/science/article/pii/

S0360544213004817

[10] M. Compare, P. Marelli, P. Baraldi, E. Zio, A markov decision process

framework for optimal operation of monitored multi-state systems, Pro-

ceedings of the Institution of Mechanical Engineers Part O Journal of Risk

and Reliability.

[11] R. S. Sutton, D. Precup, S. Singh, Between mdps and semi-mdps:

A framework for temporal abstraction in reinforcement learn-

ing, Artiﬁcial Intelligence 112 (1) (1999) 181 – 211. doi:https:

//doi.org/10.1016/S0004-3702(99)00052-1.

URL http://www.sciencedirect.com/science/article/pii/

S0004370299000521

[12] C. Szepesvari, Algorithms for Reinforcement Learning, Morgan and Clay-

pool Publishers, 2010.

[13] T. I. Ahamed, P. N. Rao, P. Sastry, A reinforcement learning approach

to automatic generation control, Electric Power Systems Research 63 (1)

(2002) 9 – 26. doi:https://doi.org/10.1016/S0378-7796(02)00088- 3.

URL http://www.sciencedirect.com/science/article/pii/

S0378779602000883

[14] J. .A, I. Ahamed, J. R. V. P., Reinforcement learning solution for unit

commitment problem through pursuit method 0 (2009) 324–327.

[15] M. Glavic, D. Ernst, L. Wehenkel, A reinforcement learning based discrete

supplementary control for power system transient stability enhancement,

in: Engineering Intelligent Systems for Electrical Engineering and Commu-

nications, 2005, pp. 1–7.

[16] R. Lu, S. H. Hong, X. Zhang, A dynamic pricing demand re-

sponse algorithm for smart grid: Reinforcement learning ap-

proach, Applied Energy 220 (2018) 220 – 230. doi:https:

//doi.org/10.1016/j.apenergy.2018.03.072.

URL http://www.sciencedirect.com/science/article/pii/

S0306261918304112

[17] E. Jasmin, T. I. Ahamed, V. J. Raj, Reinforcement learning ap-

proaches to economic dispatch problem, International Journal of

Electrical Power & Energy Systems 33 (4) (2011) 836 – 845.

doi:https://doi.org/10.1016/j.ijepes.2010.12.008.

URL http://www.sciencedirect.com/science/article/pii/

S014206151000222X

[18] V. Nanduri, T. K. Das, A reinforcement learning model to assess market

power under auction-based energy pricing, IEEE Transactions on Power

Systems 22 (1) (2007) 85–95. doi:10.1109/TPWRS.2006.888977.

[19] J. G. Vlachogiannis, N. D. Hatziargyriou, Reinforcement learning for re-

active power control, IEEE Transactions on Power Systems 19 (3) (2004)

1317–1325. doi:10.1109/TPWRS.2004.831259.

[20] D. Ernst, M. Glavic, F. Capitanescu, L. Wehenkel, Reinforcement learning

versus model predictive control: A comparison on a power system problem,

IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernet-

ics) 39 (2) (2009) 517–529. doi:10.1109/TSMCB.2008.2007630.

[21] J. R. Vzquez-Canteli, Z. Nagy, Reinforcement learning for de-

mand response: A review of algorithms and modeling tech-

niques, Applied Energy 235 (2019) 1072 – 1089. doi:https:

//doi.org/10.1016/j.apenergy.2018.11.002.

URL http://www.sciencedirect.com/science/article/pii/

S0306261918317082

[22] H. Li, T. Wei, A. Ren, Q. Zhu, Y. Wang, Deep reinforcement learning:

Framework, applications, and embedded implementations: Invited paper,

in: 2017 IEEE/ACM International Conference on Computer-Aided Design

(ICCAD), 2017, pp. 847–854. doi:10.1109/ICCAD.2017.8203866.

[23] J. Wu, H. He, J. Peng, Y. Li, Z. Li, Continuous reinforcement

learning of energy management with deep q network for a power

split hybrid electric bus, Applied Energy 222 (2018) 799 – 811.

doi:https://doi.org/10.1016/j.apenergy.2018.03.104.

URL http://www.sciencedirect.com/science/article/pii/

S0306261918304422

[24] R. Lu, S. H. Hong, Incentive-based demand response for smart grid with re-

inforcement learning and deep neural network, Applied Energy 236 (2019)

937 – 949. doi:https://doi.org/10.1016/j.apenergy.2018.12.061.

URL http://www.sciencedirect.com/science/article/pii/

S0306261918318798

[25] T. Sogabe, D. B. Malla, S. Takayama, S. Shin, K. Sakamoto, K. Yamaguchi,

T. P. Singh, M. Sogabe, T. Hirata, Y. Okada, Smart grid optimization

by deep reinforcement learning over discrete and continuous action space,

in: 2018 IEEE 7th World Conference on Photovoltaic Energy Conversion

(WCPEC) (A Joint Conference of 45th IEEE PVSC, 28th PVSEC 34th

EU PVSEC), 2018, pp. 3794–3796. doi:10.1109/PVSC.2018.8547862.

[26] J. Zhang, C. Lu, J. Si, J. Song, Y. Su, Deep reinforcement leaming for

short-term voltage control by dynamic load shedding in china southem

power grid, in: 2018 International Joint Conference on Neural Networks

(IJCNN), 2018, pp. 1–8. doi:10.1109/IJCNN.2018.8489041.

[27] D. O’Neill, M. Levorato, A. Goldsmith, U. Mitra, Residential demand re-

sponse using reinforcement learning, in: 2010 First IEEE International

Conference on Smart Grid Communications, 2010, pp. 409–414. doi:

10.1109/SMARTGRID.2010.5622078.

[28] M. Glavic, R. Fonteneau, D. Ernst, Reinforcement learning for electric

power system decision and control: Past considerations and perspectives,

IFAC-PapersOnLine 50 (1) (2017) 6918 – 6927, 20th IFAC World Congress.

doi:https://doi.org/10.1016/j.ifacol.2017.08.1217.

URL http://www.sciencedirect.com/science/article/pii/

S2405896317317238

[29] F. Cannarile, M. Compare, P. Baraldi, F. Di Maio, E. Zio, Homogeneous

continuous-time, ﬁnite-state hidden semi-markov modeling for enhancing

empirical classiﬁcation system diagnostics of industrial components, Ma-

chines 6. doi:10.3390/machines6030034.

[30] R. S. Sutton, A. G. Barto, Reinforcement learning i: Introduction (2017).

[31] M. Riedmiller, Neural ﬁtted q iteration–ﬁrst experiences with a data eﬃ-

cient neural reinforcement learning method, in: European Conference on

Machine Learning, Springer, 2005, pp. 317–328.

[32] B. D. Ripley, Pattern recognition and neural networks, Cambridge univer-

sity press, 2007.

[33] S. S. Haykin, S. S. Haykin, S. S. Haykin, S. S. Haykin, Neural networks and

learning machines, Vol. 3, Pearson Upper Saddle River, NJ, USA:, 2009.

[34] J. Frank, S. Mannor, D. Precup, Reinforcement learning in the presence of

rare events, in: Proceedings of the 25th international conference on Machine

learning, ACM, 2008, pp. 336–343.

[35] R. Rocchetta, E. Patelli, Assessment of power grid vulnerabilities

accounting for stochastic loads and model imprecision, International

Journal of Electrical Power & Energy Systems 98 (2018) 219 – 232.

doi:https://doi.org/10.1016/j.ijepes.2017.11.047.

URL http://www.sciencedirect.com/science/article/pii/

S0142061517313571

[36] R. Rocchetta, M. Compare, E. Patelli, E. Zio, A reinforcement learning

framework for optimisation of power grid operations and maintenance, in:

Reliable engineering computing, REC 2018, 2018.

[37] E. Gross, On the bellmans principle of optimality, Physica A:

Statistical Mechanics and its Applications 462 (2016) 217 – 221.

doi:https://doi.org/10.1016/j.physa.2016.06.083.

URL http://www.sciencedirect.com/science/article/pii/

S037843711630351X

Appendix 1

Formally, a MDP is a tuple hS,A,R,Pi, where Sis a ﬁnite state set, A(s) is

a ﬁnite action set with s∈S,Ris a reward function such that R(s, a)∈R,∀s∈

S, a ∈ A and P is a probability function mapping the state action space:

Ps,a,s0:S×A×S7→ [0,1]

A speciﬁc policy πis deﬁned as a map from the state space to the action space

π:S7→ Awith π(s)∈A(s)∀s∈Sand it belongs to the set of possible policies

Π. The action-value function Qπ(s, a) is mathematically deﬁned as [30]:

Qπ(s, a) = Eπ"∞

t=0

γtR(st, π(st))|S0=s, A0=π(s0)#s∈S

where γ∈[0,1] is the discount factor and a γ < 1 is generally employed to avoid

divergence of the cumulative rewards as well as to reﬂect the fact that is some

cases earlier rewards are more valuable than future rewards. The Bellman’s

optimality equation provides an analytical expression for Qπ∗(s, a), which is the

action-value function for optimal policy π∗. The Bellman’s optimality is deﬁned

by a recursive equation as follows [37]-[30]:

Qπ∗(st, at) = X

st+1

P(st+1|st, at)R(st+1 , at, st) + max

at+1

γQπ∗(st+1 , at+1)(7)

Equation 7 can be solved by Dynamic Programming such as policy iteration or

value iteration [30].

The QL+ANN algorithm 1 consists of two phases: (1) an initialization phase

of the ANNs ensemble and (2) the learning phase, where Q-learning algorithm is

used in combination to the ANNs to learn an optimal decision-making policy.In

phase (1) an ANN is associated with each action vector aand its architecture,

i.e. number of layers and nodes per layer, is deﬁned by the Nlayers vector.

Each network is ﬁrst trained using the Levenberg-Marquardt algorithm, pro-

viding as input the state vectors and as output the estimator of Qobtained

from the future rewards. In phase (2) the Reinforcement Learning algorithm

run, Artiﬁcial Neural Networks select the actions and the ensemble is incremen-

tally trained to improve its predictive performance. Notice that, whilst tabular

Reinforcement Learning methods are guaranteed to convergence to an optimal

action-value function for a Robbins-Monro sequence of step-sizes αt, a general-

ized convergence guarantee for non-tabular methods has not been provided yet

and an inadequate setup can lead to suboptimal, oscillating or even diverging

solutions.Thus, an empirical convergence test has been designed to assess the

the reliability of the results. For further details, please refer to [30].

Appendix 2

Algorithm 1 The QL+ANN Algorithm.

Set ei = 1, nei Nei ,Kα,0,α0,γ,Nlayers ;

Phase 1: Oﬀ-Line Training

Initialize Networks Nland tl= 1, l= 1, ...|A| with architecture Nlayers;

Sample transitions St,at→St+1,at+1 and observe rewards Rt,t= 1, ..., nei;

Approximate Qby the MC estimate Yt=Pt+Φ

t0=tγt0−t·Rt0

Train each Nlusing {St|t= 1, ..., nei,at=l}and the estimated Yt(output);

Phase 2: Learning

while ei < Nei (Episodic Loop) do

Set t= 1 initialize state Strandomly

=0·τei



while t < T (episode run) do

if rand() <1−(exploit)

at=arg max

l∈{1,...,|ˆ

Ag%|}

ˆql(St|µl)

else (explore)

Select atrandomly s.t. at∈ˆ

Ag%

end

Take action at, observe St+1 and reward Rt

Update network Natweights, and α

µat←µat+αat·[Rt+γ·max

l∈{1,...,|A|} ˆql(St+1 |µl)−ˆqat(St|µat)]·∇ˆqat(St|µat)

αat=α0·(1

1+Kα·tat)

Set t=t+ 1 and tat=tat+ 1

end while

go to next episode ei =ei + 1

end while

Pad=1

d=





0.98 0.02 0 0

0 0.95 0.05 0

0 0 0.9 0.1

− − − −







d= 1,2Pad=2

d=





0.97 0.03 0 0

0 0.95 0.05 0

− − − −







d= 1,2

Pad=3

d=





0.95 0.04 0.01 0

0 0.95 0.04 0.01

−−−−







d= 1,2

Pad=4

d=





1 0 0 0

0.500.5 0

0.5 0 0 0.5

− − − −







d= 1,2Pad=5

d=





− − − −

0.15 0 0 0.85







d= 1,2

d=



0.9 0.08 0.02

0 0.97 0.03

0.1 0 0.9



∀a, d = 3,4Pa

p=



0.4 0.3 0.3

0.3 0.3 0.4

0.2 0.4 0.4



∀a, p = 5,6

7=



0.5 0.1 0.4

0.3 0.3 0.4

0.1 0.4 0.5



∀aPa

8=



0.5 0.2 0.3

0.4 0.4 0.2

0 0.5 0.5



∀a

Algorithm 2 The value iteration algorithm (Bellman’s optimality)

Initialize Qarbitrarily (e.g. Q(s, a) = 0 ∀s∈ S , a ∈ A)

Deﬁne tolerance error θ∈R+and ∆ = 0

while ∆≥θdo

for each s∈ S do

get constrained action set Asin s

for each a∈ Asdo

q=Q(s, a)

Q(s, a) = Ps0P(s0|s, a)R(s0, a, s) + max

a0γQ(s0, a0)

∆ = max(∆,|q−Q(s, a)|)

end for

end while

Output a deterministic policy π≈π∗

π(s) = arg max

a∈As

Q(s, a)∀s∈ S

A review of asset management using artificial intelligence‐based machine learning models: Applications for the electric power and energy system

Article

Full-text available

Jun 2024
IET GENER TRANSM DIS

Power system protection and asset management present persistent technical challenges, particularly in the context of the smart grid and renewable energy sectors. This paper aims to address these challenges by providing a comprehensive assessment of machine learning applications for effective asset management in power systems. The study focuses on the increasing demand for energy production while maintaining environmental sustainability and efficiency. By harnessing the power of modern technologies such as artificial intelligence (AI), machine learning (ML), and deep learning (DL), this research explores how ML techniques can be leveraged as powerful tools for the power industry. By showcasing practical applications and success stories, this paper demonstrates the growing acceptance of machine learning as a significant technology for current and future business needs in the power sector. Additionally, the study examines the barriers and difficulties of large‐scale ML deployment in practical settings while exploring potential opportunities for these tactics. Through this overview, insights into the transformative potential of ML in shaping the future of power system asset management are provided.

A Review of Asset Management using Artificial Intelligence based Machine Learning models with applications for the Electric Power and Energy System

Preprint

Full-text available

Jun 2024

Power system protection and asset management present persistent technical challenges, particularly in the context of the smart grid and renewable energy sectors. This paper aims to address these challenges by providing a comprehensive assessment of machine learning applications for effective asset management in power systems. The study focuses on the increasing demand for energy production while maintaining environmental sustainability and efficiency. By harnessing the power of modern technologies such as Artificial Intelligence (AI), machine learning (ML), and Deep Learning (DL), this research explores how ML techniques can be leveraged as powerful tools for the power industry. By showcasing practical applications and success stories, this paper demonstrates the growing acceptance of machine learning as a significant technology for current and future business needs in the power sector. Additionally, the study examines the barriers and difficulties of large-scale ML deployment in practical settings while exploring potential opportunities for these tactics. Through this overview, we provide insights into the transformative potential of ML in shaping the future of power system asset management.

THE IMPACT OF MACHINE LEARNING ON PRESCRIPTIVE ANALYTICS FOR OPTIMIZED BUSINESS DECISION-MAKING

Article

Full-text available

Apr 2024

THE IMPACT OF MACHINE LEARNING ON PRESCRIPTIVE ANALYTICS FOR OPTIMIZED BUSINESS DECISION-MAKING

Article

Full-text available

Apr 2024

Ashiqur Rahman

This study investigates into the integration of Machine Learning (ML) with Prescriptive Analytics, showcasing the enhancement of decision-making processes in business through this combination. By analyzing contemporary methodologies and practical applications, it delves into how ML algorithms significantly improve the precision, efficiency, and forecasting capabilities of prescriptive analytics. Highlighting case studies across a variety of sectors, the research underscores the competitive edge businesses can gain by adopting these sophisticated analytical tools. Moreover, it addresses the array of technical and organizational hurdles that arise with the implementation of ML-enhanced prescriptive analytics, such as challenges in data handling, system integration, and the demand for specialized skills. Leveraging the latest advancements and insights from experts, the paper offers a compilation of best practices and strategic methodologies to effectively overcome these obstacles. Conclusively, it emphasizes the critical role of continuous innovation in ML and prescriptive analytics, encouraging firms to adopt these cutting-edge technologies to maintain a competitive stance in the fast-evolving, data-centric business landscape.

Research on Algorithm Driven Intelligent Management and Control Technology for Future Power Grid

Article

Full-text available

May 2024

An ever-more crucial architecture for both present and future electrical systems is a Power Grid (PG) that spans multiple areas comprising interlinked transmission lines, which may effectively reallocate energy resources on an extensive level. Preserving system equilibrium and increasing operating earnings are largely dependent on how the PG dispatches power using a variety of resources. The optimization techniques used to solve this dispatch issue today are not capable of making decisions or optimizing online; instead, they require doing the entire optimization computation at every dispatch instant. Herein, a novel Mutable Galaxy-based Search-tuned Flexible Deep Convolutional Neural Network (MGS-FDCNN) is presented as an online solution to targeted coordinated dispatch challenges in future PG. System optimization can be achieved using this strategy using only past operational data. First, a numerical model of the targeted coordination dispatch issue is created. Next, to solve the optimization challenges, we construct the MGS optimization approach. The effectiveness and accessibility of the suggested MGS-FDCNN approach are validated by the presentation of experimental data relying on the IEEE test bus network.

A chance-constrained net revenue model for online dynamic predictive maintenance decision-making

Article

May 2024
RELIAB ENG SYST SAFE

Sustainable Projects Based on the Intersection of Clean Energy with the Health Sector: A Bibliometric Review

Article

Full-text available

May 2024

The objective of this work was to develop a bibliometric analysis of academic production related to sustainable projects based on clean, green and renewable alternative energies and their interaction with the health sector and health management on a global scale. To do this, a documentary research process based on bibliometric tools was proposed. The research is limited to the period between 2003 and 2023, executing the following search equation: (TITLE-ABS-KEY ("clean energy" ) OR TITLE-ABS-KEY ( "alternative energy" ) OR TITLE-ABS -KEY ("solar energy" ) OR TITLE-ABS-KEY ( "green energy" ) OR TITLE-ABS-KEY ( "renewable energy" ) AND TITLE-ABS-KEY ( "health sector" ) OR TITLE-ABS-KEY ( "healthcare sector" ) OR TITLE-ABS-KEY ( "public health service" ) OR TITLE-ABS-KEY ( "health management" ) ). The years with the highest production were 2020 (21), 2021 (23), 2022 (30) and 2023 (34), while the countries that have contributed the most to the topic are China (111), the USA (97) and the United Kingdom (71), which together contributed 57.76% of all publications. Of 145 sources found that published on the topic, the 18 main sources concentrate 48% of the publications. The three most cited articles were EMADIAN SM, 2017, WASTE MANAGE, WATTS N, 2018, LANCET and FULLER R, 2022, LANCET PLANET HEALTH, with a total of 1567 citations. Among the most relevant institutions in relation to the topic of study, UNIVERSITY COLLEGE LONDON stands out above all with 19 publications, followed by TSINGHUA UNIVERSITY (9) and WUHAN UNIVERSITY (8); between the three institutions they account for 20.57% of the publications found. This analysis provided critical insights for academics and professionals and underscored the relevance and dynamism of this research area in addressing clean energies and sustainable projects related with health management and health sector worldwide.

A survey on LED Prognostics and Health Management and uncertainty reduction

Article

Jun 2024
MICROELECTRON RELIAB

Hybrid Prognostics and Health Management (PHM) frameworks for light-emitting diodes (LEDs) seek accurate remaining useful life (RUL) predictions by merging information from physics-of-failure laws with data-driven models and tools for online monitoring and data collection. Uncertainty quantification (UQ) and uncertainty reduction are essential to achieve accurate predictions and assess the effect of heterogeneous operational-environmental conditions, lack of data, and noises on LED durability. Aleatory uncertainty is considered in hybrid frameworks, and probabilistic models and predictions are applied to account for inherent variability and randomness in the LED lifetime. On the other hand, hybrid frameworks often neglect epistemic uncertainty, lacking formal characterization and reduction methods. In this survey, we propose an overview of accelerated data collection methods and modeling options for LEDs. In contrast to other works, this review focuses on uncertainty quantification and the fusion of hybrid PHM models with optimal design of experiment methods for epistemic uncertainty reduction. In particular, optimizing the data collection with a combination of statistical optimality criteria and accelerated degradation test schemes can substantially reduce the epistemic uncertainty and enhance the performance of hybrid prognostic models.

Research on the Operational State of Information and Communication Systems Based on Multi-Source Data Aggregation

Conference Paper

Jan 2024

An AI-based Approach for Scalable Cyber-Physical Optimal Response in Power Systems

Conference Paper

Feb 2024

Smart Grid Optimization by Deep Reinforcement Learning over Discrete and Continuous Action Space

Conference Paper

Full-text available

Jun 2018

Homogeneous Continuous-Time, Finite-State Hidden Semi-Markov Modeling for Enhancing Empirical Classification System Diagnostics of Industrial Components

Article

Full-text available

Aug 2018

This work presents a method to improve the diagnostic performance of empirical classification system (ECS), which is used to estimate the degradation state of components based on measured signals. The ECS is embedded in a homogenous continuous-time, finite-state semi-Markov model (HCTFSSMM), which adjusts diagnoses based on the past history of components. The combination gives rise to a homogeneous continuous-time finite-state hidden semi-Markov model (HCTFSHSMM). In an application involving the degradation of bearings in automotive machines, the proposed method is shown to be superior in classification performance compared to the single-stage ECS.

A Reinforcement Learning Framework for Optimisation of Power Grid Operations and Maintenance

Conference Paper

Full-text available

Jul 2018

In this work, we investigate Reinforcement Learning (RL) for managing Operation and Maintenance (O&M) of power grids equipped with Prognostic and Health Management (PHM) capabilities, which allow tracking the health state of the grid components. RL exploits this information to select optimal O&M actions on the grid components giving rise to state-action-reward trajectories maximising the expected profit. A scaled-down case study is solved for a power grid, and strengths and weaknesses of the framework are discussed.

Optimal allocation of Prognostics and Health Management capabilities to improve the reliability of a power transmission network

Article

Full-text available

May 2018
RELIAB ENG SYST SAFE

We introduce a new perspective to improve the reliability of a network, which aims at finding cost-effective portfolios of Prognostics and Health Management (PHM) systems to be installed throughout the network. To do this, we estimate the reliability of the single network element equipped with a PHM system, whose prognostic performance is measured in terms of the α−λ performance, false positive and false negative metrics. Then, we apply genetic algorithms for finding the portfolios of PHM systems to be installed on the network elements, which are optimal with respect to cost and a global reliability efficiency index of the network. The workbench case study of the IEEE 14 bus network is considered as application.

OpenCossan 2.0: an efficient computational toolbox for risk, reliability and resilience analysis

Conference Paper

Full-text available

Apr 2018

Many complex phenomena and the analysis of large and complex system and network can only be studied adopting advanced computational methods. In addition, in many engineering fields virtual prototypes are used to support and drive the design of new components, structures and systems. Uncertainty quantification is a key requirement and challenge for a realistic and reliable numerical modelling and prediction that spans across various disciplines and industry as well. The treatment of uncertainty required the availability of efficient algorithms and computational techniques able to reduce the computational cost required by the non-deterministic analysis and to interface with open-source and commercial model (e.g. FE/CFD) and libraries. In order to satisfy these requirements and allowing the inclusion of non-deterministic analyses as a practice standard routing in scientific computing, a general purpose software for uncertainty quantification and risk assessment, named COSSAN, is under continuous development. This paper presents an overview of the main capabilities of the recent release of the Matlab open source toolboxes OpenCossan. The new release includes interfaces with 3rd party libraries allowing to couple OPENCOSSAN with the state-of-the-art tools in Machine Learning and Meta-modelling. In addition, new toolboxes for reliability and resilient analysis of system and network are also presented. OpenCossan is released under the Lesser GNU licence. It is therefore freely available. It is also be package as a Python or Java library for distribution to end users who do not need MATLAB.

Incentive-based demand response for smart grid with reinforcement learning and deep neural network

Article

Feb 2019
APPL ENERG

Balancing electricity generation and consumption is essential for smoothing the power grids. Any mismatch between energy supply and demand would increase costs to both the service provider and customers and may even cripple the entire grid. This paper proposes a novel real-time incentive-based demand response algorithm for smart grid systems with reinforcement learning and deep neural network, aiming to help the service provider to purchase energy resources from its subscribed customers to balance energy fluctuations and enhance grid reliability. In particular, to overcome the future uncertainties, deep neural network is used to predict the unknown prices and energy demands. After that, reinforcement learning is adopted to obtain the optimal incentive rates for different customers considering the profits of both service provider and customers. Simulation results show that this proposed incentive-based demand response algorithm induces demand side participation, promotes service provider and customers profitabilities, and improves system reliability by balancing energy resources, which can be regarded as a win-win strategy for both service provider and customers.

Reinforcement learning for demand response: A review of algorithms and modeling techniques

Article

Feb 2019
APPL ENERG

Buildings account for about 40% of the global energy consumption. Renewable energy resources are one possibility to mitigate the dependence of residential buildings on the electrical grid. However, their integration into the existing grid infrastructure must be done carefully to avoid instability, and guarantee availability and security of supply. Demand response, or demand-side management, improves grid stability by increasing demand flexibility, and shifts peak demand towards periods of peak renewable energy generation by providing consumers with economic incentives. This paper reviews the use of reinforcement learning, a machine learning algorithm, for demand response applications in the smart grid. Reinforcement learning has been utilized to control diverse energy systems such as electric vehicles, heating ventilation and air conditioning (HVAC) systems , smart appliances, or batteries. The future of demand response greatly depends on its ability to prevent consumer discomfort and integrate human feedback into the control loop. Reinforcement learning is a potentially model-free algorithm that can adapt to its environment, as well as to human preferences by directly integrating user feedback into its control logic. Our review shows that, although many papers consider human comfort and satisfaction, most of them focus on single-agent systems with demand-independent electricity prices and a stationary environment. However, when electricity prices are modelled as demand-dependent variables, there is a risk of shifting the peak demand rather than shaving it. We identify a need to further explore reinforcement learning to coordinate multi-agent systems that can participate in demand response programs under demand-dependent electricity prices. Finally, we discuss directions for future research, e.g., quantifying how RL could adapt to changing urban conditions such as building refurbishment and urban or population growth.

Deep Reinforcement Leaming for Short-term Voltage Control by Dynamic Load Shedding in China Southem Power Grid

Conference Paper

Jul 2018

Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus

Article

Jul 2018
APPL ENERG

A Dynamic pricing demand response algorithm for smart grid: Reinforcement learning approach

Article

Jun 2018
APPL ENERG

With the modern advanced information and communication technologies in smart grid systems, demand response (DR) has become an effective method for improving grid reliability and reducing energy costs due to the ability to react quickly to supply-demand mismatches by adjusting flexible loads on the demand side. This paper proposes a dynamic pricing DR algorithm for energy management in a hierarchical electricity market that considers both service provider's (SP) profit and customers’ (CUs) costs. Reinforcement learning (RL) is used to illustrate the hierarchical decision-making framework, in which the dynamic pricing problem is formulated as a discrete finite Markov decision process (MDP), and Q-learning is adopted to solve this decision-making problem. Using RL, the SP can adaptively decide the retail electricity price during the on-line learning process where the uncertainty of CUs’ load demand profiles and the flexibility of wholesale electricity prices are addressed. Simulation results show that this proposed DR algorithm, can promote SP profitability, reduce energy costs for CUs, balance energy supply and demand in the electricity market, and improve the reliability of electric power systems, which can be regarded as a win-win strategy for both SP and CUs.

A reinforcement learning framework for optimal operation and maintenance of power grids

Abstract and Figures

Recommended publications

Positivité d'une covariance de type MA(2) ou MA(3)

Fast and Efficient Reinforcement Learning with Truncated Temporal Differences

Scaling Multiagent Markov Decision Processes

Model-free reinforcement learning with model-based safe exploration: Optimizing adaptive recovery pr...