PreprintPDF Available

[IEEE Transactions on Power Systems] Transmission Interface Power Flow Adjustment: A Deep Reinforcement Learning Approach based on Multi-task Attribution Map

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coupling relationship and even leading to conflict decisions. In this paper, we introduce a novel data-driven deep reinforcement learning (DRL) approach, to handle multiple power flow adjustment tasks jointly instead of learning each task from scratch. At the heart of the proposed method is a multi-task attribution map (MAM), which enables the DRL agent to explicitly attribute each transmission interface task to different power system nodes with task-adaptive attention weights. Based on this MAM, the agent can further provide effective strategies to solve the multi-task adjustment problem with a near-optimal operation cost. Simulation results on the IEEE 118-bus system, a realistic 300-bus system in China, and a very large European system with 9241 buses demonstrate that the proposed method significantly improves the performance compared with several baseline methods, and exhibits high interpretability with the learnable MAM.
IEEE TRANSACTIONS ON POWER SYSTEMS 1
Transmission Interface Power Flow Adjustment:
A Deep Reinforcement Learning Approach based
on Multi-task Attribution Map
Shunyu Liu, Wei Luo, Yanzhen Zhou, Kaixuan Chen, Quan Zhang, Huating Xu, Qinglai Guo, Mingli Song
Abstract—Transmission interface power flow adjustment is a
critical measure to ensure the security and economy operation of
power systems. However, conventional model-based adjustment
schemes are limited by the increasing variations and uncertainties
occur in power systems, where the adjustment problems of
different transmission interfaces are often treated as several
independent tasks, ignoring their coupling relationship and even
leading to conflict decisions. In this paper, we introduce a novel
data-driven deep reinforcement learning (DRL) approach, to
handle multiple power flow adjustment tasks jointly instead of
learning each task from scratch. At the heart of the proposed
method is a multi-task attribution map (MAM), which enables
the DRL agent to explicitly attribute each transmission interface
task to different power system nodes with task-adaptive attention
weights. Based on this MAM, the agent can further provide
effective strategies to solve the multi-task adjustment problem
with a near-optimal operation cost. Simulation results on the
IEEE 118-bus system, a realistic 300-bus system in China, and
a very large European system with 9241 buses demonstrate that
the proposed method significantly improves the performance
compared with several baseline methods, and exhibits high
interpretability with the learnable MAM.
Index Terms—Attribution map, deep reinforcement learning,
multi-task learning, power flow adjustment, transmission interface.
I. INTRODUCTION
POWER systems are complex nonlinear physical systems
with high uncertainty [1]. With the rapid expansion of
the scale of power systems and the increasing imbalance
of power demand and generation, the problems of power
systems, such as security and economy, become much more
This article has been accepted for publication by IEEE Transactions on
Power Systems. The published version is available at https://doi.org/10.1109/
TPWRS.2023.3298007. This work was supported in part by the National Key
R&D Program of China under Grant 2018AAA0101503, and in part by the
Science and Technology Project of SGCC (State Grid Corporation of China):
Fundamental Theory of Human-in-the-Loop Hybrid-Augmented Intelligence
for Power Grid Dispatch and Control. (Shunyu Liu and Wei Luo contributed
equally to this work.) (Corresponding author: Mingli Song.)
Shunyu Liu, Wei Luo, Kaixuan Chen, and Mingli Song are with the
College of Computer Science and Technology, Zhejiang University, Hangzhou
310027, China (e-mail: liushunyu@zju.edu.cn; davidluo@zju.edu.cn;
chenkx@zju.edu.cn; brooksong@zju.edu.cn).
Yanzhen Zhou and Qinglai Guo are with the Department of Elec-
trical Engineering, Tsinghua University, Beijing 100084, China (e-mail:
zhouyzh@mail.tsinghua.edu.cn; guoqinglai@tsinghua.edu.cn).
Quan Zhang and Huating Xu are with the College of Electri-
cal Engineering, Zhejiang University, Hangzhou 310027, China (e-mail:
quanzzhang@zju.edu.cn; xu huating@zju.edu.cn).
Digital Object Identifier 10.1109/TPWRS.2023.3298007
challenging [2]. To monitor the operation state of power
systems with massive variables, operators tend to take the
transmission interface into consideration instead of a single
element. A specific transmission interface is composed of a
set of transmission lines with the same direction of active
power flow and close electrical distance [3, 4]. Operators
can analyze and control the operation state of the power
system by monitoring the power flow of different transmission
interfaces. The total transfer capability of critical transmission
interfaces is widely used in practice to provide power system
security margins [5, 6, 7]. Once the power flow of a critical
transmission interface is overloaded, it inevitably poses a great
threat to the power system and even leads to the emergence
of cascading blackouts. In this way, the power flow through a
transmission interface is typically regulated at a pre-scheduled
range, thereby ensuring the stability and reliability of power
system operation [8, 9].
Transmission interface power flow adjustment serves as an
important defensive means to satisfy this pre-scheduled secu-
rity constraint. To realize power flow adjustment for different
transmission interfaces, a direct approach is generation dis-
patch. Conventional dispatch methods are highly dependent on
the system model, which can be mainly categorized into two
classes: optimal power flow-based methods [10] and sensitivity
analysis-based ones [11, 12, 13]. Optimal power flow-based
methods mainly rely on numerical optimization that transforms
the dispatch problem into a constrained programming problem,
while sensitivity analysis-based methods iteratively calculate
the sensitivity indexes to determine the candidate generators.
The conventional methods are based on mathematical models
that represent the power system, where the models are subject
to certain assumptions and simplification. As a result, these
methods may not be able to capture all the complex dynamics
that occur in a real power system. Moreover, this model-
based mechanism greatly suffers from a heavy computational
burden with the growing scale of the power system [14].
The power flow constraints of different transmission interfaces
are closely intertwined and even conflict with each other
in extreme scenarios, which significantly limits the feasible
solution space. Thus, due to the combinatorial explosion and
complicated constraints in the nonlinear nonconvex searching
space, the conventional methods may easily fail to find an
optimal solution and are no longer acceptable.
To alleviate these issue, data-driven framework based on
0885-8950 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON POWER SYSTEMS 2
deep reinforcement learning (DRL) has been proposed as a
competent alternative [15]. As a solution to control problems,
model-free DRL exploits the large capacity of deep neural net-
works to extract effective features from input states and deliver
response actions in an end-to-end fashion [16]. Recently, many
studies have shown the remarkable potential of this learning
paradigm in many power system control problems, including
voltage control [17, 18, 19], economic dispatch [20, 21, 22]
and emergency control [23, 24]. DRL can directly learn from
high-dimension power system data without being constrained
by a fixed model, providing a more robust and adaptive control
strategy under various scenarios. Moreover, DRL can realize
efficient inference based on neural networks, making it more
suitable for real-world applications with high demands for
rapid response [25]. For transmission interface power flow
adjustment problem, reference [26] proposes to use Proximal
Policy Optimization (PPO) to train the DRL policy. To avoid
the conflict problem of training scenarios with distinct patterns,
this method first clusters the training scenarios and then
employs multiple DRL agents for each cluster. Despite the
promising results achieved, existing works are still limited by
individually training policies for each specific transmission
interface task, which requires a large amount of interaction
data and ignores the coupling relationship across various tasks.
It is inevitable to monitor multiple transmission interfaces
simultaneously, while each adjustment task has its own objec-
tive function. Therefore, directly training the policy network
for multiple tasks in a simple parameter-sharing manner is
not feasible. Despite the existence of commonalities, the
differences between tasks may require network parameters to
be updated in opposite directions, resulting in the optimization
issue of gradient conflict [27]. To remedy the conflict issue and
simplify the exploration space, disentangling the relationship
between multiple tasks is especially critical. On the other hand,
different adjustment tasks under the same power system topol-
ogy share the same operation state space and dispatch action
space. Thus, it is beneficial to learn multiple tasks jointly
instead of learning them from scratch separately, with the aim
of leveraging the shareable representation ability and decision-
making pattern. A single policy network trained on multiple
tasks jointly would be able to generalize its knowledge, which
further leads to improved efficiency and performance.
In this paper, we introduce a novel multi-task reinforcement
learning method, to learn multiple power flow adjustment
tasks jointly. Unlike traditional methods that merely share
network parameters across tasks in an intertwined manner,
the proposed method performs a selective feature exaction
for each transmission interface task in an attribution manner.
Specifically, we design a multi-task attribution map (MAM)
to explicitly distinguish the impact of different power system
nodes on each transmission interface task. To construct the
MAM, we first learn the task representations based on the
common task encoder to capture the relationship among tasks.
Meanwhile, the graph convolution network is used to learn the
node features of the power system operation state. Then each
task representation serves as a query value to calculate the
node-level attention weights, resulting in a learnable MAM.
The MAM enables the model to mitigate the gradient conflict
issue. It selectively integrates the underlying node features,
achieving a compact state representation. Finally, we further
use Dueling Deep Q Network to derive the final reinforcement
learning policy, realizing effective power flow adjustment.
Our main contributions are summarized as follows:
This work is therefore the first dedicated attempt to-
wards learning multiple transmission interface power flow
adjustment tasks jointly, a highly practical problem yet
largely overlooked by existing literature in the field of
the power system.
We design a novel deep reinforcement learning (DRL)
method based on multi-task attribution map (MAM) to
handle multiple adjustment tasks jointly, where MAM
enables the DRL agent to selectively integrate the node
features into a compact task-adaptive representation for
the final adjustment policy.
Simulations are conducted on the IEEE 118-bus system,
a realistic 300-bus system in China, and a very large Eu-
ropean 9241-bus system, demonstrating that the proposed
method brings remarkable improvements to the existing
methods. Moreover, we verify the interpretability of the
learnable MAM in different operation scenarios.
II. PRO BL EM FORMULATION
For the transmission interface power flow adjustment prob-
lem, the adjustment objective is to bring the power flow
of the transmission interface to a pre-scheduled range by
generation dispatch. Specifically, we assume that there are NΦ
transmission interfaces {Φ1,Φ2,· · · ,ΦNΦ}in a given power
system. Each transmission interface Φis a set of several
transmission lines with the same direction of active power
flow and close electrical distance. The power flow of the
transmission interface Φis defined as PΦ=PΦP, where
Pis the active power of the transmission line . Generation
dispatch is utilized to adjust the power flow of the transmission
interface. The power flow of the power system after generation
dispatch is solved by the equations:
PG
iPL
i=|Vi|
NB
X
j=1
|Vj|(Gij cos ωij +Bij sin ωij ),(1)
QG
iQL
i=|Vi|
NB
X
j=1
|Vj|(Gij sin ωij Bij cos ωij ),(2)
where PL
iand QL
iare the active and reactive power con-
sumption of the load at the bus irespectively, PG
iand QG
i
are the active and reactive power production of the generator
at the bus irespectively, Viis the voltage magnitude of bus i,
ωij =ωiωjis the phase difference between bus iand j,Gij
and Bij are the conductance and susceptance matrix between
bus iand j,NBis the total number of buses, NLis the number
of loads and NGis the number of generators. To satisfy safety
margin constraints under N-1 contingency conditions, the pre-
scheduled range of each transmission interface Φis given as
[σ
Φ, σ+
Φ].σ
Φand σ+
Φrepresent the upper and lower limit of the
transmission interface total transfer capability [5, 6, 7], respec-
tively. We need to adjust the power flow of the transmission
IEEE TRANSACTIONS ON POWER SYSTEMS 3
interface to the pre-scheduled range via generation dispatch to
ensure the safe operation of the entire power system.
A. Task Setting
In this paper, we focus on training one multi-task policy to
learn multiple tasks jointly instead of training different single-
task policies for each task separately. Moreover, we consider
two types of tasks in the transmission interface power flow
adjustment problem:
Single-Interface Task. Each task Tconsists of the adjust-
ment problem of one transmission interface Φ,e.g. T(Φ).
Multi-Interface Task. Each task Tconsists of the ad-
justment problems of Mtransmission interfaces, e.g.
T({Φm}M
m=1), where Msatisfies 1< M < NΦ.
In our simulation, each test scenario has only one specified
task. In a given scenario with the single-interface task, the
multi-task policy only needs to adjust the power flow for
one transmission interface. Moreover, the same multi-task
policy can also deal with other scenarios with different single-
interface tasks. In contrast, for a scenario with the multi-
interface task, the multi-task policy need to adjust the power
flow of multiple transmission interfaces to their pre-scheduled
ranges at the same time.
B. Markov Decision Process
The transmission interface power flow adjustment task T
can be formalized as a Markov decision process (MDP). An
MDP is represented by a tuple ⟨S,A,P,R, γ .Sis the set
of continuous states, and Ais the set of discrete actions.
Pis the state transition function, Ris the reward function,
and γ[0,1] is the discount rate. The multi-step element
in the power flow adjustment problem is the dispatch action
sequence, where the maximum sequence length is set as 50.
At each control step t, the agent with a given state s S
can select the action a A through its policy π(a|s). Then
the agent will receive a reward rand the next state of the
environment saccording to the reward function R(s, a)and
the transition function P(s|s, a). For an agent, its goal is to
find the optimal policy πθparameterized by θto maximize
the discounted return:
Jπ(θ) = Esρ(·) "
X
t=0
γtR(st, at)|s0=s#,(3)
where ρ(·)is the initial state distribution and atπθ(at|st).
Moreover, the goal of multi-task reinforcement learning is
to learn a compact policy πfor various tasks drawn from
a distribution of tasks p(T). We define a task-conditioned
policy π(a|s, z), where zdenotes a task representation for
each task T. The task representation zcan be provided as a
one-hot vector or in any other form. Each task Tis a standard
MDP. Considering the average expected return across all tasks
sampled from p(T), the multi-task policy πparameterized by
θcan be optimized according to maximize the objective:
Jπ(θ) = ET p(T)[Jπ
T(θ)] ,(4)
where Jπ
T(θ)is obtained directly from Eq. 3 with task T. The
detailed MDP formulation of the power system is defined as:
State. We define the power system operation state as a graph
s= (A, F ), where Ais the adjacency matrix and Fis the
node feature matrix. The graph-based state is modeled from
the power system by considering the bus nodes as vertices
and transmission lines as edges. For each bus node B, its
feature vector can be described as FB= [PB, QB, VB, ωB]T,
where PBand QBare the active and reactive power of the
bus node Brespectively, VBand ωBis the voltage magnitude
and phase of the bus node Brespectively. We use power flow
calculation to get the next state based on the current power
system operation state with the current dispatch action.
Action. The action adopted in our setting is generation
dispatch, which enables the agent to change the active power
of the controllable generators. The action set is defined as
A=G × {c
G, c+
G}, where Gis the set of the controllable
generators. σ
Gand σ+
Grepresent the increasing and decreasing
rates of the active power, respectively. We set σ
Gas 90%
and σ+
Gas 110% based on limits of the ramp rates. At each
discrete time step, the agent can reduce the active power
of a selected generator to 90%-level or increase to 110%-
level. If the current generation is already at the upper or
lower limit of the generator capacity, the corresponding invalid
dispatch action will be masked. It is worth mentioning that
slack generators are not included in controllable generators.
Reward. Considering the both power flow adjustment and
economic dispatch cost, the total reward function is defined as
Rtot (s, a) = Rpf (s, a) + Red (s, a).(5)
To satisfy the limits of transmission interface power flow,
the reward function of the single-interface task with the
transmission interface Φis defined as
Rpf (s, a; Φ) =
PΦσ
Φ+σ+
Φ
2
.(6)
For the multi-interface task with {Φm}M
m=1, we focus on the
worst case and define the reward function as
Rpf s, a;{Φi}M
m=1= min
m∈{1,2,··· ,M}{Rpf (s, a; Φm)}.(7)
The most commonly used reward assumption of the economic
dispatch is quadratic:
Red (s, a) =
NG
X
i=1 αi(PG
i)2+βiPG
i+λi,(8)
where αi,βi, and λiare the generation cost coefficients of the
generator i.PG
iis the active power production of the generator
i. Moreover, the reward is directly set to Rtot (s, a) = 100
to penalize the divergence of power flow and the overload
of slack generators, and Rtot (s, a) = 100 to incentive the
success of adjustment where the power flow of all adjusted
transmission interfaces is within the pre-sheduled range.
III. METHODOLOGY
In what follows, we detail the proposed method for multi-
task transmission interface power flow adjustment. As shown
in Figure 1, we first use the graph convolution network to learn
the node features of power system operation state. Meanwhile,
IEEE TRANSACTIONS ON POWER SYSTEMS 4
Softmax
Transmission Interface
Multi-hot Vector   
Power System Operation State
Task Representation Multi-Task Attribution Map
State
Representation  
10 0 11 00 1
Dueling
Deep Q Network
...
Dispatch Action
Task
Encoder
GCN
GCN
Weighted
Pooling
Fig. 1. Illustration of the proposed method.
the task representations are obtained based on the common
task encoder to capture the relationship among tasks. Then
each task representation acts as a query value to calculate
the node-level attention weights based on the node features,
resulting in the task-adaptive attribution map. The attribution
map enables the model to selectively integrates the underly-
ing node features, achieving a compact state representation.
Finally, we further use Dueling Deep Q Network to derive the
final reinforcement learning policy.
A. Graph Convolution Network
A power system operation state s S is represented as
(A, F ), where A {0,1}N×Nis the adjacency matrix with
Nnodes and Aij = 1 indicates that there is a transmission
line between node iand j, and FRN×dsis the node feature
matrix assuming each node has dsfeatures. To further extract
information from graph-based data, our framework uses Graph
Convolution Network (GCN) [28] to obtain node embeddings
layer by layer. GCN enables the nodes to exchange information
with their neighbors in each convolutional layer, followed by
applying learnable filters and some non-linear transformation.
As depicted in the left column of Figure 1, the framework first
uses GCN f(·;ψ) : RdsRdxparameterized by ψto extract
the node-level embeddings with dimension dxfollowing the
general message passing mechanism as
H(k)= ReLU ˆ
D1
2ˆ
Aˆ
D1
2H(k1)W(k1)
ψ,(9)
where ˆ
A=A+Iis the adjacency matrix with added self-
connections. Iis the identity matrix. ˆ
D=Pjˆ
Aij is the
degree matrix. W(k)
ψRd(k)×d(k+1) is a trainable weight
matrix with parameters ψ.ReLU(·) = max(0,·)denotes
an activation function. H(k)RN×d(k)are the embedding
matrix with dimension d(k)computed after ksteps of the
graph convolution network. The input node embeddings H(0)
is initialized using the node feature matrix F. After running
Kiterations of Eq. (9), the graph convolution network can
generate the final node-level embedding matrix as X=
H(K)RN×dx. Here we adopt two GCN branches to obtain
two node-level embedding matrices as shown in Figure 1:
Xρ=f(A, F ;ψρ)RN×dx,(10)
Xυ=f(A, F ;ψυ)RN×dx,(11)
where Xρis used for generating the multi-task attribution
map. Xυis weighted pooled by this attribution map to give a
compact state representation.
B. Multi-Task Attribution Map
Our main idea in achieving knowledge transfer among
multiple tasks is to learn a multi-task attribution map to
realize task-adaptive node attribution. To generate the multi-
task attribution map, we first need to capture and exploit both
the common structure and the unique characteristics of tasks
by learning task representation. We associate each task Twith
a representation zTRdxand expect it to reflect different
properties of tasks. For modeling task similarity, all tasks
share a common representation encoder, which takes as input
the task vector and outputs the task representation. The task
encoder is trained on all tasks.
Concretely, the task encoder is implemented by a Multi-
layer Perceptron (MLP) network g(·;ξ)parameterized by ξ.
Considering the single-interface task T(Φ), the task represen-
tation zTis calculated by
zT=g(o(Φ); ξ)Rdx,(12)
where o(Φ) = [o(Φ)1, o(Φ)2,· · · , o(Φ)N]T {0,1}Nis the
transmission interface multi-hot vector, indicating whether the
transmission line belongs to the transmission interface Φ.N
is the total number of transmission lines. o(Φ)i= 1 if iΦ
and o(Φ)i= 0 if i/Φ. Moreover, the task representation
zTfor the multi-interface task T({Φm}M
m=1)is calculated by
zT=PM
m=1 g(om); ξ)
MRdx.(13)
IEEE TRANSACTIONS ON POWER SYSTEMS 5
When there is no task encoder but only the original vector
is used to represent the tasks, each task is usually orthogo-
nal independent. The task encoder allows us to express the
relationship between tasks in the representation space, which
enables agents to use the similarity between tasks to generalize
their knowledge.
Furthermore, to obtain the attribution map ρ, we use the
task representation zTas a query to calculate the novel-level
attention weight:
ρT= softmax(XρzT)RN.(14)
This attribution map ρis task-adaptive based on the task
representation zT. With the attribution map ρ, the agent can
explicitly distinguish the impact of different power system
nodes on each adjustment task. Then the compact state rep-
resentation υTis obtained via weighted pooling operation as
follows:
υT=XυρTRdx.(15)
The state representation υTselectively integrates the underly-
ing node-level embeddings, which enables the agent to further
derive a policy to handle the task T.
C. Dueling Deep Q Network
We use a value-based method to calculate the final rein-
forcement learning policy. The value-based method tends to
assess the quality of a policy πby the action-value function
Qπdefined as:
Qπ(s, a) = Eπ"
X
t=0
γtR(st, at)|s0=s, a0=a#,(16)
which denotes the expected discounted return after the agent
executes an action aat state s. A policy πis optimal if:
Qπ(s, a)Qπ(s, a),π, s S, a A.(17)
There is always at least one policy that is better than or equal
to all other policies [16]. All optimal policies share the same
optimal action-value function defined as Q. It is easy to show
that Qsatisfies the Bellman optimality equation:
Q(s, a) = Es∼P(·|s,a)R(s, a) + γmax
a∈A Q(s, a).(18)
To estimate the optimal action-value function Q, Deep
Q-Networks (DQN) [29] uses a neural network Q(s, a;θ)
with parameters θas an approximator. Here we take the state
representation υTas input and adopt a dueling network archi-
tecture [30], explicitly separates the estimation of state values
and state-dependent action advantages. This corresponds to the
following factorization of action values:
Q(υT, a;θ) =
V(υT;θV) + A(υT, a;θA)Pa∈A A(υT, a;θA)
|A| ,
(19)
where V(·;θV) : RdxRis the state value network and
A(·;θA) : RdxR|A| is the action advantage network. For
Algorithm 1 Learning Algorithm for MAM
Initialize: Two GCN Branches f(·;ψρ)and f(·;ψυ), Task
Encoder g(·;ξ), DQN Q(·;θ), Replay Buffer D
Return: Optimized Network Parameters {ψ, ξ, θ}
1: repeat
2: Sample a test scenario with the task from p(T)
3: Obtain the initial state s0= (A, F0)
4: while not terminal do
5: Construct the multi-task attribution map
6: Calculate the node embedding matrices Xρ, Xυ
7: Calculate the task representation zT
8: Construct the attribution map ρT
9: Deliver the final policy
10: Calculate the state representation υT
11: Calculate the action values Q(υT,·;θ)
12: Execute the action at= arg maxaQ(υT, a;θ)
13: Receive the reward rt+1 and the next state st+1
14: Store the sample to the buffer D
15: Train the networks
16: Sample the random minibatch from the buffer D
17: Compute the TD loss L(ψ, ξ, θ)
18: Perform gradient descent on the TD loss L(ψ, ξ, θ)
19: end while
20: until converge
transmission interface power flow adjustment, most dispatch
actions have little impact on the power system. The dueling
architecture enable the agent to learn the individual value
of the state itself, without having to learn the utility of
each action on the state. This is especially useful in states
where the actions have no effect on the environment [30].
Finally, the dispatch action can be obtained by a greedy policy
πT(s) = arg maxa∈A Q(υT, a;θ).
We optimize the networks by minimising the following
temporal-difference (TD) loss based on the Bellman equation:
L(ψ, ξ, θ) = E(T,s,a,r,s)∼D h(yQ(υT, a;θ))2i,(20)
where Dis the replay buffer of the transitions, y=r+
γQ(υ
T,arg maxaQ(υ
T, a;θ); θ)and θrepresents the
parameters of the target network [31]. An algorithmic descrip-
tion of the training procedure is given in Algorithm 1.
IV. CAS E STU DY
To demonstrate the effectiveness of the proposed Multi-task
Attribution Map (MAM) method, case studies are conducted
based on two small-scale systems and one large-scale system
using the PandaPower simulation [32]. In this section, we first
provide the details for the scenario generation process. Then
the compared methods and parameter settings are introduced.
The comparison results of different baselines are reported
to evaluate the performance of MAM. Moreover, several
visualization examples of MAM in different scenarios are
given to study the internal mechanism of MAM.
IEEE TRANSACTIONS ON POWER SYSTEMS 6
1 2
4
5
67
8
9
10
11
12
13
14
117
15
16
17
18
19
20
21
22
23
25
26
27
28
29
113 30
31
32
114115
33
34
35
36
37
38
39
40 41 42
43
44
45
47
46
48
49
50
51
52
53 54 55
56
57 58
59
60
61
62
63
64
65
66
67
68
116
69
70
71
72
73
24
74
75
118 76
77
78
79
80
81
82
83
84
85
86
87
88 89
90 91
92
93
94
95
96
97
98
99
100
101102
103
104 105
106
107
108
109
110
111
112
G
G
G
G
GG
G
G
G
G
G
G
G
G
G
G
GG
G
G
G
G
GG
GG
G
G G
G
G
G
G
GGGG G
G
G
3
G
G
G
G
G
G
G
G
G
G
G
G
G
G
Fig. 2. Illustration of the IEEE 118-bus system. Different transmission
interfaces are represented by different colors. Please zoom for better view.
bus 214
bus 120
bus 117 bus 112
bus 116
bus 66
bus 17 bus 45 bus 35
bus 18 bus 86 bus 41
bus 2
bus 173
Zone 1
Zone 2 Zone 3
Zone 4
Zone 5
Zone 6
Zone 7
Zone 8
Zone 9
Zone 10
bus 42bus 20
bus 169bus 168
bus 165bus 174
bus 172 bus 201
bus 178
bus 179
bus 28
bus 180
bus 223
bus 123
bus 128
bus 16
bus 129
bus 126
bus 151
bus 124
Zone 11 bus 127
Fig. 3. Illustration of the realistic 300-bus system in China. Different
transmission interfaces are represented by different colors.
0
Zone 1 Zone 2
Zone 3
bus 1863
bus 2611 bus 4838
bus 1636
bus 8321
bus 1219
bus 169 bus 5478
bus 1936bus 5472bus 4121
Zone 4 bus 7549
bus 1919 bus 2886 bus 9132
bus 8286 bus 6693 bus 5034
bus 5664
bus 1955bus 7234
bus 4778
Zone 7
bus 4378
bus 6412
bus 96
Zone 5 Zone 6
bus 1919
bus 834
bus 7832
bus 5238
bus 5204
bus 4734
bus 131
bus 4392
Zone 8
bus 8380
bus 5561
bus 6493
bus 6079
bus 3448
bus 9041
bus 220
bus 8863
bus 9129
bus 2934 bus 6076
bus 3884
bus 7227
bus 4345
bus 602 bus 8639
bus 1048 bus 6063
bus 8643
bus 105
Fig. 4. Illustration of the European 9241-bus system from the PEGASE
project. Different transmission interfaces are represented by different colors.
TABLE I
THE P RE-SCHEDULED POWER RANGE (MW) OF EACH TRANSMISSION
IN TER FACE I N THE IEEE 118-B US SY ST EM,THE REALISTIC 300-B US
SY STE M AN D THE EUROPEAN 9241-BU S SYS TEM . “# DEN OTE S THE
TOTAL N UM BER S OF T HE IN SE CUR E SCE NAR IO S.
Transmission
Interface
118-bus System 300-bus System 9241-bus System
Range # Range # Range #
Φ1[90,640] 150 [140,1000] 182 [15,110] 202
Φ2[50,360] 185 [280,1960] 162 [840,5880] 210
Φ3[40,290] 211 [170,1200] 174 [145,1000] 204
Φ4[90,640] 238 [240,1680] 188 [55,390] 204
Φ5[70,480] 161 [460,3200] 198 [560,3920] 200
Φ6[45,300] 185 [200,1400] 204 [360,2520] 203
Φ7[130,880] 160 [200,1400] 98 [345,2410] 208
Φ8[55,390] 184 [480,3360] 171 [760,5320] 205
Φ9[130,880] 158 [170,1200] 220 [130,910] 200
Φ10 [90,615] 197 [80,590] 220 [180,1260] 201
A. Scenario Generation
To obtain adequate scenarios for training and testing, we
first use two small-scale systems for simulations, including the
IEEE 118-bus system and a realistic regional 300-bus system
in China with 182 loads, 23 generators and 313 AC lines.
Then we adopt a very large European 9241-bus system from
the PEGASE project [33, 34]. The detailed scenario generation
process is summarized as follows:
To simulate various scenarios in a given power system, we
randomly select 25% of the total loads and generators in
each scenario, and then randomly perturb load and generator
power output from 10% to 200% with an interval of 10%
of original power flow.
For different power flow adjustment tasks as shown in
Figure 2, Figure 3 and Figure 4, 10 transmission interfaces
are selected for each power system, respectively.
The pre-scheduled power range [σ
Φ, σ+
Φ]of each transmis-
sion interface Φis provided for power flow adjustment, as
shown in Table I. For each power flow adjustment task,
only the insecure scenarios, of which the power flow of
the corresponding transmission interfaces falls outside the
pre-scheduled power range, are considered for case studies.
We form a control horizon based on the dispatch action
sequence. The total numbers of the insecure scenarios for
each transmission interface in different systems are also
listed in Table I.
Finally, a total number of 1829 scenarios for the IEEE 118-
bus system are obtained, including 1656 scenarios for training
and 173 scenarios for testing. For the realistic 300-bus system
in China, a total number of 1817 scenarios are obtained,
including 1637 scenarios for training and 180 scenarios for
testing. For the European 9241-bus system, a total number
of 2037 scenarios are obtained, including 1848 scenarios for
training and 189 scenarios for testing.
B. Compared Methods and Parameter Settings
The proposed Multi-task Attribution Map method, termed
as MAM, is compared with the existing state-of-the-art DRL
methods in discrete action space, including Deep Q Net-
work (DQN) [29], Double DQN [31], Dueling DQN [30],
IEEE TRANSACTIONS ON POWER SYSTEMS 7
Advantage Actor Critic (A2C) [35] and Proximal Policy Op-
timization (PPO) [36]. Two basic multi-task architectures are
considered for all DRL baselines: (1) Concat-based architec-
ture. Using one MLP network that takes the concat vector of
the task vector and node feature vector as input. (2) Soft-based
architecture. Using the soft modularization network [37] that
enables the agent to select different MLP network modules
based on the task representation. Moreover, we also adopt
Optimal Power Flow (OPF) method as a conventional model-
based baseline based on PYPOWER [38]. The internal solver
of OPF uses the interior point method where the optimization
is initialized with a valid power flow solution. All the reported
results of different methods are tested on the unseen test
scenarios. Three evaluation metrics are used, including test
success rate, test economic cost and inference speed.
We adopt the Tianshou RL framework [39] to run all ex-
periments, which uses the PyTorch library for implementation.
Moreover, power flow calculation is performed based on the
PandaPower simulation [32]. The detailed hyperparameters are
given as follows, where the common hyperparameters across
different baselines are consistent to ensure comparability and
their special hyperparameters can be referred in the source
code. In the proposed MAM, each GCN branch contains a 2-
layer GCN with the dimension of [64,64]. The task encoder is
implemented by a 2-layer MLP network with the dimension
of [128,64]. For Dueling architecture, a 2-layer value network
with the dimension of 128 and a 1-layer advantage network
with the dimension of 128 are applied. All network modules
use ReLU activation function. Batches of 64 episodes are
sampled from the replay buffer with the size of 20K for every
training iteration. The target update interval is set to 100, and
the discount factor is set to 0.9. We use the Adam optimizer
with a learning rate of 0.001. For exploration, ϵ-greedy is
used with ϵannealed linearly from 0.1 to 0.01 over 500K
training steps and kept constant for the rest of the training.
Case studies are carried out on a computing platform with
Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz, 128 GB
RAM, and Quadro P5000 GPU.
C. Single-Interface Task
To validate the effectiveness of our method in single-
interface tasks under the multi-task setting, we conduct ex-
periments on both small-scale and large-scale power sys-
tems. For each power system, 10 selected transmission in-
terfaces are considered as 10 single-interface tasks. We con-
sider the multi-task setting with all single-interface tasks
{T 1),T2),· · · ,T10 )}. Each single-interface task is
randomly sampled at each episode. The learning curves of
different DRL methods on different power system are shown
in Figure 5a5c. The final performance of all DRL and OPF
methods is shown in Table II. The average inference speed of
our method and OPF baselines is shown in Table III.
In the IEEE 118-bus system and the realistic 300-bus
system, our proposed MAM successfully improves the final
performance and offers a very high inference speed. (1) From
the perspective of different multi-task architectures, the soft-
based architecture consistently outperforms the concat-based
TABLE II
THE P ERF OR MAN CE OF O UR M ETH OD A ND BA SEL IN ES IN
SI NGL E-I NTE RFAC E TASK S UND ER T HE MU LTI-TAS K SE TTI NG . THE
PE RFO RM ANC E IS OB TAIN ED U NDE R TH E AVERAG E EVAL UATIO N OVER 5
TRIALS. TH E PER FO RMA NCE D IFF ERE NC ES BE TW EEN O UR M ETH OD A ND
BASELINES ARE GIVEN IN PARENTHESES. NOT E THAT T HE HI GH ER TE ST
SU CCE SS R ATE AND T HE L OWE R TES T EC ONO MIC C OS T,THE B ET TER
PERFORMANCE OF THE METHOD.
Method Test Success Rate (%)
118-bus System 300-bus System 9241-bus System
DQN (Concat) 61.23 (-35.38) 56.74 (-38.06) 13.67 (-29.19)
DQN (Soft) 78.53 (-18.08) 81.88 (-12.92) 23.28 (-19.58)
Double DQN (Concat) 53.28 (-43.33) 61.24 (-33.56) 16.75 (-26.11)
Double DQN (Soft) 82.35 (-14.26) 83.11 (-11.69) 16.67 (-26.17)
Dueling DQN (Concat) 68.44 (-28.17) 57.66 (-37.14) 37.57 (-5.29)
Dueling DQN (Soft) 85.23 (-11.38) 85.30 (-9.50) 31.13 (-11.73)
A2C (Concat) 80.14 (-16.47) 75.61 (-19.19) 12.70 (-30.16)
A2C (Soft) 80.10 (-16.51) 80.14 (-14.66) 17.64 (-25.22)
PPO (Concat) 24.28 (-72.33) 48.61 (-46.19) 19.40 (-23.46)
PPO (Soft) 24.28 (-72.33) 52.78 (-42.02) 22.22 (-20.64)
OPF 72.25 (-24.36) 50.56 (-44.24) 51.32 (+8.46)
MAM (Ours) 96.61 94.80 42.86
Method Test Economic Cost ($)
118-bus System 300-bus System 9241-bus System
Original 632,354 (+8,901) 999,927 (+29,669) 325,606 (+1,112)
DQN (Concat) 626,808 (+3,355) 991,762 (+21,504) 324,658 (+164)
DQN (Soft) 625,598 (+2,145) 986,337 (+16,079) 325,554 (+1,060)
Double DQN (Concat) 638,079 (+14,626) 990,126 (+19,868) 324,792 (+298)
Double DQN (Soft) 625,279 (+1,826) 982,362 (+12,104) 325,790 (+1,296)
Dueling DQN (Concat) 626,215 (+2,762) 992,048 (+21,790) 324,485 (-9)
Dueling DQN (Soft) 625,128 (+1,678) 972,573 (+2,315) 325,152 (+658)
A2C (Concat) 626,703 (+3,250) 976,467 (+6,209) 325,000 (+506)
A2C (Soft) 625,381 (+1,928) 976,803 (+6,545) 324,808 (+314)
PPO (Concat) 596,315 (-27,138) 930,581 (-39,677) 325,549 (+1,055)
PPO (Soft) 596,577 (-26,876) 936,181 (-34,077) 325,303 (+809)
OPF 606,259 (-17,194) 953,349 (-16,909) 321,586 (-2,908)
MAM (Ours) 623,453 970,258 324,494
TABLE III
THE AVERAGE INFERENCE SPEED OF OUR METHOD AND OPF BASE LI NES
IN SINGLE-IN TER FACE TAS KS .±COR RE SPO NDS T O ON E STAN DARD
DE VIATI ON O F THE AVE RAG E EVALUAT ION OV ER A LL TE ST S CEN ARI OS .
Method Inference Time (s)
118-bus System 300-bus System 9241-bus System
OPF 44.822 ±38.459 47.668 ±28.230 1912.608 ±1636.746
MAM (Ours) 0.006 ±0.009 0.010 ±0.011 0.051 ±0.025
one, while our proposed method can still achieve the best
performance. The idea of the soft-based architecture is similar
to ours. It focuses on learning the relationship between net-
work modules and tasks, but ignores the relationship between
input nodes and tasks. Different node features are still coupled
together, leading to poorly generalizing at test time due to
over-fitting on noisy nodes during the training time. Thus,
the proposed MAM takes advantage of the attribution map
to selectively integrate the node features into a compact state
representation for the final policy. (2) Comparing different
DRL methods, despite DQN and its variants showing high
learning efficiency, these methods inevitably fall into subop-
timal policies with low test success rates. The performance
of the A2C algorithm is worse than Dueling DQN, while
PPO cannot learn any effective policy and perform poorly.
IEEE TRANSACTIONS ON POWER SYSTEMS 8
     


















     
×106






(a) IEEE 118-bus System
(Single-Interface Tasks)
     
×106






(b) Realistic 300-bus System
(Single-Interface Tasks)
     
×106






(c) European 9241-bus System
(Single-Interface Tasks)
     
×106





(d) IEEE 118-bus System
(Multi-Interface Tasks)
     
×106





(e) Realistic 300-bus System
(Multi-Interface Tasks)
     
×106





(f) European 9241-bus System
(Multi-Interface Tasks)
Fig. 5. Learning curves of our method and baselines in both single-interface tasks and multi-interface tasks under the multi-task setting. All experimental
results are illustrated with the median performance and one standard deviation (shaded region) over 5 random seeds for a fair comparison.
In contrast, our proposed MAM works quite well on the test
success rate and outperforms all state-of-the-art DRL methods
during training. Furthermore, the proposed MAM not only
achieves the best test success rate, but also obtains the lowest
test economic cost except for PPO. Despite the lower test
economic cost obtained by PPO, it is worth noting that the
success rate of PPO is extraordinarily low. PPO only arbitrarily
reduces the power of the generators without considering the
power flow target of the transmission interface, which is
unacceptable. DQN and its variants are off-policy algorithms,
while A2C and PPO are on-policy algorithms. Off-policy
algorithm enables the agent to update its policy based on the
interaction samples from the historical policy, but on-policy
algorithm only uses the interaction samples from the current
policy. Therefore, the sample efficiency of off-policy algorithm
is higher than the on-policy algorithm, which is beneficial in
the power system environment where the control sequence is
short. On the other hand, PPO provides a policy constraint
over the A2C algorithm, which may lead to a locally optimal
solution due to the short control sequence.
It is notable that our MAM also achieves superior perfor-
mance compared to the conventional OPF method in the small-
scale system. We only focus on the insecure scenarios for
each specific transmission interfaces, where the OPF method
requires solving complex nonlinear optimization problems.
Although the OPF method can easily obtain a lower test eco-
nomic cost, it often fails to satisfy the power flow constraints
and perform poorly on the test success rate. However, our
MAM can learn to generalize different scenarios and enables
a near-optimal decision, making it more robust and flexible.
The results suggest that the attribution map can be utilized
to explore the diverse control mechanisms for different tasks,
which helps the agents to construct a more useful policy and
achieves non-trivial performance.
In the 9241-bus system, the problems of power flow ad-
justment become much more challenging due to the large
exploration space. We can observe that all DRL baselines
cannot perform well, while only OPF and MAM can still
offer the reasonable results. However, the conventional OPF
method become computationally expensive as the system size
increases. It requires a lot of time to find the solution in the
nonlinear nonconvex searching space. On the contrary, the
model-free MAM method can directly use neural network
forward propagation to deliver dispatch action in a more
efficient manner. Thus, MAM can provide competent inference
speed guarantees for practical deployment, while the time
consumption of OPF is unacceptable. Moreover, MAM also
exhibits its advantage in facilitates filtering the noisy infor-
mation and yields performance on par with OPF in the large-
scale system.
D. Multi-Interface Task
To further evaluate the proposed method in multi-interface
tasks under the multi-task setting, we also conduct experiments
on two power systems. In the IEEE 118-bus system, we
consider the multi-task setting with different 5-interface tasks
{T ({Φm}m∈M)}, where Mis the set of 5 transmission inter-
faces that randomly selected from all transmission interfaces.
In the more complex realistic 300-bus system and European
9241-bus system, we consider the multi-task setting with
IEEE TRANSACTIONS ON POWER SYSTEMS 9
TABLE IV
THE P ERF OR MAN CE OF O UR M ETH OD A ND BA SEL IN ES IN
MULTI-INTE RFAC E TASK S UND ER T HE MU LTI-TA SK SE TTI NG .
Method Test Success Rate (%)
118-bus System 300-bus System 9241-bus System
DQN (Concat) 38.66 (-30.79) 34.12 (-39.06) 19.71 (-13.27)
DQN (Soft) 56.92 (-12.53) 46.26 (-26.92) 19.66 (-13.32)
Double DQN (Concat) 43.52 (-25.93) 36.26 (-36.92) 24.51 (-8.47)
Double DQN (Soft) 57.44 (-12.01) 38.26 (-34.92) 18.69 (-14.29)
Dueling DQN (Concat) 42.65 (-26.80) 42.15 (-31.03) 29.10 (-3.88)
Dueling DQN (Soft) 61.54 (-7.91) 56.67 (-16.51) 26.98 (-6.00)
A2C (Concat) 28.19 (-41.26) 60.98 (-12.20) 17.20 (-15.78)
A2C (Soft) 21.69 (-47.76) 64.57 (-8.61) 17.99 (-14.99)
PPO (Concat) 38.25 (-31.20) 25.68 (-47.50) 18.78 (-14.20)
PPO (Soft) 54.37 (-15.08) 33.41 (-39.77) 23.19 (-9.79)
OPF 47.98 (-21.47) 13.89 (-59.29) 37.57 (+4.59)
MAM (Ours) 69.45 73.18 32.98
Method Test Economic Cost ($)
118-bus System 300-bus System 9241-bus System
Original 632,354 (+57,392) 999,927 (+87,212) 325,606 (+1,180)
DQN (Concat) 619,440 (+44,478) 945,438 (+32,723) 325,316 (+890)
DQN (Soft) 577,665 (+2,703) 941,011 (+28,296) 325,662 (+1,236)
Double DQN (Concat) 597,228 (+22,266) 943,314 (+30,599) 327,074 (+2,648)
Double DQN (Soft) 580,558 (+5,596) 955,407 (+42,692) 326,951 (+2,525)
Dueling DQN (Concat) 582,386 (+7,424) 924,819 (+12,104) 324,533 (+107)
Dueling DQN (Soft) 576,127 (+1,165) 920,950 (+8,235) 324,855 (+429)
A2C (Concat) 606,632 (+31,670) 921,492 (+8,777) 325,626 (+1,200)
A2C (Soft) 625,292 (+50,330) 916,958 (+4,243) 325,659 (+1,233)
PPO (Concat) 599,601 (+24,639) 938,474 (+25,759) 325,456 (+1,030)
PPO (Soft) 605,275 (+30,313) 936,014 (+23,299) 325,547 (+1,121)
OPF 554,691 (-20,271) 956,624 (+43,909) 321,694 (-2,732)
MAM (Ours) 574,962 912,715 324,426
TABLE V
THE AVER AGE I NF ERE NC E SPE ED O F OUR M ETH OD A ND OPF BA SEL IN ES
IN MULTI-INTE RFAC E TASK S.
Method Inference Time (s)
118-bus System 300-bus System 9241-bus System
OPF 67.958 ±30.218 57.729 ±23.794 2416.349 ±2016.855
MAM (Ours) 0.011 ±0.003 0.030 ±0.032 0.066 ±0.016
different 3-interface tasks. Figure 5d5f present the learning
curves of comparison methods on different power systems,
while the final performance is shown in Table IV and the
average inference speed is shown in Table V.
In different power systems, the simulation results are similar
to that of single-interface tasks. The soft-based architecture
still outperforms the concat-based one from the perspective of
different multi-task architectures, while our MAM approach
consistently exceeds all baselines by a large margin not only
in test success rate but also in test economic cost. Compared
with the conventional OPF-based method, MAM significantly
reduce the computational cost. The proposed MAM is particu-
larly beneficial in these multi-interface tasks, which enables us
to explore the diverse common critical nodes and generalize
across various tasks. Notably, despite the encouraging results
achieved, our proposed MAM cannot achieve the same high
test success rate in the multi-interface tasks as in the single-
interface tasks. The multi-interface tasks are more complicated
than single-interface tasks because of the coupling relationship
     
×106





     












Fig. 6. Learning curves of our proposed MAM and its ablations on the IEEE
118-bus system with multi-interface tasks.
TABLE VI
THE P ERF OR MAN CE OF O UR P ROP OSE D MAM AN D ITS A BL ATION S ON
TH E IEEE 118-BUS S YST EM W ITH M ULTI -IN TER FACE TAS KS .
Method Test Success Rate (%) Test Economic Cost ($) Inference Time (s)
MAM-O 64.16 (-5.29) 581,607 (+6,645) 0.010 (-0.001)
MAM-M 58.07 (-11.38) 591,698 (+16,736) 0.007 (-0.004)
MAM-W 71.19 (+1.74) 602,992 (+28,030) 0.012 (+0.001)
MAM 69.45 574,962 0.011
between different transmission interfaces. In some extreme op-
erating condition scenarios, there may even be no adjustment
policy that can successfully satisfy all power flow constraints.
E. Ablation Study
To understand the superior performance of the proposed
MAM, we carry out ablation studies to test the contribution
of its different components as follows. The comparison results
of various ablation on the IEEE 118-bus system with multi-
interface tasks are shown in Figure 6 and Table VI.
MAM-O. We use one shared GCN instead of two GCN
branches to obtain two node-level embedding matrices.
The results suggest that sharing parameters can reduce the
computation cost while downgrading the final performance.
This is because two embedding matrices from GCN have
different roles in our attribution mechanism. One matrix is
used to generate the multi-task attribution map, which only
distinguishes the impact of different power system nodes
on each adjustment task. The other matrix is used to give
a compact state representation according to the multi-task
attribution map. If we only use one shared GCN, it means
that two embedding matrices are forced to be the same,
which severely damages the model expressiveness.
MAM-M. We directly take the multi-task attribution map
instead of the state representation as the input of Dueling
Deep Q Networt. By comparing MAM with MAM-M, we
can conclude that only multi-task attribution map is not
enough to train the subsequent Dueling Deep Q Network.
Although the multi-task attribution map enable the agent to
pay attention to the important nodes, the agent still fails to
make proper decisions and perform poorly due to the lack
of state information.
MAM-W. We use the weighted average operation with
the learnable parameter instead of the average operation to
obtain the task representation for the multi-interface task.
IEEE TRANSACTIONS ON POWER SYSTEMS 10
1 2
4
5
67
8
9
10
11
12
13
14
117
15
16
17
18
19
20
21
22
23
25
26
27
28
29
113 30
31
32
114115
33
34
35
36
37
38
39
40 41 42
43
44
45
47
46
48
49
50
51
52
53 54 55
56
57 58
59
60
61
62
63
64
65
66
67
68
116
69
70
71
72
73
24
74
75
118 76
77
78
79
80
81
82
83
84
85
86
87
88 89
90 91
92
93
94
95
96
97
98
99
100
101102
103
104 105
106
107
108
109
110
111
112
G
G
G
G
GG
G
G
G
G
G
G
G
G
G
G
GG
G
G
G
G
GG
GG
G
G G
G
G
G
G
GGGG G
G
G
3
G
G
G
G
G
G
G
G
G
G
G
G
G
G
Dispatch
Fig. 7. A visualization example of the attribution maps in different single-interface scenarios under the IEEE 118-bus system. The attention magnitudes of
the node attribution maps are represented by circles with different red shades. Please zoom for better view.
01234567
0
200
400
600
800
1000
658 658 628 596 571 550 534 520
258 258 258 258
258 257 256 254
323 323 314 278
249 225 219 214
1007 1007 967 927 896
870 849 833
661 661 651 640 631 624 618 613
Transmission Interface 1
Upper Bound (640 MW)
Lower Bound (90 MW)
Transmission Interface 3
Upper Bound (290 MW)
Lower Bound (40 MW)
Transmission Interface 6
Upper Bound (300 MW)
Lower Bound (45 MW)
Transmission Interface 7
Upper Bound (880 MW)
Lower Bound (130 MW)
Transmission Interface 10
Upper Bound (615 MW)
Lower Bound (90 MW)
1 2
4
5
67
8
9
10
11
12
13
14
117
15
16
17
18
19
20
21
22
23
25
26
27
28
29
113 30
31
32
114115
33
34
35
36
37
38
39
40 41 42
43
44
45
47
46
48
49
50
51
52
53 54 55
56
57 58
59
60
61
62
63
64
65
66
67
68
116
69
70
71
72
73
24
74
75
118 76
77
78
79
80
81
82
83
84
85
86
87
88 89
90 91
92
93
94
95
96
97
98
99
100
101102
103
104 105
106
107
108
109
110
111
112
G
G
G
G
GG
G
G
G
G
G
G
G
G
G
G
GG
G
G
G
G
GG
GG
G
G G
G
G
G
G
GGGG G
G
G
3
G
G
G
G
G
G
G
G
G
G
G
G
G
G
Dispatch
Fig. 8. A visualization example of the attribution maps in one 5-interface scenario under the IEEE 118-bus system. Please zoom for better view.
IEEE TRANSACTIONS ON POWER SYSTEMS 11
The weighted average operation allows the agent to weigh
different transmission interfaces rather than regard them
as equally important. Thus, it is reasonable that MAM-W
can achieve the results on par with and sometimes slightly
superior to MAM on the test success rate. However, MAM
can obtain the lower test economic cost than MAM-W.
F. Visualization Analysis
To further explain the learned attribution map from MAM,
we conduct a qualitative analysis. The attribution maps in
different single-interface scenarios under the multi-task set-
ting are visualized in Figure 7. The simulation results show
that the proposed method successfully learns different sparse
attribution maps for different transmission interfaces, which
significantly reduce the state space. Specially, we find that
the high-attention nodes are close to the corresponding trans-
mission interface in electrical distance, which usually has a
direct impact on the power flow of the transmission interface.
Moreover, the final dispatch generators are also on the nodes
with high attention. This crisper example demonstrates the
interpretability and effectiveness of the proposed MAM. MAM
also demonstrates promising results in the multi-interface
scenario as shown in Figure 8. The learned attribution map in
the multi-interface scenario is different from that in the single-
interface scenario, which provides the importance nodes for
multiple transmission interfaces at once. Then MAM selects
the generators to dispatch based on the attribution maps. In
general, the visual analysis suggests that the proposed method
masters the ability to explicitly distinguish the impact of
different power system nodes on each transmission interface.
Moreover, MAM also learns the relationship between different
transmission interfaces, which enables the generalizable policy
to handle the multi-task adjustment problem.
V. CONCLUSION
In this work, we propose a novel approach, termed as
MAM, that enables us to take advantage of the multi-task
attribution map for transmission interface power flow adjust-
ment. MAM follows a restructuring-by-disentangling scheme,
where several distinguishable node attentions are generated
and then selectively reassembled together for the final focused
policy. We validate MAM over the IEEE 118-bus system,
a realistic regional system in China, and a very large Eu-
ropean 9241-bus system, and showcase that it yields results
significantly superior to the state-of-the-art techniques in both
single-interface and multi-interface tasks under the multi-
task settings. To our best knowledge, this paper is the first
attempt towards learning multiple transmission interface power
flow adjustment tasks jointly.
Limitations. Currently, we only focus on the generalization
over different power flow adjustment tasks under the given
transmission interfaces. We can observe that the proposed
MAM can successfully generalize to the unseen scenarios
and exhibits superior performances. However, MAM cannot
directly handle the new transmission interface that the agent
has not trained on. The premise of generalization in deep
learning is sufficient samples. For example, there may be
hundreds of transmission interfaces in IEEE 118-bus system,
while we only consider 10 critical transmission interfaces in
this work. Thus, it is hard to learn the relationship between
the trained transmission interfaces and the new one. It is
an interesting direction to study this few-shot generaliza-
tion for our future work.
REFERENCES
[1] N. Hatziargyriou, J. Milanovic, C. Rahmann, V. Ajjarapu,
C. Canizares, I. Erlich, D. Hill, I. Hiskens, I. Kamwa,
B. Pal et al., “Definition and classification of power sys-
tem stability–revisited & extended, IEEE Transactions
on Power Systems, vol. 36, no. 4, pp. 3271–3281, 2020.
[2] X. Chen, G. Qu, Y. Tang, S. Low, and N. Li, “Reinforce-
ment learning for selective key applications in power
systems: Recent advances and future challenges, IEEE
Transactions on Smart Grid, vol. 13, no. 4, pp. 2935–
2958, 2022.
[3] B.-h. Zhang, F. Yao, D.-c. Zhou, L.-y. Wang, and B.-
g. Zou, “Study on security protection of transmission
section and its key technologies, Proceedings of the
CSEE, vol. 26, no. 21, pp. 1–7, 2006.
[4] P. Kaymaz, J. Valenzuela, and C. S. Park, “Transmission
congestion and competition on power generation expan-
sion,” IEEE Transactions on Power Systems, vol. 22,
no. 1, pp. 156–163, 2007.
[5] L. Min and A. Abur, “Total transfer capability compu-
tation for multi-area power systems, IEEE Transactions
on power systems, vol. 21, no. 3, pp. 1141–1147, 2006.
[6] H. Sun, F. Zhao, H. Wang, K. Wang, W. Jiang, Q. Guo,
B. Zhang, and L. Wehenkel, Automatic learning of fine
operating rules for online power system security control,
IEEE transactions on neural networks and learning
systems, vol. 27, no. 8, pp. 1708–1719, 2015.
[7] J.-H. Liu and C.-C. Chu, “Iterative distributed algorithms
for real-time available transfer capability assessment of
multiarea power systems, IEEE Transactions on Smart
Grid, vol. 6, no. 5, pp. 2569–2578, 2015.
[8] W. Lin, Z. Yang, J. Yu, L. Jin, and W. Li, “Tie-line power
transmission region in a hybrid grid: Fast characterization
and expansion strategy,” IEEE Transactions on Power
Systems, vol. 35, no. 3, pp. 2222–2231, 2019.
[9] S. Mudaliyar, B. Duggal, and S. Mishra, “Distributed tie-
line power flow control of autonomous dc microgrid clus-
ters,” IEEE Transactions on Power Electronics, vol. 35,
no. 10, pp. 11 250–11 266, 2020.
[10] F. Capitanescu, J. M. Ramos, P. Panciatici, D. Kirschen,
A. M. Marcolini, L. Platbrood, and L. Wehenkel, “State-
of-the-art, challenges, and future trends in security con-
strained optimal power flow,” Electric power systems
research, vol. 81, no. 8, pp. 1731–1741, 2011.
[11] T. B. Nguyen and M. Pai, “Dynamic security-constrained
rescheduling of power systems using trajectory sensi-
tivities, IEEE Transactions on Power Systems, vol. 18,
no. 2, pp. 848–854, 2003.
[12] I. A. Hiskens and J. Alseddiqui, “Sensitivity, approxi-
mation, and uncertainty in power system dynamic sim-
IEEE TRANSACTIONS ON POWER SYSTEMS 12
ulation,” IEEE Transactions on Power Systems, vol. 21,
no. 4, pp. 1808–1820, 2006.
[13] S. Dutta and S. Singh, “Optimal rescheduling of gen-
erators for congestion management based on particle
swarm optimization, IEEE transactions on Power Sys-
tems, vol. 23, no. 4, pp. 1560–1569, 2008.
[14] D. K. Molzahn, F. D ¨
orfler, H. Sandberg, S. H. Low,
S. Chakrabarti, R. Baldick, and J. Lavaei, A survey
of distributed optimization and control algorithms for
electric power systems, IEEE Transactions on Smart
Grid, vol. 8, no. 6, pp. 2941–2962, 2017.
[15] Z. Zhang, D. Zhang, and R. C. Qiu, “Deep reinforcement
learning for power system applications: An overview,
CSEE Journal of Power and Energy Systems, vol. 6,
no. 1, pp. 213–225, 2019.
[16] R. S. Sutton and A. G. Barto, Reinforcement learning:
An introduction. MIT press, 2018.
[17] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang,
D. Bian, and Z. Yi, “Deep-reinforcement-learning-based
autonomous voltage control for power grid operations,
IEEE Transactions on Power Systems, vol. 35, no. 1, pp.
814–817, 2020.
[18] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and
J. Sun, “Two-timescale voltage control in distribution
grids using deep reinforcement learning,” IEEE Trans-
actions Smart Grid, vol. 11, no. 3, pp. 2313–2323, 2019.
[19] D. Cao, J. Zhao, W. Hu, N. Yu, F. Ding, Q. Huang, and
Z. Chen, “Deep reinforcement learning enabled physical-
model-free two-timescale voltage control method for
active distribution systems, IEEE Transactions on Smart
Grid, vol. 13, no. 1, pp. 149–165, 2021.
[20] W. Liu, P. Zhuang, H. Liang, J. Peng, and Z. Huang,
“Distributed economic dispatch in microgrids based on
cooperative reinforcement learning, IEEE transactions
on neural networks and learning systems, vol. 29, no. 6,
pp. 2192–2203, 2018.
[21] P. Dai, W. Yu, G. Wen, and S. Baldi, “Distributed
reinforcement learning algorithm for dynamic economic
dispatch with unknown generation cost functions, IEEE
Transactions on Industrial Informatics, vol. 16, no. 4, pp.
2258–2267, 2019.
[22] D. Li, L. Yu, N. Li, and F. Lewis, “Virtual-action-based
coordinated reinforcement learning for distributed eco-
nomic dispatch,” IEEE Transactions on Power Systems,
vol. 36, no. 6, pp. 5143–5152, 2021.
[23] Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan, and
Z. Huang, Adaptive power system emergency control
using deep reinforcement learning,” IEEE Transactions
on Smart Grid, vol. 11, no. 2, pp. 1171–1182, 2019.
[24] C. Chen, M. Cui, F. Li, S. Yin, and X. Wang, “Model-
free emergency frequency control based on reinforcement
learning,” IEEE Transactions on Industrial Informatics,
vol. 17, no. 4, pp. 2336–2346, 2020.
[25] D. Cao, W. Hu, J. Zhao, G. Zhang, B. Zhang, Z. Liu,
Z. Chen, and F. Blaabjerg, “Reinforcement learning and
its applications in modern power and energy systems: A
review,” Journal of Modern Power Systems and Clean
Energy, vol. 8, no. 6, pp. 1029–1042, 2020.
[26] Q. Gao, Y. Liu, J. Zhao, J. Liu, and C. Y. Chung, “Hy-
brid deep learning for dynamic total transfer capability
control,” IEEE Transactions on Power Systems, vol. 36,
no. 3, pp. 2733–2736, 2021.
[27] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman,
and C. Finn, “Gradient surgery for multi-task learning,
in Annual Conference on Neural Information Processing
Systems, 2020.
[28] T. N. Kipf and M. Welling, “Semi-supervised classifica-
tion with graph convolutional networks, in International
Conference on Learning Representations, 2017.
[29] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level
control through deep reinforcement learning,” Nature,
vol. 518, no. 7540, pp. 529–533, 2015.
[30] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanc-
tot, and N. de Freitas, “Dueling network architectures
for deep reinforcement learning,” in International Con-
ference on Machine Learning, 2016.
[31] H. van Hasselt, A. Guez, and D. Silver, “Deep rein-
forcement learning with double q-learning,” in AAAI
Conference on Artificial Intelligence, 2016.
[32] L. Thurner, A. Scheidler, F. Sch¨
afer, J. Menke, J. Dolli-
chon, F. Meier, S. Meinecke, and M. Braun, “pandapower
an open-source python tool for convenient modeling,
analysis, and optimization of electric power systems,
IEEE Transactions on Power Systems, vol. 33, no. 6, pp.
6510–6521, 2018.
[33] S. Fliscounakis, P. Panciatici, F. Capitanescu, and L. We-
henkel, “Contingency ranking with respect to overloads
in very large power systems taking into account uncer-
tainty, preventive, and corrective actions, IEEE Transac-
tions on Power Systems, vol. 28, no. 4, pp. 4909–4917,
2013.
[34] C. Josz, S. Fliscounakis, J. Maeght, and P. Panci-
atici, Ac power flow data in matpower and qcqp for-
mat: itesla, rte snapshots, and pegase, arXiv preprint
arXiv:1603.01533, 2016.
[35] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap,
T. Harley, D. Silver, and K. Kavukcuoglu, Asynchronous
methods for deep reinforcement learning,” in Interna-
tional Conference on Machine Learning, 2016.
[36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
O. Klimov, “Proximal policy optimization algorithms,
arXiv preprint arXiv:1707.06347, 2017.
[37] R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task rein-
forcement learning with soft modularization,” in Annual
Conference on Neural Information Processing Systems,
2020.
[38] R. D. Zimmerman, C. Edmundo Murillo-Sanchez, and
R. J. Thomas, “MATPOWER: Steady-state operations,
planning, and analysis tools for power systems research
and education,” IEEE Transactions on Power Systems,
vol. 26, no. 1, pp. 12–19, 2011.
[39] J. Weng, H. Chen, D. Yan, K. You, A. Duburcq,
M. Zhang, H. Su, and J. Zhu, “Tianshou: A highly
modularized deep reinforcement learning library, arXiv
preprint arXiv:2107.14171, 2021.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With large-scale integration of renewable generation and distributed energy resources, modern power systems are confronted with new operational challenges, such as growing complexity, increasing uncertainty, and aggravating volatility. Meanwhile, more and more data are becoming available owing to the widespread deployment of smart meters, smart sensors, and upgraded communication networks. As a result, data-driven control techniques, especially reinforcement learning (RL), have attracted surging attention in recent years. This paper provides a comprehensive review of various RL techniques and how they can be applied to decision-making and control in power systems. In particular, we select three key applications, i.e., frequency regulation, voltage control, and energy management, as examples to illustrate RL-based models and solutions. We then present the critical issues in the application of RL, i.e., safety, robustness, scalability, and data. Several potential future directions are discussed as well.
Article
Full-text available
Active distribution networks are being challenged by frequent and rapid voltage violations due to renewable energy integration. Conventional model-based voltage control methods rely on accurate parameters of the distribution networks, which are difficult to achieve in practice. This paper proposes a novel physical-model-free two-timescale voltage control framework for active distribution systems. To achieve fast control of PV inverters, the whole network is first partitioned into several sub-networks using voltage-reactive power sensitivity. Then, the scheduling of PV inverters in the multiple sub-networks is formulated as Markov games and solved by a multi-agent soft actor-critic (MASAC) algorithm, where each sub-network is modeled as an intelligent agent. All agents are trained in a centralized manner to learn a coordinated strategy while being executed based on only local information for fast response. For the slower timescale control, OLTCs and switched capacitors are coordinated by a single agent-based SAC algorithm using the global information with considering control behaviors of the inverters. Particularly, the two-level agents are trained concurrently with information exchange according to the reward signal calculated from the data-driven surrogate model. Comparative tests with different benchmark methods on IEEE 33-and 123-bus systems and 342-node low voltage distribution systems demonstrate that the proposed method can effectively mitigate the fast voltage violations and achieve systematical coordination of different voltage regulation assets without the knowledge of an accurate system model.
Article
Full-text available
This letter proposes a data-driven hybrid deep learning method for dynamic total transfer capability (TTC) control. It leverages deep learning (DL) to achieve fast prediction of TTC and reduce the problem complexity. While the deep reinforcement learning (DRL) method, e.g., proximal policy optimization (PPO), is enhanced by competitive learning (CL) to obtain a better generalization of the DRL agents. This also allows us to deal with system stochasticity. Comparison results with other model-based alternatives on the IEEE 39-bus system highlight the advantages of the proposed method for variable unseen and insecure scenarios.
Article
Full-text available
Since the publication of the original paper on power system stability definitions in 2004, the dynamic behavior of power systems has gradually changed due to the increasing penetration of converter interfaced generation technologies, loads, and transmission devices. In recognition of this change, a Task Force was established in 2016 to re-examine and extend, where appropriate, the classic definitions and classifications of the basic stability terms to incorporate the effects of fast-response power electronic devices. This paper based on an IEEE PES report summarizes the major results of the work of the Task Force and presents extended definitions and classification of power system stability. Download for free here: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9286772
Article
Full-text available
With the growing integration of distributed energy resources (DERs), flexible loads and other emerging technologies, there are increasing complexities and uncertainties for modern power and energy systems. This brings great challenges to the operation and control. On the other hand, the deployment of advanced sensor and smart meters leads to a large amount of data that opens the door for novel data-driven methods to deal with complicated operation and control issues. Among them, reinforcement learning (RL) is one of the most widely promoted methods for control and optimization problems. This paper provides a comprehensive literature review of RL in terms of basic ideas, various types of algorithms, and their applications in power and energy systems. The challenges and further works are also discussed. Index Terms-Reinforcement learning, deep reinforcement learning, power system operation and control, optimization.
Article
A unified distributed reinforcement learning (RL) solution is offered for both static and dynamic economic dispatch problems (EDPs). Each agent is assigned with a fixed, discrete, virtual action set, and a projection method generates the feasible, actual actions to satisfy the constraints. A distributed algorithm, based on singularly perturbed system, solves the projection problem. A distributed form of Hysteretic Q-learning achieves coordination among agents. Therein, the Q-values are developed based on the virtual actions, while the rewards are produced by the projected actual actions. The proposed algorithm deals with continuous action space and power loads without using function approximations. Theoretical analysis and comparative simulation studies verify algorithm's convergence and optimality.
Article
Unexpected large power surges will cause instantaneous grid shock, and thus emergency control plans must be implemented to prevent the system from collapsing. In this paper, by the aid of reinforcement learning (RL), novel model-free control (MFC)-based emergency control schemes are presented. Firstly, multi-Q-learning-based emergency plans are designed for limited emergency scenarios by using offline-training-online approximation methods. To solve the more general multi-scenario emergency control problem, a deep deterministic policy gradient (DDPG) algorithm is adopted to learn near-optimal solutions. With the aid of deep Q network, DDPG-based strategies have better generalization abilities for unknown and untrained emergency scenarios, and thus are suitable for multi-scenario learning. Through simulations using benchmark systems, the proposed schemes are proven to achieve satisfactory performances.
Article
For microgrids owned by different utilities, it is always desirable that the steady state impact of load or generation changes in one microgrid, does not affect the generation cost in another microgrid. To ensure this, the power flow through the tieline is typically regulated at a pre-scheduled value thereby enforcing the individual microgrids to manage their respective load at steady state. To this end, this paper presents a distributed control scheme to regulate tie-line power flow between two autonomous dc microgrid clusters. In this work, a unifying hierarchical control scheme based on distributed communication is proposed where the tie-line power flow control based on a pinning control strategy is unified with the distributed optimization and average voltage regulation control loops. The distributed optimization utilizes economic dispatch to minimize the operating costs of the DG, while the average voltage control regulates the microgrid's average voltage at its nominal value. With the application of this approach, the responsibility of tie-line regulation is distributed in an economic manner without degrading the voltage quality in the system. Time-domain simulations, real-time simulation using Eternet based TCP/IP communication and experimental results are presented to validate the performance of proposed distributed control under normal and faulty system conditions.
Article
Modern distribution grids are currently being challenged by frequent and sizable voltage fluctuations, due mainly to the increasing deployment of electric vehicles and renewable generators. Existing approaches to maintaining bus voltage magnitudes within the desired region can cope with either traditional utility-owned devices (e.g., shunt capacitors), or contemporary smart inverters that come with distributed generation units (e.g., photovoltaic plants). The discrete on-off commitment of capacitor units is often configured on an hourly or daily basis, yet smart inverters can be controlled within milliseconds, thus challenging joint control of these two types of assets. In this context, a novel two-timescale voltage regulation scheme is developed for distribution grids by judiciously coupling data-driven with physics-based optimization. On a faster timescale, say every second, the optimal setpoints of smart inverters are obtained by minimizing instantaneous bus voltage deviations from their nominal values, based on either the exact alternating current power flow model or a linear approximant of it; whereas, on the slower timescale (e.g., every hour), shunt capacitors are configured to minimize the long-term discounted voltage deviations using a deep reinforcement learning algorithm. Extensive numerical tests on a real-world 47-bus distribution network as well as the IEEE 123-bus test feeder using real data corroborate the effectiveness of the novel scheme.
Article
The identification of a tie-line power transmission region is crucial for the secure operation of interconnected power grids. The characterization of a tie-line power transmission region with regional operational constraints is inherently a multi-parametric programming problem, which greatly suffers from a heavy computational burden. In this paper, we propose a fast boundary search approach to efficiently characterize the tie-line power transmission region in AC/DC hybrid grids. The multi-parametric programming method is improved in the following three aspects: (1) the search direction is fixed along the boundary of a region, which avoids the search inside a region; (2) the abundant search of subregions that share a boundary facet is avoided; and (3) a fast guess of the Karush-Kuhn-Tucker (KKT) condition is presented, which reduces the number of optimization calculations. Furthermore, an expansion strategy is presented to enlarge the tie-line power transmission region via quantifying the impact of system parameters on the transmission region. Simulations based on the IEEE 118-bus test system and a 661-bus utility system show that (1) the proposed method can be over 30 times faster than existing methods and (2) the proposed method successfully expands the tie-line power transmission region by identifying the key parameters.