PreprintPDF Available

[IEEE Transactions on Power Systems] Transmission Interface Power Flow Adjustment: A Deep Reinforcement Learning Approach based on Multi-task Attribution Map

May 2024

May 2024

Authors:

Shunyu Liu

Zhejiang University

Wei Luo

Zhejiang University

Kaixuan Chen

Zhejiang University

Show all 8 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coupling relationship and even leading to conflict decisions. In this paper, we introduce a novel data-driven deep reinforcement learning (DRL) approach, to handle multiple power flow adjustment tasks jointly instead of learning each task from scratch. At the heart of the proposed method is a multi-task attribution map (MAM), which enables the DRL agent to explicitly attribute each transmission interface task to different power system nodes with task-adaptive attention weights. Based on this MAM, the agent can further provide effective strategies to solve the multi-task adjustment problem with a near-optimal operation cost. Simulation results on the IEEE 118-bus system, a realistic 300-bus system in China, and a very large European system with 9241 buses demonstrate that the proposed method significantly improves the performance compared with several baseline methods, and exhibits high interpretability with the learnable MAM.

Content uploaded by Shunyu Liu

Content may be subject to copyright.

IEEE TRANSACTIONS ON POWER SYSTEMS 1

Transmission Interface Power Flow Adjustment:

A Deep Reinforcement Learning Approach based

on Multi-task Attribution Map

Shunyu Liu, Wei Luo, Yanzhen Zhou, Kaixuan Chen, Quan Zhang, Huating Xu, Qinglai Guo, Mingli Song

Abstract—Transmission interface power ﬂow adjustment is a

critical measure to ensure the security and economy operation of

power systems. However, conventional model-based adjustment

schemes are limited by the increasing variations and uncertainties

occur in power systems, where the adjustment problems of

different transmission interfaces are often treated as several

independent tasks, ignoring their coupling relationship and even

leading to conﬂict decisions. In this paper, we introduce a novel

data-driven deep reinforcement learning (DRL) approach, to

handle multiple power ﬂow adjustment tasks jointly instead of

learning each task from scratch. At the heart of the proposed

method is a multi-task attribution map (MAM), which enables

the DRL agent to explicitly attribute each transmission interface

task to different power system nodes with task-adaptive attention

weights. Based on this MAM, the agent can further provide

effective strategies to solve the multi-task adjustment problem

with a near-optimal operation cost. Simulation results on the

IEEE 118-bus system, a realistic 300-bus system in China, and

a very large European system with 9241 buses demonstrate that

the proposed method signiﬁcantly improves the performance

compared with several baseline methods, and exhibits high

interpretability with the learnable MAM.

Index Terms—Attribution map, deep reinforcement learning,

multi-task learning, power ﬂow adjustment, transmission interface.

I. INTRODUCTION

POWER systems are complex nonlinear physical systems

with high uncertainty [1]. With the rapid expansion of

the scale of power systems and the increasing imbalance

of power demand and generation, the problems of power

systems, such as security and economy, become much more

This article has been accepted for publication by IEEE Transactions on

Power Systems. The published version is available at https://doi.org/10.1109/

TPWRS.2023.3298007. This work was supported in part by the National Key

R&D Program of China under Grant 2018AAA0101503, and in part by the

Science and Technology Project of SGCC (State Grid Corporation of China):

Fundamental Theory of Human-in-the-Loop Hybrid-Augmented Intelligence

for Power Grid Dispatch and Control. (Shunyu Liu and Wei Luo contributed

equally to this work.) (Corresponding author: Mingli Song.)

Shunyu Liu, Wei Luo, Kaixuan Chen, and Mingli Song are with the

College of Computer Science and Technology, Zhejiang University, Hangzhou

310027, China (e-mail: liushunyu@zju.edu.cn; davidluo@zju.edu.cn;

chenkx@zju.edu.cn; brooksong@zju.edu.cn).

Yanzhen Zhou and Qinglai Guo are with the Department of Elec-

trical Engineering, Tsinghua University, Beijing 100084, China (e-mail:

zhouyzh@mail.tsinghua.edu.cn; guoqinglai@tsinghua.edu.cn).

Quan Zhang and Huating Xu are with the College of Electri-

cal Engineering, Zhejiang University, Hangzhou 310027, China (e-mail:

quanzzhang@zju.edu.cn; xu huating@zju.edu.cn).

Digital Object Identiﬁer 10.1109/TPWRS.2023.3298007

challenging [2]. To monitor the operation state of power

systems with massive variables, operators tend to take the

transmission interface into consideration instead of a single

element. A speciﬁc transmission interface is composed of a

set of transmission lines with the same direction of active

power ﬂow and close electrical distance [3, 4]. Operators

can analyze and control the operation state of the power

system by monitoring the power ﬂow of different transmission

interfaces. The total transfer capability of critical transmission

interfaces is widely used in practice to provide power system

security margins [5, 6, 7]. Once the power ﬂow of a critical

transmission interface is overloaded, it inevitably poses a great

threat to the power system and even leads to the emergence

of cascading blackouts. In this way, the power ﬂow through a

transmission interface is typically regulated at a pre-scheduled

range, thereby ensuring the stability and reliability of power

system operation [8, 9].

Transmission interface power ﬂow adjustment serves as an

important defensive means to satisfy this pre-scheduled secu-

rity constraint. To realize power ﬂow adjustment for different

transmission interfaces, a direct approach is generation dis-

patch. Conventional dispatch methods are highly dependent on

the system model, which can be mainly categorized into two

classes: optimal power ﬂow-based methods [10] and sensitivity

analysis-based ones [11, 12, 13]. Optimal power ﬂow-based

methods mainly rely on numerical optimization that transforms

the dispatch problem into a constrained programming problem,

while sensitivity analysis-based methods iteratively calculate

the sensitivity indexes to determine the candidate generators.

The conventional methods are based on mathematical models

that represent the power system, where the models are subject

to certain assumptions and simpliﬁcation. As a result, these

methods may not be able to capture all the complex dynamics

that occur in a real power system. Moreover, this model-

based mechanism greatly suffers from a heavy computational

burden with the growing scale of the power system [14].

The power ﬂow constraints of different transmission interfaces

are closely intertwined and even conﬂict with each other

in extreme scenarios, which signiﬁcantly limits the feasible

solution space. Thus, due to the combinatorial explosion and

complicated constraints in the nonlinear nonconvex searching

space, the conventional methods may easily fail to ﬁnd an

optimal solution and are no longer acceptable.

To alleviate these issue, data-driven framework based on

See https://www.ieee.org/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON POWER SYSTEMS 2

deep reinforcement learning (DRL) has been proposed as a

competent alternative [15]. As a solution to control problems,

model-free DRL exploits the large capacity of deep neural net-

works to extract effective features from input states and deliver

response actions in an end-to-end fashion [16]. Recently, many

studies have shown the remarkable potential of this learning

paradigm in many power system control problems, including

voltage control [17, 18, 19], economic dispatch [20, 21, 22]

and emergency control [23, 24]. DRL can directly learn from

high-dimension power system data without being constrained

by a ﬁxed model, providing a more robust and adaptive control

strategy under various scenarios. Moreover, DRL can realize

efﬁcient inference based on neural networks, making it more

suitable for real-world applications with high demands for

rapid response [25]. For transmission interface power ﬂow

adjustment problem, reference [26] proposes to use Proximal

Policy Optimization (PPO) to train the DRL policy. To avoid

the conﬂict problem of training scenarios with distinct patterns,

this method ﬁrst clusters the training scenarios and then

employs multiple DRL agents for each cluster. Despite the

promising results achieved, existing works are still limited by

individually training policies for each speciﬁc transmission

interface task, which requires a large amount of interaction

data and ignores the coupling relationship across various tasks.

It is inevitable to monitor multiple transmission interfaces

simultaneously, while each adjustment task has its own objec-

tive function. Therefore, directly training the policy network

for multiple tasks in a simple parameter-sharing manner is

not feasible. Despite the existence of commonalities, the

differences between tasks may require network parameters to

be updated in opposite directions, resulting in the optimization

issue of gradient conﬂict [27]. To remedy the conﬂict issue and

simplify the exploration space, disentangling the relationship

between multiple tasks is especially critical. On the other hand,

different adjustment tasks under the same power system topol-

ogy share the same operation state space and dispatch action

space. Thus, it is beneﬁcial to learn multiple tasks jointly

instead of learning them from scratch separately, with the aim

of leveraging the shareable representation ability and decision-

making pattern. A single policy network trained on multiple

tasks jointly would be able to generalize its knowledge, which

further leads to improved efﬁciency and performance.

In this paper, we introduce a novel multi-task reinforcement

learning method, to learn multiple power ﬂow adjustment

tasks jointly. Unlike traditional methods that merely share

network parameters across tasks in an intertwined manner,

the proposed method performs a selective feature exaction

for each transmission interface task in an attribution manner.

Speciﬁcally, we design a multi-task attribution map (MAM)

to explicitly distinguish the impact of different power system

nodes on each transmission interface task. To construct the

MAM, we ﬁrst learn the task representations based on the

common task encoder to capture the relationship among tasks.

Meanwhile, the graph convolution network is used to learn the

node features of the power system operation state. Then each

task representation serves as a query value to calculate the

node-level attention weights, resulting in a learnable MAM.

The MAM enables the model to mitigate the gradient conﬂict

issue. It selectively integrates the underlying node features,

achieving a compact state representation. Finally, we further

use Dueling Deep Q Network to derive the ﬁnal reinforcement

learning policy, realizing effective power ﬂow adjustment.

Our main contributions are summarized as follows:

•This work is therefore the ﬁrst dedicated attempt to-

wards learning multiple transmission interface power ﬂow

adjustment tasks jointly, a highly practical problem yet

largely overlooked by existing literature in the ﬁeld of

the power system.

•We design a novel deep reinforcement learning (DRL)

method based on multi-task attribution map (MAM) to

handle multiple adjustment tasks jointly, where MAM

enables the DRL agent to selectively integrate the node

features into a compact task-adaptive representation for

the ﬁnal adjustment policy.

•Simulations are conducted on the IEEE 118-bus system,

a realistic 300-bus system in China, and a very large Eu-

ropean 9241-bus system, demonstrating that the proposed

method brings remarkable improvements to the existing

methods. Moreover, we verify the interpretability of the

learnable MAM in different operation scenarios.

II. PRO BL EM FORMULATION

For the transmission interface power ﬂow adjustment prob-

lem, the adjustment objective is to bring the power ﬂow

of the transmission interface to a pre-scheduled range by

generation dispatch. Speciﬁcally, we assume that there are NΦ

transmission interfaces {Φ1,Φ2,· · · ,ΦNΦ}in a given power

system. Each transmission interface Φis a set of several

transmission lines with the same direction of active power

ﬂow and close electrical distance. The power ﬂow of the

transmission interface Φis deﬁned as PΦ=Pℓ∈ΦPℓ, where

Pℓis the active power of the transmission line ℓ. Generation

dispatch is utilized to adjust the power ﬂow of the transmission

interface. The power ﬂow of the power system after generation

dispatch is solved by the equations:

i−PL

i=|Vi|

j=1

|Vj|(Gij cos ωij +Bij sin ωij ),(1)

i−QL

i=|Vi|

j=1

|Vj|(Gij sin ωij −Bij cos ωij ),(2)

where PL

iand QL

iare the active and reactive power con-

sumption of the load at the bus irespectively, PG

iand QG

are the active and reactive power production of the generator

at the bus irespectively, Viis the voltage magnitude of bus i,

ωij =ωi−ωjis the phase difference between bus iand j,Gij

and Bij are the conductance and susceptance matrix between

bus iand j,NBis the total number of buses, NLis the number

of loads and NGis the number of generators. To satisfy safety

margin constraints under N-1 contingency conditions, the pre-

scheduled range of each transmission interface Φis given as

[σ−

Φ, σ+

Φ].σ−

Φand σ+

Φrepresent the upper and lower limit of the

transmission interface total transfer capability [5, 6, 7], respec-

tively. We need to adjust the power ﬂow of the transmission

IEEE TRANSACTIONS ON POWER SYSTEMS 3

interface to the pre-scheduled range via generation dispatch to

ensure the safe operation of the entire power system.

A. Task Setting

In this paper, we focus on training one multi-task policy to

learn multiple tasks jointly instead of training different single-

task policies for each task separately. Moreover, we consider

two types of tasks in the transmission interface power ﬂow

adjustment problem:

•Single-Interface Task. Each task Tconsists of the adjust-

ment problem of one transmission interface Φ,e.g. T(Φ).

•Multi-Interface Task. Each task Tconsists of the ad-

justment problems of Mtransmission interfaces, e.g.

T({Φm}M

m=1), where Msatisﬁes 1< M < NΦ.

In our simulation, each test scenario has only one speciﬁed

task. In a given scenario with the single-interface task, the

multi-task policy only needs to adjust the power ﬂow for

one transmission interface. Moreover, the same multi-task

policy can also deal with other scenarios with different single-

interface tasks. In contrast, for a scenario with the multi-

interface task, the multi-task policy need to adjust the power

ﬂow of multiple transmission interfaces to their pre-scheduled

ranges at the same time.

B. Markov Decision Process

The transmission interface power ﬂow adjustment task T

can be formalized as a Markov decision process (MDP). An

MDP is represented by a tuple ⟨S,A,P,R, γ ⟩.Sis the set

of continuous states, and Ais the set of discrete actions.

Pis the state transition function, Ris the reward function,

and γ∈[0,1] is the discount rate. The multi-step element

in the power ﬂow adjustment problem is the dispatch action

sequence, where the maximum sequence length is set as 50.

At each control step t, the agent with a given state s∈ S

can select the action a∈ A through its policy π(a|s). Then

the agent will receive a reward rand the next state of the

environment s′according to the reward function R(s, a)and

the transition function P(s′|s, a). For an agent, its goal is to

ﬁnd the optimal policy πθparameterized by θto maximize

the discounted return:

Jπ(θ) = Es∼ρ(·),π "∞

t=0

γtR(st, at)|s0=s#,(3)

where ρ(·)is the initial state distribution and at∼πθ(at|st).

Moreover, the goal of multi-task reinforcement learning is

to learn a compact policy πfor various tasks drawn from

a distribution of tasks p(T). We deﬁne a task-conditioned

policy π(a|s, z), where zdenotes a task representation for

each task T. The task representation zcan be provided as a

one-hot vector or in any other form. Each task Tis a standard

MDP. Considering the average expected return across all tasks

sampled from p(T), the multi-task policy πparameterized by

θcan be optimized according to maximize the objective:

Jπ(θ) = ET ∼p(T)[Jπ

T(θ)] ,(4)

where Jπ

T(θ)is obtained directly from Eq. 3 with task T. The

detailed MDP formulation of the power system is deﬁned as:

State. We deﬁne the power system operation state as a graph

s= (A, F ), where Ais the adjacency matrix and Fis the

node feature matrix. The graph-based state is modeled from

the power system by considering the bus nodes as vertices

and transmission lines as edges. For each bus node B, its

feature vector can be described as FB= [PB, QB, VB, ωB]T,

where PBand QBare the active and reactive power of the

bus node Brespectively, VBand ωBis the voltage magnitude

and phase of the bus node Brespectively. We use power ﬂow

calculation to get the next state based on the current power

system operation state with the current dispatch action.

Action. The action adopted in our setting is generation

dispatch, which enables the agent to change the active power

of the controllable generators. The action set is deﬁned as

A=G × {c−

G, c+

G}, where Gis the set of the controllable

generators. σ−

Gand σ+

Grepresent the increasing and decreasing

rates of the active power, respectively. We set σ−

Gas 90%

and σ+

Gas 110% based on limits of the ramp rates. At each

discrete time step, the agent can reduce the active power

of a selected generator to 90%-level or increase to 110%-

level. If the current generation is already at the upper or

lower limit of the generator capacity, the corresponding invalid

dispatch action will be masked. It is worth mentioning that

slack generators are not included in controllable generators.

Reward. Considering the both power ﬂow adjustment and

economic dispatch cost, the total reward function is deﬁned as

Rtot (s, a) = Rpf (s, a) + Red (s, a).(5)

To satisfy the limits of transmission interface power ﬂow,

the reward function of the single-interface task with the

transmission interface Φis deﬁned as

Rpf (s, a; Φ) = −

PΦ−σ−

Φ+σ+

2

.(6)

For the multi-interface task with {Φm}M

m=1, we focus on the

worst case and deﬁne the reward function as

Rpf s, a;{Φi}M

m=1= min

m∈{1,2,··· ,M}{Rpf (s, a; Φm)}.(7)

The most commonly used reward assumption of the economic

dispatch is quadratic:

Red (s, a) = −

i=1 αi(PG

i)2+βiPG

i+λi,(8)

where αi,βi, and λiare the generation cost coefﬁcients of the

generator i.PG

iis the active power production of the generator

i. Moreover, the reward is directly set to Rtot (s, a) = −100

to penalize the divergence of power ﬂow and the overload

of slack generators, and Rtot (s, a) = 100 to incentive the

success of adjustment where the power ﬂow of all adjusted

transmission interfaces is within the pre-sheduled range.

III. METHODOLOGY

In what follows, we detail the proposed method for multi-

task transmission interface power ﬂow adjustment. As shown

in Figure 1, we ﬁrst use the graph convolution network to learn

the node features of power system operation state. Meanwhile,

IEEE TRANSACTIONS ON POWER SYSTEMS 4

Softmax

Transmission Interface

Multi-hot Vector   

Power System Operation State

Task Representation Multi-Task Attribution Map

State

Representation  

10 0 11 00 1

Dueling

Deep Q Network

...

Dispatch Action

Task

Encoder

GCN

Weighted

Pooling

Fig. 1. Illustration of the proposed method.

the task representations are obtained based on the common

task encoder to capture the relationship among tasks. Then

each task representation acts as a query value to calculate

the node-level attention weights based on the node features,

resulting in the task-adaptive attribution map. The attribution

map enables the model to selectively integrates the underly-

ing node features, achieving a compact state representation.

Finally, we further use Dueling Deep Q Network to derive the

ﬁnal reinforcement learning policy.

A. Graph Convolution Network

A power system operation state s∈ S is represented as

(A, F ), where A∈ {0,1}N×Nis the adjacency matrix with

Nnodes and Aij = 1 indicates that there is a transmission

line between node iand j, and F∈RN×dsis the node feature

matrix assuming each node has dsfeatures. To further extract

information from graph-based data, our framework uses Graph

Convolution Network (GCN) [28] to obtain node embeddings

layer by layer. GCN enables the nodes to exchange information

with their neighbors in each convolutional layer, followed by

applying learnable ﬁlters and some non-linear transformation.

As depicted in the left column of Figure 1, the framework ﬁrst

uses GCN f(·;ψ) : Rds→Rdxparameterized by ψto extract

the node-level embeddings with dimension dxfollowing the

general message passing mechanism as

H(k)= ReLU ˆ

D−1

2ˆ

Aˆ

D−1

2H(k−1)W(k−1)

ψ,(9)

where ˆ

A=A+Iis the adjacency matrix with added self-

connections. Iis the identity matrix. ˆ

D=Pjˆ

Aij is the

degree matrix. W(k)

ψ∈Rd(k)×d(k+1) is a trainable weight

matrix with parameters ψ.ReLU(·) = max(0,·)denotes

an activation function. H(k)∈RN×d(k)are the embedding

matrix with dimension d(k)computed after ksteps of the

graph convolution network. The input node embeddings H(0)

is initialized using the node feature matrix F. After running

Kiterations of Eq. (9), the graph convolution network can

generate the ﬁnal node-level embedding matrix as X=

H(K)∈RN×dx. Here we adopt two GCN branches to obtain

two node-level embedding matrices as shown in Figure 1:

Xρ=f(A, F ;ψρ)∈RN×dx,(10)

Xυ=f(A, F ;ψυ)∈RN×dx,(11)

where Xρis used for generating the multi-task attribution

map. Xυis weighted pooled by this attribution map to give a

compact state representation.

B. Multi-Task Attribution Map

Our main idea in achieving knowledge transfer among

multiple tasks is to learn a multi-task attribution map to

realize task-adaptive node attribution. To generate the multi-

task attribution map, we ﬁrst need to capture and exploit both

the common structure and the unique characteristics of tasks

by learning task representation. We associate each task Twith

a representation zT∈Rdxand expect it to reﬂect different

properties of tasks. For modeling task similarity, all tasks

share a common representation encoder, which takes as input

the task vector and outputs the task representation. The task

encoder is trained on all tasks.

Concretely, the task encoder is implemented by a Multi-

layer Perceptron (MLP) network g(·;ξ)parameterized by ξ.

Considering the single-interface task T(Φ), the task represen-

tation zTis calculated by

zT=g(o(Φ); ξ)∈Rdx,(12)

where o(Φ) = [o(Φ)1, o(Φ)2,· · · , o(Φ)Nℓ]T∈ {0,1}Nℓis the

transmission interface multi-hot vector, indicating whether the

transmission line ℓbelongs to the transmission interface Φ.Nℓ

is the total number of transmission lines. o(Φ)i= 1 if ℓi∈Φ

and o(Φ)i= 0 if ℓi/∈Φ. Moreover, the task representation

zTfor the multi-interface task T({Φm}M

m=1)is calculated by

zT=PM

m=1 g(o(Φm); ξ)

M∈Rdx.(13)

IEEE TRANSACTIONS ON POWER SYSTEMS 5

When there is no task encoder but only the original vector

is used to represent the tasks, each task is usually orthogo-

nal independent. The task encoder allows us to express the

relationship between tasks in the representation space, which

enables agents to use the similarity between tasks to generalize

their knowledge.

Furthermore, to obtain the attribution map ρ, we use the

task representation zTas a query to calculate the novel-level

attention weight:

ρT= softmax(XρzT)∈RN.(14)

This attribution map ρis task-adaptive based on the task

representation zT. With the attribution map ρ, the agent can

explicitly distinguish the impact of different power system

nodes on each adjustment task. Then the compact state rep-

resentation υTis obtained via weighted pooling operation as

follows:

υT=XυρT∈Rdx.(15)

The state representation υTselectively integrates the underly-

ing node-level embeddings, which enables the agent to further

derive a policy to handle the task T.

C. Dueling Deep Q Network

We use a value-based method to calculate the ﬁnal rein-

forcement learning policy. The value-based method tends to

assess the quality of a policy πby the action-value function

Qπdeﬁned as:

Qπ(s, a) = Eπ"∞

t=0

γtR(st, at)|s0=s, a0=a#,(16)

which denotes the expected discounted return after the agent

executes an action aat state s. A policy π∗is optimal if:

Qπ∗(s, a)≥Qπ(s, a),∀π, s ∈ S, a ∈ A.(17)

There is always at least one policy that is better than or equal

to all other policies [16]. All optimal policies share the same

optimal action-value function deﬁned as Q∗. It is easy to show

that Q∗satisﬁes the Bellman optimality equation:

Q∗(s, a) = Es′∼P(·|s,a)R(s, a) + γmax

a′∈A Q∗(s′, a′).(18)

To estimate the optimal action-value function Q∗, Deep

Q-Networks (DQN) [29] uses a neural network Q(s, a;θ)

with parameters θas an approximator. Here we take the state

representation υTas input and adopt a dueling network archi-

tecture [30], explicitly separates the estimation of state values

and state-dependent action advantages. This corresponds to the

following factorization of action values:

Q(υT, a;θ) =

V(υT;θV) + A(υT, a;θA)−Pa′∈A A(υT, a′;θA)

|A| ,

(19)

where V(·;θV) : Rdx→Ris the state value network and

A(·;θA) : Rdx→R|A| is the action advantage network. For

Algorithm 1 Learning Algorithm for MAM

Initialize: Two GCN Branches f(·;ψρ)and f(·;ψυ), Task

Encoder g(·;ξ), DQN Q(·;θ), Replay Buffer D

Return: Optimized Network Parameters {ψ, ξ, θ}

1: repeat

2: Sample a test scenario with the task from p(T)

3: Obtain the initial state s0= (A, F0)

4: while not terminal do

5: ▷Construct the multi-task attribution map

6: Calculate the node embedding matrices Xρ, Xυ

7: Calculate the task representation zT

8: Construct the attribution map ρT

9: ▷Deliver the ﬁnal policy

10: Calculate the state representation υT

11: Calculate the action values Q(υT,·;θ)

12: Execute the action at= arg maxaQ(υT, a;θ)

13: Receive the reward rt+1 and the next state st+1

14: Store the sample to the buffer D

15: ▷Train the networks

16: Sample the random minibatch from the buffer D

17: Compute the TD loss L(ψ, ξ, θ)

18: Perform gradient descent on the TD loss L(ψ, ξ, θ)

19: end while

20: until converge

transmission interface power ﬂow adjustment, most dispatch

actions have little impact on the power system. The dueling

architecture enable the agent to learn the individual value

of the state itself, without having to learn the utility of

each action on the state. This is especially useful in states

where the actions have no effect on the environment [30].

Finally, the dispatch action can be obtained by a greedy policy

πT(s) = arg maxa∈A Q(υT, a;θ).

We optimize the networks by minimising the following

temporal-difference (TD) loss based on the Bellman equation:

L(ψ, ξ, θ) = E(T,s,a,r,s′)∼D h(y−Q(υT, a;θ))2i,(20)

where Dis the replay buffer of the transitions, y=r+

γQ(υ′

T,arg maxa′Q(υ′

T, a′;θ); θ−)and θ−represents the

parameters of the target network [31]. An algorithmic descrip-

tion of the training procedure is given in Algorithm 1.

IV. CAS E STU DY

To demonstrate the effectiveness of the proposed Multi-task

Attribution Map (MAM) method, case studies are conducted

based on two small-scale systems and one large-scale system

using the PandaPower simulation [32]. In this section, we ﬁrst

provide the details for the scenario generation process. Then

the compared methods and parameter settings are introduced.

The comparison results of different baselines are reported

to evaluate the performance of MAM. Moreover, several

visualization examples of MAM in different scenarios are

given to study the internal mechanism of MAM.

IEEE TRANSACTIONS ON POWER SYSTEMS 6

1 2

117

113 30

114115

40 41 42

53 54 55

57 58

116

118 76

88 89

90 91

100

101102

103

104 105

106

107

108

109

110

111

112

G G

GGGG G

Fig. 2. Illustration of the IEEE 118-bus system. Different transmission

interfaces are represented by different colors. Please zoom for better view.

bus 214

bus 120

bus 117 bus 112

bus 116

bus 66

bus 17 bus 45 bus 35

bus 18 bus 86 bus 41

bus 2

bus 173

Zone 1

Zone 2 Zone 3

Zone 4

Zone 5

Zone 6

Zone 7

Zone 8

Zone 9

Zone 10

bus 42bus 20

bus 169bus 168

bus 165bus 174

bus 172 bus 201

bus 178

bus 179

bus 28

bus 180

bus 223

bus 123

bus 128

bus 16

bus 129

bus 126

bus 151

bus 124

Zone 11 bus 127

Fig. 3. Illustration of the realistic 300-bus system in China. Different

transmission interfaces are represented by different colors.

Zone 1 Zone 2

Zone 3

bus 1863

bus 2611 bus 4838

bus 1636

bus 8321

bus 1219

bus 169 bus 5478

bus 1936bus 5472bus 4121

Zone 4 bus 7549

bus 1919 bus 2886 bus 9132

bus 8286 bus 6693 bus 5034

bus 5664

bus 1955bus 7234

bus 4778

Zone 7

bus 4378

bus 6412

bus 96

Zone 5 Zone 6

bus 1919

bus 834

bus 7832

bus 5238

bus 5204

bus 4734

bus 131

bus 4392

Zone 8

bus 8380

bus 5561

bus 6493

bus 6079

bus 3448

bus 9041

bus 220

bus 8863

bus 9129

bus 2934 bus 6076

bus 3884

bus 7227

bus 4345

bus 602 bus 8639

bus 1048 bus 6063

bus 8643

bus 105

Fig. 4. Illustration of the European 9241-bus system from the PEGASE

project. Different transmission interfaces are represented by different colors.

TABLE I

THE P RE-SCHEDULED POWER RANGE (MW) OF EACH TRANSMISSION

IN TER FACE I N THE IEEE 118-B US SY ST EM,THE REALISTIC 300-B US

SY STE M AN D THE EUROPEAN 9241-BU S SYS TEM . “#” DEN OTE S THE

TOTAL N UM BER S OF T HE IN SE CUR E SCE NAR IO S.

Transmission

Interface

118-bus System 300-bus System 9241-bus System

Range # Range # Range #

■Φ1[90,640] 150 [140,1000] 182 [15,110] 202

■Φ2[50,360] 185 [280,1960] 162 [840,5880] 210

■Φ3[40,290] 211 [170,1200] 174 [145,1000] 204

■Φ4[90,640] 238 [240,1680] 188 [55,390] 204

■Φ5[70,480] 161 [460,3200] 198 [560,3920] 200

■Φ6[45,300] 185 [200,1400] 204 [360,2520] 203

■Φ7[130,880] 160 [200,1400] 98 [345,2410] 208

■Φ8[55,390] 184 [480,3360] 171 [760,5320] 205

■Φ9[130,880] 158 [170,1200] 220 [130,910] 200

■Φ10 [90,615] 197 [80,590] 220 [180,1260] 201

A. Scenario Generation

To obtain adequate scenarios for training and testing, we

ﬁrst use two small-scale systems for simulations, including the

IEEE 118-bus system and a realistic regional 300-bus system

in China with 182 loads, 23 generators and 313 AC lines.

Then we adopt a very large European 9241-bus system from

the PEGASE project [33, 34]. The detailed scenario generation

process is summarized as follows:

•To simulate various scenarios in a given power system, we

randomly select 25% of the total loads and generators in

each scenario, and then randomly perturb load and generator

power output from 10% to 200% with an interval of 10%

of original power ﬂow.

•For different power ﬂow adjustment tasks as shown in

Figure 2, Figure 3 and Figure 4, 10 transmission interfaces

are selected for each power system, respectively.

•The pre-scheduled power range [σ−

Φ, σ+

Φ]of each transmis-

sion interface Φis provided for power ﬂow adjustment, as

shown in Table I. For each power ﬂow adjustment task,

only the insecure scenarios, of which the power ﬂow of

the corresponding transmission interfaces falls outside the

pre-scheduled power range, are considered for case studies.

We form a control horizon based on the dispatch action

sequence. The total numbers of the insecure scenarios for

each transmission interface in different systems are also

listed in Table I.

Finally, a total number of 1829 scenarios for the IEEE 118-

bus system are obtained, including 1656 scenarios for training

and 173 scenarios for testing. For the realistic 300-bus system

in China, a total number of 1817 scenarios are obtained,

including 1637 scenarios for training and 180 scenarios for

testing. For the European 9241-bus system, a total number

of 2037 scenarios are obtained, including 1848 scenarios for

training and 189 scenarios for testing.

B. Compared Methods and Parameter Settings

The proposed Multi-task Attribution Map method, termed

as MAM, is compared with the existing state-of-the-art DRL

methods in discrete action space, including Deep Q Net-

work (DQN) [29], Double DQN [31], Dueling DQN [30],

IEEE TRANSACTIONS ON POWER SYSTEMS 7

Advantage Actor Critic (A2C) [35] and Proximal Policy Op-

timization (PPO) [36]. Two basic multi-task architectures are

considered for all DRL baselines: (1) Concat-based architec-

ture. Using one MLP network that takes the concat vector of

the task vector and node feature vector as input. (2) Soft-based

architecture. Using the soft modularization network [37] that

enables the agent to select different MLP network modules

based on the task representation. Moreover, we also adopt

Optimal Power Flow (OPF) method as a conventional model-

based baseline based on PYPOWER [38]. The internal solver

of OPF uses the interior point method where the optimization

is initialized with a valid power ﬂow solution. All the reported

results of different methods are tested on the unseen test

scenarios. Three evaluation metrics are used, including test

success rate, test economic cost and inference speed.

We adopt the Tianshou RL framework [39] to run all ex-

periments, which uses the PyTorch library for implementation.

Moreover, power ﬂow calculation is performed based on the

PandaPower simulation [32]. The detailed hyperparameters are

given as follows, where the common hyperparameters across

different baselines are consistent to ensure comparability and

their special hyperparameters can be referred in the source

code. In the proposed MAM, each GCN branch contains a 2-

layer GCN with the dimension of [64,64]. The task encoder is

implemented by a 2-layer MLP network with the dimension

of [128,64]. For Dueling architecture, a 2-layer value network

with the dimension of 128 and a 1-layer advantage network

with the dimension of 128 are applied. All network modules

use ReLU activation function. Batches of 64 episodes are

sampled from the replay buffer with the size of 20K for every

training iteration. The target update interval is set to 100, and

the discount factor is set to 0.9. We use the Adam optimizer

with a learning rate of 0.001. For exploration, ϵ-greedy is

used with ϵannealed linearly from 0.1 to 0.01 over 500K

training steps and kept constant for the rest of the training.

Case studies are carried out on a computing platform with

Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz, 128 GB

RAM, and Quadro P5000 GPU.

C. Single-Interface Task

To validate the effectiveness of our method in single-

interface tasks under the multi-task setting, we conduct ex-

periments on both small-scale and large-scale power sys-

tems. For each power system, 10 selected transmission in-

terfaces are considered as 10 single-interface tasks. We con-

sider the multi-task setting with all single-interface tasks

{T (Φ1),T(Φ2),· · · ,T(Φ10 )}. Each single-interface task is

randomly sampled at each episode. The learning curves of

different DRL methods on different power system are shown

in Figure 5a–5c. The ﬁnal performance of all DRL and OPF

methods is shown in Table II. The average inference speed of

our method and OPF baselines is shown in Table III.

In the IEEE 118-bus system and the realistic 300-bus

system, our proposed MAM successfully improves the ﬁnal

performance and offers a very high inference speed. (1) From

the perspective of different multi-task architectures, the soft-

based architecture consistently outperforms the concat-based

TABLE II

THE P ERF OR MAN CE OF O UR M ETH OD A ND BA SEL IN ES IN

SI NGL E-I NTE RFAC E TASK S UND ER T HE MU LTI-TAS K SE TTI NG . THE

PE RFO RM ANC E IS OB TAIN ED U NDE R TH E AVERAG E EVAL UATIO N OVER 5

TRIALS. TH E PER FO RMA NCE D IFF ERE NC ES BE TW EEN O UR M ETH OD A ND

BASELINES ARE GIVEN IN PARENTHESES. NOT E THAT T HE HI GH ER TE ST

SU CCE SS R ATE AND T HE L OWE R TES T EC ONO MIC C OS T,THE B ET TER

PERFORMANCE OF THE METHOD.

Method Test Success Rate (%)

118-bus System 300-bus System 9241-bus System

DQN (Concat) 61.23 (-35.38) 56.74 (-38.06) 13.67 (-29.19)

DQN (Soft) 78.53 (-18.08) 81.88 (-12.92) 23.28 (-19.58)

Double DQN (Concat) 53.28 (-43.33) 61.24 (-33.56) 16.75 (-26.11)

Double DQN (Soft) 82.35 (-14.26) 83.11 (-11.69) 16.67 (-26.17)

Dueling DQN (Concat) 68.44 (-28.17) 57.66 (-37.14) 37.57 (-5.29)

Dueling DQN (Soft) 85.23 (-11.38) 85.30 (-9.50) 31.13 (-11.73)

A2C (Concat) 80.14 (-16.47) 75.61 (-19.19) 12.70 (-30.16)

A2C (Soft) 80.10 (-16.51) 80.14 (-14.66) 17.64 (-25.22)

PPO (Concat) 24.28 (-72.33) 48.61 (-46.19) 19.40 (-23.46)

PPO (Soft) 24.28 (-72.33) 52.78 (-42.02) 22.22 (-20.64)

OPF 72.25 (-24.36) 50.56 (-44.24) 51.32 (+8.46)

MAM (Ours) 96.61 94.80 42.86

Method Test Economic Cost ($)

118-bus System 300-bus System 9241-bus System

Original 632,354 (+8,901) 999,927 (+29,669) 325,606 (+1,112)

DQN (Concat) 626,808 (+3,355) 991,762 (+21,504) 324,658 (+164)

DQN (Soft) 625,598 (+2,145) 986,337 (+16,079) 325,554 (+1,060)

Double DQN (Concat) 638,079 (+14,626) 990,126 (+19,868) 324,792 (+298)

Double DQN (Soft) 625,279 (+1,826) 982,362 (+12,104) 325,790 (+1,296)

Dueling DQN (Concat) 626,215 (+2,762) 992,048 (+21,790) 324,485 (-9)

Dueling DQN (Soft) 625,128 (+1,678) 972,573 (+2,315) 325,152 (+658)

A2C (Concat) 626,703 (+3,250) 976,467 (+6,209) 325,000 (+506)

A2C (Soft) 625,381 (+1,928) 976,803 (+6,545) 324,808 (+314)

PPO (Concat) 596,315 (-27,138) 930,581 (-39,677) 325,549 (+1,055)

PPO (Soft) 596,577 (-26,876) 936,181 (-34,077) 325,303 (+809)

OPF 606,259 (-17,194) 953,349 (-16,909) 321,586 (-2,908)

MAM (Ours) 623,453 970,258 324,494

TABLE III

THE AVERAGE INFERENCE SPEED OF OUR METHOD AND OPF BASE LI NES

IN SINGLE-IN TER FACE TAS KS .±COR RE SPO NDS T O ON E STAN DARD

DE VIATI ON O F THE AVE RAG E EVALUAT ION OV ER A LL TE ST S CEN ARI OS .

Method Inference Time (s)

118-bus System 300-bus System 9241-bus System

OPF 44.822 ±38.459 47.668 ±28.230 1912.608 ±1636.746

MAM (Ours) 0.006 ±0.009 0.010 ±0.011 0.051 ±0.025

one, while our proposed method can still achieve the best

performance. The idea of the soft-based architecture is similar

to ours. It focuses on learning the relationship between net-

work modules and tasks, but ignores the relationship between

input nodes and tasks. Different node features are still coupled

together, leading to poorly generalizing at test time due to

over-ﬁtting on noisy nodes during the training time. Thus,

the proposed MAM takes advantage of the attribution map

to selectively integrate the node features into a compact state

representation for the ﬁnal policy. (2) Comparing different

DRL methods, despite DQN and its variants showing high

learning efﬁciency, these methods inevitably fall into subop-

timal policies with low test success rates. The performance

of the A2C algorithm is worse than Dueling DQN, while

PPO cannot learn any effective policy and perform poorly.

IEEE TRANSACTIONS ON POWER SYSTEMS 8

     





































     



×106

















(a) IEEE 118-bus System

(Single-Interface Tasks)

     



×106

















(b) Realistic 300-bus System

(Single-Interface Tasks)

     



×106

















(Single-Interface Tasks)

     



×106















(d) IEEE 118-bus System

(Multi-Interface Tasks)

     



×106















(e) Realistic 300-bus System

(Multi-Interface Tasks)

     



×106













(f) European 9241-bus System

(Multi-Interface Tasks)

Fig. 5. Learning curves of our method and baselines in both single-interface tasks and multi-interface tasks under the multi-task setting. All experimental

results are illustrated with the median performance and one standard deviation (shaded region) over 5 random seeds for a fair comparison.

In contrast, our proposed MAM works quite well on the test

success rate and outperforms all state-of-the-art DRL methods

during training. Furthermore, the proposed MAM not only

achieves the best test success rate, but also obtains the lowest

test economic cost except for PPO. Despite the lower test

economic cost obtained by PPO, it is worth noting that the

success rate of PPO is extraordinarily low. PPO only arbitrarily

reduces the power of the generators without considering the

power ﬂow target of the transmission interface, which is

unacceptable. DQN and its variants are off-policy algorithms,

while A2C and PPO are on-policy algorithms. Off-policy

algorithm enables the agent to update its policy based on the

interaction samples from the historical policy, but on-policy

algorithm only uses the interaction samples from the current

policy. Therefore, the sample efﬁciency of off-policy algorithm

is higher than the on-policy algorithm, which is beneﬁcial in

the power system environment where the control sequence is

short. On the other hand, PPO provides a policy constraint

over the A2C algorithm, which may lead to a locally optimal

solution due to the short control sequence.

It is notable that our MAM also achieves superior perfor-

mance compared to the conventional OPF method in the small-

scale system. We only focus on the insecure scenarios for

each speciﬁc transmission interfaces, where the OPF method

requires solving complex nonlinear optimization problems.

Although the OPF method can easily obtain a lower test eco-

nomic cost, it often fails to satisfy the power ﬂow constraints

and perform poorly on the test success rate. However, our

MAM can learn to generalize different scenarios and enables

a near-optimal decision, making it more robust and ﬂexible.

The results suggest that the attribution map can be utilized

to explore the diverse control mechanisms for different tasks,

which helps the agents to construct a more useful policy and

achieves non-trivial performance.

In the 9241-bus system, the problems of power ﬂow ad-

justment become much more challenging due to the large

exploration space. We can observe that all DRL baselines

cannot perform well, while only OPF and MAM can still

offer the reasonable results. However, the conventional OPF

method become computationally expensive as the system size

increases. It requires a lot of time to ﬁnd the solution in the

nonlinear nonconvex searching space. On the contrary, the

model-free MAM method can directly use neural network

forward propagation to deliver dispatch action in a more

efﬁcient manner. Thus, MAM can provide competent inference

speed guarantees for practical deployment, while the time

consumption of OPF is unacceptable. Moreover, MAM also

exhibits its advantage in facilitates ﬁltering the noisy infor-

mation and yields performance on par with OPF in the large-

scale system.

D. Multi-Interface Task

To further evaluate the proposed method in multi-interface

tasks under the multi-task setting, we also conduct experiments

on two power systems. In the IEEE 118-bus system, we

consider the multi-task setting with different 5-interface tasks

{T ({Φm}m∈M)}, where Mis the set of 5 transmission inter-

faces that randomly selected from all transmission interfaces.

In the more complex realistic 300-bus system and European

9241-bus system, we consider the multi-task setting with

IEEE TRANSACTIONS ON POWER SYSTEMS 9

TABLE IV

THE P ERF OR MAN CE OF O UR M ETH OD A ND BA SEL IN ES IN

MULTI-INTE RFAC E TASK S UND ER T HE MU LTI-TA SK SE TTI NG .

Method Test Success Rate (%)

118-bus System 300-bus System 9241-bus System

DQN (Concat) 38.66 (-30.79) 34.12 (-39.06) 19.71 (-13.27)

DQN (Soft) 56.92 (-12.53) 46.26 (-26.92) 19.66 (-13.32)

Double DQN (Concat) 43.52 (-25.93) 36.26 (-36.92) 24.51 (-8.47)

Double DQN (Soft) 57.44 (-12.01) 38.26 (-34.92) 18.69 (-14.29)

Dueling DQN (Concat) 42.65 (-26.80) 42.15 (-31.03) 29.10 (-3.88)

Dueling DQN (Soft) 61.54 (-7.91) 56.67 (-16.51) 26.98 (-6.00)

A2C (Concat) 28.19 (-41.26) 60.98 (-12.20) 17.20 (-15.78)

A2C (Soft) 21.69 (-47.76) 64.57 (-8.61) 17.99 (-14.99)

PPO (Concat) 38.25 (-31.20) 25.68 (-47.50) 18.78 (-14.20)

PPO (Soft) 54.37 (-15.08) 33.41 (-39.77) 23.19 (-9.79)

OPF 47.98 (-21.47) 13.89 (-59.29) 37.57 (+4.59)

MAM (Ours) 69.45 73.18 32.98

Method Test Economic Cost ($)

118-bus System 300-bus System 9241-bus System

Original 632,354 (+57,392) 999,927 (+87,212) 325,606 (+1,180)

DQN (Concat) 619,440 (+44,478) 945,438 (+32,723) 325,316 (+890)

DQN (Soft) 577,665 (+2,703) 941,011 (+28,296) 325,662 (+1,236)

Double DQN (Concat) 597,228 (+22,266) 943,314 (+30,599) 327,074 (+2,648)

Double DQN (Soft) 580,558 (+5,596) 955,407 (+42,692) 326,951 (+2,525)

Dueling DQN (Concat) 582,386 (+7,424) 924,819 (+12,104) 324,533 (+107)

Dueling DQN (Soft) 576,127 (+1,165) 920,950 (+8,235) 324,855 (+429)

A2C (Concat) 606,632 (+31,670) 921,492 (+8,777) 325,626 (+1,200)

A2C (Soft) 625,292 (+50,330) 916,958 (+4,243) 325,659 (+1,233)

PPO (Concat) 599,601 (+24,639) 938,474 (+25,759) 325,456 (+1,030)

PPO (Soft) 605,275 (+30,313) 936,014 (+23,299) 325,547 (+1,121)

OPF 554,691 (-20,271) 956,624 (+43,909) 321,694 (-2,732)

MAM (Ours) 574,962 912,715 324,426

TABLE V

THE AVER AGE I NF ERE NC E SPE ED O F OUR M ETH OD A ND OPF BA SEL IN ES

IN MULTI-INTE RFAC E TASK S.

Method Inference Time (s)

118-bus System 300-bus System 9241-bus System

OPF 67.958 ±30.218 57.729 ±23.794 2416.349 ±2016.855

MAM (Ours) 0.011 ±0.003 0.030 ±0.032 0.066 ±0.016

different 3-interface tasks. Figure 5d–5f present the learning

curves of comparison methods on different power systems,

while the ﬁnal performance is shown in Table IV and the

average inference speed is shown in Table V.

In different power systems, the simulation results are similar

to that of single-interface tasks. The soft-based architecture

still outperforms the concat-based one from the perspective of

different multi-task architectures, while our MAM approach

consistently exceeds all baselines by a large margin not only

in test success rate but also in test economic cost. Compared

with the conventional OPF-based method, MAM signiﬁcantly

reduce the computational cost. The proposed MAM is particu-

larly beneﬁcial in these multi-interface tasks, which enables us

to explore the diverse common critical nodes and generalize

across various tasks. Notably, despite the encouraging results

achieved, our proposed MAM cannot achieve the same high

test success rate in the multi-interface tasks as in the single-

interface tasks. The multi-interface tasks are more complicated

than single-interface tasks because of the coupling relationship

     



×106















     

























Fig. 6. Learning curves of our proposed MAM and its ablations on the IEEE

118-bus system with multi-interface tasks.

TABLE VI

THE P ERF OR MAN CE OF O UR P ROP OSE D MAM AN D ITS A BL ATION S ON

TH E IEEE 118-BUS S YST EM W ITH M ULTI -IN TER FACE TAS KS .

Method Test Success Rate (%) Test Economic Cost ($) Inference Time (s)

MAM-O 64.16 (-5.29) 581,607 (+6,645) 0.010 (-0.001)

MAM-M 58.07 (-11.38) 591,698 (+16,736) 0.007 (-0.004)

MAM-W 71.19 (+1.74) 602,992 (+28,030) 0.012 (+0.001)

MAM 69.45 574,962 0.011

between different transmission interfaces. In some extreme op-

erating condition scenarios, there may even be no adjustment

policy that can successfully satisfy all power ﬂow constraints.

E. Ablation Study

To understand the superior performance of the proposed

MAM, we carry out ablation studies to test the contribution

of its different components as follows. The comparison results

of various ablation on the IEEE 118-bus system with multi-

interface tasks are shown in Figure 6 and Table VI.

•MAM-O. We use one shared GCN instead of two GCN

branches to obtain two node-level embedding matrices.

The results suggest that sharing parameters can reduce the

computation cost while downgrading the ﬁnal performance.

This is because two embedding matrices from GCN have

different roles in our attribution mechanism. One matrix is

used to generate the multi-task attribution map, which only

distinguishes the impact of different power system nodes

on each adjustment task. The other matrix is used to give

a compact state representation according to the multi-task

attribution map. If we only use one shared GCN, it means

that two embedding matrices are forced to be the same,

which severely damages the model expressiveness.

•MAM-M. We directly take the multi-task attribution map

instead of the state representation as the input of Dueling

Deep Q Networt. By comparing MAM with MAM-M, we

can conclude that only multi-task attribution map is not

enough to train the subsequent Dueling Deep Q Network.

Although the multi-task attribution map enable the agent to

pay attention to the important nodes, the agent still fails to

make proper decisions and perform poorly due to the lack

of state information.

•MAM-W. We use the weighted average operation with

the learnable parameter instead of the average operation to

obtain the task representation for the multi-interface task.

IEEE TRANSACTIONS ON POWER SYSTEMS 10

1 2

117

113 30

114115

40 41 42

53 54 55

57 58

116

118 76

88 89

90 91

100

101102

103

104 105

106

107

108

109

110

111

112

G G

GGGG G

Dispatch

1 2

117

113 30

114115

40 41 42

53 54 55

57 58

116

118 76

88 89

90 91

100

101102

103

104 105

106

107

108

109

110

111

112

G G

GGGG G

Dispatch

1 2

117

113 30

114115

40 41 42

53 54 55

57 58

116

118 76

88 89

90 91

100

101102

103

104 105

106

107

108

109

110

111

112

G G

GGGG G

Dispatch

Fig. 7. A visualization example of the attribution maps in different single-interface scenarios under the IEEE 118-bus system. The attention magnitudes of

the node attribution maps are represented by circles with different red shades. Please zoom for better view.

01234567

200

400

600

800

1000

658 658 628 596 571 550 534 520

258 258 258 258

258 257 256 254

323 323 314 278

249 225 219 214

1007 1007 967 927 896

870 849 833

661 661 651 640 631 624 618 613

Transmission Interface 1

Upper Bound (640 MW)

Lower Bound (90 MW)

Transmission Interface 3

Upper Bound (290 MW)

Lower Bound (40 MW)

Transmission Interface 6

Upper Bound (300 MW)

Lower Bound (45 MW)

Transmission Interface 7

Upper Bound (880 MW)

Lower Bound (130 MW)

Transmission Interface 10

Upper Bound (615 MW)

Lower Bound (90 MW)

1 2

117

113 30

114115

40 41 42

53 54 55

57 58

116

118 76

88 89

90 91

100

101102

103

104 105

106

107

108

109

110

111

112

G G

GGGG G

Dispatch

1 2

117

113 30

114115

40 41 42

53 54 55

57 58

116

118 76

88 89

90 91

100

101102

103

104 105

106

107

108

109

110

111

112

G G

GGGG G

Dispatch

1 2

117

113 30

114115

40 41 42

53 54 55

57 58

116

118 76

88 89

90 91

100

101102

103

104 105

106

107

108

109

110

111

112

G G

GGGG G

Dispatch

Fig. 8. A visualization example of the attribution maps in one 5-interface scenario under the IEEE 118-bus system. Please zoom for better view.

IEEE TRANSACTIONS ON POWER SYSTEMS 11

The weighted average operation allows the agent to weigh

different transmission interfaces rather than regard them

as equally important. Thus, it is reasonable that MAM-W

can achieve the results on par with and sometimes slightly

superior to MAM on the test success rate. However, MAM

can obtain the lower test economic cost than MAM-W.

F. Visualization Analysis

To further explain the learned attribution map from MAM,

we conduct a qualitative analysis. The attribution maps in

different single-interface scenarios under the multi-task set-

ting are visualized in Figure 7. The simulation results show

that the proposed method successfully learns different sparse

attribution maps for different transmission interfaces, which

signiﬁcantly reduce the state space. Specially, we ﬁnd that

the high-attention nodes are close to the corresponding trans-

mission interface in electrical distance, which usually has a

direct impact on the power ﬂow of the transmission interface.

Moreover, the ﬁnal dispatch generators are also on the nodes

with high attention. This crisper example demonstrates the

interpretability and effectiveness of the proposed MAM. MAM

also demonstrates promising results in the multi-interface

scenario as shown in Figure 8. The learned attribution map in

the multi-interface scenario is different from that in the single-

interface scenario, which provides the importance nodes for

multiple transmission interfaces at once. Then MAM selects

the generators to dispatch based on the attribution maps. In

general, the visual analysis suggests that the proposed method

masters the ability to explicitly distinguish the impact of

different power system nodes on each transmission interface.

Moreover, MAM also learns the relationship between different

transmission interfaces, which enables the generalizable policy

to handle the multi-task adjustment problem.

V. CONCLUSION

In this work, we propose a novel approach, termed as

MAM, that enables us to take advantage of the multi-task

attribution map for transmission interface power ﬂow adjust-

ment. MAM follows a restructuring-by-disentangling scheme,

where several distinguishable node attentions are generated

and then selectively reassembled together for the ﬁnal focused

policy. We validate MAM over the IEEE 118-bus system,

a realistic regional system in China, and a very large Eu-

ropean 9241-bus system, and showcase that it yields results

signiﬁcantly superior to the state-of-the-art techniques in both

single-interface and multi-interface tasks under the multi-

task settings. To our best knowledge, this paper is the ﬁrst

attempt towards learning multiple transmission interface power

ﬂow adjustment tasks jointly.

Limitations. Currently, we only focus on the generalization

over different power ﬂow adjustment tasks under the given

transmission interfaces. We can observe that the proposed

MAM can successfully generalize to the unseen scenarios

and exhibits superior performances. However, MAM cannot

directly handle the new transmission interface that the agent

has not trained on. The premise of generalization in deep

learning is sufﬁcient samples. For example, there may be

hundreds of transmission interfaces in IEEE 118-bus system,

while we only consider 10 critical transmission interfaces in

this work. Thus, it is hard to learn the relationship between

the trained transmission interfaces and the new one. It is

an interesting direction to study this few-shot generaliza-

tion for our future work.

REFERENCES

[1] N. Hatziargyriou, J. Milanovic, C. Rahmann, V. Ajjarapu,

C. Canizares, I. Erlich, D. Hill, I. Hiskens, I. Kamwa,

B. Pal et al., “Deﬁnition and classiﬁcation of power sys-

tem stability–revisited & extended,” IEEE Transactions

on Power Systems, vol. 36, no. 4, pp. 3271–3281, 2020.

[2] X. Chen, G. Qu, Y. Tang, S. Low, and N. Li, “Reinforce-

ment learning for selective key applications in power

systems: Recent advances and future challenges,” IEEE

Transactions on Smart Grid, vol. 13, no. 4, pp. 2935–

2958, 2022.

[3] B.-h. Zhang, F. Yao, D.-c. Zhou, L.-y. Wang, and B.-

g. Zou, “Study on security protection of transmission

section and its key technologies,” Proceedings of the

CSEE, vol. 26, no. 21, pp. 1–7, 2006.

[4] P. Kaymaz, J. Valenzuela, and C. S. Park, “Transmission

congestion and competition on power generation expan-

sion,” IEEE Transactions on Power Systems, vol. 22,

no. 1, pp. 156–163, 2007.

[5] L. Min and A. Abur, “Total transfer capability compu-

tation for multi-area power systems,” IEEE Transactions

on power systems, vol. 21, no. 3, pp. 1141–1147, 2006.

[6] H. Sun, F. Zhao, H. Wang, K. Wang, W. Jiang, Q. Guo,

B. Zhang, and L. Wehenkel, “Automatic learning of ﬁne

operating rules for online power system security control,”

IEEE transactions on neural networks and learning

systems, vol. 27, no. 8, pp. 1708–1719, 2015.

[7] J.-H. Liu and C.-C. Chu, “Iterative distributed algorithms

for real-time available transfer capability assessment of

multiarea power systems,” IEEE Transactions on Smart

Grid, vol. 6, no. 5, pp. 2569–2578, 2015.

[8] W. Lin, Z. Yang, J. Yu, L. Jin, and W. Li, “Tie-line power

transmission region in a hybrid grid: Fast characterization

and expansion strategy,” IEEE Transactions on Power

Systems, vol. 35, no. 3, pp. 2222–2231, 2019.

[9] S. Mudaliyar, B. Duggal, and S. Mishra, “Distributed tie-

line power ﬂow control of autonomous dc microgrid clus-

ters,” IEEE Transactions on Power Electronics, vol. 35,

no. 10, pp. 11 250–11 266, 2020.

[10] F. Capitanescu, J. M. Ramos, P. Panciatici, D. Kirschen,

A. M. Marcolini, L. Platbrood, and L. Wehenkel, “State-

of-the-art, challenges, and future trends in security con-

strained optimal power ﬂow,” Electric power systems

research, vol. 81, no. 8, pp. 1731–1741, 2011.

[11] T. B. Nguyen and M. Pai, “Dynamic security-constrained

rescheduling of power systems using trajectory sensi-

tivities,” IEEE Transactions on Power Systems, vol. 18,

no. 2, pp. 848–854, 2003.

[12] I. A. Hiskens and J. Alseddiqui, “Sensitivity, approxi-

mation, and uncertainty in power system dynamic sim-

IEEE TRANSACTIONS ON POWER SYSTEMS 12

ulation,” IEEE Transactions on Power Systems, vol. 21,

no. 4, pp. 1808–1820, 2006.

[13] S. Dutta and S. Singh, “Optimal rescheduling of gen-

erators for congestion management based on particle

swarm optimization,” IEEE transactions on Power Sys-

tems, vol. 23, no. 4, pp. 1560–1569, 2008.

[14] D. K. Molzahn, F. D ¨

orﬂer, H. Sandberg, S. H. Low,

S. Chakrabarti, R. Baldick, and J. Lavaei, “A survey

of distributed optimization and control algorithms for

electric power systems,” IEEE Transactions on Smart

Grid, vol. 8, no. 6, pp. 2941–2962, 2017.

[15] Z. Zhang, D. Zhang, and R. C. Qiu, “Deep reinforcement

learning for power system applications: An overview,”

CSEE Journal of Power and Energy Systems, vol. 6,

no. 1, pp. 213–225, 2019.

[16] R. S. Sutton and A. G. Barto, Reinforcement learning:

An introduction. MIT press, 2018.

[17] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang,

D. Bian, and Z. Yi, “Deep-reinforcement-learning-based

autonomous voltage control for power grid operations,”

IEEE Transactions on Power Systems, vol. 35, no. 1, pp.

814–817, 2020.

[18] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and

J. Sun, “Two-timescale voltage control in distribution

grids using deep reinforcement learning,” IEEE Trans-

actions Smart Grid, vol. 11, no. 3, pp. 2313–2323, 2019.

[19] D. Cao, J. Zhao, W. Hu, N. Yu, F. Ding, Q. Huang, and

Z. Chen, “Deep reinforcement learning enabled physical-

model-free two-timescale voltage control method for

active distribution systems,” IEEE Transactions on Smart

Grid, vol. 13, no. 1, pp. 149–165, 2021.

[20] W. Liu, P. Zhuang, H. Liang, J. Peng, and Z. Huang,

“Distributed economic dispatch in microgrids based on

cooperative reinforcement learning,” IEEE transactions

on neural networks and learning systems, vol. 29, no. 6,

pp. 2192–2203, 2018.

[21] P. Dai, W. Yu, G. Wen, and S. Baldi, “Distributed

reinforcement learning algorithm for dynamic economic

dispatch with unknown generation cost functions,” IEEE

Transactions on Industrial Informatics, vol. 16, no. 4, pp.

2258–2267, 2019.

[22] D. Li, L. Yu, N. Li, and F. Lewis, “Virtual-action-based

coordinated reinforcement learning for distributed eco-

nomic dispatch,” IEEE Transactions on Power Systems,

vol. 36, no. 6, pp. 5143–5152, 2021.

[23] Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan, and

Z. Huang, “Adaptive power system emergency control

using deep reinforcement learning,” IEEE Transactions

on Smart Grid, vol. 11, no. 2, pp. 1171–1182, 2019.

[24] C. Chen, M. Cui, F. Li, S. Yin, and X. Wang, “Model-

free emergency frequency control based on reinforcement

learning,” IEEE Transactions on Industrial Informatics,

vol. 17, no. 4, pp. 2336–2346, 2020.

[25] D. Cao, W. Hu, J. Zhao, G. Zhang, B. Zhang, Z. Liu,

Z. Chen, and F. Blaabjerg, “Reinforcement learning and

its applications in modern power and energy systems: A

review,” Journal of Modern Power Systems and Clean

Energy, vol. 8, no. 6, pp. 1029–1042, 2020.

[26] Q. Gao, Y. Liu, J. Zhao, J. Liu, and C. Y. Chung, “Hy-

brid deep learning for dynamic total transfer capability

control,” IEEE Transactions on Power Systems, vol. 36,

no. 3, pp. 2733–2736, 2021.

[27] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman,

and C. Finn, “Gradient surgery for multi-task learning,”

in Annual Conference on Neural Information Processing

Systems, 2020.

[28] T. N. Kipf and M. Welling, “Semi-supervised classiﬁca-

tion with graph convolutional networks,” in International

Conference on Learning Representations, 2017.

[29] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level

control through deep reinforcement learning,” Nature,

vol. 518, no. 7540, pp. 529–533, 2015.

[30] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanc-

tot, and N. de Freitas, “Dueling network architectures

for deep reinforcement learning,” in International Con-

ference on Machine Learning, 2016.

[31] H. van Hasselt, A. Guez, and D. Silver, “Deep rein-

forcement learning with double q-learning,” in AAAI

Conference on Artiﬁcial Intelligence, 2016.

[32] L. Thurner, A. Scheidler, F. Sch¨

afer, J. Menke, J. Dolli-

chon, F. Meier, S. Meinecke, and M. Braun, “pandapower

— an open-source python tool for convenient modeling,

analysis, and optimization of electric power systems,”

IEEE Transactions on Power Systems, vol. 33, no. 6, pp.

6510–6521, 2018.

[33] S. Fliscounakis, P. Panciatici, F. Capitanescu, and L. We-

henkel, “Contingency ranking with respect to overloads

in very large power systems taking into account uncer-

tainty, preventive, and corrective actions,” IEEE Transac-

tions on Power Systems, vol. 28, no. 4, pp. 4909–4917,

2013.

[34] C. Josz, S. Fliscounakis, J. Maeght, and P. Panci-

atici, “Ac power ﬂow data in matpower and qcqp for-

mat: itesla, rte snapshots, and pegase,” arXiv preprint

arXiv:1603.01533, 2016.

[35] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap,

T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous

methods for deep reinforcement learning,” in Interna-

tional Conference on Machine Learning, 2016.

[36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and

O. Klimov, “Proximal policy optimization algorithms,”

arXiv preprint arXiv:1707.06347, 2017.

[37] R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task rein-

forcement learning with soft modularization,” in Annual

Conference on Neural Information Processing Systems,

2020.

[38] R. D. Zimmerman, C. Edmundo Murillo-Sanchez, and

R. J. Thomas, “MATPOWER: Steady-state operations,

planning, and analysis tools for power systems research

and education,” IEEE Transactions on Power Systems,

vol. 26, no. 1, pp. 12–19, 2011.

[39] J. Weng, H. Chen, D. Yan, K. You, A. Duburcq,

M. Zhang, H. Su, and J. Zhu, “Tianshou: A highly

modularized deep reinforcement learning library,” arXiv

preprint arXiv:2107.14171, 2021.

ResearchGate has not been able to resolve any citations for this publication.

Reinforcement Learning for Selective Key Applications in Power Systems: Recent Advances and Future Challenges

Article

Full-text available

Feb 2022

With large-scale integration of renewable generation and distributed energy resources, modern power systems are confronted with new operational challenges, such as growing complexity, increasing uncertainty, and aggravating volatility. Meanwhile, more and more data are becoming available owing to the widespread deployment of smart meters, smart sensors, and upgraded communication networks. As a result, data-driven control techniques, especially reinforcement learning (RL), have attracted surging attention in recent years. This paper provides a comprehensive review of various RL techniques and how they can be applied to decision-making and control in power systems. In particular, we select three key applications, i.e., frequency regulation, voltage control, and energy management, as examples to illustrate RL-based models and solutions. We then present the critical issues in the application of RL, i.e., safety, robustness, scalability, and data. Several potential future directions are discussed as well.

Deep Reinforcement Learning Enabled Physical-Model-Free Two-Timescale Voltage Control Method for Active Distribution Systems

Article

Full-text available

Sep 2021

Active distribution networks are being challenged by frequent and rapid voltage violations due to renewable energy integration. Conventional model-based voltage control methods rely on accurate parameters of the distribution networks, which are difficult to achieve in practice. This paper proposes a novel physical-model-free two-timescale voltage control framework for active distribution systems. To achieve fast control of PV inverters, the whole network is first partitioned into several sub-networks using voltage-reactive power sensitivity. Then, the scheduling of PV inverters in the multiple sub-networks is formulated as Markov games and solved by a multi-agent soft actor-critic (MASAC) algorithm, where each sub-network is modeled as an intelligent agent. All agents are trained in a centralized manner to learn a coordinated strategy while being executed based on only local information for fast response. For the slower timescale control, OLTCs and switched capacitors are coordinated by a single agent-based SAC algorithm using the global information with considering control behaviors of the inverters. Particularly, the two-level agents are trained concurrently with information exchange according to the reward signal calculated from the data-driven surrogate model. Comparative tests with different benchmark methods on IEEE 33-and 123-bus systems and 342-node low voltage distribution systems demonstrate that the proposed method can effectively mitigate the fast voltage violations and achieve systematical coordination of different voltage regulation assets without the knowledge of an accurate system model.

Hybrid Deep Learning for Dynamic Total Transfer Capability Control

Article

Full-text available

Feb 2021

This letter proposes a data-driven hybrid deep learning method for dynamic total transfer capability (TTC) control. It leverages deep learning (DL) to achieve fast prediction of TTC and reduce the problem complexity. While the deep reinforcement learning (DRL) method, e.g., proximal policy optimization (PPO), is enhanced by competitive learning (CL) to obtain a better generalization of the DRL agents. This also allows us to deal with system stochasticity. Comparison results with other model-based alternatives on the IEEE 39-bus system highlight the advantages of the proposed method for variable unseen and insecure scenarios.

Definition and Classification of Power System Stability Revisited & Extended

Article

Full-text available

Jul 2021

Since the publication of the original paper on power system stability definitions in 2004, the dynamic behavior of power systems has gradually changed due to the increasing penetration of converter interfaced generation technologies, loads, and transmission devices. In recognition of this change, a Task Force was established in 2016 to re-examine and extend, where appropriate, the classic definitions and classifications of the basic stability terms to incorporate the effects of fast-response power electronic devices. This paper based on an IEEE PES report summarizes the major results of the work of the Task Force and presents extended definitions and classification of power system stability. Download for free here: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9286772

Reinforcement Learning and Its Applications in Modern Power and Energy Systems: A Review

Article

Full-text available

Nov 2020

With the growing integration of distributed energy resources (DERs), flexible loads and other emerging technologies, there are increasing complexities and uncertainties for modern power and energy systems. This brings great challenges to the operation and control. On the other hand, the deployment of advanced sensor and smart meters leads to a large amount of data that opens the door for novel data-driven methods to deal with complicated operation and control issues. Among them, reinforcement learning (RL) is one of the most widely promoted methods for control and optimization problems. This paper provides a comprehensive literature review of RL in terms of basic ideas, various types of algorithms, and their applications in power and energy systems. The challenges and further works are also discussed. Index Terms-Reinforcement learning, deep reinforcement learning, power system operation and control, optimization.

Virtual-Action-Based Coordinated Reinforcement Learning for Distributed Economic Dispatch

Article

Mar 2021

A unified distributed reinforcement learning (RL) solution is offered for both static and dynamic economic dispatch problems (EDPs). Each agent is assigned with a fixed, discrete, virtual action set, and a projection method generates the feasible, actual actions to satisfy the constraints. A distributed algorithm, based on singularly perturbed system, solves the projection problem. A distributed form of Hysteretic Q-learning achieves coordination among agents. Therein, the Q-values are developed based on the virtual actions, while the rewards are produced by the projected actual actions. The proposed algorithm deals with continuous action space and power loads without using function approximations. Theoretical analysis and comparative simulation studies verify algorithm's convergence and optimality.

Model-Free Emergency Frequency Control Based on Reinforcement Learning

Article

Jun 2020

Unexpected large power surges will cause instantaneous grid shock, and thus emergency control plans must be implemented to prevent the system from collapsing. In this paper, by the aid of reinforcement learning (RL), novel model-free control (MFC)-based emergency control schemes are presented. Firstly, multi-Q-learning-based emergency plans are designed for limited emergency scenarios by using offline-training-online approximation methods. To solve the more general multi-scenario emergency control problem, a deep deterministic policy gradient (DDPG) algorithm is adopted to learn near-optimal solutions. With the aid of deep Q network, DDPG-based strategies have better generalization abilities for unknown and untrained emergency scenarios, and thus are suitable for multi-scenario learning. Through simulations using benchmark systems, the proposed schemes are proven to achieve satisfactory performances.

Distributed Tie-Line Power Flow Control of Autonomous DC Microgrid Clusters

Article

Mar 2020

For microgrids owned by different utilities, it is always desirable that the steady state impact of load or generation changes in one microgrid, does not affect the generation cost in another microgrid. To ensure this, the power flow through the tieline is typically regulated at a pre-scheduled value thereby enforcing the individual microgrids to manage their respective load at steady state. To this end, this paper presents a distributed control scheme to regulate tie-line power flow between two autonomous dc microgrid clusters. In this work, a unifying hierarchical control scheme based on distributed communication is proposed where the tie-line power flow control based on a pinning control strategy is unified with the distributed optimization and average voltage regulation control loops. The distributed optimization utilizes economic dispatch to minimize the operating costs of the DG, while the average voltage control regulates the microgrid's average voltage at its nominal value. With the application of this approach, the responsibility of tie-line regulation is distributed in an economic manner without degrading the voltage quality in the system. Time-domain simulations, real-time simulation using Eternet based TCP/IP communication and experimental results are presented to validate the performance of proposed distributed control under normal and faulty system conditions.

Two-Timescale Voltage Control in Distribution Grids Using Deep Reinforcement Learning

Article

Nov 2019

Modern distribution grids are currently being challenged by frequent and sizable voltage fluctuations, due mainly to the increasing deployment of electric vehicles and renewable generators. Existing approaches to maintaining bus voltage magnitudes within the desired region can cope with either traditional utility-owned devices (e.g., shunt capacitors), or contemporary smart inverters that come with distributed generation units (e.g., photovoltaic plants). The discrete on-off commitment of capacitor units is often configured on an hourly or daily basis, yet smart inverters can be controlled within milliseconds, thus challenging joint control of these two types of assets. In this context, a novel two-timescale voltage regulation scheme is developed for distribution grids by judiciously coupling data-driven with physics-based optimization. On a faster timescale, say every second, the optimal setpoints of smart inverters are obtained by minimizing instantaneous bus voltage deviations from their nominal values, based on either the exact alternating current power flow model or a linear approximant of it; whereas, on the slower timescale (e.g., every hour), shunt capacitors are configured to minimize the long-term discounted voltage deviations using a deep reinforcement learning algorithm. Extensive numerical tests on a real-world 47-bus distribution network as well as the IEEE 123-bus test feeder using real data corroborate the effectiveness of the novel scheme.

Tie-Line Power Transmission Region in a Hybrid Grid: Fast Characterization and Expansion Strategy

Article

Nov 2019

The identification of a tie-line power transmission region is crucial for the secure operation of interconnected power grids. The characterization of a tie-line power transmission region with regional operational constraints is inherently a multi-parametric programming problem, which greatly suffers from a heavy computational burden. In this paper, we propose a fast boundary search approach to efficiently characterize the tie-line power transmission region in AC/DC hybrid grids. The multi-parametric programming method is improved in the following three aspects: (1) the search direction is fixed along the boundary of a region, which avoids the search inside a region; (2) the abundant search of subregions that share a boundary facet is avoided; and (3) a fast guess of the Karush-Kuhn-Tucker (KKT) condition is presented, which reduces the number of optimization calculations. Furthermore, an expansion strategy is presented to enlarge the tie-line power transmission region via quantifying the impact of system parameters on the transmission region. Simulations based on the IEEE 118-bus test system and a 661-bus utility system show that (1) the proposed method can be over 30 times faster than existing methods and (2) the proposed method successfully expands the tie-line power transmission region by identifying the key parameters.

[IEEE Transactions on Power Systems] Transmission Interface Power Flow Adjustment: A Deep Reinforcement Learning Approach based on Multi-task Attribution Map

Abstract

Recommended publications

[IEEE Transactions on Power Systems] Transmission Interface Power Flow Adjustment: A Deep Reinforcem...

Powerformer: A Section-adaptive Transformer for Power Flow Adjustment

[ESWA] Progressive decision-making framework for power system topology control

PowerGridworld: a framework for multi-agent reinforcement learning in power systems