Content uploaded by Yoshitaka Okada
Author content
All content in this area was uploaded by Yoshitaka Okada on Jun 12, 2019
Content may be subject to copyright.
- 1 -
Smart Grid Optimization by Deep Reinforcement Learning over
Discrete and Continuous Action Space
To mo hi r o H i ra t a 1, Dinesh Bahadur Malla2, Katsuyoshi Sakamoto3, Koichi Yamaguchi3,
Yosh it ak a Ok ad a1, Tomah Sogabe1,2,3,
1Research Center for Advanced Science and Technology, The University of Tokyo, 153-8904,Japan
2Technology Solution Group, Grid Inc., Kita Aoyama, Minato-ku, Tokyo, 107-0061, Japan
3Info-Powered Energy System Research Center, Department of Engineering Science,
The University of Electro-Communications, Chofu, Tokyo, 182-8585, Japan
Abstract- In this work, we have applied two deep
reinforcement learning (DRL) algorithms designed for both
discrete and continuous action space. These algorithms were
well embedded in a rigorous physical model using Simscape
Power SystemsTM (Matlab/SimulinkTM Environment) for
smart grid optimization. Bechmark test were conducted by
comparing the results from the MILP (Mixed-integer linear
programming) and the DRL. The results showed that the agent
successfully captured the energy demand and supply feature
in the training data and learnt to choose behavior leading to
maximize its profit.
Keywords—deep reinforcement learning, smart grid,
optimization
I INTRODUCTION
Energy grid system containing renewable energy
resources(RES) such as photovoltaic energy, wind power as
well as hydropower have been considered as alternative
power supply configuration. It is renovating conventional
grid systems, aiming at reducing the emission of CO2 while
mitigating the global warming. A decentralized energy
system is more robust and resilient against the unexpected
natural disasters, which are frequently occur in countries
such as Japan. However, due to the intermittent nature of
RES, a mismatch between electricity supply and demand is
often encountered and causes instability and limit of power
output. As an effective approach to these challenges, smart
grid has been proposed and has shown great technological
innovation towards intelligent, robust and functional power
grid [1][2].
Smart grid evolves energy transmission among different
sub-smart grid utilities, which finally contribute to the
efficient energy management ecosystem of energy storage,
energy supply, balanced load demand over large scale grid
configuration. Construction of efficient smart grid system
is in principle a control optimization mathematical problem.
A w ide ra nge of metho ds hav e been propo sed to tackle thi s
challenge including linear and dynamic programming as
well as heuristic methods such as PSO, GA, game or fuzzy
theory and so on [3]. In the recent years, studies on energy
optimization in smart grid has gradually shifted to agent-
based machine learning method represented by state of art
deep learning and deep reinforcement learning. Especially
deep neural network based reinforcement learning methods
are emerging and gain popularity to for smart grid
application [4][5].
In this work, we focus on the following issues and
tasks:
(1) Different from previous reports, we have developed our
deep reinforcement learning algorithm embedded in a
rigorous physical model using Simscape Power SystemsTM
for smart power grid optimization. All the parameters used
in smart grid represents the realistic electric circuits and
detailed fluctuation regarding the voltage, frequency and
phase can be therefore fully revealed, which are not
available in previous reports where the constructed smart
grid system could not output sufficient information.
(2) For RL, model-free off-policy deep Q-learning suing
MatlabTM is developed. Actor critic and DQN are suited for
addressing continuous state space and discrete action space
respectively. Here we have focused on the discrete action
control designed for switching the grid power supply/sell
and battery charge /discharge.
(3) For continuous state and continuous action space, we
have self-developed a H-DDPG (hybrid-deterministic
policy gradient) algorithm, in which we have hybridized the
latest deep deterministic policy gradient with the deep
actor-critic stochastic policy gradient.
II ALGORITHM AND MODEL
Fig.1: Sketch of smart grid optimization using deep
neural network based reinforcement learning algorithm.
- 2 -
(i) Deep Q-Learning (DQN): A gener al model which
describes the main framework is given as follows. In this
sketch, we adopted deep Q-learning algorithm as an
example to illustrate the learning principle: physical model
of smart grid simulation environment based on Simscape
Power SystemsTM was constructed. The state space is
always continuous and action space is set either discrete or
continuous for off-policy Q learning and deep policy
gradient algorithm respectively. A detailed operation flow
is given as follows by the form of pseudo-simulation code:
(ii) Hybrid Deep Deterministic Policy Gradient (H-
DDPG): Deterministic policy is in theory efficient at the
late stage of simulation because the policy distribution is
less variant and more deterministic. Policy gradient is
usually formulated as follows, where
!
is the policy
object function;
"
is the function approximation parameter
(in neural network, it is the weight w); s and a correspond
to the state and action
#$%&'()
is the state-action
function under certain policy
*%(+&' ",)
and is the :
-!
.
",
/
0 1
2
#$
%
&'(
)3
-45678
3
*
.
(
9
&'",
/:
3333333
%
;
)
policy distribution function. David et al. has shown that if
the policy is treated as deterministic; the above equation can
be reformed as: [6]
#$
%
&'(
)3
-45678
3
*
.
(
9
&'",
/
0<(#
%
&'(
)
33333333
%
=
)
and if the action a is approximated as policy action function:
33( 0 >
%
&
+
"?
)
3333333333333333333333333333333%@)
using the chain rule
<A
%
&'(
) can be further extended as:
33<4B#
%
&+"A
)
<4C>
%
&
+
"?
)
3333333333333333333333
%
D
)
and then policy parameter
"?
is updated as the usual
gradient decent:
33"?0 "?E F G <4B#
%
&+"A
)
<4C>
%
&
+
"?
)
3333333333
%
H
)
However, implementing the deterministic policy at the
early simulation stage will inevitably cause high variance
and slow convergence because the policy is far from
optimal policy so the policy distribution is fairly stochastic
and less deterministic with high bias. The hybridized
algorithm is designed in such a way that both the advantage
of deterministic and stochastic policy is assimilated thus a
stable learning profile with fast convergence can be
achieved.
(iii) Neural Network Model: In this work, we use
multilayer neural network including four hidden layers to
approximate the state-action value function. The activation
function is fixed at hyperbolic-tangent function and epsilon-
greedy algorithm is utilized to enhance the exploration in
the case of DQN for discrete action and re-parameterization.
These techniques were used when using H-DDPG for
continuous action space.
(iv) MILP and DRL Model: In this work we
performed benchmark test and compared the results from
the MILP and DRL algorithm. Both the MILP and DRL
were perfomed under the same input data including the
solar power generation and electric consumption profile as
well the purchase/sell price for the electricity. The soft
constraints for the battery charge/discharge were also
arranged the same for both methods. The inter-conversion
principle between these two methods were given in Fig.2.
For MILP we divide the constraints in into two parts : Soft
constraint and hard constraint. And we make soft constraint
as reward for the neural network learning process and hard
- 3 -
constraint are difficult to learn so we used them to terminate
and restart the learning process.
III RESULTS
Here we present one representative simulated results by
employing DQN algorithm to optimize for discrete action
space. Mainly we deploy the DRL(DQN) agent to
maximize earning for comparing with MILP optimization,
and we also use DRL for maintain balance from different
power sources.
There are many optimization methods for different types
of problems, among them MILP is popular tools in the
Matlab environment. On the base of Matlab deployed
energy hub optimization we compared our result and its
reliability so far. In Fig.3. upper graphs are result of Matlab
based MILP used optimized result for buying and selling.
The optimized one-day profit from selling power produced
by PV and reduced the cost of buying from power producer
is 74 yen. The lower two graphs are result by our programed
DRL agent. The optimal result obtained from DRL is 78 yen
for a day, from selling power produced by PV and reduced
the cost of buying from power producer. By comparing
these result DQN (deep reinforcement learning agent) is
good enough to optimize the power system optimization
problem. By using reinforcement learning agent we get
optimized result as well as we can get different option for
optimizing the problem where these options are not
available from other optimization tools.
Optimization tools in Matlab deliver quite accurate
results but failed to be applied to large scale system. The
MILP results calculated over 10 days input data has greatly
deviated from the theoretical solution. On the contrast, DRL
has greatest advantage over the MILP in this sense. The
machine learning based DRL method learn the feature of
the system via big data and generalize the feature using
neural network. The agent successfully learnt to discharge
its battery power during daytime instead buying electricity
from the grid and also learnt to purchase at low price time.
We a ls o co mp ar e t he D RL ag en t a nd M IL P o pt im iz at io n
result on different PV production pattern and different
selling buying rate. we used the same DRL reward system
for all the comparing time and most of time DRL get the
better result than the optimization tools.
Fig.3. The MILP optimization tools optimized buying and selling schedule in the upper and Agent
training results
using the DQN algorithm. In the lower graph diagram
Fig.2. Conversion between MILP and DRL implementation
- 4 -
The DRL agent is able to maintain the balance for power
source demand stability. There are many options for power
sources to fulfill the power demand in such problem
Reinforcement agent can help to maintain supply demand
balance. In Matlab environment we create virtual power
resources and power demand as well, from the helps of
power sources and demand data DRL agent able to maintain
the power demand and supply.
On the base of above works and result, we are building
large scale virtual power network for demand Grid power
supply as well as power purchaser many others electricity
producer like PV, Turbine, Wind farm, CHP etc. and Heat
producer like gas boiler, heat tank etc. as well. According to
this plan we need good DRL algorithm as well as more
agents. Our planed concepts simple diagram is shown in
Fig.5 and more detailed design principle and preliminary
results from trial experiments will be given at the
conference.
IV CONCLUSION
We p re se nt h er e a d ee p r ei nf orc em en t l ea rn in g me th od
applied for smart grid optimization. From the preliminary
simulation results, the agent was able to catch the feature
involved in the balance of load demand, PV power surplus
and battery discharge/charge as well as grid integrate. The
agent successfully learnt how to tune its action profile to
maximize the reward function during training. More
detailed results regarding to the comparison between DQN
and H-DDPG and the key role played by reward function
will be given at the conference.
REFERENCES
[1] R. H. Khan and J. Y. Khan, "A comprehensive review
of the application characteristics and traffic
requirements of a smart grid communications
network," Computer Networks, vol. 57, no. 3, pp. 825-
845, 2013.
[2] H. E. Brown, S. Suryanarayanan, and G. T. Heydt,
"Some characteristics of emerging distribution systems
considering the smart grid initiative," The Electricity
Journal, vol. 23,no. 5, pp. 64 -75, 2010.
[3] M. R. Alam, M. St-Hilaire, and T. Kunz,
"Computational methods for residential energy cost
optimization in smart grids: A survey," ACM Comput.
Surv., vol. 49, pp. 22-34,Apr. 2016.
[4] E. Mocanu, P. H. Nguyen, M. Gibescu, and W. L. Kling,
"Deep learning for estimating building energy
consumption," Sustainable Energy, Grids and
Networks, vol. 6, pp. 91-99, 2016.
[5] V François-Lavet, Q Gemine, D Ernst, R Fonteneau,
"Towards the minimization of the levelized energy
costs of microgrids using both long-term and short-
term storage devices, "Smart Grid: Networking, Data
Management, and Business Models, P295-319,2016
[6] S. David, L. Guy, H. Nicolas, D. Thomas, W. Daan,
and R.Martin. "Deterministic policy gradient
algorithms," ICML, 2014
3
Battery PV Demand Grid
Sell Buy
PV
Battery
Demand
GRID
Fig.4. The model for power balance in Matlab
Demand
Electricity
request >0
Heat
request >0
Supply
Electricity
offer >0
Heat
offer >0
Market Hub
Agent
Energy Utility
Hub
Heat pump
Gas boiler
CHP units
PV
Solar thermal
Building
Battery 1
Hot water tank
Request
quantity
Request
Price
Offer
quantity
Offer
Price
Tec hno log y Ag ent
Given Given
Opinion
Action
Critic NN
Actor NN
Opinion
Action
Critic NN
Actor NN
Opinion
Action
Critic NN
Actor NN
Opinion
Action
Critic NN
Actor NN
Action Opinion
Actor NN Critic NN
ERQ= action(0)
ERP= action(1)
HRQ= action(2)
HRP= action(3)
Given
!"#$%& '( )*+$,-#&,(*)$..&+(*+/#&(0 1*+$,-#&,(*1$..&+(*+/#&(
)2*&",/3-+& '( )*+$,-#&,(*)#$43(( 0 1*+$,-#&,(*1#$43(
EOQ= action(0)
EOP= action(1)
HOQ= action(2)
HOP= action(3)
EOQ= action(0)
EOP= action(1)
HOQ= action(2)
HOP= action(3)
Energy
conversion
Energy
Storage
+&56+, +&56+, +&56+, +&56+,
7+$./3( ' !"#$%&( 8 )2*&",/3-+&(
Battery 3
Battery 2
Hot water tank
Fig.5. Sketch of large scale virtual power plant using
DRL