Content uploaded by Andreas Schwung
Author content
All content in this area was uploaded by Andreas Schwung on May 04, 2019
Content may be subject to copyright.
c
2017 IEEE. Personal use of this material is permitted. However, permis-
sion to reprint/republish this material for advertising or promotional purposes
or for creating new collective works for resale or redistribution to servers or
lists, or to reuse any copyrighted component of this work in other works must
be obtained from the IEEE.
Appeared as: D. Schwung, T. Kempe, A. Schwung, S. X. Ding: "‘Self-
optimization of energy consumption in complex bulk good processes using rein-
forcement learning"’; Proc. of the 2017 IEEE 15th International Conference on
Industrial Informatics (INDIN), 2017
DOI: 10.1109/INDIN.2017.8104776
2
Self-optimization of energy consumption in complex bulk good
processes using reinforcement learning
Dorothea Schwung∗, Tim Kempe∗, Andreas Schwung∗, Steven X. Ding†
∗South Westfalia University of Applied Sciences, Soest, Germany
{schwung.dorothea, kempe.tim, schwung.andreas}@fh-swf.de
†University of Duisburg-Essen, Duisburg, Germany
steven.ding@uni-due.de
Abstract—This paper presents a novel approach to the opti-
mization of energy consumption in large scale industrial bulk
good processes. The approach is based on a model-free self-
learning algorithm solely based on available process data using
ideas from the well known reinforcement learning framework.
To this end energy consumers of the plant are integrated in the
optimization framework such that each consumer learns its own
optimal energy profile for a given production task. The approach
is implemented on a laboratory size testbed where the task is
the supply of bulk good to a subsequent dosing section. The
capability of the approach is underlined by the results obtained
at the testbed.
I. INTRODUCTION
Environmental requirements as well as the increasing costs for
energy demand for more energy efficient operation of indus-
trial production facilities. Although the energy consumption
per processed product has been reduced in last years, more
efforts are required to reduce in particular the emission of
detrimental greenhouse gases. In parallel the energy costs for
production facilities are continuously increasing demanding
the industry to reduce energy consumption to stay competitive.
This problem becomes even more severe in industrial sectors
like steel making industry, process industry and basic materials
industry, to name a few, where the energy consumption is
extremely high. Here, even minor percentages of energy sav-
ings in the production process can lead to both, significantly
decreasing emissions and cost impacts [1], [2]. Hence, an
effective and sustainable reduction of energy consumption
especially in these industrial sectors is necessary. To this end
the energy consumption of such plants needs to be analyzed
in detail and possible savings need to be detected [3]. These
ideas are supported by recent improvements in automation
technology where energy consumption can be continuously
monitored with high resolution using intelligent PLC-modules
allowing to gain a comprehensive overview over the status
of the plant. Such energy data can subsequently be used for
further analysis by using e.g. cloud based services to optimize
the overall efficiency.
However, industrial production plants in the basic materials
industry are in general large-scale plants with a very high
number of energy consumers [4], [5], [6]. Furthermore, such
plants are normally highly distributed which poses additional
challenges to an energy optimization approach. Particularly,
connections between different production areas as well as
constraints have to be considered as the reduction of energy
consumption in one part of the plant can cause increased
consumption in other parts. In the literature, different ap-
proaches for energy management in the manufacturing and
process industry have been introduced. Load demand schedul-
ing has been presented in [7], [8] while [9] presents a layered
approach for energy optimization based on service oriented
architectures. Approaches based on dynamic programming are
introduced in [10], [11]. In [12] a data-driven system is
developed for energy management.
In this paper we present a novel approach for energy manage-
ment and optimization which is based on the ideas of rein-
forcement learning (RL) [13]. RL is essentially a knowledge-
based control scheme which learns optimal policy by direct
interaction with the environment. RL has already been applied
to energy management systems with special emphasis on
distributed smart grids and microgrids [14], [15] and on
electric vehicles [16], [17]. In contrast, the application of RL to
energy management in the manufacturing and process industry
domain with its inherent challenges has not been presented to
date.
In our work we assume a production process with a fully
distributed basic control system architecture, i.e. each pro-
duction process is controlled by its own control system.
However, we assume that the energy and production data
are communicated to a centralized PC-system where the RL-
algorithms are implemented. The control actions are then
communicated backwards to the PLCs. For the definition of
the RL-problem we need to consider both the operational
requirements given by the production process as well as the
energy consumption by the production process. In general,
these are conflicting requirements such that the learning pro-
cess needs to find a trade-off between production time and
quality and energy consumption. In our approach, production
requirements and goals are incorporated in two different ways.
First, reward functions representing the production quality and
time are developed and combined with the reward functions
for the energy consumption ending up in a multiobjective
RL-problem. Second, necessary constraints for the production
process, particularly safety constraints, are implemented in
the basic control system. Hence, safety constraints cannot
be violated by the inherent trial-and-error of reinforcement
learning but the learning process is forced to learn the process
constraints. The proposed approach is finally applied to a
laboratory scale test environment. The results obtained show
the effectiveness of the proposed approach.
The paper is organized as follows. In Section II we state the
learning problem for energy management in process plants.
Section III introduces into reinforcement learning and presents
the application of RL to energy optimization. In Section IV
we present the results of the proposed RL framework to a
laboratory bulk good system. Section V gives the conclusions
and an outlook to future work.
II. PROBLEM STATEMENT
In this section we state the energy optimization problem in
large-scale production environments and define its goals and
challenges. The considered general structure of the production
process is illustrated in Fig. 1. As can be seen we consider
Fig. 1: Structure of the considered distributed production
process.
distributed production processes where different subsystems
interact with each other to form the overall system. Each of
the subsystem is assumed to have its own control systems
and suitable sensoring and monitoring devices to measure its
production performance like e.g. required mass flows etc. To
execute its task each subsystem contains a certain number of
energy consumers like electrical drives, valves, compressors
etc. The considered energy consumers are assumed to have
either discrete behavior like DOL-motors or on-off valves or
continuous behavior like VSD-drives and hybrid behavior like
e.g. the considered vacuum pumps. Note that we generally
consider not only the consumption of electrical energy but
also other energy sources like e.g. instrument air. The energy
consumption of the consumers is assumed to be continuously
measured by suitable energy metering devices in the PLC
and then communicated to a centralized control system for
monitoring and analysis.
After describing the considered system setup, we will now
state the considered problem:
Given the distributed system Swith lsubsystems Sias
illustrated in Fig. 1 where each subsystems contains nen-
ergy consumers Ei,j and corresponding energy measurements
ei,j (t),i= 1,...l,j= 1,...n. Then, find the optimal energy
consumption
e∗(t) =
T
X
t=0
X
i,j
ei,j (t)(1)
over a given production episode t= 0,...T while simul-
taneously meeting given quality parameters ci(t)within this
episode.
Some remarks to the previously defined problem are in or-
der. The quality parameters ci(t)can be arbitrarily defined
based in the given process objectives and available process
measurements. Examples include product concentrations in
chemical plants, mass flows in bulk good plants or processing
times in manufacturing plants. The length of the considered
production episode is closely related to these requirements as
quality parameters can only be accessed after some processing
time. Hence, the episode should be at least as long as the
processing times. The number of samples per episode should
be determined such that the important dynamics of the en-
ergy consumption and the process parameters are represented.
Particularly, the processing times and operation points of
actuators strongly influence the energy consumption of the
overall system and need to be carefully examined.
Note that the problem description mainly focuses on discrete
and hybrid processes where operations and system behavior
are not solely continuous. Such hybrid systems containing
discrete and continuous dynamics and actuation are quite
common in the process and basic materials industries due to
discontinuous and delaying components like buffers, reactors
or conveyors as well as on-off actuators and actuators with
discontinuous actions. We will introduce an example of such
a process in Sec. IV.
III. ENERGY OPTIMIZATION USING
REINFORCEMENT LEARNING
After stating the problem we will now introduce our rein-
forcement learning framework for the energy optimization of
process plants. We start with a short introduction into RL and
will then apply RL for the energy management system.
A. Introduction to reinforcement learning
RL is basically a learning technique where an agent acts on
its environments and learns from rewards gained from these
interactions. To this end the environment is formulated as a
Markov decision process (MDP). When the agent takes an
action, it receives a feedback from the environment in form of
a reward which is the outcome of the action. If the interaction
is positive, the reward will be positive and vice versa. The main
objective of the agent is to maximize the expected cumulative
reward over a given time horizon called episode. Formally, the
MDP is described by the tupel (S,A,P,R,p0)with:
•the set of states S,
•the set of action A,
•the transition model P:S×A×S→[0 1], such that
P(st,at,st+1) = p(st+1 |st,at)is the transition probabil-
ity from state stto the following state st+1 by applying
action at,
•the reward function R:S×A×S→R, which assigns a
reward r(st,at,st+1)to each state transition st←st+1,
•an initialization probability p0of the states.
Obviously, the RL-agent interacts with the environment in
discrete time steps. At each time step, an action atis taken
forcing the system to enter state st+1. At the same time
the reward rtassociated with the transition (st,at,st+1)is
observed by the agent. The way how the agent chooses his
actions based on the actual state is called the agents policy
π:S→A. The policy can be chosen based on the
agents past experiences or even randomly. The goal of the
reinforcement learning problem is to find the optimal policy by
interacting with the environment, i.e. the policy which results
in the highest possible cumulated reward. Hence, we want to
maximize the return
ρπ=E[R|π](2)
with the discounted reward
R=X
t
γtrt,(3)
where 0≤γ≤1is the discount factor.
For solving the above RL-problem, the well known value
function approach can be applied. In particular, Q-learning is
well suited which is based on the state-action value function
defined as
Qπ(a,s) = E[X
t
γtrt|s0,a0,π].(4)
The state-action value function determines the expected reward
to be gained when starting in state sand taking action aand
then following policy π. As we want to find the policy with
the optimal reward, we are interested in finding the optimal
action in the actual state which can be determined using the
Bellman equation
Q∗(at,st)= rt(st,at)+X
st+1
P(st,at,st+1 ) max
aQ∗(at+1 ,st+1)
(5)
resulting in the optimal policy
π∗(st) = argmaxaQ∗(at,st).(6)
However, in most real world applications, the transition prob-
abilities as well as the reward allocation are hardly predictable
or even unknown. Hence, also the state-action value functions
can not be determined directly. To adress this issue, model
based approaches where the transition probabilities and re-
wards are learned by interacting with the environment and
model free approaches like temporal difference (TD) learning
can be distinguished [13]. In TD learning, the value of the
state or state-action pair is continuously updated based on the
temporal difference between the Q-values of the actual state
stand the subsequent state st+1 and the reward gained in this
transition step rt(at,st)as follows
Qt+1 (at,st) = Qt(at,st)
+αt(rt+γmax
at+1
Qt(at+1 ,st+1)−Qt(at,st)) (7)
where αis the learning rate and γis the discount factor. The
subsequent action at+1 is normally chosen based on a -greedy
procedure [13], i.e. it chooses the optimal policy using (6) with
probability 1−and a random policy with probability to
sufficiently balance between exploitation and exploration.
B. Application to energy optimization
After the introduction to RL we will now formulate our RL-
learning framework for energy optimization for the presented
system setup illustrated in Fig. 1. To this end, we need to
define the MDP for the energy optimization problem, i.e. we
need to define the system states, the set of possible actions
and the rewards to be gained.
As we are interested in energy optimization the state of the
MDP need to mainly represent the energy flows in the system.
To this end, we assign a set of states Si,j to each energy
consumer Ei,j of the ith subsystem. A typical examples of
such a state set is Si,j ={off,on}for simple actuators.
However, more than two states can be defined for e.g. contin-
uously operated actuators. The allocation of discrete states for
continuously operated actuators is best chosen by considering
the characteristic points of the energy profile like minimum
and maximum values. This will be further evaluated for the
example process below. Note, that this definition of the states
directly impacts the energy flows in the system as typically
different energy consumptions are associated with each state.
Furthermore, it indirectly affects the quality parameters in
the process as the actuators typically control the production
process. The resulting set of states of subsystem ithan yields
Si=Si,1×. . . × Si,n .
The definition of the action space Ai,j is done likewise. In
this case just two or three actions are assigned to each energy
consumer. For simple on-off actuators two actions representing
the turn on and turn off action are defined. Continuously
operating actuators are represented by turn up, turn down and
no change action.
Note that in terms of the RL-framework, continuously oper-
ating actuators can also be modelled by a continuous state
and action space, i.e. by using gaussian basis functions [18].
However, for the sake of simplicity and a PLC-ready imple-
mentation we restrict ourselfs to a discrete set of states and
actions.
In addition to the state and action space, the rewards for each
state transition have to be defined. The appropriate definition
of rewards is of major importance as the energy optimization
problem has to fulfill different partly counteracting objectives.
More precisely, the energy flows need to be optimized under
the constraints that certain quality parameters like processing
times or product quality as well as requirements for process
safety are met. Consequently, the reward has to reflect and
balance these different requirements and needs to be carefully
chosen.
To cover the energy consumption in the reward function, we
could simply use the measured energy consumption during
the sample period ei,j (t). However, this would result in
poor production performance. To avoid this, we normalize
the energy consumption with respect to the actual quality
parameters ci(t)within the given subsystem and define the
reward function for the ith subsystem as
ri(s,a) = Pjei,j (t)
ci(t).(8)
Note that the term ci(t)can include different quality parame-
ters.
Additionally, hard process constraints which are normally
implemented on the basic control level, need to be considered.
This integration in the rewards is essential to the learning
process as can be seen by considering a simple overflow
sensor which locks a valve to be closed. Obviously, the
energy consumption of the actuator will be zero either if it is
commanded open or closed due to the safety function locked
closed. Hence, a reward not considering this would indicate
no differences although the open command is incorrect. To
avoid this we penalize each violation of such hard process
constraints by a corresponding penalty value pi(t). Hence, the
final reward for each subsystem yields:
ri(s,a) = Pjei,j (t)
ci(t)+pi(t).(9)
The penalties need to be balanced carefully to allow for a
sufficient steering of the learning process away from inappro-
priate actions on the one hand while not underweighting the
normalized energy consumption on the other hand.
IV. APPLICATION EXAMPLE
After introducing the general approach to the energy optimiza-
tion we apply the presented approach to a laboratory scale
test facility. The task of the plant is to process bulk good as
a supply for a subsequent dosing unit. After presenting the
testbed we will formulate the RL-problem.
A. Laboratory bulk good system
We start with a short explanation of the laboratory scale testbed
which is illustrated in Fig. 2. As shown the testbed consists
of four modules for processing bulk good material. Modules
1 and 2 represent typical supply, buffer and transportation
stations. Module 1 consists of a container and a continuously
controlled belt conveyor from which the bulk good is carried
to a mini hopper which is the interface to module 2. Module 2
consists of a vacuum pump and a buffer container. The vacuum
pump itself transports the material from module 1 into an
internal container. The material is then released to the buffer
container by a pneumatically actuated flap and then charged
into a mini hopper which is the interface to the next module.
Module 3 is the dosing station, consisting of a second vacuum
pump with similar setup as in module 2 and the dosing unit.
This unit consists of a buffer container with a weighing system
and a rotary feeder. The dosed material is finally transported to
module 4 using a third vacuum pump and filled into transport
containers. Each of the four modules is equipped with its own
Fig. 2: Schematic of the laboratory scale testbed.
PLC-based control system using a Siemens ET200SP. The four
control systems communicate with each other via Profinet. In
addition, each module has a set of sensors to monitor the
modules state. In particular, each container is equipped with
min/max level sensors and each mini hopper with overflow
sensors. To additionally monitor the energy consumption of
each module, Siemens energy metering modules are installed
at each PLC. Hence, the energy consumption of the actua-
tors, namely the conveyors and pumps, can be continuously
monitored. As the energy consumption of the vacuum pump
is especially influenced by the consumption of instrument air,
we take this additionally into account in the PLC.
Note that the testbed mimics to some extend typical large
scale systems which are modularized in smaller subsystems
with their own control systems and suitable communication
interfaces. Furthermore, due to the system structure with
different buffer containers as well as due to the inherent
discontinuous behavior of the vacuum pump this process
constitutes a typical hybrid system with a mixture of discrete
and continuous behavior. Due to this structure the RL-based
energy management system is especially beneficial allowing
for mining the energy optimal operation strategy.
In the following experiments for testing our RL-approach we
will concentrate on the first two modules which are the supply
units for the subsequent dosing station. In particular, we want
to supply the dosing unit with the required amount of material
continuously processed by the dosing station while keeping the
energy consumption of all the actuators in the supply stations
as low as possible. Note that there exists no pre-programmed
sequence of actuator operations in the PLC when starting with
the learning process. However, to assure the safe operation
of the process, some interlocks to avoid buffer overflows are
implemented at the basic PLC level using the available sensor
information described above.
Note that we do not require a constant mass flow into the
buffer at the dosing unit but just a certain mass inflow within
a given time period. However, the inflow should be adjusted to
assure that the bulk good level in the buffer does not fall below
a given threshold to allow for a continuous dosing operation.
This constraint will be represented by a high negative reward
in the RL-problem, if the level falls below this treshold.
B. Energy management based on RL
We now apply the general RL-framework for energy optimiza-
tion presented in Sec. III to the bulk good system. To this end,
the states and actions as well as the reward function have to
be defined.
According to Sec. III-B, the states are defined by the actuator
states, i.e. in our testbed, the states of the conveyor and the
two vaccum pumps. We define two states on and off for both
vacuum pumps and four states for the continuous conveyor as
off, low, medium and maximum speed. These states are chosen
in accordance with the specific energy profile of the conveyor
systems. Furthermore, the action for the vacuum pumps are
defined as switch off and switch on, while the actions of the
conveyor are speed down, speed up and no change.
The definition of the reward function is done using Eq. 9,
i.e. consisting of the related energy consumption with respect
to the mass flow and the penalties for violating process
constraints. While the actual energy consumption can be
directly obtained from the energy meters, the mass flow is
calculated by using a simple physical model of the mass flow
dynamics in the plant due to the lack of appropriate sensors.
It is worth noting that the vacuum pumps exhibit a specific
behavior. After switching on, first an evacuation period appears
where conveying of product is not possible. Then the conveyed
product follows a polynomial function until the buffer in the
vacuum pump is full which results in a sudden drop of the
mass flow. The RL-algorithm should be able to cope with this
specific system behavior.
Additionally, we have to account for different process con-
straint and shut down conditions. These mainly include over-
flow sensors at the mini hopper and the different buffer
stations. Similarly, after a full surge cycle of the vacuum pump,
the material has to relaxed to the buffer. This is locked by an
additional overflow sensor. Following Sec. III these constraints
are incorporated using additional penalty factors.
C. Results
In this section we present results of the RL-framework at the
testbed. Therefore, the RL-algorithm based on Q-learning is
implemented on a Siemens-S7 PLC system. The realization
on the PLC is done by means of different function blocks.
The core of the realization is the Q-learning algorithm which
is implemented in SCL in its own function block using an
array-based structure to store the look-up table of the state-
action values during the learning process.
During our experiments, the discount factor is chosen as γ=
0.9and the learning rate is first fixed at α= 0.1and vanishes
for high number of episodes. The actions are chosen based
on an -greedy strategy such that the first learning steps are
solely based on random exploration and later turn to more
exploitation learning behavior. The length of an production
episode is fixed to 30s.
In our test scenario we consider a continuous supply of
mass flow through the first and second station to the buffer
at the dosing unit. The goal is to learn an energy optimal
behavior of this part of the system while assuring a sufficient
supply of material for the dosing operation. Fig. 3 shows the
learning results for this scenario in terms of normalized energy
consumption. Note that due to the continuous operation, the
results are obtained from a moving average over the episodes
with a step size of 6 to account for varying conditions in the
buffers. As can be seen, the learning process first enters the
exploration phase with strongly varying energy consumptions.
As the learning continues the energy efficiency in terms of
the reward increases over time and results in comparable
performance than strategies implemented by hand. As already
stated, the vacuum pumps exhibit specific dynamic behavior.
From the energy optimal operation point of view, the pumps
need to be switched on until the buffer is full and then switched
off. For the machines at hand, this means operation periods
of 2-3 steps after switching off the pump again. Too short
operation times result in waste of energy due to the evacuation
time while too long operation due to the full buffer. Hence,
a coresponding operation strategy has to be learned by the
RL-algorithm which is illustrated in Fig. 4. As can be seen
during learning the duration of the vacuum pump operation
enters into a region of 2-3 operation steps and hence, the RL
is able to learn the optimal operation of the vacuum pumps.
Fig. 3: Normalized energy efficiency (reward) over learning
episodes.
Fig. 4: Operation steps of the vacuum pump over learning
episodes.
V. CONCLUSIONS
In this paper we presented a novel approach for energy opti-
mization in large scale distributed production environments.
The approach is based on a formulation of the optimiza-
tion problem in form of a reinforcement learning problem.
Hence, the energy consumption of the production process is
minimized via learning from subsequent operation sequences
while maintaining the production quality represented by some
quality functions. The approach is applied to a laboratory
testbed with very promising results. In future research, the
developed RL-approach can be generalized to account also
for continuous state and actions spaces to better represent the
continuous behavior of the production process. Furthermore,
as the current approach requires a centralized learning process
a more distributed approach to the RL-problem potentially
incorporating ideas from game theory will be a topic for future
research.
REFERENCES
[1] S. Karnouskos and A. Walter Colombo and J. L. Martinez Lastra and C.
Popescu, “Towards the energy efficient future factory,” 2009 7th IEEE
International Conference on Industrial Informatics, 2009
[2] A. Cannata and S. Karnouskos and M. Taisch, “Energy efficiency driven
process analysis and optimization in discrete manufacturing,” 2009 35th
Annual Conference of IEEE Industrial Electronics, 2009
[3] J. R. Duflou and J. W. Sutherland and D. Dornfeld and C. Herrmann and
J. Jeswiet and S. Kara and M. Hauschild and K. Kellens, “Towards energy
and resource efficient manufacturing: A processes and systems approach,”
CIRP Annals - Manufacturing Technology, 61(2), pp. 587-609, 2012
[4] K. Bunse and M. Vodicka and P. Sch ¨
onsleben and M. Br¨
ulhart and
F. O. Ernst, “Integrating energy efficiency performance in production
management – gap analysis between industrial needs and scientific
literature,” Journal of Cleaner Production, 19(6-7), pp. 667-679, 2011
[5] E. Oh and S.-Y. Son, “Toward dynamic energy management for green
manufacturing systems,” IEEE Communications Magazine, 54(10), pp.
74-79, 2016
[6] K. Tanaka, “Review of policies and measures for energy efficiency in
industry sector,” Energy Policy, 39(10), pp. 6532-6550, 2011
[7] M. Choobineh and S. Mohagheghi, “Optimal Energy Management in
an Industrial Plant Using On-Site Generation and Demand Scheduling,”
IEEE Transactions on Industry Applications, 52(3), pp. 1945-1952, 2016
[8] Y.-C. Li and S. H. Hong, “Real-Time Demand Bidding for Energy
Management in Discrete Manufacturing Facilities,” IEEE Transactions
on Industrial Electronics, 64(1), pp. 739-749, 2017
[9] A. Florea and J. A. Garcia Izaguirre Montemayor and C. Postelnicu and
J. L. Martinez Lastra, “A cross-layer approach to energy management
in manufacturing,” IEEE 10th International Conference on Industrial
Informatics, 2012
[10] C. K. Pang and C. V. Le, “Optimization of Total Energy Consumption
in Flexible Manufacturing Systems Using Weighted P-Timed Petri Nets
and Dynamic Programming,” IEEE Transactions on Automation Science
and Engineering, 11(4), pp. 1083-1096, 2014
[11] C. V. Lee and C. K. Pang, “Robust Total Energy Optimization of Flexible
Manufacturing Systems Based on Renyi Mean-Entropy Criterion,” IEEE
Transactions on Automation Science and Engineering, 13(1), pp. 355-367,
2016
[12] Z. Sun and D. Wei and L. Wang and L. Li, “Data driven production run-
time energy control of manufacturing systems,” 2015 IEEE International
Conference on Automation Science and Engineering (CASE), 2015
[13] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduc-
tion,” Cambridge, MA, USA: MIT Press, 1998,
[14] E. Kuznetsova and Y.-F. Li and C. Ruiz and E. Zio and G. Ault and
K. Bell, “Reinforcement learning for microgrid energy management,”
Energy, 59, pp. 133-146, 2013
[15] L. Raju and S. Sankar and R.S. Milton, “Distributed Optimization of
Solar Micro-grid Using Multi Agent Reinforcement Learning,” Procedia
Computer Science, 46, pp. 231-239, 2015
[16] C. Liu and Y. L. Murphey, “Power management for Plug-in Hybrid
Electric Vehicles using Reinforcement Learning with trip information,”
2014 IEEE Transportation Electrification Conference and Expo (ITEC),
2014
[17] T. Liu and Y. Zou and D. Liu and F. Sun, “Reinforcement Learn-
ing–Based Energy Management Strategy for a Hybrid Electric Tracked
Vehicle,” Energies, 8(7), pp. 7243-7260, 2015
[18] R.M. Kretchmar and C.W. Anderson, “Comparison of CMACs and
radial basis functions for local function approximators in reinforcement
learning,” Proceedings of International Conference on Neural Networks,
1997