Conference PaperPDF Available

Self-optimization of energy consumption in complex bulk good processes using reinforcement learning

Authors:

Figures

Content may be subject to copyright.
c
2017 IEEE. Personal use of this material is permitted. However, permis-
sion to reprint/republish this material for advertising or promotional purposes
or for creating new collective works for resale or redistribution to servers or
lists, or to reuse any copyrighted component of this work in other works must
be obtained from the IEEE.
Appeared as: D. Schwung, T. Kempe, A. Schwung, S. X. Ding: "‘Self-
optimization of energy consumption in complex bulk good processes using rein-
forcement learning"’; Proc. of the 2017 IEEE 15th International Conference on
Industrial Informatics (INDIN), 2017
DOI: 10.1109/INDIN.2017.8104776
2
Self-optimization of energy consumption in complex bulk good
processes using reinforcement learning
Dorothea Schwung, Tim Kempe, Andreas Schwung, Steven X. Ding
South Westfalia University of Applied Sciences, Soest, Germany
{schwung.dorothea, kempe.tim, schwung.andreas}@fh-swf.de
University of Duisburg-Essen, Duisburg, Germany
steven.ding@uni-due.de
Abstract—This paper presents a novel approach to the opti-
mization of energy consumption in large scale industrial bulk
good processes. The approach is based on a model-free self-
learning algorithm solely based on available process data using
ideas from the well known reinforcement learning framework.
To this end energy consumers of the plant are integrated in the
optimization framework such that each consumer learns its own
optimal energy profile for a given production task. The approach
is implemented on a laboratory size testbed where the task is
the supply of bulk good to a subsequent dosing section. The
capability of the approach is underlined by the results obtained
at the testbed.
I. INTRODUCTION
Environmental requirements as well as the increasing costs for
energy demand for more energy efficient operation of indus-
trial production facilities. Although the energy consumption
per processed product has been reduced in last years, more
efforts are required to reduce in particular the emission of
detrimental greenhouse gases. In parallel the energy costs for
production facilities are continuously increasing demanding
the industry to reduce energy consumption to stay competitive.
This problem becomes even more severe in industrial sectors
like steel making industry, process industry and basic materials
industry, to name a few, where the energy consumption is
extremely high. Here, even minor percentages of energy sav-
ings in the production process can lead to both, significantly
decreasing emissions and cost impacts [1], [2]. Hence, an
effective and sustainable reduction of energy consumption
especially in these industrial sectors is necessary. To this end
the energy consumption of such plants needs to be analyzed
in detail and possible savings need to be detected [3]. These
ideas are supported by recent improvements in automation
technology where energy consumption can be continuously
monitored with high resolution using intelligent PLC-modules
allowing to gain a comprehensive overview over the status
of the plant. Such energy data can subsequently be used for
further analysis by using e.g. cloud based services to optimize
the overall efficiency.
However, industrial production plants in the basic materials
industry are in general large-scale plants with a very high
number of energy consumers [4], [5], [6]. Furthermore, such
plants are normally highly distributed which poses additional
challenges to an energy optimization approach. Particularly,
connections between different production areas as well as
constraints have to be considered as the reduction of energy
consumption in one part of the plant can cause increased
consumption in other parts. In the literature, different ap-
proaches for energy management in the manufacturing and
process industry have been introduced. Load demand schedul-
ing has been presented in [7], [8] while [9] presents a layered
approach for energy optimization based on service oriented
architectures. Approaches based on dynamic programming are
introduced in [10], [11]. In [12] a data-driven system is
developed for energy management.
In this paper we present a novel approach for energy manage-
ment and optimization which is based on the ideas of rein-
forcement learning (RL) [13]. RL is essentially a knowledge-
based control scheme which learns optimal policy by direct
interaction with the environment. RL has already been applied
to energy management systems with special emphasis on
distributed smart grids and microgrids [14], [15] and on
electric vehicles [16], [17]. In contrast, the application of RL to
energy management in the manufacturing and process industry
domain with its inherent challenges has not been presented to
date.
In our work we assume a production process with a fully
distributed basic control system architecture, i.e. each pro-
duction process is controlled by its own control system.
However, we assume that the energy and production data
are communicated to a centralized PC-system where the RL-
algorithms are implemented. The control actions are then
communicated backwards to the PLCs. For the definition of
the RL-problem we need to consider both the operational
requirements given by the production process as well as the
energy consumption by the production process. In general,
these are conflicting requirements such that the learning pro-
cess needs to find a trade-off between production time and
quality and energy consumption. In our approach, production
requirements and goals are incorporated in two different ways.
First, reward functions representing the production quality and
time are developed and combined with the reward functions
for the energy consumption ending up in a multiobjective
RL-problem. Second, necessary constraints for the production
process, particularly safety constraints, are implemented in
the basic control system. Hence, safety constraints cannot
be violated by the inherent trial-and-error of reinforcement
learning but the learning process is forced to learn the process
constraints. The proposed approach is finally applied to a
laboratory scale test environment. The results obtained show
the effectiveness of the proposed approach.
The paper is organized as follows. In Section II we state the
learning problem for energy management in process plants.
Section III introduces into reinforcement learning and presents
the application of RL to energy optimization. In Section IV
we present the results of the proposed RL framework to a
laboratory bulk good system. Section V gives the conclusions
and an outlook to future work.
II. PROBLEM STATEMENT
In this section we state the energy optimization problem in
large-scale production environments and define its goals and
challenges. The considered general structure of the production
process is illustrated in Fig. 1. As can be seen we consider
Fig. 1: Structure of the considered distributed production
process.
distributed production processes where different subsystems
interact with each other to form the overall system. Each of
the subsystem is assumed to have its own control systems
and suitable sensoring and monitoring devices to measure its
production performance like e.g. required mass flows etc. To
execute its task each subsystem contains a certain number of
energy consumers like electrical drives, valves, compressors
etc. The considered energy consumers are assumed to have
either discrete behavior like DOL-motors or on-off valves or
continuous behavior like VSD-drives and hybrid behavior like
e.g. the considered vacuum pumps. Note that we generally
consider not only the consumption of electrical energy but
also other energy sources like e.g. instrument air. The energy
consumption of the consumers is assumed to be continuously
measured by suitable energy metering devices in the PLC
and then communicated to a centralized control system for
monitoring and analysis.
After describing the considered system setup, we will now
state the considered problem:
Given the distributed system Swith lsubsystems Sias
illustrated in Fig. 1 where each subsystems contains nen-
ergy consumers Ei,j and corresponding energy measurements
ei,j (t),i= 1,...l,j= 1,...n. Then, find the optimal energy
consumption
e(t) =
T
X
t=0
X
i,j
ei,j (t)(1)
over a given production episode t= 0,...T while simul-
taneously meeting given quality parameters ci(t)within this
episode.
Some remarks to the previously defined problem are in or-
der. The quality parameters ci(t)can be arbitrarily defined
based in the given process objectives and available process
measurements. Examples include product concentrations in
chemical plants, mass flows in bulk good plants or processing
times in manufacturing plants. The length of the considered
production episode is closely related to these requirements as
quality parameters can only be accessed after some processing
time. Hence, the episode should be at least as long as the
processing times. The number of samples per episode should
be determined such that the important dynamics of the en-
ergy consumption and the process parameters are represented.
Particularly, the processing times and operation points of
actuators strongly influence the energy consumption of the
overall system and need to be carefully examined.
Note that the problem description mainly focuses on discrete
and hybrid processes where operations and system behavior
are not solely continuous. Such hybrid systems containing
discrete and continuous dynamics and actuation are quite
common in the process and basic materials industries due to
discontinuous and delaying components like buffers, reactors
or conveyors as well as on-off actuators and actuators with
discontinuous actions. We will introduce an example of such
a process in Sec. IV.
III. ENERGY OPTIMIZATION USING
REINFORCEMENT LEARNING
After stating the problem we will now introduce our rein-
forcement learning framework for the energy optimization of
process plants. We start with a short introduction into RL and
will then apply RL for the energy management system.
A. Introduction to reinforcement learning
RL is basically a learning technique where an agent acts on
its environments and learns from rewards gained from these
interactions. To this end the environment is formulated as a
Markov decision process (MDP). When the agent takes an
action, it receives a feedback from the environment in form of
a reward which is the outcome of the action. If the interaction
is positive, the reward will be positive and vice versa. The main
objective of the agent is to maximize the expected cumulative
reward over a given time horizon called episode. Formally, the
MDP is described by the tupel (S,A,P,R,p0)with:
the set of states S,
the set of action A,
the transition model P:S×A×S[0 1], such that
P(st,at,st+1) = p(st+1 |st,at)is the transition probabil-
ity from state stto the following state st+1 by applying
action at,
the reward function R:S×A×SR, which assigns a
reward r(st,at,st+1)to each state transition stst+1,
an initialization probability p0of the states.
Obviously, the RL-agent interacts with the environment in
discrete time steps. At each time step, an action atis taken
forcing the system to enter state st+1. At the same time
the reward rtassociated with the transition (st,at,st+1)is
observed by the agent. The way how the agent chooses his
actions based on the actual state is called the agents policy
π:SA. The policy can be chosen based on the
agents past experiences or even randomly. The goal of the
reinforcement learning problem is to find the optimal policy by
interacting with the environment, i.e. the policy which results
in the highest possible cumulated reward. Hence, we want to
maximize the return
ρπ=E[R|π](2)
with the discounted reward
R=X
t
γtrt,(3)
where 0γ1is the discount factor.
For solving the above RL-problem, the well known value
function approach can be applied. In particular, Q-learning is
well suited which is based on the state-action value function
defined as
Qπ(a,s) = E[X
t
γtrt|s0,a0].(4)
The state-action value function determines the expected reward
to be gained when starting in state sand taking action aand
then following policy π. As we want to find the policy with
the optimal reward, we are interested in finding the optimal
action in the actual state which can be determined using the
Bellman equation
Q(at,st)= rt(st,at)+X
st+1
P(st,at,st+1 ) max
aQ(at+1 ,st+1)
(5)
resulting in the optimal policy
π(st) = argmaxaQ(at,st).(6)
However, in most real world applications, the transition prob-
abilities as well as the reward allocation are hardly predictable
or even unknown. Hence, also the state-action value functions
can not be determined directly. To adress this issue, model
based approaches where the transition probabilities and re-
wards are learned by interacting with the environment and
model free approaches like temporal difference (TD) learning
can be distinguished [13]. In TD learning, the value of the
state or state-action pair is continuously updated based on the
temporal difference between the Q-values of the actual state
stand the subsequent state st+1 and the reward gained in this
transition step rt(at,st)as follows
Qt+1 (at,st) = Qt(at,st)
+αt(rt+γmax
at+1
Qt(at+1 ,st+1)Qt(at,st)) (7)
where αis the learning rate and γis the discount factor. The
subsequent action at+1 is normally chosen based on a -greedy
procedure [13], i.e. it chooses the optimal policy using (6) with
probability 1and a random policy with probability to
sufficiently balance between exploitation and exploration.
B. Application to energy optimization
After the introduction to RL we will now formulate our RL-
learning framework for energy optimization for the presented
system setup illustrated in Fig. 1. To this end, we need to
define the MDP for the energy optimization problem, i.e. we
need to define the system states, the set of possible actions
and the rewards to be gained.
As we are interested in energy optimization the state of the
MDP need to mainly represent the energy flows in the system.
To this end, we assign a set of states Si,j to each energy
consumer Ei,j of the ith subsystem. A typical examples of
such a state set is Si,j ={off,on}for simple actuators.
However, more than two states can be defined for e.g. contin-
uously operated actuators. The allocation of discrete states for
continuously operated actuators is best chosen by considering
the characteristic points of the energy profile like minimum
and maximum values. This will be further evaluated for the
example process below. Note, that this definition of the states
directly impacts the energy flows in the system as typically
different energy consumptions are associated with each state.
Furthermore, it indirectly affects the quality parameters in
the process as the actuators typically control the production
process. The resulting set of states of subsystem ithan yields
Si=Si,1×. . . × Si,n .
The definition of the action space Ai,j is done likewise. In
this case just two or three actions are assigned to each energy
consumer. For simple on-off actuators two actions representing
the turn on and turn off action are defined. Continuously
operating actuators are represented by turn up, turn down and
no change action.
Note that in terms of the RL-framework, continuously oper-
ating actuators can also be modelled by a continuous state
and action space, i.e. by using gaussian basis functions [18].
However, for the sake of simplicity and a PLC-ready imple-
mentation we restrict ourselfs to a discrete set of states and
actions.
In addition to the state and action space, the rewards for each
state transition have to be defined. The appropriate definition
of rewards is of major importance as the energy optimization
problem has to fulfill different partly counteracting objectives.
More precisely, the energy flows need to be optimized under
the constraints that certain quality parameters like processing
times or product quality as well as requirements for process
safety are met. Consequently, the reward has to reflect and
balance these different requirements and needs to be carefully
chosen.
To cover the energy consumption in the reward function, we
could simply use the measured energy consumption during
the sample period ei,j (t). However, this would result in
poor production performance. To avoid this, we normalize
the energy consumption with respect to the actual quality
parameters ci(t)within the given subsystem and define the
reward function for the ith subsystem as
ri(s,a) = Pjei,j (t)
ci(t).(8)
Note that the term ci(t)can include different quality parame-
ters.
Additionally, hard process constraints which are normally
implemented on the basic control level, need to be considered.
This integration in the rewards is essential to the learning
process as can be seen by considering a simple overflow
sensor which locks a valve to be closed. Obviously, the
energy consumption of the actuator will be zero either if it is
commanded open or closed due to the safety function locked
closed. Hence, a reward not considering this would indicate
no differences although the open command is incorrect. To
avoid this we penalize each violation of such hard process
constraints by a corresponding penalty value pi(t). Hence, the
final reward for each subsystem yields:
ri(s,a) = Pjei,j (t)
ci(t)+pi(t).(9)
The penalties need to be balanced carefully to allow for a
sufficient steering of the learning process away from inappro-
priate actions on the one hand while not underweighting the
normalized energy consumption on the other hand.
IV. APPLICATION EXAMPLE
After introducing the general approach to the energy optimiza-
tion we apply the presented approach to a laboratory scale
test facility. The task of the plant is to process bulk good as
a supply for a subsequent dosing unit. After presenting the
testbed we will formulate the RL-problem.
A. Laboratory bulk good system
We start with a short explanation of the laboratory scale testbed
which is illustrated in Fig. 2. As shown the testbed consists
of four modules for processing bulk good material. Modules
1 and 2 represent typical supply, buffer and transportation
stations. Module 1 consists of a container and a continuously
controlled belt conveyor from which the bulk good is carried
to a mini hopper which is the interface to module 2. Module 2
consists of a vacuum pump and a buffer container. The vacuum
pump itself transports the material from module 1 into an
internal container. The material is then released to the buffer
container by a pneumatically actuated flap and then charged
into a mini hopper which is the interface to the next module.
Module 3 is the dosing station, consisting of a second vacuum
pump with similar setup as in module 2 and the dosing unit.
This unit consists of a buffer container with a weighing system
and a rotary feeder. The dosed material is finally transported to
module 4 using a third vacuum pump and filled into transport
containers. Each of the four modules is equipped with its own
Fig. 2: Schematic of the laboratory scale testbed.
PLC-based control system using a Siemens ET200SP. The four
control systems communicate with each other via Profinet. In
addition, each module has a set of sensors to monitor the
modules state. In particular, each container is equipped with
min/max level sensors and each mini hopper with overflow
sensors. To additionally monitor the energy consumption of
each module, Siemens energy metering modules are installed
at each PLC. Hence, the energy consumption of the actua-
tors, namely the conveyors and pumps, can be continuously
monitored. As the energy consumption of the vacuum pump
is especially influenced by the consumption of instrument air,
we take this additionally into account in the PLC.
Note that the testbed mimics to some extend typical large
scale systems which are modularized in smaller subsystems
with their own control systems and suitable communication
interfaces. Furthermore, due to the system structure with
different buffer containers as well as due to the inherent
discontinuous behavior of the vacuum pump this process
constitutes a typical hybrid system with a mixture of discrete
and continuous behavior. Due to this structure the RL-based
energy management system is especially beneficial allowing
for mining the energy optimal operation strategy.
In the following experiments for testing our RL-approach we
will concentrate on the first two modules which are the supply
units for the subsequent dosing station. In particular, we want
to supply the dosing unit with the required amount of material
continuously processed by the dosing station while keeping the
energy consumption of all the actuators in the supply stations
as low as possible. Note that there exists no pre-programmed
sequence of actuator operations in the PLC when starting with
the learning process. However, to assure the safe operation
of the process, some interlocks to avoid buffer overflows are
implemented at the basic PLC level using the available sensor
information described above.
Note that we do not require a constant mass flow into the
buffer at the dosing unit but just a certain mass inflow within
a given time period. However, the inflow should be adjusted to
assure that the bulk good level in the buffer does not fall below
a given threshold to allow for a continuous dosing operation.
This constraint will be represented by a high negative reward
in the RL-problem, if the level falls below this treshold.
B. Energy management based on RL
We now apply the general RL-framework for energy optimiza-
tion presented in Sec. III to the bulk good system. To this end,
the states and actions as well as the reward function have to
be defined.
According to Sec. III-B, the states are defined by the actuator
states, i.e. in our testbed, the states of the conveyor and the
two vaccum pumps. We define two states on and off for both
vacuum pumps and four states for the continuous conveyor as
off, low, medium and maximum speed. These states are chosen
in accordance with the specific energy profile of the conveyor
systems. Furthermore, the action for the vacuum pumps are
defined as switch off and switch on, while the actions of the
conveyor are speed down, speed up and no change.
The definition of the reward function is done using Eq. 9,
i.e. consisting of the related energy consumption with respect
to the mass flow and the penalties for violating process
constraints. While the actual energy consumption can be
directly obtained from the energy meters, the mass flow is
calculated by using a simple physical model of the mass flow
dynamics in the plant due to the lack of appropriate sensors.
It is worth noting that the vacuum pumps exhibit a specific
behavior. After switching on, first an evacuation period appears
where conveying of product is not possible. Then the conveyed
product follows a polynomial function until the buffer in the
vacuum pump is full which results in a sudden drop of the
mass flow. The RL-algorithm should be able to cope with this
specific system behavior.
Additionally, we have to account for different process con-
straint and shut down conditions. These mainly include over-
flow sensors at the mini hopper and the different buffer
stations. Similarly, after a full surge cycle of the vacuum pump,
the material has to relaxed to the buffer. This is locked by an
additional overflow sensor. Following Sec. III these constraints
are incorporated using additional penalty factors.
C. Results
In this section we present results of the RL-framework at the
testbed. Therefore, the RL-algorithm based on Q-learning is
implemented on a Siemens-S7 PLC system. The realization
on the PLC is done by means of different function blocks.
The core of the realization is the Q-learning algorithm which
is implemented in SCL in its own function block using an
array-based structure to store the look-up table of the state-
action values during the learning process.
During our experiments, the discount factor is chosen as γ=
0.9and the learning rate is first fixed at α= 0.1and vanishes
for high number of episodes. The actions are chosen based
on an -greedy strategy such that the first learning steps are
solely based on random exploration and later turn to more
exploitation learning behavior. The length of an production
episode is fixed to 30s.
In our test scenario we consider a continuous supply of
mass flow through the first and second station to the buffer
at the dosing unit. The goal is to learn an energy optimal
behavior of this part of the system while assuring a sufficient
supply of material for the dosing operation. Fig. 3 shows the
learning results for this scenario in terms of normalized energy
consumption. Note that due to the continuous operation, the
results are obtained from a moving average over the episodes
with a step size of 6 to account for varying conditions in the
buffers. As can be seen, the learning process first enters the
exploration phase with strongly varying energy consumptions.
As the learning continues the energy efficiency in terms of
the reward increases over time and results in comparable
performance than strategies implemented by hand. As already
stated, the vacuum pumps exhibit specific dynamic behavior.
From the energy optimal operation point of view, the pumps
need to be switched on until the buffer is full and then switched
off. For the machines at hand, this means operation periods
of 2-3 steps after switching off the pump again. Too short
operation times result in waste of energy due to the evacuation
time while too long operation due to the full buffer. Hence,
a coresponding operation strategy has to be learned by the
RL-algorithm which is illustrated in Fig. 4. As can be seen
during learning the duration of the vacuum pump operation
enters into a region of 2-3 operation steps and hence, the RL
is able to learn the optimal operation of the vacuum pumps.
Fig. 3: Normalized energy efficiency (reward) over learning
episodes.
Fig. 4: Operation steps of the vacuum pump over learning
episodes.
V. CONCLUSIONS
In this paper we presented a novel approach for energy opti-
mization in large scale distributed production environments.
The approach is based on a formulation of the optimiza-
tion problem in form of a reinforcement learning problem.
Hence, the energy consumption of the production process is
minimized via learning from subsequent operation sequences
while maintaining the production quality represented by some
quality functions. The approach is applied to a laboratory
testbed with very promising results. In future research, the
developed RL-approach can be generalized to account also
for continuous state and actions spaces to better represent the
continuous behavior of the production process. Furthermore,
as the current approach requires a centralized learning process
a more distributed approach to the RL-problem potentially
incorporating ideas from game theory will be a topic for future
research.
REFERENCES
[1] S. Karnouskos and A. Walter Colombo and J. L. Martinez Lastra and C.
Popescu, “Towards the energy efficient future factory,” 2009 7th IEEE
International Conference on Industrial Informatics, 2009
[2] A. Cannata and S. Karnouskos and M. Taisch, “Energy efficiency driven
process analysis and optimization in discrete manufacturing,” 2009 35th
Annual Conference of IEEE Industrial Electronics, 2009
[3] J. R. Duflou and J. W. Sutherland and D. Dornfeld and C. Herrmann and
J. Jeswiet and S. Kara and M. Hauschild and K. Kellens, “Towards energy
and resource efficient manufacturing: A processes and systems approach,
CIRP Annals - Manufacturing Technology, 61(2), pp. 587-609, 2012
[4] K. Bunse and M. Vodicka and P. Sch ¨
onsleben and M. Br¨
ulhart and
F. O. Ernst, “Integrating energy efficiency performance in production
management – gap analysis between industrial needs and scientific
literature,” Journal of Cleaner Production, 19(6-7), pp. 667-679, 2011
[5] E. Oh and S.-Y. Son, “Toward dynamic energy management for green
manufacturing systems,” IEEE Communications Magazine, 54(10), pp.
74-79, 2016
[6] K. Tanaka, “Review of policies and measures for energy efficiency in
industry sector,Energy Policy, 39(10), pp. 6532-6550, 2011
[7] M. Choobineh and S. Mohagheghi, “Optimal Energy Management in
an Industrial Plant Using On-Site Generation and Demand Scheduling,”
IEEE Transactions on Industry Applications, 52(3), pp. 1945-1952, 2016
[8] Y.-C. Li and S. H. Hong, “Real-Time Demand Bidding for Energy
Management in Discrete Manufacturing Facilities,IEEE Transactions
on Industrial Electronics, 64(1), pp. 739-749, 2017
[9] A. Florea and J. A. Garcia Izaguirre Montemayor and C. Postelnicu and
J. L. Martinez Lastra, “A cross-layer approach to energy management
in manufacturing,” IEEE 10th International Conference on Industrial
Informatics, 2012
[10] C. K. Pang and C. V. Le, “Optimization of Total Energy Consumption
in Flexible Manufacturing Systems Using Weighted P-Timed Petri Nets
and Dynamic Programming,” IEEE Transactions on Automation Science
and Engineering, 11(4), pp. 1083-1096, 2014
[11] C. V. Lee and C. K. Pang, “Robust Total Energy Optimization of Flexible
Manufacturing Systems Based on Renyi Mean-Entropy Criterion,IEEE
Transactions on Automation Science and Engineering, 13(1), pp. 355-367,
2016
[12] Z. Sun and D. Wei and L. Wang and L. Li, “Data driven production run-
time energy control of manufacturing systems,2015 IEEE International
Conference on Automation Science and Engineering (CASE), 2015
[13] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduc-
tion,” Cambridge, MA, USA: MIT Press, 1998,
[14] E. Kuznetsova and Y.-F. Li and C. Ruiz and E. Zio and G. Ault and
K. Bell, “Reinforcement learning for microgrid energy management,”
Energy, 59, pp. 133-146, 2013
[15] L. Raju and S. Sankar and R.S. Milton, “Distributed Optimization of
Solar Micro-grid Using Multi Agent Reinforcement Learning,” Procedia
Computer Science, 46, pp. 231-239, 2015
[16] C. Liu and Y. L. Murphey, “Power management for Plug-in Hybrid
Electric Vehicles using Reinforcement Learning with trip information,”
2014 IEEE Transportation Electrification Conference and Expo (ITEC),
2014
[17] T. Liu and Y. Zou and D. Liu and F. Sun, “Reinforcement Learn-
ing–Based Energy Management Strategy for a Hybrid Electric Tracked
Vehicle,” Energies, 8(7), pp. 7243-7260, 2015
[18] R.M. Kretchmar and C.W. Anderson, “Comparison of CMACs and
radial basis functions for local function approximators in reinforcement
learning,” Proceedings of International Conference on Neural Networks,
1997
... Actuator torque-distribution methods have been studied for: (i) minimization of the peak value of the required actuator torque [9]; (ii) inversedynamics analyses [10], [11]; (iii) global kinetic-energy minimization [12]; (iv) optimization of the driving forces [13]; (v) high-regeneration efficiency [14]; and (vi) reduction in energy consumption [15]. As well, actuator energy consumption has been optimized for multi-objective path-placement optimization [16] and energy-saving [17], [18], [19], [20]. Other applications target actuator-force optimization for reduction of overall cost and size of the actuators [21]. ...
... in which (20) in which, ...
... As well, in the actuation-systemdesign process, the materials and geometry of the links are not prescribed thereby friction forces cannot be solved. Therefore, the dissipative-force array δ in Eq. (20) is set as zero. ...
Article
Full-text available
A symmetric, double-tripod multiloop mechanism (DTMLM), intended for grabbing objects in outerspace, is the subject of this article. Actuation modes are analyzed while introducing a novel tool applicable to space mechanisms. The key issue here lies in establishing the criteria for selecting the optimum mode from multiple actuation possibilities. The evaluation procedure includes generalized-force values, power requirement, and actuation-strategy models, along with their optimization. Accordingly, the optimization procedure targets the mode(s) with 1) uniformity of generalized-force distribution, 2) uniformity of power-requirement distribution, and 3) the fewest working actuators in a given maneuver. In this way, a rather complex problem is formulated in a simple manner, whereby the optimum actuation mode is found by list-lookup, from a reduced number of candidates. The DTMLM, which carries three compound hinges plus three prismatic, and six revolute joints, has three degrees of freedom. To analyze the actuation modes of the mechanism, the dynamics model of the DTMLM is established; as well, five representative modes are selected from 84 possibilities. It turns out that the 3 R -type (three revolute actuators) shows a uniform power-requirement distribution among the three actuators in the bending motion mode; in the 3 P -type (three prismatic actuators), one single actuator is operational, and hence, takes all the load. Thus, the 3 R -type is the best from the power-distribution viewpoint, thereby providing strong power support, but complex from the actuation-strategy viewpoint, as illustrated in the article. Simulation results and prototype experiments of the DTMLM are reported, thereby verifying the analysis results.
... Schwung et al., for example, optimize the energy consumption of hybrid production plants while meeting process constraints [18,19]. The author's work is centered around a laboratory testbed for processing bulk good material. ...
... While the dimension of the switching action space design is significantly smaller than the one of the on/off version, the effect of a single action is less clear in the switching one. Furthermore, the on/off version has already been successfully used in the context of aPS [18]. ...
... The action sequence suggested by the system may for example be tested in the simulation environment before being executed on the actual production system. As suggested by Schwung et al. safety conditions can also be implemented that stop the system in case of critical states [18]. While the resulting system may not be able to restart the production system in any situation, it still improves the system's availability compared to a classical nonfault-tolerant controller. ...
Article
Fault-tolerant control policies that automatically restart PLC-based aPS during fault recovery can increase system availability. This paper provides a proof-of-concept that such policies can be synthesized with DRL. The authors specifically focus on systems with multiple end-effectors that are actuated in only one or two axes, commonly used for assembly and logistics tasks. Due to the large number of actuators in multi-end-effector systems and the limited possibilities to track workpieces in a single coordinate system, these systems are especially challenging to learn. This paper demonstrates that a hierarchical multi-agent DRL approach together with a separate coordinate prediction module per agent can overcome these challenges. The evaluation of the suggested approach on the simulation of a small laboratory demonstrator shows that it is capable of restarting the system and completing open tasks as part of fault recovery.
... The Bulk Good Laboratory Plant (BGLP) (Schwung et al., 2017(Schwung et al., , 2021) illustrates a smart production system with high flexibility and distributed control to transport bulk raw materials. One of the major advantages of the system is the modularity in design, as depicted schematically in Fig. 13. ...
... Illustrative of the BGLP(Schwung et al., 2017). ...
Preprint
Full-text available
Nowadays there are numerous powerful software packages available for most areas of machine learning (ML). These can be roughly divided into frameworks that solve detailed aspects of ML and those that pursue holistic approaches for one or two learning paradigms. For the implementation of own ML applications, several packages often have to be involved and integrated through individual coding. The latter aspect in particular makes it difficult for newcomers to get started. It also makes a comparison with other works difficult, if not impossible. Especially in the area of reinforcement learning (RL), there is a lack of frameworks that fully implement the current concepts up to multi-agents (MARL) and model-based agents (MBRL). For the related field of game theory (GT), there are hardly any packages available that aim to solve real-world applications. Here we would like to make a contribution and propose the new framework MLPro, which is designed for the holistic realization of hybrid ML applications across all learning paradigms. This is made possible by an additional base layer in which the fundamentals of ML (interaction, adaptation, training, hyperparameter optimization) are defined on an abstract level. In contrast, concrete learning paradigms are implemented in higher sub-frameworks that build on the conventions of this additional base layer. This ensures a high degree of standardization and functional recombinability. Proven concepts and algorithms of existing frameworks can still be used. The first version of MLPro includes sub-frameworks for RL and cooperative GT.
... The bulk good system (BGS) [44] is a typical modular organized production process on laboratory scale as shown schematically in Fig. 5. It consists of four modules, transporting bulk material from station A to station D. The interface between the modules is realized via a mini hopper in the prior module and a vacuum pump for priming the material in the next reservoir. ...
Article
Full-text available
This paper presents a novel approach for distributed optimization of production units based on Potential Game Theory and Machine Learning. The core of our approach is split into two parts: The first part concentrates on the conceptual treatment of modular installed production units in terms of a Potential Game scenario. The second part focuses on the development and incorporation of suitable learning algorithms to finally form an intelligent autonomous system. In this context, we model the production environment as a State based Potential Game where each actuator of each module has the role of an agent in the game aiming to maximize its utility value by learning the optimal process behavior. The benefit of the additional state information is visible in the performance of the algorithm making the environment dynamic and serving as a connector between the players. We propose a novel learning algorithm based on a global interpolation method that is applied to a laboratory scale modular bulk good system. The thorough analysis of the encouraging results yields to highly interesting insights into the learning dynamics and the process itself. The benefits of our distributed optimization approach are the plug-and-play functionality, the online capability, fast adaption to changing production requirements and the possibility of a IEC 61131conform PLC-implementation.
... This MRACs are suited for industrial applications, such us hydraulic-presses, vacuum pumps and conveyor belts, as the enhances have been designed to be compatible with this type of environment [20]. The presented novel methodology is explained on Section II through an analysis of the BCs, showing the improvements introduced in FDI and CR phases. ...
Article
Full-text available
Fault-Tolerant Controllers (FTCs) modify system behaviour to overcome faults without human interaction. These control algorithms, when based on active approach, detect, quantify and isolate the faults during Fault Detection and Isolation (FDI) phase. Afterwards, during Control Re-design (CR) phase, the controller is reconfigured and adapted to the faulty situation. This last phase has been approached by a wide variety of algorithms, being Adaptive Controllers (ACs) the ones studied in this paper. Despite their potentiality to overcome faults, industrial manufacturing systems demand robustness and flexibility levels hardly achievable by these algorithms. On this context, the paper proposes to upgrade them introducing novel Digital-Twin (DT) models to increase its flexibility and Anti-Windup (AW) techniques to improve their robustness. These novelties reach their maximum potential when FDI and CR phases merge to generate a novel FTC platform based on a Bank of Controllers (BC), improving the fault avoidance process as controller gains are switched to the ones that recover the machine more efficiently.
... In recent years, RL has enjoyed considerable success in different application domains, see e.g. [7]- [9], mainly due to the combinations of RL with deep learning approaches [10] using neural networks as function approximators. Furthermore, the originally singleagent based RL has been extended to the multi-agent system domain resulting in multi-agent RL (MARL) [11]. ...
Conference Paper
Full-text available
This paper introduces a novel approach for self-learning in highly flexible, modular manufacturing systems enabling fast reconfiguration and online adaptation to changing production requirements. The approach is based on a distributed optimization scheme such that production modules are equipped with their own optimization agent with its local objectives to be optimized. The communication and coordination of the agent is limited to the basically required amount. The approach is based on the recently developed deep deter-ministic policy gradient (DDPG) approach, a high performing algorithm from the family of actor-critic reinforcement learning algorithms. As DDPG is based on single agent learning, we develop a fully distributed multi-agent learning setting with different levels of information about the neighbors. We apply the approach to a laboratory scale distributed bulk good production testbed with very encouraging results. Particularly, we found very reasonable control strategies by learning the agents from scratch.
Article
Full-text available
Intelligent manufacturing applications and agent-based implementations are scientifically investigated due to the enormous potential of industrial process optimization. The most widespread data-driven approach is the use of experimental history under test conditions for training, followed by execution of the trained model. Since factors, such as tool wear, affect the process, the experimental history has to be compiled extensively. In addition, individual machine noise implies that the models are not easily transferable to other (theoretically identical) machines. In contrast, a continual learning system should have the capacity to adapt (slightly) to a changing environment, e.g., another machine under different working conditions. Since this adaptation can potentially have a negative impact on process quality, especially in industry, safe optimization methods are required. In this article, we present a significant step towards self-optimizing machines in industry, by introducing a novel method for efficient safe contextual optimization and continuously trading-off between exploration and exploitation. Furthermore, an appropriate data discard strategy and local approximation techniques enable continual optimization. The approach is implemented as generic software module for an industrial edge control device. We apply this module to a steel straightening machine as an example, enabling it to adapt safely to changing environments.
Article
Nowadays there are numerous powerful software packages available for most areas of machine learning (ML). These can be roughly divided into frameworks that solve detailed aspects of ML and those that pursue holistic approaches for one or two learning paradigms. For the implementation of own ML applications, several packages often have to be involved and integrated through individual coding. The latter aspect in particular makes it difficult for newcomers to get started. It also makes a comparison with other works difficult, if not impossible. Especially in the area of reinforcement learning (RL), there is a lack of frameworks that fully implement the current concepts up to multi-agents (MARL) and model-based agents (MBRL). For the related field of game theory (GT), there are hardly any packages available that aim to solve real-world applications. Here we would like to make a contribution and propose the new framework MLPro, which is designed for the holistic realization of hybrid ML applications across all learning paradigms. This is made possible by an additional base layer in which the fundamentals of ML (interaction, adaptation, training, hyperparameter optimization) are defined on an abstract level. In contrast, concrete learning paradigms are implemented in higher sub-frameworks that build on the conventions of this additional base layer. This ensures a high degree of standardization and functional recombinability. Proven concepts and algorithms of existing frameworks can still be used. The first version of MLPro includes sub-frameworks for RL and cooperative GT. See also: https://doi.org/10.1016/j.mlwa.2022.100341
Conference Paper
The training of Deep Reinforcement Learning algorithms on robotic devices is challenging due to their large number of actuators and limited number of feasible action sequences. This paper addresses this challenge by extending and transferring existing approaches for waypoint-based exploration with Hierarchical Reinforcement Learning to the domain of robotic devices. The resulting algorithm utilizes a top-level policy, which suggests waypoints to a bottom-level policy that controls the system actuators. The waypoints can either be provided to the top-level policy as domain knowledge or be learned from scratch. The algorithm explicitly accounts for the low number of feasible waypoints and waypoint transitions that are characteristic of robotic devices. The effectiveness of the approach is evaluated on the simulation of a research demonstrator, and a separate ablation study proves the importance of its components.
Chapter
This paper presents a novel approach for self-optimization and learning as well as plug-and-play control of highly flexible, modular manufacturing units. The approach is inspired by recent encouraging results of game theory based learning (GT) in computer and control science. However, instead of representing the entire control behavior as a strategic game which might results in long training times and huge data set requirements, we restrict the learning process to the supervisor level by defining appropriate parameters from the basic control level (BCL) to be learned by learning agents. To this end, we define a set of interface parameters to the BCL programmed by IEC61131 compatible code, which will be used for learning. Typical control parameters include switching thresholds, timing parameters and transition conditions. These parameters will then be considered as players in a multi-player game resulting in a distributed optimization approach. We apply the approach to a laboratory testbed consisting of different production modules which underlines the efficiency improvements for manufacturing units. In addition, plug-and-produce control is enabled by the approach as different configuration of production modules can efficiently be put in operation by re-learning the parameter sets.
Article
Full-text available
This paper presents a reinforcement learning (RL)-based energy management strategy for a hybrid electric tracked vehicle. A control-oriented model of the powertrain and vehicle dynamics is first established. According to the sample information of the experimental driving schedule, statistical characteristics at various velocities are determined by extracting the transition probability matrix of the power request. Two RL-based algorithms, namely Q-learning and Dyna algorithms, are applied to generate optimal control solutions. The two algorithms are simulated on the same driving schedule, and the simulation results are compared to clarify the merits and demerits of these algorithms. Although the Q-learning algorithm is faster (3 h) than the Dyna algorithm (7 h), its fuel consumption is 1.7% higher than that of the Dyna algorithm. Furthermore, the Dyna algorithm registers approximately the same fuel consumption as the dynamic programming-based global optimal solution. The computational cost of the Dyna algorithm is substantially lower than that of the stochastic dynamic programming.
Article
Full-text available
In the distributed optimization of micro-grid, we consider grid connected solar micro-grid system which contains a local consumer, a solar photovoltaic system and a battery. The consumer as an agent continuously interacts with the environment and learns to take optimal actions. Each agent uses a model-free reinforcement learning algorithm, namely Q Learning, to optimize the battery scheduling in dynamic environment of load and available solar power. Multiple agents sense the states of the environment components and make collective decisions about how to respond to randomness in load, intermittent solar power using a Multi-Agent Reinforcement Learning algorithm, called Coordinated Q Learning (CQL). The goals of each agent are to increase the utility of the battery and solar power in order to achieve the long term objective of reducing the power consumption from grid.
Article
Full-text available
Schedule optimization is crucial to reduce energy consumption of flexible manufacturing systems (FMSs) with shared resources and route flexibility. Based on the weighted p-timed Petri Net (WTPN) models of FMS, this paper considers a scheduling problem which minimizes both productive and idle energy consumption subjected to general production constraints. The considered problem is proven to be a nonconvex mixed integer nonlinear program (MINLP). A new reachability graph (RG)-based discrete dynamic programming (DP) approach is proposed for generating near energy-optimal schedules within adequate computational time. The nonconvex MINLP is sampled, and the reduced RG is constructed such that only reachable paths are retained for computation of the energy-optimal path. Each scheduling subproblem is linearized, and each optimal substructure is computed to store in a routing table. It is proven that the sampling-induced error is bounded, and this upper bound can be reduced by increasing the sampling frequency. Experiment results on an industrial stamping system show the effectiveness of our proposed scheduling method in terms of computational complexity and deviation from optimality.
Article
Full-text available
This paper aims to provide a systematic overview of the state of the art in energy and resource efficiency increasing methods and techniques in the domain of discrete part manufacturing, with attention for the effectiveness of the available options. For this purpose a structured approach, distinguishing different system scale levels, is applied: starting from a unit process focus, respectively the multi-machine, factory, multi-facility and supply chain levels are covered. Determined by the research contributions reported in literature, the de facto focus of the paper is mainly on energy related aspects of manufacturing. Significant opportunities for systematic efficiency improving measures are identified and summarized in this area.
Article
The manufacturing industry is responsible for significant energy consumption, particularly in the form of electricity. From the perspective of the energy management system in manufacturing, reducing this consumption is not only a matter of exhibiting environmental responsibility, but also of substantially reducing the production cost. We discuss how dynamic energy management in manufacturing systems can not only solve the current technical issues in manufacturing, but can also aid in the integration of additional energy equipment into energy systems. We quantitatively estimate these potential savings through analysis of a simple manufacturing process. We also address a future research direction, wherein advanced manufacturing systems such as Industry 4.0 are deployed.
Article
During periods of power system stress, demand bidding (DB) programs encourage large electricity consumers to submit curtailment capacity bids and carry out load reduction, in return for financial rewards. In this paper, a real-time DB (RT-DB) program with applications in discrete manufacturing facilities is considered. A discrete manufacturing production model is constructed and an automated RT-DB algorithm is designed. An optimization problem, where the objective is to maximize the profits for manufacturers, is formulated. Solving this problem enables the RT-DB algorithm to automatically generate optimal load-reduction bids with adjusted production and energy plans. A case study is described, which shows that the proposed algorithm reduced the load demand during an RT-DB event, increasing the manufacturer's profits. Furthermore, the relationship between the incentive rate and the demand elasticity of the consumer, as well as the production volume and profits is described.
Conference Paper
Energy consumption and cost of manufacturing systems have been gradually considered as key performance indicators (KPIs) to evaluate the overall performance of manufacturing due to the increasing concerns of environmental protection and climate change. Different energy management methods have been developed for both specific manufacturing processes and entire manufacturing systems. Many of them are focused on optimal decision-making by offline calculations, while neglecting the online information for the real time decision-making. In this paper, we propose a data-driven energy control method that can provide manufacturers with a dynamic joint production and energy control scheme utilizing the online information regarding both production and energy to reduce energy consumption and cost while maintaining production. The proposed control schemes can be implemented and integrated into production runtime control system in manufacturing plants to save energy and its cost. A case study based on a real auto part manufacturing plant is conducted to illustrate the effectiveness of the proposed method.
Article
Fierce global competition and reduced profit margins are forcing many manufacturing plants to maximize their operational efficiency. This includes the consumption of energy across the plant as well as the inventory buildup. To reduce the cost of energy purchased from the electric utility, the plant operator may use the available on-site generation or reschedule the operation of one or more workstations so that the plant is less heavily operational during peak-load hours when the price of electricity is high. All this must be done while considering the interdependencies between various workstations as well as the target production level (and possible penalties in case it is not met). These factors turn the energy management problem into a constrained optimization one where the operational constraints of the workstations, the capacity limits of available energy resources, and the financial information are all taken into account. Developing such a solution is the focus of this current paper. A nonlinear mixed-integer optimization problem is formulated here that tries to optimize the performance of an industrial plant subject to the above operational and technical constraints. The problem is solved for two scenarios, when the reverse power injection by the plant is allowed and when it is not.
Conference Paper
In this paper, we present a new method of power management for Plug-in Hybrid Electric Vehicles (PHEVs) using Reinforcement Learning technique combined with trip information. Our new control strategy uses the remaining travel distance, which can be easily obtained from today's Global Positioning System (GPS), for the energy optimization of PHEVs. For a given trip, the remaining distance is highly correlated to the future energy consumption, a quantity our controller tries to learn and optimize continuously. The simulation results confirm the self-improving capability of our reinforcement learning controller and show that our controller outperforms the rule-based controller with respect to a defined reward function.
Article
To deal with uncertainties in energy optimization of flexible manufacturing systems (FMSs), this paper presents a robust total energy optimization problem with productive and idle powers of resources being random variables (RVs). Using the weighted p-timed Petri net model of FMS, the robust energy-optimal schedule is determined by searching the robust shortest path of reachability graph based on Renyi mean-entropy criterion. The Renyi quadratic entropy is proposed as a robustness measure which does not depend on the Gaussianity of RVs and can be computed easily from samples. Next, Bellman's dynamic programming is applied to find the shortest path in Renyi mean-entropy criterion. The effectiveness of Renyi mean-entropy criterion is verified with mathematical rigour as well as extensive numerical results from simulations and an industrial stamping system, in terms of computational complexity and deviation from optimality.