Conference PaperPDF Available

Self-optimization of energy consumption in complex bulk good processes using reinforcement learning

July 2017

July 2017

DOI:10.1109/INDIN.2017.8104776

Conference: 15th International Conference on Industrial Informatics
At: Emden

Authors:

Dorothea Schwung

Hochschule Düsseldorf

Andreas Schwung

Fachhochschule Südwestfalen

Steven X. Ding

University of Duisburg-Essen

Structure of the considered distributed production process.

…

Schematic of the laboratory scale testbed.

…

Normalized energy efficiency (reward) over learning episodes.

…

Figures - uploaded by Andreas Schwung

Content may be subject to copyright.

Content uploaded by Andreas Schwung

Content may be subject to copyright.

2017 IEEE. Personal use of this material is permitted. However, permis-

sion to reprint/republish this material for advertising or promotional purposes

or for creating new collective works for resale or redistribution to servers or

lists, or to reuse any copyrighted component of this work in other works must

be obtained from the IEEE.

Appeared as: D. Schwung, T. Kempe, A. Schwung, S. X. Ding: "‘Self-

optimization of energy consumption in complex bulk good processes using rein-

forcement learning"’; Proc. of the 2017 IEEE 15th International Conference on

Industrial Informatics (INDIN), 2017

DOI: 10.1109/INDIN.2017.8104776

Self-optimization of energy consumption in complex bulk good

processes using reinforcement learning

Dorothea Schwung∗, Tim Kempe∗, Andreas Schwung∗, Steven X. Ding†

∗South Westfalia University of Applied Sciences, Soest, Germany

{schwung.dorothea, kempe.tim, schwung.andreas}@fh-swf.de

†University of Duisburg-Essen, Duisburg, Germany

steven.ding@uni-due.de

Abstract—This paper presents a novel approach to the opti-

mization of energy consumption in large scale industrial bulk

good processes. The approach is based on a model-free self-

learning algorithm solely based on available process data using

ideas from the well known reinforcement learning framework.

To this end energy consumers of the plant are integrated in the

optimization framework such that each consumer learns its own

optimal energy proﬁle for a given production task. The approach

is implemented on a laboratory size testbed where the task is

the supply of bulk good to a subsequent dosing section. The

capability of the approach is underlined by the results obtained

at the testbed.

I. INTRODUCTION

Environmental requirements as well as the increasing costs for

energy demand for more energy efﬁcient operation of indus-

trial production facilities. Although the energy consumption

per processed product has been reduced in last years, more

efforts are required to reduce in particular the emission of

detrimental greenhouse gases. In parallel the energy costs for

production facilities are continuously increasing demanding

the industry to reduce energy consumption to stay competitive.

This problem becomes even more severe in industrial sectors

like steel making industry, process industry and basic materials

industry, to name a few, where the energy consumption is

extremely high. Here, even minor percentages of energy sav-

ings in the production process can lead to both, signiﬁcantly

decreasing emissions and cost impacts [1], [2]. Hence, an

effective and sustainable reduction of energy consumption

especially in these industrial sectors is necessary. To this end

the energy consumption of such plants needs to be analyzed

in detail and possible savings need to be detected [3]. These

ideas are supported by recent improvements in automation

technology where energy consumption can be continuously

monitored with high resolution using intelligent PLC-modules

allowing to gain a comprehensive overview over the status

of the plant. Such energy data can subsequently be used for

further analysis by using e.g. cloud based services to optimize

the overall efﬁciency.

However, industrial production plants in the basic materials

industry are in general large-scale plants with a very high

number of energy consumers [4], [5], [6]. Furthermore, such

plants are normally highly distributed which poses additional

challenges to an energy optimization approach. Particularly,

connections between different production areas as well as

constraints have to be considered as the reduction of energy

consumption in one part of the plant can cause increased

consumption in other parts. In the literature, different ap-

proaches for energy management in the manufacturing and

process industry have been introduced. Load demand schedul-

ing has been presented in [7], [8] while [9] presents a layered

approach for energy optimization based on service oriented

architectures. Approaches based on dynamic programming are

introduced in [10], [11]. In [12] a data-driven system is

developed for energy management.

In this paper we present a novel approach for energy manage-

ment and optimization which is based on the ideas of rein-

forcement learning (RL) [13]. RL is essentially a knowledge-

based control scheme which learns optimal policy by direct

interaction with the environment. RL has already been applied

to energy management systems with special emphasis on

distributed smart grids and microgrids [14], [15] and on

electric vehicles [16], [17]. In contrast, the application of RL to

energy management in the manufacturing and process industry

domain with its inherent challenges has not been presented to

date.

In our work we assume a production process with a fully

distributed basic control system architecture, i.e. each pro-

duction process is controlled by its own control system.

However, we assume that the energy and production data

are communicated to a centralized PC-system where the RL-

algorithms are implemented. The control actions are then

communicated backwards to the PLCs. For the deﬁnition of

the RL-problem we need to consider both the operational

requirements given by the production process as well as the

energy consumption by the production process. In general,

these are conﬂicting requirements such that the learning pro-

cess needs to ﬁnd a trade-off between production time and

quality and energy consumption. In our approach, production

requirements and goals are incorporated in two different ways.

First, reward functions representing the production quality and

time are developed and combined with the reward functions

for the energy consumption ending up in a multiobjective

RL-problem. Second, necessary constraints for the production

process, particularly safety constraints, are implemented in

the basic control system. Hence, safety constraints cannot

be violated by the inherent trial-and-error of reinforcement

learning but the learning process is forced to learn the process

constraints. The proposed approach is ﬁnally applied to a

laboratory scale test environment. The results obtained show

the effectiveness of the proposed approach.

The paper is organized as follows. In Section II we state the

learning problem for energy management in process plants.

Section III introduces into reinforcement learning and presents

the application of RL to energy optimization. In Section IV

we present the results of the proposed RL framework to a

laboratory bulk good system. Section V gives the conclusions

and an outlook to future work.

II. PROBLEM STATEMENT

In this section we state the energy optimization problem in

large-scale production environments and deﬁne its goals and

challenges. The considered general structure of the production

process is illustrated in Fig. 1. As can be seen we consider

Fig. 1: Structure of the considered distributed production

process.

distributed production processes where different subsystems

interact with each other to form the overall system. Each of

the subsystem is assumed to have its own control systems

and suitable sensoring and monitoring devices to measure its

production performance like e.g. required mass ﬂows etc. To

execute its task each subsystem contains a certain number of

energy consumers like electrical drives, valves, compressors

etc. The considered energy consumers are assumed to have

either discrete behavior like DOL-motors or on-off valves or

continuous behavior like VSD-drives and hybrid behavior like

e.g. the considered vacuum pumps. Note that we generally

consider not only the consumption of electrical energy but

also other energy sources like e.g. instrument air. The energy

consumption of the consumers is assumed to be continuously

measured by suitable energy metering devices in the PLC

and then communicated to a centralized control system for

monitoring and analysis.

After describing the considered system setup, we will now

state the considered problem:

Given the distributed system Swith lsubsystems Sias

illustrated in Fig. 1 where each subsystems contains nen-

ergy consumers Ei,j and corresponding energy measurements

ei,j (t),i= 1,...l,j= 1,...n. Then, ﬁnd the optimal energy

consumption

e∗(t) =

t=0

i,j

ei,j (t)(1)

over a given production episode t= 0,...T while simul-

taneously meeting given quality parameters ci(t)within this

episode.

Some remarks to the previously deﬁned problem are in or-

der. The quality parameters ci(t)can be arbitrarily deﬁned

based in the given process objectives and available process

measurements. Examples include product concentrations in

chemical plants, mass ﬂows in bulk good plants or processing

times in manufacturing plants. The length of the considered

production episode is closely related to these requirements as

quality parameters can only be accessed after some processing

time. Hence, the episode should be at least as long as the

processing times. The number of samples per episode should

be determined such that the important dynamics of the en-

ergy consumption and the process parameters are represented.

Particularly, the processing times and operation points of

actuators strongly inﬂuence the energy consumption of the

overall system and need to be carefully examined.

Note that the problem description mainly focuses on discrete

and hybrid processes where operations and system behavior

are not solely continuous. Such hybrid systems containing

discrete and continuous dynamics and actuation are quite

common in the process and basic materials industries due to

discontinuous and delaying components like buffers, reactors

or conveyors as well as on-off actuators and actuators with

discontinuous actions. We will introduce an example of such

a process in Sec. IV.

III. ENERGY OPTIMIZATION USING

REINFORCEMENT LEARNING

After stating the problem we will now introduce our rein-

forcement learning framework for the energy optimization of

process plants. We start with a short introduction into RL and

will then apply RL for the energy management system.

A. Introduction to reinforcement learning

RL is basically a learning technique where an agent acts on

its environments and learns from rewards gained from these

interactions. To this end the environment is formulated as a

Markov decision process (MDP). When the agent takes an

action, it receives a feedback from the environment in form of

a reward which is the outcome of the action. If the interaction

is positive, the reward will be positive and vice versa. The main

objective of the agent is to maximize the expected cumulative

reward over a given time horizon called episode. Formally, the

MDP is described by the tupel (S,A,P,R,p0)with:

•the set of states S,

•the set of action A,

•the transition model P:S×A×S→[0 1], such that

P(st,at,st+1) = p(st+1 |st,at)is the transition probabil-

ity from state stto the following state st+1 by applying

action at,

•the reward function R:S×A×S→R, which assigns a

reward r(st,at,st+1)to each state transition st←st+1,

•an initialization probability p0of the states.

Obviously, the RL-agent interacts with the environment in

discrete time steps. At each time step, an action atis taken

forcing the system to enter state st+1. At the same time

the reward rtassociated with the transition (st,at,st+1)is

observed by the agent. The way how the agent chooses his

actions based on the actual state is called the agents policy

π:S→A. The policy can be chosen based on the

agents past experiences or even randomly. The goal of the

reinforcement learning problem is to ﬁnd the optimal policy by

interacting with the environment, i.e. the policy which results

in the highest possible cumulated reward. Hence, we want to

maximize the return

ρπ=E[R|π](2)

with the discounted reward

R=X

γtrt,(3)

where 0≤γ≤1is the discount factor.

For solving the above RL-problem, the well known value

function approach can be applied. In particular, Q-learning is

well suited which is based on the state-action value function

deﬁned as

Qπ(a,s) = E[X

γtrt|s0,a0,π].(4)

The state-action value function determines the expected reward

to be gained when starting in state sand taking action aand

then following policy π. As we want to ﬁnd the policy with

the optimal reward, we are interested in ﬁnding the optimal

action in the actual state which can be determined using the

Bellman equation

Q∗(at,st)= rt(st,at)+X

st+1

P(st,at,st+1 ) max

aQ∗(at+1 ,st+1)

(5)

resulting in the optimal policy

π∗(st) = argmaxaQ∗(at,st).(6)

However, in most real world applications, the transition prob-

abilities as well as the reward allocation are hardly predictable

or even unknown. Hence, also the state-action value functions

can not be determined directly. To adress this issue, model

based approaches where the transition probabilities and re-

wards are learned by interacting with the environment and

model free approaches like temporal difference (TD) learning

can be distinguished [13]. In TD learning, the value of the

state or state-action pair is continuously updated based on the

temporal difference between the Q-values of the actual state

stand the subsequent state st+1 and the reward gained in this

transition step rt(at,st)as follows

Qt+1 (at,st) = Qt(at,st)

+αt(rt+γmax

at+1

Qt(at+1 ,st+1)−Qt(at,st)) (7)

where αis the learning rate and γis the discount factor. The

subsequent action at+1 is normally chosen based on a -greedy

procedure [13], i.e. it chooses the optimal policy using (6) with

probability 1−and a random policy with probability to

sufﬁciently balance between exploitation and exploration.

B. Application to energy optimization

After the introduction to RL we will now formulate our RL-

learning framework for energy optimization for the presented

system setup illustrated in Fig. 1. To this end, we need to

deﬁne the MDP for the energy optimization problem, i.e. we

need to deﬁne the system states, the set of possible actions

and the rewards to be gained.

As we are interested in energy optimization the state of the

MDP need to mainly represent the energy ﬂows in the system.

To this end, we assign a set of states Si,j to each energy

consumer Ei,j of the ith subsystem. A typical examples of

such a state set is Si,j ={off,on}for simple actuators.

However, more than two states can be deﬁned for e.g. contin-

uously operated actuators. The allocation of discrete states for

continuously operated actuators is best chosen by considering

the characteristic points of the energy proﬁle like minimum

and maximum values. This will be further evaluated for the

example process below. Note, that this deﬁnition of the states

directly impacts the energy ﬂows in the system as typically

different energy consumptions are associated with each state.

Furthermore, it indirectly affects the quality parameters in

the process as the actuators typically control the production

process. The resulting set of states of subsystem ithan yields

Si=Si,1×. . . × Si,n .

The deﬁnition of the action space Ai,j is done likewise. In

this case just two or three actions are assigned to each energy

consumer. For simple on-off actuators two actions representing

the turn on and turn off action are deﬁned. Continuously

operating actuators are represented by turn up, turn down and

no change action.

Note that in terms of the RL-framework, continuously oper-

ating actuators can also be modelled by a continuous state

and action space, i.e. by using gaussian basis functions [18].

However, for the sake of simplicity and a PLC-ready imple-

mentation we restrict ourselfs to a discrete set of states and

actions.

In addition to the state and action space, the rewards for each

state transition have to be deﬁned. The appropriate deﬁnition

of rewards is of major importance as the energy optimization

problem has to fulﬁll different partly counteracting objectives.

More precisely, the energy ﬂows need to be optimized under

the constraints that certain quality parameters like processing

times or product quality as well as requirements for process

safety are met. Consequently, the reward has to reﬂect and

balance these different requirements and needs to be carefully

chosen.

To cover the energy consumption in the reward function, we

could simply use the measured energy consumption during

the sample period ei,j (t). However, this would result in

poor production performance. To avoid this, we normalize

the energy consumption with respect to the actual quality

parameters ci(t)within the given subsystem and deﬁne the

reward function for the ith subsystem as

ri(s,a) = Pjei,j (t)

ci(t).(8)

Note that the term ci(t)can include different quality parame-

ters.

Additionally, hard process constraints which are normally

implemented on the basic control level, need to be considered.

This integration in the rewards is essential to the learning

process as can be seen by considering a simple overﬂow

sensor which locks a valve to be closed. Obviously, the

energy consumption of the actuator will be zero either if it is

commanded open or closed due to the safety function locked

closed. Hence, a reward not considering this would indicate

no differences although the open command is incorrect. To

avoid this we penalize each violation of such hard process

constraints by a corresponding penalty value pi(t). Hence, the

ﬁnal reward for each subsystem yields:

ri(s,a) = Pjei,j (t)

ci(t)+pi(t).(9)

The penalties need to be balanced carefully to allow for a

sufﬁcient steering of the learning process away from inappro-

priate actions on the one hand while not underweighting the

normalized energy consumption on the other hand.

IV. APPLICATION EXAMPLE

After introducing the general approach to the energy optimiza-

tion we apply the presented approach to a laboratory scale

test facility. The task of the plant is to process bulk good as

a supply for a subsequent dosing unit. After presenting the

testbed we will formulate the RL-problem.

A. Laboratory bulk good system

We start with a short explanation of the laboratory scale testbed

which is illustrated in Fig. 2. As shown the testbed consists

of four modules for processing bulk good material. Modules

1 and 2 represent typical supply, buffer and transportation

stations. Module 1 consists of a container and a continuously

controlled belt conveyor from which the bulk good is carried

to a mini hopper which is the interface to module 2. Module 2

consists of a vacuum pump and a buffer container. The vacuum

pump itself transports the material from module 1 into an

internal container. The material is then released to the buffer

container by a pneumatically actuated ﬂap and then charged

into a mini hopper which is the interface to the next module.

Module 3 is the dosing station, consisting of a second vacuum

pump with similar setup as in module 2 and the dosing unit.

This unit consists of a buffer container with a weighing system

and a rotary feeder. The dosed material is ﬁnally transported to

module 4 using a third vacuum pump and ﬁlled into transport

containers. Each of the four modules is equipped with its own

Fig. 2: Schematic of the laboratory scale testbed.

PLC-based control system using a Siemens ET200SP. The four

control systems communicate with each other via Proﬁnet. In

addition, each module has a set of sensors to monitor the

modules state. In particular, each container is equipped with

min/max level sensors and each mini hopper with overﬂow

sensors. To additionally monitor the energy consumption of

each module, Siemens energy metering modules are installed

at each PLC. Hence, the energy consumption of the actua-

tors, namely the conveyors and pumps, can be continuously

monitored. As the energy consumption of the vacuum pump

is especially inﬂuenced by the consumption of instrument air,

we take this additionally into account in the PLC.

Note that the testbed mimics to some extend typical large

scale systems which are modularized in smaller subsystems

with their own control systems and suitable communication

interfaces. Furthermore, due to the system structure with

different buffer containers as well as due to the inherent

discontinuous behavior of the vacuum pump this process

constitutes a typical hybrid system with a mixture of discrete

and continuous behavior. Due to this structure the RL-based

energy management system is especially beneﬁcial allowing

for mining the energy optimal operation strategy.

In the following experiments for testing our RL-approach we

will concentrate on the ﬁrst two modules which are the supply

units for the subsequent dosing station. In particular, we want

to supply the dosing unit with the required amount of material

continuously processed by the dosing station while keeping the

energy consumption of all the actuators in the supply stations

as low as possible. Note that there exists no pre-programmed

sequence of actuator operations in the PLC when starting with

the learning process. However, to assure the safe operation

of the process, some interlocks to avoid buffer overﬂows are

implemented at the basic PLC level using the available sensor

information described above.

Note that we do not require a constant mass ﬂow into the

buffer at the dosing unit but just a certain mass inﬂow within

a given time period. However, the inﬂow should be adjusted to

assure that the bulk good level in the buffer does not fall below

a given threshold to allow for a continuous dosing operation.

This constraint will be represented by a high negative reward

in the RL-problem, if the level falls below this treshold.

B. Energy management based on RL

We now apply the general RL-framework for energy optimiza-

tion presented in Sec. III to the bulk good system. To this end,

the states and actions as well as the reward function have to

be deﬁned.

According to Sec. III-B, the states are deﬁned by the actuator

states, i.e. in our testbed, the states of the conveyor and the

two vaccum pumps. We deﬁne two states on and off for both

vacuum pumps and four states for the continuous conveyor as

off, low, medium and maximum speed. These states are chosen

in accordance with the speciﬁc energy proﬁle of the conveyor

systems. Furthermore, the action for the vacuum pumps are

deﬁned as switch off and switch on, while the actions of the

conveyor are speed down, speed up and no change.

The deﬁnition of the reward function is done using Eq. 9,

i.e. consisting of the related energy consumption with respect

to the mass ﬂow and the penalties for violating process

constraints. While the actual energy consumption can be

directly obtained from the energy meters, the mass ﬂow is

calculated by using a simple physical model of the mass ﬂow

dynamics in the plant due to the lack of appropriate sensors.

It is worth noting that the vacuum pumps exhibit a speciﬁc

behavior. After switching on, ﬁrst an evacuation period appears

where conveying of product is not possible. Then the conveyed

product follows a polynomial function until the buffer in the

vacuum pump is full which results in a sudden drop of the

mass ﬂow. The RL-algorithm should be able to cope with this

speciﬁc system behavior.

Additionally, we have to account for different process con-

straint and shut down conditions. These mainly include over-

ﬂow sensors at the mini hopper and the different buffer

stations. Similarly, after a full surge cycle of the vacuum pump,

the material has to relaxed to the buffer. This is locked by an

additional overﬂow sensor. Following Sec. III these constraints

are incorporated using additional penalty factors.

C. Results

In this section we present results of the RL-framework at the

testbed. Therefore, the RL-algorithm based on Q-learning is

implemented on a Siemens-S7 PLC system. The realization

on the PLC is done by means of different function blocks.

The core of the realization is the Q-learning algorithm which

is implemented in SCL in its own function block using an

array-based structure to store the look-up table of the state-

action values during the learning process.

During our experiments, the discount factor is chosen as γ=

0.9and the learning rate is ﬁrst ﬁxed at α= 0.1and vanishes

for high number of episodes. The actions are chosen based

on an -greedy strategy such that the ﬁrst learning steps are

solely based on random exploration and later turn to more

exploitation learning behavior. The length of an production

episode is ﬁxed to 30s.

In our test scenario we consider a continuous supply of

mass ﬂow through the ﬁrst and second station to the buffer

at the dosing unit. The goal is to learn an energy optimal

behavior of this part of the system while assuring a sufﬁcient

supply of material for the dosing operation. Fig. 3 shows the

learning results for this scenario in terms of normalized energy

consumption. Note that due to the continuous operation, the

results are obtained from a moving average over the episodes

with a step size of 6 to account for varying conditions in the

buffers. As can be seen, the learning process ﬁrst enters the

exploration phase with strongly varying energy consumptions.

As the learning continues the energy efﬁciency in terms of

the reward increases over time and results in comparable

performance than strategies implemented by hand. As already

stated, the vacuum pumps exhibit speciﬁc dynamic behavior.

From the energy optimal operation point of view, the pumps

need to be switched on until the buffer is full and then switched

off. For the machines at hand, this means operation periods

of 2-3 steps after switching off the pump again. Too short

operation times result in waste of energy due to the evacuation

time while too long operation due to the full buffer. Hence,

a coresponding operation strategy has to be learned by the

RL-algorithm which is illustrated in Fig. 4. As can be seen

during learning the duration of the vacuum pump operation

enters into a region of 2-3 operation steps and hence, the RL

is able to learn the optimal operation of the vacuum pumps.

Fig. 3: Normalized energy efﬁciency (reward) over learning

episodes.

Fig. 4: Operation steps of the vacuum pump over learning

episodes.

V. CONCLUSIONS

In this paper we presented a novel approach for energy opti-

mization in large scale distributed production environments.

The approach is based on a formulation of the optimiza-

tion problem in form of a reinforcement learning problem.

Hence, the energy consumption of the production process is

minimized via learning from subsequent operation sequences

while maintaining the production quality represented by some

quality functions. The approach is applied to a laboratory

testbed with very promising results. In future research, the

developed RL-approach can be generalized to account also

for continuous state and actions spaces to better represent the

continuous behavior of the production process. Furthermore,

as the current approach requires a centralized learning process

a more distributed approach to the RL-problem potentially

incorporating ideas from game theory will be a topic for future

research.

REFERENCES

[1] S. Karnouskos and A. Walter Colombo and J. L. Martinez Lastra and C.

Popescu, “Towards the energy efﬁcient future factory,” 2009 7th IEEE

International Conference on Industrial Informatics, 2009

[2] A. Cannata and S. Karnouskos and M. Taisch, “Energy efﬁciency driven

process analysis and optimization in discrete manufacturing,” 2009 35th

Annual Conference of IEEE Industrial Electronics, 2009

[3] J. R. Duﬂou and J. W. Sutherland and D. Dornfeld and C. Herrmann and

J. Jeswiet and S. Kara and M. Hauschild and K. Kellens, “Towards energy

and resource efﬁcient manufacturing: A processes and systems approach,”

CIRP Annals - Manufacturing Technology, 61(2), pp. 587-609, 2012

[4] K. Bunse and M. Vodicka and P. Sch ¨

onsleben and M. Br¨

ulhart and

F. O. Ernst, “Integrating energy efﬁciency performance in production

management – gap analysis between industrial needs and scientiﬁc

literature,” Journal of Cleaner Production, 19(6-7), pp. 667-679, 2011

[5] E. Oh and S.-Y. Son, “Toward dynamic energy management for green

manufacturing systems,” IEEE Communications Magazine, 54(10), pp.

74-79, 2016

[6] K. Tanaka, “Review of policies and measures for energy efﬁciency in

industry sector,” Energy Policy, 39(10), pp. 6532-6550, 2011

[7] M. Choobineh and S. Mohagheghi, “Optimal Energy Management in

an Industrial Plant Using On-Site Generation and Demand Scheduling,”

IEEE Transactions on Industry Applications, 52(3), pp. 1945-1952, 2016

[8] Y.-C. Li and S. H. Hong, “Real-Time Demand Bidding for Energy

Management in Discrete Manufacturing Facilities,” IEEE Transactions

on Industrial Electronics, 64(1), pp. 739-749, 2017

[9] A. Florea and J. A. Garcia Izaguirre Montemayor and C. Postelnicu and

J. L. Martinez Lastra, “A cross-layer approach to energy management

in manufacturing,” IEEE 10th International Conference on Industrial

Informatics, 2012

[10] C. K. Pang and C. V. Le, “Optimization of Total Energy Consumption

in Flexible Manufacturing Systems Using Weighted P-Timed Petri Nets

and Dynamic Programming,” IEEE Transactions on Automation Science

and Engineering, 11(4), pp. 1083-1096, 2014

[11] C. V. Lee and C. K. Pang, “Robust Total Energy Optimization of Flexible

Manufacturing Systems Based on Renyi Mean-Entropy Criterion,” IEEE

Transactions on Automation Science and Engineering, 13(1), pp. 355-367,

2016

[12] Z. Sun and D. Wei and L. Wang and L. Li, “Data driven production run-

time energy control of manufacturing systems,” 2015 IEEE International

Conference on Automation Science and Engineering (CASE), 2015

[13] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduc-

tion,” Cambridge, MA, USA: MIT Press, 1998,

[14] E. Kuznetsova and Y.-F. Li and C. Ruiz and E. Zio and G. Ault and

K. Bell, “Reinforcement learning for microgrid energy management,”

Energy, 59, pp. 133-146, 2013

[15] L. Raju and S. Sankar and R.S. Milton, “Distributed Optimization of

Solar Micro-grid Using Multi Agent Reinforcement Learning,” Procedia

Computer Science, 46, pp. 231-239, 2015

[16] C. Liu and Y. L. Murphey, “Power management for Plug-in Hybrid

Electric Vehicles using Reinforcement Learning with trip information,”

2014 IEEE Transportation Electriﬁcation Conference and Expo (ITEC),

2014

[17] T. Liu and Y. Zou and D. Liu and F. Sun, “Reinforcement Learn-

ing–Based Energy Management Strategy for a Hybrid Electric Tracked

Vehicle,” Energies, 8(7), pp. 7243-7260, 2015

[18] R.M. Kretchmar and C.W. Anderson, “Comparison of CMACs and

radial basis functions for local function approximators in reinforcement

learning,” Proceedings of International Conference on Neural Networks,

1997

On the Actuation Modes of a Multiloop Mechanism for Space Applications

Article

Full-text available

Nov 2021

A symmetric, double-tripod multiloop mechanism (DTMLM), intended for grabbing objects in outerspace, is the subject of this article. Actuation modes are analyzed while introducing a novel tool applicable to space mechanisms. The key issue here lies in establishing the criteria for selecting the optimum mode from multiple actuation possibilities. The evaluation procedure includes generalized-force values, power requirement, and actuation-strategy models, along with their optimization. Accordingly, the optimization procedure targets the mode(s) with 1) uniformity of generalized-force distribution, 2) uniformity of power-requirement distribution, and 3) the fewest working actuators in a given maneuver. In this way, a rather complex problem is formulated in a simple manner, whereby the optimum actuation mode is found by list-lookup, from a reduced number of candidates. The DTMLM, which carries three compound hinges plus three prismatic, and six revolute joints, has three degrees of freedom. To analyze the actuation modes of the mechanism, the dynamics model of the DTMLM is established; as well, five representative modes are selected from 84 possibilities. It turns out that the 3 R -type (three revolute actuators) shows a uniform power-requirement distribution among the three actuators in the bending motion mode; in the 3 P -type (three prismatic actuators), one single actuator is operational, and hence, takes all the load. Thus, the 3 R -type is the best from the power-distribution viewpoint, thereby providing strong power support, but complex from the actuation-strategy viewpoint, as illustrated in the article. Simulation results and prototype experiments of the DTMLM are reported, thereby verifying the analysis results.

Fault-tolerant Control of PLC-based Production Systems with Deep Reinforcement Learning

Article

Mar 2021

Fault-tolerant control policies that automatically restart PLC-based aPS during fault recovery can increase system availability. This paper provides a proof-of-concept that such policies can be synthesized with DRL. The authors specifically focus on systems with multiple end-effectors that are actuated in only one or two axes, commonly used for assembly and logistics tasks. Due to the large number of actuators in multi-end-effector systems and the limited possibilities to track workpieces in a single coordinate system, these systems are especially challenging to learn. This paper demonstrates that a hierarchical multi-agent DRL approach together with a separate coordinate prediction module per agent can overcome these challenges. The evaluation of the suggested approach on the simulation of a small laboratory demonstrator shows that it is capable of restarting the system and completing open tasks as part of fault recovery.

MLPro 1.0 - Standardized reinforcement learning and game theory in Python

Preprint

Full-text available

Mar 2022

Nowadays there are numerous powerful software packages available for most areas of machine learning (ML). These can be roughly divided into frameworks that solve detailed aspects of ML and those that pursue holistic approaches for one or two learning paradigms. For the implementation of own ML applications, several packages often have to be involved and integrated through individual coding. The latter aspect in particular makes it difficult for newcomers to get started. It also makes a comparison with other works difficult, if not impossible. Especially in the area of reinforcement learning (RL), there is a lack of frameworks that fully implement the current concepts up to multi-agents (MARL) and model-based agents (MBRL). For the related field of game theory (GT), there are hardly any packages available that aim to solve real-world applications. Here we would like to make a contribution and propose the new framework MLPro, which is designed for the holistic realization of hybrid ML applications across all learning paradigms. This is made possible by an additional base layer in which the fundamentals of ML (interaction, adaptation, training, hyperparameter optimization) are defined on an abstract level. In contrast, concrete learning paradigms are implemented in higher sub-frameworks that build on the conventions of this additional base layer. This ensures a high degree of standardization and functional recombinability. Proven concepts and algorithms of existing frameworks can still be used. The first version of MLPro includes sub-frameworks for RL and cooperative GT.

Distributed Self-Optimization of Modular Production Units: A State-Based Potential Game Approach

Article

Full-text available

Jun 2020

This paper presents a novel approach for distributed optimization of production units based on Potential Game Theory and Machine Learning. The core of our approach is split into two parts: The first part concentrates on the conceptual treatment of modular installed production units in terms of a Potential Game scenario. The second part focuses on the development and incorporation of suitable learning algorithms to finally form an intelligent autonomous system. In this context, we model the production environment as a State based Potential Game where each actuator of each module has the role of an agent in the game aiming to maximize its utility value by learning the optimal process behavior. The benefit of the additional state information is visible in the performance of the algorithm making the environment dynamic and serving as a connector between the players. We propose a novel learning algorithm based on a global interpolation method that is applied to a laboratory scale modular bulk good system. The thorough analysis of the encouraging results yields to highly interesting insights into the learning dynamics and the process itself. The benefits of our distributed optimization approach are the plug-and-play functionality, the online capability, fast adaption to changing production requirements and the possibility of a IEC 61131conform PLC-implementation.

On Fault-Tolerant Control Systems: A Novel Reconfigurable and Adaptive Solution for Industrial Machines

Article

Full-text available

Feb 2020

Fault-Tolerant Controllers (FTCs) modify system behaviour to overcome faults without human interaction. These control algorithms, when based on active approach, detect, quantify and isolate the faults during Fault Detection and Isolation (FDI) phase. Afterwards, during Control Re-design (CR) phase, the controller is reconfigured and adapted to the faulty situation. This last phase has been approached by a wide variety of algorithms, being Adaptive Controllers (ACs) the ones studied in this paper. Despite their potentiality to overcome faults, industrial manufacturing systems demand robustness and flexibility levels hardly achievable by these algorithms. On this context, the paper proposes to upgrade them introducing novel Digital-Twin (DT) models to increase its flexibility and Anti-Windup (AW) techniques to improve their robustness. These novelties reach their maximum potential when FDI and CR phases merge to generate a novel FTC platform based on a Bank of Controllers (BC), improving the fault avoidance process as controller gains are switched to the ones that recover the machine more efficiently.

Self-Optimization in Smart Production Systems using Distributed Reinforcement Learning

Conference Paper

Full-text available

Jul 2019

This paper introduces a novel approach for self-learning in highly flexible, modular manufacturing systems enabling fast reconfiguration and online adaptation to changing production requirements. The approach is based on a distributed optimization scheme such that production modules are equipped with their own optimization agent with its local objectives to be optimized. The communication and coordination of the agent is limited to the basically required amount. The approach is based on the recently developed deep deter-ministic policy gradient (DDPG) approach, a high performing algorithm from the family of actor-critic reinforcement learning algorithms. As DDPG is based on single agent learning, we develop a fully distributed multi-agent learning setting with different levels of information about the neighbors. We apply the approach to a laboratory scale distributed bulk good production testbed with very encouraging results. Particularly, we found very reasonable control strategies by learning the agents from scratch.

Safe contextual Bayesian optimization integrated in industrial control for self-learning machines

Article

Full-text available

Feb 2023
J INTELL MANUF

Intelligent manufacturing applications and agent-based implementations are scientifically investigated due to the enormous potential of industrial process optimization. The most widespread data-driven approach is the use of experimental history under test conditions for training, followed by execution of the trained model. Since factors, such as tool wear, affect the process, the experimental history has to be compiled extensively. In addition, individual machine noise implies that the models are not easily transferable to other (theoretically identical) machines. In contrast, a continual learning system should have the capacity to adapt (slightly) to a changing environment, e.g., another machine under different working conditions. Since this adaptation can potentially have a negative impact on process quality, especially in industry, safe optimization methods are required. In this article, we present a significant step towards self-optimizing machines in industry, by introducing a novel method for efficient safe contextual optimization and continuously trading-off between exploration and exploitation. Furthermore, an appropriate data discard strategy and local approximation techniques enable continual optimization. The approach is implemented as generic software module for an industrial edge control device. We apply this module to a steel straightening machine as an example, enabling it to adapt safely to changing environments.

MLPro 1.0 - Standardized reinforcement learning and game theory in Python

Article

May 2022

Hierarchical Reinforcement Learning for Waypoint-based Exploration in Robotic Devices

Conference Paper

Jul 2021

The training of Deep Reinforcement Learning algorithms on robotic devices is challenging due to their large number of actuators and limited number of feasible action sequences. This paper addresses this challenge by extending and transferring existing approaches for waypoint-based exploration with Hierarchical Reinforcement Learning to the domain of robotic devices. The resulting algorithm utilizes a top-level policy, which suggests waypoints to a bottom-level policy that controls the system actuators. The waypoints can either be provided to the top-level policy as domain knowledge or be learned from scratch. The algorithm explicitly accounts for the low number of feasible waypoints and waypoint transitions that are characteristic of robotic devices. The effectiveness of the approach is evaluated on the simulation of a research demonstrator, and a separate ablation study proves the importance of its components.

Smart Manufacturing Systems: A Game Theory based Approach

Chapter

Jan 2020

This paper presents a novel approach for self-optimization and learning as well as plug-and-play control of highly flexible, modular manufacturing units. The approach is inspired by recent encouraging results of game theory based learning (GT) in computer and control science. However, instead of representing the entire control behavior as a strategic game which might results in long training times and huge data set requirements, we restrict the learning process to the supervisor level by defining appropriate parameters from the basic control level (BCL) to be learned by learning agents. To this end, we define a set of interface parameters to the BCL programmed by IEC61131 compatible code, which will be used for learning. Typical control parameters include switching thresholds, timing parameters and transition conditions. These parameters will then be considered as players in a multi-player game resulting in a distributed optimization approach. We apply the approach to a laboratory testbed consisting of different production modules which underlines the efficiency improvements for manufacturing units. In addition, plug-and-produce control is enabled by the approach as different configuration of production modules can efficiently be put in operation by re-learning the parameter sets.

Reinforcement Learning–Based Energy Management Strategy for a Hybrid Electric Tracked Vehicle

Article

Full-text available

Jul 2015

This paper presents a reinforcement learning (RL)-based energy management strategy for a hybrid electric tracked vehicle. A control-oriented model of the powertrain and vehicle dynamics is first established. According to the sample information of the experimental driving schedule, statistical characteristics at various velocities are determined by extracting the transition probability matrix of the power request. Two RL-based algorithms, namely Q-learning and Dyna algorithms, are applied to generate optimal control solutions. The two algorithms are simulated on the same driving schedule, and the simulation results are compared to clarify the merits and demerits of these algorithms. Although the Q-learning algorithm is faster (3 h) than the Dyna algorithm (7 h), its fuel consumption is 1.7% higher than that of the Dyna algorithm. Furthermore, the Dyna algorithm registers approximately the same fuel consumption as the dynamic programming-based global optimal solution. The computational cost of the Dyna algorithm is substantially lower than that of the stochastic dynamic programming.

Energy Optimization of Solar Micro-Grid Using Multi Agent Reinforcement Learning

Article

Full-text available

Dec 2015

In the distributed optimization of micro-grid, we consider grid connected solar micro-grid system which contains a local consumer, a solar photovoltaic system and a battery. The consumer as an agent continuously interacts with the environment and learns to take optimal actions. Each agent uses a model-free reinforcement learning algorithm, namely Q Learning, to optimize the battery scheduling in dynamic environment of load and available solar power. Multiple agents sense the states of the environment components and make collective decisions about how to respond to randomness in load, intermittent solar power using a Multi-Agent Reinforcement Learning algorithm, called Coordinated Q Learning (CQL). The goals of each agent are to increase the utility of the battery and solar power in order to achieve the long term objective of reducing the power consumption from grid.

Optimization of Total Energy Consumption in Flexible Manufacturing Systems Using Weighted P-Timed Petri Nets and Dynamic Programming

Article

Full-text available

Oct 2014

Schedule optimization is crucial to reduce energy consumption of flexible manufacturing systems (FMSs) with shared resources and route flexibility. Based on the weighted p-timed Petri Net (WTPN) models of FMS, this paper considers a scheduling problem which minimizes both productive and idle energy consumption subjected to general production constraints. The considered problem is proven to be a nonconvex mixed integer nonlinear program (MINLP). A new reachability graph (RG)-based discrete dynamic programming (DP) approach is proposed for generating near energy-optimal schedules within adequate computational time. The nonconvex MINLP is sampled, and the reduced RG is constructed such that only reachable paths are retained for computation of the energy-optimal path. Each scheduling subproblem is linearized, and each optimal substructure is computed to store in a routing table. It is proven that the sampling-induced error is bounded, and this upper bound can be reduced by increasing the sampling frequency. Experiment results on an industrial stamping system show the effectiveness of our proposed scheduling method in terms of computational complexity and deviation from optimality.

Towards energy and resource efficient manufacturing: A processes and systems approach

Article

Full-text available

Dec 2012
CIRP ANN-MANUF TECHN

This paper aims to provide a systematic overview of the state of the art in energy and resource efficiency increasing methods and techniques in the domain of discrete part manufacturing, with attention for the effectiveness of the available options. For this purpose a structured approach, distinguishing different system scale levels, is applied: starting from a unit process focus, respectively the multi-machine, factory, multi-facility and supply chain levels are covered. Determined by the research contributions reported in literature, the de facto focus of the paper is mainly on energy related aspects of manufacturing. Significant opportunities for systematic efficiency improving measures are identified and summarized in this area.

Toward dynamic energy management for green manufacturing systems

Article

Oct 2016

The manufacturing industry is responsible for significant energy consumption, particularly in the form of electricity. From the perspective of the energy management system in manufacturing, reducing this consumption is not only a matter of exhibiting environmental responsibility, but also of substantially reducing the production cost. We discuss how dynamic energy management in manufacturing systems can not only solve the current technical issues in manufacturing, but can also aid in the integration of additional energy equipment into energy systems. We quantitatively estimate these potential savings through analysis of a simple manufacturing process. We also address a future research direction, wherein advanced manufacturing systems such as Industry 4.0 are deployed.

Real-Time Demand Bidding for Energy Management in Discrete Manufacturing Facilities

Article

Jan 2016

During periods of power system stress, demand bidding (DB) programs encourage large electricity consumers to submit curtailment capacity bids and carry out load reduction, in return for financial rewards. In this paper, a real-time DB (RT-DB) program with applications in discrete manufacturing facilities is considered. A discrete manufacturing production model is constructed and an automated RT-DB algorithm is designed. An optimization problem, where the objective is to maximize the profits for manufacturers, is formulated. Solving this problem enables the RT-DB algorithm to automatically generate optimal load-reduction bids with adjusted production and energy plans. A case study is described, which shows that the proposed algorithm reduced the load demand during an RT-DB event, increasing the manufacturer's profits. Furthermore, the relationship between the incentive rate and the demand elasticity of the consumer, as well as the production volume and profits is described.

Data driven production runtime energy control of manufacturing systems

Conference Paper

Aug 2015

Energy consumption and cost of manufacturing systems have been gradually considered as key performance indicators (KPIs) to evaluate the overall performance of manufacturing due to the increasing concerns of environmental protection and climate change. Different energy management methods have been developed for both specific manufacturing processes and entire manufacturing systems. Many of them are focused on optimal decision-making by offline calculations, while neglecting the online information for the real time decision-making. In this paper, we propose a data-driven energy control method that can provide manufacturers with a dynamic joint production and energy control scheme utilizing the online information regarding both production and energy to reduce energy consumption and cost while maintaining production. The proposed control schemes can be implemented and integrated into production runtime control system in manufacturing plants to save energy and its cost. A case study based on a real auto part manufacturing plant is conducted to illustrate the effectiveness of the proposed method.

Optimal Energy Management in an Industrial Plant Using On-Site Generation and Demand Scheduling

Article

Jan 2015

Fierce global competition and reduced profit margins are forcing many manufacturing plants to maximize their operational efficiency. This includes the consumption of energy across the plant as well as the inventory buildup. To reduce the cost of energy purchased from the electric utility, the plant operator may use the available on-site generation or reschedule the operation of one or more workstations so that the plant is less heavily operational during peak-load hours when the price of electricity is high. All this must be done while considering the interdependencies between various workstations as well as the target production level (and possible penalties in case it is not met). These factors turn the energy management problem into a constrained optimization one where the operational constraints of the workstations, the capacity limits of available energy resources, and the financial information are all taken into account. Developing such a solution is the focus of this current paper. A nonlinear mixed-integer optimization problem is formulated here that tries to optimize the performance of an industrial plant subject to the above operational and technical constraints. The problem is solved for two scenarios, when the reverse power injection by the plant is allowed and when it is not.

Power management for Plug-in Hybrid Electric Vehicles using Reinforcement Learning with trip information

Conference Paper

Jun 2014

In this paper, we present a new method of power management for Plug-in Hybrid Electric Vehicles (PHEVs) using Reinforcement Learning technique combined with trip information. Our new control strategy uses the remaining travel distance, which can be easily obtained from today's Global Positioning System (GPS), for the energy optimization of PHEVs. For a given trip, the remaining distance is highly correlated to the future energy consumption, a quantity our controller tries to learn and optimize continuously. The simulation results confirm the self-improving capability of our reinforcement learning controller and show that our controller outperforms the rule-based controller with respect to a defined reward function.

Robust Total Energy Optimization of Flexible Manufacturing Systems Based on Renyi Mean-Entropy Criterion

Article

Dec 2014
IEEE T AUTOM SCI ENG

Kelvin C. V. Le

To deal with uncertainties in energy optimization of flexible manufacturing systems (FMSs), this paper presents a robust total energy optimization problem with productive and idle powers of resources being random variables (RVs). Using the weighted p-timed Petri net model of FMS, the robust energy-optimal schedule is determined by searching the robust shortest path of reachability graph based on Renyi mean-entropy criterion. The Renyi quadratic entropy is proposed as a robustness measure which does not depend on the Gaussianity of RVs and can be computed easily from samples. Next, Bellman's dynamic programming is applied to find the shortest path in Renyi mean-entropy criterion. The effectiveness of Renyi mean-entropy criterion is verified with mathematical rigour as well as extensive numerical results from simulations and an industrial stamping system, in terms of computational complexity and deviation from optimality.

Self-optimization of energy consumption in complex bulk good processes using reinforcement learning

Figures

Recommended publications

Train operation scheduling optimization based on deep reinforcement learning

On-line Energy Optimization of Hybrid Production Systems Using Actor-Critic Reinforcement Learning

Self Learning in Flexible Manufacturing Units: A Reinforcement Learning Approach

Self-Optimization in Smart Production Systems using Distributed Reinforcement Learning

Decentralized Learning of Energy Optimal Production Policies using PLC-informed Reinforcement Learni...