Conference PaperPDF Available

Transferring Human Manipulation Knowledge to Industrial Robots Using Reinforcement Learning

Authors:

Abstract and Figures

Nowadays in the context of Industry 4.0, manufacturing companies are faced by increasing global competition and challenges, which requires them to become more flexible and be able to adapt fast to rapid market changes. Advanced robot system is an enabler for achieving greater flexibility and adaptability, however, programming such systems also become increasingly more complex. Thus, new methods for programming robot system and enabling self-learning capabilities to accommodate the natural variation exhibited in real-world tasks are needed. In this paper, we propose a Reinforcement Learning (RL) enabled robot system, which learns task trajectories from human workers. The presented work demonstrates that with minimal human effort, we can transfer manual manipulation tasks in certain domains to a robot system without the requirement for a complicated hardware system model or tedious and complex programming. Furthermore, the robot is able to build upon the learned concepts from the human expert and improve its performance over time. Initially, Q-learning is applied, which has shown very promising results. Preliminary experiments, from a use case in slaughterhouses, demonstrate the viability of the proposed approach. We conclude that the feasibility and applicability of RL for industrial robots and industrial processes, holds and unseen potential, especially for tasks where natural variation is exhibited in either the product or process.
Content may be subject to copyright.
29th International Conference on Flexible Automation and Intelligent Manufacturing
(FAIM2019), June 24-28, 2019, Limerick, Ireland.
Transferring Human Manipulation Knowledge to Industrial Robots
Using Reinforcement Learning
N. Arana-Arexolaleibaa,
*
, N. Urrestilla-Anguiozarb, D. Chrysostomoub, S. Bøghb
aRobotics and Automation Research Group, Mondragon University, Spain
b Robotics & Automation Group, Dept. of Materials and Production, Aalborg University, Fibigerstræde 16, Aalborg Øst, DK-9220, Denmark
Abstract
Nowadays in the context of Industry 4.0, manufacturing companies are faced by increasing global competition and challenges,
which requires them to become more flexible and able to adapt fast to rapid market changes. Advanced robot system is an enabler
for achieving greater flexibility and adaptability, however, programming such systems also become increasingly more complex.
Thus, new methods for programming robot systems and enabling self-learning capabilities to accommodate the natural variation
exhibited in real-world tasks are needed. In this paper, we propose a Reinforcement Learning (RL) enabled robot system, which
learns task trajectories from human workers. The presented work demonstrates that with minimal human effort, we can transfer
manual manipulation tasks in certain domains to a robot system without the requirement for a complicated hardware system model
or tedious and complex programming. Furthermore, the robot is able to build upon the learned concepts from the human expert and
improve its performance over time. Initially, Q-learning is applied, which has shown very promising results. Preliminary
experiments, from a use case in slaughterhouses, demonstrate the viability of the proposed approach. We conclude that the
feasibility and applicability of RL for industrial robots and industrial processes, holds and unseen potential, especially for tasks
where natural variation is exhibited in either the product or process.
Keywords: Reinforcement Learning; Q-learning; Robot Control; Self-Learning Capabilities
*
Corresponding author. Tel.: +34 647504340
E-mail address: narana@mondragon.edu
2
1. Introduction
The current trend and challenge in the emerging era of Industry 4.0 is the absence of the human worker during the
digital transformation of the smart factories. This is especially true when it comes to introducing new robot systems.
Additionally, when building advanced robotic systems, the process knowledge possessed by the human worker is
often paramount for the efficiency of the process and the product’s quality. However, the transfer of this, often tacit,
knowledge to the robot remains a difficult problem.
Industrial robots are commonly programmed in an inflexible way. Setting up the motions for simple pick-and-place
operations often require many adjustments and does not generalize well to new situations. Small variations in the
process setup will often result in task failure. The current state-of-the-art in flexible robot programming either entails
using robot skills concepts [1,2] or kinesthetic teaching combined with robot skills [2]. However, these approaches
still suffer from the ability to take new and small variations in the process into account. The general problem of current
robot programming points to the question: can a robot more easily and naturally learn from the (near-) optimal
trajectories carried out by a human expert worker and be able to adapt to future small variations?
In this paper, an industrial robot setup with self-learning capabilities using Reinforcement Learning (RL) is
proposed in order to demonstrate how human workers can transfer their task behavior to the robot without any
programming requirements. The use case takes place in meat processing in slaughterhouses, which are often
characterized by heavy pick and place manipulations. The research is a three-step development consisting of:
1. Acquire human motions and behaviors through recordings in virtual reality (VR)
2. Train the RL model from the recorded trajectories
3. Evaluate the final policy on the real robot system
2. Related work
Reinforcement Learning has previously proved to be successful in different environments such as board games [3]
and computer games such as Atari [4,5]. The sensory input varies in different applications of RL, and in recent years
many demonstrations in game environments have applied visual sensory input such as raw pixel data applying a Deep
Convolutional Neural Network (CNN) as part of the function approximator [4].
Exploration in environments with a sparse reward has been a persistent problem in Reinforcement Learning, which
is often true in many robot applications as well [6]. Usually, the reward is triggered at the very end of an episode when
the agent reaches its goal. This means that most of the experiences do not yield any reward, which makes it challenging
and time-consuming to train the robot agent.
There are studies where training data is obtained from a Virtual Reality (VR) system simulating the real
environment [7]. This can in many cases help accelerate the training due to the availability of more data in the
simulation environment vs. the real-world environment.
In other work such as [8], an RL agent learns how to control a robot to complete a wire loop game
. Our presented
work builds on top of some of the same ideas and takes the human-factor and training data into account, by recording
trajectories from human workers in VR, training the RL agent, and finally evaluating the optimal policy on a real robot
system.
2.1. Background
Reinforcement Learning
Reinforcement Learning (RL) [9] is a category of machine learning where the actions an agent must take are based on
a reward function, and the agent must explore an environment in order to learn how to behave and maximize the
performance. The main difference with other machine learning techniques, such as unsupervised and supervised
https://en.wikipedia.org/wiki/Wire_loop_game
3
learning, is that no external labelled data is introduced, and the agent is responsible for exploring the unknown
environment.
As shown in Fig. 1, the agent learns from a direct interaction with its environment. It selects an action at, e.g. move
forward, and this action affects the environment and a state change takes place. From the environment, the agent
observes the new state st+1, and the reward rt+1, after the action is taken. The goal of the learning process is to find the
optimal policy, which maximizes the cumulative reward Rt as seen in Eq. 1:
!"# $"% &'$"(' % ) % &*+" $
*
(1)
where γ in (0,1) is the discount factor.
While learning, the agent needs to explore the environment e.g. by taking random actions, but also exploit the
partially learned policy to move to good states that yield a high reward. One of the challenges is to keep a balance
between exploration and exploitation. A common policy is an ϵ-greedy policy. The ϵ-greedy policy defines the
exploration-exploitation rate and how it changes over time, going from a high exploration (ϵ = 0.9) behavior to an
exploitation one (ϵ = 0.1).
Q-learning
Q-learning is an off-policy RL method that enables the agent to select the optimal action, by following the optimal
policy for a given Markov Decision Process (MDP) [10]. A stochastic process has the Markov property if the
conditional probability distribution of the future states of the process only depends on the present state. A set of
actions, states and rewards can be described as an MDP episode with eMDP as seen in Eq. 2:
,-./ #
0
123 4253 $23 163 463 $63 7 3 1*+63 4*+63 $*+65
8 (2)
Q-learning is based on the idea that for each state-action tuple there is an associated Q-value:
9:;3<;#=4>
?
$"(6
@.
Assuming that the optimal policy π(s) is known, the best action for each state.
A
B
1
C
#4$D=4><9
B
13 4
C
5
can be
selected. As the policy is unknown, one need to compute the
9:;3<;
values iteratively using the Bellman equation [10].
In each of the states, the agent is located with the Bellman equation, defined in Eq. 3, the Q-value of the current state
is retrieved, considering the reward rt+1 received from transitioning from the current state to the next state st+1, and the
max discounted Q-value of the next state
&=4><9:;EF3<;EF
.
9:;3<;# $"(6 % 5&=4><9:;EF3<;EF
(3)
In noisy environments, the robustness of the learning process can be increased by considering a part of the new
knowledge, avoiding the portion corresponding to the noise. This is done by introducing the learning rate α in the
Bellman equation, Eq. 4, where α (0,1). Low α values are commonly used in noisy environments.
9:;3<;# 9:;3<;% G
H
$"(6 % 5&=4><9:;EF3<;EF I 9:;3<;
J(4)
Fig. 1. The Reinforcement Learning environment.
4
3. Use case
Meat processing industry and slaughterhouses, in particular, are often identified as one of the harshest work
environments for humans. Handling of unique items, flexibility, and food safety are only some of the reasons why
human hands and minds are still required for competitive production lines in the food industry. Even though novel
automation solutions and robotic technologies have slowly been integrated into production lines, a native human-robot
collaborative environment is still non-applicable. The natural variation of the products, the machines' inability to be
altered and the current high cost of the robot solutions lead to solutions, which usually underperform, are inflexible to
quick changes, and unintuitive to use by the operators.
Today, one of the many repetitive tasks found in slaughterhouses, concern hanging heavy pieces of meat onto
hooks on what is commonly known in the industry as a Christmas tree, see Fig. 2 left. A Christmas tree consists of
several hooks e.g. 20 hooks. Christmas trees are e.g. used for transportation and are standardized for slaughterhouses
around Europe. Currently, 1-2 workers handle large pieces of meat, which generally weigh up to 20 kilos each. The
workers have to lift and position the objects accurately onto the hooks on the Christmas tree. Such a sequence of
movements creates many challenges on the production line. Initially, the high number of heavy, repetitive lifts cause
severe stress to the back and arms of the workers. Consequently, they have to stop working every 45 min. in order to
regain their strength. A Christmas tree has to be filled within a specific time frame, so the allowed time for detecting
the correct pose of the object and choosing the appropriate placement point at the hook is significantly small. Such a
small error margin creates an additional mental load to inexperienced workers who often make mistakes that lead to
delays in the production and reduced quality.
The vision is to automate the processing of those large pieces of meat as it involves heavy and repetitive lifting.
This way, the worker avoids such a stressful task (both physically and mentally). The first step is to have the robot
system learn how to handle the pieces of meat and transfer them from a conveyor or table onto a designated hook on
the Christmas tree. In the following section, the method and RL architecture applied in this research are explained.
4. Method
The RL architecture used in this research is a two-step training-testing algorithm described in [8]. This algorithm
is a modified version of the Q-learning ϵ-greedy exploration and considers a safe exploration space where the robot
can learn without any risk of colliding with the physical environment.
There are six discrete actions corresponding to three Cartesian axes (x-, x+), (y-, y+) and (z-, z+). The step value used
in each direction is 1mm. To obtain a more accurate path this value can be decreased, but the smaller the step value
is, the larger the Q-table becomes, and the time needed for the learning process.
The state-space is defined by the current end-effector Cartesian position
K/# >3 L3 M
and the final goal Cartesian
position
N/# >O3 LO3 MO
. The tuple composed by those two positions
P # BK/3 N/C
is transformed in a unique identifier
QR # SB1C
and recorded in the Q-table. More information can be included, e.g. rotations, but the Q-table easily become
Fig. 2. (a) Christmas tree commonly seen in slaughter houses around
Europe. The Christmas tree contains 20 hooks. (b) Meat is hanged onto the
hooks for curing and transportation.
5
unmanageable. In this particular case, the Q-table can easily contain 154,000 states. It is well-known that the Q-
learning approach suffers from large state-space dimensions. To overcome this problem, neural network-based
approaches, such as DQN, are suggested [5].
The reward function is a linear combination of three weights and distances as seen in Eq. 5. It considers the
Euclidean distance change between the actual position EP and the goal GP (d1) and a reward r for achieving the goal
(d2). Moreover, an additional reward function has been tested, where the third parameter (d3) considers the distance
from EP to the table. We hypothesize that those trajectories that carry the meat by sliding the objects across the table
(hence minimizing overall payload), are ergonomically more adequate for the human. For this experiment, the weights
of the reward function have iteratively been adjusted. In the example shown in this paper, the following weights are
used: w1=2, w2=1.0 and w3=0.1.
$ # T6R6%5T'R'%5 TURU
(5)
In the presented experiment, the following hyperparameters are applied; The discount factor gamma was set to γ =
0.9 and the learning rate α = 0.5. The ϵ-epsilon was initialized to ϵ = 0.9 and the minimum ϵ = 0.1. Those values are
the outcome of several simulations. The distance between the initial and final position is around 0.9m. The ϵ decreases
following the next equation
V # V I WXYZ
(Algorithm 1). Boundary conditions for safety purpose are added where
the robot is not allowed to explore further than a predefined space.
5. System implementation
The system architecture used in this work is shown in Fig. 3. It contains three nodes: (i) VR data acquisition, (ii)
RL learning and (iii) robot control.
The first node (i) tracks and records the user movements by means of the VR system. It is composed of two satellites
and a controller. The controller trigger is used to define the starting and ending points of the path, allowing the user
to create as many paths as needed. Each path is afterwards transformed into the robot coordinate system.
Algorithm 1. Two-phase RL algorithm based on [8].
6
The second node (ii) receives the path and launches the RL algorithm described in the previous section. This node
learns the policy to track the human behavior. The position information is sent by means of a shared file or a socket
communication to the robot. The coding structure of the simulation environment (blue box) uses the same set of
functions (reset, step) and values observation, reward and done as does the OpenAI-Gym framework [11]. The RL
node sends the reset and step functions (which are in charge of setting up the robot to the initial position and sending
the actions to take) to the simulation environment and receive some variables that provide information about the agent
such as; the current state, the reward obtained, and if the goal is reached or not. The third node (iii) controls the robot.
The MoveIt planner included in ROS distribution is in charge of receiving the Cartesian points and applying the
inverse kinematics to translate them into joint movements.
Table 1 lists the primary hardware elements utilized in the experiments. The standard deviation for static pose value
obtained by the VR system at 1-2 m distance is 0.417 mm [12] and a sampling speed of 60 Hz. The software
dependencies used are also indicated, where the computer system is in Ubuntu 16.04 LTS with ROS Kinetic installed.
Table 1: Hardware and software overview.
Robot manipulator
Computer
Software
UR5
Quad Core i7
Ubuntu 16.04
Payload: 5 kg
16 GB RAM
ROS Kinetic
Working radius of 850 mm
Nvidia RTX 2048 8 GB
Python 2.7
Repeatability +/-0.1 mm
Gazebo 7
6. Experimental results
6.1. Training
The robot is learning around a zone starting from the initial position Tp. The learning zone is a 3D space with a
radius size defined in the hyperparameters. In Fig. 4(a), a trajectory (in green) is obtained by the VR system in the
robot’s end-effector coordinate system. At the initial point (in black), the robot tries to learn how to go through the
trajectory. The robot observes the VR trajectory from what is referred to as a safe TCP coordinate system. A TCP is
considered to be safe when the robot learns to move forward to a new position belonging to the VR trajectory. In Fig.
4(b) the trajectory from the initial TCP position and from a second TCP position is shown after moving forward.
Finally, in Fig. 4(c), all the intermediate safe positions until the final goal GP are shown.
Fig. 3. Software and hardware architecture. Left: (i) Acquisition of VR data. Middle: (ii) Reinforcement Learning block. Right: (iii) Robot control
either for Gazebo or the real robot.
7
6.2. Evaluation
The work is tested in an experimental scenario shown in Fig. 5. In this scenario, an object is picked up from a table
and hanged on the hook on the Christmas tree, while the VR system is tracking the human behavior. Since the meat
on the process is presently not considered, a lightweight soft object was used.
The VR system tracks the human movement with high precision. It was observed that even the trajectory of the
hooking action is clearly tracked. However, it was recognized that the VR system does not behave correctly if any
shiny surfaces are present in the scene, which may impose tracking errors. The VR controller used is moved by hand.
This is not practical for real scenarios, but other wearable trackers can be used instead in order to track the human
worker while manipulating real meat in the slaughterhouse.
After the learning process using the VR information, the robot can go from the initial position TP to the goal GP.
The learning time can significantly change from 86.39 minutes for a 1 mm robot step and 50 mm radius size of the
learning zone to 1.15 minutes for a 5 mm robot step and 50 mm radius size of the learning zone. In order to simplify
the problem, picking and placing action are not considered in this work.
7. Discussion
It was observed that a simple task such as placing a piece of meat on a hook is composed of several skills: (i)
picking up a piece of meat, (ii) its displacement, and finally, (iii) the placement of the meat on a hook. The present
work takes its outset from the second skill, i.e. the displacement of a piece of meat. By use of Reinforcement Learning,
the robot was enabled to imitate the trajectory made by the human, captured by a VR system, through a trial and error
strategy. Training time varied and was very dependent on the radius of the search and differential movements distance
of the robot. In our tests, the training time was between a few minutes to an hour. The reason behind this is due to the
Fig. 4. (a) First learning step (first sub trajectory). (b) Trajectory seen from initial position and second position in the training sequence. (c) All
learning points across the trajectory.
(a)
(c)
(b)
Fig. 5. (a) VR-system sensing the working environment. (b) Initial coordinate TP (table) and goal coordinate GP (Christmas tree hook). (c) Robot
moving towards the final position GP along the trained trajectory.
HOOKING
MOVEMENT
PICKING
Initial State
Initial State
2nd Safe State
8
number of states generated in each configuration varies significantly, ranging from approximately 30,000 to 154,000.
Regardless, this should be considered a fast learning process compared with other alternatives that use a Gazebo
simulator or real robots, where the robot is able to perform only 2-4 movements per second. The Q-learning technique
is characterized by its simplicity, but also by its limitation when it comes to increasing the number of variables, which
defines the state-space of the system. Currently, we are working with strategies within more recent developments in
the Reinforcement Learning domain such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient
(DDPG), given that these techniques allow for increasing the number of input features and actions.
8. Conclusion
In this paper, a concept for an industrial self-learning robot agent was presented to strive for an optimal policy
when moving heavy non-rigid objects in the food industry by use of Reinforcement Learning and Q-learning. The
work aims to study the use of machine learning techniques to facilitate the programming of robots by non-experts in
real manufacturing use case scenarios. The use case selected consists of transferring manipulation knowledge from a
human worker to the robot i.e. the behavior of moving and hooking meat onto hooks in a slaughterhouse. The presented
method was applied, which enabled the agent to learn and control a six-axis industrial robot purely from recordings
of human behaviors. The training time achieved in this work is practical for industrial use, considering that the learning
is performed without any human need.
The presented approach can be generalized to other use cases that share similarities in movement and execution
e.g. painting, welding, polishing. This work has the potential to extend beyond the manufacturing industry into e.g.
health care. For all these kind of use cases, time is the crucial factor when programming advanced robot systems. With
stronger techniques and innovative applications of Reinforcement Learning, it is the first step, which allows us to gain
new unseen possibilities in real-world robotics in the future for the greater good of companies and human workers.
Additional future work includes learning picking and hooking skills. We consider that the information of the force
(payload) should also be part of the state representation as this can potentially affect the system dynamics drastically.
Acknowledgements
This work was partially supported by the Danish national funded research project MADE Digital, the Basque
Government Mobility program (MV _2018_1_0018) and Malgurob (KK--2018/00114) research project.
References
[1] Pedersen MR, Nalpantidis L, Andersen RS, Schou C, Bøgh S, Krüger V, et al. Robot skills for manufacturing: From concept to industrial
deployment. Robot Comput Integr Manuf 2016;37:28291. doi:10.1016/j.rcim.2015.04.002.
[2] Schou C, Andersen RS, Chrysostomou D, Bøgh S, Madsen O. Skill-based instruction of collaborative robots in industrial settings. Robot
Comput Integr Manuf 2018. doi:10.1016/j.rcim.2018.03.008.
[3] Tesauro G. TD-Gammon: A Self-Teaching Backgammon Program. Appl. Neural Networks, Boston, MA: Springer US; 1995, p. 267–85.
doi:10.1007/978-1-4757-2379-3_11.
[4] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning.
Nature 2015;518:52933. doi:10.1038/nature14236.
[5] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double q-learning. Thirtieth AAAI Conf. Artif. Intell., 2016.
[6] Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P. Overcoming exploration in reinforcement learning with demonstrations. 2018
IEEE Int. Conf. Robot. Autom., 2018, p. 62929.
[7] Dyrstad JS, Mathiassen JR. Grasping virtual fish: A step towards robotic deep learning from demonstration in virtual reality. 2017 IEEE Int.
Conf. Robot. Biomimetics, 2017, p. 11817.
[8] Meyes R, Tercan H, Roggendorf S, Thiele T, Büscher C, Obdenbusch M, et al. Motion planning for industrial robots using reinforcement
learning. Procedia CIRP 2017;63:10712.
[9] Wiering M, van Otterlo M. Reinforcement Learning: State-of-the-Art. vol. 12. Springer Science & Business Media; 2012.
[10] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018.
[11] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al. Openai gym. ArXiv Prepr ArXiv160601540 2016.
[12] Borges M, Symington A, Coltin B, Smith T, Ventura R. HTC Vive: Analysis and Accuracy Improvement. 2018 IEEE/RSJ Int. Conf. Intell.
Robot. Syst., 2018, p. 26105.
... Welding tasks has also been tested with deep reinforcement learning where, e.g. an Actor-Critic network (ACN) has been used for solving welding tasks [14] [15]. In the robotics sub-field of pick-and-place, examples of the deep reinforcement learning algorithm Deep Deterministic Policy Gradient (DDPG) exist [16][17] [18]. ...
Article
Full-text available
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
Article
Full-text available
A major challenge of today's production systems in the context of Industry 4.0 and Cyber-Physical Production Systems is to be flexible and adaptive whilst being robust and economically efficient. Specifically, the implementation of motion planning processes for industrial robots need to be refined concerning their variability of the motion task and the ability to adaptively deal with variations in the environment. In this paper, we propose a reinforcement learning (RL) based, cognition-enhanced six-axis industrial robot for complex motion planning along continuous trajectories as e.g. needed for welding, gluing or cutting processes in production. Our prototype demonstrator is inspired by the classic wire loop game which involves guiding a metal loop along the path of a curved wire from start to finish while avoiding any contact between the wire and the loop. Our work shows that the RL-agent is capable of learning how to control the robot to successfully play the wire loop game without the need of modeling the wire or programming the robot motion beforehand. Furthermore, the extension of the system by a visual sensor (a camera) allows the agent to sufficiently generalize the learning problem so that it can solve new or reshaped wires without the need of additional learning. We conclude that the applicability of RL for industrial robots and production systems in general provides vast and unexplored potential for processes that feature variability to some extent and thus require a general and robust approach for process automation.
Article
Full-text available
The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether this harms performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
Article
Full-text available
Due to a general shift in manufacturing paradigm from mass production towards mass customization, reconfigurable automation technologies, such as robots, are required. However, current industrial robot solutions are notoriously difficult to program, leading to high changeover times when new products are introduced by manufacturers. In order to compete on global markets, the factories of tomorrow need complete production lines, including automation technologies that can effortlessly be reconfigured or repurposed, when the need arises. In this paper we present the concept of general, self-asserting robot skills for manufacturing. We show how a relatively small set of skills are derived from current factory worker instructions, and how these can be transferred to industrial mobile manipulators. General robot skills can not only be implemented on these robots, but also be intuitively concatenated to program the robots to perform a variety of tasks, through the use of simple task-level programming methods. We demonstrate various approaches to this, extensively tested with several people inexperienced in robotics. We validate our findings through several deployments of the complete robot system in running production facilities at an industrial partner. It follows from these experiments that the use of robot skills, and associated task-level programming framework, is a viable solution to introducing robots that can intuitively and on the fly be programmed to perform new tasks by factory workers.
Article
During the past decades, the increasing need for more flexible and agile manufacturing equipment has spawned a growing interest in collaborative robots. Contrary to traditional industrial robots, collaborative robots are intended for operating alongside the production personnel in dynamic or semi-structured human environments. To cope with the environment and workflow of humans, new programming and control methods are needed compared to those of traditional industrial robots. This paper presents a task-level programming software tool allowing robotic novices to program industrial tasks on a collaborative robot. The tool called Skill Based System (SBS) is founded on the concept of robot skills, which are parameterizable and task-related actions of the robot. Task programming is conducted by first sequencing skills followed by an online parameterization performed using kinesthetic teaching. Through several user studies, SBS is found to enable robotic novices to program industrial tasks. SBS has further been deployed and tested in two manufacturing settings demonstrating its applicability in real industrial scenarios.
Chapter
This chapter describes TD-Gammon, a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. TD-Gammon uses a recently proposed reinforcement learning algorithm called TD(λ) (Sutton, 1988), and is apparently the first application of this algorithm to a complex nontrivial task. Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e. given only a “raw” description of the board state), the network learns to play the entire game at a strong intermediate level that surpasses not only conventional commercial programs, but also comparable networks trained via supervised learning on a large corpus of human expert games. The hidden units in the network have apparently discovered useful features, a longstanding goal of computer games research. Furthermore, when a set of hand-crafted features is added to the network’s input representation, the result is a truly staggering level of performance: TD-Gammon is now estimated to play at a strong master level that is extremely close to the world’s best human players. We discuss possible principles underlying the success of TD-Gammon, and the prospects for successful real-world applications of TD learning in other domains.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Book
Reinforcement learning encompasses both a science of adaptive behavior of rational beings in uncertain environments and a computational methodology for finding optimal behaviors for challenging problems in control, optimization and adaptive behavior of intelligent agents. As a field, reinforcement learning has progressed tremendously in the past decade. The main goal of this book is to present an up-to-date series of survey articles on the main contemporary sub-fields of reinforcement learning. This includes surveys on partially observable environments, hierarchical task decompositions, relational knowledge representation and predictive state representations. Furthermore, topics such as transfer, evolutionary methods and continuous spaces in reinforcement learning are surveyed. In addition, several chapters review reinforcement learning methods in robotics, in games, and in computational neuroscience. In total seventeen different subfields are presented by mostly young experts in those areas, and together they truly represent a state-of-the-art of current reinforcement learning research. Marco Wiering works at the artificial intelligence department of the University of Groningen in the Netherlands. He has published extensively on various reinforcement learning topics. Martijn van Otterlo works in the cognitive artificial intelligence group at the Radboud University Nijmegen in The Netherlands. He has mainly focused on expressive knowledge representation in reinforcement learning settings.
Article
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"
Grasping virtual fish: A step towards robotic deep learning from demonstration in virtual reality
  • J S Dyrstad
  • J R Mathiassen
Dyrstad JS, Mathiassen JR. Grasping virtual fish: A step towards robotic deep learning from demonstration in virtual reality. 2017 IEEE Int. Conf. Robot. Biomimetics, 2017, p. 1181-7.