Conference PaperPDF Available

Transferring Human Manipulation Knowledge to Industrial Robots Using Reinforcement Learning

June 2019

June 2019

Conference: 29th International Conference on Flexible Automation and Intelligent Manufacturing (FAIM2019)
At: Limerick, Ireland

Authors:

Nestor Arana-Arexolaleiba

Mondragon Unibertsitatea

Simon Bøgh

Aalborg University

Dimitrios Chrysostomou

Aalborg University

Nowadays in the context of Industry 4.0, manufacturing companies are faced by increasing global competition and challenges, which requires them to become more flexible and be able to adapt fast to rapid market changes. Advanced robot system is an enabler for achieving greater flexibility and adaptability, however, programming such systems also become increasingly more complex. Thus, new methods for programming robot system and enabling self-learning capabilities to accommodate the natural variation exhibited in real-world tasks are needed. In this paper, we propose a Reinforcement Learning (RL) enabled robot system, which learns task trajectories from human workers. The presented work demonstrates that with minimal human effort, we can transfer manual manipulation tasks in certain domains to a robot system without the requirement for a complicated hardware system model or tedious and complex programming. Furthermore, the robot is able to build upon the learned concepts from the human expert and improve its performance over time. Initially, Q-learning is applied, which has shown very promising results. Preliminary experiments, from a use case in slaughterhouses, demonstrate the viability of the proposed approach. We conclude that the feasibility and applicability of RL for industrial robots and industrial processes, holds and unseen potential, especially for tasks where natural variation is exhibited in either the product or process.

The Reinforcement Learning environment.

…

(a) Christmas tree commonly seen in slaughter houses around Europe. The Christmas tree contains 20 hooks. (b) Meat is hanged onto the hooks for curing and transportation.

…

(a) First learning step (first sub trajectory). (b) Trajectory seen from initial position and second position in the training sequence. (c) All learning points across the trajectory.

…

Hardware and software overview.

…

Figures - uploaded by Simon Bøgh

Content may be subject to copyright.

Content uploaded by Simon Bøgh

Content may be subject to copyright.

29th International Conference on Flexible Automation and Intelligent Manufacturing

(FAIM2019), June 24-28, 2019, Limerick, Ireland.

Transferring Human Manipulation Knowledge to Industrial Robots

Using Reinforcement Learning

N. Arana-Arexolaleibaa,

, N. Urrestilla-Anguiozarb, D. Chrysostomoub, S. Bøghb

aRobotics and Automation Research Group, Mondragon University, Spain

b Robotics & Automation Group, Dept. of Materials and Production, Aalborg University, Fibigerstræde 16, Aalborg Øst, DK-9220, Denmark

Abstract

Nowadays in the context of Industry 4.0, manufacturing companies are faced by increasing global competition and challenges,

which requires them to become more flexible and able to adapt fast to rapid market changes. Advanced robot system is an enabler

for achieving greater flexibility and adaptability, however, programming such systems also become increasingly more complex.

Thus, new methods for programming robot systems and enabling self-learning capabilities to accommodate the natural variation

exhibited in real-world tasks are needed. In this paper, we propose a Reinforcement Learning (RL) enabled robot system, which

learns task trajectories from human workers. The presented work demonstrates that with minimal human effort, we can transfer

manual manipulation tasks in certain domains to a robot system without the requirement for a complicated hardware system model

or tedious and complex programming. Furthermore, the robot is able to build upon the learned concepts from the human expert and

improve its performance over time. Initially, Q-learning is applied, which has shown very promising results. Preliminary

experiments, from a use case in slaughterhouses, demonstrate the viability of the proposed approach. We conclude that the

feasibility and applicability of RL for industrial robots and industrial processes, holds and unseen potential, especially for tasks

where natural variation is exhibited in either the product or process.

Keywords: Reinforcement Learning; Q-learning; Robot Control; Self-Learning Capabilities

Corresponding author. Tel.: +34 647504340

E-mail address: narana@mondragon.edu

1. Introduction

The current trend and challenge in the emerging era of Industry 4.0 is the absence of the human worker during the

digital transformation of the smart factories. This is especially true when it comes to introducing new robot systems.

Additionally, when building advanced robotic systems, the process knowledge possessed by the human worker is

often paramount for the efficiency of the process and the product’s quality. However, the transfer of this, often tacit,

knowledge to the robot remains a difficult problem.

Industrial robots are commonly programmed in an inflexible way. Setting up the motions for simple pick-and-place

operations often require many adjustments and does not generalize well to new situations. Small variations in the

process setup will often result in task failure. The current state-of-the-art in flexible robot programming either entails

using robot skills concepts [1,2] or kinesthetic teaching combined with robot skills [2]. However, these approaches

still suffer from the ability to take new and small variations in the process into account. The general problem of current

robot programming points to the question: can a robot more easily and naturally learn from the (near-) optimal

trajectories carried out by a human expert worker and be able to adapt to future small variations?

In this paper, an industrial robot setup with self-learning capabilities using Reinforcement Learning (RL) is

proposed in order to demonstrate how human workers can transfer their task behavior to the robot without any

programming requirements. The use case takes place in meat processing in slaughterhouses, which are often

characterized by heavy pick and place manipulations. The research is a three-step development consisting of:

1. Acquire human motions and behaviors through recordings in virtual reality (VR)

2. Train the RL model from the recorded trajectories

3. Evaluate the final policy on the real robot system

2. Related work

Reinforcement Learning has previously proved to be successful in different environments such as board games [3]

and computer games such as Atari [4,5]. The sensory input varies in different applications of RL, and in recent years

many demonstrations in game environments have applied visual sensory input such as raw pixel data applying a Deep

Convolutional Neural Network (CNN) as part of the function approximator [4].

Exploration in environments with a sparse reward has been a persistent problem in Reinforcement Learning, which

is often true in many robot applications as well [6]. Usually, the reward is triggered at the very end of an episode when

the agent reaches its goal. This means that most of the experiences do not yield any reward, which makes it challenging

and time-consuming to train the robot agent.

There are studies where training data is obtained from a Virtual Reality (VR) system simulating the real

environment [7]. This can in many cases help accelerate the training due to the availability of more data in the

simulation environment vs. the real-world environment.

In other work such as [8], an RL agent learns how to control a robot to complete a wire loop game

†

. Our presented

work builds on top of some of the same ideas and takes the human-factor and training data into account, by recording

trajectories from human workers in VR, training the RL agent, and finally evaluating the optimal policy on a real robot

system.

2.1. Background

Reinforcement Learning

Reinforcement Learning (RL) [9] is a category of machine learning where the actions an agent must take are based on

a reward function, and the agent must explore an environment in order to learn how to behave and maximize the

performance. The main difference with other machine learning techniques, such as unsupervised and supervised

†

https://en.wikipedia.org/wiki/Wire_loop_game

learning, is that no external labelled data is introduced, and the agent is responsible for exploring the unknown

environment.

As shown in Fig. 1, the agent learns from a direct interaction with its environment. It selects an action at, e.g. move

forward, and this action affects the environment and a state change takes place. From the environment, the agent

observes the new state st+1, and the reward rt+1, after the action is taken. The goal of the learning process is to find the

optimal policy, which maximizes the cumulative reward Rt as seen in Eq. 1:

!"# $"% &'$"(' % ) % &*+" $

(1)

where γ in (0,1) is the discount factor.

While learning, the agent needs to explore the environment e.g. by taking random actions, but also exploit the

partially learned policy to move to good states that yield a high reward. One of the challenges is to keep a balance

between exploration and exploitation. A common policy is an ϵ-greedy policy. The ϵ-greedy policy defines the

exploration-exploitation rate and how it changes over time, going from a high exploration (ϵ = 0.9) behavior to an

exploitation one (ϵ = 0.1).

Q-learning

Q-learning is an off-policy RL method that enables the agent to select the optimal action, by following the optimal

policy for a given Markov Decision Process (MDP) [10]. A stochastic process has the Markov property if the

conditional probability distribution of the future states of the process only depends on the present state. A set of

actions, states and rewards can be described as an MDP episode with eMDP as seen in Eq. 2:

,-./ #

123 4253 $23 163 463 $63 7 3 1*+63 4*+63 $*+65

8 (2)

Q-learning is based on the idea that for each state-action tuple there is an associated Q-value:

9:;3<;#=4>

$"(6

Assuming that the optimal policy π(s) is known, the best action for each state.

#4$D=4><9

13 4

can be

selected. As the policy is unknown, one need to compute the

9:;3<;

values iteratively using the Bellman equation [10].

In each of the states, the agent is located with the Bellman equation, defined in Eq. 3, the Q-value of the current state

is retrieved, considering the reward rt+1 received from transitioning from the current state to the next state st+1, and the

max discounted Q-value of the next state

&=4><9:;EF3<;EF

9:;3<;# $"(6 % 5&=4><9:;EF3<;EF

(3)

In noisy environments, the robustness of the learning process can be increased by considering a part of the new

knowledge, avoiding the portion corresponding to the noise. This is done by introducing the learning rate α in the

Bellman equation, Eq. 4, where α (0,1). Low α values are commonly used in noisy environments.

9:;3<;# 9:;3<;% G

$"(6 % 5&=4><9:;EF3<;EF I 9:;3<;

J(4)

Fig. 1. The Reinforcement Learning environment.

3. Use case

Meat processing industry and slaughterhouses, in particular, are often identified as one of the harshest work

environments for humans. Handling of unique items, flexibility, and food safety are only some of the reasons why

human hands and minds are still required for competitive production lines in the food industry. Even though novel

automation solutions and robotic technologies have slowly been integrated into production lines, a native human-robot

collaborative environment is still non-applicable. The natural variation of the products, the machines' inability to be

altered and the current high cost of the robot solutions lead to solutions, which usually underperform, are inflexible to

quick changes, and unintuitive to use by the operators.

Today, one of the many repetitive tasks found in slaughterhouses, concern hanging heavy pieces of meat onto

hooks on what is commonly known in the industry as a Christmas tree, see Fig. 2 left. A Christmas tree consists of

several hooks e.g. 20 hooks. Christmas trees are e.g. used for transportation and are standardized for slaughterhouses

around Europe. Currently, 1-2 workers handle large pieces of meat, which generally weigh up to 20 kilos each. The

workers have to lift and position the objects accurately onto the hooks on the Christmas tree. Such a sequence of

movements creates many challenges on the production line. Initially, the high number of heavy, repetitive lifts cause

severe stress to the back and arms of the workers. Consequently, they have to stop working every 45 min. in order to

regain their strength. A Christmas tree has to be filled within a specific time frame, so the allowed time for detecting

the correct pose of the object and choosing the appropriate placement point at the hook is significantly small. Such a

small error margin creates an additional mental load to inexperienced workers who often make mistakes that lead to

delays in the production and reduced quality.

The vision is to automate the processing of those large pieces of meat as it involves heavy and repetitive lifting.

This way, the worker avoids such a stressful task (both physically and mentally). The first step is to have the robot

system learn how to handle the pieces of meat and transfer them from a conveyor or table onto a designated hook on

the Christmas tree. In the following section, the method and RL architecture applied in this research are explained.

4. Method

The RL architecture used in this research is a two-step training-testing algorithm described in [8]. This algorithm

is a modified version of the Q-learning ϵ-greedy exploration and considers a safe exploration space where the robot

can learn without any risk of colliding with the physical environment.

There are six discrete actions corresponding to three Cartesian axes (x-, x+), (y-, y+) and (z-, z+). The step value used

in each direction is 1mm. To obtain a more accurate path this value can be decreased, but the smaller the step value

is, the larger the Q-table becomes, and the time needed for the learning process.

The state-space is defined by the current end-effector Cartesian position

K/# >3 L3 M

and the final goal Cartesian

position

N/# >O3 LO3 MO

. The tuple composed by those two positions

P # BK/3 N/C

is transformed in a unique identifier

QR # SB1C

and recorded in the Q-table. More information can be included, e.g. rotations, but the Q-table easily become

Fig. 2. (a) Christmas tree commonly seen in slaughter houses around

Europe. The Christmas tree contains 20 hooks. (b) Meat is hanged onto the

hooks for curing and transportation.

unmanageable. In this particular case, the Q-table can easily contain 154,000 states. It is well-known that the Q-

learning approach suffers from large state-space dimensions. To overcome this problem, neural network-based

approaches, such as DQN, are suggested [5].

The reward function is a linear combination of three weights and distances as seen in Eq. 5. It considers the

Euclidean distance change between the actual position EP and the goal GP (d1) and a reward r for achieving the goal

(d2). Moreover, an additional reward function has been tested, where the third parameter (d3) considers the distance

from EP to the table. We hypothesize that those trajectories that carry the meat by sliding the objects across the table

(hence minimizing overall payload), are ergonomically more adequate for the human. For this experiment, the weights

of the reward function have iteratively been adjusted. In the example shown in this paper, the following weights are

used: w1=2, w2=1.0 and w3=0.1.

$ # T6R6%5T'R'%5 TURU

(5)

In the presented experiment, the following hyperparameters are applied; The discount factor gamma was set to γ =

0.9 and the learning rate α = 0.5. The ϵ-epsilon was initialized to ϵ = 0.9 and the minimum ϵ = 0.1. Those values are

the outcome of several simulations. The distance between the initial and final position is around 0.9m. The ϵ decreases

following the next equation

V # V I WXYZ

(Algorithm 1). Boundary conditions for safety purpose are added where

the robot is not allowed to explore further than a predefined space.

5. System implementation

The system architecture used in this work is shown in Fig. 3. It contains three nodes: (i) VR data acquisition, (ii)

RL learning and (iii) robot control.

The first node (i) tracks and records the user movements by means of the VR system. It is composed of two satellites

and a controller. The controller trigger is used to define the starting and ending points of the path, allowing the user

to create as many paths as needed. Each path is afterwards transformed into the robot coordinate system.

Algorithm 1. Two-phase RL algorithm based on [8].

…

The second node (ii) receives the path and launches the RL algorithm described in the previous section. This node

learns the policy to track the human behavior. The position information is sent by means of a shared file or a socket

communication to the robot. The coding structure of the simulation environment (blue box) uses the same set of

functions (reset, step) and values observation, reward and done as does the OpenAI-Gym framework [11]. The RL

node sends the reset and step functions (which are in charge of setting up the robot to the initial position and sending

the actions to take) to the simulation environment and receive some variables that provide information about the agent

such as; the current state, the reward obtained, and if the goal is reached or not. The third node (iii) controls the robot.

The MoveIt planner included in ROS distribution is in charge of receiving the Cartesian points and applying the

inverse kinematics to translate them into joint movements.

Table 1 lists the primary hardware elements utilized in the experiments. The standard deviation for static pose value

obtained by the VR system at 1-2 m distance is 0.417 mm [12] and a sampling speed of 60 Hz. The software

dependencies used are also indicated, where the computer system is in Ubuntu 16.04 LTS with ROS Kinetic installed.

Table 1: Hardware and software overview.

Robot manipulator

VR-system

Computer

Software

UR5

HTC VIVE

Quad Core i7

Ubuntu 16.04

Payload: 5 kg

2 trackers

16 GB RAM

ROS Kinetic

Working radius of 850 mm

σposition: 0.417 mm

Nvidia RTX 2048 8 GB

Python 2.7

Repeatability +/-0.1 mm

Sampling speed: 60 Hz

Gazebo 7

6. Experimental results

6.1. Training

The robot is learning around a zone starting from the initial position Tp. The learning zone is a 3D space with a

radius size defined in the hyperparameters. In Fig. 4(a), a trajectory (in green) is obtained by the VR system in the

robot’s end-effector coordinate system. At the initial point (in black), the robot tries to learn how to go through the

trajectory. The robot observes the VR trajectory from what is referred to as a safe TCP coordinate system. A TCP is

considered to be safe when the robot learns to move forward to a new position belonging to the VR trajectory. In Fig.

4(b) the trajectory from the initial TCP position and from a second TCP position is shown after moving forward.

Finally, in Fig. 4(c), all the intermediate safe positions until the final goal GP are shown.

Fig. 3. Software and hardware architecture. Left: (i) Acquisition of VR data. Middle: (ii) Reinforcement Learning block. Right: (iii) Robot control

either for Gazebo or the real robot.

6.2. Evaluation

The work is tested in an experimental scenario shown in Fig. 5. In this scenario, an object is picked up from a table

and hanged on the hook on the Christmas tree, while the VR system is tracking the human behavior. Since the meat

on the process is presently not considered, a lightweight soft object was used.

The VR system tracks the human movement with high precision. It was observed that even the trajectory of the

hooking action is clearly tracked. However, it was recognized that the VR system does not behave correctly if any

shiny surfaces are present in the scene, which may impose tracking errors. The VR controller used is moved by hand.

This is not practical for real scenarios, but other wearable trackers can be used instead in order to track the human

worker while manipulating real meat in the slaughterhouse.

After the learning process using the VR information, the robot can go from the initial position TP to the goal GP.

The learning time can significantly change from 86.39 minutes for a 1 mm robot step and 50 mm radius size of the

learning zone to 1.15 minutes for a 5 mm robot step and 50 mm radius size of the learning zone. In order to simplify

the problem, picking and placing action are not considered in this work.

7. Discussion

It was observed that a simple task such as placing a piece of meat on a hook is composed of several skills: (i)

picking up a piece of meat, (ii) its displacement, and finally, (iii) the placement of the meat on a hook. The present

work takes its outset from the second skill, i.e. the displacement of a piece of meat. By use of Reinforcement Learning,

the robot was enabled to imitate the trajectory made by the human, captured by a VR system, through a trial and error

strategy. Training time varied and was very dependent on the radius of the search and differential movements distance

of the robot. In our tests, the training time was between a few minutes to an hour. The reason behind this is due to the

Fig. 4. (a) First learning step (first sub trajectory). (b) Trajectory seen from initial position and second position in the training sequence. (c) All

learning points across the trajectory.

(a)

(c)

(b)

Fig. 5. (a) VR-system sensing the working environment. (b) Initial coordinate TP (table) and goal coordinate GP (Christmas tree hook). (c) Robot

moving towards the final position GP along the trained trajectory.

HOOKING

MOVEMENT

PICKING

Initial State

2nd Safe State

number of states generated in each configuration varies significantly, ranging from approximately 30,000 to 154,000.

Regardless, this should be considered a fast learning process compared with other alternatives that use a Gazebo

simulator or real robots, where the robot is able to perform only 2-4 movements per second. The Q-learning technique

is characterized by its simplicity, but also by its limitation when it comes to increasing the number of variables, which

defines the state-space of the system. Currently, we are working with strategies within more recent developments in

the Reinforcement Learning domain such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient

(DDPG), given that these techniques allow for increasing the number of input features and actions.

8. Conclusion

In this paper, a concept for an industrial self-learning robot agent was presented to strive for an optimal policy

when moving heavy non-rigid objects in the food industry by use of Reinforcement Learning and Q-learning. The

work aims to study the use of machine learning techniques to facilitate the programming of robots by non-experts in

real manufacturing use case scenarios. The use case selected consists of transferring manipulation knowledge from a

human worker to the robot i.e. the behavior of moving and hooking meat onto hooks in a slaughterhouse. The presented

method was applied, which enabled the agent to learn and control a six-axis industrial robot purely from recordings

of human behaviors. The training time achieved in this work is practical for industrial use, considering that the learning

is performed without any human need.

The presented approach can be generalized to other use cases that share similarities in movement and execution

e.g. painting, welding, polishing. This work has the potential to extend beyond the manufacturing industry into e.g.

health care. For all these kind of use cases, time is the crucial factor when programming advanced robot systems. With

stronger techniques and innovative applications of Reinforcement Learning, it is the first step, which allows us to gain

new unseen possibilities in real-world robotics in the future for the greater good of companies and human workers.

Additional future work includes learning picking and hooking skills. We consider that the information of the force

(payload) should also be part of the state representation as this can potentially affect the system dynamics drastically.

Acknowledgements

This work was partially supported by the Danish national funded research project MADE Digital, the Basque

Government Mobility program (MV _2018_1_0018) and Malgurob (KK--2018/00114) research project.

References

[1] Pedersen MR, Nalpantidis L, Andersen RS, Schou C, Bøgh S, Krüger V, et al. Robot skills for manufacturing: From concept to industrial

deployment. Robot Comput Integr Manuf 2016;37:282–91. doi:10.1016/j.rcim.2015.04.002.

[2] Schou C, Andersen RS, Chrysostomou D, Bøgh S, Madsen O. Skill-based instruction of collaborative robots in industrial settings. Robot

Comput Integr Manuf 2018. doi:10.1016/j.rcim.2018.03.008.

[3] Tesauro G. TD-Gammon: A Self-Teaching Backgammon Program. Appl. Neural Networks, Boston, MA: Springer US; 1995, p. 267–85.

doi:10.1007/978-1-4757-2379-3_11.

[4] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning.

Nature 2015;518:529–33. doi:10.1038/nature14236.

[5] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double q-learning. Thirtieth AAAI Conf. Artif. Intell., 2016.

[6] Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P. Overcoming exploration in reinforcement learning with demonstrations. 2018

IEEE Int. Conf. Robot. Autom., 2018, p. 6292–9.

[7] Dyrstad JS, Mathiassen JR. Grasping virtual fish: A step towards robotic deep learning from demonstration in virtual reality. 2017 IEEE Int.

Conf. Robot. Biomimetics, 2017, p. 1181–7.

[8] Meyes R, Tercan H, Roggendorf S, Thiele T, Büscher C, Obdenbusch M, et al. Motion planning for industrial robots using reinforcement

learning. Procedia CIRP 2017;63:107–12.

[9] Wiering M, van Otterlo M. Reinforcement Learning: State-of-the-Art. vol. 12. Springer Science & Business Media; 2012.

[10] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018.

[11] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al. Openai gym. ArXiv Prepr ArXiv160601540 2016.

[12] Borges M, Symington A, Coltin B, Smith T, Ventura R. HTC Vive: Analysis and Accuracy Improvement. 2018 IEEE/RSJ Int. Conf. Intell.

Robot. Syst., 2018, p. 2610–5.

Transferring Human Manipulation Knowledge to Robots with Inverse Reinforcement Learning *

Conference Paper

Jan 2020

Overcoming Exploration in Reinforcement Learning with Demonstrations

Article

Full-text available

Sep 2017

Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.

Motion Planning for Industrial Robots using Reinforcement Learning

Article

Full-text available

Dec 2017

A major challenge of today's production systems in the context of Industry 4.0 and Cyber-Physical Production Systems is to be flexible and adaptive whilst being robust and economically efficient. Specifically, the implementation of motion planning processes for industrial robots need to be refined concerning their variability of the motion task and the ability to adaptively deal with variations in the environment. In this paper, we propose a reinforcement learning (RL) based, cognition-enhanced six-axis industrial robot for complex motion planning along continuous trajectories as e.g. needed for welding, gluing or cutting processes in production. Our prototype demonstrator is inspired by the classic wire loop game which involves guiding a metal loop along the path of a curved wire from start to finish while avoiding any contact between the wire and the loop. Our work shows that the RL-agent is capable of learning how to control the robot to successfully play the wire loop game without the need of modeling the wire or programming the robot motion beforehand. Furthermore, the extension of the system by a visual sensor (a camera) allows the agent to sufficiently generalize the learning problem so that it can solve new or reshaped wires without the need of additional learning. We conclude that the applicability of RL for industrial robots and production systems in general provides vast and unexplored potential for processes that feature variability to some extent and thus require a general and robust approach for process automation.

Deep Reinforcement Learning with Double Q-Learning

Article

Full-text available

Sep 2015

The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether this harms performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.

Robot skills for manufacturing: From concept to industrial deployment

Article

Full-text available

Feb 2016
ROBOT CIM-INT MANUF

Due to a general shift in manufacturing paradigm from mass production towards mass customization, reconfigurable automation technologies, such as robots, are required. However, current industrial robot solutions are notoriously difficult to program, leading to high changeover times when new products are introduced by manufacturers. In order to compete on global markets, the factories of tomorrow need complete production lines, including automation technologies that can effortlessly be reconfigured or repurposed, when the need arises. In this paper we present the concept of general, self-asserting robot skills for manufacturing. We show how a relatively small set of skills are derived from current factory worker instructions, and how these can be transferred to industrial mobile manipulators. General robot skills can not only be implemented on these robots, but also be intuitively concatenated to program the robots to perform a variety of tasks, through the use of simple task-level programming methods. We demonstrate various approaches to this, extensively tested with several people inexperienced in robotics. We validate our findings through several deployments of the complete robot system in running production facilities at an industrial partner. It follows from these experiments that the use of robot skills, and associated task-level programming framework, is a viable solution to introducing robots that can intuitively and on the fly be programmed to perform new tasks by factory workers.

Skill-based instruction of collaborative robots in industrial settings

Article

Mar 2018
ROBOT CIM-INT MANUF

During the past decades, the increasing need for more flexible and agile manufacturing equipment has spawned a growing interest in collaborative robots. Contrary to traditional industrial robots, collaborative robots are intended for operating alongside the production personnel in dynamic or semi-structured human environments. To cope with the environment and workflow of humans, new programming and control methods are needed compared to those of traditional industrial robots. This paper presents a task-level programming software tool allowing robotic novices to program industrial tasks on a collaborative robot. The tool called Skill Based System (SBS) is founded on the concept of robot skills, which are parameterizable and task-related actions of the robot. Task programming is conducted by first sequencing skills followed by an online parameterization performed using kinesthetic teaching. Through several user studies, SBS is found to enable robotic novices to program industrial tasks. SBS has further been deployed and tested in two manufacturing settings demonstrating its applicability in real industrial scenarios.

TD-Gammon: A Self-Teaching Backgammon Program

Chapter

Jan 1995

Gerald Tesauro

This chapter describes TD-Gammon, a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. TD-Gammon uses a recently proposed reinforcement learning algorithm called TD(λ) (Sutton, 1988), and is apparently the first application of this algorithm to a complex nontrivial task. Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e. given only a “raw” description of the board state), the network learns to play the entire game at a strong intermediate level that surpasses not only conventional commercial programs, but also comparable networks trained via supervised learning on a large corpus of human expert games. The hidden units in the network have apparently discovered useful features, a longstanding goal of computer games research. Furthermore, when a set of hand-crafted features is added to the network’s input representation, the result is a truly staggering level of performance: TD-Gammon is now estimated to play at a strong master level that is extremely close to the world’s best human players. We discuss possible principles underlying the success of TD-Gammon, and the prospects for successful real-world applications of TD learning in other domains.

Human-level control through deep reinforcement learning

Article

Feb 2015
NATURE

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Reinforcement Learning: State-Of-The-Art

Book

Jan 2012

Reinforcement learning encompasses both a science of adaptive behavior of rational beings in uncertain environments and a computational methodology for finding optimal behaviors for challenging problems in control, optimization and adaptive behavior of intelligent agents. As a field, reinforcement learning has progressed tremendously in the past decade. The main goal of this book is to present an up-to-date series of survey articles on the main contemporary sub-fields of reinforcement learning. This includes surveys on partially observable environments, hierarchical task decompositions, relational knowledge representation and predictive state representations. Furthermore, topics such as transfer, evolutionary methods and continuous spaces in reinforcement learning are surveyed. In addition, several chapters review reinforcement learning methods in robotics, in games, and in computational neuroscience. In total seventeen different subfields are presented by mostly young experts in those areas, and together they truly represent a state-of-the-art of current reinforcement learning research. Marco Wiering works at the artificial intelligence department of the University of Groningen in the Netherlands. He has published extensively on various reinforcement learning topics. Martijn van Otterlo works in the cognitive artificial intelligence group at the Radboud University Nijmegen in The Netherlands. He has mainly focused on expressive knowledge representation in reinforcement learning settings.

Reinforcement Learning: An Introduction

Article

Feb 1998

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"

Grasping virtual fish: A step towards robotic deep learning from demonstration in virtual reality

Jan 2017
1181-1187

J S Dyrstad
J R Mathiassen

Dyrstad JS, Mathiassen JR. Grasping virtual fish: A step towards robotic deep learning from demonstration in virtual reality. 2017 IEEE Int. Conf. Robot. Biomimetics, 2017, p. 1181-7.

Transferring Human Manipulation Knowledge to Industrial Robots Using Reinforcement Learning

Abstract and Figures

Recommended publications

Transferring Human Manipulation Knowledge to Industrial Robots Using Reinforcement Learning

Transferring Human Manipulation Knowledge to Robots with Inverse Reinforcement Learning *

skrl: Modular and Flexible Library for Reinforcement Learning

Learning and generalising object extraction skill for contact-rich disassembly tasks: an introductor...