Conference PaperPDF Available

Learning Object Manipulation with Dexterous Hand-Arm Systems from Human Demonstration

August 2020

August 2020

DOI:10.1109/IROS45743.2020.9340966

Conference: IROS 2020

Authors:

Philipp Ruppel

University of Hamburg

Jianwei Zhang

University of Hamburg

We present a novel learning and control framework that combines artificial neural networks with online trajectory optimization to learn dexterous manipulation skills from human demonstration and to transfer the learned behaviors to real robots. Humans can perform the demonstrations with their own hands and with real objects. An instrumented glove is used to record motions and tactile data. Our system learns neural control policies that generalize to modified object poses directly from limited amounts of demonstration data. Outputs from the neural policy network are combined at runtime with kinematic and dynamic safety and feasibility constraints as well as a learned regularizer to obtain commands for a real robot through online trajectory optimization. We test our approach on multiple tasks and robots.

We learn manipulation tasks from human demonstration (top-left) and execute the learned behaviors on real robots (bottom-left, middle, right).

…

Overview of our training methods (a) and of our system during execution (b).

…

Figures - uploaded by Philipp Ruppel

Content may be subject to copyright.

Content uploaded by Philipp Ruppel

Content may be subject to copyright.

Learning Object Manipulation with Dexterous Hand-Arm Systems

from Human Demonstration

Philipp Ruppel, Jianwei Zhang

{ruppel, zhang}@informatik.uni-hamburg.de

Abstract— We present a novel learning and control frame-

work that combines artiﬁcial neural networks with online tra-

jectory optimization to learn dexterous manipulation skills from

human demonstration and to transfer the learned behaviors

to real robots. Humans can perform the demonstrations with

their own hands and with real objects. An instrumented glove

is used to record motions and tactile data. Our system learns

neural control policies that generalize to modiﬁed object poses

directly from limited amounts of demonstration data. Outputs

from the neural policy network are combined at runtime

with kinematic and dynamic safety and feasibility constraints

as well as a learned regularizer to obtain commands for a

real robot through online trajectory optimization. We test our

approach on multiple tasks and robots.

I. INT ROD UC TI ON

Humanoid robot hands offer unique opportunities for

robotic manipulation by being ideally suited to handle a vast

number of objects and tools that were originally designed

for human hands and by potentially allowing for simple

and intuitive teaching of robots through human demonstra-

tion without requiring explicit task-speciﬁc programming of

motions or task goals. As an additional beneﬁt, humanoid

robot hands are modeled after a system that has proven to

be extremely versatile and effective for millennia. However,

despite many recent advances in robotics and AI, control

and learning for dexterous manipulation with humanoid robot

hands in the real world remains a signiﬁcant challenge.

To make teaching simple and convenient, we want to

allow humans to demonstrate tasks using their own hands

with real objects, instead of having to use teleoperation, ki-

naesthetic teaching, or additional task-speciﬁc programming

or annotation. While kinaesthetic teaching would provide

robot poses directly and teleoperation could rely on human

feedback to correct for errors and to ensure safety, our

system has to produce accurate and safe robot commands

autonomously. To apply current reinforcement learning tech-

niques to robotic manipulation, human programmers usually

have to implement task-speciﬁc rewards and simulation en-

vironments. Dynamic motion primitives can require explicit

data annotation and task-speciﬁc programming to adapt and

initiate different motion primitives. We do not want to burden

human teachers with having to perform these additional

steps and instead want our system to learn the required task

information directly from demonstration data.

Authors are with the Department of Informatics, University of Hamburg.

This work was partially supported by the German Research Foundation

(DFG) and the National Science Foundation of China (NSFC) in project

Crossmodal Learning, TRR-169, www.crossmodal-learning.org

Fig. 1: We learn manipulation tasks from human demon-

stration (top-left) and execute the learned behaviors on real

robots (bottom-left, middle, right).

We ﬁrst record human demonstrations using an instru-

mented glove and extract motion trajectories as well as

tactile information. To efﬁciently learn stable control policies

that generalize to previously unseen situations, we introduce

a set of trajectory-based training and data augmentation

methods. We present and compare three different network

architectures: a feed-forward policy, a deep recurrent struc-

ture that implicitly learns hidden state information, and a

model-based approach which ﬁrst learns neural models and

then trains a multimodal policy network to also consider

tactile sensations and state variables. At runtime, an online

trajectory optimizer uses abstract and hardware-independent

commands from the neural network as well as a learned

regularizer to generate joint-space commands for a particular

robot under kinematic and dynamic safety and feasibility

constraints. Optimizing trajectories over multiple time steps

allows our controller to avoid kinematic constraints and

collisions while it is still possible to do so without violating

dynamic limits or further deviating from the motion goals.

We also use trajectory optimization for stable hand tracking.

See ﬁgure 2 for an overview of our system. We test and

demonstrate our methods on four different manipulation

tasks and on three different robots.

II. RE LATE D WOR K

Inverse kinematics can ﬁnd joint angles for a robot arm

to reach a Cartesian goal pose, inverse dynamics computes

joint velocities or torques from Cartesian goal velocities or

forces [1] [2]. It can be desirable to optimize trajectories over

multiple time steps simultaneously to fulﬁll kinematic as well

as dynamic objectives and constraints. This can be accom-

plished using stochastic [3] or gradient-based [4] methods.

Schulman et al. [5] generate collision-free trajectories using

penalty terms. Mordatch et al. [6] use a special contact model

to generate trajectories for simulated manipulation problems.

Trajectory optimization for robots with many degrees of free-

dom is typically performed ofﬂine due to high computation

time. However, for certain general classes of constrained

optimization problems, it has been shown that interior-point

methods can ﬁnd solutions efﬁciently [7] [8] [9] [10].

Reinforcement learning adjusts a policy to maximize a

reward function through trial and error. For practical prob-

lems with sparse rewards, the required numbers of trials

can be prohibitively large. It is often possible to accelerate

reinforcement learning by programming smooth task-speciﬁc

shaped reward functions. Marcin et al. use reinforcement

learning in simulated environments and a special reward

function to ﬁnd ﬁnger motions for rotating a cube [11].

Akkaya et al. [12] and Li et al. [13] also learn motions

for rotating the top-facing side of a Rubik’s Cube and

combine the learned behaviors with a traditional Rubik’s

Cube solver. Rajeswaran et al. teleoperate a virtual robot

hand in a simulated environment. They use reinforcement

learning to train a policy network to imitate the demon-

strated motions using the same simulated robot hand and

to maximize additional task-speciﬁc reward functions [14].

Abbeel et al. learn cost functions from human demonstration

for controlling a car in a simpliﬁed driving simulator [15]. If

we would use current reinforcement learning techniques for

our work, human teachers would not only have to perform

demonstrations, but would also have to prepare task-speciﬁc

simulation environments with accurate object models.

Ijspeert et al. [16] capture human motions and represent

the recorded joint-space trajectories using basis functions and

an attractor term. The attractor acts as a low-pass ﬁlter and

can be tuned for smooth or accurate control. Users have

to choose between repetitive and non-repetitive basis func-

tions, and for repetitive motions, annotate phases. Repetitive

motions are executed indeﬁnitely and have to be stopped

by the user or by a program. For a single goal position,

the entire motion primitive can be shifted by a ﬁxed offset.

Paraschos et al. [17] learn probability distributions from sets

of joint-space trajectories to shape the transitions between

motion primitives. Automatically adapting to multiple object

poses, achieving rotational invariance, combining repetitive

and non-repetitive motions, generating accurate as well as

smooth and feasile motions, integrating additional sensor

modalities, etc., would require further extensions. It is not

always clear how these features could be integrated without

task-speciﬁc programming or annotation.

(a) Training

Robust trajectory

reconstruction

Tactile

data

Trajectory-based

training method

Differentiable

template model

Neural

models

Policy network

Data

augmentation

(b) Execution

Sensor inputs

Policy network

Cartesian goal trajectories

Online trajectory

optimization

Robot

Template

model

Neural

models

Learned

regularizer

Kinematic

constraints

Collision

avoidance

Dynamic

constraints

Fig. 2: Overview of our training methods (a) and of our

system during execution (b).

Several teleoperation systems have been developed for

controlling humanoid robot hands. These methods can rely

on the human operator to ensure safety and to compensate

for position offsets. Recent successful approaches mainly use

relative objectives between ﬁngertips [18] or treat the hand

and the arm separately [19] [20]. For autonomous execution,

our controller has to not only reproduce relative motions,

but it also has to achieve accurate absolute positioning and

enforce safety and feasibility constraints.

Convolutional neural networks [21] [22] use special con-

nection patterns and weight sharing to achieve transla-

tion invariance. Weight sharing can also be used to pro-

cess unordered point sets [23].

III. DATA ACQ UI SI TI ON A ND TR AJ ECTORY

REC ON ST RUC TI ON

During each demonstration, we record ﬁnger and object

motions as well as tactile sensations. Finger motions and

tactile information are captured using an instrumented glove.

We construct tactile sensors from conductive fabric and

pressure-sensitive piezo-resistive materials. One sensor is

attached to each ﬁngertip. Human motions are recorded

using LEDs on the ﬁngertips and hand joints, and the line

cameras of a Phasespace Impulse X2 system.

To achieve stable hand tracking without manual data clean-

up, we introduce a trajectory-based reconstruction method.

Each line sensor is modeled as a one-dimensional pin-

hole camera with position PC, orientation matrix RC, and

polynomial distortion terms di. We optimize 3D marker

positions pjto minimize a robust loss lj[24] with re-

projection error ejfor observation oj. During calibration,

we also optimize camera parameters.

qj=R−1

C·(pj−PC)(1)

sj=qj,1

qj,3

ej=sj+X

i=1

dis2i

j−oj(2)

lj=









2e2

jif kejk< δ

δ(kejk − 1

2δ)otherwise.

(3)

To ﬁll in data gaps caused by marker occlusions and to

exploit observations from other time frames as additional

evidence for the correct 3D marker positions, we add a

dynamic regularization term di,t with regularization weights

wf, wgbetween marker positions pi,t.

di,t =|pi,t −pi,t+dt|2wf+|pi,t−dt −2pi,t +pi,t+dt |2wg(4)

To exploit the kinematic structure of the hand, we add

an additional link regularizer li,j,t with weight wlbetween

directly connected hand markers mi,t, mj,t . We use soft

objective terms instead of hard constraints since the glove

can slightly move and stretch.

li,j,t =|mi,t −mj,t −mi,t+dt +mj,t+dt|2wl(5)

To accelerate convergence, we solve a sequence of

equations with exponentially increasing temporal resolu-

tion. Each previous solution is used as an initial guess

for the next subdivision step.

IV. NET WO RK AR CH IT EC TU RE S

After recording human demonstrations and reconstructing

3D trajectories, the obtained data is used to train a neural

policy. We substitute objects and the robot with simpliﬁed

but differentiable Cartesian template models. This frees the

network from having to learn hardware-speciﬁc details about

a particular robot, and it allows us to focus our machine

learning efforts on task information. It also allows us to

train our policies directly using efﬁcient gradient-based op-

timization. Both the objects and the hand are represented

as point sets. Hand points pt,j can be controlled by the

network through velocity commands vt,j.

dpt,j

dt =vt,j (6)

We provide relative position vectors of the hand and object

points, velocities vt, and if available additional state vari-

ables stand tactile measurements htas inputs It. Relative

position vectors are obtained by subtracting the arithmetic

mean of the Nhand points pt,j .

It= (pt−XN

j=0

pt,j

N, vt, st, ht)(7)

As in convolutional neural networks, translational invariance

is ensured by the network architecture and rotational invari-

ance is achieved through data augmentation.

A. Feed-Forward Policy Network

For simple tasks such as reach-to-grasp, which neither

require memory nor tactile perception, a simple unimodal

feed-forward policy should be sufﬁcient. Our feed-forward

network shown in ﬁgure 3a consists of 5 densely connected

layers with 2048 input neurons, 512 neurons in each hidden

layer, and 15 output neurons. We use ReLU activation for

input and hidden layers and linear activation for the output

layer. Each output neuron controls the velocity of a hand

point along one Cartesian dimension.

Input

Dense, 2048, ReLU

Dense, 512, ReLU

Dense, 15, ReLU

Output Velocities

(a) Feed-forward policy

Input Concatenate

Dense, 2048, ReLU Dropout

Dense, 512, ReLU

Dense, 15, ReLU

Output Velocities

Dense, 32, TanH

Dense, 16, TanH

Dense, 4, TanH

(b) Recurrent policy

States

Conv1D, 64, TanH

Velocities

Positions

Conv1D, 64, TanH

Conv1D, 1, TanH

Tactile Output

Positions

Dense, 32, TanH

Output State Variables

Velocities Tactile

Dense, 32, TanH

Dense, Linear

(d) Object state model

Fig. 3: Neural policy and model networks.

B. Recurrent Policy Network

We enable the network to remember previous actions and

observations by adding recurrent connections as shown in

ﬁgure 3b. Inputs are shared with a recurrent branch, which

consists of three densely connected layers. The outputs of

the recurrent branch are concatenated to the inputs of the re-

current and of the feed-forward branch. We use signiﬁcantly

smaller layer sizes for the recurrent branch, with 32 TanH

input units, 16 TanH hidden neurons, and four linear output

neurons, and we apply 10% dropout at the inputs.

C. Neural Object Models

We extend our template models to also simulate tactile

sensations and object state variables. To keep the training

process simple for human teachers, we learn these models

directly from demonstration data.

1) Tactile Object Model: Our tactile models map relative

ﬁngertip positions, ﬁngertip velocities and object state vari-

ables to simulated tactile readings. We represent the tactile

models using PointNet-inspired [23] fully convolutional neu-

ral networks with 1D convolutions over tactile sensor indices

and a ﬁlter width of 1. The architecture consists of 4 layers

with 64 convolutional TanH units in the input and hidden

layers and one linear convolutional unit in the output layer.

2) Object State Model: During human demonstration,

we measure and record additional object state variables.

At runtime, the state variables are predicted. Our neural

object state model takes relative Cartesian ﬁngertip positions

and velocities, current values for the state variables and

tactile information as inputs. The outputs of the object state

model are added to the current values of the state variables.

The network consists of 4 densely connected layers with

32 TanH units in each input and hidden layer and one

linear output unit for each state variable.

V. TRAJECTO RY-BASED TR AI NI NG

We train our policy networks by simulating trajectories

over multiple time steps and propagating gradients back in

time. During each simulation step, the policy network Pis

called with inputs from a previous time step St. The velocity

outputs of the policy network Pare used together with our

differentiable template model Mand learned models Lto

compute new values St+dt for the state variables.

St+dt =M(St, P (St))

L(St, P (St)) (8)

State variables St,m are reset before the ﬁrst simulation

step t0and at randomly selected time frames with demon-

stration data Dt,m and random perturbations. The reset

probability is computed using a constant base r, a random

exponent xand a uniformly distributed random number

generator rt. The random perturbations are composed of

a normally distributed random vector Pt,m for each point

mand a scalar random exponent ffor each trajectory

segment. We introduce exponential terms to randomly scale

the perturbations across multiple orders of magnitude to

avoid having to manually ﬁne-tune augmentation parameters.

Rotational invariance is achieved through additional online

data augmentation, multiplying each demonstration trajectory

with a random rotation matrix R.

St,m =(R Dt,m +bfPt,m if (t=t0)∨(rt< bx)

S′

t−1,m otherwise. (9)

We compute a loss value from simulated and demonstrated

states over the entire simulated trajectory, propagate the

gradients back in time until reaching the start of the trajectory

or one of the random resets, and update the network weights

for all contributing time steps. The loss function computes a

weighted error over different modalities including Cartesian

positions and velocities, and if available, tactile information

and object state variables. The network weights are optimized

via a batch gradient descent method [25]. A relatively large

batch size between 128 and 512 should be used to obtain

meaningful gradients despite strong randomization and to

allow for efﬁcient parallelization.

VI. TI ME DI SC RE TI ZATI ON

Even when using a feed-forward policy, our trajectory-

based training method leads to a recurrent structure. During

our experiments, we found that it is usually sufﬁcient to

train with relatively large time steps and that doing so

reduces training time. However, at runtime, we want to

use smaller time steps to allow for fast reaction to sensor

input and to achieve smooth as well as accurate control.

Therefore, we want to reformulate our networks as differ-

ential equations and use numerical integration with differ-

ent step sizes for training and execution. Our continuous-

time network N′computes network outputs otand the

time derivatives of the network activations from current

activations Aand additional inputs I.

(dA

dt , ot) = N′(A, I)(10)

A practical obstacle to using this approach is that in current

high-performance software libraries for implementing arti-

ﬁcial neural networks, the network Neffectively performs

numerical integration with a ﬁxed step size s.

(At+s, ot1) = N(At, It)(11)

For an explicit Euler step, ﬁnite differences could directly

recover the exact gradients within numerical precision. In

practice, the activations may be updated incrementally and

we obtain a gradient approximation.

lim

s→0

dt =N(At, It)0−At

s(12)

The gradients can now be integrated with modiﬁed step sizes.

VII. RO BOT VISION

To allow the robot to manipulate unmodiﬁed objects, we

train a fully convolutional neural network to detect virtual

keypoints. We use pre-trained Mobilenet [22] layers up to

the ﬁfth separable convolutional block to compute feature

embeddings and then add two 32-channel 1x1 convolu-

tional hidden layers and a 1x1 linear convolutional output

layer with one channel for each marker ID. The output

is resampled to the size of the original input image using

bicubic interpolation and maxima in the marker channels

are interpreted as virtual marker detections.

Camera poses are calibrated using structure from motion.

We attach multiple Aruco [26] tags to the forearm of

the robot and automatically move the arm into randomly

generated poses while recording marker detections. Since

the surface of the robot and the markers is curved, we use

corner-based subpixel reﬁnement. To calibrate the cameras,

we simultaneously optimize camera parameters and the 3D

positions of the marker corners relative to the forearm link

to minimize the reprojection error of each corner.

VIII. ON LI NE TR AJ EC TORY OPTI MI ZATION

We translate Cartesian commands from the neural policy

network into hardware-speciﬁc joint angles through kino-

dynamic online trajectory optimization. We ﬁrst simulate

Cartesian trajectories using the most recent measurements,

the policy network, and our template and neural models.

The resulting Cartesian trajectories are converted into sets

of timestamped position goals, which are combined with

additional goals and constraints to optimize robot trajecto-

ries. Each optimization step is initialized with a timeshifted

version of a previous trajectory.

For each trajectory update, we solve a non-linear op-

timization problem through sequential quadratic program-

ming using a primal-dual interior-point method. The op-

timization problem is deﬁned by instances of different

goal classes. Each goal can specify quadratic objectives,

equality constraints, inequality constraints, and box con-

straints. Inequality constraints are automatically converted

into box constraints and equality constraints by inserting

slack variables. We ﬁnally solve a sequence of unconstrained

linear equations with objective gradients JX, equality and

inequality constraint gradients JEand JI, exponentially ad-

justed logarithmic barrier gradients BX, BS, and right-hand-

side vectors rx, re, rifor the joint variables X, Lagrange

multipliers LE, LIand slack variables SB.







XJX+BXI JEJI0

JE0 0 0

JI0 0 −I

0 0 −I BSI



















=











(13)

Cartesian trajectories are translated into quadratic position

goals. For each template model point pi,t with time tand

point index i, we assign a corresponding reference point

rirelative to a link pose Li,t and minimize the squared

distance di,t between both point positions.

di,t =kpi,t −Li,t rik2(14)

For each joint position variable qj,t with time t, step

size dt and joint index j, we specify upper and lower joint

position limits uj, lj, a ﬁxed trust region crelative to the last

candidate solution rj,t, as well as maximum joint velocities

vjand maximum joint accelerations aj.

max(rj,t −c, lj)< qj,t < min(uj, rj,t +c)(15)

−vj<qj,t+dt −qj,t

dt < vj(16)

−aj<qj,t−dt +qj,t+dt −2qj,t

2dt < aj(17)

To prevent jumps during trajectory replacement, we con-

strain the ﬁrst two keyframes of each new trajectory to match

the corresponding two keyframes of the previous trajectory.

Mechanical couplings between ﬁnger joints on underactuated

hands are modeled as additional equality constraints.

For collision avoidance, we construct a convex polyhe-

dral approximation of the workspace in Cartesian space

and approximate the shape of each link by a convex hull

around a set of spheres. Since the workspace approximation

is convex, constraining only the spheres is sufﬁcient to

prevent collisions with the entire link bodies. We insert

pairwise linear constraints between boundary planes and link

spheres. Each boundary plane is represented by a normal nk

and a distance dk. Each sphere has a center clrelative

to a link pose Pland a radius rl.

Plcl·nk< dk−rl(18)

TABLE I: Robot experiments for different tasks, robots, net-

works architectures, demonstration counts (D.) and trajectory

optimization windows (Traj.). For each experiment, we test

whether the task is performed successfully during multiple

consecutive trials (Succ.) and for different object poses (Inv.).

Task Robot Network D. Traj. Succ. Inv.

Pick Place C5 UR10e Feed-Fwd. 10 10 Yes Yes

Wiping C5 UR10e Feed-Fwd. 5 10 Yes Yes

C. Bottle C5 LBR4+ Feed-Fwd. 1 10 No n/a

C. Bottle C5 LBR4+ Recurrent 1 10 Yes Yes

C. Bottle C6 UR10 Model-B. 1 10 Yes Yes

B. Bottle C5 UR10e Feed-Fwd. 1 3, 4 No n/a

B. Bottle C5 UR10e Feed-Fwd. 1 5..10 Yes Yes

If multiple solutions can be found which fulﬁll the ob-

jective function almost equally well without violating any

of the constraints, we want to prefer natural hand poses

that would also be preferred by a human. We therefore

introduce a learned regularizer.

ri=



vi−mi

si



(19)

From an existing hand pose dataset [27] [28], we compute

averages and standard deviations for the joint angles and con-

struct a multivariate Gaussian distribution. For each Gaussian

with mean mi, standard deviation si, and corresponding joint

variable vi, we add a quadratic regularization term ri.

IX. EX PE RI ME NT S

We test our methods on three different manipulation

problems: a pick-place and a wiping task, opening a chemical

bottle with a wide lid, and opening a beverage bottle with a

small lid. The experiments are performed with real objects

and robots. We use a UR10e arm with a Shadow C5 hand,

a KUKA LBR 4+ arm with a Shadow C5 hand, and a

UR10 arm with a Shadow C6 hand. An overview of our

robot experiments is given in table I.

A. Pick-and-Place Task

The robot has to grasp an elongated box-shaped object and

place it onto a rectangular plate. Both objects are equipped

with LEDs as tracking markers. We collect a total of 10

human demonstrations. Before each demonstration, both

items are moved into different positions and orientations.

During the demonstrations, a human grasps the box and

places it onto the plate. We use the recorded trajectories

to train our feed-forward network. As training data, we use

the positions of two markers on each object, one marker

on each ﬁngertip, and one marker on each knuckle and at

the base of the thumb. At runtime, we use observed marker

positions as inputs and pass outputs from the network to our

trajectory optimizer. The resulting motions are executed on

a UR10e arm with a Shadow C5 hand. The robot is able

to successfully perform the task even if the box, the plate,

and the hand are placed in previously unseen poses. Grasp

poses are adapted if the box is rotated. The lengths of the

Fig. 4: UR10e arm with Shadow C5 hand while performing

a pick-and-place task.

Fig. 5: UR10e with Shadow C5 hand during a wiping task.

Fig. 6: Turning the lid of a chemical bottle (feed-forward

network, LBR4+ arm, C5 hand).

trajectories are adjusted if the object positions are changed.

Figure 4 shows the robot during execution.

B. Wiping Task

We record ﬁve demonstrations of a wiping task that

requires grasping a brush, moving to a target object, and

performing oscillating cleaning motions. As for the pick-

and-place experiment, we use the feed-forward architecture

and a UR10e arm with a Shadow C5 hand. At runtime,

the robot approaches and grasps the brush, lifts it, places

it onto the target object, and performs periodic cleaning

motions, with the bristles of the brush wiping across the

surface. The task can be performed successfully for previ-

ously unseen hand, brush and target poses. Figure 5 shows

the robot during the wiping task.

C. Opening a Chemical Bottle

The robot has to turn the lid of a chemical bottle un-

til it has been loosened, grasp the lid, lift it, and place

it next to the bottle. We record a single demonstration

with tactile readings and Cartesian motion trajectories for

the ﬁngertips and bottle position.

1) Feed-Forward Policy: We train our feed-forward ar-

chitecture with the recorded trajectories and use a KUKA

Fig. 7: Opening the chemical bottle using our recurrent

policy network (bottom), image from an overhead camera

(top-left), output of our vision network (top-right).

0 50 100 150 200 250

Time Step

−10

Activations

Approach Turn Lid Pick Place

Fig. 8: Recurrent neural activations while opening the chem-

ical bottle, with approximate sub-task annotations.

LBR 4+ with a Shadow C5 hand for execution. For a ﬁrst

test, we assume a ﬁxed bottle pose. If the bottle is carefully

placed in the correct position, the robot performs a correct

approach motion, and the ﬁnger motions turn the lid (see

ﬁgure 6). Since the network does not possess memory and

can neither use recurrent models nor tactile information, it

is not able to determine when the lid can be lifted off and

continues to perform turning motions indeﬁnitely.

2) Vision Network: We use our vision network described

in section VII to automatically determine the bottle position

without needing LED markers. After performing SfM-based

calibration, our marker-less tracking method delivers results

that are accurate enough for approaching the bottle and turn-

ing the lid. The object can still be detected and manipulated if

it is placed in different positions and orientations on the table.

3) Recurrent Policy Network: We use the same data as

before to train our recurrent policy network described in

section IV-B for the bottle opening task. While the feed-

forward network keeps performing turning motions indef-

initely and fails to remove the lid, our recurrent network

stops rotating the lid at an appropriate time. It then grasps

the lid, lifts it, performs a sideways motion, lowers the hand,

and places the lid next to the bottle. We execute the policy

on an LBR 4+ arm with a C5 hand. Figure 8 shows the

neural activations in the output layer of the recurrent column

over time. If we dampen the connections between the last

layer in the recurrent column and the concatenation layer,

Fig. 9: Opening a chemical bottle and removing the lid

(model-based learning, UR10, C6 hand).

Fig. 10: Turning and removing the lid of a beverage bottle

(feed-forward network, UR10e, C5 hand).

the transition from the lid-rotation phase to the pick-place

phase is delayed. The overall behavior of the network and

the speed of the ﬁnger motions remain the same. See ﬁgure 7

for photos of the robot during execution as well as an input

image and output activations of the vision network.

4) Crossmodal Model-Based Learning: We use tactile

data collected during demonstration of the bottle opening

task to train a tactile object model as described in section

IV-C.1. We also train a recurrent object state model as

described in section IV-C.2 with lid orientation as a state

variable. Using our model networks, we then train a feed-

forward policy network as described in sections IV-A and V.

During execution, we use predicted object state information

from the object state model and a mixture of predicted

and measured tactile readings. A 50-50 combination leads

to stable yet responsive behavior. We test the policy on

a UR10 arm with a Shadow C6 hand. Each ﬁngertip is

equipped with a tactile pressure sensor. If a human touches

multiple robot ﬁngertips, the robot hand opens, and after

removing the externally induced stimulus, the robot hand

closes again until the ﬁngers touch the lid. Our crossmodal

model-based architecture was able to perform the bottle

opening task successfully in 10 out of 10 trials.

D. Opening a Beverage Bottle

The feed-forward architecture is trained to open a beverage

bottle with a smaller lid. We use a single demonstration with

trajectories of 21 hand markers at the ﬁngertips and joints,

and two markers on the bottle. At runtime, the bottle markers

are located using the tracking system, and the generated

motions are executed on a UR10e arm with a Shadow C5

TABLE II: Average tracking errors for different trajectory

lengths while following Cartesian goal trajectories generated

by our recurrent network for opening the chemical bottle.

Trajectory Length 3 4 5 7 10

MSE 0.0014 0.0005 0.0003 0.0003 0.0002

hand. See ﬁgure 10 for different states during execution. The

robot is able to successfully turn the lid. After the lid has

been screwed off, it falls onto the table. While the chemical

bottle requires a recurrent structure to initiate a ﬁnal pick-

and-place phase, the beverage bottle task can be considered

successfully solved by the simpler feed-forward architecture.

If we set the window size of the trajectory optimizer to 3,

the ﬁngers push the bottle instead of turning the lid. With a

trajectory length of 5 or above, the lid is turned successfully.

E. Trajectory Optimization

Table II shows mean squared tracking errors for different

trajectory lengths while opening the chemical bottle. The

ﬁrst two time frames are constrained to match a previous

trajectory to allow for smooth trajectory replacement. For

each time step, the non-linear problem is solved to conver-

gence. Optimizing only a single new robot pose or very short

trajectories leads to high tracking errors. The errors quickly

decrease if the trajectory length is increased.

F. Training Time

The neural networks are trained using Tensorﬂow [29] on

an NVIDIA GTX 1080. While the feed-forward network and

the model-based approach can learn successful manipulation

policies in about 30 minutes, the recurrent network requires

approximately three hours of training.

X. IMP LE ME NTATION

The components of our system are implemented as ROS

[30] nodes and libraries. For neural networks, we use ten-

sorﬂow [29], Python [31], and Keras [32]. The trajectory

optimizer, calibration tools, and the trajectory reconstruction

method are implemented in C++ using Eigen [33] for linear

algebra. Robot models and states are exchanged as MoveIt

[34] objects. For execution, we used roscontrol [35], FRI

[36], ur modern driver [37], ur robot driver, the etherCAT

interface of the C6 hand, and a custom driver for the C5 hand.

XI. CO NC LU SI ON A ND FU TU RE WO RK

We introduced a novel learning and control framework

that allows human teachers to train humanoid robotic ma-

nipulators by demonstrating tasks using their own hands

with real objects. We successfully tested our approach

on multiple tasks and robots.

Three neural network architectures were presented. A

feed-forward policy network was able to successfully learn

a pick-place, a cleaning, and a bottle-opening task. A

different bottle-opening task could not be ﬁnished by the

feed-forward network. Our recurrent networks completed

the bottle opening task by learning to automatically tran-

sition from a periodic turning motion to a ﬁnal pick-and-

place motion. Our trajectory-based training and data aug-

mentation methods allow the system to learn stable neural

policies that can automatically adapt to modiﬁed object

poses from limited amounts of data. As demonstrated by

the pick-place and the wiping task, our system can not only

produce approach motions but also learn to automatically

generate trajectories between objects.

We found that it is possible to learn local object models

which are sufﬁciently accurate for model-based policy op-

timization directly from demonstration data. In contrast to

previous work based on reinforcement learning, our method

does not require the user to program task-speciﬁc reward

functions or simulation environments. By substituting the

robot with simpliﬁed but differentiable template models, we

were able to use efﬁcient gradient-based training, and we

could focus our machine learning efforts on task information.

Our trajectory optimizer is fast enough for online control

of hand-arm systems with many degrees of freedom. If

only a single robot state is optimized at a time, as in

inverse kinematics, tracking errors increase and the robot

consistently fails during a bottle opening task. We also

use trajectory optimization to achieve stable hand track-

ing. At runtime, unmodiﬁed objects can be manipulated

via learned keypoints. To prefer natural hand poses, we

introduced a learned regularizer.

While our policy networks already accept point lists as

input, we are currently using only small numbers of points

from the motion tracking system or from neural keypoint

detectors. In future work, we want to use point clouds from

depth cameras or raw color images. Tactile perception on our

instrumented gloves could be improved with high-resolution

matrix sensors and we would like to further investigate

methods for using tactile information. It would also be

interesting to test our system on a larger number of tasks.

We plan to further improve our software and to develop

it into a set of public open-source packages.

REF ER EN CE S

[1] P. Beeson and B. Ames, “TRAC-IK: An open-source library for

improved solving of generic inverse kinematics,” in Proc. IEEE RAS

Humanoids Conference, Seoul, Korea, Nov. 2015.

[2] R. Smits, “KDL: Kinematics and Dynamics Library.” [Online].

Available: http://www.orocos.org/kdl

[3] M. Kalakrishnan, S. Chitta, E. Theodorou, P. Pastor, and S. Schaal,

“STOMP: Stochastic trajectory optimization for motion planning,” in

Proc. IEEE International Conference on Robotics and Automation,

2011, pp. 4569–4574.

[4] M. Zucker et al., “CHOMP: Covariant hamiltonian optimization for

motion planning,” The International Journal of Robotics Research,

vol. 32, pp. 1164–1193, Aug. 2013.

[5] J. Schulman et al., “Motion planning with sequential convex opti-

mization and convex collision checking,” The International Journal of

Robotics Research, vol. 33, pp. 1251–1270, Aug. 2014.

[6] I. Mordatch, Z. Popovi´

c, and E. Todorov, “Contact-invariant opti-

mization for hand manipulation,” in Proc. Eurographics conference

on Computer Animation, July 2012, pp. 137–144.

[7] R. Frisch, “The multiplex method for linear programming,” The Indian

Journal of Statistics, pp. 329–362, Sept. 1957.

[8] D. F. Shanno, “Who invented the interior-point method?” Documenta

Mathematica, Extra Volume: Optimization Stories, 2012.

[9] A. V. Fiacco and G. P. McCormick, Nonlinear programming: Sequen-

tial unconstrained minimization techniques. Society for Industrial

and Applied Mathematics, Jan. 1968.

[10] N. Karmarkar, “A new polynomial-time algorithm for linear program-

ming,” Combinatorica, vol. 4, no. 4, p. 373–395, Dec. 1984.

[11] OpenAI et al., “Learning dexterous in-hand manipulation,” The Inter-

national Journal of Robotics Research, Aug. 2018.

[12] ——, “Solving rubik’s cube with a robot hand,” Oct. 2019.

[13] T. Li et al., “Learning to solve a rubik’s cube with a dexterous hand,”

in Proc. IEEE International Conference on Robotics and Biomimetics,

Dec. 2019.

[14] A. Rajeswaran*, V. Kumar*, et al., “Learning complex dexterous

manipulation with deep reinforcement learning and demonstrations,”

in Proc. Robotics: Science and Systems (RSS), June 2018.

[15] P. Abbeel and A. Ng, “Apprenticeship learning via inverse reinforce-

ment learning,” Proceedings, Twenty-First International Conference

on Machine Learning, ICML 2004, Sept. 2004.

[16] A. J. Ijspeert, J. Nakanishi, and S. Schaal, “Learning attractor land-

scapes for learning motor primitives,” in Proc. Advances in Neural

Information Processing Systems, Jan. 2002, pp. 1523–1530.

[17] A. Paraschos, C. Daniel, J. Peters, and G. Neumann, “Probabilistic

movement primitives,” in Proc. Advances in Neural Information Pro-

cessing Systems, Jan. 2013.

[18] A. Handa et al., “Dexpilot: Vision based teleoperation of dexter-

ous robotic hand-arm system,” in IEEE International Conference on

Robotics and Automation, 2020.

[19] S. Li et al., “Vision-based teleoperation of shadow dexterous hand

using end-to-end deep neural network,” in Proc. IEEE International

Conference on Robotics and Automation, 2019.

[20] ——, “A mobile robot hand-arm teleoperation system by vision and

IMU,” in Proc. IEEE/RSJ International Conference on Intelligent

Robots and Systems, in press.

[21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based

learning applied to document recognition,” Proceedings of the IEEE,

vol. 86, pp. 2278 – 2324, Dec. 1998.

[22] A. Howard et al., “MobileNets: Efﬁcient convolutional neural net-

works for mobile vision applications,” arXiv, Apr. 2017.

[23] R. Charles, H. Su, M. Kaichun, and L. Guibas, “Pointnet: Deep

learning on point sets for 3d classiﬁcation and segmentation,” in IEEE

Conference on Computer Vision and Pattern Recognition, July 2017,

pp. 77–85.

[24] P. J. Huber, “Robust estimation of a location parameter,” Annals of

Mathematical Statistics, vol. 35, no. 1, pp. 73–101, Mar. 1964.

[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-

tion,” in Proc. International Conference for Learning Representations,

Dec. 2014.

[26] F. Romero-Ramirez, R. Mu˜

noz-Salinas, and R. Medina-Carnicer,

“Speeded up detection of squared ﬁducial markers,” Image and Vision

Computing, vol. 76, June 2018.

[27] A. Bernardino, M. Henriques, N. Hendrich, and J. Zhang, “Precision

grasp synergies for dexterous robotic hands,” in Proc. IEEE Interna-

tional Conference on Robotics and Biomimetics, Dec. 2013, pp. 62–67.

[28] N. Hendrich and A. Bernardino, “Affordance-based grasp planning

for anthropomorphic hands from human demonstration,” in Proc.

ROBOT2013: First Iberian Robotics Conference, 2014, pp. 687–701.

[29] M. Abadi, A. Agarwal, P. Barham, et al., “TensorFlow: Large-

scale machine learning on heterogeneous systems,” 2015. [Online].

Available: http://tensorﬂow.org/

[30] M. Quigley et al., “ROS: an open-source robot operating system,” in

ICRA Workshop on Open Source Software, 2009.

[31] G. van Rossum, “Python tutorial,” Centrum voor Wiskunde en Infor-

matica (CWI), Amsterdam, Tech. Rep. CS-R9526, May 1995.

[32] F. Chollet et al., “Keras,” 2015. [Online]. Available: https://keras.io

[33] G. Guennebaud et al., “Eigen v3,” http://eigen.tuxfamily.org, 2010.

[34] D. Coleman, I. Sucan, S. Chitta, and N. Correll, “Reducing the barrier

to entry of complex robotic software: a moveit! case study,” Journal

of Software Engineering for Robotics, Apr. 2014.

[35] S. Chitta et al., “ros control: A generic and simple control framework

for ROS,” The Journal of Open Source Software, Dec. 2017.

[36] G. Schreiber, A. Stemmer, and R. Bischoff, “The fast research interface

for the kuka lightweight robot,” in IEEE Workshop on Innovative

Robot Control Architectures for Demanding (Research) Applications

(ICRA 2010), May 2010.

[37] T. Andersen, Optimizing the Universal Robots ROS driver. Technical

University of Denmark, Department of Electrical Engineering, 2015.

Anthropomorphic Tendon-Based Hands Controlled by Agonist–Antagonist Corticospinal Neural Network

Article

Full-text available

May 2024
SENSORS-BASEL

This article presents a study on the neurobiological control of voluntary movements for anthropomorphic robotic systems. A corticospinal neural network model has been developed to control joint trajectories in multi-fingered robotic hands. The proposed neural network simulates cortical and spinal areas, as well as the connectivity between them, during the execution of voluntary movements similar to those performed by humans or monkeys. Furthermore, this neural connection allows for the interpretation of functional roles in the motor areas of the brain. The proposed neural control system is tested on the fingers of a robotic hand, which is driven by agonist–antagonist tendons and actuators designed to accurately emulate complex muscular functionality. The experimental results show that the corticospinal controller produces key properties of biological movement control, such as bell-shaped asymmetric velocity profiles and the ability to compensate for disturbances. Movements are dynamically compensated for through sensory feedback. Based on the experimental results, it is concluded that the proposed biologically inspired adaptive neural control system is robust, reliable, and adaptable to robotic platforms with diverse biomechanics and degrees of freedom. The corticospinal network successfully integrates biological concepts with engineering control theory for the generation of functional movement. This research significantly contributes to improving our understanding of neuromotor control in both animals and humans, thus paving the way towards a new frontier in the field of neurobiological control of anthropomorphic robotic systems.

Elastic Tactile Sensor Glove for Dexterous Teaching by Demonstration

Article

Full-text available

Mar 2024
SENSORS-BASEL

We present a thin and elastic tactile sensor glove for teaching dexterous manipulation tasks to robots through human demonstration. The entire glove, including the sensor cells, base layer, and electrical connections, is made from soft and stretchable silicone rubber, adapting to deformations under bending and contact while preserving human dexterity. We develop a glove design with five fingers and a palm sensor, revise material formulations for reduced thickness, faster processing and lower cost, adapt manufacturing processes for reduced layer thickness, and design readout electronics for improved sensitivity and battery operation. We further address integration with a multi-camera system and motion reconstruction, wireless communication, and data processing to obtain multimodal reconstructions of human manipulation skills.

Recalling Unknown Manipulations by Spontaneously Sharing Actions with Similar Objects in Observation Based Learning観察に基づく学習における類似物体との自動的な動作共有による未知の操作方法の想起

Article

Full-text available

Mar 2023

This paper proposes a method for a robot to recall multiple action candidates for an object by learning object manipulations based on observation of human actions. When learning, multiple answers to a single input in supervised regression manner, it is usually necessary to map all correct answers to the same input. However, only one action can be observed for an object at a time in observing object manipulations, and other possible actions are not always observed for the identical object. It is, therefore, important to automatically share various observed actions between similar-shaped objects by recognizing common shape cues among individual objects. The proposed method learns the code descriptions of object shapes by a variational auto-encoder (VAE) with an object image as input data, and the code descriptions of actions by a conditional VAE with object shape as a condition and an action as input data. Since the action is unknown recall target, it is desirable to obtain the code description of the action from only the object shape during recalling. The distribution of the code description of actions conditioned by input object shape on the obtained code description space is obtained by marginalization of the distribution learned by the encoder part of CVAE. However, since this marginalization is difficult to analytically and numerically operate, a deep regression model that “imitates” this marginal distribution is trained by using a maximum likelihood method based on sampling. Common actions of similar-shaped objects are shared among the similar objects in this “marginalization by imitation” process. Various possible actions for the input object shape can be recalled by repeatedly sampling from the imitated marginal distribution. This paper describes the results of experiment using actual object images and manipulation actions, and demonstrates the effectiveness of the proposed method.

A Survey of Demonstration Learning

Preprint

Full-text available

Mar 2023

With the fast improvement of machine learning, reinforcement learning (RL) has been used to automate human tasks in different areas. However, training such agents is difficult and restricted to expert users. Moreover, it is mostly limited to simulation environments due to the high cost and safety concerns of interactions in the real world. Demonstration Learning is a paradigm in which an agent learns to perform a task by imitating the behavior of an expert shown in demonstrations. It is a relatively recent area in machine learning, but it is gaining significant traction due to having tremendous potential for learning complex behaviors from demonstrations. Learning from demonstration accelerates the learning process by improving sample efficiency, while also reducing the effort of the programmer. Due to learning without interacting with the environment, demonstration learning would allow the automation of a wide range of real world applications such as robotics and healthcare. This paper provides a survey of demonstration learning, where we formally introduce the demonstration problem along with its main challenges and provide a comprehensive overview of the process of learning from demonstrations from the creation of the demonstration data set, to learning methods from demonstrations, and optimization by combining demonstration learning with different machine learning methods. We also review the existing benchmarks and identify their strengths and limitations. Additionally, we discuss the advantages and disadvantages of the paradigm as well as its main applications. Lastly, we discuss our perspective on open problems and research directions for this rapidly growing field.

Review on human‐like robot manipulation using dexterous hands

Article

Full-text available

Feb 2023

In recent years, human hand‐based robotic hands or dexterous hands have gained attention due to their enormous capabilities of handling soft materials compared to traditional grippers. Back in the earlier days, the development of a hand model close to that of a human was an impossible task but with the advancements made in technology, dexterous hands with three, four or five‐fingered robotic hands have been developed to mimic human hand nature. However, human‐like manipulation of dexterous hands to this date remains a challenge. Thus, this review focuses on (a) the history and motivation behind the development of dexterous hands, (b) a brief overview of the available multi‐fingered hands, and (c) learning‐based methods such as traditional and data‐driven learning methods for manipulating dexterous hands. Additionally, it discusses the challenges faced in terms of the manipulation of multi‐fingered or dexterous hands.

A Gripper-like Exoskeleton Design for Robot Grasping Demonstration

Article

Full-text available

Jan 2023

Learning from demonstration (LfD) is a practical method for transferring skill knowledge from a human demonstrator to a robot. Several studies have shown the effectiveness of LfD in robotic grasping tasks to improve the success rate of grasping and to accelerate the development of new robotic grasping tasks. A well-designed demonstration device can effectively represent human grasping motion to transfer grasping skills to robots. In this paper, an improved gripper-like exoskeleton with a data collection system is proposed. First, we present the mechatronic details of the exoskeleton and its motion-tracking system, considering the manipulation flexibility and data acquisition requirements. We then present the capabilities of the device and its data collection system, which collects the position, pose and displacement of the gripper on the exoskeleton. The collected data is further processed by the data acquisition and processing software. Next, we describe the principles of Gaussian mixture model (GMM) and Gaussian mixture regression (GMR) in robot skill learning, which are used to transfer the raw data from demonstrations to robot motions. In the experiment, an optimized trajectory was learned from multiple demonstrations and reproduced on a robot. The results show that the GMR complemented with GMM is able to learn a smooth trajectory from demonstration trajectories with noise.

Coordinating human-robot collaboration by EEG-based human intention prediction and vigilance control

Article

Full-text available

Dec 2022

In human-robot collaboration scenarios with shared workspaces, a highly desired performance boost is offset by high requirements for human safety, limiting speed and torque of the robot drives to levels which cannot harm the human body. Especially for complex tasks with ﬂexible human behavior, it becomes vital to maintain safe working distances and coordinate tasks efficiently. An established approach in this regard is reactive servo in response to the current human pose. However, such an approach does not exploit expectations of the human’s behavior and can therefore fail to react to fast human motions in time. To adapt the robot’s behavior as soon as possible, predicting human intention early becomes a factor which is vital but hard to achieve. Here, we employ a recently developed type of brain-computer interface (BCI) which can detect the focus of the human’s overt attention as a predictor for impending action. In contrast to other types of BCI, direct projection of stimuli onto the workspace facilitates a seamless integration in workﬂows. Moreover, we demonstrate how the signal-to-noise ratio of the brain response can be used to adjust the velocity of the robot movements to the vigilance or alertness level of the human. Analyzing this adaptive system with respect to performance and safety margins in a physical robot experiment, we found the proposed method could improve both collaboration efficiency and safety distance.

Recalling Multiple Object Manipulation Candidates by Learning Based on Observation

Conference Paper

Aug 2023

Mechanical Design of a Haptic Hand Exoskeleton for Tele-Exploration of Explosive Devices

Conference Paper

Jun 2023

Dante Jorge Dorantes-Gonzalez

Efficent Gradient Propagation for Robot Control and Learning

Chapter

May 2023

The recent wealth of discoveries in deep learning has coincided with the development of specialized automatic differentiation frameworks which can efficiently propagate gradients through repeating structures in artificial neural networks. For model-based approaches, automatic differentiation still performs relatively poorly and it is common to formulate gradients manually or to focus on low-dimensional problems. To accelerate research into model-based control of high-DOF robots such as humanoids with articulated hands and to enable hybrid approaches that combine model-based methods with deep learning, we develop a novel automatic differentiation framework that can evaluate gradients of robot models around previous candidate solutions multiple times faster than state-of-the-art methods.Keywordsautomatic differentiationmachine learningrobot control

A Mobile Robot Hand-Arm Teleoperation System by Vision and IMU

Conference Paper

Full-text available

Oct 2020

Learning dexterous in-hand manipulation

Article

Full-text available

Nov 2019

We use reinforcement learning (RL) to learn dexterous in-hand manipulation policies that can perform vision-based object reorientation on a physical Shadow Dexterous Hand. The training is performed in a simulated environment in which we randomize many of the physical properties of the system such as friction coefficients and an object’s appearance. Our policies transfer to the physical robot despite being trained entirely in simulation. Our method does not rely on any human demonstrations, but many behaviors found in human manipulation emerge naturally, including finger gaiting, multi-finger coordination, and the controlled use of gravity. Our results were obtained using the same distributed RL system that was used to train OpenAI Five. We also include a video of our results: https://youtu.be/jwSbzNHGflM .

Speeded Up Detection of Squared Fiducial Markers

Article

Full-text available

Jun 2018
IMAGE VISION COMPUT

Squared planar markers have become a popular method for pose estimation in applications such as autonomous robots, unmanned vehicles and virtual trainers. The markers allow estimating the position of a monocular camera with minimal cost, high robustness, and speed. One only needs to create markers with a regular printer, place them in the desired environment so as to cover the working area, and then registering their location from a set of images. Nevertheless, marker detection is a time-consuming process, especially as the image dimensions grows. Modern cameras are able to acquire high resolutions images, but fiducial marker systems are not adapted in terms of computing speed. This paper proposes a multi-scale strategy for speeding up marker detection in video sequences by wisely selecting the most appropriate scale for detection, identification and corner estimation. The experiments conducted show that the proposed approach outperforms the state-of-the-art methods without sacrificing accuracy or robustness. Our method is up to 40 times faster than the state-of-the-art method, achieving over 1000 fps in 4 K images without any parallelization.

ros_control: A generic and simple control framework for ROS

Article

Full-text available