ArticlePDF Available

Integrating multi-purpose natural language understanding, robot’s memory, and symbolic planning for task execution in humanoid robots

January 2018
Robotics and Autonomous Systems 99(2):148-165

January 2018
99(2):148-165

DOI:10.1016/j.robot.2017.10.012

Authors:

Mirko Wächter

Karlsruhe Institute of Technology

Peter Kaiser

DeepL

Show all 10 authorsHide

We propose an approach for instructing a robot using natural language to solve complex tasks in a dynamic environment. In this study, we elaborate on a framework that allows a humanoid robot to understand natural language, derive symbolic representations of its sensorimotor experience, generate complex plans according to the current world state, and monitor plan execution. The presented development supports replacing missing objects and suggesting possible object locations. It is a realization of the concept of structural bootstrapping developed in the context of the European project Xperience. The framework is implemented within the robot development environment ArmarX. We evaluate the framework on the humanoid robot ARMAR-III in the context of two experiments: a demonstration of the real execution of a complex task in the kitchen environment on ARMAR-III and an experiment with untrained users in a simulation environment.

Affordances predicated based on shape representations.

…

Snapshots of cooperative rearranging the room and preparing a dinner.

…

Graphical User Interface used for the simulation experiments.

…

Instruction scenes.

…

Figures - uploaded by Mirko Wächter

Content may be subject to copyright.

Content uploaded by Mirko Wächter

Content may be subject to copyright.

Integrating Multi-Purpose Natural Language

Understanding, Robot’s Memory, and Symbolic

Planning for Task Execution in Humanoid Robots

Mirko W¨achtera,∗

, Ekaterina Ovchinnikovaa, Valerij Wittenbecka, Peter

Kaisera, Sandor Szedmakd, Wail Mustafac, Dirk Kraftc, Norbert Kr¨ugerc,

Justus Piaterb, Tamim Asfoura

aKarlsruhe Institute of Technology (KIT), Adenauerring 2, 76131, Karlsruhe, Germany,

lastname@kit.edu

bUniversity of Innsbruck (UIBK), Technikerstr. 21a, 6020 Innsbruck, Austria,

ﬁrstname.lastname@uibk.ac.at

cUniversity of Southern Denmark (SDU), Campusvej 55, 5230 Odense M, Denmark,

ﬁrstname@mmmi.sdu.dk

dAalto University, Konemiehentie 2, 6020 Espoo, Finland, ﬁrstname.lastname@aalto.ﬁ

Abstract

We propose an approach for instructing a robot using natural language to solve

complex tasks in a dynamic environment. In this study, we elaborate on a

framework that allows a humanoid robot to understand natural language, de-

rive symbolic representations of its sensorimotor experience, generate complex

plans according to the current world state, and monitor plan execution. The

presented development supports replacing missing objects and suggesting possi-

ble object locations. It is a realization of the concept of structural bootstrapping

developed in the context of the European project Xperience. The framework

is implemented within the robot development environment ArmarX. We eval-

uate the framework on the humanoid robot ARMAR-III in the context of two

experiments: a demonstration of the real execution of a complex task in the

kitchen environment on ARMAR-III and an experiment with untrained users

in a simulation environment.

Keywords: structural bootstrapping, natural language

understanding, planning, task execution, object replacement,

∗Corresponding author

Preprint submitted to Robotics and Autonomous Systems November 21, 2017

humanoid robotics

1. Introduction

One of the goals of the humanoid robotics research is to model human-like

information processing and the underlying mechanisms for dealing with the real

world. This especially concerns the ability to communicate and collaborate

with humans, adapt to changing environments, apply available knowledge in

previously unseen situations. The concept of structural bootstrapping introduced

in the context of the Xperience project [1] addresses mechanisms of employing

semantic and syntactic similarity to infer which entities can replace each other

with respect to certain roles. For example, if the robot is asked to bring a

lemonade but cannot ﬁnd it in the kitchen, the robot can suggest to replace

it with another beverage, e.g. a juice, based on the similarity of both objects

being drinkable. Thus, structural bootstrapping allows the robot to reason

about an observed novel entity and its potential functionality and features.

Earlier experiments demonstrated how structural bootstrapping can be applied

at diﬀerent levels of a robotic architecture including a sensorimotor level, a

symbol-to-signal mediator level, and a planning level [2, 3, 4].

Structural bootstrapping is related to the concept of aﬀordances – latent

“action possibilities” available to an agent towards an object, given agent capa-

bilities and the environment [5]. For example, a bowl can aﬀord pouring into it

or stirring in it and a knife can aﬀord cutting with it. Object aﬀordances can

also be used to support object categorization and infer potential object replace-

ments. For example, if two containers aﬀord pouring into them, then they are

interchangeable towards this action. In the context of structural bootstrapping,

the interplay between the symbolic encoding of the sensorimotor information,

prior knowledge, planning, plan execution monitoring, and natural language

understanding plays a signiﬁcant role. Natural language (NL) commands and

comments can be used to set goals for the robot, update its knowledge, and

provide it with feedback. Symbolic representations of object aﬀordances can be

used for object replacement. Symbolic representations of robot’s observations

are required to generate realistic plans. Plan execution monitoring is needed to

check if a plan was executed successfully and to solve encountered problems, e.g.

replace missing objects. The main aspects and contributions of the manuscript

are:

•We elaborate on a framework that allows the robot to understand natural

language, generate symbolic representations of its sensorimotor experi-

ence, generate complex plans according to the current world state, and

monitor plan execution. The framework is implemented within the robot

development environment ArmarX 1, see also [6]. The developed natural

language understanding (NLU) pipeline is intended for a ﬂexible multi-

purpose human-robot communication. Given human utterances, it gen-

erates goals for a planner, symbolic descriptions of the world and human

actions, and representations of human feedback. It grounds ambiguous NL

constructions into the sensorimotor experience of the robot and supports

complex linguistic phenomena, such as ambiguity, negation, anaphora,

and quantiﬁcation without requiring training data. The NLU component

interacts with related components in a system architecture such as the

robot’s memory, a planner, and a replacement manager.

•We address the mapping of sensorimotor data to symbolic representations

required for linking the sensorimotor experience of the robot to NLU and

symbolic planning and describe how a symbolic domain description is gen-

erated from the robot memory each time an NL utterance needs to be

interpreted.

•We introduce the novel Replacement Manager (RM) component of the

framework, which is responsible for ﬁnding possible locations of missing

objects as well as replacing missing objects with suitable alternatives. The

RM utilizes a variety of replacement strategies based on robots previous

1armarx.humanoids.kit.edu

experience, common-sense knowledge extracted from text corpora, visual

object features, and human feedback.

Parts of the cognitive architecture presented in this manuscript rely on our

previous work. The concept of structural bootstrapping is introduced in the

European project Xperience and in [4]. The ArmarX framework is introduced

in [6]. Our NLU pipeline and the mapping of sensorimotor data to symbolic rep-

resentations are ﬁrst presented in [7]. The common sense knowledge extraction

from text is described in [8], while the ﬁrst experiment on object replacement

based on text-derived knowledge is described in [3]. Vision-based object re-

placement is in focus of [9]. The learning framework for computing probable

replacements is outlined in [10, 11].

The novelty of this manuscript lies in realization of the concept of structural

bootstrapping within a control architecture of a humanoid robot and the demon-

stration of how this concept allows for grounded cognitive behaviors in a complex

setting requiring human-robot communication and collaboration. More specif-

ically, we introduce a novel component of the architecture – the Replacement

Manager, which implements both previously developed and novel replacement

strategies, see Sec. 5. We extend existing components of the architecture, such

as NLU and plan execution and monitoring, with novel functionality allowing

the components to interact with the RM, see Sec. 4 and 6.

For testing the framework, we design and conduct two novel experiments,

see Sec. 7. We test the developed framework on the humanoid robot ARMAR-

III [12] in a scenario requiring planning based on human-robot communication.

Two experiments are described. First, we present a demonstration of the sce-

nario execution on ARMAR-III in a kitchen environment. Second, we test how

well the framework can be employed by untrained users. To do so, we ask the

subjects to control the robot in a visual simulation environment by using natural

language.

The remainder of the article is structured as follows. After presenting the

general system architecture in Sec. 2, we present the domain description gener-

ation from the robot’s memory (Sec. 3). Sec. 4 introduces the natural language

understanding pipeline and Sec. 5 presents the Replacement Manager. Sec. 6

brieﬂy presents the planner employed in the experiments and discusses how plan

execution and monitoring are organized in our framework. Experiments on the

humanoid robot ARMAR-III and in a visual simulation environment are pre-

sented in Sec. 7. Related work is discussed in Sec. 8, while Sec. 9 concludes the

manuscript.

2. System Architecture

The system architecture is implemented within the robot development envi-

ronment ArmarX [6] and is shown in Figure 1. The system architecture consists

of six major building blocks: robot’s memory, domain generation, natural lan-

guage understanding, replacement manager, planning, as well as plan execution

and monitoring.

Robot’s

Memory

Domain

generation NL understanding

Planning Plan execution &

monitoring

table

Replacement

manager

Novel component

Extended components

Figure 1: The system architecture.

The robot’s memory is represented within MemoryX, one of the main mod-

ules of the ArmarX architecture (see Sec. 3). The memory stores and oﬀers sym-

bolic and sub-symbolic information about prior knowledge, long-term knowledge

and knowledge about the current world state. Domain descriptions in symbolic

form are generated from the robot’s memory (see Sec. 3). The domain de-

scriptions are also used by the NL understanding component for grounding and

generating the domain knowledge base, see Sec. 4. The developed multi-purpose

NLU framework can distinguish between a) direct commands that can be exe-

cuted without planning (Go to the table), b) plans requiring commands that are

converted into planner goals, which are processed by a planner (Set the table),

c) descriptions of the world that are added to the robot’s memory and used by

the planner (The cup is on the table), and d) human feedback (I’m ﬁne with

it). The goal generated by the NLU component is passed to the Replacement

Manager that checks for each object in the goal if this object and its location

are in the domain description (Sec. 5). If an object or its location are unknown,

then the component suggests available alternatives. NLU also provides the Re-

placement Manager with context-based aﬀordances for mentioned objects. After

suggesting a replacement, the Replacement Manager waits for a human conﬁr-

mation or rejection, which is provided by the NLU component. If a replacement

is conﬁrmed, then the goal is rewritten and passed to the planner together with

the domain description. The planner takes a domain description and a goal as

an input. It generates a plan, which is a sequence of grounded atomic actions

(see Sec. 6.1). Plan execution is performed by the Plan Execution & Monitoring

component, which also veriﬁes if the plan is executed correctly (see Sec. 6.2).

Each time an NL utterance is registered and processed by the NLU pipeline

and the planner, the robot’s memory is updated and the required actions are

added to the task stack to be processed by the Plan Execution & Monitoring

component. If the plan execution fails e.g. because of missing objects, the com-

ponent queries the Replacement Manager for suitable replacements. Otherwise,

the planner is called to re-plan according to the current world state. Human

comments and feedback can be used to update the world state in the robot’s

memory before or during action execution and are considered by the robot to

adjust its plan accordingly.

2.1. Integration into the robotic platform

Integration of existing algorithms in a complex robotic platform poses sev-

eral challenges. First of all, algorithms, e.g. for ob ject localization, that are

used on robotic service platforms in a dynamic environment, rather than in a

controlled lab environment, have to be highly robust and real-time capable. An-

other challenge concerns the interfaces between the algorithms and the robotic

hardware.

In the robot development environment ArmarX employed for the proposed

system, generic interfaces were designed in an Interface Deﬁnition Language

(IDL) to allow seamless exchange of implementations of diﬀerent algorithms.

Communication between diﬀerent programming languages is realized with the

middleware Internet Communication Engine [13], which provides transparent

network communication and interfacing between many diﬀerent programming

languages with minimal overhead. New components can be easily added to the

system with the minimal eﬀort of implementing a corresponding interface. For

example, new replacement strategies can by added to the Replacement Manager

dynamically and will be used in future calls of the Replacement Manager. More

details on ArmarX can be found in [6].

3. Robot’s Memory and Domain Description Generation

The goal of the domain generation component is to map sensory data to sym-

bolic representations. Since each symbol depends on a diﬀerent combination of

sensory data, we design the mapping procedure in a modular and extensible

way. First, sensorimotor experience is turned into continuous sub-symbolic rep-

resentations (e.g., coordinates of objects, the robot, and robot’s hands) that are

added to the robot’s memory. These continuous representations are mapped

to object and location names. Finally, symbolic representations describing the

world state that are represented by predicates are generated.

3.1. Memory Structure

MemoryX, a central part of the robot development environment ArmarX, is

responsible for storing and representing diﬀerent types of robot knowledge in

diﬀerent memories: prior knowledge memory, long-term memory, and working

memory that provide symbolic entities like actions, objects, states, and loca-

tions. Each memory element, called memory entity, is represented by a name-

value map. Memory entities are organized as a hierarchy, where every entity

can be a parent of another entity. For example, the type cup can have a parent

container, which has a parent object. If a parent has a feature attached,

this feature is also available for its children. This hierarchy is also used as an

ontology to deﬁne meta classes and specify common aﬀordances or attributes of

objects. For example, in order to specify that objects from a certain class are

graspable, we add the class graspable as a parent for this class.

The prior knowledge contains persistent data inserted by the developer, e.g.

accurate object 3D models, environmental models, and aﬀordances as object-

action associations extracted, for example, from text (see Sec. 5). The long-term

memory consists of knowledge stored persistently, e.g. common object loca-

tions that are learned from the robot’s experience during task execution and

persistently stored as heat maps [14]. The working memory contains volatile

knowledge about the current world state, e.g. object existence and position

or relations between entities. The working memory serves as an intermediate

data storage between sensorimotor experience and the symbolic high-level rep-

resentation. The working memory is updated by components like the robot

self-localization, object localization, or natural language understanding, when-

ever they receive new information.

To deal with uncertainties in sensory data, each memory entity value is

accompanied by a probability distribution. In case of object locations, new

data is fused with the data stored in the memory using a Kalman-ﬁlter.

3.2. Mapping sensorimotor data to symbols

In order to map sensorimotor experience to symbols, in this study, we em-

ploy the raw sensory data used by the components of self-localization, visual

object recognition, and robot kinematics. For the self-localization, we use laser

scanners and a 2D map of the environment. The self-localization is used to nav-

igate on a labeled 2D graph, in which location labels are deﬁned by a 2D pose

of the location and a variance of the pose. For the visual object recognition,

we use the RGB stereo vision with the texture-based [15] or color-based [16]

approaches. The robot kinematics is used to calculate position of the robot’s

hands, which is used to determine if an object is grasped. Since the robot does

not have sensors in its hands, we assume that the object is grasped if the pose

of the object in the working memory is close enough to the robot hand pose.

The mapping of continuous sensory data into discrete symbolic data is done

by predicate providers. Each world state predicate is deﬁned in its own predicate

provider module, which outputs a predicate state (unknown, true, false) by

evaluating the content of the working memory or low-level sensorimotor data.

Examples of the predicate providers are: grasped represents an object being

held by an agent using a hand; handEmpty represents a state of a robot’s hand;

objectAt and agentAt represent object and robot locations, correspondingly;

leftgraspable and rightgraspable represent the fact that an object at a

certain location can be grasped by the corresponding hand of the robot. Table 1

shows the full list of predicates that are calculated based on the sensor data.

Predicate providers can access other components (e.g., the working mem-

ory, robot kinematics information, long-term memory) to evaluate the predicate

state. For example, the objectAt predicate provider uses the distance between

the detected object coordinates and the center coordinates of the location label

to determine if an object should be considered to be currently at this location.

Only those objects that are required for fulﬁlling a particular task are tracked

during the action execution. Higher level components operating on a symbolic

level generate requests for a particular object to be recognized at a particular

location. Other ob jects are not tracked to reduce the system load and avoid the

Predicate with types Description Calculation description

inHand(object, hand, robot) object is in hand of robot Distance between object and

hand < ǫ1

objectAt(object, location) object is at landmark location Distance between object and

location < ǫ2

agentAt(robot, location) robot is at landmark location Distance between robot and

location < ǫ3

handEmpty(robot, hand) hand of robot is empty Distance of hand to all objects > ǫ4

grasped(robot, hand, object) object is grasped with hand of robot Distance of hand to object < ǫ5

leftGraspable(object) Left hand grasp is known for object and it is

currently at a suitable location for the left hand

Object position in bounding box

rightGraspable(object) Right hand grasp is known for objectand it is

currently at a suitable location for the right hand

Object position in bounding box

Table 1: Predicates calculated from sensor data.

Predicate with types Description

open(door) door is open

clean(location) cleaning action was performed at location

stirred(container) stirring action was performed in container

substanceIn(substance,container) substance was poured into container

stackable(location) more than one object can be located at location (e.g. sink)

toPutAway(object) object needs to be put away into a predeﬁned location

(e.g. dirty utensils go into the sink)

inHandOfHuman(object,human) object was handed to human

Table 2: Non-observable predicates.

false positive object recognition.

The validity of the generated predicates relies both on the sensor data and

the memory state. Let us consider the grasped predicate as an example. The

employed robot ARMAR-III does not have any sensor in his hands. Therefore,

the grasped predicate is deﬁned by the distance between the object and the

hand. The localization algorithms we apply are not precise enough to localize

the grasped object in the hand, which makes it diﬃcult to determine whether

the object is grasped or not. We solve this problem by introducing the virtual

attachment of the grasped object to the hand. By default, the grasping action

is expected to be successful. Thus, it is assumed that the grasped object is

virtually attached to the hand, i.e. it is moving synchronously with the hand

with an increasing position uncertainty, until a new localization data is available.

This strategy can fail if the grasping action was not successful. In this case, the

robot keeps assuming that the object is in the hand, until the object is visually

localized as still being at the original location.

Additionally, predicates that cannot be perceived by the robot, called non-

observable predicates, are tracked and stored in the robot’s memory. These

predicates are either inserted by the human via speech or are inferred from the

actions of the robot. An example of a non-observable predicate is open applied

to a door. Currently, ARMAR-III cannot perceive a door being open or closed.

However, it needs to access the state of the door in order to plan, for example, for

grasping an object from the fridge. By default, the robot assumes all doors being

closed. After it has performed the door opening action, it adds the predicate

open applied to the door into its memory. The human can also update the

memory of the robot by saying ”The door of the fridge is open/closed”. In the

experiments described in Sec. 7, we use non-observable predicates that are listed

in Table 2.

3.3. Domain Description Generation

Figure 2 shows how the robot’s memory is used to generate a symbolic do-

main description consisting of static symbol deﬁnitions and problem speciﬁc

deﬁnitions. The symbol deﬁnitions consist of types, constants, predicate def-

initions, and action descriptions, while the problem deﬁnitions consist of the

symbolic representation of the current world state represented by predicates

and the goal state that should be achieved. Types enumerate available agents,

hands, locations, and object classes stored in the prior knowledge. Constants

represent available instances, on which actions can be performed, and are gen-

erated using entities in the working memory. Each constant can have multiple

types, such that one is the actual type of the corresponding entity, and others

are parents of that particular type including transitive parentship. For exam-

ple, instances of the type cup are also instances of graspable and object. This

Figure 2: Components involved in the domain description generation.

type hierarchy is important for specifying actions over a particular set of types,

e.g., the grasping action has the type graspable as a parameter to ensure that

grasps are only planned on graspable objects. The domain generator derives ac-

tion representations from the long-term memory, where they are associated with

speciﬁc robot skills represented by statecharts as described in [17]. Each action

is associated with a set of preconditions and eﬀects represented by predicates.

The generated domain description is used by the NLU component as well as

by the Replacement Manager and the planning component. The NLU compo-

nent uses the domain description to create a knowledge base and to ground NL

references and the Replacement Manager uses the description to check if objects

in the planner goal and their locations are known to the robot. The planning

component uses it as the knowledge base for plan generation.

4. Multi-Purpose Natural Language Understanding

The purposes of the NL understanding component are a) to ground NL ut-

terances to actions, objects, and locations stored in the robot’s memory, b) to

distinguish between commands, descriptions of the world, and human feedback,

c) to provide context-based aﬀordances for mentioned objects, and d) to gen-

erate representations of each type of the NL input suitable for the downstream

components (goal for the planner, object context for the Replacement Manager,

feedback for the plan execution and monitoring component). Our approach is

based on the abductive inference used for interpreting NL utterances as obser-

vations by linking them to known or assumed facts, see [18].

The NL understanding pipeline shown in Fig. 3 consists of the following

processing modules. The speech input is processed by a speech recognition

component2that converts it into text. The text is then processed by a se-

mantic parser that outputs a logical representation of it. This representation

together with observations stored in the robot’s memory and the lexical and

domain knowledge base constitute an input for an abductive reasoning engine

that produces a mapping to the domain, i.e. symbolic labels known to the

robot. The mapping is further classiﬁed and post-processed. The pipeline is

ﬂexible, i.e. each component can be replaced by an alternative. We use the

implementation of the abduction-based NLU that was developed in the context

of knowledge-intensive large-scale text interpretation [20].

Robot’s

memory

Text

blabla

Semantic

parser

Logical

form

a(x) ^ c(x,y)

Abductive

reasoner

Lexical & domain

knowledge base

Domain

mapping

goal#a(x) ^

world#b(y)

Classifier

Type

goal:a(x)

world:b(y)

command:d(x)

Postprocessing

Figure 3: Natural language understanding pipeline.

2In the experiments described in this manuscript, we used the speech recognition system

presented in [19].

4.1. Logical form

We use logical representations of NL utterances described in [21]. In this

framework, a logical form (LF) is a conjunction of propositions and variable

inequalities, which have argument links showing relationships among phrase

constituents. For example, the following LF corresponds to the command Bring

me the juice from the table:

∃e1, x1, x2, x3, x4(bring-v(e1, x1, x2, x3)∧thing(x1)∧person(x2)∧juice-n(x3)

∧table-n(x4)∧from-p(e1, x4)),

where variables xirefer to objects thing,person,juice, and table and variable

e1refers to the eventuality of x1bringing x2to x3; see [21] for more details.

In the experiments described below, we used the Boxer semantic parser [22].

Alternatively, any dependency parser can be used if it is accompanied by an LF

converter as described in [23].

4.2. Abductive inference

Abduction is inference to the best explanation. Formally, logical abduction

is deﬁned as follows:

Given: Background knowledge B, observations O, where both Band Oare

sets of ﬁrst-order logical formulas,

Find: A hypothesis Hsuch that H∪B|=O, H ∪B6|=⊥, where His a set of

ﬁrst-order logical formulas.

Abduction can be applied to discourse interpretation [18]. In this framework,

logical forms of the NL utterances represent observations, which need to be

explained by the background knowledge. Where the reasoner is trying to link

parts of the logical form to what is already known from the overall context

and the background knowledge. The reasoner introduces assumptions, if it is

provided with incomplete information. The reasoner prefers minimal hypotheses

to those that introduce more assumptions.

Suppose the command Bring me the juice from the table is turned into an

observation oc. If the robot’s memory contains an observation of a particular

instance of juice being located on the table, this observation will be concate-

nated with ocand the noun phrase the juice will be grounded to this instance

by the abductive reasoner.

Another example is the disambiguation between a command (Put the juice

on the table) and a world description (The juice is on the table), which depends

on the presence of the action predicate. This disambiguation can be performed

by using the following two background axioms:

goal#objectAt(e1, x1, x2)→put-v (e1, Robot, x1)∧on-p(e1, x2)

world#objectAt(x1, x2)→on-p(x1, x2),

where preﬁxes goal# and world# indicate the type of information conveyed by

the corresponding linguistic structures. In the case of the command, the ﬁrst

axiom will be applied, because it will explain more atomic observables (put and

on). This axiom represents the fact that commands like Robot, put x1on x2

imply that there is a goal of x1being located at x2. In the case of the world

description represented by the bare on prepositional phrase, the second axiom

will be applied. This axiom describes the semantics of the bare on prepositional

phrase not attached to a verb and represents the fact that x1is located at x2.

We use a tractable implementation of abduction based on Integer Linear

Programming (ILP) [20]. The reasoning system converts a problem of abduc-

tion into an ILP problem, and solves the problem by using eﬃcient techniques

developed by the ILP research community. Typically, there exist many hy-

potheses explaining an observation. In the experiments described below, we

use the framework of weighted abduction [18] to rank hypotheses according to

plausibility and select the best hypothesis. This framework allows us to de-

ﬁne assumption costs and axiom weights that are used to estimate the overall

cost of the hypotheses and rank them. As the result, the framework favors

minimal (shortest) hypotheses as well as hypotheses that link parts of obser-

vations together and support discourse coherence, which is crucial for language

understanding, see [24]. However, any other abductive framework and reasoning

engine can be integrated into the pipeline.

4.3. Lexical and domain knowledge base

In our framework, the background knowledge Bis a set of ﬁrst-order logic

formulas of the form

Pw1

1∧... ∧Pwn

n→Q1∧... ∧Qm,

where Piand Qjare predicate-argument structures or variable inequalities and

wiare axiom weights.3

Lexical knowledge used in the experiments described below was generated

automatically from the lexical-semantic resources WordNet [25] and FrameNet

[26]. First, verbs and nouns were mapped to the synonym classes. For example,

the following axiom maps the verb bring to the class of giving:

action#give(e1, agent, recipient, theme)→

bring-v(e1, agent, theme)∧to-p (e1, recipient)

Prepositional phrases were mapped to source, destination, location, instru-

ment, etc., predicates. Diﬀerent syntactic realizations of each predicate for each

verb (e.g., from X,in X,out of X ) were derived from syntactic patterns speciﬁed

in FrameNet that were linked to the corresponding FrameNet roles. See [27] for

more details on the generation of lexical axioms. A simple spatial axiom was

added to reason about locations, which states that if an object is located at a

part of a location (corner,top,side, etc.), then it is located at the location.

The synonym classes were further manually axiomatized in terms of domain

types, predicates, constants, and actions. For example, the axiom below is used

to process constructions like bring me X from Y :

goal#inHandOfHuman(e1, theme)∧

world#objectAt(theme, loc)→

action#give(e1, Robot, recipient, theme)∧

location#source (e1, loc),

which represents the fact that the command evokes the goal of the given object

3See [23] for a discussion of the weights.

being in the hand of the human and the indicated source is used to describe the

location of the object in the world. The preﬁxes (e.g., goal#,world#) indicate

the type of information conveyed by the corresponding linguistic structures.

The framework can also handle numerals, negation, quantiﬁers represented by

separate predicates in the axioms (e.g., not,repeat). For example, the following

axiom is used to process constructions like put N Xs on Y :

goal#objectAt(e1, theme, loc)∧#repeat(theme, n)→

action#puton(e1, Robot, theme, loc)∧card(theme, n)

The #repeat predicate is further used by the post-processing component that

multiplies predicates containing the corresponding variable (theme)ntimes.

Negation is represented by the predicate not, e.g., the following axiom maps

the adjective dirty to the domain:

not(e1)∧world#clean(e1, x)→dirty-a(e2, x)

Quantiﬁcation is also represented by a separate predicate. The repeti-

tion, negation, and quantiﬁcation predicates are further treated by the post-

processing component.

The hierarchy axioms (red cup→cup) and inconsistency axioms (red cup

xor green cup) were generated automatically from the domain description. Ev-

ery type-parent relation in the description was converted into a hierarchy axiom.

If two types share the same parent and do not share any instances, they were

declared to be inconsistent in the knowledge base.

4.4. Object grounding

If objects are described uniquely, then they can be directly mapped to the

constants in the domain. For example, the red cup in the utterance Give me the

red cup can be mapped to the constant red cup if there is only one red cup in

the domain. However, redundant information that can be recovered from the

context is often omitted in the NL communication, see [28]. In our approach,

grounding of underspeciﬁed references is naturally performed by the abductive

reasoner interpreting observations by linking their parts together, see Sec. 4.2.

For example, given the text fragment The red cup is on the table. Give it to

me, the pronoun it in the second sentence will be linked to red cup in the ﬁrst

sentence and grounded to red cup. To link underspeciﬁed references to earlier

object mentions in a robot-human interaction session, we keep all mentions

and concatenate them with each new input LF to be interpreted. Predicates

describing the world from the robot’s memory are also concatenated with LFs

to enable grounding. Given Bring me the cup from the table, the reference the

cup from the table will be grounded to an instance of cup observed as being

located on an instance of table.

If some arguments of an action remain underspeciﬁed or not speciﬁed, then

the ﬁrst instance or the corresponding type will be derived from the domain

description. For example, the execution of the action of putting things down

requires a hand to be speciﬁed. In the NL commands this argument is often

omitted (Put the cup on the table), because for humans it does not matter,

which hand the robot will use. The structure putdown(cup,table,hand) is

generated by the NLU pipeline for the ﬁrst command above. The grounding

function then selects the ﬁrst available instance of the underspeciﬁed predicate.

In future, we consider using a clariﬁcation dialogue, as proposed, for example,

in [29].

4.5. Context-based aﬀordances

Object replacement depends on the corresponding object aﬀordances. For

example, a spoon can be replaced by a fork in the context of eating or by a knife

in the context of stirring. Potential aﬀordances can be extracted from the dialog

context. For example, if the human says I’d like to drink something. Bring me

some juice, then the aﬀordance of the juice is drinking. In order to provide the

Replacement Manager with this information, we store all verbs mentioned in

the dialog session. When the NLU component generates a new planner goal, for

each domain object label represented by a noun phrase, it selects a verb possibly

representing its aﬀordance. Those verbs are selected, which have the highest

weights in the common-sense aﬀordance database generated from corpora as

described in Sec. 5.1.1. For object labels, for which there is no appropriate verb

in the dialog context, the top aﬀordance from the common-sense aﬀordance

database is provided. Basing the context-based aﬀordance extraction on verbs

is a clear limitation, because any part of speech can refer to an aﬀordance, e.g.,

I’m thirsty. Bring me some juice. In future, we plan to employ lexical-semantic

and world knowledge to reason about aﬀordances.

4.6. Classiﬁer

The classiﬁer takes into account preﬁxes assigned to the inferred predicates.

For example, the abductive reasoner returned the following mapping for the

command Bring me the cup from the table:

action#give(e1, x1, x2, x3)∧location#source(e1, x4)∧x1=Robot ∧

x2=Human ∧goal#inHandOfHuman(e1, x3, x2)∧world#objectAt(x3, x4)

The classiﬁer extracts predicates with preﬁxes and predicates related to the

corresponding arguments. The following structures will be produced for the

mapping above:

[goal: inHandOfHuman(cup,Human), world: objectAt(cup,table)]

Actions that do not evoke goals or world descriptions are interpreted by

the classiﬁer as direct commands or human action descriptions depending on

the agent. For example, action#grasp(Human,cup ) (I’m grasping the cup) will

be interpreted as a human action description, while action#grasp(Robot,cup)

(Grasp the cup) is a direct command.

The classiﬁer can also handle nested predicates. For example, the utterances

1) Help me to move the table, 2) I will help you to move the table, 3) I will help

you by moving the table will be assigned the following structures, correspond-

ingly:

1. [direct command: helpRequest:[requester: Human,

action: move(Robot,table)]

2. [human action: help:[helpInAction: move(Robot,table)]

3. [human action: help:[helpByAction: move(Human,table)]

The human feedback is currently typed as agreement (e.g., I’m ﬁne with it),

disagreement (e.g., No), or no information (e.g., I don’t know).

4.7. Post-processing

The post-processing component converts the extracted data into the format

required by the downstream modules. Direct commands and human feedback

are immediately processed by the Plan Execution & Monitoring component. Ob-

ject context is used by the Replacement Manager. World descriptions are added

to the robot’s working memory. Goals extracted from utterances are converted

into a planner goal format, so that not predicate is turned into the corresponding

negation symbol, predicates that need multiplication (indicated by the #repeat

predicate) are multiplied, and quantiﬁcation predicates are turned into quanti-

ﬁers. For example, the commands 1) Put two cups on the table and 2) Put all

cups on the table can be converted into the following goal representations in the

PKS syntax [30], correspondingly:

1. (existsK(?x1: cup, ?x2: table) K(objectAt(?x1,?x2)) & (existsK(?x3:

cup) K(objectAt(?x3,?x2)) & K(?x1 != ?x3)))

2. (forallK(?x1: cup) (existsK(?x2: table) K(objectAt(?x1,?x2))))

4.8. Processing examples

In the following, we present processing steps for two example sentences.

For simplicity, we do not demonstrate object grounding in these examples. As

described in Sec. 4.4, the grounding is performed by concatenating logical forms

produced by the semantic parser with the earlier object mentions as well as

predicates corresponding to objects known to the robot.

In the example below, two background axioms are employed by the abductive

reasoner. Lexical axiom L1 is derived from FrameNet and maps the phrase

bring from to the giving action class and location indication. Domain axiom

D1 created manually is used to map the giving action class to the goal of the

object being in the hand of a human.

Analysis step Output representation

Text Bring me the juice from the table

Logical form ∃e1, x1, x2, x3, x4(bring-v (e1, x1, x2, x3)∧thing (x1)∧

human(x2)∧juice-n(x3)∧table-n(x4)∧from-p(e1, x4))

Axioms applied L1: action#give(e1, ag ent, recipient, theme)∧

world#objectAt(theme, location)→

bring-v(e1, agent, recipient, theme)∧from-p(e1, location)

D1: goal#inHandOfHuman(e1, theme, recipient)→

action#give(e1, r obot, recipient, theme)

Abductive inference goal#inHandOfHuman(e1, x3, x2)∧world#objectAt(x3, x4)

∧juice-n(x3)∧table-n(x4)∧human(x2)

Classiﬁer [goal: inHandOfHuman(juice,human),

world: objectAt(juice,table)]

Post-processor [goal: (existsK(?x1 : juice, ?x2: human)

K(inHandOfHuman(?x1,?x2)),

world: objectAt(juice,table)]

The next example illustrates the use of the #repreat predicate used to multiply

goad predicates.

Analysis step Output representation

Text Put two the glasses on the table

Logical form ∃e1, x1, x2, x3(put-v (e1, x1, x2)∧thing (x1)∧

∧glass-n(x2)∧card(x2,2) ∧

table-n(x3)∧on-p (e1, x3))

Axioms applied L1: action#puton(e1, agent, theme, location)→

put-v(e1, agent, theme)∧on-p (e1, location)

D1: goal#objectAt(e1, theme, location)∧#repeat(theme, n)→

action#puton(e1, robot, theme, location)∧card(theme, n)

Abductive inference goal#objectAt(e1, x2, x3)∧#repeat (x2,2) ∧

glass-n(x2)∧table-n(x3)

Classiﬁer [goal: objectAt(glass, table) & repeat(glass,2)]

Post-processor [goal: (existsK(?x1 : glass, ?x2 : table)

K(objectAt(?x1, ?x2)) & (existsK(?x3 : glass)

K(objectAt(?x3, ?x2)) & K(?x1 != ?x3)))]

5. Replacement Manager

The Replacement Manager (RM) has two aims: 1) to replace missing objects

with alternatives and 2) to suggest new potential locations for missing objects.

A welcome side product of the RM is that it serves as a preliminary feasibility

checker for the task before the planning process is started. If objects mentioned

in the goal are not present in the generated domain, the planner will not be able

to ﬁnd a valid plan and thus does not need to be called.

The RM is evoked after the NLU component produces a goal for the planner

or if the plan execution fails because of a missing object (see Sec. 6.2). For

each object and location name occurring in the goal, the RM checks if they are

contained in the robot’s memory (MemoryX ), i.e. the names can be converted

into MemoryX types that have instances with speciﬁed locations. If an instance

or its location is missing, the RM attempts a replacement and rewrites the goal

before passing it to the planner.

Figure 4 shows how the RM interacts with other components. A speech

command is processed by the NL Understanding component that generates a

goal for the planner and aﬀordances for each object mentioned in the goal. The

RM queries the domain generator and replaces unknown objects in the goal with

the known ones and makes sure that there are valid locations for instances of

all objects mentioned in the goal by inserting object instance hypotheses into

the working memory. The planner will treat these the same as conﬁrmed object

instances, but all actions using an object instance will fail during execution if

the object instance hypothesis is wrong.

If a suitable replacement has been found, the RM rewrites the goal. The

component passes the goal to the planner that generates a plan. The plan

execution is supervised by the Plan Execution & Monitoring component. If the

plan execution fails because of a missing object, the RM is called again.

The RM is using diﬀerent replacement strategies, which vary with respect to

the considered input data and range from visual shape estimation to evaluation

of large text corpora. The replacement strategies are subdivided into object and

location replacement strategies as follows.

Adjusted

goal

Speech Command

blablabla

Replacement Manager

Replace objects Find object locations

Replacement Strategies

Plan Execution &

Monitoring

Action execution fails

Object not found

Replan

Replace

Plan goal Object affordances

Execute

NL Understanding Domain Generator

Planner

Figure 4: Interaction between the Replacement Manager and other components.

5.1. Object Replacement

Object replacement is performed when a) an ob ject type mentioned in the

goal is unknown or b) a suitable object could not be found at any known location

of the object during the plan execution. We employ two object replacement

strategies based on shared common-sense aﬀordances and shared visual features

as described below. Ob ject replacement requires human feedback. Therefore

the RM generates a conﬁrmation question for the human and proceeds with the

replacement only if it was conﬁrmed.

5.1.1. Common-sense aﬀordances strategy

The strategy based on shared common-sense aﬀordances employs typical

object aﬀordances generated from textual corpora as described in [8]. For each

noun referring to an object in the domain, we extract aﬀordances expressed by

verbs with assigned scores. This is done as follows. We use a parsed text corpus4

to extract verbs co-occurring with a given noun in the instrument role patterns:

”VERB with (a/the)? NOUN” (cut with a knife), ”NOUN for VERBing” (knife

for cutting) and in the patient role patterns: ”VERB (a/the)? NOUN” (cut the

bread). Stop words are excluded from consideration. As a result, we generate

an aﬀordance database containing entries of the form hobject,aﬀordance,role,

norm freqi, where role can be instrument or patient and norm freq is a nor-

malized frequency of the co-occurrence of object and aﬀordance in the patterns,

e.g., hjuice,drink,patient, 0.866i. The aﬀordance database is used to gener-

ate a replacement database consisting of tuples of the form hobject1,object2,

aﬀordance,score iindicating that object1 can be replaced by object2 towards

aﬀordance with the conﬁdence equal to score. For example, a spoon is most

likely to be replaced by a fork towards eating, while it is most likely to be

replaced by a stirrer towards stirring.

The conﬁdence score is computed by a relational learning framework in two

steps. First, similarity for object pairs towards aﬀordances is computed. Second,

a function is learned, which generalizes the known relations to all possible object

pairs not observed earlier. The similarity measure is deﬁned as follows.

4In the experiments described below, the Google Books corpus was used,

http://storage.googleapis.com/ books/syntactic-ngrams/index.html.

r(o1, o2|a) = 





nf(o1, a)∗n f(o2, a) if (o1, a)∈ D and (o2, a)∈ D

0 otherwise,

(1)

where D ⊂ O×A is a set of object-aﬀordance pairs such that Ois a set of objects

and Ais a set of aﬀordances, n f (o, a) is the normalized frequency of the object

oand aﬀordance aco-occurrence. Based on this similarity measure a feature

vector for all object pairs can be constructed, φ(o1, o2) = (r(o1, o2|a), a ∈ A).

Note that φcan be deﬁned for cases not represented in D.

At the second step, a set of functions connecting the object pairs to the

aﬀordances is learned. This step is needed to propose replacements for object o

towards aﬀordance aeven if oand ado not co-occur in the aﬀordance database.

It is also needed to learn replacements towards an unknown aﬀordance. In

[3, 10, 11], the learning procedure and related applications are described in

detail. Let us sketch the main idea in this section.

In this approach, the similarity measure r(o1, o2|a) is represented by proba-

bility density functions of Gaussian distribution with expected value r(o1, o2|a)

and with a common standard deviation σ, i.e. ψ(r(o1, o2|a)) = p(.|r(o1, o2|a), σ).

The functions are deﬁned for all aﬀordances a∈ A as fa:O × O → Fσ, where

Fσis the set of all probability functions of normal distribution with standard

deviation σ. These functions can express the uncertainty of the similarity mea-

sure. The functions {fa|a∈ A} are then represented by a linear operator that

maps feature vectors φ(o1, o2,) of the object pairs into the space of Fσ.

The learning problem consists in maximizing the inner product between

the predicted and the given feature vectors of the similarity scores for all ob-

served object pairs and aﬀordances. This optimization problem is an extension

of the Support Vector Machine, and the Maximum Margin Markov Networks

developed for structured output learning frameworks, see a description and sev-

eral alternatives in [31]. This optimization problem can be solved via its dual

form, the detailed procedure is provided by [32]. After solving the optimization

problem, the predicted similarity score for tuple ho1, o2, aican be computed as

follows

ψ(r(o1, o2|a)) = W∗

aφ(o1, o2),(2)

where W∗

ais the optimal solution for aﬀordance a. Since ψas a feature vector is

a probability density function, the replacement score measure for an object pair

towards an aﬀordance can be derived by taking the expectation with respect to

ψ, i.e.

˜r(o1, o2|a) = E[ψ(r(o1, o2|a))].(3)

The NL Understanding component generates an aﬀordance for each object

mentioned in the goal as described in Sec. 4. For the object that needs to

be replaced, the shared common-sense aﬀordances strategy searches for the re-

placement towards the LU generated aﬀordance for this object, which has the

highest score in the replacement database. Since the replacement database is

generated from textual corpora, it contains only object names represented by

nouns (”cup” instead of ”bluecup”). The strategy uses the type hierarchy (e.g.,

”bluecup” →”cup” →”container”) for ﬁnding the replacement. For example, if

”bluecup” needs to be replaced, it will search for replacement options speciﬁed

for its parent in the hierarchy. Similarly, if ”cup” is suggested as a replacement,

the strategy will select its leaf child in the hierarchy as an actual replacement

candidate.

5.1.2. Visual features strategy

This strategy compares precomputed aﬀordances of the object to be re-

placed, which are based on its visual features, with aﬀordances extracted from

point cloud data during run-time in the robot’s current ﬁeld of view. In [9], we

introduced, evaluated and discussed in more detail the strategy presented here.

The aﬀordances are estimated on the basis of object shapes. More speciﬁcally,

we utilize the shape representation based on global 3D descriptors to predict

functional properties of objects [33]. For instance, a container shape aﬀords

pouring into it, dropping into it, etc. Fig. 5 shows ARMAR-III applying the

visual features strategy. The perceived point cloud data is projected into the

memory view of the robot. For each dense point cloud cluster (colored point

cloud data), the aﬀordances based on the shape of the cluster are estimated.

In this example, the bowl as well as the basked aﬀord pouring into,stirring,

and dropping into. The grey bowl represents the perceived pose of the bowl

produced by the object recognition system.

Figure 5: Aﬀordances predicated based on shape representations.

Object shapes are described using histograms of relations between pairs of

3D features. First, we segment the scene using RGBD data obtained from the

Kinect sensor. For each segment, we extract planar 3D surface features called

3D texlets [34] that contain position and orientation. Objects are represented

by sets of pairwise relations, deﬁned globally, for all pairs of texlets. We com-

pute geometric relations of two attributes: angle and scale-invariant distance

(i.e., normalized relative to the object size) between 3D texlets. The distance

relation is chosen to be scale-invariant, because what deﬁnes a object aﬀor-

dance is usually independent of scale. The ﬁnal object descriptor is obtained

by binning these two relations in a 2D histogram, which models the distri-

butions of the relations in ﬁxed-sized feature vectors while considering their

co-occurrence. According to previous investigations [33], the binning size is set

to 12 in both dimensions resulting in a feature vector of 144 dimensions. The re-

sulting descriptor is highly discriminative, leading to fast learning and accurate

estimation.

Aﬀordance labels are learned using JointSVM [9]. JointSVM is equivalent

to Structural SVM, which is an extension of SVM for predicting structured

outputs, with a linear output kernel plus a regularization term on the kernel

[35]. As input kernels, we choose polynomial kernels based on previous tests, cf.

[9]. The estimation of both the kernel parameters and the internal parameters

of JointSVM is performed by cross-validation.

JointSVM outputs an indicator vector, i.e. a set of binary labels, deﬁned

on the objects appearing on a scene. The learner treats the full output vector

as one entity, i.e. simultaneously predicts all object labels. The estimation of

the conﬁdence of each aﬀordance label is based on the assumption that if the

predicted vector is close to a known label vector in the training data, then the

conﬁdence should be high, otherwise it is low; see [9] for the implementation.

For estimating if an unknown object with predicated aﬀordances can replace

a known object with known aﬀordances, we compute object similarity. The

similarity measure is deﬁned as follows:

S(y(x)) = Sp(y(x)) −Sn(y(x)),(4)

where y(x) are the predicted aﬀordances, and Sp(y(x)), Sn(y(x)) are the positive

and the negative similarity metrics, respectively, cf. [9]. The positive similarity

accounts for the true positive predictions, yp(x), while the negative similarity

accounts for the false positive predictions, yn(x). Both yp(x) and yn(x) are

derived from y(x) by considering the known aﬀordances. Then, Sp(y(x)) and

Sn(y(x)) are deﬁned as follows:

Sp(y(x)) = AV G(yp(x)) ×T P R(y(x)) (5)

Sn(y(x)) = AV G(yn(x)) ×F P R(y(x)) (6)

where AV G indicates the mean value and T P R and F P R indicate the true

positive rate (recall) and the false positive rate (fall-out), respectively. Based

on the pre-deﬁned threshold, a potential replacement is estimated to be accept-

able or not. In the experiments described below, we used the threshold of 0.3

estimated as described in [9].

Because it is possible to extract aﬀordances during run-time, the robot can

select an object as a replacement alternative, for which it otherwise does not

have any knowledge. It can also react to changes in the known objects. For

example, a container can be closed or open, which changes its aﬀordances, e.g.,

it can be poured from or into only when it is open.

5.2. Location Replacement

Location replacement happens when the location of all instances of an object

type mentioned in the goal is unknown or when an object could not be found

during the plan execution. For the location replacement, the RM manipulates

the current working memory of the robot and inserts object instances as un-

conﬁrmed hypotheses at the suggested location from the replacement strategy.

We employ three location replacement strategies based on 1) common locations

learned from the previous experience of the robot, 2) common-sense locations

obtained from textual corpora, 3) the human feedback.

5.2.1. Common locations strategy

The strategy based on common locations uses the feature of ArmarX that

allows robots to learn typical locations of the objects from its experience [14].

Since the information about object locations is stored in the robot’s memory

as a set of density distributions of points (see Fig. 6), the distributions have to

be mapped to symbolic location labels that can be processed by the planner.

To do so, we link a location label with the expected value of the corresponding

distribution. When the robot is close enough to the object and can actually see

it, then the assumed location is replaced by the actual observed object position.

This strategy is the most used location strategy, since the conﬁdence of the

location hypotheses is high. When the robot is initialized, the locations of the

objects are unknown. Thus, when a command is received, a location hypothesis

needs to be generated for each implied object.

Recorded object localization results

Figure 6: Previously seen locations of an ob ject (green spheres), which are clustered to density

distributions in order to generate common object locations. The cluster on the right side

represents the top location hypothesis, since the object has been seen there more frequently.

5.2.2. Common-sense locations strategy

The strategy based on common-sense knowledge obtained from textual cor-

pora employs the method for extracting typical object locations from text as

described in [8]. For each pair of (object label,location label) in our domain,

such that the labels are expressed by nouns, we search in the corpus for the

patterns ”OBJECT NOUN (be)? loc prep LOC NOUN”, where loc prep is a lo-

cation preposition, e.g., on,in,at. Using this method, we generate a location

database consisting of the tuples of the form hobject,location,scorei. Each tuple

proposes a location for a given object, with a conﬁdence score corresponding to

the normalized frequency of their co-occurrence in the corpus. To increase the

likelihood of ﬁnding objects, this strategy queries the database not only for the

actual object type, but also all parents of that object type in the type hierarchy.

5.2.3. Human feedback strategy

The strategy based on the human feedback uses object locations communi-

cated by the human, e.g., The corn is in the fridge. As described in Sec. 4, the

NL Understanding component handles world state descriptions including loca-

tion descriptions and updates the MemoryX working memory correspondingly.

Thus, the strategy consists of generating a question for the human asking for a

location of the missing object and monitoring the MemoryX working memory

updates. After MemoryX was updated, the RM invokes the planner, which

replans given the new information. This strategy is a fall-back mechanism that

is used only if all other strategies fail.

6. Planning and Plan Execution & Monitoring

6.1. Planning

We deﬁne a plan as a sequence of actions P=ha1, .., aniwith respect to

the initial state s0and the goal Gsuch that hs0, P i |=G. In the experiments

described in Sec. 7, we used the state-of-the-art PKS planner [30]. Any other

symbolic planner can be used instead.

The domain generation and NLU components of our system provide the

planner with a domain description (Sec. 3) and a goal (Sec. 4) represented in

the PKS syntax as input. Below we show the PKS representation of the state

of the world description corresponding to the robot being in the center of the

kitchen and two cups being on the counter top.5

agentAt(robot, kitchen_center),

handEmpty(robot, rightHand), handEmpty(robot, leftHand),

objectAt(cup1, countertop), objectAt(cup2, countertop),

5For simplicity, in this example we skip deﬁnitions of types, predicates, and constants

required by PKS.

leftGraspable(cup1), rightGraspable(cup1),

leftGraspable(cup2), rightGraspable(cup2)

The example below shows the PKS deﬁnition of the grasp action.

grasp(?a : robot, ?h : hand, ?l : surface, ?o : graspable) {

preconds:

(K(rightGraspable(?o)) & (existsK(?y : rightHand) K(?y = ?h)) |

(K(leftGraspable(?o)) & (existsK(?y : leftHand)K(?y = ?h)))) &

K(agentAt(?a, ?l)) &

K(objectAt(?o, ?l)) &

K(handEmpty(?a, ?h))

effects:

add(Kf, grasped(?a, ?h, ?o)),

del(Kf, objectAt(?o, ?l)),

del(Kf, handEmpty(?a, ?h))

}

The PKS planner returns sequences of grounded actions with their pre- and

post-conditions. Given the domain description example above and the follow-

ing goal: (existsK(?x1 : cup, ?x2 : table) K(objectAt(?x1, ?x2)) &

(existsK(?x3 : cup) K(objectAt(?x3, ?x2)) & K(?x1 != ?x3))) corre-

sponding to the command Put two cups on the table, it generates the plan

below6.

move(robot, kitchen_center, countertop)

pre: agentAt(robot, kitchen_center)

(kitchen_center != countertop)

post: agentAt(robot, countertop)

grasp(robot, leftHand, countertop, cup1)

pre: agentAt(robot, countertop)

6The plan description is shortened for better readability.

objectAt(cup1, countertop)

handEmpty(leftHand)

leftGraspable(cup1)

post: grasped(robot, leftHand, cup1)

!objectAt(cup1, countertop)

!handEmpty(leftHand)

grasp(robot,rightHand,countertop, cup2)

...

move(robot,countertop,table)

...

putdown(robot,leftHand,table, cup1)

pre: agentAt(robot, table)

grasped(robot, leftHand, cup1)

post: handEmpty(robot, leftHand, cup1)

objectAt(cup1, table)

!grasped(robot, leftHand, cup1)

putdown(robot,rightHand,table, cup2)

...

6.2. Plan Execution and Monitoring

The components described in the previous sections are employed by the

Plan Execution and Monitoring (PEM) component. The PEM is the central

coordination component for the command execution. A simpliﬁed control ﬂow

for execution of a planning task is shown in Fig. 7. The PEM is evoked when

a new task is sent from an external component (e.g., the NLU component).

Diﬀerent types of tasks are accepted. For each type of task, the PEM has an

implementation of a ControlMode interface, which knows how to execute this

task type. Currently, a task can be a single command or a list of goals that

should be achieved. Here, we are focusing on the goal task type, which requires

the planner to produce a plan. After a new task was received, the PEM calls the

Domain Generator to generate a new domain description based on the current

Generate domain

Generate plan

Execute plan step

Verify action effects

Success

Failure

Task achieved

Language Understanding

MemoryX

ArmarX Statechart

framework

Planner

Plan Execution Monitor

Get task

Replace missing objects

& locations Replacement Manager

Figure 7: Plan generation, execution, and monitoring.

Figure 8: Visualization of the robot’s working memory (left) during action execution and the

robot in the real world state (right) at the same moment. The working memory is continuously

updated with perceived and predicted data. Only objects relevant for the current task or that

have been relevant for a previous task are tracked in the working memory.

world state (Sec. 3). This domain description together with the received goal

are then passed to the Replacement Manager, which checks if every object in the

goal and its location are in the domain description. In case of missing objects

or object locations, it replaces them and rewrites the goal. Then the domain

description and the rewritten goal are passed to the planner. If a plan could not

be found, the PEM synthesizes a feedback indicating the failure. A successful

plan consists of a sequence of actions with bound variables as well as pre- and

post-conditions of the actions that are passed back to PEM. These actions are

executed one by one.

Each action that corresponds to a symbolic planning operator is associated

with an ArmarX statechart, which controls the action execution. The ArmarX

statecharts allow us to model robot actions in a hierarchical manner and to

specify control and data ﬂow visually. Due to the hierarchy of statecharts, it is

possible to compose complex skills by using elementary or primitive skills, e.g.

following a trajectory with the tool center point of a robot arm. The primi-

tive skills are based on services provided by the robot development environment

ArmarX [6] such as inverse kinematics, motion planning, and robot’s memory.

Each top-level action, i.e. the execution of a planning operator such as grasp,

is manually designed with respect to the available sensors and algorithms. For

example, the action open uses force-torque-based sensing for grasping the door

handle and impedance control to open the door. The action grasp uses visual-

servoing [36] for precise execution with respect to the object localization. The

actions stirring and wiping use motion learned from demonstration with a spe-

cialization of the action formalism Dynamic Movement Primitives presented in

[37]. A detailed description of the ArmarX statecharts can be found in [17].

Action execution might fail because of uncertainties in perception and execu-

tion or changes in the environment. To account for the changes, pre-conditions

of an action before the execution and its eﬀects after the execution are veriﬁed

by the PEM. If action execution failed because of a missing object, then the

PEM calls the Replacement Manager to replace the object or its location. The

world state is continuously updated as well as the working memory based on

sensor-data or prediction models. Fig. 8 shows the visualization of the robot

working memory and the robot in the real world at the same moment dur-

ing the execution of the skill putdown. The world state observer component

is queried for the current world state after each action. If any mismatches be-

tween a planned world state and a perceived world state are detected, the plan

execution is considered to have failed and re-planning is triggered based on the

current world state. Additionally, the statecharts report if they succeeded or

failed; failing leads to re-planning. If an action was successfully executed, the

next action is selected and executed. After the task completion the robot goes

idle and waits for the next task.

7. Experiments

In this section, we describe two experiments: a demonstration of the hu-

manoid robot ARMAR-III and a visual simulation experiment involving un-

trained human subjects. Several speciﬁc components involved in our system

have been evaluated in our previous work. The accuracy of the NLU pipeline

was evaluated in [7]. Common-sense knowledge extraction was evaluated in [8].

In [3], object replacement based on common-sense knowledge was tested. In [9],

an evaluation of the visual features replacement strategy was presented. The

object localization was evaluated in [15] and[16]. In this study, we test how all

components work in combination in a complex setting requiring human-object

communication and collaboration.

7.1. Execution on ARMAR-III

We tested our approach on the humanoid robot ARMAR-III [12] in a kitchen

environment. The case study elaborates on the dinner preparation scenario.

In this section, we describe two parts of the Xperience project demo scenario,

which are relevant for this manuscript: 1) salad preparation, 2) bringing a drink.

The accompanying video can be found at https://youtu.be/PyJ5hCW3zQM.

The video of the full Xperience project demonstration can be found at https:

//youtu.be/-8oC-WW5P1I. Apart from the video, the system was also shown

in a live demonstration for the project review of the Xperience project.

In this experiment, the following robot skills were involved: moving, grasp-

ing, placing, stirring, pouring, door opening and closing, handing, receiving.

Fig. 9 shows snapshots of the scenario in chronological execution order.

In the salad preparation part, the human ﬁrst asks the robot to put a

salad bowl on the sideboard. The command is processed by the NLU compo-

nent, which generates a planner goal. The goal and the domain description are

processed by the planner, which generates a multi-step plan. According to the

plan, the robot moves to the location of the salad bowl, but does not ﬁnd the

required object at the location. The Plan Execution & Monitoring component

(a) (b)

(e) (f)

(g) (h)

(i) (j)

Figure 9: Snapshots of cooperative rearranging the room and preparing a dinner.

reports the failure in the plan execution. The Replacement Manager is evoked.

It ﬁnds a container that has a similar shape to the original known bowl, Fig. 9a.

The RM suggests the new container as a replacement based on shared visual

features. The goal is rewritten accordingly and passed to the planner, which

produces a new plan that can be executed successfully, Fig. 9b. When the bowl

is on the sideboard, the human asks the robot for help in preparing a salad with

corn and oil. The command is processed by the NLU and a goal is generated,

which implies the corn and the oil being in the bowl and the salad being stirred

in the bowl. The RM ﬁnds that the location of the corn is unknown. Since

other location replacement strategies fail, it generates a question for the human

inquiring for the corn location. The human tells the robot that the corn is lo-

cated in the fridge. The utterance is processed by the NLU and MemoryX is

updated correspondingly. After obtaining the feedback from human, the RM

evokes the planner that generates a plan. The robot moves to the fridge, opens

it (Fig. 9c), moves to the sideboard, pours the corn into the bowl (Fig. 9d), puts

the empty can into the sink, and returns to the fridge to close it (Fig. 9e). The

actions of putting the can into the sink and closing the fridge are planned, since

we introduce symbolic rules stating that dirty objects should go into the sink

after being manipulated and that the fridge door should be closed at the end of

each plan execution. After adding the corn, the robot moves to the oil location,

grasps the oil, pours it into the bowl (Fig. 9f), and puts the oil bottle away. The

latter action is also planned, since we deﬁne a symbolic rule requiring robot’s

hands to be empty at the end of each plan execution. In the meanwhile, the

human is cutting other salad ingredients and pouring them into the bowl. In

order to mix the salad, the robot requires a stirrer. It moves to the assumed

stirrer location, but it cannot grasp the stirrer. Instead of grasping, the planner

plans a speech request from the human. The human passes the stirrer to the

robot. The robot returns to the bowl, stirs the salad (Fig. 9g), and puts the

stirrer into the sink (Fig. 9h). Finally, the human asks the robot to put the

bowl on the dinning table, which is performed by the robot (Fig. 9i).

In the second part of the scenario, the human is asking for a drink saying

I’d like to drink something. Could you please bring me a lemonade. This com-

mand is processed by the NLU component and a goal of the lemonade being

in the hand of the human is generated. Apart from generating the goal, the

NLU extracts the aﬀordance of drinking for the object ”lemonade”. The goal

and the predicted aﬀordances are passed to the Replacement Manager. The

RM ﬁnds that the object ”lemonade” is unknown and attempts a replacement

by using the aﬀordance ”drink”. The object ”multivitamin juice” is proposed

as a possible replacement. The RM generates a conﬁrmation utterance Sorry, I

have no lemonade. But I can bring you a multivitamin juice. After the human

conﬁrms the replacement, the RM rewrites the goal and passes it to the planner.

According to the generated plan, the robot moves to the assumed location of

the juice, but does not ﬁnd it there. Plan execution fails and the RM is evoked.

The RM component derives another potential location of the juice from the

database of common locations. The robot ﬁnds the juice at the new location,

grasps it, moves to the location of the human, and hands the juice over to the

human (Fig. 9j).

7.2. Simulation experiment

In the simulation experiment, we aimed at testing if the framework can be

employed by untrained users. Due to the usage of generic interfaces in ArmarX,

the programs used on the real robot can also be executed without changes in

a simulation environment. To provide the user with information about the

simulation, ArmarX oﬀers a visualization of the simulated scene and the user

can track the robot’s action execution and the eﬀects they have on the scene.

This Graphical User Interface is shown in Fig. 10. The user can interact with

the simulated robot by typing in the text dialog window or by speaking into a

microphone.

We asked the subjects to achieve a given goal by controlling the robot with

natural language commands in the simulation environment. The instructions

were formulated as follows. First, users were advised to communicate with

the robot by typing sentences in the dialog widget. The users were shown the

Figure 10: Graphical User Interface used for the simulation experiments.

(a) Initial scene. (b) Goal table setup.

Figure 11: Instruction scenes.

initial scene with labels assigned to available locations and some of the objects

(Fig. 11a), which could not be easily visually recognized. They were also shown

the goal table setup (Fig. 11b) and asked to achieve it by controlling the robot.

In addition, we speciﬁed that the drink on the table can be chosen from a

predeﬁned list of drinks represented by images (e.g. beer, juice, milk) and the

salad should contain ingredients represented by images of corn and oil.

The experiment was run oﬄine on site, with no time constraints. In total,

eight subjects (four male, four female, age 20–60) took part in the experiment.

All subjects had no background in robotics or related areas. The experimenter

was present during the whole time of the experiment to record the experiment

transcript, see Table 3. The experimenter was not giving any instructions or

answering questions.

All subjects were able to solve the tasks using 12.5 utterances on average.

Some subjects provided general task descriptions, e.g. Make a salad with corn

and oil. Others gave detailed commands, e.g. Go to the fridge. Open the fridge.

Take the corn... In total, 103 natural language utterances were processed in this

experiment. The Language Understanding component processed 87 utterances

fully correctly, 3 - partially correctly, and 13 - incorrectly; see examples in

Table 4.

The main source of errors are underspeciﬁed commands, e.g. I need milk,

User

Text LU RM Memory

update

Planner Exe-

cution

User

feedback

M2 ”Take two

glasses and

put them on

the table”

yes yes:

”glass”

→

”cup”

- - - -

”yes” yes yes - suboptimal:

juice removed

from the table

yes user con-

fused

Table 3: Example experiment transcript. For each input utterance, the transcript records if

each of the evoked components worked correctly. For the ﬁrst command, the language under-

standing component produced correct goals and passed them to the Replacement Manager,

which produced the suggestion to replace the unknown object ”glass” with the known object

”cup” and the conﬁrmation question for the user. The next utterance ”yes” was correctly

processed by LU; given the conﬁrmation, the RM correctly rewrote the goals; there was no

memory update; the planner produced a suboptimal goal, which implied removing the juice

from the table to free space for putting the cups, while the user asked for the juice to be

placed there earlier; the plan was correctly executed on the robot; the user was confused by

the fact that the juice was removed from the table.

Text LU output Comment

C ”Can you

please set the

table for two

people”

[goal: (existsK(?x1 : cup, ?x2

: placesetting) K(objectAt(?x1,

?x2)) & (existsK(?x3 : cup, ?x4 :

placesetting) K(objectAt(?x3, ?x4))

& K(?x1 != ?x3) & K(?x2 != ?x4))),

context: (table,set)]

P ”Bring a glass

and put it on

the table”

[goal: ((existsK(?x1 : glass, ?x2

: human) K(inHandOfHuman(?x1,?x2)),

(existsK(?x1 : glass, ?x2 : table)

K(objectAt(?x1,?x2))), context:

((glass,bring), (table,put))]

”bring” in-

terpreted as

”bring to hu-

man” instead

of ”bring to

location”

I ”I need milk” [] Underspeciﬁed

command,

modality

(”need”) not

recognized

Table 4: Examples of the correct (C), partially correct (P), and incorrect (I) output of the

language understanding component.

Prepare a salad). Introducing clariﬁcation questions asked by the robot is a

possible solution to this problem. Another issue concerns processing of the

partial input. For example, if the robot asks about the corn location and the

user answers in the fridge instead of the corn/it is in the fridge, such answer is

not processed correctly. Currently, we do not have a dialog component that waits

for a particular type of input. Instead, we process all types of input all the time.

This problem can be solved in our framework by assigning higher assumptions

weights to certain syntactic constructions (e.g. prepositional phrases without

noun), which will force the abductive reasoner to link them to the previous

discourse. A related issue concerns context-dependent metonymy. For example,

in the utterance Put the salad on the table, the reference salad should be linked

to the bowl, in which the salad has been prepared. Additional domain axioms

are needed to establish this link. One more issue is related to missing lexical

items or their meanings in the knowledge base. For example, the verb put was

deﬁned in our knowledge base as referring to moving an object from one location

to another, while some of the users were using it in the sense of pour, e.g. put

the corn into the bowl.

The Replacement Manager component suggested 6 object replacements, 4 of

which were accepted by the users. It has also generated location hypotheses for

all known objects and asked the users for unknown object locations. One issue

we have encountered with this component were the cases when several replace-

ments of the same type were required in a goal expression. For example, the

command Put two glasses on the table implies that there should be two diﬀerent

glasses on the table. Since no glasses were available in our kitchen environment,

the RM suggested replacing them with cups and has chosen the type red cup as

a replacement. Since there was only one instance of red cup, the rewritten goal

implying two diﬀerent instances could not be fulﬁlled. To tackle this problem,

the RM needs to consider the number of instances of the replaced type and

make the replacement only if the required instances are available.

The working memory was successfully updated during the execution and

after processing human utterances. Given a goal that could be fulﬁlled and a

domain description, the planner was producing relevant plans. An issue that

we have encountered is that our framework is currently lacking a mechanism

to estimate pragmatic context-dependent relevance of the generated plans. For

example, a user has commanded the robot to put a drink on the table, which the

robot has performed. Some of the following commands may result in a plan that

requires the robot to remove the drink, which may contradict the original inten-

tion of the user. The ability of storing previous goals and generating new plans

without overwriting these goals is a desired functionality for our framework.

The runtime of the main components is depicted in Table 5 and the timing of

the action execution on the real robot is shown in Table 6. The host pc used for

the measurements was an Intel Core i7-6700HQ CPU @ 2.60GHz with 16 GB

Component Average Maximum Minimum

Language Understanding 0.26 s 0.79 s 0.13 s

Visual Features Strategy 2.61 s 4.61 s 1.66 s

Planning System 0.89 s 7.99 s 0.13 s

Action Execution 20.59 s 42.01 s 4.44 s

Table 5: Runtime of the main components in simulation, in seconds.

RAM. The tasks from this simulation experiment were used to measure the run-

time. The NLU component performed its analysis reliably in under one second

(maximum 0.79 s). No input data produced a noticeable delay in the task trig-

gering. The RM and its strategies, except the visual features strategy, are table

look ups of precomputed information, e.g. common sense object replacements,

and thus consumed no signiﬁcant CPU time. These strategies were omitted

in Table 5. The visual features strategy on the other hand is computationally

expensive and required 2.61 s on average. The run time mainly depends on how

many point cloud clusters (i.e. objects) have been found in the current scene.

The runtime of the planning component highly depends on the speciﬁc goal and

the current state of the memory and varies between 0.00015 s and 7.99 s for the

given tasks. The minimum of 0.00015 s was achieved when the goal was already

fulﬁlled before starting planning. The maximum of 7.99 s was required for the

task set the table for two people, with all objects except the corn already being

in the working memory. The runtime of the action execution was measured per

plan step in the simulation. This measurement includes the full duration of the

robot’s action execution. The runtime is similar on the real robot, but slightly

higher due to smaller joint angle velocities and a higher perception time. The

runtime highly depends on the executed action; in case of the move action it

depends on the traveling distance.

Action Name Average Std. Dev. σ

Grasp 38.9 s 2.2 s

Move 8.5 s 2.6 s

Open 121.8 s 1.7 s

Pour 11.5 s 0.5 s

Putdown 20.5 s 0.7 s

RequestFromHuman 13.7 s 1.5 s

Stir 55.6 s 1.15 s

Table 6: Runtime of the action execution in seconds on the real robot.

8. Related Work

In this section, the related work is organized according to speciﬁc aspects

relevant for this article.

Structural bootstrapping and replacement. The concept of structural bootstrap-

ping was introduced in the context of the Xperience project [1]. In [4], it was

demonstrated how structural bootstrapping can be performed at diﬀerent levels

of a robotic architecture consisting of a planning level, a symbol-to-signal medi-

ator level, and a sensorimotor level. The concept was applied to acquisition of

action knowledge [38], learning action skills based on exploration and interaction

with the environment [2] and replacement of missing objects [3, 4, 9].

In [39], object replacement is carried out based on a similarity measure

utilizing a) object classes with features and aﬀordances coded manually and

b) visual features such as shape and color intensity. In [3], object replacement

is performed by employing the ROAR database of objects with their aﬀordances

and a learning method predicting unobserved object aﬀordances. Along these

lines, we employ structural bootstrapping for object and location replacement

based on object aﬀordances and predicted ob ject locations.

Aﬀordance estimation. Early work on aﬀordance estimation followed a function-

based approach to object recognition for 3D CAD models of ob jects such as

chairs [40]. Another approach models aﬀordances of objects as a function of

human-object interactions. For example, in [41], 3D geometric properties are

computed for tracked humans and objects in order to describe human-object in-

teractions. In [42], actions and objects in human demonstrations are recognized

and dependencies between them are modelled. In [43], object aﬀordances are

represented by clustered spatial conﬁgurations of human-object interactions.

Other researchers focus on deﬁning a subset of attributes for predicting new

object categories [44, 45, 46]. For example, by learning 2D shape and color pat-

terns, attributes of novel objects can be recognized –[47]. In the ﬁeld of NL pro-

cessing, there exist developed methods for extracting common-sense knowledge

from textual corpora, which can also be applied for mining object aﬀordances,

cf. [48, 49, 50]. For example, [51] use verb-noun co-occurrences in Google Syn-

tactic N-Grams as well as Latent Semantic Analysis and Word2Vec semantic

vectors to extract verbs that are most likely to refer to aﬀordances of objects

represented by given nouns. In our work, we extract object aﬀordances from

visual features, textual corpora, and dialog context.

Grounding NL. Approaches to grounding NL into actions, relations, and ob-

jects known to the robot can be roughly subdivided into symbolic and statisti-

cal. Symbolic approaches rely on sets of rules to map linguistic constructions

into pre-speciﬁed action spaces and sets of environmental features. In [52], sim-

ple rules are used to map NL instructions having a pre-deﬁned structure to

robot skills and task hierarchies. In [53], NL instructions are processed with a

dependency parser and background axioms are used to make assumptions and

ﬁll the gaps in the NL input. In [54], background knowledge about robot ac-

tions is axiomatized using Markov Logic Networks. In [55], a knowledge base

of known actions, objects, and locations is used for a Bayes-based grounding

model. Symbolic approaches work well for small pre-deﬁned domains, but most

of them employ manually written rules, which limits their coverage and scal-

ability. In order to increase the linguistic coverage, some of the systems use

lexical-semantic resources like WordNet, FrameNet, and VerbNet [56, 54]. In

this study, we follow this approach and generate our lexical axioms from Word-

net and FrameNet.

Statistical approaches rely on annotated corpora to learn mappings between

linguistic structures and grounded predicates representing the external world.

In [57], reinforcement learning is applied to interpret NL directions in terms of

landmarks on a map. In [58], machine translation is used to translate from NL

route instructions to a map of an environment built by a robot. In [59], Gen-

eralized Grounding Graphs are presented that deﬁne a probabilistic graphical

model dynamically according to linguistic parse structures. In [28], a verb-

environment-instruction library is used to learn the relations between the lan-

guage, environment states, and robotic instructions in a machine learning frame-

work. Statistical approaches are generally better at handling NL variability. An

obvious drawback of these approaches is that they generate noise and require a

signiﬁcant amount of annotated training data, which can be diﬃcult to obtain

for each new application domain and set of action primitives.

Some recent work focuses on building joint models explicitly considering

perception at the same time as parsing [60, 61]. The framework presented in

this article is in line with this approach, because abductive inference considers

both the linguistic and perceptual input as an observation to be interpreted

given the background knowledge.

Our approach to grounding is also in line with [62] proposing to ground

language and other kind of symbolic information to so called Object Action

Complexes (OACs). OACs provide a framework, in which experience is sys-

tematically structured and linked to speciﬁc actions, which on the one hand

are described by symbolic state transitions allowing for planning, and on the

other hand, by the relevant sub-symbolic information allowing for reasoning in

continuous and ambiguous signal space. Our mapping of linguistic structures to

predicate providers (Sec. 3) and statecharts (Sec. 6.2) can be seen as a realization

of the OAC framework.

Planning. With respect to the action execution, the existing approaches can

be classiﬁed into those directly mapping NL instructions into action sequences

[63, 64, 54] and those employing a planner [65, 56, 53, 66]. We employ a planner,

because it allows us to account for the dynamically changing environment, which

is essential for the human-robot collaboration. Similar to [66], we translate a

NL command into a goal description.

World descriptions. Although most of the NLU systems in robotics focus di-

rectly on instruction interpretations, there are a few systems detecting world

descriptions implicitly contained in human commands [53, 67, 55]. These de-

scriptions are further used in the planning context, as it is done in our approach.

In addition, we detect world descriptions not embedded into the context of an

instruction and process human action descriptions and feedback.

Linking planning, NL, and sensorimotor experience. Interaction between NL

instructions, resulting symbolic plans, and sensorimotor experience during plan

execution has been previously explored in the literature. In [63], symbolic repre-

sentations of objects, object locations, and robot actions, are mapped on the ﬂy

to the sensorimotor information. During the execution of the predeﬁned plans,

the plan execution monitoring component evaluates the outcome of each robot’s

action as success or failure. In [68], symbolic representations are generated based

on several sensorimotor features needed for segmentation and inference of tasks.

In [53], the planner knowledge base is updated each time a NL instruction re-

lated to the current world state is provided and the planner re-plans taking into

consideration the new information. In [65], symbolic planning is employed to

plan a sequence of motion primitives for executing a predeﬁned baking primitive

given the current world state. Replacement of missing objects and re-planning

is performed in [39, 69]. In line with these studies, mapping sensorimotor data

to symbols, plan execution monitoring, as well as object replacement is a part

of our system.

9. Conclusion

We have presented a realization of the concept of structural bootstrapping

in a framework integrating sensorimotor experience, robot’s memory, natural

language understanding, and planning in a robotic architecture developed in

the context of the Xperience project. We showed that the developed framework

is ﬂexible enough to be used for action execution on a humanoid robot in a com-

plex domain and can process input from untrained users in a scenario requiring

human-robot interaction and collaboration.

The limitations of each system component are discussed in detail in the cor-

responding sections. The main issues that need to be addressed in the future

work are a) processing of the underspeciﬁed/partial linguistic input, b) incor-

porating pragmatic context-dependent relevance of the generated plans, and

c) equipping the knowledge base with deeper domain knowledge required both

for language understanding and for relevant planning and replacement. Another

current limitation concerns the inability of the framework to learn new objects

and environment models. New objects and information required for interact-

ing with them, e.g. 3D mesh models, ob ject localization descriptors and grasp

information, are currently manually added to the prior knowledge database.

Similarly, a semantic model of the environment has to be developed manually.

The adjustment of the system to a new environment requires landmarks with a

suitable pose for performing various manipulation actions.

Acknowledgements

The research leading to these results has received funding from the European

Union Seventh Framework Programme under grant agreement No270273 (Xpe-

rience). We would also like to thank M. Do, C. Geib, M. Grotz, M. Kr¨ohnert, D.

Schiebener, and N. Vahrenkamp for their various contributions to the underlying

system, which made this article possible.

[1] Xperience Project, Website, available online at http://www.xperience.

org.

[2] M. Do, J. Schill, J. Ernesti, T. Asfour, Learn to wipe: A case study of

structural bootstrapping from sensorimotor experience, in: Proc. of ICRA,

2014.

[3] A. Agostini, M. Javad Aein, S. Szedmak, E. E. Aksoy, J. Piater, F. Wor-

gotter, Using structural bootstrapping for object substitution in robotic

executions of human-like manipulation tasks, in: Proc. of IROS, 2015, pp.

6479–6486.

[4] F. W¨org¨otter, C. Geib, M. Tamosiunaite, E. E. Aksoy, J. Piater, H. Xiong,

A. Ude, B. Nemec, D. Kraft, N. Kr¨uger, M. W¨achter, T. Asfour, Structural

bootstrapping - a novel concept for the fast acquisition of action-knowledge,

IEEE Trans. on Autonomous Mental Development 7 (2) (2015) 140–154.

[5] J. J. Gibson, The theory of aﬀordances, Hilldale, 1977.

[6] N. Vahrenkamp, M. W¨achter, M. Kr¨ohnert, K. Welke, T. Asfour, The

ArmarX Framework - Supporting high level robot programming through

state disclosure, Information Technology 57 (2) (2015) 99–111.

[7] E. Ovchinnikova, M. W¨achter, V. Wittenbeck, T. Asfour, Multi-purpose

natural language understanding linked to sensorimotor experience in hu-

manoid robots, in: Proc. of Humanoids, 2015, pp. 365–372.

[8] P. Kaiser, M. Lewis, R. P. A. Petrick, T. Asfour, M. Steedman, Extracting

common sense knowledge from text for robot planning, in: Proc. of ICRA,

2014, pp. 3749–3756.

[9] W. Mustafa, M. W¨achter, S. Szedmak, A. Agostini, D. Kraft, T. Asfour,

J. Piater, F. Wrgtter, N. Kr¨uger, Aﬀordance estimation for vision-based

object replacement on a humanoid robot, in: Proc. of ISR, 2016, p. in press.

[10] S. Szedmak, E. Ugur, J. Piater, Knowledge propagation and relation learn-

ing for predicting action eﬀects, 2014.

[11] S. Krivic, S. Szedmak, H. Xiong, J. Piater, Learning missing edges via

kernels in partially-known graphs, in: European Symposium on Artiﬁcial

Neural Networks ESANN, Computational Intelligence and Machine Learn-

ing, 2015.

[12] T. Asfour, K. Regenstein, P. Azad, J. Schroder, A. Bierbaum,

N. Vahrenkamp, R. Dillmann, ARMAR-III: An integrated humanoid plat-

form for sensory-motor control, in: Proc. of Humanoids, 2006, pp. 169–175.

[13] M. Henning, A new approach to object-oriented middleware, Internet Com-

puting, IEEE 8 (1) (2004) 66–75. doi:10.1109/MIC.2004.1260706.

[14] K. Welke, P. Kaiser, A. Kozlov, N. Adermann, T. Asfour, M. Lewis,

M. Steedman, Grounded spatial symbols for task planning based on ex-

perience, in: Proc. of Humanoids, 2013, pp. 484–491.

[15] P. Azad, T. Asfour, R. Dillmann, Combining Harris Interest Points and the

SIFT Descriptor for Fast Scale-Invariant Object Recognition, in: Proc. of

IROS, 2009, pp. 4275–4280.

[16] P. Azad, D. M¨unch, T. Asfour, R. Dillmann, 6-dof model-based tracking of

arbitrarily shaped 3d objects, in: Proc. of ICRA, 2011, pp. 5204–5209.

[17] M. W¨achter, S. Ottenhaus, M. Kr¨ohnert, N. Vahrenkamp, T. Asfour, The

ArmarX statechart concept: Graphical programming of robot behaviour,

Frontiers - Software Architectures for Humanoid Robotics.

[18] J. R. Hobbs, M. E. Stickel, D. E. Appelt, P. A. Martin, Interpretation as

abduction, Artif. Intell. 63 (1-2) (1993) 69–142.

[19] H. Soltau, F. Metze, C. F¨ugen, A. Waibel, A one-pass decoder based on

polymorphic linguistic context assignment, in: Proc. of ASRU, 2001, pp.

214–217.

[20] N. Inoue, E. Ovchinnikova, K. Inui, J. R. Hobbs, Weighted abduction for

discourse processing based on integer linear programming, in: Plan, Activ-

ity, and Intent Recognition, 2014, pp. 33–55.

[21] J. R. Hobbs, Ontological promiscuity, in: Proc. of ACL, 1985, pp. 60–69.

[22] J. Bos, Wide-Coverage Semantic Analysis with Boxer, in: Proc. of STEP,

Research in Computational Semantics, 2008, pp. 277–286.

[23] E. Ovchinnikova, R. Israel, S. Wertheim, V. Zaytsev, N. Montazeri,

J. Hobbs, Abductive Inference for Interpretation of Metaphors, in: Proc.

of ACL Workshop on Metaphor in NLP, 2014, pp. 33–41.

[24] E. Ovchinnikova, A. S. Gordon, J. Hobbs, Abduction for Discourse Inter-

pretation: A Probabilistic Framework, in: Proc. of JSSP, 2013, pp. 42–50.

[25] C. Fellbaum, WordNet: An Electronic Lexical Database, 1998.

[26] C. F. Baker, C. J. Fillmore, J. B. Lowe, The Berkeley FrameNet project,

in: Proc. of COLING-ACL, 1998, pp. 86–90.

[27] E. Ovchinnikova, Integration of world knowledge for natural language un-

derstanding, Springer, 2012.

[28] D. K. Misra, J. Sung, K. Lee, A. Saxena, Tell me dave: Context-sensitive

grounding of natural language to manipulation instructions, Proc. of RSS.

[29] R. Ros, S. Lemaignan, E. A. Sisbot, R. Alami, J. Steinwender, K. Hamann,

F. Warneken, Which one? grounding the referent based on eﬃcient human-

robot interaction, in: Proc. of RO-MAN, 2010, pp. 570–575.

[30] R. P. Petrick, F. Bacchus, A Knowledge-Based Approach to Planning with

Incomplete Information and Sensing, in: Proc. of AIPS, 2002, pp. 212–222.

[31] G. Bakir, T. Hofman, B. Sch¨olkopf, A. J. Smola, B. Taskar, S. V. N. Vish-

wanathan (Eds.), Predicting Structured Data, MIT Press, 2007.

[32] M. Ghazanfar, A. Prugel-Bennett, S. Szedmak, Kernel mapping recom-

mender system algorithms, Information Sciences 208 (2012) 81–104.

[33] W. Mustafa, N. Pugeault, A. G. Buch, N. Kr¨uger, Multi-view object in-

stance recognition in an industrial context, Robotica (2015) 1–22.

[34] D. Kraft, W. Mustafa, M. Popovi´c, J. B. Jessen, A. G. Buch, T. R.

Savarimuthu, N. Pugeault, N. Kr¨uger, Using surfaces and surface rela-

tions in an early cognitive vision system, Machine Vision and Applications

26 (7-8) (2015) 933–954.

[35] T. Joachims, T. Finley, C.-N. J. Yu, Cutting-plane training of structural

svms, Machine Learning 77 (1) (2009) 27–59.

[36] N. Vahrenkamp, S. Wieland, P. Azad, D. Gonzalez-Aguirre, T. Asfour,

R. Dillmann, Visual servoing for humanoid grasping and manipulation

tasks, in: IEEE/RAS International Conference on Humanoid Robots (Hu-

manoids), 2008, pp. 406–412.

[37] J. Ernesti, L. Righetti, M. Do, T. Asfour, S. Schaal, Encoding of periodic

and their transient motions by a single dynamic movement primitive, in:

IEEE/RAS International Conference on Humanoid Robots (Humanoids),

2012, pp. 57–64.

[38] E. E. Aksoy, M. Tamosiunaite, R. Vuga, A. Ude, C. Geib, M. Steedman,

F. Worgotter, Structural bootstrapping at the sensorimotor level for the

fast acquisition of action knowledge for cognitive robots, in: Proc. of ICDL,

2013, pp. 1–8.

[39] I. Awaad, G. K. Kraetzschmar, J. Hertzberg, Finding ways to get the job

done: An aﬀordance-based approach, in: Proc. of ICAPS, 2014.

[40] L. Stark, K. Bowyer, Function-based generic recognition for multiple object

categories, CVGIP: Image Understanding 59 (1) (1994) 1–21.

[41] H. S. Koppula, R. Gupta, A. Saxena, Learning human activities and ob-

ject aﬀordances from rgb-d videos, The International Journal of Robotics

Research 32 (8) (2013) 951–970.

[42] H. Kjellstr¨om, J. Romero, D. Kragi´c, Visual object-action recognition: In-

ferring object aﬀordances from human demonstration, Computer Vision

and Image Understanding 115 (1) (2011) 81–90.

[43] B. Yao, J. Ma, L. Fei-Fei, Discovering object functionality, in: Proc. of

ICCV, 2013, pp. 2512–2519.

[44] D. Parikh, K. Grauman, Relative attributes, in: Proc. of ICCV, IEEE,

2011, pp. 503–510.

[45] C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object

classes by between-class attribute transfer, in: Proc. of CVPR, IEEE, 2009,

pp. 951–958.

[46] X. Yu, Y. Aloimonos, Attribute-based transfer learning for object cate-

gorization with zero/one training example, in: Proc. of ECCV, 2010, pp.

127–140.

[47] V. Ferrari, A. Zisserman, Learning visual attributes, in: Proc. of NIPS,

2007, pp. 433–440.

[48] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, T. M.

Mitchell, Toward an architecture for never-ending language learning., in:

Proc. of AAAI, 2010.

[49] C. L. Teo, Y. Yang, H. Daum´e III, C. Ferm¨uller, Y. Aloimonos, A corpus-

guided framework for robotic visual perception, in: Proc. of Workshop on

Language-Action Tools for Cognitive Artiﬁcial Agents, 2011, pp. 36–42.

[50] K. Zhou, M. Zillich, H. Zender, M. Vincze, Web mining driven object lo-

cality knowledge acquisition for eﬃcient robot behavior, in: Proc. of IROS,

IEEE, 2012, pp. 3962–3969.

[51] Y.-W. Chao, Z. Wang, R. Mihalcea, J. Deng, Mining semantic aﬀordances

of visual object categories, in: Proc. of CVPR, 2015, pp. 4259–4267.

[52] P. E. Rybski, K. Yoon, J. Stolarz, M. M. Veloso, Interactive robot task

training through dialog and demonstration, in: Proc. of HRI, 2007, pp.

49–56.

[53] R. Cantrell, K. Talamadupula, P. W. Schermerhorn, J. Benton, S. Kamb-

hampati, M. Scheutz, Tell me when and why to do it!: run-time planner

model updates via natural language instruction, in: Proc. of HRI, 2012,

pp. 471–478.

[54] D. Nyga, M. Beetz, Everything robots always wanted to know about house-

work (but were afraid to ask), in: Proc. of IROS, 2012, pp. 243–250.

[55] T. Kollar, V. Perera, D. Nardi, M. Veloso, Learning environmental knowl-

edge from task-based human-robot dialog, in: Proc. of ICRA, 2013, pp.

4304–4309.

[56] D. J. Brooks, C. Lignos, C. Finucane, M. S. Medvedev, I. Perera, V. Raman,

H. Kress-Gazit, M. Marcus, H. A. Yanco, Make it so: Continuous, ﬂexible

natural language interaction with an autonomous robot, in: Proc. of the

Grounding Language for Physical Systems Workshop at AAAI, 2012.

[57] A. Vogel, D. Jurafsky, Learning to follow navigational directions, in: Proc.

of ACL, 2010, pp. 806–814.

[58] C. Matuszek, D. Fox, K. Koscher, Following directions using statistical

machine translation, in: Proc. of ACM/IEEE, 2010, pp. 251–258.

[59] T. Kollar, S. Tellex, M. R. Walter, A. Huang, A. Bachrach, S. Hemachan-

dra, E. Brunskill, A. Banerjee, D. Roy, S. Teller, et al., Generalized ground-

ing graphs: A probabilistic framework for understanding grounded lan-

guage, JAIR.

[60] J. Krishnamurthy, T. Kollar, Jointly learning to parse and perceive: Con-

necting natural language to the physical world, Trans. of ACL 1 (2013)

193–206.

[61] C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, D. Fox, A joint model

of language and perception for grounded attribute learning, arXiv preprint

arXiv:1206.6423.

[62] N. Kr¨uger, C. Geib, J. Piater, R. Petrick, M. Steedman, F. Wrgtter, A. Ude,

T. Asfour, D. Kraft, D. Omrcen, A. Agostini, R. Dillmann, Object-action

complexes: Grounded abstractions of sensori-motor processes, Robotics

and Autonomous Systems 59 (10) (2011) 740–757.

[63] M. Beetz, U. Klank, I. Kresse, A. Maldonado, L. Mosenlechner, D. Panger-

cic, T. Ruhr, M. Tenorth, Robotic roommates making pancakes, in: Proc.

of Humanoids, 2011, pp. 529–536.

[64] C. Matuszek, E. Herbst, L. Zettlemoyer, D. Fox, Learning to parse natural

language commands to a robot control system, in: Experimental Robotics,

Springer, 2013, pp. 403–415.

[65] M. Bollini, S. Tellex, T. Thompson, N. Roy, D. Rus, Interpreting and

executing recipes with a cooking robot, in: Experimental Robotics, 2013,

pp. 481–495.

[66] J. Dzifcak, M. Scheutz, C. Baral, P. Schermerhorn, What to do and how to

do it: Translating natural language directives into temporal and dynamic

logic representation for goal management and action execution, in: Proc.

of ICRA, 2009, pp. 4163–4168.

[67] F. Duvallet, M. R. Walter, T. Howard, S. Hemachandra, J. Oh, S. Teller,

N. Roy, A. Stentz, Inferring maps and behaviors from natural language

instructions, in: Proc. of ISER, 2014.

[68] K. Ramirez-Amaro, E. Dean-Leon, I. Dianov, F. Bergner, G. Cheng, Gen-

eral recognition models capable of integrating multiple sensors for diﬀerent

domains, in: Proc. of Humanoids, 2016, pp. 306–311.

[69] S. Konecn`y, S. Stock, F. Pecora, A. Saﬃotti, Planning domain+ execu-

tion semantics: a way towards robust execution?, in: Proc. of Qualitative

Representations for Robots, AAAI Spring Symposium, 2014.

Embodied intelligence in manufacturing: leveraging large language models for autonomous industrial robotics

Article

Full-text available

Jan 2024
J INTELL MANUF

This paper delves into the potential of Large Language Model (LLM) agents for industrial robotics, with an emphasis on autonomous design, decision-making, and task execution within manufacturing contexts. We propose a comprehensive framework that includes three core components: (1) matches manufacturing tasks with process parameters, emphasizing the challenges in LLM agents’ understanding of human-imposed constraints; (2) autonomously designs tool paths, highlighting the LLM agents’ proficiency in planar tasks and challenges in 3D spatial tasks; and (3) integrates embodied intelligence within industrial robotics simulations, showcasing the adaptability of LLM agents like GPT-4. Our experimental results underscore the distinctive performance of the GPT-4 agent, especially in Component 3, where it is outstanding in task planning and achieved a success rate of 81.88% across 10 samples in task completion. In conclusion, our study accentuates the transformative potential of LLM agents in industrial robotics and suggests specific avenues, such as visual semantic control and real-time feedback loops, for their enhancement.

Top-down Design of Human-like Teachable Mind

Preprint

Full-text available

Jul 2023

Ming Xie

Teachability has been extensively studied under the context of making industrial robots to be programmable and reprogrammable. However, it is only recently that the artificial intelligence (AI) research community is accelerating the research works with the objective of making humanoid robots and many other robots to be teachable under the context of using natural languages. We human beings spend many years to learn knowledge and skills despite our extraordinary mental capabilities of being teachable with the use of natural languages. Therefore, if we would like to develop human-like robots such as humanoid robots, it is inevitable for us to face the issue of making future humanoid robots to be teachable with the use of natural languages as well. In this paper, we present the key details of a top-down design for achieving a teachable mind which consists of two major processes: the first one is the process which enables humanoid robots to gain innate mental capabilities of transforming incoming signals into meaningful crisp data, and the second one is the process which enables humanoid robots to gain innate mental capabilities of undertaking incremental and deep learning with the main focus of associating conceptual labels in a natural language to meaningful crisp data. These two processes consist of the two necessary and sufficient conditions for future humanoid robots to be teachable with the use of natural languages. In addition, this paper outlines a very likely new finding underlying human brain’s neural systems as well as the obvious mathematics underlying artificial deep neural networks. These outlines provide us the strong reason to separate the study of mind from the study of brain. Hopefully, the content discussed in this paper will help the AI research community to venture into the right direction which is to make future humanoid robots, non-humanoid robots, and many other systems to achieve human-like self-intelligence at cognitive level with the use of natural languages.

Bootstrapping Concept Formation in Small Neural Networks

Preprint

Full-text available

Oct 2021

The question how neural systems (of humans) can perform reasoning is still far from being solved. We posit that the process of forming Concepts is a fundamental step required for this. We argue that, first, Concepts are formed as closed representations, which are then consolidated by relating them to each other. Here we present a model system (agent) with a small neural network that uses realistic learning rules and receives only feedback from the environment in which the agent performs virtual actions. First, the actions of the agent are reflexive. In the process of learning, statistical regularities in the input lead to the formation of neuronal pools representing relations between the entities observed by the agent from its artificial world. This information then influences the behavior of the agent via feedback connections replacing the initial reflex by an action driven by these relational representations. We hypothesize that the neuronal pools representing relational information can be considered as primordial Concepts, which may in a similar way be present in some pre-linguistic animals, too. We argue that systems such as this can help formalizing the discussion about what constitutes Concepts and serve as a starting point for constructing artificial cogitating systems.

Daily Assistive View Control Learning of Low-Cost Low-Rigidity Robot via Large-Scale Vision-Language Model

Conference Paper

Dec 2023

Top-Down Design of Human-Like Teachable Mind

Article

Nov 2023

Ming Xie

Teachability has been extensively studied under the context of making industrial robots to be programmable and reprogrammable. However, it is only recently that the artificial intelligence (AI) research community is accelerating the research works with the objective of making humanoid robots and many other robots to be teachable under the context of using natural languages. We human beings spend many years learning knowledge and skills despite our extraordinary mental capabilities of being teachable with the use of natural languages. Therefore, if we would like to develop human-like robots such as humanoid robots, it is inevitable for us to face the issue of making future humanoid robots teachable with the use of natural languages as well. In this paper, we present the key details of a top-down design for achieving a teachable mind which consists of two major processes: the first one is the process that enables humanoid robots to gain innate mental capabilities of transforming incoming signals into meaningful crisp data, and the second one is the process which enables humanoid robots to gain innate mental capabilities of undertaking incremental and deep learning with the main focus of associating conceptual labels in a natural language to meaningful crisp data. These two processes consist of the two necessary and sufficient conditions for future humanoid robots to be teachable with the use of natural languages. In addition, this paper outlines a very likely new finding underlying the human brain’s neural systems as well as the obvious mathematics underlying artificial deep neural networks. These outlines provide us with a strong reason to separate the study of the mind from the study of the brain. Hopefully, the content discussed in this paper will help the AI research community to venture into the right direction which is to make future humanoid robots, non-humanoid robots, and many other systems to achieve human-like self-intelligence at the cognitive level with the use of natural languages.

A memory system of a robot cognitive architecture and its implementation in ArmarX

Article

Mar 2023
ROBOT AUTON SYST

BlueSky: Combining Task Planning and Activity-Centric Access Control for Assistive Humanoid Robots

Conference Paper

Jun 2022

Bootstrapping Concept Formation in Small Neural Networks

Article

Jan 2022

The question how neural systems (of humans) can perform reasoning is still far from being solved. We posit that the process of forming Concepts is a fundamental step required for this. We argue that, first, Concepts are formed as closed representations, which are then consolidated by relating them to each other. Here we present a model system (agent) with a small neural network that uses realistic learning rules and receives only feedback from the environment in which the agent performs virtual actions. First, the actions of the agent are reflexive. In the process of learning, statistical regularities in the input lead to the formation of neuronal pools representing relations between the entities observed by the agent from its artificial world. This information then influences the behavior of the agent via feedback connections replacing the initial reflex by an action driven by these relational representations. We hypothesize that the neuronal pools representing relational information can be considered as primordial Concepts, which may in a similar way be present in some pre-linguistic animals, too. This system provides formal grounds for further discussions on what could be understood as a Concept and shows that associative learning is enough to develop concept-like structures.

Controlling Industrial Robots with High-Level Verbal Commands

Chapter

Nov 2021

Industrial robots today are still mostly pre-programmed to perform a specific task. Despite previous research in human-robot interaction in the academia, adopting such systems in industrial settings is not trivial and has rarely been done. In this paper, we introduce a robotic system that we control with high-level verbal commands, leveraging some of the latest neural approaches to language understanding and a cognitive architecture for goal-directed but reactive execution. We show that a large-scale pre-trained language model can be effectively fine-tuned for translating verbal instructions into robot tasks, better than other semantic parsing methods, and that our system is capable of handling through dialogue a variety of exceptions that happen during human-robot interaction including unknown tasks, user interruption, and changes in the world state.

Draw mir a Sheep: A Supersense-based Analysis of German Case and Adposition Semantics

Article

May 2021

Adpositions and case markers are ubiquitous in natural language and express a wide range of meaning relations that can be of crucial relevance for many NLP and AI tasks. However, capturing their semantics in a comprehensive yet concise, as well as cross-linguistically applicable way has remained a challenge over the years. To address this, we adapt the largely language-agnostic SNACS framework to German, defining language-specific criteria for identifying adpositional expressions and piloting a supersense-annotated German corpus. We compare our approach with prior work on both German and multilingual adposition semantics, and discuss our empirical findings in the context of potential applications.

Generalized Grounding Graphs: A Probabilistic Framework for Understanding Grounded Commands

Article

Full-text available

Nov 2017

Many task domains require robots to interpret and act upon natural language commands which are given by people and which refer to the robot's physical surroundings. Such interpretation is known variously as the symbol grounding problem, grounded semantics and grounded language acquisition. This problem is challenging because people employ diverse vocabulary and grammar, and because robots have substantial uncertainty about the nature and contents of their surroundings, making it difficult to associate the constitutive language elements (principally noun phrases and spatial relations) of the command text to elements of those surroundings. Symbolic models capture linguistic structure but have not scaled successfully to handle the diverse language produced by untrained users. Existing statistical approaches can better handle diversity, but have not to date modeled complex linguistic structure, limiting achievable accuracy. Recent hybrid approaches have addressed limitations in scaling and complexity, but have not effectively associated linguistic and perceptual features. Our framework, called Generalized Grounding Graphs (G^3), addresses these issues by defining a probabilistic graphical model dynamically according to the linguistic parse structure of a natural language command. This approach scales effectively, handles linguistic diversity, and enables the system to associate parts of a command with the specific objects, places, and events in the external world to which they refer. We show that robots can learn word meanings and use those learned meanings to robustly follow natural language commands produced by untrained users. We demonstrate our approach for both mobility commands and mobile manipulation commands involving a variety of semi-autonomous robotic platforms, including a wheelchair, a micro-air vehicle, a forklift, and the Willow Garage PR2.

General recognition models capable of integrating multiple sensors for different domains

Conference Paper

Full-text available

Nov 2016

Finding Ways to Get the Job Done: An Affordance-Based Approach

Article

May 2014

Adapting plans to changes in the environment by finding alternatives and taking advantage of opportunities is a common human behavior. The need for such behavior is often rooted in the uncertainty produced by our incomplete knowledge of the environment. While several existing planning approaches deal with such issues, artificial agents still lack the robustness that humans display in accomplishing their tasks. In this work, we address this brittleness by combining Hierarchical Task Network planning, Description Logics, and the notions of affordances and conceptual similarity. The approach allows a domestic service robot to find ways to get a job done by making substitutions. We show how knowledge is modeled, how the reasoning process is used to create a constrained planning problem, and how the system handles cases where plan generation fails due to missing/unavailable objects. The results of the evaluation for two tasks in a domestic service domain show the viability of the approach in finding and making the appropriate goal transformations.

Toward an Architecture for Never-Ending Language Learning

Article

Jul 2010

We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent.

Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World

Article

Dec 2013

This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment. For example, given an image, LSP can map the statement “blue mug on the table” to the set of image segments showing blue mugs on tables. LSP learns physical representations for both categorical (“blue,” “mug”) and relational (“on”) language, and also learns to compose these representations to produce the referents of entire statements. We further introduce a weakly supervised training procedure that estimates LSP’s parameters using annotated referents for entire statements, without annotated referents for individual words or the parse structure of the statement. We perform experiments on two applications: scene understanding and geographical question answering. We find that LSP outperforms existing, less expressive models that cannot represent relational language. We further find that weakly supervised training is competitive with fully supervised training while requiring significantly less annotation effort.

WordNet: An Electronic Lexical Database

Book

Jan 1998

Christiane Fellbaum

Multi-purpose natural language understanding linked to sensorimotor experience in humanoid robots

Conference Paper

Nov 2015

Mining semantic affordances of visual object categories

Conference Paper

Jun 2015

Learning missing edges via kernels in partially-known graphs

Conference Paper

Apr 2015

This paper deals with the problem of learning unknown edges with attributes in a partially-given multigraph. The method is an extension of Maximum Margin Multi-Valued Regression (M3VM) to the case where those edges are characterized by different attributes. It is applied on a large-scale problem where an agent tries to learn unknown object-object relations by exploiting known such relations. The method can handle not only binary relations but also complex, structured relations such as text, images, collections of labels, categories, etc., which can be represented by kernels. We compare the performance with a specialized, state-of-the-art matrix completion method.

The theory of affordances

Article

Jan 1979

J.J. Gibson

Integrating multi-purpose natural language understanding, robot’s memory, and symbolic planning for task execution in humanoid robots

Abstract and Figures

Recommended publications

Exploiting Symmetries and Extrusions for Grasping Household Objects

Conversational Bootstrapping and Other Tricks of a Concierge Robot

Towards robots that give each other navigational directions: Learning symbols for perceptual categor...

Space-Game: Domestication of Humanoid Robots and AI by Generating a Cultural Space Model of Intra-ac...