Conference PaperPDF Available

Content-based control of goal-directed attention during human action perception

October 2006
Interaction Studies 9(2):226 - 231

October 2006
9(2):226 - 231

DOI:10.1109/ROMAN.2006.314422

Source
IEEE Xplore

Conference: Robot and Human Interactive Communication, 2006. ROMAN 2006. The 15th IEEE International Symposium on

Authors:

Yiannis Demiris

Imperial College London

During the perception of human actions by robotic assistants, the robotic assistant needs to direct its computational and sensor resources to relevant parts of the human action. In previous work we have introduced HAMMER (Hierarchical Attentive Multiple Models for Execution and Recognition) in Demiris, Y. and Khadhouri, B., (2006), a computational architecture that forms multiple hypotheses with respect to what the demonstrated task is, and multiple predictions with respect to the forthcoming states of the human action. To confirm their predictions, the hypotheses request information from an attentional mechanism, which allocates the robot's resources as a function of the saliency of the hypotheses. In this paper we augment the attention mechanism with a component that considers the content of the hypotheses' requests, with respect to reliability, utility and cost. This content-based attention component further optimises the utilisation of the resources while remaining robust to noise. Such computational mechanisms are important for the development of robotic devices that will rapidly respond to human actions, either for imitation or collaboration purposes

The basic architecture, showing multiple inverse models (B1 to Bn) receiving the system state, suggesting motor commands (M1 to Mn), with which the corresponding forward models (F1 to Fn) form predictions regarding the system’s next state (P1 to Pn); these predictions are verified at the next time state, resulting in a set of error signals (E1 to En)

…

gives a summary of the performance of the attention mechanism for each of the behaviours. In all conditions, all behaviours were recognised correctly. Results of the final confidence value of the demonstrated action are

…

The HAMMER architecture (Demiris and Khadhouri, 2006), incor- porating the attention systems); B1 to Bn are the available inverse models, and the arbitration block has to decide which of their requests to satisfy.

…

The experimental setup involves an ActivMedia Peoplebot observing a human demonstrator acting on objects

…

Representative frames from a video sequence involving a human reach- ing for an object

…

Figures - uploaded by Yiannis Demiris

Content may be subject to copyright.

Content uploaded by Yiannis Demiris

Content may be subject to copyright.

Content-based control of goal-directed

attention during human action

perception

Yiannis Demiris and Bassam Khadhouri ∗

Abstract

During the perception of human actions by robotic assistants, the

robotic assistant needs to direct its computational and sensor resources

to relevant parts of the human action. In previous work we have intro-

duced HAMMER (Hierarchical Attentive Multiple Models for Execution

and Recognition) (Demiris and Khadhouri, 2006), a computational archi-

tecture that forms multiple hypotheses with respect to what the demon-

strated task is, and multiple predictions with respect to the forthcoming

states of the human action. To conﬁrm their predictions, the hypothe-

ses request information from an attentional mechanism, which allocates

the robot’s resources as a function of the saliency of the hypotheses. In

this paper we augment the attention mechanism with a component that

considers the content of the hypotheses’ requests, with respect to the con-

tent’s reliability, utility and cost. This content-based attention component

further optimises the utilisation of the resources while remaining robust

to noise. Such computational mechanisms are important for the develop-

ment of robotic devices that will rapidly respond to human actions, either

for imitation or collaboration purposes.

∗Yiannis Demiris and Bassam Khadhouri are with the Department of Electrical and Elec-

tronic Engineering, Imperial College London y.demiris@imperial.ac.uk

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

1 Introduction

To increase the utility and ﬂexibility of robotic assistants, it is useful to equip

them with the capacity to understand the actions of humans in their environ-

ment. Such understanding can lead to easy programming of robotic assistants

to perform new tasks through human demonstration (Demiris and Johnson,

2003; Schaal et al., 2003), as well as the potential for collaborating with hu-

mans to perform a common task. This quest leads to several challenges: while

observing a demonstration, it is not clear what aspect of the demonstration

needs to be attended too, either for imitating it or for determining the best

course of action in a collaboration scenario. As a result, there is no obvious top-

down (goal-directed) control as to where to direct the robot’s attention. While

saliency-based approaches for the control of attention (Itti and Koch, 2000)

can be very useful, for example by directing the attention to the most rapidly

moving object, a more goal-directed attentional control is needed (Demiris and

Khadhouri, 2006).

In the next few sections we will review our approach to the goal-directed

control of attention (Demiris and Khadhouri, 2006), in dynamic scenes involving

human actions. We subsequently present our experimental setup consisting of a

robot camera observing a human during diﬀerent behaviour demonstrations, and

demonstrate how a content-based control of top-down (or goal directed) attention

requests can optimise the use of the computational and sensor resources of the

observer robot.

2 BACKGROUND

An important issue that remains unresolved in imitation and collaboration re-

search is where the attention of the observer should be focused when a demon-

strator performs an action. Whilst there is little agreement on a universal

deﬁnition of attention (Tsotsos, 2001), from an engineering point of view it can

be deﬁned as a mechanism for allocating the limited perceptual and motor re-

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

sources of an agent to the most relevant sensory stimuli. The control inputs to

the attention mechanism can be divided into two categories: stimulus-driven (or

bottom-up) and goal-directed (or top-down). Stimulus-driven attention models

work by attaching levels of saliency to low-level features of the visual scene,

e.g. colour, texture or movement, and then deriving the corresponding saliency

maps, fusing them, and applying a winner-take-all strategy to direct the atten-

tion to the most salient part of the scene (Itti and Koch, 2000). However, it is

well known from human psychophysical experiments that top-down information

can also inﬂuence bottom-up processing (e.g. (Wolfe, 1994; Treue and Trujillo,

1999)). The top-down information can be derived from the task requirements in

hand, or from other sources of knowledge about the observed action, for exam-

ple from linguistic descriptions (Steels and Baillie, 2003). Wolfe (Wolfe, 1994)

put forward the hypothesis that the goal-directed element of attention selects

bottom-up features that are relevant to the current task (in what is termed

visual search) by varying the weighting of the feature maps. However, the fun-

damental question of what determines the features that are relevant to the task

has not been answered in a principled way. This is the case particularly when

the tasks are performed by someone else without the observer having access to

the internal motivational systems of the demonstrator. The task demonstrated

is not known in advance, therefore the top-down selection of features to attend

to is not obvious, and an online method for selecting features dynamically is

needed.

In previous work we have utilised the motor theory of perception (Demiris

and Hayes, 2002) for deriving a principled mechanism for the goal-directed con-

trol of attention during action perception. The particular method we employ

in the HAMMER architecture relies on the concept of motor simulation. Men-

tal simulation theories of cognitive function (Hesslow, 2002), of which motor

simulation is an instance, advocate the use of the observer’s cognitive and mo-

tor structures in a dual role: on-line, for the purposes of perceiving and acting

overtly, and oﬀ-line, for simulating and imagining actions and their consequences

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

(Jeannerod, 2001).

The key contribution of this paper is the extension of this mechanism to add

content-based factors to the arbitration process.

3 Overview of the HAMMER architecture

HAMMER is organized around, and contributes towards, three concepts:

•The basic building block involves a pair of inverse and forward models

(Narendra and Balakrishnan, 1997; Wolpert and Kawato, 1998; Demiris

and Hayes, 2002). These are used in the dual role of either executing or

perceiving an action, an idea ﬁrst proposed and implemented in (Demiris,

1999; Demiris and Hayes, 1999).

•These building blocks are arranged in a hierarchical, parallel manner

(Demiris and Johnson, 2003).

•The limited computational and sensor resources are taken explicitly into

consideration: we do not assume that all state information is instantly

available to the inverse model that requires it, but formulate them as

requests to an attention mechanism. We will describe how this provides a

principled approach to the top-down control of attention during imitation.

HAMMER is using the concepts of inverse and forward models (Narendra

and Balakrishnan, 1997; Karniel, 2002; Wolpert and Kawato, 1998). An inverse

model (akin to the concepts of a behavior, a controller, or action) is a function

that takes as inputs the current state of the system and the target goal(s), and

outputs the control commands that are needed to achieve or maintain those

goal(s). On the other side of the spectrum a forward model (akin to the concept

of internal predictor), is a function that takes as inputs the current state of the

system and a control command to be applied on it and outputs the predicted

next state of the controlled system. Inverse and forward models can be learned

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

through reinforcement learning, and motor babbling (Dearden and Demiris,

2005).

The building block of HAMMER is an inverse model paired with a forward

model. When HAMMER is asked to rehearse or execute a certain action, the

inverse model module receives information about the current state (and, op-

tionally, about the target goal(s)), and it outputs the motor commands that

it hypothesises are necessary to achieve or maintain these implicit or explicit

target goal(s). The forward model provides a prediction of the upcoming states

should these motor commands get executed. This estimate is returned back to

the inverse model, allowing it to adjust any parameters of the action.

Interestingly, as ﬁrst proposed and demostrated in (Demiris, 1999; Demiris

and Hayes, 1999), the pairs of inverse and forward models can be used in a

dual role, to both plan and execute an action, as well as perceive it when

demonstrated by another agent. For HAMMER to determine whether a visu-

ally perceived demonstrated action matches a particular inverse-forward model

coupling, the demonstrator’s current state as perceived by the imitator is fed

to the inverse model. The inverse model generates the motor commands that it

would output if it was in that state and wanted to execute this particular action.

The motor commands are inhibited from being sent to the motor system. The

forward model outputs an estimated next state, which is a prediction of what

the demonstrator’s next state will be. This predicted state is compared with

the demonstrator’s actual state at the next time step. This comparison results

in an error signal that can be used to increase or decrease the inverse model’s

conﬁdence value, which is an indicator of how closely the demonstrated action

matches a particular imitator’s action.

Multiple pairs of inverse and forward models can operate in parallel (Demiris,

1999; Demiris and Hayes, 1999, 2002). Fig. 1 shows the basic structure. When

the demonstrator agent executes a particular action the perceived states are fed

into all of the imitator’s available inverse models. As described earlier, this gen-

erates multiple motor commands (representing multiple hypotheses as to what

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 1: The basic architecture, showing multiple inverse models (B1 to Bn)

receiving the system state, suggesting motor commands (M1 to Mn), with which

the corresponding forward models (F1 to Fn) form predictions regarding the

system’s next state (P1 to Pn); these predictions are veriﬁed at the next time

state, resulting in a set of error signals (E1 to En)

action is being demonstrated) that are sent to the corresponding forward mod-

els. The forward models generate predictions about the demonstrator’s next

state: these are compared with the actual demonstrator’s state at the next time

step, and the error signal resulting from this comparison aﬀects the conﬁdence

values of the inverse models. At the end of the demonstration (or earlier if

required) the inverse model with the highest conﬁdence value, i.e. the one that

is the closest match to the demonstrator’s action is selected. This architec-

ture has been implemented in real-dynamics robot simulations (Demiris, 1999;

Demiris and Hayes, 1999, 2002), and robotic platforms (Demiris and Johnson,

2003; Johnson and Demiris, 2004) and has oﬀered plausible explanations and

testable predictions regarding the behaviour of biological imitation mechanisms

in humans and monkeys (review in (Demiris, 1999; Demiris and Johnson, 2007)).

4 Top-down control of attention

The architecture as stated so far assumes that the complete state information

will be available for and fed to all the available inverse models. However, the

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

sensory and memory capacities of the observer are limited, and obtaining the

complete state information is costly (for example, it might require multiple

saccades), so in order to increase the eﬃciency of the architecture, we do not

seek to obtain all the state information for all the inverse models. Since each of

the inverse models requires a subset of the global state information (for example,

one might only need the arm position rather than full body state information),

we can optimize this process by allowing each inverse model to request a subset of

the information from an attention mechanism, thus exerting a top-down control

on the attention mechanism. Since HAMMER is inspired by the simulation

theory of mind point of view for action perception, it asserts the following:

for a given action, the information that the attention system will try to extract

during the action’s demonstration is the state of the variables the corresponding

inverse model would have to control if it was executing this action. For example,

the inverse model for executing an arm movement will request the state of the

arm when used in perception mode. This novel approach provides a principled

way for supplying top-down signals to the attention system. Depending on the

hypotheses that the observer has on what the ongoing demonstrated task is, the

attention will be directed to the features of the task needed to conﬁrm one of the

hypotheses. Since there are multiple hypotheses, thus multiple state requests,

the saliency of each request can be made a function of the conﬁdence that each

inverse model possesses. This removes the need for ad-hoc ways for computing

the saliency of top-down requests. Top-down control can then be integrated

with saliency information from the stimuli itself, allowing a control decision to

be made as to where to focus the observer’s attention. An overall diagram of

this is shown in ﬁgure 2.

In the experiments we reported in (Demiris and Khadhouri, 2006), we in-

vestigated a number of arbitration mechanisms, and scheduling algorithms. We

found that the best strategy to use for allocation was (what was termed as) RR-

HCAW: initially the resources are distributed equally across behaviours, (RR:

Round Robin algorithm), switching to a strategy that favours the behaviour

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 2: The HAMMER architecture (Demiris and Khadhouri, 2006), incor-

porating the attention systems); B1 to Bn are the available inverse models, and

the arbitration block has to decide which of their requests to satisfy.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

with the highest conﬁdence (HCAW: High conﬁdence Always Wins) once such

behaviour can be clearly identiﬁed.

5 Introducing the content based control of at-

tention

In the work we have reported so far (Demiris and Khadhouri, 2006), distribu-

tion of resources is done on a single behaviour basis; one behaviour is selected

and its request is serviced. By introducing content based control, the archi-

tecture reﬁnes its selection process by going on step further and examining the

content of each request, across all behaviours; as we will see, this will allow

for better scheduling of resouces, and signiﬁcantly improves the performance

characteristics of the architecture of (Demiris and Khadhouri, 2006).

5.1 Adding content-based factors

In the work reported in this paper, in addition to the conﬁdence values which

are used again to reﬂect how well a hypothesised behaviour matches the demon-

strated one, a number of new factors that add content-based control are used.

We list below the new factors that are used, grouped in three categories: Relia-

bility of content, Utility of content, and Cost of content; their usage can be seen

in ﬁgure 4.

5.1.1 Reliability of content

•Is the request asking for the location of a dynamic or static object1?

Dynamic objects would require much more frequent update than static

objects, because their location can change constantly. A static object

however is most likely to have its current location matching that of the

memory from the last time it was seen. Hence it would not need frequent

1A dynamic object is an object which can move by itself, e.g. the hand. Whereas a static

object requires force from a dynamic object to move it.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

updating, and a short-term memory component holding the details of the

various object locations can be used (this can, for example, save us a

saccade). We can thus give less priority to requests for locations of static

objects that had their locations updated recently in the memory, giving

more importance to dynamic objects’ requests.

•When was the last time this request was attended to? Is the current

location of the object asked for in this request valid? This knowledge

can be used to validate whether the location of an object (be it static,

dynamic, or static in contact with a dynamic ob ject) has expired.

•Can a simple prediction of the object’s location be performed?

While the object is not being attended to, a prediction is generated to es-

timate its current location. This is based on whether the ob ject is static,

dynamic or static in contact with a dynamic object. If the object is dy-

namic (or static in contact with a dynamic object), then a simple linear

extrapolation method is used to predict its next location.

5.1.2 Utility of content

•How many behaviours are submitting this request?

More importance is given to a more popular request that is required by a

larger number of behaviours. This is because it would help enable more

behaviours to be processed.

•Will the content attended to also serve the request of the winning be-

haviour? In other words, is it being made by the behaviour that is cur-

rently selected by the scheduling algorithm? If it is, the request’s priority

is increased, since it would always remain top priority for the attention

model to serve the most-likely behaviour, to ensure correct recognition of

the demonstrated action.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

•How many behaviours will be completely satisﬁed if this request is at-

tended to?

This is diﬀerent from “How many behaviours are requesting this request?”

since a behaviour can have more than one request, meaning that answering

a request may not be suﬃcient to completely satisfy the behaviour that

made it. This information is only used to grade the importance of a

request.

•When was the last time the behaviour was attended?

This ensures that behaviours which have been left unattended for a long

time, are given a chance, even if they are not winning behaviours.

5.1.3 Cost of obtaining the Content

•How far is the requested content from the currently attended location,

thus what is the cost of the saccade to the new location? This information

is important in order to extend the model’s ability to deal with camera

saccades to objects that lie outside the current camera input.

The previous memory of the object’s location is used, with the help of the

prediction mentioned earlier, to estimate the distance (and direction) of

the saccade needed in order to update its location. The distance of the

saccade acts as a cost function, that is used as an input to optimal path

criteria. This enables the determination of the shortest path needed to

update the objects that needs saccading to.

Experiments that simulate saccades were also incorporated, and a saccade

cost was introduced. This saccade cost was proportional to the distance the

camera has to saccade in order to view the object. The cost was normalised to

a value between 1 and 10. Therefore, for an image that has a width of xpixels

and a height of ypixels, and assuming that the centre of the attention box is at

(x1, y1) and the location for the object to saccade to is (x2, y2), then the cost is

calculated as:

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 3: A general overview of the architecture

SC =round(10 √(x1−x2)2+(y1−y2)2)

√x2+y2)

The result of this calculation would freeze the allocation of resources for SC

number of frames (which simulates the time it takes for the camera to do the

saccade).

In some of the experiments shown later, the objects used were placed outside

the visual scene to force saccades. Throughout the experiments, only a simula-

tion of the saccade, with its associated cost value (SC ) based on the distance

of the object from the attention box was implemented.

Figure 3 shows the top-level of how the content-based approach will work.

First, it is necessary to perform an overall initial scan to identify which objects

(from the robot’s database) are present in the scene and to identify the state

of these objects, whether they are together, or isolated, etc. The content based

control is subsequently activated, following the steps described in ﬁgure 4 which

are described below. Based on the selection of this block, the behaviours that

can be computed are then processed.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 4: A detailed overview of the content-based control architecture

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

5.2 The content based control algorithm

There are two main goals that are desired from this new approach, listed here

in their priority order:

•To give priority to the winning behaviour using HCAW scheduling al-

gorithm when there is a clear winner. Otherwise giving priority to all

behaviours using RR scheduling algorithm when there is no winner yet

(which is often the case during the initialisation stages)2. This goal will

insure that the performance of the content-based control mechanism can,

at least, maintain the performance of the previous work (Demiris and

Khadhouri, 2006)

•To maximise the number of behaviours being computed through intelligent

allocation of the resources between the requests, whilst carrying out the

ﬁrst goal above.

Given the two goals above, the main part of the algorithm in ﬁgure 4 is

the block, “Does the top request from the ranked list belong to the selected

behaviour?”. The input to this block is a ranked list of requests and the selected

behaviour from the RR-HCAW algorithm. If this request is being made by the

winning behaviour, then it will be attended to, unless there is no need to, as

it would only produce redundant information. One example for this is when a

request is asking for the location of a static object that has only been attended

to recently, and would therefore remain unchanged.

If the request however is not being made by the currently winning behaviour,

then the algorithm will still attend to it, on the condition that it is being made

by a behaviour that remained idle for some long period. In this case however,

since there is time for the attention model to allocate resources to requests that

serve a behaviour other than the winning behaviour, a further mini-competition

takes place to ﬁnd the lowest saccade cost. A simple optimal path algorithm

2On the implementation level, the RR-HCAW switching technique from the basic HAM-

MER implementation was used again to decide on when to give priority to every behaviour

equally, and when to give priority to the winning behaviour.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

is used to ﬁnd the nearest location that requires the smallest saccade. If there

are other similar requests that are nearby, requiring a small saccade, then these

too are attended to, in order to minimise the overall number of saccades needed

throughout the experiment. These further saccades will only continue to take

place, however, if the winning behaviour is still being satisﬁed, or has only been

compromised for a very short period of time.

As mentioned in the previous section, determining whether the request is

worth attending to, or just a waste of time because of redundancy, was possible

using the reliability of content factors. These factors are therefore used to serve

the block labelled “Is previous value of top request reliable?” in ﬁgure 4.

The block at the top of the algorithm in ﬁgure 4 receives all the requests

from all the behaviours. The function of this block is to try to prioritise these

requests according to two factors: How useful it will be to answer it (i.e. how

many behaviours will be computed as a result of attending to it) and How

popular is the request (or how many behaviours are submitting it). Currently

these are combined by giving priority to the more important one (namely, how

many behaviours will be computed as a result). If then there are two requests

that are equal, the next factor is used to make a decision (namely, how many

behaviours are submitting it). If then there is still a draw, one of the two

requests is chosen at random to go ﬁrst.

The content-based architecture will have the advantage of enabling multiple

behaviours to be computed at a time, instead of only one behaviour previously.

As a result of this, the resulting graphs for the content based control should look

more similar to the original graphs (without the attention mechanism), resulting

in a smaller loss of information. The additional overhead of the scheduling

algorithm however is very little the algorithm produces a ranked list of requests

based on simple counter increment operations (of the constant order of (Max

number of Requests * N) + N in the worst case scenario (where all behaviours

are asking all needed states, where N is the number of behaviors), followed by

a maximum of three binary decision operations.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 5: The experimental setup involves an ActivMedia Peoplebot observing

a human demonstrator acting on objects

6 Experiments

6.1 Experimental Setup

We implemented and tested this architecture on an experimental setup involv-

ing an ActivMedia Peoplebot, following the setup of (Demiris and Khadhouri,

2006; Demiris and Johnson, 2003); in these experiments the on-board camera

was used as the only sensor. A description of the low-level implementational

details of inverse and forward models, and the low-level vision processing can

be found in (Demiris and Khadhouri, 2006). A human demonstrator performed

an object oriented action (an example is shown in ﬁgure 5), and the robot,

using HAMMER, attempted to match the action demonstrated (an example is

shown in ﬁgure 6), with the equivalent in its repertoire. In the experiments

reported here, the robot captured the demonstration at a rate of 30Hz, with

an image resolution of 160×120, and the demonstrations lasted an average of

3 seconds. In the following sections, we will describe the performance of the

architecture with and without two implementations of the attention mechanism

(with and without content-based control). We will compare its performance

against previous implementations of this approach (Demiris and Hayes, 2002;

Demiris and Johnson, 2003) to demonstrate the performance improvements that

the content-based attention-control subsystem of HAMMER brings.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 6: Representative frames from a video sequence involving a human reach-

ing for an object

6.2 Demonstrated Behaviour Set

The following seven new behaviours were implemented in these experiments to

demonstrate and test the model architecture explained earlier.

•Behaviour 1 - Move empty hand up; requests: Hand location only.

•Behaviour 2 - Move empty hand down; requests: Hand location only.

•Behaviour 3 - Move empty hand left; requests: Hand location only.

•Behaviour 4 - Move empty hand right; requests: Hand location only.

•Behaviour 5 - Put objects together; requests: Hand location, object A and

object B.

•Behaviour 6 - Separate two objects; requests: Hand location, object A

and object B.

•Behaviour 7 - Move two objects together; requests: Hand location, object

A and object B.

In these behaviours, there are three objects, one dynamic (the hand), and two

static (A and B). The hand has 7 requests in total (i.e. made by all behaviours),

whilst objects A and B have 3 requests each. A coke can and an orange were

chosen for objects A and B respectively.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

6.3 Experimental Design

Three diﬀerent sets of experiments were designed, in order to demonstrate how

the added content-based control can improve the attention model in three dif-

ferent ways.

•Experiments to show that it will generally improve the recognition results

for the winning behaviour. We performed three sets of experiments to

demonstrate this point, for each of the 7 behaviours, with the following

conditions:

1. No attention mechanism - at each frame all the behaviours receive

the state information they require, regardless of how long it takes

(or how costly it is) to collect it, and all compute and update their

conﬁdences. This condition is the same used in previous experiments

(Demiris and Hayes, 2002; Demiris and Johnson, 2003), and will serve

as the reference condition for comparing the performance of the at-

tention subsystem. Under this condition, the system contains com-

plete knowledge of the whole state information and the resources to

supply all hypotheses with the necessary information for them to be

conﬁrmed/dismissed. By deﬁnition, this condition will perform bet-

ter than any attention system that we can devise, since any attention

system will result in sub-sampling of the state information, thus can-

not possibly result in more accurate recognition. However, these are

idealistic circumstances, and embedded systems cannot be assumed

to have the necessary resources to have complete state information.

Thus the need for attention mechanisms, and the no-attention con-

dition is included as the reference signal, i.e. the theoretically ideal

condition that cannot always be achieved.

2. RR-HCAW, the best performing strategy from the content-free allo-

cation strategies (Demiris and Khadhouri, 2006), which combines the

Round-Robin (RR) implementation and the HCAW (Highest Conﬁ-

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

dence Always Wins) condition. Initially all behaviours are given

equal treatment (RR), but when one of them takes a clear lead (its

conﬁdence becomes higher than 50% of the average conﬁdence of

all behaviours with positive conﬁdences), the arbitration mechanism

switches to HCAW.

3. Content-Based approach, in addition to the mechanisms of experi-

ment 2).

•Experiments to quantify explicitly the cost of involving saccades, which

are necessary for more realistic situations where not all the information

required are in the current ﬁeld of view.

•Experiments to show that the content based control allows the system

to recover from initial errors by (as much as possible) keeping track of

other behaviours to allow them to recover later, where the ﬁnal winning

behaviour is not the same as the initial winning behaviour.

In all conditions, the same formula is used to update the conﬁdence of the

inverse models, and follows the procedure of Demiris and Khadhouri (2006): the

data requested, once retrieved, are sent to the corresponding inverse model that

asked for them. The motor command of that inverse model is generated and

sent to the corresponding forward model, which forms a prediction of what the

next state will be. We then calculate the prediction error and use it to update

the conﬁdence of the inverse model. Since we are using a qualitative prediction

(e.g. hand closer to target), the prediction will either be correct or not. The

conﬁdence of the inverse model is then updated according to the following rule:

C(t) =











C(t−1) + 1 if prediction is correct

C(t−1) −1,if prediction is incorrect

(1)

Figure 7 serves as a legend for the other ﬁgures, and shows the assigned

labels (B1 to B7) for the diﬀerent behaviours.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 7: The corresponding labels for the behaviours used in the experiments

Figure 8: B1 demonstrated. Evolution of the conﬁdences when no attention

mechanism is employed

6.4 Experiments to show recognition improvement using

content-based control

Figure 8 gives the evolution of the conﬁdences of all the behaviours, whilst

behaviour 1 (as an example) was being demonstrated, when no attention mech-

anism is employed.

Figure 9 gives the evolution of the conﬁdences of all the behaviours, whilst

behaviour 1 (as an example) was being demonstrated, using the attention model

of our previous work (Demiris and Khadhouri, 2006) that only employs resource

scheduling algorithms working on the conﬁdence values.

Figure 10 gives the evolution of the conﬁdences of all the behaviours, whilst

behaviour 1 (as an example) was being demonstrated, when the content-based

approach is added.

Figure 1 gives a summary of the performance of the attention mechanism

for each of the behaviours. In all conditions, all behaviours were recognised

correctly. Results of the ﬁnal conﬁdence value of the demonstrated action are

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 9: B1 demonstrated. Evolution of the conﬁdences using the RR-HCAW

condition

Figure 10: B1 demonstrated. Evolution of the conﬁdences using a content-based

approach

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Table 1: Conﬁdence for each behaviour for each condition as a percentage of

the conﬁdence reached with no attention present

shown for all experimental conditions. Percentages are calculated by making a

relative comparison between each conﬁdence value and the corresponding one

from the version of the architecture without the attention mechanism. The re-

sults show that with the content-based control activated, the conﬁdence levels of

each of the behaviours are closer to those when no attentional control is present

than the ones of the RR-HCAW condition, the best performing non-content-

based attentional control strategy (Demiris and Khadhouri, 2006). These re-

sults indicate that content-based control can maintain the beneﬁts brought by

an attention mechanism, but at the same time with less impact on the reliability

of the predictions.

6.5 Introducing saccades

The purpose of this subsection is to introduce experiments that quantify the

eﬀects of saccades in content-based control. Saccades in general are known to

be very demanding, because of the movement required to perform them, during

which, no new data can be collected, hence resulting in the possibility of less

computations. As a result of this, the allocation of resources becomes even more

valuable, and must be allocated correctly in order to still achieve high conﬁdence

values.

The result of this would freeze the allocation of resources for SaccadeCost

number of frames (which simulates the time it takes for the camera to do the

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 11: B1 demonstrated. Evolution of the conﬁdences using a content-based

approach for scenarios requiring saccades

saccade, during which, all processes must be frozen).

Where possible, ob jects A and B were placed outside the visual scene to

incorporate saccades. Throughout the experiments, only a simulation of the

saccade, with its associated cost value (SaccadeC ost) based on the distance

of the object from the attention box was implemented, instead of an actual

physical saccade.

As representative examples of the results that were obtained from the ex-

periments, ﬁgures 11 and 12gives the evolution of the conﬁdences of all the

behaviours, whilst behaviour 1 and 3 was being demonstrated respectivel, using

the added content-based control.

Table 2 gives a summary of the performance of the attention mechanism

for all the other behaviours. In all conditions, all behaviours were recognised

correctly. Results of the ﬁnal conﬁdence value of the demonstrated action are

shown for all experimental conditions. Percentages are calculated by making a

relative comparison between each conﬁdence value and the corresponding one

from the plain version of the architecture without the attention mechanism.

Looking at the graph in ﬁgure 11 more closely, it can be seen that there

are periods of inactivity, where no computations are taking place, these occur

about every 20 frames or so. This represents the saccade cost, which the model

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 12: B3 demonstrated. Evolution of the conﬁdences using a content-based

approach for scenarios requiring saccades

Table 2: Conﬁdence for each behaviour for each condition as a percentage of the

conﬁdence reached with no attention present for scenarios requiring saccades

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Figure 13: In this experiment, the demonstrator started by demonstrating B3

by moving his hand to the left, and then intentionally changing the direction to

move it to the right, hence demonstrating B4

chooses to go through, sacriﬁcing computations, in order to attend to the static

objects that fall outside the scene, refreshing their static locations, which were

designed to expire after 20 frames in these experiments.

Despite the signiﬁcant loss in the computations, it can be seen from table

2 that all the behaviours were still identiﬁed correctly, with a good average

conﬁdence value of 77.85% for the ﬁnal winning behaviours.

The attention model with the added content-based control is powerful enough

to deal with saccades, causing a loss in recognition rate of only 15% (since the

recognition rate in the previous section without saccades was 92.66%)

6.6 Experiments to demonstrate recovery

The purpose of this section is to demonstrate that the new added content-based

control will enable good tracking of all the behaviours, such that it can recover

by recognising the correct demonstrated behaviour in case of an error occurring

at the beginning.

In order to achieve this, an experiment was designed where the demonstrator

starts oﬀ by demonstrating B3 and then quickly alternates to demonstrate B4.

Figure 13 shows the outcome of this experiment.

The attention model is able to recover and correctly recognise that it was

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

mainly B4 which was demonstrated, rather than B3.

Since the model managed to recover from this worse case scenario, where the

eventual winning behaviour was scoring negatively initially, the model should

therefore have no problem either to deal with easier scenarios, where the even-

tual winner scores positively at the beginning. One such possibility can be to

deal with two behaviours that are similar in nature at the beginning, and only

diﬀer towards the end. In this case, both behaviours would score positively at

the beginning, and therefore recovery will be even easier than what was demon-

strated in this experiment.

The reason why the added content-based control has an advantage over

the previous attention model that did not have it, is because it allows the

computation of more than one behaviour at a time, keeping track of all the

behaviours as much as it can. Whereas previously, this was not the case, and

only the winning behaviour was being updated, hence recovery would not be as

eﬃcient.

7 Discussion

By allocating the limited computational resources according to a content-based

approach, that is based on behaviour needs/requests, one can see further signif-

icant performance gains.

The higher conﬁdence values for the winning behaviours, and the improved

way in allocating the resources allowed the attention model to be pushed to work

with more demanding situations, such as those requiring saccade generations.

It was shown through experiments that the attention model still managed to

correctly identify winning behaviours in these situations with a high average

experimental conﬁdence value of 78%.

Furthermore, using this approach produced graphs with shapes that no

longer just clearly identiﬁed the winning behaviour, as was previously achieved,

but they also better matched the shape of the original graphs of the perfect

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

scenario with no attention mechanism at all.

This means that most of the information about the other behaviours are

retained, which has also resulted in better performance under more noisy condi-

tions, as well as assisting, in some rare circumstances, to correctly identify the

demonstrated behaviour, even if it didn’t have the highest conﬁdence value to

start with.

Overall therefore, the addition of the content-based control to the attention

model has improved the attention model in three useful ways:

•It generally improved the recognition results for the winning behaviours.

•It is therefore better equipped at handling tougher situations such as sac-

cades.

•It kept good track of other behaviours too, minimising ambiguity in scene

understanding, hence also allowing a later recovery for a demonstrated

behaviour that did not start out as the winning behaviour.

The problem we have addressed, which we have termed utilization of re-

sources, can be seen a question of sampling how coarse can your sampling

become while retaining accuracy of the results, compared to perfect sampling

(here represented by the no-attention condition as we explained above). In the

RR-HCAW condition, given that only one inverse model is granted its request,

we obtain 1/N samples per time unit, where N is the number of inverse mod-

els putting forward a request, saving N-1/N sampling requests per time unit;

it might the case that our hardware setup can actually handle N requests per

time unit; however we cannot assume that this is always the case, particularly

in embedded systems. Even if the computational resources are suﬃcient (and

in this case they are), when you introduce non-sharable resources (e.g. Pan and

tilt camera) there will be cases (for e.g. any situation where the action observed

takes place over an area larger than your ﬁeld of view, necessitating saccades))

where perfect sampling will not be possible.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

The sampling in the content-based approach remains the same only one

request is serviced, but we now look at the requests at the content level and

rank them accordingly; the key point here is that if the request we serviced is

made by more than one behavior, we distribute the results to all such behaviors,

which can only improve the results (as we see above) by bringing them closer

to the theoretically optimal but not always achievable situation (no attention).

8 Conclusions

Attention control is an important aspect of understanding human actions. In

our work, we have employed a motor theory approach to the control of at-

tention, putting forward HAMMER as a principled method for calculating the

goal-directed importance of visual information. In this paper we have demon-

strated that by examining the content of the requests of multiple hypotheses,

and calculating its utility, reliability and cost, we can further optimise the sensor

usage of the robotic assistant during the observation of human actions. This will

enable robotic assistants to respond to human actions rapidly and eﬃciently.

References

Dearden, A. and Demiris, Y. (2005). Learning forward models for robotics. In

Proceedings of the International Joint Conference on Artiﬁcial Intelligence

(IJCAI), pages 1440–1445.

Demiris, J. (1999). Movement imitation mechanisms in robots and humans.

PhD thesis, University of Edinburgh, Scotland, UK.

Demiris, J. and Hayes, G. M. (1999). Active and passive routes to imitation.

In Dautenhahn, K. and Nehaniv, C., editors, Proceedings of the AISB Sym-

posium on Imitation in Animals and Artifacts, pages 81–87.

Demiris, Y. and Hayes, G. (2002). Imitation as a dual route process featuring

predictive and learning components: a biologically-plausible computational

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

model. In Dautenhahn, K. and Nehaniv, C., editors, Imitation in Animals

and Artifacts, chapter 13, pages 327–361. MIT Press.

Demiris, Y. and Johnson, M. (2003). Distributed, prediction perception of

actions: a biologically inspired architecture for imitation and learning. Con-

nection Science, 15(4):231–243.

Demiris, Y. and Johnson, M. (2007). Simulation theory of understanding others:

A robotics perspective. In Dautenhahn, K. and Nehaniv, C., editors, Imitation

and Social Learning in Robots, Humans and Animals: Behavioural, Social and

Communicative Dimensions, pages 89–102. Cambridge University Press.

Demiris, Y. and Khadhouri, B. (2006). Hierarchical, attentive multiple models

for execution and recognition. Robotics and Autonomous Systems, 54:361–

369.

Hesslow, G. (2002). Conscious thought as simulation of behaviour and percep-

tion. Trends in Cognitive Sciences, 6(6):242–247.

Itti, L. and Koch, C. (2000). A saliency-based search mechanism for overt and

covert shifts of visual attention. Vision Research, 40(10-12):1489–1506.

Jeannerod, M. (2001). Neural simulation of actions: a unifying mechanism for

motor cognition. NeuroImage, 14:103–109.

Johnson, M. and Demiris, Y. (2004). Abstraction in recognition to solve the

correspondence problem for robot imitation. In Proceedings of TAROS, pages

63–70, Essex.

Karniel, A. (2002). Three creatures named forward model. Neural Networks,

15:305–307.

Narendra, K. S. and Balakrishnan, J. (1997). Adaptive control using multiple

models. IEEE Transactions on Automatic Control, 42(2):171–187.

Schaal, S., Ijspeert, A., and Billard, A. (2003). Computational approaches to

motor learning by imitation. Phil. Trans. R. Soc London B, 358:537–547.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Steels, L. and Baillie, J. (2003). Shared grounding of event descriptions by

autonomous robots. Robotics and Autonomous Systems, 43:163–173.

Treue, S. and Trujillo, J. C. M. (1999). Feature-based attention inﬂuences

motion processing gain in macaque visual cortex. Nature, 399(575-579).

Tsotsos, J. (2001). Motion understanding: Task-directed attention and repre-

sentations that link perception with action. International Journal of Com-

puter Vision, 45(3):265–280.

Wolfe, J. M. (1994). Visual search in continuous, naturalistic stimuli. Vision

Research, 34:1187–1195.

Wolpert, D. M. and Kawato, M. (1998). Multiple paired forward and inverse

models for motor control. Neural Networks, 11:1317–1329.

Interaction Studies 9:2 (2008), pp. 353–376.

DOI 10.1075/is.9.2.10dem

issn 1572–0373 / e-issn 1572–0381

Contextual action recognition and target localization with an active allocation of attention on a humanoid robot

Article

Full-text available

Aug 2013
BIOINSPIR BIOMIM

Exploratory gaze movements are fundamental for gathering the most relevant information regarding the partner during social interactions. Inspired by the cognitive mechanisms underlying human social behaviour, we have designed and implemented a system for a dynamic attention allocation which is able to actively control gaze movements during a visual action recognition task exploiting its own action execution predictions. Our humanoid robot is able, during the observation of a partner's reaching movement, to contextually estimate the goal position of the partner's hand and the location in space of the candidate targets. This is done while actively gazing around the environment, with the purpose of optimizing the gathering of information relevant for the task. Experimental results on a simulated environment show that active gaze control, based on the internal simulation of actions, provides a relevant advantage with respect to other action perception approaches, both in terms of estimation precision and of time required to recognize an action. Moreover, our model reproduces and extends some experimental results on human attention during an action perception.

Towards Contextual Action Recognition and Target Localization with Active Allocation of Attention

Conference Paper

Full-text available

Jul 2012

Exploratory gaze movements are fundamental for gathering the most relevant information regarding the partner during social interactions. We have designed and implemented a system for dynamic attention allocation which is able to actively control gaze movements during a visual action recognition task. During the observation of a partner’s reaching movement, the robot is able to contextually estimate the goal position of the partner hand and the location in space of the candidate targets, while moving its gaze around with the purpose of optimizing the gathering of information relevant for the task. Experimental results on a simulated environment show that active gaze control provides a relevant advantage with respect to typical passive observation, both in term of estimation precision and of time required for action recognition.

Partial Observability During Predictions of the Opponent’s Movements in an RTS Game

Conference Paper

Full-text available

Sep 2010

In RTS-style games it is important to be able to predict the movements of the opponent's forces to have the best chance of performing appropriate counter-moves. Resorting to using perfect global state information is generally considered to be `cheating' by the player, so to perform such predictions scouts (or observers) must be used to gather information. This means being in the right place at the right time to observe the opponent. In this paper we show the effect of imposing partial observability onto an RTS game with regard to making predictions, and we compare two different mechanisms that decide where best to direct the attention of the observers to maximise the benefit of predictions.

Reasoning based on consolidated real world experience acquired by a humanoid robot

Article

Full-text available

Jan 2017

The development of reasoning systems exploiting expert knowledge from interactions with humans is a non-trivial problem, particularly when considering how the information can be coded in the knowledge representation. For example, in human development, the acquisition of knowledge at one level requires the consolidation of knowledge from lower levels. How is the accumulated experience structured to allow the individual to apply knowledge to new situations, allowing reasoning and adaptation? We investigate how this can be done automatically by an iCub that interacts with humans to acquire knowledge via demonstration. Once consolidated, this knowledge is used in further acquisitions of experience concerning preconditions and consequences of actions. Finally, this knowledge is translated into rules that allow reasoning and planning for novel problem solving, including a Tower of Hanoi scenario. We thus demonstrate proof of concept for an interaction system that uses knowledge acquired from human interactions to reason about new situations.

Hierarchies for Embodied Action Perception

Article

Full-text available

Jan 2014

During social interactions, humans are capable of initiating and responding to rich and complex social actions despite having incomplete world knowledge, and physical, perceptual and computational constraints. This capability relies on action perception mechanisms that exploit regularities in observed goal-oriented behaviours to generate robust predictions and reduce the workload of sensing systems. To achieve this essential capability, we argue that the following three factors are fundamental. First, human knowledge is frequently hierarchically structured, both in the perceptual and execution domains. Second, human perception is an active process driven by current task requirements and context; this is particularly important when the perceptual input is complex (e.g. human motion) and the agent has to operate under embodiment constraints. Third, learning is at the heart of action perception mechanisms, underlying the agent's ability to add new behaviours to its repertoire. Based on these factors, we review multiple instantiations of a hierarchically-organised biologically-inspired framework for embodied action perception, demonstrating its flexibility in addressing the rich computational contexts of action perception and learning in robotic platforms. © Springer-Verlag Berlin Heidelberg 2013. All rights are reserved.

Artificial Co-Drivers as a Universal Enabling Technology for Future Intelligent Vehicles and Transportation Systems

Article

Full-text available

Feb 2015

This position paper introduces the concept of artificial “co-drivers” as an enabling technology for future intelligent transportation systems. In Sections I and II, the design principles of co-drivers are introduced and framed within general human-robot interactions. Several contributing theories and technologies are reviewed, specifically those relating to relevant cognitive architectures, human-like sensory-motor strategies, and the emulation theory of cognition. In Sections III and IV, we present the co-driver developed for the EU project interactIVe as an example instantiation of this notion, demonstrating how it conforms to the given guidelines. We also present substantive experimental results and clarify the limitations and performance of the current implementation. In Sections IV and V, we analyze the impact of the co-driver technology. In particular, we identify a range of application fields, showing how it constitutes a universal enabling technology for both smart vehicles and cooperative systems, and naturally sets out a program for future research.

Towards incremental learning of task-dependent action sequences using probabilistic parsing

Article

Full-text available

Aug 2011

We study an incremental process of learning where a set of generic basic actions are used to learn higher-level task-dependent action sequences. A task-dependent action sequence is learned by associating the goal given by a human demonstrator with the task-independent, general-purpose actions in the action repertoire. This process of contextualization is done using prob-abilistic parsing. We propose stochastic context-free grammars as the representational framework due to its robustness to noise, structural flexibility, and easiness on defining task-independent actions. We demonstrate our implementation on a real-world scenario using a humanoid robot and report implementation issues we had.

How a robot's attention shapes the way people teach

Article

Full-text available

We address the question of how a robot's attention shapes the way people teach. When demonstrating a task to a robot, human part-ners often emphasize important aspects of the task by modifying their body movement as caregivers do toward infants. This phe-nomenon has recently been investigated in de-velopmental robotics; however, what causes such action modification is yet unrevealed. This paper presents an experiment examin-ing influences that a robot's attention has on task demonstration of human partners. Our hypothesis is that a robot's bottom-up atten-tion based on visual salience induces partners to exaggerate their body movement, to seg-ment the movement frequently, to approach closely to the robot, and so on, which are ho-mologous to modifications in infant-directed action. We present quantitative results sup-porting our hypothesis and discuss what prop-erties of bottom-up attention contribute to eliciting such action modifications.

Towards active event recognition

Conference Paper

Full-text available

Aug 2013

Directing robot attention to recognise activities and to anticipate events like goal-directed actions is a crucial skill for human-robot interaction. Unfortunately, issues like intrinsic time constraints, the spatially distributed nature of the entailed information sources, and the existence of a multitude of unobservable states affecting the system, like latent intentions, have long rendered achievement of such skills a rather elusive goal. The problem tests the limits of current attention control systems. It requires an integrated solution for tracking, exploration and recognition, which traditionally have been seen as separate problems in active vision. We propose a probabilistic generative framework based on information gain maximisation and a mixture of Kalman Filters that uses predictions in both recognition and attention-control. This framework can efficiently use the observations of one element in a dynamic environment to provide information on other elements, and consequently enables guided exploration. Interestingly, the sensors control policy, directly derived from first principles, represents the intuitive trade-off between finding the most discriminative clues and maintaining overall awareness. Experiments on a simulated humanoid robot observing a human executing goal-oriented actions demonstrated improvement on recognition time and precision over baseline systems.

Hierarchical, Attentive, Multiple Models for Execution and Recognition (HAMMER)

Article

Full-text available

Jan 2005

Abstraction in Recognition to Solve the Correspondence Problem for Robot Imitation

Article

Full-text available

Jan 2004

A considerable part of the imitation problem is nding mechanisms that link the recognition of actions that are being demonstrated to the execution of the same actions by the imitator. In a situation where a human is instructing a robot, the problem is made more complicated by the dierence in morphology. In this paper we present an imitation framework that allows a robot to recognise and imitate object-directed actions performed by a human demonstrator by solving the correspondence problem. The recog- nition is achieved using an abstraction mechanism that focuses on the features of the demonstra- tion that are important to the imitator. The abstraction mechanism is applied to experimen- tal scenarios in which a robot imitates human- demonstrated tasks of transporting objects be- tween tables.

f 3 Imitation as a Dual-Route Process Featuring Predictive and Learning Components; 4 Biologically Plausible Computational Model

Article

Full-text available

Jan 2002

We do not exist alone. Humans and most other animal species live in societies where the behaviour of an individual influences and is influenced by other members of the society. Within societies, an individual learns not only on its own, through classical conditioning and reinforcement, but to a large extent through its conspecifics, by observation and imitation. Species from rats to birds to humans have been observed to turn to their conspecifics for efficient learning of useful knowledge. One of the most important mechanisms for the transmission of this knowledge is imitation.

Imitation as a Dual-Route Process Featuring Predictive and Learning Components: A Biologically Plausible Computational Model

Chapter

Jun 2002

An interdisciplinary overview of current research on imitation in animals and artifacts. The effort to explain the imitative abilities of humans and other animals draws on fields as diverse as animal behavior, artificial intelligence, computer science, comparative psychology, neuroscience, primatology, and linguistics. This volume represents a first step toward integrating research from those studying imitation in humans and other animals, and those studying imitation through the construction of computer software and robots. Imitation is of particular importance in enabling robotic or software agents to share skills without the intervention of a programmer and in the more general context of interaction and collaboration between software agents and humans. Imitation provides a way for the agent—whether biological or artificial—to establish a "social relationship" and learn about the demonstrator's actions, in order to include them in its own behavioral repertoire. Building robots and software agents that can imitate other artificial or human agents in an appropriate way involves complex problems of perception, experience, context, and action, solved in nature in various ways by animals that imitate. Bradford Books imprint

A saliency-based search mechanism for overt and covert shifts of visual attention

Article

Jun 2000
VISION RES

Most models of visual search, whether involving overt eye movements or covert shifts of attention, are based on the concept of a saliency map, that is, an explicit two-dimensional map that encodes the saliency or conspicuity of objects in the visual environment. Competition among neurons in this map gives rise to a single winning location that corresponds to the next attended target. Inhibiting this location automatically allows the system to attend to the next most salient location. We describe a detailed computer implementation of such a scheme, focusing on the problem of combining information across modalities, here orientation, intensity and color information, in a purely stimulus-driven manner. The model is applied to common psychophysical stimuli as well as to a very demanding visual search task. Its successful performance is used to address the extent to which the primate visual system carries out visual search via one or more such saliency maps and how this can be tested.

Simulation theory of understanding others: A robotics perspective

Article

Mar 2007

According to the simulation theory, ‘human beings are able to use the resources of their own minds to simulate the psychological etiology of the behaviour of others’, typically by making decisions within a ‘pretend context’ (Gordon, 1999). During observation of another agent's behaviour, the execution apparatus of the observer is taken ‘off-line’ and is used as a manipulable model of the observed behaviour. From a roboticist's point of view, the fundamental characteristics of the simulation theory are: Utilization of the same computational systems for a dual purpose; both behaviour generation as well as recognition. Taking systems off-line (suspending their normal input/output) which necessitates a redirection and suppression of input (feeding ‘pretend states’ following a perspective taking process, suppressing the current ones that are coming from the visual sensors while this is performed) and output from/to the system to achieve this dual use. Simulation theory is often set as a rival to the ‘theory-theory’, where a separate theoretical reasoning system is used; the observer perceives and reasons about the observed behaviour not by simulating it, but by utilizing a set of causal laws about behaviours. It is important to note that in the experiments we describe here, ‘understanding’ the demonstrated behaviour equates to inferring its goal in terms of sensorimotor states, and does not imply inference of the emotional, motivational and intentional components of mental states. © Cambridge University Press 2007 and Cambridge University Press, 2009.

Imitation as a dual-route process featuring prediction and learning components: A biologically plausible computational model

Article

Movement imitation mechanisms in robots and humans

Article

Yiannis Demiris

Shared grounding of event descriptions by autonomous robots. Robotics and Autonomous Systems 43(2-3):163-173

Article

May 2003
ROBOT AUTON SYST

The paper describes a system for open-ended communication by autonomous robots about event descriptions anchored in reality through the robot’s sensori-motor apparatus. The events are dynamic and agents must continually track changing situations at multiple levels of detail through their vision system. We are specifically concerned with the question how grounding can become shared through the use of external (symbolic) representations, such as natural language expressions.

Hierarchical attentive multiple models for execution and recognition (HAMMER)

Article

May 2006
ROBOT AUTON SYST

According to the motor theories of perception, the motor systems of an observer are actively involved in the perception of actions when these are performed by a demonstrator. In this paper we review our computational architecture, HAMMER (Hierarchical Attentive Multiple Models for Execution and Recognition), where the motor control systems of a robot are organised in a hierarchical, distributed manner, and can be used in the dual role of (a) competitively selecting and executing an action, and (b) perceiving it when performed by a demonstrator. We subsequently demonstrate that such an arrangement can provide a principled method for the top-down control of attention during action perception, resulting in significant performance gains. We assess these performance gains under a variety of resource allocation strategies.