Towards active event recognition

Perception of a non-local event. The discrete variable V represents the class of the event. The variables Xi are the state variables of various elements in the environment. Their next state is determined by their previous state and by V. The dashed connections indicate that the connection are sparse and that an element Xi is affected by an element Xj with i = j only for a limited set of values of V. Each element i provides a different observation O k i which depends on the current state of the element and on the sensor configuration Θ.

Evolution of the average success rate on simulated data with 4 objects.

Evolution of the average gaze focus during a trial with REALVAC and 30 cm of distance between the targets. It shows only the saccades directed towards the elements of the currently most probable action.

Evolution of the average success rate on recorded human action with 4 objects at a distance of 15 cm.

Active Predictive Recognition of Action on the iCub Humanoid Robot

Figures - uploaded by Dimitri Ognibene

Content may be subject to copyright.

Content uploaded by Dimitri Ognibene

Content may be subject to copyright.

Towards Active Event Recognition∗

Dimitri Ognibene, Yiannis Demiris

Personal Robotics Lab,Imperial College London, UK

{d.ognibene,y.demiris}@imperial.ac.uk

Abstract

Directing robot attention to recognise activities and

to anticipate events like goal-directed actions is a

crucial skill for human-robot interaction. Unfor-

tunately, issues like intrinsic time constraints, the

spatially distributed nature of the entailed infor-

mation sources, and the existence of a multitude

of unobservable states affecting the system, like

latent intentions, have long rendered achievement

of such skills a rather elusive goal. The problem

tests the limits of current attention control systems.

It requires an integrated solution for tracking, ex-

ploration and recognition, which traditionally have

been seen as separate problems in active vision. We

propose a probabilistic generative framework based

on information gain maximisation and a mixture of

Kalman Filters that uses predictions in both recog-

nition and attention-control. This framework can

efﬁciently use the observations of one element in

a dynamic environment to provide information on

other elements, and consequently enables guided

exploration. Interestingly, the sensors control pol-

icy, directly derived from ﬁrst principles, repre-

sents the intuitive trade-off between ﬁnding the

most discriminative clues and maintaining overall

awareness. Experiments on a simulated humanoid

robot observing a human executing goal-oriented

actions demonstrated improvement on recognition

time and precision over baseline systems.

1 Introduction

Anticipating the evolution of the external environment, which

may comprise other agents, is essential to prepare and pro-

duce effective action responses [Pezzulo and Ognibene, 2012;

Pezzulo, 2008; Balkenius and Johansson, 2007]. In this con-

text an autonomous agent has to face three main difﬁculties:

1) acquire a predictive model of the environment; 2) compute

∗This research was funded by EFAA FP7-ICT-Project, Grant

Agreement no: 270490. The authors want to thank Yan Wu, Miguel

Sarabia, Kyuhwa Lee, Eris Chinellato, Helgi P. Helgason, Margarita

Kotti, Harold Soh, Sotirios Chatzis and Giovanni Pezzulo for their

suggestions and IJCAI reviewers for the helpful comments.

Figure 1: Experimental setup: iCub looking at target objects and

arm movements (bottom right). The top-left image shows the iCubs

gaze following the hand. In the top-right, hand movements of the

human are anticipated by the iCub gaze. Finally, the bottom-left im-

age shows gaze focused on the target object before the hand reaches

it. These three images are grabbed from the iCub camera during an

interaction trial.

predictions in a limited time [Zilberstein and Russell, 1996];

3) control its sensors to perceive the current state of the envi-

ronment and predict its evolution.

Active Perception for Anticipation. This paper focuses

speciﬁcally on the third problem, which can be named ‘Ac-

tive Perception for Anticipation’ (APA). It consists in ﬁnding

effective sensor control strategies to gather the information

necessary to feed the available predictive models. The APA

problem is proposed as a new component of the active vi-

sion paradigm [Aloimonos et al., 1988; Ballard, 1991; Ba-

jcsy, 1988; Suzuki and Floreano, 2008], which complements

and integrates other components like active monitoring dur-

ing action execution [Sridharan et al., 2010], object detection

[Vijayanarasimhan and Kapoor, 2010; de Croon and Postma,

2007; Vogel and de Freitas, 2008], tracking [Sommerlade and

Reid, 2008; Gould et al., 2007]and active object recognition

[Paletta et al., 2005; Denzler and Brown, 2002]. These com-

ponents exploit task-speciﬁc knowledge to improve percep-

tion performance and reduce computational cost, but active

vision has also shown to play an important role in improv-

ing learning performance [Andreopoulos and Tsotsos, 2013;

Walther et al., 2005]even when the task is not known [Og-

nibene et al., 2010]. An example, which can help under-

standing the APA problem, is the prediction of the motion

of one element of interest (EI), say an asteroid, in an environ-

ment composed of multiple elements (e.g., asteroid ﬁeld) with

known dynamics, i.e. ideal motion and bouncing, when the

state of some elements is not precisely known and the sensors

receptive ﬁeld is limited but controllable. A short time pre-

diction of the trajectory of the EI can easily be produced by

just tracking it and passively detecting other elements, which

can give rise to an interaction and change the motion of the

EI (e.g., an impact) when they enter into the ﬁeld of view.

In most conditions longer term predictions are desired to

produce more effective responses. The reliability and tempo-

ral reach of the prediction (e.g., 1 minute without impacts at

99% probability) depends on the sensor control strategy. The

current motion of the EI is a ‘dynamic context’ which can

be used to actively ﬁnd other elements who could produce

an interaction. The agent can predictively direct its sensors

along the trajectory of the EI and successively explore the

areas close to the trajectory. The system should also come

back to track the EI regularly, otherwise it may lose the EI

(e.g., due to an impact with a very fast undetected element).

Thus an effective attentional strategy is necessary to produce

longer and more reliable predictions.

When the interactions can take place over long ranges, like

when attraction and repulsions forces are present, the motion

of each element is part of the ‘dynamic context’ giving infor-

mation related to the state and presence of the other relevant

elements (e.g., if an asteroid is accelerating the source of at-

traction can be found in the direction of the acceleration). Si-

multaneously, having more information on the elements in the

environment can allow for more precise and longer predic-

tions (e.g., predicting not only the impacts but the non-linear

trajectories too, which in turn helps predicting impacts). This

increases the importance of an active sensor strategy that ex-

plores some candidate areas for the presence of long-range

interacting elements.

Active Event Recognition. In this paper we are interested in

an even more complex condition, where the actual interaction

between elements (e.g., repulsion, attraction or just bouncing)

depends on some hidden states of the elements themselves

(e.g., electric charge). In this case accurate estimations of

motion and position of some elements can contribute at un-

veiling the value of the hidden state of other elements. In this

condition, an efﬁcient active sensor-control policy has the ad-

ditional role of selecting when and what elements to track to

permit the necessary estimations to recognizes the most infor-

mative elements (e.g., charged ones). We deﬁne Active Event

Recognition (AER) as a sensor control strategy aimed at un-

veiling class of the event, which is the value of the hidden

states determining the successive evolution of the environ-

ment (see ﬁgure 2). AER is thus an essential part of the APA

problem.

While AER, like all the previous conditions, shows the im-

portance of an active control of sensors for effective predic-

tions in dynamic environments, AER is of particular interest

because it can be directly mapped to broad set of conditions

comprising goal-directed actions executed by other agents,

e.g. a hand reaching a cup (see ﬁgure 1), two person going

one toward the other, a car avoiding an obstacle, or a prey

escaping from a predator. Anticipating such behaviourally-

relevant1events still poses particularly demanding challenges

in terms of the timely detection of relevant elements in an

event and the recognition of the discriminative dynamics

(e.g., motion of the hand). When the state of an element (e.g.,

prey) is uncertain, the ‘dynamic context’ (presence and dy-

namics of other elements e.g., a running predator) can pro-

vide valuable information for prediction. In general, to an-

ticipate these ‘non-local’ events, and to produce an AER, the

observer must couple its sensing behaviours with the inde-

pendent dynamics of the environment which is yet unknown

to the observer. The solution to this chicken-and-egg problem

demands for a principled integration of tracking, exploration,

search and recognition capabilities. However, these are per-

ceived as separate problems in the active vision literature.

A principled integration requires the selection of the next

sensor conﬁguration (e.g., stop tracking the demonstrator and

look for an graspable object) by merging, evaluating and ex-

ploiting different sources of information (e.g., noisy observa-

tions of hand movements and target positions). Solving this

online is computationally complex and knowledge demand-

ing. In particular the evaluation of the informative contribu-

tion of the dynamic context may require the prediction of the

distribution of the expected trajectories2.

Related Works. Some previous attempts of principled vi-

sual attention system, such as [Navalpakkam and Itti, 2006]

which uses only local visual features, cannot direct attention

to objects which are outside the visual ﬁeld. Others, like

Sprague and Ballard [2007], formalise the role of attention

to subserve action execution, using independent models for

the elements in the environment. Thus they cannot predict

the elements interactions, like others’ actions, and employ

it for action selection or attention allocation. Attention sys-

tems, like [de Croon and Postma, 2007; Paletta et al., 2005;

Denzler and Brown, 2002], utilise the contextual information

connected to low-level visual features and suffer from lim-

ited generalisation capabilities and reduced applicability to

dynamic environments.

Proposed Approach. To achieve the efﬁcient spatial-

temporal coupling between the agent’s sensors and the en-

vironment, we propose a probabilistic generative framework

based on a mixture of Kalman Filters (KFs). It exploits sev-

eral KF properties, like fast analytical update and computa-

tion of entropy, to reduce the computational complexity and

evaluate the dynamic context. The sensor-control policy is di-

rectly derived from the principle of maximisation of expected

information gain. It explicitly attempts at reducing the uncer-

tainty on the event which is currently taking place.

1Anticipation and prediction of others’ goals is at the basis of

most human-human and human-robot interactions [Demiris, 2007].

2In the case of agent actions, this corresponds to the online

solution of the intention recognition problem [Baker et al., 2009;

Ramirez and Geffner, 2011; Demiris, 2007]. This is a hard prob-

lem which can be further complicated by partial observability and

exacerbated by the uncertainty of the environment.

Swift evaluation of the dynamic context can be achieved by

making two assumptions: Firstly, events can be represented

as linear dynamic systems3. Consequently, the state of the

dynamic system implicitly characterises the expected kine-

matics. Secondly, a limited number of elements participate in

each event. The expensive multidimensional optimisation for

selecting the next sensor conﬁguration can be approximated

by choosing the conﬁguration that maximises the expected

information gain for the event from a set of candidates. Ef-

fective candidates can be built by reusing the predictions of

the KF. This also allows the system to focus its sensors to po-

sitions outside of their current ﬁeld of view and to select be-

tween visually similar elements, thus overcoming the limits

of some other approaches like [Navalpakkam and Itti, 2006].

We apply this framework to the problem of directing the

attention of a humanoid robot iCub to perceive a goal di-

rected action, an archetype of non-local event. To furnish the

necessary models of events we follow [Demiris and Khad-

houri, 2008]and the simulation theory of action perception

reusing the trajectory planner knowledge of the robot [Gori

et al., 2012]. The transition function of each KF of the mix-

ture is an instance of the same stable linear dynamic system

that is used in the planner but has a different attractor corre-

sponding to a different target. [Demiris and Khadhouri, 2008]

recognize actions by running in parallel the different models

corresponding to the various action hypotheses. It controls

attention through the direct competition between the models

to access the information they need. It results in the selection

of the elements which are necessary to control the currently

winning action4. If the predictions of the different models for

the selected elements are similar then the observations may

not be useful for the recognition of the event. Instead our

system directs attention to directly discriminate between the

different events using information gain maximisation.

Using a robot simulator and synthetic and recorded human

goal-directed actions, we compare our framework with the

approach we proposed in [Ognibene et al., 2012]. The lat-

ter extends [Demiris and Khadhouri, 2008]by integrating its

predictive models with separate KF for each element in the

environment.

2 Active Event Recognition

In this section the AER is deﬁned and a solution based on a

mixture of KF using Information Gain (AERIG) is described.

Problem deﬁnition. The graphical model in ﬁgure 2 displays

3While the coupling between an agent and its environment

as a dynamic system has longly been studied in various disci-

plines like artiﬁcial life and robotics [Nolﬁ and Floreano, 2000;

Beer, 1995], it did not receive attention in the ﬁeld of action recog-

nition, where it can deliver several computational advantages over

normative methods [Baker et al., 2009]and is more parsimonious in

terms of knowledge allowing direct use without extensive environ-

ment based training, necessary for models like [Bruce and Gordon,

2004]

4Also this attention system results in predictive gaze saccades

toward the action target. The use of predictive models to control at-

tention during action perception is supported by some experimental

results like [Flanagan and Johansson, 2003].

k$1"

X2 X2 k"k$1"

k$1"

Xn Xn k"k$1"

ϴk"

Figure 2: Perception of a non-local event. The discrete variable

V represents the class of the event. The variables Xiare the state

variables of various elements in the environment. Their next state

is determined by their previous state and by V. The dashed con-

nections indicate that the connection are sparse and that an element

Xiis affected by an element Xjwith i6=jonly for a limited set

of values of V. Each element iprovides a different observation Ok

which depends on the current state of the element and on the sensor

conﬁguration Θ.

the formulation of the problem. The discrete hidden stochas-

tic variable Vrepresents the class of the event which is taking

place, characterised by a different dynamic of the environ-

ment that the agent must predict and recognise. The environ-

ment is composed of a ﬁxed set of elements I={i= 1 . . . n}

and thus its state Xkat time kis composed of the states Xk

iof

the different elements. For each value of Vthe evolution of

Xkis determined by a different dynamic system with differ-

ent independence conditions between the elements. At each

time step the agent receives for each element ian observation

iwhich depends on the current conﬁguration of the sensors

θk. The states and observations are continuous variables.

At every time step the goal of the system is to select the

conﬁguration ˆ

θkthat will minimise the expected uncertainty

over V(quantiﬁed by entropy H):

θk= argmin

θkZ

p(ok|θk)H(V|ok, θk)dok(1)

Proposed solution. For the recognition of the event and for

the selection of the sensors conﬁguration it is necessary to

compute the posterior P(v|ok;θk). Given a prior distribu-

tion P(v, xk

1:N) = P(xk

1:N|v)P(v)and the independence of

the observed event from the sensor conﬁguration P(v|θ) =

P(v), the update expression of the posterior P(v|ok+1θk+1 )

can be derived through the use of the Bayes rule:

P(v|ok+1, θk+1) = P(ok+1|v , θk+1)P(v)

P(ok+1|θk+1 )(2)

The computation of eq.1 and eq.2 in the general case can pose

severe computational complexities. The solution proposed is

based on the assumption that, once vis ﬁxed, the dynamics

is linear and the probability distributions are normal. This

enables the use of a mixture of KF with a distinct KF for

each value of v. Denoting with ¯ok+1

v,θk+1 the mean expected

observation and with Sk+1

v,θk+1 its covariance matrix, both of

which are conditioned on vand θand computed during the

KF update, the following can be derived:

θk+1 = argmin

θk+1 X

P(v)1

2ln |Sk+1

v,θk+1 |+

N(o;¯ok+1

v,θk+1 ,Sk+1

v,θk+1 ) ln(P(o|θk+1 ))do(3)

Where |S|is the determinant of a matrix S. The ﬁrst order

Taylor expansion of P(o|θ)at point ¯ok+1

vresults in:

θk+1 ≈argmin

θX

P(v)1

2ln |Sk+1

v,θk+1 |+

v0Pv0N(¯ov ,θ ;¯ok+1

v0,θk+1 ,Sk+1

v0,θk+1 )#(4)

The ﬁrst term of eq. 4 is the average of the expected entropy

of the observations for each model von the new observation

point. The second term is a measure of how much, in the new

sensor conﬁguration, the observations predicted by a model

will differ from those predicted by other models. Thus, the

proposed formulation integrates a trade-off between discrim-

inating the event hypotheses and maintaining their perception

quality.

We will now use the assumption that to each event (for

each value of v) involved only a limited subset of ele-

ments of the environment, the set Iv⊂I. The dynam-

ics of these elements will be coupled, formally their transi-

tion probabilities depends on the state of the other elements

of Iv, that is P(xk

i|xk−1

1:Ne, v)≡P(xk

i|xk−1

Iv, v), i ∈Iv. The

dynamic of those elements which do not participate in the

event will be independent of each other, P(xk

i|xk−1

1:N, v)≡

P(xk

i|xk−1

i), i /∈Iv. This allows decomposition of each KF

corresponding to a value of vin a set KFs of lower dimen-

sions. One of the KF will model the coupled dynamic of the

elements in Ivwhile the dynamic of each element i /∈Iv

will be modelled by an independent KF. This decomposition

delivers two advantages: 1) The computational cost of updat-

ing a KF is cubic in the dimension of the state space. Thus,

updating a set a lower dimension KFs is more efﬁcient than

updating a single high dimensional one; 2) The KFs for the

elements which do not belong to Ivare independent of v,

which permits to reuse the results of the KF update.

The KF decomposition can be utilised for efﬁcient compu-

tation of eq. 4. The ﬁrst term can be computed as the product

of the determinant of the covariance matrix ˜

Sk+1

v,θ of the obser-

vation distribution predicted by the KF of Ivand those ˙

Sk+1

i,θ

from the KF of the other elements.

ln |Sk+1

v,θ |= ln |˜

Sk+1

v,θ |+

I−Iv

ln |˙

Sk+1

i,θ |(5)

The second term of eq. 4 can also be decomposed in a similar

way leading to reduced computational complexity.

Solving directly eq. 4, even after decomposition, can still

be complex. Instead, a ﬁnite set ˆ

Θof candidate sensor-

conﬁgurations is deﬁned and the one with the highest ex-

pected information gain is selected. In this work, the ele-

ments of ˆ

Θare the conﬁgurations which will minimise the

noise on any of the elements according to any of the possible

hypotheses. E.g., ˆ

Θcan be the set of different camera conﬁg-

urations focusing on the predicted positions of the different

objects according to the different possible events.

3 Experimental Evaluation: Active action

recognition

The framework proposed in the previous section, AERIG,

has been applied to control the gaze of an iCub humanoid

robot that has to anticipate the target of a reaching ac-

tion performed by a human partner (see ﬁgure 1). In

the tests the robot showed human-like attentional behaviour

while observing human actions, switching its attention from

tracking the hand to the anticipated target (see videos at

http://www.imperial.ac.uk/personalrobotics).

The most important contribution of the approach is the pre-

dictive recognition of the event and the temporal advantage

for action preparation it permits. However, due to human

reaction delays, it is complex to assess the temporal perfor-

mance of the approach with a real robot. In fact, during the

tests, unlike during normal operations, it is important to com-

pare synchronously the timings of performer’s intention and

action with the evolution of the robot estimations. Instead

simulation results are reported with different versions of the

sensor model: with a ﬁxed camera (NO Visual Attention

Controller, NOVAC) randomly placed near one of the ele-

ments, an ideal camera with instantaneous and precise gaz-

ing movements (IDEALVAC), and with the simulated cam-

era controller reproducing the speed and trajectories of the

real iCub camera controller [Pattacini, 2010](REALVAC).

The robot’s perception of the position of an element (object

or of the other agent hand) is considered affected by gaussian

noise whose variance is a linear function of the distance be-

tween the gaze focus point and the element (σ= 0.1+0.2×d)

5. This observation model is a simpliﬁcation of the human vi-

sion which has greatly higher resolution in the centre [Findlay

and Gilchrist, 2003]. The system was tested on both synthetic

data, generated by the same control model used by the robot,

and on natural human reaching actions recorded with the

Kinect. AERIG is compared to a heuristic attention controller

proposed in [Ognibene et al., 2012](HEURISTIC). The

heuristic balances element position uncertainty with event un-

certainty by selecting the next target to gaze as the element

with the highest product between its position entropy and the

conﬁdence of the event in which it takes part6. In this task,

the environment (I) is composed by the hand of the human

performer (h) and by a ﬁxed number Neof graspable objects

5An object speciﬁc factor can be applied to reﬂect its intrinsic

difﬁculty in recognition and localisation. This observation noise

model is correlated with the state, thus the optimality properties of

Kalman ﬁlter cannot be guaranteed.

6The effector position is estimated using a KF and the same mo-

tion model used by AERIG ([Gori et al., 2012]). The target position

estimation needed by the motion model is estimated by an indepen-

dent KF which model assumes the target to be still, this does not

allow to use the motion of the effector to correct the target position.

The conﬁdence cis updated at every time step using the prediction

error ek

hon the hand position ck+1 =ck+ (1 + ek+1

h)−1

0 2 4 6 8 10 12 14 16 18 20

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Time Steps (1 =0.1 sec)

Correctness Ratio (1000 run)

Ideal VAC

Real VAC

No VAC

Agent

Sensor control

AERIG

Distance (cm)

HEURISTIC

Figure 3: Evolution of the average success rate on simulated data

with 4 objects.

(i= 1 . . . Ne). Each event vcorresponds to a hand move-

ment to reach a different object in the environment (thus the

domain of Vis {1. . . Ne}). The state Xk

iof an object irep-

resents its position, while the state Xk

hof the hand consists

of both speed and position. The set of coupled elements Iv

is composed by the hand and by the the object with index

i≡v. The KF has the same transition matrix which is used

to control a reaching action execution, thus complying with

the simulation theory of action perception [Demiris, 2007;

Dindo et al., 2011]. The equation employed by a given model

vto compute the next effector position xk+1

p,v , when the target

is at position xk

xk+1

p,v =xk

h,v +τn˙xk

h,v +τhK(xv−xk

h,v)−D˙xk

h,v io (6)

Kand Dare typical constants of a PD controller while τis

the related time integration parameter. They are set to the

same default values used for action generation K= 1.5,

D= 3 and τ= 0.16,τcan be modulated to model dif-

ferent motion speeds. The Kalman ﬁlter associated with

each single element assume that the element is still (thus

the transition matrix is the identity) with its position affected

only by zero-mean Gaussian noise with variance 0.0025.

The actual system is open source and available online at

http://www.imperial.ac.uk/personalrobotics.

3.1 Results

Figure 3 reports the evolution of the“success rate”; how many

times the currently estimated event is actually the correct one,

with simulated effector trajectories in an environment with

4 objects at different distances from each other. The aver-

age duration of trial depends on the distance between the ob-

jects. In each condition 1000 runs were executed. The ﬁgure

shows signiﬁcant improvement (up to two times better) when

AERIG is adopted, delivering more accurate and faster recog-

nition in all conditions, specially when the task is more dif-

ﬁcult. It also shows the crucial role of active sensor control.

Similar results were observed with a varying number of ob-

jects, thus showing that AERIG can scale with the number of

elements. On the computational side, on a conventional 2010

PC (2.8Ghz), AERIG takes 0.4ms for frame with 16 targets.

Figure 4 displays the average evolution of the attentional be-

haviour of the AERIG and HEURISTIC agents with the real

0 2 4 6 8 10 12 14 16

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Time Steps

Ratio x trial (1000 run)

AERIG

Agent

Heuristic

Predicted Performer

Position

Predicted Action Target

Gaze Focus

Figure 4: Evolution of the average gaze focus during a trial with

REALVAC and 30 cm of distance between the targets. It shows only

the saccades directed towards the elements of the currently most

probable action.

gaze controller. The graph shows that both the systems do not

focus just on the effector but continuously alternates between

it and the targets. Thus AERIG, which uses the effector mo-

tion information to correct the estimated position of the tar-

get, can more precisely direct its saccades towards the target

and thus improve the overall prediction. The AERIG system

explores the environment and does not focus only on the cur-

rent most probable target and effector position. Successively

it increases the saccades towards those elements and in par-

ticular the target. The HEURISTIC model instead focuses

only on the elements of the most probable event. Figure 5

displays the results of AERIG when applied to human reach-

ing actions recorded with the Kinect. The 3D positions of the

right hand of the user skeleton, extracted using the OpenNI

[Ope, 2010]and NITE [Pri, 2010]libraries, were recorded

during 24 reaching actions towards 4 objects positioned on

a circle of radius 15 cm (see ﬁgure 6). The actions, with an

average duration of 1 sec, were performed naturally from dif-

ferent starting positions at an average distance of 40 cm from

the circle centre. At each time step, the actual position per-

ceived by the robot was affected by noise depending on the

current gaze position in the same way as in the previous ex-

0 5 10 15 20 25 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Time steps (1=0.06 secs)

Success ratio (100 runs)

Ideal VAC

No VAC

HEURISTIC

AERIG

Agent

Sensor control

Figure 5: Evolution of the average success rate on recorded human

action with 4 objects at a distance of 15 cm.

0 200 400 600 800

−400

−200

200

400

600

x (mm)

y (mm)

Figure 6: Samples of recorded reaching action trajectories (2d pro-

jection) towards two different objects (circles) .

periment. It is easy to see that the trajectories are not straight

as predicted by the models used.

This data show that AERIG delivers improvement in event

recognition even when its knowledge is noisy and does not

match the real dynamic of the events. While both AERIG

and the heuristic model use the same action model, and thus

suffer from similar problems with the recorded data, AERIG

is more robust because it is sensitive to uncertainty when se-

lecting the actual event. We also hypothesise that the robust-

ness of AERIG is partially due to the use of stable dynamic

systems to model the interactions in the environment. The

trajectories are also affected by noise which is specially evi-

dent near the target objects. This noise cannot be reduced by

the active control of the camera. This helps explain the shape

of the performance in Figure 5. Figure 7 displays the perfor-

mance of the systems when the robot sensor has a limited ﬁeld

of view (radius of 40cm around the gaze focus) and reduced

noise (σ= 0.03 + 0.06 ∗d), with four objects at a distance

of 30 cm. This result, while preliminary, shows that only

AERIG with active camera can provide useful events recog-

nition7. This is due to the use of the effector motion informa-

tion which is used to adjust the initial random guesses of the

target position, which in the other conditions were based on

the peripheral vision. This adjustment are successively used

to conﬁrm the presence of the target with direct gazes.

4 Conclusions

This paper introduces AERIG, a probabilistic framework for

the active perception of multi-element dynamic events, like

goal-oriented actions. It addresses the requirements of the

APA problem, which consists of ﬁnding effective sensor con-

trol strategies to gather the information necessary to predict

the evolution of the environment. AERIG actively directs

7The policy used by the systems to manage missing observations

was: a) if the element was predicted to be observed then generate a

random observation outside of the ﬁeld of view and reduce the prob-

ability of the event to the minimum event probability, b) otherwise

generate a noisy observation of the element centred on the predicted

position.

0 2 4 6 8 10 12 14 16

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Timesteps (0.1 secs)

Success ratio (100 runs)

No VAC

Ideal VAC

Sensor control

Agent

HEURISTIC

AERIG

Figure 7: Evolution of the average success rate of the system with

limited ﬁeld of view.

sensors to the most discriminative conﬁgurations to uncover

the actual dynamics of the environment. This leads to the

spatio-temporal coupling of the sensors with the environment

dynamics.

While actively recognising a complex dynamic context

[Rothkopf et al., 2007]has crucial behavioural importance,

this problem has been not sufﬁciently addressed in the litera-

ture of attention. In fact, most of the attention models assume

artiﬁcial laboratory conditions and offer limited explanations

of task oriented attention in dynamic environments [Tatler et

al., 2011]. Thus, it would be intriguing to compare the pre-

dictions of the proposed framework with actual human be-

haviour in natural conditions. AERIG applied to the recog-

nition of goal oriented actions represents a sound and robust

implementation of the simulation theory of action perception

[Demiris, 2007]. While some theories [Flanagan and Johans-

son, 2003]support the reuse of the same attention control

schemas used during action execution, our approach focuses

on the direct reduction of uncertainty during the perception

of the action. In this task the tests showed substantial perfor-

mance improvement over baseline approaches. Interestingly

the framework allows a mathematical formulation of the intu-

itive trade-off between a conservative behaviour, which tries

to observe all the events and elements, and a greedy one that

is only interested in determining the real event which is taking

place.

In this work, we tested our approach only on stable approx-

imately linear systems with a limited number of elements and

only used position information. In the future, we plan to ex-

tend the framework to use more complex models and inte-

grating it with the overall agent control architecture.

References

[Aloimonos et al., 1988]J. Aloimonos, I. Weiss, and A. Bandy-

opadhyay. Active vision. International Journal of Computer

Vision, 1(4):333–356, january 1988.

[Andreopoulos and Tsotsos, 2013]A. Andreopoulos and J. K.

Tsotsos. A computational learning theory of active object recog-

nition under uncertainty. Int. J. Comp. Vis., 101(1):95–142, 2013.

[Bajcsy, 1988]R. Bajcsy. Active perception. Proceedings of the

IEEE, Special issue on Computer Vision, 76(8):966–1005, 1988.

[Baker et al., 2009]C. L Baker, R. Saxe, and J. B Tenenbaum. Ac-

tion understanding as inverse planning. Cognition, 2009.

[Balkenius and Johansson, 2007]C. Balkenius and B. Johansson.

Anticipatory models in gaze control: a developmental model.

Cognitive processing, 8(3):167–174, 2007.

[Ballard, 1991]D.H. Ballard. Animate vision. Artiﬁcial Intelli-

gence, 48:57–86, 1991.

[Beer, 1995]R. D. Beer. A dynamical systems perspective on

agent-environment interaction. Artiﬁcial Intelligence, 72:173–

215, 1995.

[Bruce and Gordon, 2004]A. Bruce and G. Gordon. Better motion

prediction for people-tracking. In Proc. of the 2004 IEEE Inter-

national Conference on Robotics and Automation, New Orleans,

LA, USA, May 2004.

[de Croon and Postma, 2007]G. de Croon and E.O. Postma.

Sensory-motor coordination in object detection. In Proceedings

of the IEEE Symposium on Artiﬁcial Life, ALIFE’07, pages 147–

154, 2007.

[Demiris and Khadhouri, 2008]Y. Demiris and B. Khadhouri.

Content-based control of goal-directed attention during human

action perception. Interaction Studies, 9(2):353–376, 2008.

[Demiris, 2007]Y. Demiris. Prediction of intent in robotics and

multi-agent systems. Cognitive Processing, 8(3):151–158, 2007.

[Denzler and Brown, 2002]J. Denzler and C.M. Brown. Informa-

tion Theoretic Sensor Data Selection for Active Object Recogni-

tion and State Estimation. Transactions in Pattern Analysis and

Machine Intelligence, 24(2):145–157, 2002.

[Dindo et al., 2011]H. Dindo, D. Zambuto, and G. Pezzulo. Motor

simulation via coupled internal models using sequential monte

carlo. In Proceedings of the Twenty-Second International Joint

Conference on Artiﬁcial Intelligence, Barcelona, Spain, July

2011.

[Findlay and Gilchrist, 2003]J.M. Findlay and I.D. Gilchrist. Ac-

tive vision: The psychology of looking and seeing. Oxford Uni-

versity Press, 2003.

[Flanagan and Johansson, 2003]J Randall Flanagan and Roland S

Johansson. Action plans used in action observation. Nature,

424(6950):769–771, 2003.

[Gori et al., 2012]I. Gori, U. Pattacini, F. Nori, G. Metta, and

G. Sandini. Dforc: a real-time method for reaching, tracking

and obstacle avoidance in humanoid robots. In IEEE-RAS Inter-

national Conference on Humanoid Robots (HUMANOIDS2012),,

Osaka, Japan, November 29 - December 1. 2012.

[Gould et al., 2007]S. Gould, J. Arfvidsson, A. Kaehler, B. Sapp,

M. Messner, G. Bradski, P. Baumstarck, S. Chung, and A. Y.

Ng. Peripheral-foveal vision for real-time object recognition and

tracking in video. In IJCAI07, Proceedings of, 2007.

[Navalpakkam and Itti, 2006]V. Navalpakkam and L. Itti. An in-

tegrated model of top-down and bottom-up attention for optimal

object detection. In Proceedings of the 2006 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2006.

[Nolﬁ and Floreano, 2000]S. Nolﬁ and D. Floreano. Evolutionary

Robotics: The Biology,Intelligence,and Technology. MIT Press,

Cambridge, MA, USA, 2000.

[Ognibene et al., 2010]D. Ognibene, G. Pezzulo, and G. Baldas-

sarre. How can bottom-up information shape learning of top-

down attention control skills? In Proceedings of 9th International

Conference on Development and Learning, 2010.

[Ognibene et al., 2012]D. Ognibene, E. Chinellato, M. Sarabia,

and Y. Demiris. Towards contextual action recognition and target

localization with active allocation of attention. In Proc. of First

Int. Conf.on Living Machines, pages 192–203, 2012.

[Ope, 2010]OpenNI organization. OpenNI User Guide, November

2010. Last viewed 19-01-2011 11:32.

[Paletta et al., 2005]L. Paletta, G. Fritz, and C. Seifert. Cascaded

sequential attention for object recognition with informative lo-

cal descriptors and q-learning of grouping strategies. In Proc.of

Computer Vision and Pattern Recognition (CVPR), 2005.

[Pattacini, 2010]U. Pattacini. Modular Cartesian Controllers for

Humanoid Robots: Design and Implementation on the iCub. PhD

thesis, RBCS, Istituto Italiano di Tecnologia, Genoa., 2010.

[Pezzulo and Ognibene, 2012]G Pezzulo and D Ognibene. Proac-

tive action preparation: Seeing action preparation as a continuous

and proactive process. Motor control, 16(3):386–424., 2012.

[Pezzulo, 2008]G. Pezzulo. Coordinating with the future: the

anticipatory nature of representation. Minds and Machines,

18(2):179–225, 2008.

[Pri, 2010]PrimeSense Inc. Prime Sensor NITE 1.3 Algorithms

notes, 2010. Last viewed 19-01-2011 15:34.

[Ramirez and Geffner, 2011]M. Ramirez and H. Geffner. Goal

recognition over pomdps: Inferring the intention of a pomdp

agent. In Proceedings of the Twenty-Second International Joint

Conference on Artiﬁcial Intelligence, Barcelona, Spain, 2011.

[Rothkopf et al., 2007]C. A. Rothkopf, D. H. Ballard, and M. M.

Hayhoe. Task and context determine where you look. Journal of

Vision, 7(14):1610–1620, 2007.

[Sommerlade and Reid, 2008]R. Sommerlade and I Reid. Infor-

mation theoretic active scene exploration. In Proceedings of 2008

IEEE Computer Society Conference on Computer Vision and Pat-

tern Recognition, Anchorage, Alaska, USA, June 2008.

[Sprague et al., 2007]N. Sprague, D. Ballard, and A. Robinson.

Modeling embodied visual behaviors. ACM Trans. Appl. Per-

cept., 4(2):11, 2007.

[Sridharan et al., 2010]M. Sridharan, J. Wyatt, and R. Dearden.

Planning to see: A hierarchical approach to planning vi-

sual actions on a robot using pomdps. Artiﬁcial Intelligence,

174(11):704–725, 2010.

[Suzuki and Floreano, 2008]M. Suzuki and D. Floreano. Enactive

robot vision. Adaptive Behavior, 16(2-3):122–128, 2008.

[Tatler et al., 2011]B. W. Tatler, M. M. Hayhoe, M. F. Land, and

D. Ballard. Eye guidance in natural vision: Reinterpreting

salience. Journal of Vision, 11(5):1–23, 2011.

[Vijayanarasimhan and Kapoor, 2010]S. Vijayanarasimhan and

A. Kapoor. Visual recognition and detection under bounded com-

putational resources. In Proc. of Conf. on Computer Vision and

Pattern Recognition (CVPR), pages 1006–1013, June 2010.

[Vogel and de Freitas, 2008]J. Vogel and N. de Freitas. Target-

directed attention: Sequential decision-making for gaze plan-

ning. In The International Conference of Robotics and Automa-

tion (ICRA), pages 2372–2379, 2008.

[Walther et al., 2005]D. Walther, U. Rutishauser, C. Koch, and

P. Perona. Selective visual attention enables learning and recog-

nition of multiple objects in cluttered scenes. Computer Vision

and Image Understanding, 100(1-2):41–63, October 2005.

[Zilberstein and Russell, 1996]S. Zilberstein and S. Russell. Op-

timal composition of real-time systems. Artiﬁcial Intelligence,

82(1-2):181–213, 1996.

Data

November 2015

Active Predictive Recognition of Action Concept

Data

November 2015

Active Predictive Recognition of Action on the iCub Humanoid Robot

Data

November 2015

Dimitri Ognibene · Yiannis Demiris · Miguel Sarabia del Castillo

Active Predictive Recognition of Action: Concept and Issues

Data

November 2015

Towards Active Event Recognition Video

Data

August 2013

Toward Active Event Recognition IJCAI poster

Data

August 2013

Slides

Data

July 2013

Video(old)

Data

July 2013

poster

Data

July 2013