Content uploaded by Dimitri Ognibene
Author content
All content in this area was uploaded by Dimitri Ognibene
Content may be subject to copyright.
Towards Active Event Recognition∗
Dimitri Ognibene, Yiannis Demiris
Personal Robotics Lab,Imperial College London, UK
{d.ognibene,y.demiris}@imperial.ac.uk
Abstract
Directing robot attention to recognise activities and
to anticipate events like goal-directed actions is a
crucial skill for human-robot interaction. Unfor-
tunately, issues like intrinsic time constraints, the
spatially distributed nature of the entailed infor-
mation sources, and the existence of a multitude
of unobservable states affecting the system, like
latent intentions, have long rendered achievement
of such skills a rather elusive goal. The problem
tests the limits of current attention control systems.
It requires an integrated solution for tracking, ex-
ploration and recognition, which traditionally have
been seen as separate problems in active vision. We
propose a probabilistic generative framework based
on information gain maximisation and a mixture of
Kalman Filters that uses predictions in both recog-
nition and attention-control. This framework can
efficiently use the observations of one element in
a dynamic environment to provide information on
other elements, and consequently enables guided
exploration. Interestingly, the sensors control pol-
icy, directly derived from first principles, repre-
sents the intuitive trade-off between finding the
most discriminative clues and maintaining overall
awareness. Experiments on a simulated humanoid
robot observing a human executing goal-oriented
actions demonstrated improvement on recognition
time and precision over baseline systems.
1 Introduction
Anticipating the evolution of the external environment, which
may comprise other agents, is essential to prepare and pro-
duce effective action responses [Pezzulo and Ognibene, 2012;
Pezzulo, 2008; Balkenius and Johansson, 2007]. In this con-
text an autonomous agent has to face three main difficulties:
1) acquire a predictive model of the environment; 2) compute
∗This research was funded by EFAA FP7-ICT-Project, Grant
Agreement no: 270490. The authors want to thank Yan Wu, Miguel
Sarabia, Kyuhwa Lee, Eris Chinellato, Helgi P. Helgason, Margarita
Kotti, Harold Soh, Sotirios Chatzis and Giovanni Pezzulo for their
suggestions and IJCAI reviewers for the helpful comments.
Figure 1: Experimental setup: iCub looking at target objects and
arm movements (bottom right). The top-left image shows the iCubs
gaze following the hand. In the top-right, hand movements of the
human are anticipated by the iCub gaze. Finally, the bottom-left im-
age shows gaze focused on the target object before the hand reaches
it. These three images are grabbed from the iCub camera during an
interaction trial.
predictions in a limited time [Zilberstein and Russell, 1996];
3) control its sensors to perceive the current state of the envi-
ronment and predict its evolution.
Active Perception for Anticipation. This paper focuses
specifically on the third problem, which can be named ‘Ac-
tive Perception for Anticipation’ (APA). It consists in finding
effective sensor control strategies to gather the information
necessary to feed the available predictive models. The APA
problem is proposed as a new component of the active vi-
sion paradigm [Aloimonos et al., 1988; Ballard, 1991; Ba-
jcsy, 1988; Suzuki and Floreano, 2008], which complements
and integrates other components like active monitoring dur-
ing action execution [Sridharan et al., 2010], object detection
[Vijayanarasimhan and Kapoor, 2010; de Croon and Postma,
2007; Vogel and de Freitas, 2008], tracking [Sommerlade and
Reid, 2008; Gould et al., 2007]and active object recognition
[Paletta et al., 2005; Denzler and Brown, 2002]. These com-
ponents exploit task-specific knowledge to improve percep-
tion performance and reduce computational cost, but active
vision has also shown to play an important role in improv-
ing learning performance [Andreopoulos and Tsotsos, 2013;
Walther et al., 2005]even when the task is not known [Og-
nibene et al., 2010]. An example, which can help under-
standing the APA problem, is the prediction of the motion
of one element of interest (EI), say an asteroid, in an environ-
ment composed of multiple elements (e.g., asteroid field) with
known dynamics, i.e. ideal motion and bouncing, when the
state of some elements is not precisely known and the sensors
receptive field is limited but controllable. A short time pre-
diction of the trajectory of the EI can easily be produced by
just tracking it and passively detecting other elements, which
can give rise to an interaction and change the motion of the
EI (e.g., an impact) when they enter into the field of view.
In most conditions longer term predictions are desired to
produce more effective responses. The reliability and tempo-
ral reach of the prediction (e.g., 1 minute without impacts at
99% probability) depends on the sensor control strategy. The
current motion of the EI is a ‘dynamic context’ which can
be used to actively find other elements who could produce
an interaction. The agent can predictively direct its sensors
along the trajectory of the EI and successively explore the
areas close to the trajectory. The system should also come
back to track the EI regularly, otherwise it may lose the EI
(e.g., due to an impact with a very fast undetected element).
Thus an effective attentional strategy is necessary to produce
longer and more reliable predictions.
When the interactions can take place over long ranges, like
when attraction and repulsions forces are present, the motion
of each element is part of the ‘dynamic context’ giving infor-
mation related to the state and presence of the other relevant
elements (e.g., if an asteroid is accelerating the source of at-
traction can be found in the direction of the acceleration). Si-
multaneously, having more information on the elements in the
environment can allow for more precise and longer predic-
tions (e.g., predicting not only the impacts but the non-linear
trajectories too, which in turn helps predicting impacts). This
increases the importance of an active sensor strategy that ex-
plores some candidate areas for the presence of long-range
interacting elements.
Active Event Recognition. In this paper we are interested in
an even more complex condition, where the actual interaction
between elements (e.g., repulsion, attraction or just bouncing)
depends on some hidden states of the elements themselves
(e.g., electric charge). In this case accurate estimations of
motion and position of some elements can contribute at un-
veiling the value of the hidden state of other elements. In this
condition, an efficient active sensor-control policy has the ad-
ditional role of selecting when and what elements to track to
permit the necessary estimations to recognizes the most infor-
mative elements (e.g., charged ones). We define Active Event
Recognition (AER) as a sensor control strategy aimed at un-
veiling class of the event, which is the value of the hidden
states determining the successive evolution of the environ-
ment (see figure 2). AER is thus an essential part of the APA
problem.
While AER, like all the previous conditions, shows the im-
portance of an active control of sensors for effective predic-
tions in dynamic environments, AER is of particular interest
because it can be directly mapped to broad set of conditions
comprising goal-directed actions executed by other agents,
e.g. a hand reaching a cup (see figure 1), two person going
one toward the other, a car avoiding an obstacle, or a prey
escaping from a predator. Anticipating such behaviourally-
relevant1events still poses particularly demanding challenges
in terms of the timely detection of relevant elements in an
event and the recognition of the discriminative dynamics
(e.g., motion of the hand). When the state of an element (e.g.,
prey) is uncertain, the ‘dynamic context’ (presence and dy-
namics of other elements e.g., a running predator) can pro-
vide valuable information for prediction. In general, to an-
ticipate these ‘non-local’ events, and to produce an AER, the
observer must couple its sensing behaviours with the inde-
pendent dynamics of the environment which is yet unknown
to the observer. The solution to this chicken-and-egg problem
demands for a principled integration of tracking, exploration,
search and recognition capabilities. However, these are per-
ceived as separate problems in the active vision literature.
A principled integration requires the selection of the next
sensor configuration (e.g., stop tracking the demonstrator and
look for an graspable object) by merging, evaluating and ex-
ploiting different sources of information (e.g., noisy observa-
tions of hand movements and target positions). Solving this
online is computationally complex and knowledge demand-
ing. In particular the evaluation of the informative contribu-
tion of the dynamic context may require the prediction of the
distribution of the expected trajectories2.
Related Works. Some previous attempts of principled vi-
sual attention system, such as [Navalpakkam and Itti, 2006]
which uses only local visual features, cannot direct attention
to objects which are outside the visual field. Others, like
Sprague and Ballard [2007], formalise the role of attention
to subserve action execution, using independent models for
the elements in the environment. Thus they cannot predict
the elements interactions, like others’ actions, and employ
it for action selection or attention allocation. Attention sys-
tems, like [de Croon and Postma, 2007; Paletta et al., 2005;
Denzler and Brown, 2002], utilise the contextual information
connected to low-level visual features and suffer from lim-
ited generalisation capabilities and reduced applicability to
dynamic environments.
Proposed Approach. To achieve the efficient spatial-
temporal coupling between the agent’s sensors and the en-
vironment, we propose a probabilistic generative framework
based on a mixture of Kalman Filters (KFs). It exploits sev-
eral KF properties, like fast analytical update and computa-
tion of entropy, to reduce the computational complexity and
evaluate the dynamic context. The sensor-control policy is di-
rectly derived from the principle of maximisation of expected
information gain. It explicitly attempts at reducing the uncer-
tainty on the event which is currently taking place.
1Anticipation and prediction of others’ goals is at the basis of
most human-human and human-robot interactions [Demiris, 2007].
2In the case of agent actions, this corresponds to the online
solution of the intention recognition problem [Baker et al., 2009;
Ramirez and Geffner, 2011; Demiris, 2007]. This is a hard prob-
lem which can be further complicated by partial observability and
exacerbated by the uncertainty of the environment.
Swift evaluation of the dynamic context can be achieved by
making two assumptions: Firstly, events can be represented
as linear dynamic systems3. Consequently, the state of the
dynamic system implicitly characterises the expected kine-
matics. Secondly, a limited number of elements participate in
each event. The expensive multidimensional optimisation for
selecting the next sensor configuration can be approximated
by choosing the configuration that maximises the expected
information gain for the event from a set of candidates. Ef-
fective candidates can be built by reusing the predictions of
the KF. This also allows the system to focus its sensors to po-
sitions outside of their current field of view and to select be-
tween visually similar elements, thus overcoming the limits
of some other approaches like [Navalpakkam and Itti, 2006].
We apply this framework to the problem of directing the
attention of a humanoid robot iCub to perceive a goal di-
rected action, an archetype of non-local event. To furnish the
necessary models of events we follow [Demiris and Khad-
houri, 2008]and the simulation theory of action perception
reusing the trajectory planner knowledge of the robot [Gori
et al., 2012]. The transition function of each KF of the mix-
ture is an instance of the same stable linear dynamic system
that is used in the planner but has a different attractor corre-
sponding to a different target. [Demiris and Khadhouri, 2008]
recognize actions by running in parallel the different models
corresponding to the various action hypotheses. It controls
attention through the direct competition between the models
to access the information they need. It results in the selection
of the elements which are necessary to control the currently
winning action4. If the predictions of the different models for
the selected elements are similar then the observations may
not be useful for the recognition of the event. Instead our
system directs attention to directly discriminate between the
different events using information gain maximisation.
Using a robot simulator and synthetic and recorded human
goal-directed actions, we compare our framework with the
approach we proposed in [Ognibene et al., 2012]. The lat-
ter extends [Demiris and Khadhouri, 2008]by integrating its
predictive models with separate KF for each element in the
environment.
2 Active Event Recognition
In this section the AER is defined and a solution based on a
mixture of KF using Information Gain (AERIG) is described.
Problem definition. The graphical model in figure 2 displays
3While the coupling between an agent and its environment
as a dynamic system has longly been studied in various disci-
plines like artificial life and robotics [Nolfi and Floreano, 2000;
Beer, 1995], it did not receive attention in the field of action recog-
nition, where it can deliver several computational advantages over
normative methods [Baker et al., 2009]and is more parsimonious in
terms of knowledge allowing direct use without extensive environ-
ment based training, necessary for models like [Bruce and Gordon,
2004]
4Also this attention system results in predictive gaze saccades
toward the action target. The use of predictive models to control at-
tention during action perception is supported by some experimental
results like [Flanagan and Johansson, 2003].
V"
On
Xi
k$1"
X2 X2 k"k$1"
X1
k$1"
Xn Xn k"k$1"
Xi
k"
X1
k"
ϴk"
Oi
O1
O2
k"
k"
k"
k"
Figure 2: Perception of a non-local event. The discrete variable
V represents the class of the event. The variables Xiare the state
variables of various elements in the environment. Their next state
is determined by their previous state and by V. The dashed con-
nections indicate that the connection are sparse and that an element
Xiis affected by an element Xjwith i6=jonly for a limited set
of values of V. Each element iprovides a different observation Ok
i
which depends on the current state of the element and on the sensor
configuration Θ.
the formulation of the problem. The discrete hidden stochas-
tic variable Vrepresents the class of the event which is taking
place, characterised by a different dynamic of the environ-
ment that the agent must predict and recognise. The environ-
ment is composed of a fixed set of elements I={i= 1 . . . n}
and thus its state Xkat time kis composed of the states Xk
iof
the different elements. For each value of Vthe evolution of
Xkis determined by a different dynamic system with differ-
ent independence conditions between the elements. At each
time step the agent receives for each element ian observation
ok
iwhich depends on the current configuration of the sensors
θk. The states and observations are continuous variables.
At every time step the goal of the system is to select the
configuration ˆ
θkthat will minimise the expected uncertainty
over V(quantified by entropy H):
ˆ
θk= argmin
θkZ
O
p(ok|θk)H(V|ok, θk)dok(1)
Proposed solution. For the recognition of the event and for
the selection of the sensors configuration it is necessary to
compute the posterior P(v|ok;θk). Given a prior distribu-
tion P(v, xk
1:N) = P(xk
1:N|v)P(v)and the independence of
the observed event from the sensor configuration P(v|θ) =
P(v), the update expression of the posterior P(v|ok+1θk+1 )
can be derived through the use of the Bayes rule:
P(v|ok+1, θk+1) = P(ok+1|v , θk+1)P(v)
P(ok+1|θk+1 )(2)
The computation of eq.1 and eq.2 in the general case can pose
severe computational complexities. The solution proposed is
based on the assumption that, once vis fixed, the dynamics
is linear and the probability distributions are normal. This
enables the use of a mixture of KF with a distinct KF for
each value of v. Denoting with ¯ok+1
v,θk+1 the mean expected
observation and with Sk+1
v,θk+1 its covariance matrix, both of
which are conditioned on vand θand computed during the
KF update, the following can be derived:
ˆ
θk+1 = argmin
θk+1 X
v
P(v)1
2ln |Sk+1
v,θk+1 |+
Z
O
N(o;¯ok+1
v,θk+1 ,Sk+1
v,θk+1 ) ln(P(o|θk+1 ))do(3)
Where |S|is the determinant of a matrix S. The first order
Taylor expansion of P(o|θ)at point ¯ok+1
vresults in:
ˆ
θk+1 ≈argmin
θX
v
P(v)1
2ln |Sk+1
v,θk+1 |+
ln
V
X
v0Pv0N(¯ov ,θ ;¯ok+1
v0,θk+1 ,Sk+1
v0,θk+1 )#(4)
The first term of eq. 4 is the average of the expected entropy
of the observations for each model von the new observation
point. The second term is a measure of how much, in the new
sensor configuration, the observations predicted by a model
will differ from those predicted by other models. Thus, the
proposed formulation integrates a trade-off between discrim-
inating the event hypotheses and maintaining their perception
quality.
We will now use the assumption that to each event (for
each value of v) involved only a limited subset of ele-
ments of the environment, the set Iv⊂I. The dynam-
ics of these elements will be coupled, formally their transi-
tion probabilities depends on the state of the other elements
of Iv, that is P(xk
i|xk−1
1:Ne, v)≡P(xk
i|xk−1
Iv, v), i ∈Iv. The
dynamic of those elements which do not participate in the
event will be independent of each other, P(xk
i|xk−1
1:N, v)≡
P(xk
i|xk−1
i), i /∈Iv. This allows decomposition of each KF
corresponding to a value of vin a set KFs of lower dimen-
sions. One of the KF will model the coupled dynamic of the
elements in Ivwhile the dynamic of each element i /∈Iv
will be modelled by an independent KF. This decomposition
delivers two advantages: 1) The computational cost of updat-
ing a KF is cubic in the dimension of the state space. Thus,
updating a set a lower dimension KFs is more efficient than
updating a single high dimensional one; 2) The KFs for the
elements which do not belong to Ivare independent of v,
which permits to reuse the results of the KF update.
The KF decomposition can be utilised for efficient compu-
tation of eq. 4. The first term can be computed as the product
of the determinant of the covariance matrix ˜
Sk+1
v,θ of the obser-
vation distribution predicted by the KF of Ivand those ˙
Sk+1
i,θ
from the KF of the other elements.
ln |Sk+1
v,θ |= ln |˜
Sk+1
v,θ |+
I−Iv
X
i
ln |˙
Sk+1
i,θ |(5)
The second term of eq. 4 can also be decomposed in a similar
way leading to reduced computational complexity.
Solving directly eq. 4, even after decomposition, can still
be complex. Instead, a finite set ˆ
Θof candidate sensor-
configurations is defined and the one with the highest ex-
pected information gain is selected. In this work, the ele-
ments of ˆ
Θare the configurations which will minimise the
noise on any of the elements according to any of the possible
hypotheses. E.g., ˆ
Θcan be the set of different camera config-
urations focusing on the predicted positions of the different
objects according to the different possible events.
3 Experimental Evaluation: Active action
recognition
The framework proposed in the previous section, AERIG,
has been applied to control the gaze of an iCub humanoid
robot that has to anticipate the target of a reaching ac-
tion performed by a human partner (see figure 1). In
the tests the robot showed human-like attentional behaviour
while observing human actions, switching its attention from
tracking the hand to the anticipated target (see videos at
http://www.imperial.ac.uk/personalrobotics).
The most important contribution of the approach is the pre-
dictive recognition of the event and the temporal advantage
for action preparation it permits. However, due to human
reaction delays, it is complex to assess the temporal perfor-
mance of the approach with a real robot. In fact, during the
tests, unlike during normal operations, it is important to com-
pare synchronously the timings of performer’s intention and
action with the evolution of the robot estimations. Instead
simulation results are reported with different versions of the
sensor model: with a fixed camera (NO Visual Attention
Controller, NOVAC) randomly placed near one of the ele-
ments, an ideal camera with instantaneous and precise gaz-
ing movements (IDEALVAC), and with the simulated cam-
era controller reproducing the speed and trajectories of the
real iCub camera controller [Pattacini, 2010](REALVAC).
The robot’s perception of the position of an element (object
or of the other agent hand) is considered affected by gaussian
noise whose variance is a linear function of the distance be-
tween the gaze focus point and the element (σ= 0.1+0.2×d)
5. This observation model is a simplification of the human vi-
sion which has greatly higher resolution in the centre [Findlay
and Gilchrist, 2003]. The system was tested on both synthetic
data, generated by the same control model used by the robot,
and on natural human reaching actions recorded with the
Kinect. AERIG is compared to a heuristic attention controller
proposed in [Ognibene et al., 2012](HEURISTIC). The
heuristic balances element position uncertainty with event un-
certainty by selecting the next target to gaze as the element
with the highest product between its position entropy and the
confidence of the event in which it takes part6. In this task,
the environment (I) is composed by the hand of the human
performer (h) and by a fixed number Neof graspable objects
5An object specific factor can be applied to reflect its intrinsic
difficulty in recognition and localisation. This observation noise
model is correlated with the state, thus the optimality properties of
Kalman filter cannot be guaranteed.
6The effector position is estimated using a KF and the same mo-
tion model used by AERIG ([Gori et al., 2012]). The target position
estimation needed by the motion model is estimated by an indepen-
dent KF which model assumes the target to be still, this does not
allow to use the motion of the effector to correct the target position.
The confidence cis updated at every time step using the prediction
error ek
hon the hand position ck+1 =ck+ (1 + ek+1
h)−1
0 2 4 6 8 10 12 14 16 18 20
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time Steps (1 =0.1 sec)
Correctness Ratio (1000 run)
Ideal VAC
Real VAC
No VAC
Agent
Sensor control
AERIG
15
30
60
Distance (cm)
HEURISTIC
Figure 3: Evolution of the average success rate on simulated data
with 4 objects.
(i= 1 . . . Ne). Each event vcorresponds to a hand move-
ment to reach a different object in the environment (thus the
domain of Vis {1. . . Ne}). The state Xk
iof an object irep-
resents its position, while the state Xk
hof the hand consists
of both speed and position. The set of coupled elements Iv
is composed by the hand and by the the object with index
i≡v. The KF has the same transition matrix which is used
to control a reaching action execution, thus complying with
the simulation theory of action perception [Demiris, 2007;
Dindo et al., 2011]. The equation employed by a given model
vto compute the next effector position xk+1
p,v , when the target
is at position xk
v:
xk+1
p,v =xk
h,v +τn˙xk
h,v +τhK(xv−xk
h,v)−D˙xk
h,v io (6)
Kand Dare typical constants of a PD controller while τis
the related time integration parameter. They are set to the
same default values used for action generation K= 1.5,
D= 3 and τ= 0.16,τcan be modulated to model dif-
ferent motion speeds. The Kalman filter associated with
each single element assume that the element is still (thus
the transition matrix is the identity) with its position affected
only by zero-mean Gaussian noise with variance 0.0025.
The actual system is open source and available online at
http://www.imperial.ac.uk/personalrobotics.
3.1 Results
Figure 3 reports the evolution of the“success rate”; how many
times the currently estimated event is actually the correct one,
with simulated effector trajectories in an environment with
4 objects at different distances from each other. The aver-
age duration of trial depends on the distance between the ob-
jects. In each condition 1000 runs were executed. The figure
shows significant improvement (up to two times better) when
AERIG is adopted, delivering more accurate and faster recog-
nition in all conditions, specially when the task is more dif-
ficult. It also shows the crucial role of active sensor control.
Similar results were observed with a varying number of ob-
jects, thus showing that AERIG can scale with the number of
elements. On the computational side, on a conventional 2010
PC (2.8Ghz), AERIG takes 0.4ms for frame with 16 targets.
Figure 4 displays the average evolution of the attentional be-
haviour of the AERIG and HEURISTIC agents with the real
0 2 4 6 8 10 12 14 16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time Steps
Ratio x trial (1000 run)
AERIG
Agent
Heuristic
Predicted Performer
Position
Predicted Action Target
Gaze Focus
Figure 4: Evolution of the average gaze focus during a trial with
REALVAC and 30 cm of distance between the targets. It shows only
the saccades directed towards the elements of the currently most
probable action.
gaze controller. The graph shows that both the systems do not
focus just on the effector but continuously alternates between
it and the targets. Thus AERIG, which uses the effector mo-
tion information to correct the estimated position of the tar-
get, can more precisely direct its saccades towards the target
and thus improve the overall prediction. The AERIG system
explores the environment and does not focus only on the cur-
rent most probable target and effector position. Successively
it increases the saccades towards those elements and in par-
ticular the target. The HEURISTIC model instead focuses
only on the elements of the most probable event. Figure 5
displays the results of AERIG when applied to human reach-
ing actions recorded with the Kinect. The 3D positions of the
right hand of the user skeleton, extracted using the OpenNI
[Ope, 2010]and NITE [Pri, 2010]libraries, were recorded
during 24 reaching actions towards 4 objects positioned on
a circle of radius 15 cm (see figure 6). The actions, with an
average duration of 1 sec, were performed naturally from dif-
ferent starting positions at an average distance of 40 cm from
the circle centre. At each time step, the actual position per-
ceived by the robot was affected by noise depending on the
current gaze position in the same way as in the previous ex-
0 5 10 15 20 25 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Time steps (1=0.06 secs)
Success ratio (100 runs)
Ideal VAC
No VAC
HEURISTIC
AERIG
Agent
Sensor control
Figure 5: Evolution of the average success rate on recorded human
action with 4 objects at a distance of 15 cm.
0 200 400 600 800
−400
−200
0
200
400
600
x (mm)
y (mm)
Figure 6: Samples of recorded reaching action trajectories (2d pro-
jection) towards two different objects (circles) .
periment. It is easy to see that the trajectories are not straight
as predicted by the models used.
This data show that AERIG delivers improvement in event
recognition even when its knowledge is noisy and does not
match the real dynamic of the events. While both AERIG
and the heuristic model use the same action model, and thus
suffer from similar problems with the recorded data, AERIG
is more robust because it is sensitive to uncertainty when se-
lecting the actual event. We also hypothesise that the robust-
ness of AERIG is partially due to the use of stable dynamic
systems to model the interactions in the environment. The
trajectories are also affected by noise which is specially evi-
dent near the target objects. This noise cannot be reduced by
the active control of the camera. This helps explain the shape
of the performance in Figure 5. Figure 7 displays the perfor-
mance of the systems when the robot sensor has a limited field
of view (radius of 40cm around the gaze focus) and reduced
noise (σ= 0.03 + 0.06 ∗d), with four objects at a distance
of 30 cm. This result, while preliminary, shows that only
AERIG with active camera can provide useful events recog-
nition7. This is due to the use of the effector motion informa-
tion which is used to adjust the initial random guesses of the
target position, which in the other conditions were based on
the peripheral vision. This adjustment are successively used
to confirm the presence of the target with direct gazes.
4 Conclusions
This paper introduces AERIG, a probabilistic framework for
the active perception of multi-element dynamic events, like
goal-oriented actions. It addresses the requirements of the
APA problem, which consists of finding effective sensor con-
trol strategies to gather the information necessary to predict
the evolution of the environment. AERIG actively directs
7The policy used by the systems to manage missing observations
was: a) if the element was predicted to be observed then generate a
random observation outside of the field of view and reduce the prob-
ability of the event to the minimum event probability, b) otherwise
generate a noisy observation of the element centred on the predicted
position.
0 2 4 6 8 10 12 14 16
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Timesteps (0.1 secs)
Success ratio (100 runs)
No VAC
Ideal VAC
Sensor control
Agent
HEURISTIC
AERIG
Figure 7: Evolution of the average success rate of the system with
limited field of view.
sensors to the most discriminative configurations to uncover
the actual dynamics of the environment. This leads to the
spatio-temporal coupling of the sensors with the environment
dynamics.
While actively recognising a complex dynamic context
[Rothkopf et al., 2007]has crucial behavioural importance,
this problem has been not sufficiently addressed in the litera-
ture of attention. In fact, most of the attention models assume
artificial laboratory conditions and offer limited explanations
of task oriented attention in dynamic environments [Tatler et
al., 2011]. Thus, it would be intriguing to compare the pre-
dictions of the proposed framework with actual human be-
haviour in natural conditions. AERIG applied to the recog-
nition of goal oriented actions represents a sound and robust
implementation of the simulation theory of action perception
[Demiris, 2007]. While some theories [Flanagan and Johans-
son, 2003]support the reuse of the same attention control
schemas used during action execution, our approach focuses
on the direct reduction of uncertainty during the perception
of the action. In this task the tests showed substantial perfor-
mance improvement over baseline approaches. Interestingly
the framework allows a mathematical formulation of the intu-
itive trade-off between a conservative behaviour, which tries
to observe all the events and elements, and a greedy one that
is only interested in determining the real event which is taking
place.
In this work, we tested our approach only on stable approx-
imately linear systems with a limited number of elements and
only used position information. In the future, we plan to ex-
tend the framework to use more complex models and inte-
grating it with the overall agent control architecture.
References
[Aloimonos et al., 1988]J. Aloimonos, I. Weiss, and A. Bandy-
opadhyay. Active vision. International Journal of Computer
Vision, 1(4):333–356, january 1988.
[Andreopoulos and Tsotsos, 2013]A. Andreopoulos and J. K.
Tsotsos. A computational learning theory of active object recog-
nition under uncertainty. Int. J. Comp. Vis., 101(1):95–142, 2013.
[Bajcsy, 1988]R. Bajcsy. Active perception. Proceedings of the
IEEE, Special issue on Computer Vision, 76(8):966–1005, 1988.
[Baker et al., 2009]C. L Baker, R. Saxe, and J. B Tenenbaum. Ac-
tion understanding as inverse planning. Cognition, 2009.
[Balkenius and Johansson, 2007]C. Balkenius and B. Johansson.
Anticipatory models in gaze control: a developmental model.
Cognitive processing, 8(3):167–174, 2007.
[Ballard, 1991]D.H. Ballard. Animate vision. Artificial Intelli-
gence, 48:57–86, 1991.
[Beer, 1995]R. D. Beer. A dynamical systems perspective on
agent-environment interaction. Artificial Intelligence, 72:173–
215, 1995.
[Bruce and Gordon, 2004]A. Bruce and G. Gordon. Better motion
prediction for people-tracking. In Proc. of the 2004 IEEE Inter-
national Conference on Robotics and Automation, New Orleans,
LA, USA, May 2004.
[de Croon and Postma, 2007]G. de Croon and E.O. Postma.
Sensory-motor coordination in object detection. In Proceedings
of the IEEE Symposium on Artificial Life, ALIFE’07, pages 147–
154, 2007.
[Demiris and Khadhouri, 2008]Y. Demiris and B. Khadhouri.
Content-based control of goal-directed attention during human
action perception. Interaction Studies, 9(2):353–376, 2008.
[Demiris, 2007]Y. Demiris. Prediction of intent in robotics and
multi-agent systems. Cognitive Processing, 8(3):151–158, 2007.
[Denzler and Brown, 2002]J. Denzler and C.M. Brown. Informa-
tion Theoretic Sensor Data Selection for Active Object Recogni-
tion and State Estimation. Transactions in Pattern Analysis and
Machine Intelligence, 24(2):145–157, 2002.
[Dindo et al., 2011]H. Dindo, D. Zambuto, and G. Pezzulo. Motor
simulation via coupled internal models using sequential monte
carlo. In Proceedings of the Twenty-Second International Joint
Conference on Artificial Intelligence, Barcelona, Spain, July
2011.
[Findlay and Gilchrist, 2003]J.M. Findlay and I.D. Gilchrist. Ac-
tive vision: The psychology of looking and seeing. Oxford Uni-
versity Press, 2003.
[Flanagan and Johansson, 2003]J Randall Flanagan and Roland S
Johansson. Action plans used in action observation. Nature,
424(6950):769–771, 2003.
[Gori et al., 2012]I. Gori, U. Pattacini, F. Nori, G. Metta, and
G. Sandini. Dforc: a real-time method for reaching, tracking
and obstacle avoidance in humanoid robots. In IEEE-RAS Inter-
national Conference on Humanoid Robots (HUMANOIDS2012),,
Osaka, Japan, November 29 - December 1. 2012.
[Gould et al., 2007]S. Gould, J. Arfvidsson, A. Kaehler, B. Sapp,
M. Messner, G. Bradski, P. Baumstarck, S. Chung, and A. Y.
Ng. Peripheral-foveal vision for real-time object recognition and
tracking in video. In IJCAI07, Proceedings of, 2007.
[Navalpakkam and Itti, 2006]V. Navalpakkam and L. Itti. An in-
tegrated model of top-down and bottom-up attention for optimal
object detection. In Proceedings of the 2006 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2006.
[Nolfi and Floreano, 2000]S. Nolfi and D. Floreano. Evolutionary
Robotics: The Biology,Intelligence,and Technology. MIT Press,
Cambridge, MA, USA, 2000.
[Ognibene et al., 2010]D. Ognibene, G. Pezzulo, and G. Baldas-
sarre. How can bottom-up information shape learning of top-
down attention control skills? In Proceedings of 9th International
Conference on Development and Learning, 2010.
[Ognibene et al., 2012]D. Ognibene, E. Chinellato, M. Sarabia,
and Y. Demiris. Towards contextual action recognition and target
localization with active allocation of attention. In Proc. of First
Int. Conf.on Living Machines, pages 192–203, 2012.
[Ope, 2010]OpenNI organization. OpenNI User Guide, November
2010. Last viewed 19-01-2011 11:32.
[Paletta et al., 2005]L. Paletta, G. Fritz, and C. Seifert. Cascaded
sequential attention for object recognition with informative lo-
cal descriptors and q-learning of grouping strategies. In Proc.of
Computer Vision and Pattern Recognition (CVPR), 2005.
[Pattacini, 2010]U. Pattacini. Modular Cartesian Controllers for
Humanoid Robots: Design and Implementation on the iCub. PhD
thesis, RBCS, Istituto Italiano di Tecnologia, Genoa., 2010.
[Pezzulo and Ognibene, 2012]G Pezzulo and D Ognibene. Proac-
tive action preparation: Seeing action preparation as a continuous
and proactive process. Motor control, 16(3):386–424., 2012.
[Pezzulo, 2008]G. Pezzulo. Coordinating with the future: the
anticipatory nature of representation. Minds and Machines,
18(2):179–225, 2008.
[Pri, 2010]PrimeSense Inc. Prime Sensor NITE 1.3 Algorithms
notes, 2010. Last viewed 19-01-2011 15:34.
[Ramirez and Geffner, 2011]M. Ramirez and H. Geffner. Goal
recognition over pomdps: Inferring the intention of a pomdp
agent. In Proceedings of the Twenty-Second International Joint
Conference on Artificial Intelligence, Barcelona, Spain, 2011.
[Rothkopf et al., 2007]C. A. Rothkopf, D. H. Ballard, and M. M.
Hayhoe. Task and context determine where you look. Journal of
Vision, 7(14):1610–1620, 2007.
[Sommerlade and Reid, 2008]R. Sommerlade and I Reid. Infor-
mation theoretic active scene exploration. In Proceedings of 2008
IEEE Computer Society Conference on Computer Vision and Pat-
tern Recognition, Anchorage, Alaska, USA, June 2008.
[Sprague et al., 2007]N. Sprague, D. Ballard, and A. Robinson.
Modeling embodied visual behaviors. ACM Trans. Appl. Per-
cept., 4(2):11, 2007.
[Sridharan et al., 2010]M. Sridharan, J. Wyatt, and R. Dearden.
Planning to see: A hierarchical approach to planning vi-
sual actions on a robot using pomdps. Artificial Intelligence,
174(11):704–725, 2010.
[Suzuki and Floreano, 2008]M. Suzuki and D. Floreano. Enactive
robot vision. Adaptive Behavior, 16(2-3):122–128, 2008.
[Tatler et al., 2011]B. W. Tatler, M. M. Hayhoe, M. F. Land, and
D. Ballard. Eye guidance in natural vision: Reinterpreting
salience. Journal of Vision, 11(5):1–23, 2011.
[Vijayanarasimhan and Kapoor, 2010]S. Vijayanarasimhan and
A. Kapoor. Visual recognition and detection under bounded com-
putational resources. In Proc. of Conf. on Computer Vision and
Pattern Recognition (CVPR), pages 1006–1013, June 2010.
[Vogel and de Freitas, 2008]J. Vogel and N. de Freitas. Target-
directed attention: Sequential decision-making for gaze plan-
ning. In The International Conference of Robotics and Automa-
tion (ICRA), pages 2372–2379, 2008.
[Walther et al., 2005]D. Walther, U. Rutishauser, C. Koch, and
P. Perona. Selective visual attention enables learning and recog-
nition of multiple objects in cluttered scenes. Computer Vision
and Image Understanding, 100(1-2):41–63, October 2005.
[Zilberstein and Russell, 1996]S. Zilberstein and S. Russell. Op-
timal composition of real-time systems. Artificial Intelligence,
82(1-2):181–213, 1996.