Conference PaperPDF Available

Towards active event recognition

Authors:

Abstract and Figures

Directing robot attention to recognise activities and to anticipate events like goal-directed actions is a crucial skill for human-robot interaction. Unfortunately, issues like intrinsic time constraints, the spatially distributed nature of the entailed information sources, and the existence of a multitude of unobservable states affecting the system, like latent intentions, have long rendered achievement of such skills a rather elusive goal. The problem tests the limits of current attention control systems. It requires an integrated solution for tracking, exploration and recognition, which traditionally have been seen as separate problems in active vision. We propose a probabilistic generative framework based on information gain maximisation and a mixture of Kalman Filters that uses predictions in both recognition and attention-control. This framework can efficiently use the observations of one element in a dynamic environment to provide information on other elements, and consequently enables guided exploration. Interestingly, the sensors control policy, directly derived from first principles, represents the intuitive trade-off between finding the most discriminative clues and maintaining overall awareness. Experiments on a simulated humanoid robot observing a human executing goal-oriented actions demonstrated improvement on recognition time and precision over baseline systems.
Content may be subject to copyright.
Towards Active Event Recognition
Dimitri Ognibene, Yiannis Demiris
Personal Robotics Lab,Imperial College London, UK
{d.ognibene,y.demiris}@imperial.ac.uk
Abstract
Directing robot attention to recognise activities and
to anticipate events like goal-directed actions is a
crucial skill for human-robot interaction. Unfor-
tunately, issues like intrinsic time constraints, the
spatially distributed nature of the entailed infor-
mation sources, and the existence of a multitude
of unobservable states affecting the system, like
latent intentions, have long rendered achievement
of such skills a rather elusive goal. The problem
tests the limits of current attention control systems.
It requires an integrated solution for tracking, ex-
ploration and recognition, which traditionally have
been seen as separate problems in active vision. We
propose a probabilistic generative framework based
on information gain maximisation and a mixture of
Kalman Filters that uses predictions in both recog-
nition and attention-control. This framework can
efficiently use the observations of one element in
a dynamic environment to provide information on
other elements, and consequently enables guided
exploration. Interestingly, the sensors control pol-
icy, directly derived from first principles, repre-
sents the intuitive trade-off between finding the
most discriminative clues and maintaining overall
awareness. Experiments on a simulated humanoid
robot observing a human executing goal-oriented
actions demonstrated improvement on recognition
time and precision over baseline systems.
1 Introduction
Anticipating the evolution of the external environment, which
may comprise other agents, is essential to prepare and pro-
duce effective action responses [Pezzulo and Ognibene, 2012;
Pezzulo, 2008; Balkenius and Johansson, 2007]. In this con-
text an autonomous agent has to face three main difficulties:
1) acquire a predictive model of the environment; 2) compute
This research was funded by EFAA FP7-ICT-Project, Grant
Agreement no: 270490. The authors want to thank Yan Wu, Miguel
Sarabia, Kyuhwa Lee, Eris Chinellato, Helgi P. Helgason, Margarita
Kotti, Harold Soh, Sotirios Chatzis and Giovanni Pezzulo for their
suggestions and IJCAI reviewers for the helpful comments.
Figure 1: Experimental setup: iCub looking at target objects and
arm movements (bottom right). The top-left image shows the iCubs
gaze following the hand. In the top-right, hand movements of the
human are anticipated by the iCub gaze. Finally, the bottom-left im-
age shows gaze focused on the target object before the hand reaches
it. These three images are grabbed from the iCub camera during an
interaction trial.
predictions in a limited time [Zilberstein and Russell, 1996];
3) control its sensors to perceive the current state of the envi-
ronment and predict its evolution.
Active Perception for Anticipation. This paper focuses
specifically on the third problem, which can be named ‘Ac-
tive Perception for Anticipation’ (APA). It consists in finding
effective sensor control strategies to gather the information
necessary to feed the available predictive models. The APA
problem is proposed as a new component of the active vi-
sion paradigm [Aloimonos et al., 1988; Ballard, 1991; Ba-
jcsy, 1988; Suzuki and Floreano, 2008], which complements
and integrates other components like active monitoring dur-
ing action execution [Sridharan et al., 2010], object detection
[Vijayanarasimhan and Kapoor, 2010; de Croon and Postma,
2007; Vogel and de Freitas, 2008], tracking [Sommerlade and
Reid, 2008; Gould et al., 2007]and active object recognition
[Paletta et al., 2005; Denzler and Brown, 2002]. These com-
ponents exploit task-specific knowledge to improve percep-
tion performance and reduce computational cost, but active
vision has also shown to play an important role in improv-
ing learning performance [Andreopoulos and Tsotsos, 2013;
Walther et al., 2005]even when the task is not known [Og-
nibene et al., 2010]. An example, which can help under-
standing the APA problem, is the prediction of the motion
of one element of interest (EI), say an asteroid, in an environ-
ment composed of multiple elements (e.g., asteroid field) with
known dynamics, i.e. ideal motion and bouncing, when the
state of some elements is not precisely known and the sensors
receptive field is limited but controllable. A short time pre-
diction of the trajectory of the EI can easily be produced by
just tracking it and passively detecting other elements, which
can give rise to an interaction and change the motion of the
EI (e.g., an impact) when they enter into the field of view.
In most conditions longer term predictions are desired to
produce more effective responses. The reliability and tempo-
ral reach of the prediction (e.g., 1 minute without impacts at
99% probability) depends on the sensor control strategy. The
current motion of the EI is a ‘dynamic context’ which can
be used to actively find other elements who could produce
an interaction. The agent can predictively direct its sensors
along the trajectory of the EI and successively explore the
areas close to the trajectory. The system should also come
back to track the EI regularly, otherwise it may lose the EI
(e.g., due to an impact with a very fast undetected element).
Thus an effective attentional strategy is necessary to produce
longer and more reliable predictions.
When the interactions can take place over long ranges, like
when attraction and repulsions forces are present, the motion
of each element is part of the ‘dynamic context’ giving infor-
mation related to the state and presence of the other relevant
elements (e.g., if an asteroid is accelerating the source of at-
traction can be found in the direction of the acceleration). Si-
multaneously, having more information on the elements in the
environment can allow for more precise and longer predic-
tions (e.g., predicting not only the impacts but the non-linear
trajectories too, which in turn helps predicting impacts). This
increases the importance of an active sensor strategy that ex-
plores some candidate areas for the presence of long-range
interacting elements.
Active Event Recognition. In this paper we are interested in
an even more complex condition, where the actual interaction
between elements (e.g., repulsion, attraction or just bouncing)
depends on some hidden states of the elements themselves
(e.g., electric charge). In this case accurate estimations of
motion and position of some elements can contribute at un-
veiling the value of the hidden state of other elements. In this
condition, an efficient active sensor-control policy has the ad-
ditional role of selecting when and what elements to track to
permit the necessary estimations to recognizes the most infor-
mative elements (e.g., charged ones). We define Active Event
Recognition (AER) as a sensor control strategy aimed at un-
veiling class of the event, which is the value of the hidden
states determining the successive evolution of the environ-
ment (see figure 2). AER is thus an essential part of the APA
problem.
While AER, like all the previous conditions, shows the im-
portance of an active control of sensors for effective predic-
tions in dynamic environments, AER is of particular interest
because it can be directly mapped to broad set of conditions
comprising goal-directed actions executed by other agents,
e.g. a hand reaching a cup (see figure 1), two person going
one toward the other, a car avoiding an obstacle, or a prey
escaping from a predator. Anticipating such behaviourally-
relevant1events still poses particularly demanding challenges
in terms of the timely detection of relevant elements in an
event and the recognition of the discriminative dynamics
(e.g., motion of the hand). When the state of an element (e.g.,
prey) is uncertain, the ‘dynamic context’ (presence and dy-
namics of other elements e.g., a running predator) can pro-
vide valuable information for prediction. In general, to an-
ticipate these ‘non-local’ events, and to produce an AER, the
observer must couple its sensing behaviours with the inde-
pendent dynamics of the environment which is yet unknown
to the observer. The solution to this chicken-and-egg problem
demands for a principled integration of tracking, exploration,
search and recognition capabilities. However, these are per-
ceived as separate problems in the active vision literature.
A principled integration requires the selection of the next
sensor configuration (e.g., stop tracking the demonstrator and
look for an graspable object) by merging, evaluating and ex-
ploiting different sources of information (e.g., noisy observa-
tions of hand movements and target positions). Solving this
online is computationally complex and knowledge demand-
ing. In particular the evaluation of the informative contribu-
tion of the dynamic context may require the prediction of the
distribution of the expected trajectories2.
Related Works. Some previous attempts of principled vi-
sual attention system, such as [Navalpakkam and Itti, 2006]
which uses only local visual features, cannot direct attention
to objects which are outside the visual field. Others, like
Sprague and Ballard [2007], formalise the role of attention
to subserve action execution, using independent models for
the elements in the environment. Thus they cannot predict
the elements interactions, like others’ actions, and employ
it for action selection or attention allocation. Attention sys-
tems, like [de Croon and Postma, 2007; Paletta et al., 2005;
Denzler and Brown, 2002], utilise the contextual information
connected to low-level visual features and suffer from lim-
ited generalisation capabilities and reduced applicability to
dynamic environments.
Proposed Approach. To achieve the efficient spatial-
temporal coupling between the agent’s sensors and the en-
vironment, we propose a probabilistic generative framework
based on a mixture of Kalman Filters (KFs). It exploits sev-
eral KF properties, like fast analytical update and computa-
tion of entropy, to reduce the computational complexity and
evaluate the dynamic context. The sensor-control policy is di-
rectly derived from the principle of maximisation of expected
information gain. It explicitly attempts at reducing the uncer-
tainty on the event which is currently taking place.
1Anticipation and prediction of others’ goals is at the basis of
most human-human and human-robot interactions [Demiris, 2007].
2In the case of agent actions, this corresponds to the online
solution of the intention recognition problem [Baker et al., 2009;
Ramirez and Geffner, 2011; Demiris, 2007]. This is a hard prob-
lem which can be further complicated by partial observability and
exacerbated by the uncertainty of the environment.
Swift evaluation of the dynamic context can be achieved by
making two assumptions: Firstly, events can be represented
as linear dynamic systems3. Consequently, the state of the
dynamic system implicitly characterises the expected kine-
matics. Secondly, a limited number of elements participate in
each event. The expensive multidimensional optimisation for
selecting the next sensor configuration can be approximated
by choosing the configuration that maximises the expected
information gain for the event from a set of candidates. Ef-
fective candidates can be built by reusing the predictions of
the KF. This also allows the system to focus its sensors to po-
sitions outside of their current field of view and to select be-
tween visually similar elements, thus overcoming the limits
of some other approaches like [Navalpakkam and Itti, 2006].
We apply this framework to the problem of directing the
attention of a humanoid robot iCub to perceive a goal di-
rected action, an archetype of non-local event. To furnish the
necessary models of events we follow [Demiris and Khad-
houri, 2008]and the simulation theory of action perception
reusing the trajectory planner knowledge of the robot [Gori
et al., 2012]. The transition function of each KF of the mix-
ture is an instance of the same stable linear dynamic system
that is used in the planner but has a different attractor corre-
sponding to a different target. [Demiris and Khadhouri, 2008]
recognize actions by running in parallel the different models
corresponding to the various action hypotheses. It controls
attention through the direct competition between the models
to access the information they need. It results in the selection
of the elements which are necessary to control the currently
winning action4. If the predictions of the different models for
the selected elements are similar then the observations may
not be useful for the recognition of the event. Instead our
system directs attention to directly discriminate between the
different events using information gain maximisation.
Using a robot simulator and synthetic and recorded human
goal-directed actions, we compare our framework with the
approach we proposed in [Ognibene et al., 2012]. The lat-
ter extends [Demiris and Khadhouri, 2008]by integrating its
predictive models with separate KF for each element in the
environment.
2 Active Event Recognition
In this section the AER is defined and a solution based on a
mixture of KF using Information Gain (AERIG) is described.
Problem definition. The graphical model in figure 2 displays
3While the coupling between an agent and its environment
as a dynamic system has longly been studied in various disci-
plines like artificial life and robotics [Nolfi and Floreano, 2000;
Beer, 1995], it did not receive attention in the field of action recog-
nition, where it can deliver several computational advantages over
normative methods [Baker et al., 2009]and is more parsimonious in
terms of knowledge allowing direct use without extensive environ-
ment based training, necessary for models like [Bruce and Gordon,
2004]
4Also this attention system results in predictive gaze saccades
toward the action target. The use of predictive models to control at-
tention during action perception is supported by some experimental
results like [Flanagan and Johansson, 2003].
V"
On
Xi
k$1"
X2 X2 k"k$1"
X1
k$1"
Xn Xn k"k$1"
Xi
k"
X1
k"
ϴk"
Oi
O1
O2
k"
k"
k"
k"
Figure 2: Perception of a non-local event. The discrete variable
V represents the class of the event. The variables Xiare the state
variables of various elements in the environment. Their next state
is determined by their previous state and by V. The dashed con-
nections indicate that the connection are sparse and that an element
Xiis affected by an element Xjwith i6=jonly for a limited set
of values of V. Each element iprovides a different observation Ok
i
which depends on the current state of the element and on the sensor
configuration Θ.
the formulation of the problem. The discrete hidden stochas-
tic variable Vrepresents the class of the event which is taking
place, characterised by a different dynamic of the environ-
ment that the agent must predict and recognise. The environ-
ment is composed of a fixed set of elements I={i= 1 . . . n}
and thus its state Xkat time kis composed of the states Xk
iof
the different elements. For each value of Vthe evolution of
Xkis determined by a different dynamic system with differ-
ent independence conditions between the elements. At each
time step the agent receives for each element ian observation
ok
iwhich depends on the current configuration of the sensors
θk. The states and observations are continuous variables.
At every time step the goal of the system is to select the
configuration ˆ
θkthat will minimise the expected uncertainty
over V(quantified by entropy H):
ˆ
θk= argmin
θkZ
O
p(ok|θk)H(V|ok, θk)dok(1)
Proposed solution. For the recognition of the event and for
the selection of the sensors configuration it is necessary to
compute the posterior P(v|ok;θk). Given a prior distribu-
tion P(v, xk
1:N) = P(xk
1:N|v)P(v)and the independence of
the observed event from the sensor configuration P(v|θ) =
P(v), the update expression of the posterior P(v|ok+1θk+1 )
can be derived through the use of the Bayes rule:
P(v|ok+1, θk+1) = P(ok+1|v , θk+1)P(v)
P(ok+1|θk+1 )(2)
The computation of eq.1 and eq.2 in the general case can pose
severe computational complexities. The solution proposed is
based on the assumption that, once vis fixed, the dynamics
is linear and the probability distributions are normal. This
enables the use of a mixture of KF with a distinct KF for
each value of v. Denoting with ¯ok+1
v,θk+1 the mean expected
observation and with Sk+1
v,θk+1 its covariance matrix, both of
which are conditioned on vand θand computed during the
KF update, the following can be derived:
ˆ
θk+1 = argmin
θk+1 X
v
P(v)1
2ln |Sk+1
v,θk+1 |+
Z
O
N(o;¯ok+1
v,θk+1 ,Sk+1
v,θk+1 ) ln(P(o|θk+1 ))do(3)
Where |S|is the determinant of a matrix S. The first order
Taylor expansion of P(o|θ)at point ¯ok+1
vresults in:
ˆ
θk+1 argmin
θX
v
P(v)1
2ln |Sk+1
v,θk+1 |+
ln
V
X
v0Pv0N(¯ov ,θ ;¯ok+1
v0k+1 ,Sk+1
v0k+1 )#(4)
The first term of eq. 4 is the average of the expected entropy
of the observations for each model von the new observation
point. The second term is a measure of how much, in the new
sensor configuration, the observations predicted by a model
will differ from those predicted by other models. Thus, the
proposed formulation integrates a trade-off between discrim-
inating the event hypotheses and maintaining their perception
quality.
We will now use the assumption that to each event (for
each value of v) involved only a limited subset of ele-
ments of the environment, the set IvI. The dynam-
ics of these elements will be coupled, formally their transi-
tion probabilities depends on the state of the other elements
of Iv, that is P(xk
i|xk1
1:Ne, v)P(xk
i|xk1
Iv, v), i Iv. The
dynamic of those elements which do not participate in the
event will be independent of each other, P(xk
i|xk1
1:N, v)
P(xk
i|xk1
i), i /Iv. This allows decomposition of each KF
corresponding to a value of vin a set KFs of lower dimen-
sions. One of the KF will model the coupled dynamic of the
elements in Ivwhile the dynamic of each element i /Iv
will be modelled by an independent KF. This decomposition
delivers two advantages: 1) The computational cost of updat-
ing a KF is cubic in the dimension of the state space. Thus,
updating a set a lower dimension KFs is more efficient than
updating a single high dimensional one; 2) The KFs for the
elements which do not belong to Ivare independent of v,
which permits to reuse the results of the KF update.
The KF decomposition can be utilised for efficient compu-
tation of eq. 4. The first term can be computed as the product
of the determinant of the covariance matrix ˜
Sk+1
v,θ of the obser-
vation distribution predicted by the KF of Ivand those ˙
Sk+1
i,θ
from the KF of the other elements.
ln |Sk+1
v,θ |= ln |˜
Sk+1
v,θ |+
IIv
X
i
ln |˙
Sk+1
i,θ |(5)
The second term of eq. 4 can also be decomposed in a similar
way leading to reduced computational complexity.
Solving directly eq. 4, even after decomposition, can still
be complex. Instead, a finite set ˆ
Θof candidate sensor-
configurations is defined and the one with the highest ex-
pected information gain is selected. In this work, the ele-
ments of ˆ
Θare the configurations which will minimise the
noise on any of the elements according to any of the possible
hypotheses. E.g., ˆ
Θcan be the set of different camera config-
urations focusing on the predicted positions of the different
objects according to the different possible events.
3 Experimental Evaluation: Active action
recognition
The framework proposed in the previous section, AERIG,
has been applied to control the gaze of an iCub humanoid
robot that has to anticipate the target of a reaching ac-
tion performed by a human partner (see figure 1). In
the tests the robot showed human-like attentional behaviour
while observing human actions, switching its attention from
tracking the hand to the anticipated target (see videos at
http://www.imperial.ac.uk/personalrobotics).
The most important contribution of the approach is the pre-
dictive recognition of the event and the temporal advantage
for action preparation it permits. However, due to human
reaction delays, it is complex to assess the temporal perfor-
mance of the approach with a real robot. In fact, during the
tests, unlike during normal operations, it is important to com-
pare synchronously the timings of performer’s intention and
action with the evolution of the robot estimations. Instead
simulation results are reported with different versions of the
sensor model: with a fixed camera (NO Visual Attention
Controller, NOVAC) randomly placed near one of the ele-
ments, an ideal camera with instantaneous and precise gaz-
ing movements (IDEALVAC), and with the simulated cam-
era controller reproducing the speed and trajectories of the
real iCub camera controller [Pattacini, 2010](REALVAC).
The robot’s perception of the position of an element (object
or of the other agent hand) is considered affected by gaussian
noise whose variance is a linear function of the distance be-
tween the gaze focus point and the element (σ= 0.1+0.2×d)
5. This observation model is a simplification of the human vi-
sion which has greatly higher resolution in the centre [Findlay
and Gilchrist, 2003]. The system was tested on both synthetic
data, generated by the same control model used by the robot,
and on natural human reaching actions recorded with the
Kinect. AERIG is compared to a heuristic attention controller
proposed in [Ognibene et al., 2012](HEURISTIC). The
heuristic balances element position uncertainty with event un-
certainty by selecting the next target to gaze as the element
with the highest product between its position entropy and the
confidence of the event in which it takes part6. In this task,
the environment (I) is composed by the hand of the human
performer (h) and by a fixed number Neof graspable objects
5An object specific factor can be applied to reflect its intrinsic
difficulty in recognition and localisation. This observation noise
model is correlated with the state, thus the optimality properties of
Kalman filter cannot be guaranteed.
6The effector position is estimated using a KF and the same mo-
tion model used by AERIG ([Gori et al., 2012]). The target position
estimation needed by the motion model is estimated by an indepen-
dent KF which model assumes the target to be still, this does not
allow to use the motion of the effector to correct the target position.
The confidence cis updated at every time step using the prediction
error ek
hon the hand position ck+1 =ck+ (1 + ek+1
h)1
0 2 4 6 8 10 12 14 16 18 20
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time Steps (1 =0.1 sec)
Correctness Ratio (1000 run)
Ideal VAC
Real VAC
No VAC
Agent
Sensor control
AERIG
15
30
60
Distance (cm)
HEURISTIC
Figure 3: Evolution of the average success rate on simulated data
with 4 objects.
(i= 1 . . . Ne). Each event vcorresponds to a hand move-
ment to reach a different object in the environment (thus the
domain of Vis {1. . . Ne}). The state Xk
iof an object irep-
resents its position, while the state Xk
hof the hand consists
of both speed and position. The set of coupled elements Iv
is composed by the hand and by the the object with index
iv. The KF has the same transition matrix which is used
to control a reaching action execution, thus complying with
the simulation theory of action perception [Demiris, 2007;
Dindo et al., 2011]. The equation employed by a given model
vto compute the next effector position xk+1
p,v , when the target
is at position xk
v:
xk+1
p,v =xk
h,v +τn˙xk
h,v +τhK(xvxk
h,v)D˙xk
h,v io (6)
Kand Dare typical constants of a PD controller while τis
the related time integration parameter. They are set to the
same default values used for action generation K= 1.5,
D= 3 and τ= 0.16,τcan be modulated to model dif-
ferent motion speeds. The Kalman filter associated with
each single element assume that the element is still (thus
the transition matrix is the identity) with its position affected
only by zero-mean Gaussian noise with variance 0.0025.
The actual system is open source and available online at
http://www.imperial.ac.uk/personalrobotics.
3.1 Results
Figure 3 reports the evolution of the“success rate”; how many
times the currently estimated event is actually the correct one,
with simulated effector trajectories in an environment with
4 objects at different distances from each other. The aver-
age duration of trial depends on the distance between the ob-
jects. In each condition 1000 runs were executed. The figure
shows significant improvement (up to two times better) when
AERIG is adopted, delivering more accurate and faster recog-
nition in all conditions, specially when the task is more dif-
ficult. It also shows the crucial role of active sensor control.
Similar results were observed with a varying number of ob-
jects, thus showing that AERIG can scale with the number of
elements. On the computational side, on a conventional 2010
PC (2.8Ghz), AERIG takes 0.4ms for frame with 16 targets.
Figure 4 displays the average evolution of the attentional be-
haviour of the AERIG and HEURISTIC agents with the real
0 2 4 6 8 10 12 14 16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time Steps
Ratio x trial (1000 run)
AERIG
Agent
Heuristic
Predicted Performer
Position
Predicted Action Target
Gaze Focus
Figure 4: Evolution of the average gaze focus during a trial with
REALVAC and 30 cm of distance between the targets. It shows only
the saccades directed towards the elements of the currently most
probable action.
gaze controller. The graph shows that both the systems do not
focus just on the effector but continuously alternates between
it and the targets. Thus AERIG, which uses the effector mo-
tion information to correct the estimated position of the tar-
get, can more precisely direct its saccades towards the target
and thus improve the overall prediction. The AERIG system
explores the environment and does not focus only on the cur-
rent most probable target and effector position. Successively
it increases the saccades towards those elements and in par-
ticular the target. The HEURISTIC model instead focuses
only on the elements of the most probable event. Figure 5
displays the results of AERIG when applied to human reach-
ing actions recorded with the Kinect. The 3D positions of the
right hand of the user skeleton, extracted using the OpenNI
[Ope, 2010]and NITE [Pri, 2010]libraries, were recorded
during 24 reaching actions towards 4 objects positioned on
a circle of radius 15 cm (see figure 6). The actions, with an
average duration of 1 sec, were performed naturally from dif-
ferent starting positions at an average distance of 40 cm from
the circle centre. At each time step, the actual position per-
ceived by the robot was affected by noise depending on the
current gaze position in the same way as in the previous ex-
0 5 10 15 20 25 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Time steps (1=0.06 secs)
Success ratio (100 runs)
Ideal VAC
No VAC
HEURISTIC
AERIG
Agent
Sensor control
Figure 5: Evolution of the average success rate on recorded human
action with 4 objects at a distance of 15 cm.
0 200 400 600 800
400
200
0
200
400
600
x (mm)
y (mm)
Figure 6: Samples of recorded reaching action trajectories (2d pro-
jection) towards two different objects (circles) .
periment. It is easy to see that the trajectories are not straight
as predicted by the models used.
This data show that AERIG delivers improvement in event
recognition even when its knowledge is noisy and does not
match the real dynamic of the events. While both AERIG
and the heuristic model use the same action model, and thus
suffer from similar problems with the recorded data, AERIG
is more robust because it is sensitive to uncertainty when se-
lecting the actual event. We also hypothesise that the robust-
ness of AERIG is partially due to the use of stable dynamic
systems to model the interactions in the environment. The
trajectories are also affected by noise which is specially evi-
dent near the target objects. This noise cannot be reduced by
the active control of the camera. This helps explain the shape
of the performance in Figure 5. Figure 7 displays the perfor-
mance of the systems when the robot sensor has a limited field
of view (radius of 40cm around the gaze focus) and reduced
noise (σ= 0.03 + 0.06 d), with four objects at a distance
of 30 cm. This result, while preliminary, shows that only
AERIG with active camera can provide useful events recog-
nition7. This is due to the use of the effector motion informa-
tion which is used to adjust the initial random guesses of the
target position, which in the other conditions were based on
the peripheral vision. This adjustment are successively used
to confirm the presence of the target with direct gazes.
4 Conclusions
This paper introduces AERIG, a probabilistic framework for
the active perception of multi-element dynamic events, like
goal-oriented actions. It addresses the requirements of the
APA problem, which consists of finding effective sensor con-
trol strategies to gather the information necessary to predict
the evolution of the environment. AERIG actively directs
7The policy used by the systems to manage missing observations
was: a) if the element was predicted to be observed then generate a
random observation outside of the field of view and reduce the prob-
ability of the event to the minimum event probability, b) otherwise
generate a noisy observation of the element centred on the predicted
position.
0 2 4 6 8 10 12 14 16
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Timesteps (0.1 secs)
Success ratio (100 runs)
No VAC
Ideal VAC
Sensor control
Agent
HEURISTIC
AERIG
Figure 7: Evolution of the average success rate of the system with
limited field of view.
sensors to the most discriminative configurations to uncover
the actual dynamics of the environment. This leads to the
spatio-temporal coupling of the sensors with the environment
dynamics.
While actively recognising a complex dynamic context
[Rothkopf et al., 2007]has crucial behavioural importance,
this problem has been not sufficiently addressed in the litera-
ture of attention. In fact, most of the attention models assume
artificial laboratory conditions and offer limited explanations
of task oriented attention in dynamic environments [Tatler et
al., 2011]. Thus, it would be intriguing to compare the pre-
dictions of the proposed framework with actual human be-
haviour in natural conditions. AERIG applied to the recog-
nition of goal oriented actions represents a sound and robust
implementation of the simulation theory of action perception
[Demiris, 2007]. While some theories [Flanagan and Johans-
son, 2003]support the reuse of the same attention control
schemas used during action execution, our approach focuses
on the direct reduction of uncertainty during the perception
of the action. In this task the tests showed substantial perfor-
mance improvement over baseline approaches. Interestingly
the framework allows a mathematical formulation of the intu-
itive trade-off between a conservative behaviour, which tries
to observe all the events and elements, and a greedy one that
is only interested in determining the real event which is taking
place.
In this work, we tested our approach only on stable approx-
imately linear systems with a limited number of elements and
only used position information. In the future, we plan to ex-
tend the framework to use more complex models and inte-
grating it with the overall agent control architecture.
References
[Aloimonos et al., 1988]J. Aloimonos, I. Weiss, and A. Bandy-
opadhyay. Active vision. International Journal of Computer
Vision, 1(4):333–356, january 1988.
[Andreopoulos and Tsotsos, 2013]A. Andreopoulos and J. K.
Tsotsos. A computational learning theory of active object recog-
nition under uncertainty. Int. J. Comp. Vis., 101(1):95–142, 2013.
[Bajcsy, 1988]R. Bajcsy. Active perception. Proceedings of the
IEEE, Special issue on Computer Vision, 76(8):966–1005, 1988.
[Baker et al., 2009]C. L Baker, R. Saxe, and J. B Tenenbaum. Ac-
tion understanding as inverse planning. Cognition, 2009.
[Balkenius and Johansson, 2007]C. Balkenius and B. Johansson.
Anticipatory models in gaze control: a developmental model.
Cognitive processing, 8(3):167–174, 2007.
[Ballard, 1991]D.H. Ballard. Animate vision. Artificial Intelli-
gence, 48:57–86, 1991.
[Beer, 1995]R. D. Beer. A dynamical systems perspective on
agent-environment interaction. Artificial Intelligence, 72:173–
215, 1995.
[Bruce and Gordon, 2004]A. Bruce and G. Gordon. Better motion
prediction for people-tracking. In Proc. of the 2004 IEEE Inter-
national Conference on Robotics and Automation, New Orleans,
LA, USA, May 2004.
[de Croon and Postma, 2007]G. de Croon and E.O. Postma.
Sensory-motor coordination in object detection. In Proceedings
of the IEEE Symposium on Artificial Life, ALIFE’07, pages 147–
154, 2007.
[Demiris and Khadhouri, 2008]Y. Demiris and B. Khadhouri.
Content-based control of goal-directed attention during human
action perception. Interaction Studies, 9(2):353–376, 2008.
[Demiris, 2007]Y. Demiris. Prediction of intent in robotics and
multi-agent systems. Cognitive Processing, 8(3):151–158, 2007.
[Denzler and Brown, 2002]J. Denzler and C.M. Brown. Informa-
tion Theoretic Sensor Data Selection for Active Object Recogni-
tion and State Estimation. Transactions in Pattern Analysis and
Machine Intelligence, 24(2):145–157, 2002.
[Dindo et al., 2011]H. Dindo, D. Zambuto, and G. Pezzulo. Motor
simulation via coupled internal models using sequential monte
carlo. In Proceedings of the Twenty-Second International Joint
Conference on Artificial Intelligence, Barcelona, Spain, July
2011.
[Findlay and Gilchrist, 2003]J.M. Findlay and I.D. Gilchrist. Ac-
tive vision: The psychology of looking and seeing. Oxford Uni-
versity Press, 2003.
[Flanagan and Johansson, 2003]J Randall Flanagan and Roland S
Johansson. Action plans used in action observation. Nature,
424(6950):769–771, 2003.
[Gori et al., 2012]I. Gori, U. Pattacini, F. Nori, G. Metta, and
G. Sandini. Dforc: a real-time method for reaching, tracking
and obstacle avoidance in humanoid robots. In IEEE-RAS Inter-
national Conference on Humanoid Robots (HUMANOIDS2012),,
Osaka, Japan, November 29 - December 1. 2012.
[Gould et al., 2007]S. Gould, J. Arfvidsson, A. Kaehler, B. Sapp,
M. Messner, G. Bradski, P. Baumstarck, S. Chung, and A. Y.
Ng. Peripheral-foveal vision for real-time object recognition and
tracking in video. In IJCAI07, Proceedings of, 2007.
[Navalpakkam and Itti, 2006]V. Navalpakkam and L. Itti. An in-
tegrated model of top-down and bottom-up attention for optimal
object detection. In Proceedings of the 2006 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2006.
[Nolfi and Floreano, 2000]S. Nolfi and D. Floreano. Evolutionary
Robotics: The Biology,Intelligence,and Technology. MIT Press,
Cambridge, MA, USA, 2000.
[Ognibene et al., 2010]D. Ognibene, G. Pezzulo, and G. Baldas-
sarre. How can bottom-up information shape learning of top-
down attention control skills? In Proceedings of 9th International
Conference on Development and Learning, 2010.
[Ognibene et al., 2012]D. Ognibene, E. Chinellato, M. Sarabia,
and Y. Demiris. Towards contextual action recognition and target
localization with active allocation of attention. In Proc. of First
Int. Conf.on Living Machines, pages 192–203, 2012.
[Ope, 2010]OpenNI organization. OpenNI User Guide, November
2010. Last viewed 19-01-2011 11:32.
[Paletta et al., 2005]L. Paletta, G. Fritz, and C. Seifert. Cascaded
sequential attention for object recognition with informative lo-
cal descriptors and q-learning of grouping strategies. In Proc.of
Computer Vision and Pattern Recognition (CVPR), 2005.
[Pattacini, 2010]U. Pattacini. Modular Cartesian Controllers for
Humanoid Robots: Design and Implementation on the iCub. PhD
thesis, RBCS, Istituto Italiano di Tecnologia, Genoa., 2010.
[Pezzulo and Ognibene, 2012]G Pezzulo and D Ognibene. Proac-
tive action preparation: Seeing action preparation as a continuous
and proactive process. Motor control, 16(3):386–424., 2012.
[Pezzulo, 2008]G. Pezzulo. Coordinating with the future: the
anticipatory nature of representation. Minds and Machines,
18(2):179–225, 2008.
[Pri, 2010]PrimeSense Inc. Prime Sensor NITE 1.3 Algorithms
notes, 2010. Last viewed 19-01-2011 15:34.
[Ramirez and Geffner, 2011]M. Ramirez and H. Geffner. Goal
recognition over pomdps: Inferring the intention of a pomdp
agent. In Proceedings of the Twenty-Second International Joint
Conference on Artificial Intelligence, Barcelona, Spain, 2011.
[Rothkopf et al., 2007]C. A. Rothkopf, D. H. Ballard, and M. M.
Hayhoe. Task and context determine where you look. Journal of
Vision, 7(14):1610–1620, 2007.
[Sommerlade and Reid, 2008]R. Sommerlade and I Reid. Infor-
mation theoretic active scene exploration. In Proceedings of 2008
IEEE Computer Society Conference on Computer Vision and Pat-
tern Recognition, Anchorage, Alaska, USA, June 2008.
[Sprague et al., 2007]N. Sprague, D. Ballard, and A. Robinson.
Modeling embodied visual behaviors. ACM Trans. Appl. Per-
cept., 4(2):11, 2007.
[Sridharan et al., 2010]M. Sridharan, J. Wyatt, and R. Dearden.
Planning to see: A hierarchical approach to planning vi-
sual actions on a robot using pomdps. Artificial Intelligence,
174(11):704–725, 2010.
[Suzuki and Floreano, 2008]M. Suzuki and D. Floreano. Enactive
robot vision. Adaptive Behavior, 16(2-3):122–128, 2008.
[Tatler et al., 2011]B. W. Tatler, M. M. Hayhoe, M. F. Land, and
D. Ballard. Eye guidance in natural vision: Reinterpreting
salience. Journal of Vision, 11(5):1–23, 2011.
[Vijayanarasimhan and Kapoor, 2010]S. Vijayanarasimhan and
A. Kapoor. Visual recognition and detection under bounded com-
putational resources. In Proc. of Conf. on Computer Vision and
Pattern Recognition (CVPR), pages 1006–1013, June 2010.
[Vogel and de Freitas, 2008]J. Vogel and N. de Freitas. Target-
directed attention: Sequential decision-making for gaze plan-
ning. In The International Conference of Robotics and Automa-
tion (ICRA), pages 2372–2379, 2008.
[Walther et al., 2005]D. Walther, U. Rutishauser, C. Koch, and
P. Perona. Selective visual attention enables learning and recog-
nition of multiple objects in cluttered scenes. Computer Vision
and Image Understanding, 100(1-2):41–63, October 2005.
[Zilberstein and Russell, 1996]S. Zilberstein and S. Russell. Op-
timal composition of real-time systems. Artificial Intelligence,
82(1-2):181–213, 1996.
... Several tasks focus on robots' ability to explore an unknown environment [61][62][63]. Furthermore, in social contexts, unobservable factors such as others' intentions must be actively considered to allow for efficient human-robot collaboration [64][65][66][67]. ...
... More specifically, to infer the best action sequences (policies), one must also predict the future observations they would produce. This is realized in the active inference framework by minimizing the expected free energy over a time interval T. We can express this as the sum of two terms, including (i) the variational information gain term [64,[69][70][71], or epistemic value [55], defined as the expected KL divergence between the distribution of the latent states conditioned on the expected observations q(z t+1:T | x t+1:T , a t:T−1 ) and the prior distribution on the latent states q(z t+1:T | a t:T−1 ) that represents the reduction in uncertainty on the latent states z t+1:T provided by the expected observations x t+1:T , and (ii) the extrinsic or pragmatic value log p(x t+1:T | C), where C denotes the agent's preferences, which is a parameter of the distribution p(x t+1:T | ·) 11 . This results in the following expression 12 . ...
... The impact of missing observations for the observer agent has also been analyzed [199]. A further step has been proposed by active methods for activity recognition [64,201,202] that use the same world model both to interpret others' actions as well as selecting actions that would improve the recognition process, e.g. by giving access to the most informative observations [65] and allow the completion of a joint task [190]. While even the initial formulations of this approach were computationally aware [196], their efficiency is often affected by the length of the observed behavior and the environment complexity, resulting in methods that can seldom be applied online on a robot. ...
... Several tasks focus on robots' ability to explore an unknown environment [59][60][61]. Furthermore, in social contexts, unobservable factors such as others' intentions must be actively considered to allow for efficient human-robot collaboration [62][63][64][65]. ...
... This is realized in the active inference framework by minimizing the expected free energy over a time interval T . We can express this as the sum of two terms, including i) the variational information gain term [62,[67][68][69], or epistemic value [54], defined as the expected KL divergence between the distribution of the latent states conditioned on the expected observations q(z t+1:T |x t+1:T , a t:T −1 ) and the prior distribution on the latent states q(z t+1:T |a t:T −1 ) that represents the reduction in uncertainty on the latent states z t+1:T provided by the expected observations x t+1:T , and ii) the extrinsic or pragmatic value log p(x t+1:T |C), where C denotes the agent's preferences. This results in the following expression 10 . ...
... The impact of missing observations for the observer agent has also been analyzed [189]. A further step has been proposed by active methods for activity recognition [62,191,192] that use the same world model both to interpret others' actions as well as selecting actions that would improve the recognition process, e.g., by giving access to the most informative observations [63] and allow the completion of a joint task [180]. While even the initial formulations of this approach were computationally aware [186], their efficiency is often affected by the length of the observed behavior and the environment complexity, resulting in methods that can seldom be applied online on a robot. ...
Preprint
Full-text available
Creating autonomous robots that can actively explore the environment, acquire knowledge and learn skills continuously is the ultimate achievement envisioned in cognitive and developmental robotics. Their learning processes should be based on interactions with their physical and social world in the manner of human learning and cognitive development. Based on this context, in this paper, we focus on the two concepts of world models and predictive coding. Recently, world models have attracted renewed attention as a topic of considerable interest in artificial intelligence. Cognitive systems learn world models to better predict future sensory observations and optimize their policies, i.e., controllers. Alternatively, in neuroscience, predictive coding proposes that the brain continuously predicts its inputs and adapts to model its own dynamics and control behavior in its environment. Both ideas may be considered as underpinning the cognitive development of robots and humans capable of continual or lifelong learning. Although many studies have been conducted on predictive coding in cognitive robotics and neurorobotics, the relationship between world model-based approaches in AI and predictive coding in robotics has rarely been discussed. Therefore, in this paper, we clarify the definitions, relationships, and status of current research on these topics, as well as missing pieces of world models and predictive coding in conjunction with crucially related concepts such as the free-energy principle and active inference in the context of cognitive and developmental robotics. Furthermore, we outline the frontiers and challenges involved in world models and predictive coding toward the further integration of AI and robotics, as well as the creation of robots with real cognitive and developmental capabilities in the future.
... It also offers a mechanism of changes of points of view and perspectives on a world model, including for perspective taking in social cognition, which is critical for the development of strategies of action planning in humans (see [14], [15]). We used Euclidean geometry as a standard baseline geometry for comparison [16]. Our geometrical rationale implies a different understanding of how agents' actions in their environment (here in the behavioral sense of the term) are implemented compared to usual active inference. ...
... The impact of changes of perspective in exploration has long been of interest. [16]. [18] proposed an internal 3-D egocentric, spherical representation of space, to modulate information sampling and uncertainty as a function of distance, and control a robot attention through Bayesian inference. ...
Preprint
Full-text available
In human spatial awareness, information appears to be represented according to 3-D projective geometry. It structures information integration and action planning within an internal representation space. The way different first person perspectives of an agent relate to each other, through transformations of a world model, defines a specific perception scheme for the agent. In mathematics, this collection of transformations is called a `group' and it characterizes a geometric space by acting on it. We propose that imbuing world models with a `geometric' structure, given by a group, is one way to capture different perception schemes of agents. We explore how changing the geometric structure of a world model impacts the behavior of an agent. In particular, we focus on how such geometrical operations transform the formal expression of epistemic value in active inference as driving an agent's curiosity about its environment, and impact exploration behaviors accordingly. We used group action as a special class of policies for perspective-dependent control. We compared the Euclidean versus projective groups. We formally demonstrate that the groups induce distinct behaviors. The projective group induces nonlinear contraction and dilatation that transform entropy and epistemic value as a function of the choice of frame, which fosters exploration behaviors. This contribution opens research avenues in which a geometry structures \textit{a priori} an agent's internal representation space for information integration and action planning.
... Due to recent technological developments, the interactions between AI and humans have become pervasive and heterogeneous, extending from voice assistants or recommender systems supporting the online experience of millions of users to autonomous driving cars. Principled models to represent the human collaborators' needs are being adopted [1] while robotic perception in complex environments is becoming more flexible and adaptive [2]- [5] even in social contexts, robot sensory limits are starting to be actively managed [6], [7]. However, robots and intelligent systems still have a limited understanding of how sensory limits affect human partners' behaviour and lead them to rely on internal beliefs about the state of the world. ...
... From a cognitive modelling perspective, the proposed architecture detaches from the motor simulation tradition common in robotics [6], [7], [41]- [43], where the motor control system is used for both action execution and perception. In our model, the motor control system is only the source of the reference signal for the training of the social perception system. ...
Preprint
Full-text available
In complex environments, where the human sensory system reaches its limits, our behaviour is strongly driven by our beliefs about the state of the world around us. Accessing others' beliefs, intentions, or mental states in general, could thus allow for more effective social interactions in natural contexts. Yet these variables are not directly observable. Theory of Mind (TOM), the ability to attribute to other agents' beliefs, intentions, or mental states in general, is a crucial feature of human social interaction and has become of interest to the robotics community. Recently, new models that are able to learn TOM have been introduced. In this paper, we show the synergy between learning to predict low-level mental states, such as intentions and goals, and attributing high-level ones, such as beliefs. Assuming that learning of beliefs can take place by observing own decision and beliefs estimation processes in partially observable environments and using a simple feed-forward deep learning model, we show that when learning to predict others' intentions and actions, faster and more accurate predictions can be acquired if beliefs attribution is learnt simultaneously with action and intentions prediction. We show that the learning performance improves even when observing agents with a different decision process and is higher when observing beliefs-driven chunks of behaviour. We propose that our architectural approach can be relevant for the design of future adaptive social robots that should be able to autonomously understand and assist human partners in novel natural environments and tasks.
... For brevity, we use "autonomy" to mean "fully autonomous flight." We focus on tasks involving active vision [2,18]. For example, based on its current scene interpretation, a drone may drop down to lower altitude without human intervention "to take Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
Conference Paper
Full-text available
Fully autonomous flight by low-cost, lightweight commercial off-the-shelf (COTS) drones could transform many use cases involving real-time computer vision. We show how such autonomy can be achieved using edge computing from a flight platform costing less than $800, and composed of a 320 g COTS drone with a 26 g COTS wearable device as payload. In spite of the extreme austerity of this platform and thermal limits on its LTE transmission, the system is able perform tasks such as detecting and then tracking a target. It is also able to visually navigate around obstacles. Such capabilities are only found on heavier and more expensive drones today.
... To this end, we exploit the Information Gain (IG), an informationtheoretic measure of the expected reduction in uncertainty from additional observations of a specific area. The usage of information theory for exploration and mapping has been demonstrated across application domains [6,16,22,21,20], and precision agriculture in particular [14,24]. Here, we exploit IG to support exploration and coordination among robots. ...
Preprint
Full-text available
Monitoring crop fields to map features like weeds can be efficiently performed with unmanned aerial vehicles (UAVs) that can cover large areas in a short time due to their privileged perspective and motion speed. However, the need for high-resolution images for precise classification of features (e.g., detecting even the smallest weeds in the field) contrasts with the limited payload and ight time of current UAVs. Thus, it requires several flights to cover a large field uniformly. However, the assumption that the whole field must be observed with the same precision is unnecessary when features are heterogeneously distributed, like weeds appearing in patches over the field. In this case, an adaptive approach that focuses only on relevant areas can perform better, especially when multiple UAVs are employed simultaneously. Leveraging on a swarm-robotics approach, we propose a monitoring and mapping strategy that adaptively chooses the target areas based on the expected information gain, which measures the potential for uncertainty reduction due to further observations. The proposed strategy scales well with group size and leads to smaller mapping errors than optimal pre-planned monitoring approaches.
... In terms of the mathematical model of HRI, partially observable Markov decision process (POMDP) is the most common Markovian model because some internal parameters of the human partners such as intent are not directly observable (Broz et al., 2013;Ognibene and Demiris, 2013;Mavridis, 2015). Dealing with POMDP is computationally demanding and it usually requires large amounts of data for learning (Bernstein et al., 2002). ...
Article
Full-text available
This paper offers a new hybrid probably approximately correct (PAC) reinforcement learning (RL) algorithm for Markov decision processes (MDPs) that intelligently maintains favorable features of both model-based and model-free methodologies. The designed algorithm, referred to as the Dyna-Delayed Q-learning (DDQ) algorithm, combines model-free Delayed Q-learning and model-based R-max algorithms while outperforming both in most cases. The paper includes a PAC analysis of the DDQ algorithm and a derivation of its sample complexity. Numerical results are provided to support the claim regarding the new algorithm’s sample efficiency compared to its parents as well as the best known PAC model-free and model-based algorithms in application. A real-world experimental implementation of DDQ in the context of pediatric motor rehabilitation facilitated by infant-robot interaction highlights the potential benefits of the reported method.
... To this end, we exploit the Information Gain (IG), an informationtheoretic measure of the expected reduction in uncertainty from additional observations of a specific area. The usage of information theory for exploration and mapping has been demonstrated across application domains [6,16,22,21,20], and precision agriculture in particular [14,24]. Here, we exploit IG to support exploration and coordination among robots. ...
Chapter
Monitoring crop fields to map features like weeds can be efficiently performed with unmanned aerial vehicles (UAVs) that can cover large areas in a short time due to their privileged perspective and motion speed. However, the need for high-resolution images for precise classification of features (e.g., detecting even the smallest weeds in the field) contrasts with the limited payload and flight time of current UAVs. Thus, it requires several flights to cover a large field uniformly. However, the assumption that the whole field must be observed with the same precision is unnecessary when features are heterogeneously distributed, like weeds appearing in patches over the field. In this case, an adaptive approach that focuses only on relevant areas can perform better, especially when multiple UAVs are employed simultaneously. Leveraging on a swarm-robotics approach, we propose a monitoring and mapping strategy that adaptively chooses the target areas based on the expected information gain, which measures the potential for uncertainty reduction due to further observations. The proposed strategy scales well with group size and leads to smaller mapping errors than optimal pre-planned monitoring approaches.
... Working memory has been defined as short-term memory used in order to proactively reinterpret the information in order to better operate in the environment (Miyake and Shah, 1999;Oberauer, 2019). Different computational models of attention for artificial agents have been proposed (Nagai et al., 2003;Triesch et al., 2006;Ognibene and Demiris, 2013) to respond to visual (Itti and Koch, 2001) and auditory stimuli (Treisman, 1996). However, these models do not fully consider the potential role of working memory related to the process of attentional focus redeployment. ...
Article
Full-text available
One of the fundamental prerequisites for effective collaborations between interactive partners is the mutual sharing of the attentional focus on the same perceptual events. This is referred to as joint attention. In psychological, cognitive, and social sciences, its defining elements have been widely pinpointed. Also the field of human-robot interaction has extensively exploited joint attention which has been identified as a fundamental prerequisite for proficient human-robot collaborations. However, joint attention between robots and human partners is often encoded in prefixed robot behaviours that do not fully address the dynamics of interactive scenarios. We provide autonomous attentional behaviour for robotics based on a multi-sensory perception that robustly relocates the focus of attention on the same targets the human partner attends. Further, we investigated how such joint attention between a human and a robot partner improved with a new biologically-inspired memory-based attention component. We assessed the model with the humanoid robot iCub involved in performing a joint task with a human partner in a real-world unstructured scenario. The model showed a robust performance on capturing the stimulation, making a localisation decision in the right time frame, and then executing the right action. We then compared the attention performance of the robot against the human performance when stimulated from the same source across different modalities (audio-visual and audio only). The comparison showed that the model is behaving with temporal dynamics compatible with those of humans. This provides an effective solution for memory-based joint attention in real-world unstructured environments. Further, we analyzed the localisation performances (reaction time and accuracy), the results showed that the robot performed better in an audio-visual condition than an audio only condition. The performance of the robot in the audio-visual condition was relatively comparable with the behaviour of the human participants whereas it was less efficient in audio-only localisation. After a detailed analysis of the internal components of the architecture, we conclude that the differences in performance are due to egonoise which significantly affects the audio-only localisation performance.
Article
Full-text available
We present some theoretical results related to the problem of actively searching a 3D scene to determine the positions of one or more pre-specified objects. We investigate the effects that input noise, occlusion, and the VC-dimensions of the related representation classes have in terms of localizing all objects present in the search region, under finite computational resources and a search cost constraint. We present a number of bounds relating the noise-rate of low level feature detection to the VC-dimension of an object representable by an architecture satisfying the given computational constraints. We prove that under certain conditions, the corresponding classes of object localization and recognition problems are efficiently learnable in the presence of noise and under a purposive learning strategy, as there exists a polynomial upper bound on the minimum number of examples necessary to correctly localize the targets under the given models of uncertainty. We also use these arguments to show that passive approaches to the same problem do not necessarily guarantee that the problem is efficiently learnable. Under this formulation, we prove the existence of a number of emergent relations between the object detection noise-rate, the scene representation length, the object class complexity, and the representation class complexity, which demonstrate that selective attention is not only necessary due to computational complexity constraints, but it is also necessary as a noise-suppression mechanism and as a mechanism for efficient object class learning. These results concretely demonstrate the advantages of active, purposive and attentive approaches for solving complex vision problems.
Conference Paper
Full-text available
We present the Dynamic Force Field Controller (DForC), a reliable and effective framework in the context of humanoid robotics for real-time reaching and tracking in presence of obstacles. It is inspired by well established works based on artificial potential fields, providing a robust basis for sidestepping a number of issues related to inverse kinematics of complex manipulators. DForC is composed of two layers organized in descending order of abstraction: (1) at the highest level potential fields are employed to outline on the fly collision-free trajectories that serve to drive the robot end-effector toward fixed or moving targets while accounting for obstacles; (2) at the bottom level an optimization algorithm is exploited in place of traditional techniques that resort to the Transposed or Pseudo-Inverse Jacobian, in order to deal with constraints specified in the joints space and additional conditions related to the robot structure. As demonstrated by experiments conducted on the iCub robot, our method reveals to be particularly flexible with respect to environmental changes allowing for a safe tracking procedure, and generating reliable paths in practically every situation.
Conference Paper
Full-text available
Exploratory gaze movements are fundamental for gathering the most relevant information regarding the partner during social interactions. We have designed and implemented a system for dynamic attention allocation which is able to actively control gaze movements during a visual action recognition task. During the observation of a partner’s reaching movement, the robot is able to contextually estimate the goal position of the partner hand and the location in space of the candidate targets, while moving its gaze around with the purpose of optimizing the gathering of information relevant for the task. Experimental results on a simulated environment show that active gaze control provides a relevant advantage with respect to typical passive observation, both in term of estimation precision and of time required for action recognition.
Article
Full-text available
This book focuses on Active Vision: the psychology of looking and seeing. The authors present a view of the process of seeing, with a particular emphasis on visual attention. They contend that the regular sampling of the environment with eye movements is the normal process of visual attention. Several sections of the book are devoted to the neurophysiological substrates underpinning the processes of active vision. Topics of interest that are included are: visual orienting, visual selection, covert attention, eye movements, neural scenes and activities, human neuropsychology, and space constancy and trans-saccadic integration. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
In this paper, we aim to elucidate the processes that occur during action preparation from both a conceptual and a computational point of view. We first introduce the traditional, serial model of goal-directed action and discuss from a computational viewpoint its subprocesses occurring during the two phases of covert action preparation and overt motor control. Then, we discuss recent evidence indicating that these subprocesses are highly intertwined at representational and neural levels, which undermines the validity of the serial model and points instead to a parallel model of action specification and selection. Within the parallel view, we analyze the case of delayed choice, arguing that action preparation can be proactive, and preparatory processes can take place even before decisions are made. Specifically, we discuss how prior knowledge and prospective abilities can be used to maximize utility even before deciding what to do. To support our view, we present a computational implementation of (an approximated version of) proactive action preparation, showing its advantages in a simulated tennis-like scenario.
Conference Paper
Full-text available
Studies support the need for high resolution imagery to identify persons in surveillance videos. However, the use of telephoto lenses sacrifices a wider field of view and thereby increases the uncertainty of other, possibly more interesting events in the scene. Using zoom lenses offers the possibility of enjoying the benefits of both wide field of view and high resolution, but not simultaneously. We approach this problem of balancing these finite imaging resources - or of exploration vs exploitation - using an information-theoretic approach. We argue that the camera parameters - pan, tilt and zoom - should be set to maximise information gain, or equivalently minimising conditional entropy of the scene model, comprised of multiple targets and a yet unobserved one. The information content of the former is supplied directly by the uncertainties computed using a Kalman filter tracker, while the latter is modelled using a rdquobackgroundrdquo Poisson process whose parameters are learned from extended scene observations; together these yield an entropy for the scene. We support our argument with quantitative and qualitative analyses in simulated and real-world environments, demonstrating that this approach yields sensible exploration behaviours in which the camera alternates between obtaining close-up views of the targets while paying attention to the background, especially to areas of known high activity.
Conference Paper
Full-text available
During the perception of human actions by robotic assistants, the robotic assistant needs to direct its computational and sensor resources to relevant parts of the human action. In previous work we have introduced HAMMER (Hierarchical Attentive Multiple Models for Execution and Recognition) in Demiris, Y. and Khadhouri, B., (2006), a computational architecture that forms multiple hypotheses with respect to what the demonstrated task is, and multiple predictions with respect to the forthcoming states of the human action. To confirm their predictions, the hypotheses request information from an attentional mechanism, which allocates the robot's resources as a function of the saliency of the hypotheses. In this paper we augment the attention mechanism with a component that considers the content of the hypotheses' requests, with respect to reliability, utility and cost. This content-based attention component further optimises the utilisation of the resources while remaining robust to noise. Such computational mechanisms are important for the development of robotic devices that will rapidly respond to human actions, either for imitation or collaboration purposes
Article
We introduce a formalism for optimal sensor parameter selection for iterative state estimation in static systems. Our optimality criterion is the reduction of uncertainty in the state estimation process, rather than an estimator-specific metric (e.g., minimum mean squared estimate error). The claim is that state estimation becomes more reliable if the uncertainty and ambiguity in the estimation process can be reduced. We use Shannon's information theory to select information-gathering actions that maximize mutual information, thus optimizing the information that the data conveys about the true state of the system. The technique explicitly takes into account the a priori probabilities governing the computation of the mutual information. Thus, a sequential decision process can be formed by treating the a priori probability at a certain time step in the decision process as the a posteriori probability of the previous time step. We demonstrate the benefits of our approach in an object recognition application using an active camera for sequential gaze control and viewpoint selection. We describe experiments with discrete and continuous density representations that suggest the effectiveness of the approach.
Article
Flexible, general-purpose robots need to autonomously tailor their sensing and information processing to the task at hand. We pose this challenge as the task of planning under uncertainty. In our domain, the goal is to plan a sequence of visual operators to apply on regions of interest (ROIs) in images of a scene, so that a human and a robot can jointly manipulate and converse about objects on a tabletop. We pose visual processing management as an instance of probabilistic sequential decision making, and specifically as a Partially Observable Markov Decision Process (POMDP). The POMDP formulation uses models that quantitatively capture the unreliability of the operators and enable a robot to reason precisely about the trade-offs between plan reliability and plan execution time. Since planning in practical-sized POMDPs is intractable, we partially ameliorate this intractability for visual processing by defining a novel hierarchical POMDP based on the cognitive requirements of the corresponding planning task. We compare our hierarchical POMDP planning system (HiPPo) with a non-hierarchical POMDP formulation and the Continual Planning (CP) framework that handles uncertainty in a qualitative manner. We show empirically that HiPPo and CP outperform the naive application of all visual operators on all ROIs. The key result is that the POMDP methods produce more robust plans than CP or the naive visual processing. In summary, visual processing problems represent a challenging domain for planning techniques and our hierarchical POMDP-based approach for visual processing management opens up a promising new line of research.
Article
Humans are adept at inferring the mental states underlying other agents’ actions, such as goals, beliefs, desires, emotions and other thoughts. We propose a computational framework based on Bayesian inverse planning for modeling human action understanding. The framework represents an intuitive theory of intentional agents’ behavior based on the principle of rationality: the expectation that agents will plan approximately rationally to achieve their goals, given their beliefs about the world. The mental states that caused an agent’s behavior are inferred by inverting this model of rational planning using Bayesian inference, integrating the likelihood of the observed actions with the prior over mental states. This approach formalizes in precise probabilistic terms the essence of previous qualitative approaches to action understanding based on an “intentional stance” [Dennett, D. C. (1987). The intentional stance. Cambridge, MA: MIT Press] or a “teleological stance” [Gergely, G., Nádasdy, Z., Csibra, G., & Biró, S. (1995). Taking the intentional stance at 12 months of age. Cognition, 56, 165–193]. In three psychophysical experiments using animated stimuli of agents moving in simple mazes, we assess how well different inverse planning models based on different goal priors can predict human goal inferences. The results provide quantitative evidence for an approximately rational inference mechanism in human goal inference within our simplified stimulus paradigm, and for the flexible nature of goal representations that human observers can adopt. We discuss the implications of our experimental results for human action understanding in real-world contexts, and suggest how our framework might be extended to capture other kinds of mental state inferences, such as inferences about beliefs, or inferring whether an entity is an intentional agent.