Content uploaded by Gerriet Backer
Author content
All content in this area was uploaded by Gerriet Backer
Content may be subject to copyright.
Two selection stages provide efficient object-based
attentional control for dynamic vision
Gerriet Backer
Krauss Software GmbH
Cremlingen, Germany
Gerriet.Backer@krauss-software.de
Bärbel Mertsching
AG IMA, Department of Computer Science
University Hamburg, Germany
mertsching@informatik.uni-hamburg.de
Abstract— In this paper, we introduce semiattentive computa-
tions as the result of replacing the usual single selection stage
of visual attention by two consecutive selection stages. They
are motivated by shortcomings of conventional attention models
and correlate well to findings in human attention. The first
selection stage employs preattentive saliency computations for
the complete available input, and selects a small number of
discrete items. These are subject to the semiattentive processes
of tracking and information accumulation. The second stage
selects a single element from the result of the first selection
stage for the conventional focus of attention. The implementation
and efficiency of this scheme is demonstrated in this paper.
Its main advantage is the efficient selection and inhibition of
objects in dynamic scenes. It allows the serialized accumulation
of information for a changing environment and provides an up-
to-date world model. The focus of this paper is on the quality of
the computed world model and the object-related computations.
I. INTRODUCTION
Attentional mechanisms are mainly used to reduce the
amount of data for complex computations. They employ a
method of determining important, salient objects or areas and
select them - one after another - for being subjected to these
computations. Thus attention is a general method for seri-
alizing complex operations. Computations in those schemes
are either preattentively applied to the complete input data in
parallel or attentively and serially only to the selected area.
Complex accurate computations are usually done attentively
while simple computations are assigned to the preattentive
part.
The spotlight metaphor describes this behavior: at each time
only one region of space is illuminated. This area is the focus
of attention (FOA) and as such the place where complex
operations are applied. The spotlight can move to include other
regions. It is moved to regions of high saliency. The saliency
of an area can be determined by either data-driven bottom-up
information or model-driven top-down information. The focus
of computational attention models is mostly on the data-driven
information. The predominant selection unit for attention is
space. Only a few models deviate from this view and use
either features or objects as selection unit.
The rest of the article is organized as follows. A short review
of the predominant attention models in chapter II leads us to
an analysis of their drawbacks when operating in dynamic
environments. Chapter III proposes necessary modifications
that lead to our new architecture. An implementation of this
architecture which lines out the object-based aspects follows.
After an analysis and comparison of the properties of our
model in chapter IV, we conclude with an outlook on further
developments (chapter V).
II. PREVIOUS WORK
A. Conventional models of visual attention
A classic model of visual attention was proposed by Koch
and Ullman in 1985 [1]. In containing a parallel feature
extraction stage, a master map of attention for integrating the
saliency, applying a WTA-process to this map and allowing
the scanning of maxima by using an inhibition map, it already
provided many aspects present in todays attention models. The
model (outlined in figure 1) is closely related to models of
human visual attention like the Feature Integration Theory by
Treisman [2] or the Guided Search model by Wolfe [3], [4].
Input data
Saliency map
Feature map
WTA
FOA
Feature map Feature map
Preattentive stage
Attentive stage
Inhibition
map
Fig. 1. Simplified version of conventional attention models.
Computer models that build on this architecture were in-
troduced by e.g. Milanese et al. [5], Leavers [6], Itti et al.
[7], and Maki [8]. Other models are more concerned with the
transformation of a scene part into a constant reference frame
like the routing circuits of Olshausen [9] or the inhibitorybeam
of Tsotsos [10].
Attentional control is of special importance in active vision
[11], where the activity of the system - mostly in the form
of directing a camera - has to be determined according to the
properties of the environment and the state and goal of the
system. Active vision is a form of overt visual attention that
is closely related to covert attention by selection from internal
representations.
Applications of visual attention in computer vision include
object recognition methods [12], control of vehicles [13], and
navigation [14]. Especially object recognition profits from
the availability of a segmented single object in contrast to
a cluttered scene.
B. Beyond the spotlight
Accounts that go beyond the spotlight metaphor are mainly
found in models of natural visual attention. Pylyshyn [15],
[16] proposed the so-called FINST-theory to give an account
of findings from various experimental paradigms. He was able
to show that one can keep track of a small number (about 4
or 5) of independently moving objects among other identical
objects [17]. Accounts of a fast serial scanning of the objects
by a single focus of attention could be ruled out due to the
necessary speed of the focus. There is also an attention-related
limit on the fast, parallel and error-free counting (so-called
subitizing) of a small number of items (about 4 or 5). This
led to the assumption that some indices are available pointing
to moving objects and sticking to them without the need for
focal attention. Indexed items are more easily available for
focal attention.
Object-based theories of visual attention [18], [19] challenge
the predominant spatial accounts of attention. According to
them, objects are the meaningful units in visual selection.
The partitioning of the scene into objects determines the
assignment of attention. The empirical evidence comes from
experiments where in identical spatial layouts the suggested
grouping into objects caused additional costs associated with
the processing of multiple objects. Some recent empirical
findings point towards an integration of object-based effects
in spatial selection. A possible compromise suggests that
although attention selects a spatial part of the scene, the space
is determined by a fast object-based segmentation of the scene,
or by grouping effects.
Examples of the successful incorporation of object-based
approaches into computer models of visual attention have
been demonstrated at various levels by Fellenz [20], Maki et
al. [21], and Dickinson et al. [22]. We aim to contribute to
these achievements with a special focus on dynamic aspects
of controlling attention.
C. Limitations of conventional models
By applying conventional models (see section II-A) to
dynamic scenes we identify three major problems:
Inhibition of return is bound to static locations instead of
moving objects.
Extracted information cannot be bound to moving objects.
Selection and feature integration do not take into account
the dynamic environment.
By inhibiting recently selected locations, inhibition of return
(IOR) allows the scanning of a scene by a serial process. The
area with maximal activation in the master map of attention is
marked in the inhibition map with high activity. The activity in
the inhibition map is slowly decaying and inhibits the master
map of attention, so that another area will show the highest
activity. Using this static inhibition map, it is not possible to
inhibit moving objects. Imagine a scene with a highly salient
moving object and a number of salient static objects. After the
moving object is selected and processed, it is marked in the
inhibition map. As soon as it moves out of the inhibited area
it becomes the most salient object and is selected again. This
prevents the system from scanning the scene and selecting
among the static objects. For human visual attention Tipper et
al. [23] have demonstrated that the IOR is in fact bound to
moving objects instead of static locations.
By serializing the high-level computations, their results e.g.
object identities or classifications are bound to the location an
object inhabited in the moment it was selected. In the case
of moving objects, this information is soon outdated. Without
expensive re-checking, the system is not able to provide an up-
to-date world model, binding the identities to actual locations.
Itti and Koch [24] identified the spatiotemporal integration
of saliency information as an important step in the control
of visual attention. Thus the saliency information has to be
computed taking into account the previous saliency data. This
can lead to problems if there is no knowledge about the
movement of objects raising the saliency values.
In the following we will discuss how these problems can
be overcome in a new model of visual attention.
III. MODELING VISUAL ATTENTION
A. Consequences for modeling visual attention in dynamic
environments
From the above analysis it is clear that we need a method
of binding the saliency information to moving objects. For
the problem of dynamic IOR, we have to bind saliency
information to a small number of recently selected moving
objects. This also provides us with the binding of attentively
computed information to these objects and is thus a solution
for the first two problems.
This binding is necessary for the already selected objects
as well as salient objects that have not been selected yet.
This determines the need for a model-free tracking mechanism.
Objects that have never been selected for focal attention are not
recognized and can thus not be tracked based on knowledge
of their identity. Nevertheless it is not necessary to track all
objects in the scene. Just those who are salient enough to be
candidates for focal selection are relevant. This indicates a
close connection between selection and tracking.
Determining the saliency for these objects is a necessary
first step. To reflect the properties of the environment and
account for immanent inaccuracies in the feature computa-
tions, that have to computed preattentively for the complete
input images, spatial and temporal integration of saliency is
important. It has to compensate that the objects may be moving
and that the speed constraints on the preattentive feature
computations impose limits on their accuracy and reliability.
B. Model architecture
The processes that have been classified essential in the
previous chapter can neither be assigned to the preattentive
part nor to the attentive part of the selection.We therefor define
an additional semiattentive stage, where a small number of
discrete items is represented. These items have to be selected
by a first selection stage. This first selection stage selects a
small number of items according to their saliency integrated
over space and time. It should be robust and show hysteresis:
selected items remain selected for some time, even if other
items become more salient. Tracking is integrated into this first
selection stage which allows to bind extracted information to
moving objects as well as to inhibit them from being selected
by focal attention.
Input image
computations
Feature
Behavior
control
.
FOA
−
Neural
field
Attentive stage Semiattentive stage
Object file 3
Object file 2
Object file 1
...
Preattentive stage
World model
sequence
Saliency
representation
Object
recognition
Fig. 2. Architecture of the attention model outlining the three computa-
tion stages. Inside the neural field, three-dimensional activity clusters are
displayed.
Among the semiattentive algorithms is the generation of
symbolic descriptions for each selected item. These contain
information about the position, size, and trajectories as well
as histories of object selection, mean feature values, and
the results of high-level computations like object recognition.
They are stored in so-called object files, which constitute
the world model of the system. The notion of object files is
borrowed from psychophysical modeling [25], [26], [27] and
emphasizes the symbolic reference to an object preceeding the
computation of identity information.
For the focal selection of a single item, a second selection
stage is needed. This stage selects among the items that were
the result of the first selection stage. Second-stage selection
is subject to behaviors. It operates on the symbolic data
associated with an object and can include top-down influences.
The behavior is responsible for controlling the system, it
can e.g. initiate camera movements for foveating an object.
Figure 2 depicts the model architecture. Its implementation is
explained in the following section.
C. Model implementation
1) Feature computations: For the computation of saliency
we employ a number of features designed for fast object-
related information extraction.To achieve a most robust behav-
ior in different environments, these features use very different
aspects of the visual information, including edges, areas, color
information and stereo information where available. The use of
multi-scale computations ensures fast computations and robust
results. We tried to realize a more object-based behavior than
simple filter-operations could achieve.
Symmetry: To extract edge information in a biologically
plausible way, gabor filters of different scales and orientations
are applied to the input. The energy of the gabor filters
orthogonal to circles of different radii at different scales is
accumulated to compute the strength of rotation symmetry at
every pixel. Symmetry is a strong cue for artificial objects as
well as biological forms and points toward their center.
Eccentricity: A grey-level segmentation of the image,
consisting of a fast initial segmentation into many small
segments followed by a dilation and integration procedure,
provides area-based information of homogenous object or
object-part candidates. The saliency of segments is evaluated
by a computation of the segments eccentricity.
Color contrast: The image is first transformed into the
MTM color space [28] to achieve human-like processing of
colors. There, a segmentation takes place. The saliency of each
segment is computed according to the mean color contrast
to its neighboring segments weighted by the length of the
common border.
Depth: Gabor filters with vertical components form the
basis for this feature. For different orientations, a modified
cross correlation is applied to the filter energies of two stereo
images using multiple scales. Results from the lower resolution
scales limit the correlation range. A voting scheme selects the
most probable disparity from the correlation results for each
location. It takes into account the results of neighboring pixels,
different orientations and scales. According to the heuristic
that a system should first react to close objects, the saliency
is monotonic with the disparity.
The segmentation results as well as clues from depth and
symmetry can be used to identify visual objects and segment
them. The features have been described in detail in [29], [30],
[31]. The feature saliency is integrated into a representation
by first honoring exclusivity (one single red area is more
salient than a large number of identically colored areas) and
a following superposition of the feature values. In case stereo
information is available, a 3D representation is created. The
saliency representation provides the information necessary for
the first selection stage.
2) Dynamic neural fields: The close integration of robust
selection and model-free tracking suggests the use of dynamic
neural fields (DNF) as proposed by Amari [32]. Their selection
characteristic is robust, shows hysteresis and spatiotemporal
integration, which makes them the perfect candidates for
this stage as shown in [33]. Neural fields are simulations of
laterally connected cortical areas. Their topology corresponds
to the input they receive. The connections inside the field
are homogenous, only dependant on the distance between the
neurons. The dynamic of a neuron’s activity
at position
and time
is defined by the following differential equation:
!
#"
$
(1)
Herein,
is a (negative) resting value,
is the weight
function for the connections between the neurons,
is a
sigmoid function and
"
denotes the input function. The weights
for a DNF are excitatoric in a local neighborhood and get
inhibitoric for distant neurons. Different implementations use
either connections in a local neighborhood (local inhibition
type) or simulate a completely interconnected neural field
(global inhibition type). While the first type has stable states
with multiple activity clusters, the latter shows not more than
one such cluster. The weights are typically defined by a DoG-
function (for the local inhibition type) or standard distribu-
tions with a constant negative term (for the global inhibition
type). The distinct clusters of positive activity develop at
locations with sustained high input values and follow this
input. Hysteresis and spatiotemporal integration are important
mathematically proven properties of neural fields [32], [34].
We have realized different architectures of neural fields
reflecting characteristics of the saliency representation that
is used as input for the neural field [33]. They include
systems of interconnected global inhibition 2D neural fields for
individually weighted superpositions of the saliency features
as well as a single local inhibition 3D DNF. Using these
architectures we aim at integrating saliency only for objects
and use the cues of (three-dimensional) neighborhood or the
homogenity of an object. All those architectures show a small
number of distinct activity clusters (connected areas of positive
activity) that denote locations of high saliency and follow the
movement of such areas in their input.
These activity clusters are the result of the first selection
stage. They correspond to areas of sustained high saliency in
the input. For each of these clusters, a symbolic description is
created, a so-called object file. The underlying hypothesis is
that each of the clusters corresponds to a basic visual object,
a meaningful part of an object or a collection of objects. The
correspondence between object files and activity clusters is
constantly updated, a process that is easily implemented due to
the well-defined behavior of DNF, which shows spatial limits
for integration, tracking, and inhibition of different objects.
These thresholds determine spatial boundaries beyond which
no correspondence is sought. Inside the boundaries, spatial
distance and similarity of features inside the activity areas
determine the correspondence of object files to activity clusters
and therefor the continuity of object files.
3) Second selection stage: The second selection stage is
subject to top-down influences and can be implemented in a
problem-specific way. Its operation is encapsulated in behav-
iors, that take into account the object file-information as well
as the state and goal of the system. Main task of an behavior
is the selection of one of the object files, and thereby the
corresponding activity cluster in the neural field, for focal
attention. The area corresponding to the activity cluster in
the input image is then subjected to high-level computations
like object recognition. It can also be foveated by saccadic
camera movements, so that the system shows overt attention
by controlling an active vision system [31].
The default exploration behavior is achieved by assigning
priority levels to the object files according to the time they
were last selected. Unselected items receive the highest prior-
ity. Within a priority level the object files are ordered by their
saliency. Dynamic IOR for moving objects is implicit to this
behavior and can be achieved by other behaviors in a similar
way. Examples of other behaviors we have implemented
include an alarm system, integrated searching and tracking of
a defined object, and the simulation of visual search.
Due to the symbolic computation on a small number of
simple data structures, the modification of behaviors and the
implementation of additional behaviors (possibly using exist-
ing behaviors) is easily achieved. In operating on individual
items, the behaviors are related Ullmans visual routines [35].
The indexed items from the visual routine model correspond
to the object files in our model. They differ in that a single
behavior is used for the main system control and a collection
of visual routines is used in the Ullman model, but it would
be possible to replace the monolithic behavior by such an
approach. An important aspect agreed on by Ullman [35]
and Pylyshyn [16] is that indices to a number of items are
important for relational operations. This is also achieved by
the first selection stage of our architecture. Behaviors can use
notions of objects that are “behind”, “higher” or “larger” than
something else.
IV. RESULTS
A. First selection stage and semiattentive computations
Static image feature computations, integration and selection
by a DNF is depicted in figure 3. The variant shown uses
the stereo information computed during the determination
of stereo saliency to create a 3D representation of overall
saliency. The neural field used is a single three-dimensional
local inhibition type DNF. We used some modifications [30]
to the neural field to realize fast computations in spite of the
high dimensionality and the large number of neurons.
The tracking performance by neural fields is demonstrated
in [33]. The feature saliency reflects the environment proper-
ties. The features have been analyzed further in [30], [31].
B. World model quality
In order to compare the quality of the new approach to more
conventional modeling of attentional control we designed an
experiment involving the exploration of a scene by simulated
Input images
3D−Mastermap
activation
Neural field
Features
2D−Mastermap
Fig. 3. Example of the feature computations (from left to right: symmetry,
eccentricity, color contrast, and depth), saliency integration into master map,
superposition in 3D master map, and neural field activation. The activation
clusters in the neural field are colored. The 3D representations are ordered
by increasing distance in reading order. The colored background reflects
the architectural distinction of preattentive and semiattentive components as
shown in fig. 2.
recognition of objects. This allowed us to abstract from special
aspects like feature computation qualities for different inputs
and concentrating on the architectural design. The goal to
be achieved was to compute a world model containing as
much objects as possible while maintaining accurate position
information for these objects. A number of simple objects
(squares of 5 by 5 pixels) were either stationary or moving
on a straight path (they moved at most 2 pixels in x- and y-
direction between consecutive frames). Noise was added with
half the amplitude of the objects. This data was used as a
simulated 2D master map of attention. Figure 4 shows three
consecutive frames of such a scene.
To these scenes we applied a conventional attention algo-
rithm with a static inhibition map as well as our attention
model. A simulated object recognition was the high-level
algorithm carried out at the focus of attention. The recognition
should take three frames. For our model, we added another
fourth frame to the recognition duration to compensate the
additional computations necessary for the neural fields. We
compared the resulting world models (identified objects and
their positions) to ground truth and computed the mean number
Fig. 4. Three consecutive frames used in the experiment for comparing
the attention models. Two of the objects are static, while three of them are
dynamic.
of recognized objects and the position error. Whenever the
position was off by more than 20 pixels, the object was
counted as not being recognized.
For this experiment, we used the most simple variant of
our model with a single 2D neural field of local inhibition
type. The choice was made to achieve as much comparability
between the two models as possible. As conventional models
use a 2D representation of overall saliency, we decided to use
the same representation for our neural field model. This ruled
out the use of the more advanced 3D neural field and the
system of global inhibition 2D fields with weighted features
(see [33] for a comparison).
The conventional model was mainly derived from the Koch
and Ullman [1] model. By abstracting from the feature com-
putations as well as the WTA-process, we tried to capture the
essential selection and inhibition scheme of the conventional
attention algorithms that we analyzed in section II-C. The
localization and selection was achieved by blurring the input
(mimicking the selection by neural fields and finding the
center of the input) and selecting the maximum value after
applying inhibition. We used an inhibition map with activity
slowly decaying by a factor of 0.8 after each frame. An object
was marked in the inhibition map using an area of 8 by
8 pixels, taking into account that the distance between two
objects was at least 14 pixels at each moment, so that there
was no danger of inhibiting a different object. The large size
and slow decay was chosen to give the conventional system
a small additional advantage: the long inhibition of objects
that would not inhibit other moving objects due to the large
distance. Under real world circumstances, the classical model
would perform worse than in our experiment, while our model
could still be improved by using more advanced neural field
architectures and saliency representations.
One run of the experiment consisted of the preparation of 40
input frames (master maps) with the desired number of static
and dynamic objects. Ground truth of the identity and location
of the objects was computed. Both models were presented with
the simulated master maps, selected one location/item for focal
attention and started the simulated object recognition. After
three or four frames (depending on the model), the identity
of the selected object was returned by the simulated object
recognition. The identity was transferred to the internal world
model. For the conventional algorithm, it was connected to the
position, where it was selected. Our model used the object files
0
1
2
3
4
5
6
0 1 2 3 4 5 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Recognized objects
Position error
Dynamic objects
NeuralField-objects
Conventional-objects
NeuralField-position
Conventional-position
0
1
2
3
4
5
6
0 1 2 3 4 5 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Recognized objects
Position error
Dynamic objects
NeuralField-objects
Conventional-objects
NeuralField-position
Conventional-position
0
1
2
3
4
5
6
0 1 2 3 4 5 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Recognized objects
Position error
Dynamic objects
NeuralField-objects
Conventional-objects
NeuralField-position
Conventional-position
Static objects: 1 Static objects: 3 Static objects: 5
Fig. 5. Comparison of world model quality for two approaches of visual attention and varied numbers of static and dynamic objects. Depicted are the number
of recognized objects and the position error. See text for details.
to connect the identity with the actual position of the activity
cluster. The mean number of recognized objects in the world
model (computed over all frames and all runs) as well as the
mean position error for the recognized objects was computed.
Figure 5 shows the results for different numbers of static
and dynamic objects. We refer to the conventional model
as “Conventional” and to our two-stage selection model as
“NeuralField”. Each data point is based on 50 runs of 40
frames. The mean number of recognized objects is always
smaller than the number of objects present because the mean
is taken over the complete run. The systems need a number of
frames until every object is recognized. Take the runs with five
dynamic and five static objects. The neural field model needs
at least all 40 frames until all 10 objects are recognized (four
frames per object). Therefor its optimal result would be a mean
of five recognized objects. It nearly reaches this optimum.
We find that for every number of static objects, the neural
field model scales much better with the number of dynamic
objects. The advantage of faster simulated recognition is only
exploited by the conventionalmodel when no object is moving.
In all other cases the new model is supreme. This is especially
true for the position error. In every condition, the mean
position error is smaller than 0.5 pixels for our new model
while the conventional model shows an error between 0.5 and
5 pixels. We conclude that even with the higher computational
demands associated with the neural fields, the new approach
provides a more efficient way of scanning a scene and keeping
the extracted information up to date than conventional models
of visual attention.
C. System performance
The exploration of a scene by a specified behavior is
depicted in figure 6. It shows the input frames, with object
files marked by bounding boxes, together with the area of
the FOA (below the input frame). Note the selection of
mainly meaningful areas (ball, picture, and robot) due to the
object-related feature computations and the movement of the
bounding box together with the moving objects (ball and
robot). The first re-checking of an item occurs only after
all objects were subject to focal attention. In this aspect, the
system has found an optimal dynamic scanpath.
The experiment was carried out using the simulation envi-
ronment Orbital 3D, which was implemented in our workgroup
[36]. Its suitability for evaluating vision algorithms is due to
the fact that controllable environments of different qualities
can be used to provide reproducible dynamic experiments with
ground truth [37].
D. Correspondence to natural visual attention
We have incorporated some advanced aspects of natural
visual attention models into our attention architecture. It is
therefor suggesting to ask if the architecture can serve to
explain additional empirical findings on attention. Besides the
simultaneous tracking of multiple objects and the binding of
IOR to moving objects that are inherent to the model, we take
a look at further effects in natural attention that are known to
be difficult to explain.
The two selection stages contribute to the old debate on
early and late selection. The core problem here is that while
under some circumstances there is evidence for complex com-
putations outside the focus of attention, others find that even
simple computations need attention. By using two selection
stages, the dichotomy of attentive and preattentive processing
is replaced by three stages, adding semiattentive computations.
By shifting computations between the attentive and the semi-
attentive stage in accordance with the computational load,
the complexity of the task, and the state of the system, the
observed variants of serial and parallel processing could be
produced.
Accounts of multiple foci of attention [38] or the striking
effects of flanker compatibility [39] can also be explained
by our model as being related to semiattentive computations.
Take the experiments by Kramer and Hahn [38]. They showed
that it is possible to quickly compare objects at two positions
Fig. 6. Exploration of a scene. For 15 frames (in reading order), the current view of the scene is annotated with the bounding boxes and numbers of the
object files. The currently selected OF is white with an arrow pointing from the center towards it; OF with already recognized objects are blue, OF unselected
so far are red. For each frame, the area of the FOA is depicted separately.
without identifying distractors lying in between them. This
ruled out the possibility of one large spotlight of attention. The
presentation speed ruled out a possible “jump” of the focus of
attention from one object to the other. Using our model, the
explanation would not involve multiple foci of attention but
just semiattentive selection and comparison of both items.
The flanker compatibility effects [40] demonstrate the pro-
cessing and recognition of items at positions that are known
to be irrelevant (distractors) when the task is to classify one
item (the target) at a previously known position. At a first
look, this seems to be just what would be avoided by selective
attention. The typical displays to demonstrate this effect show
a small number of items that are easily recognized (like digits
or letters). The distractors are of the same type as the target.
Applying our model to such displays, each item would be
selected by the first selection stage due to the small number
of overall present items and their similarity to the target. The
identification processes are rather simple and could operate on
the semiattentive stage. Focal attention is then just needed to
bind the correct result to the target and the reaction. Although
it may be more efficient to suppress the computation of letter
and target identities, the Stroop effect [41] suggests that they
are too automated to be suppressed whenever an item is
selected.
V. CONCLUSIONS
The novel architecture of two selection stages in visual
attention, providing an additional semiattentive computation
stage was motivated by problems conventional approaches of
visual attention reveal in dynamic scenes. The object-based
computations in every stage of the model allow us to refer
to meaningful entities of the environment. This improves the
selection process itself and simplifies high-level computations
like object recognition. Especially in dynamic environments,
the operation on moving objects is an improvement over purely
spatial approaches.
By creating object files for the discrete activity clusters in
the neural field, the model shows a well-defined transition from
subsymbolic computations to the symbolic domain, where
single visual objects are the subjects of manipulation. The
implementation of behaviors for the second selection stage al-
lows an encapsulation of top-down influences on the operation
characteristic of the system.
Specialization for applications is achieved by additional
features that allow the localization and selection of objects
relevant to the task at hand. The modification or implemen-
tation of behaviors allows the integration into a larger vision
system, as well as interaction with other system components,
and the inclusion of specific knowledge. Note that the system
does not depend on such knowledge, but it can be augmented
and specialized whenever it is available. To provide even better
object candidates by the first selection stage, a segmentation
process based on the feature computations would be a sugges-
tive extension of the model.
REFERENCES
[1] C. Koch and S. Ullman, “Shifts in selective visual attention: Towards the
underlying neural circuitry,” Human Neurobiology, vol. 4, pp. 219–227,
1985.
[2] A. Treisman and G. Gelade, “A feature integration theory of attention,”
Cognitive Psychology, vol. 12, pp. 97–136, 1980.
[3] J. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: An alternative
to the feature integration model for visual search,” Journal of Exper-
imental Psychology: Human Perception and Performance, vol. 15, pp.
419–433, 1989.
[4] J. Wolfe, “Guided search 2.0: A revised model of visual search,”
Psychonomic Bulletin and Review, vol. 1, no. 2, pp. 202–238, 1994.
[5] R. Milanese, H. Wechsler, S. Gil, J. Bost, and T. Pun, “Integration
of bottom-up and top-down cues for visual attention using non-linear
relaxation,” in Proceedings, of the IEEE Conference on Computer Vision
and Pattern Recognition (Seattle, 1994), 1994, pp. 781–785.
[6] V. Leavers, “Preattentive computer vision - towards a 2-stage computer
vision system for the extraction of qualitative descriptors and the cues
for focus of attention,” Image and Vision Computing, vol. 12, no. 9, pp.
583–599, 1994.
[7] L. Itti and C. Koch, “A saliency-based search mechanism for overt and
covert shifts of visual attention,” Vision Research, vol. 10-12, pp. 1489–
1506, 6 2000.
[8] A. Maki, P. Nordlund, and J.-O. Eklundh, “A computational model of
depth-based attention,” in Proc. 13th Int. Conf. on Pattern Recognition,
vol. 4, 1996, pp. 734–738.
[9] B. Olshausen, C. Anderson, and D. V. Essen, “A multiscale dynamic
routing circuit for forming size- and position-invariant object representa-
tions,” Journal of Computational Neuroscience, vol. 2, no. 1, pp. 45–62,
1995.
[10] J. K. Tsotsos, “An inhibitory beam for attentional selection,” in Spatial
visions in humans and robots, L. Harris and M. Jenkins, Eds., 1993.
[11] Y. Aloimonos, I. Weiss, and A. Bandopadhay, “Active vision,” in
Proceedings of the first International Conference on Computer Vision,
1987, pp. 35–54.
[12] L. Pessoa and S. Exel, “Attentional strategies for object recognition,”
in Proceedings of the IWANN, Alicante, Spain 1999, ser. Lecture Notes
in Computer Science, J. Mira and J. Sachez-Andres, Eds., vol. 1606.
Springer, 1999, pp. 850–859.
[13] D. Reece and S. Shafer, “Control of perceptual attention in robot
driving,” Artificial Intelligence, vol. 78, pp. 397–430, 1995.
[14] A. Abbott, “A survey of selective fixation control for machine vision,”
in IEEE Control Systems, 1992, pp. 25–31.
[15] Z. Pylyshyn, J. Burkell, B. Fisher, C. Sears, W. Schmidt, and L. Trick,
“Multiple parallel access in visual attention,” Canadian Journal of
Experimental Psychology, vol. 48, no. 2, pp. 260–283, 1994.
[16] Z. Pylyshyn, “Visual indexes in spatial vision and imagery,” in Visual
Attention, ser. Vancouver Studies in Cognitive Science, R. Wright, Ed.
Oxford University Press, 1998, no. 8, pp. 215–231.
[17] Z. W. Pylyshyn and R. Storm, “Tracking multiple independent targets:
evidence for a parallel tracking mechanism,” Spatial Vision, vol. 3, no. 3,
1988.
[18] S. Vecera and M. Farah, “Does visual attention select objects or
locations,” Journal of Experimental Psychology: General, vol. 123, pp.
146–160, 1994.
[19] S. Tipper and B. Weaver, “The medium of attention: Location-based,
object-centered, or scene-bases,” in Visual Attention, R. Wright, Ed.
Oxford University Press, 1998, pp. 77–107.
[20] W. Fellenz and G. Hartmann, “Preattentive grouping and attentive
selection for early visual computation,” in 13th ICPR 1996, International
25 - 30, 1996 Conference on Pattern Recognition Technical University,
Wien, August, 1996.
[21] A. Maki, P. Nordlund, and J.-O. Eklundh, “Attentional scene segmen-
tation: Integrating depth and motion,” Computer Vision and Image
Understanding, vol. 78, pp. 351–373, 2000.
[22] S. Dickinson, H. Christensen, J. Tsotsos, and G. Olofsson, “Active ob-
ject recognition integrating attention and viewpoint control,” Computer
Vision and Image Understanding, vol. 67, no. 3, pp. 239–260, 1997.
[23] S. P. Tipper, J. Driver, and B. Weaver, “Object-centered inhibition of re-
turn of visual attention,” Quarterly Journal of Experimental Psychology,
vol. 43A, no. 2, pp. 289–298, May 1991.
[24] L. Itti and C. Koch, “Feature combination strategies for saliency-based
visual attention systems,” Journal of Electronic Imaging, vol. 10, no. 1,
2001.
[25] A. Treisman, “Representing visual objects,” in Attention and Perfor-
mance, D. Meyer and S. Kornblum, Eds. Hillsdale, NJ: Erlbaum, 1991,
vol. 14.
[26] D. Kahneman, A. Treisman, and B. Gibbs, “The reviewing of object
files: object-specific integration of information,” Cognitive Psychology,
vol. 24, no. 2, pp. 175–210, 1992.
[27] J. Wolfe and S. Bennett, “Preattentive object files: Shapeless bundles of
basic features,” Vision Research, vol. 37, pp. 25–43, 1997.
[28] M. Miyahara and Y. Yoshida, “Mathematical transform of (r, g, b) color
data to munsell (h, v, c) color data,” Visual Communication and Image
Processing, vol. 1001, pp. 650–657, 1988.
[29] B. Mertsching, M. Bollmann, R. Hoischen, and S. Schmalz, “The neural
active vision system navis,” in Handbook of Computer Vision and
Applications Vol. 3 (Systems and Applications), B. Jähne, H. HauSSecke,
and P. GeiSSler, Eds. Academic Press, 1999, pp. 543–568.
[30] G. Backer and B. Mertsching, “Integrating time and depth into the at-
tentional control of an active vision system,” in Dynamische Perzeption.
Workshop der GI-Fachgruppe 1.0.4 Bildverstehen, Ulm, November 2000,
G. Baratoff and H. Neumann, Eds., 2000, pp. 69–74.
[31] G. Backer, B. Mertsching, and M. Bollmann, “Data- and model-driven
gaze control for an active-vision system,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 23, no. 12, pp. 1415–1429, 2001.
[32] S.-I. Amari, “Dynamics of pattern formation in lateral inhibition type
neural field,” Biological Cybernetic, vol. 27, pp. 77–87, 1977.
[33] G. Backer and B. Mertsching, “Using neural field dynamics in the
context of attentional control,” in Proceedings of the ICANN 2002, 2002,
pp. 1237–1242.
[34] K. Kishimoto and S.-I. Amari, “Existence and stability of local excita-
tions in homogenous neural fields,” Journal of Mathematical Biology,
vol. 7, pp. 303–318, 1979.
[35] S. Ullman, “Visual routines,” Cognition, vol. 18, pp. 97–159, 1984.
[36] M. Bungenstock, A. Baudry, J. Bitterling, and B. Mertsching, “Devel-
opment of a simulation framework for mobile robots,” in Proceedings
of the EUROIMAGE ICAV3D 2001, 2001, pp. 89–92.
[37] G. Backer and B. Mertsching, “Evaluation of attentional control in active
vision systems using a 3d simulation framework,” in Journal of the
WSCG - 10th International Conference in Central Europe on Computer
Graphics, Visualization and Computer Vision, vol. 10, 2002, pp. 32–39.
[38] A. Kramer and S. Hahn, “Splitting the beam: Distribution of attention
over noncontiguous regions of the visual field,” Psychological Science,
vol. 6, no. 6, pp. 381–386, 1995.
[39] J. Miller, “The flanker compatibility effct as a function of visual angle,
attention focus, visual transients, and perceptual load: A search for
boundary conditions,” Perception and Psychophysics, vol. 49, pp. 270–
288, 1991.
[40] C. Eriksen and J. Hoffman, “The extent of processing of noise ele-
ments during selective encoding from visual displays,” Perception and
Psychophysics, vol. 14, pp. 155–160, 1973.
[41] J. Stroop, “Studies of interference in serial verbal reactions,” Journal of
Experimental Psychology, vol. 18, pp. 643–662, 1935.