Conference PaperPDF Available

Multisensory architecture for intelligent surveillance systems - Integration of segmentation, tracking and activity analysis

Authors:

Figures

Content may be subject to copyright.
MULTISENSORY ARCHITECTURE FOR INTELLIGENT
SURVEILLANCE SYSTEMS
Integration of segmentation, tracking and activity analysis
Keywords: Intelligent surveillance systems; Monitoring architecture; Segmentation; Tracking; Activity analysis.
Abstract: Intelligent surveillance systems deal with all aspects of threat detection in a given scene; these range from
segmentation to activity interpretation. The proposed architecture is a step towards solving the detection and
tracking of suspicious objects as well as the analysis of the activities in the scene. It is important to include
different kinds of sensors for the detection process. Indeed, their mutual advantages enhance the performance
provided by each sensor on its own. The results of the multisensory architecture offered in the paper, obtained
from testing the proposal on CAVIAR project data sets, are very promising within the three proposed levels,
that is, segmentation based on accumulative computation, tracking based on distance calculation and activity
analysis based on finite state automaton.
1 INTRODUCTION
Nowadays, obtaining surveillance systems able to
cover all levels of traditional image processing stages
is still a very challenging issue. Intelligent surveil-
lance systems must be able to predict potential sus-
picious behaviors, to work semi-automatically, and
to warn the human operator if necessary. More-
over, for a human operator, the task of looking
at a monitor during hours is very monotonous.
That is why a system able to detect suspicious
situations is essential for an efficient surveillance
(Gascue˜
na and Fern´
andez-Caballero, 2011).
1.1 Segmentation in surveillance
Regarding the detection/segmentation process, it is
important to have different kinds of sensors. In-
deed, their mutual advantages enhance the perfor-
mance provided by each sensor on its own. For
instance, when working with surveillance cameras,
complementing color with thermal-infrared cameras
provides the surveillance systems with the ability to
perform detection in almost each situation, regard-
less of the environmental conditions (e.g. illumina-
tion changes, fog or smoke).
It is known that color is a very important fea-
ture for object recognition. Several approaches can
be found in the bibliography devoted to color image
segmentation (e.g. (L´
ezoray and Charrier, 2009)). In
(Maldonado-Basc´
on et al., 2007)RGB and HSI color
spaces are used for the detection of traffic signals.
There are other proposals that use image features
combined with color to solve the segmentation prob-
lem (e.g. (Moreno-Noguer et al., 2008)).
Concerning infrared cameras, many proposals fo-
cus on the assumption that objects of interest
mostly pedestrians possess a temperature higher
than their surroundings (Yilmaz et al., 2003). Ther-
mal infrared video cameras detect relative differences
in the amount of thermal energy emitted/reflected
from objects in the scene. As long as the thermal
properties of a foreground object are slightly differ-
ent (higher or lower) from the background radiation,
the corresponding region in a thermal image appears
at a contrast from the environment. A technique
based on background subtraction for the detection of
objects under different environmental conditions has
been proposed (Davis and Sharma, 2007).
1.2 Tracking in surveillance
Tracking objects of interest in a scene is another
key step prior to video surveillance events detec-
tion (Regazzoni and Marcenaro, 2000). Tracking
approaches can be classified into four main cat-
egories, namely, tracking based on the moving
object region, where the bounding box sur-
rounding an object is tracked in the 2D space
(Masoud and Papanikolopoulos, 2001), tracking
based on the moving object contour, where a contour
is defined by curves delimiting the moving objects
and dynamically updated (Isard and Blake, 1998),
tracking based on a model of the moving object,
where the 3D geometry of a moving object is defined
(Koller et al., 1993), and, tracking based on the
features of the moving object, where some features of
the objects are extracted and monitored (e.g. vertices
in vehicle tracking (McCane et al., 2002)).
Besides these blocks there exist other propos-
als such as wavelets analysis-based (Kim et al., 2001)
or Kalman filter-based (Stauffer and Grimson, 1999)
tracking, among others.
1.3 Activity analysis in surveillance
In (Lavee et al., 2009) some recent approaches for
video event understanding are presented. The im-
portance of the two main components of the event
understanding process abstraction and event mod-
eling is pointed out. Abstraction corresponds
to the process of molding the data into informa-
tive units to be used as input to the event model
(Natarajan and Nevatia, 2008), while event modeling
is devoted to formally describing events of interest
and enabling recognition of these events as they oc-
cur in the video sequence (Ulusoy and Bishop, 2005).
In close relation to finite state machines theory,
in (Ayers and Shah, 2001) the authors describe a sys-
tem which automatically recognizes human actions in
a room from video sequences. The system recognizes
the actions by using prior knowledge about the lay-
out of the room. Indeed, action recognition is mod-
eled by a state machine, which consists of ’states’
and ’transitions’ between states. Another approach
(Hongeng et al., 2004) models scenario events from
shape and trajectory features using a hierarchical ac-
tivity representation, where events are organized into
several layers of abstraction, providing flexibility and
modularity in modeling scheme. An event is consid-
ered to be composed of action threads, each thread
being executed by a single actor. A single-thread ac-
tion is represented by a stochastic finite automaton of
event states that are recognized from the characteris-
tics of the trajectory and shape of the moving blob of
the actor.
1.4 The proposal for intelligent
surveillance
The proposed architecture aims to solve the de-
tection and tracking of suspicious objects as
well as the analysis of the activities detected
in the scene. The approach is closely related
to the works of (Ivanov and Bobick, 2000) and
(Hongeng et al., 2004) in the sense that the exter-
nal knowledge about the problem domain is incorpo-
rated into the expected structure of the activity model.
Motion-based image features are linked explicitly to
a symbolic notion of hierarchical activity through
several layers of more abstract activity descriptions.
Atomic actions are detected at a low level and fed to
hand-crafted grammars to detect activity patterns of
interest. The inspiration also is close to the paper by
(Amer et al., 2005), as we work with shape and trajec-
tory to indicate the events related to moving objects.
In comparison to other approaches, such as Bayesian
networks or HMMs (Oliver and Horvitz, 2005), the
proposal is unable to model uncertainty in video
events, but it is presented as a useful tool in video
event understanding because of its simplicity and its
ability to model temporal sequence and to easily in-
corporate new actions.
2 ARCHITECTURE DEFINITION
This section describes in detail the different phases
proposed for the intelligent multisensor surveillance
architecture. Fig. 1schematically depicts the sys-
tem processing stages. Processing starts after cap-
turing color images as well as infrared images. The
use of two different spectral images allows the detec-
tion of objects of interest independently of lighting
or environmental conditions (day, night, fog, smoke,
etc.). The segmentation algorithm detects motion of
the scene objects. As a result, the segmentation algo-
rithm generates a list of blobs detected in the scene.
The blobs are used as the inputs to the tracking phase.
Following the segmentation process, a simple but
robust tracking approach is proposed. An identifier
is assigned to each object until it leaves the scene or
the segmentation algorithm does not detect it for a de-
termined period of time (application dependent). Fi-
nally, the identified blobs are passed to the activity
analysis stage, where some predefined activities are
modeled by means of state machines. At this stage
Figure 1: The proposed architecture
the different behaviors and their associated alarm lev-
els are obtained.To configure the parameters needed
for segmentation and tracking, as well as the static
objects definition on the scene a modeling stage has
been included to the proposal.
2.1 Segmentation based on
accumulative computation
This section describes the proposed segmen-
tation method integrated into the architecture.
The method has been tested on visual and non
visual spectrum (infrared) image sequences
with promising results in detecting moving ob-
jects (e.g. (Fern´
andez-Caballero et al., 2011;
Fern´
andez-Caballero et al., 2010)). The accumula-
tive computation method consist of the five phases
described next (and also depicted in Fig. 2):
Preprocessing: This phase performs the preprocess-
ing of the input images. Filters such as the mean or
the median are applied in order to enhance the image
contrast and to smooth the image noise.
Grey level bands segmentation: This phase is in
charge of splitting the input image, I(c,y,t), into k
grey level bands. That is, each image is binarized in
ranges according to the kbands following equation 1.
Equation 1shows how in a given time t, a pixel can
just belong to a unique grey level band.
GLSk(x,y,t) = ½0,if I(x,y,t)6=kk[0,255]
1,otherwise (1)
Accumulative computation: This phase obtains one
sub-layer per each band defined in the previous phase.
Figure 2: The phases of the accumulative computation
method
Each band stores the pixels’ accumulative compu-
tation values. In first place, the method estab-
lishes a permanence memory value for each pixel,
PMk(x,y,t). It is assumed that motion takes place at a
pixel when that value of the pixel falls in a new band.
For each pixel (x,y), at a given time t, and in a band
k, the following possibilities must be taken into ac-
count when comparing with the pixel at time t1 (see
equation 2). A complete description may be found in
(Delgado et al., 2010).
PMk(x,y,t) =
vdes,if GLSk(x,y,t) == 0
vsat,if GLSk(x,y,t) = 1GLSk(x,y,t1) = 0
max{PMk(x,y,t)vdm,vdes },
if GLSk(x,y,t) = 1GLSk(x,y,t1) = 1
(2)
Fusion: This phase fuses the information coming
from the kaccumulative computation layers. The aim
is to detect the movement of the objects in the scene.
For this, a new layer is created to store the ksub-
bands of the accumulative computation. Each pixel
is assigned the maximum value of the different sub-
bands following equation 3. Next, a thresholding is
performed to discard regions with low motion. Clos-
ing and opening morphological operations eliminate
isolated pixels and unite close regions wrongly split
by the thresholding operation.
S(x,y,T) = max(PM(x,y,t)) (3)
Objects segmentation: This phase obtains the areas
containing moving regions. As an input a blobs list
LBis obtained and used to higher layers of the archi-
tecture.
2.2 Tracking based on distance
computation
The second level of the proposed architec-
ture consists in tracking of segmented objects
(Moreno-Garcia et al., 2010). This tracking proposal
consists of the four stages described below:
Object labeling: The tracking approach uses the re-
sult of the previous segmentation stage, that is, the
segmented spots, LB, though the tracking algorithm
has its own list of blobs, LT, updated over time with
the tracking results. Firstly, each blob contained in
LB,LBi, where i {0,1,...,N}and Nis the number
of elements in LB, is compared to all blobs contained
in LT. The aim is to calculate the distance between
the centers of the boxes associated to the blobs. The
centers for LBblobs are calculated as shown in equa-
tions 4and 5. The centers for LTblobs are calculated
in a similar way.
LBi.xc =LBi.xmin +LBi.xmax
2(4)
LBi.yc =LBi.ymin +LBi.ymax
2(5)
where LBi.xmin and LBi.ymin are the initial coordinates
of blob LB, and LBi.xmax and LBi.ymax the final ones.
According to equation 6, blobs LTjwith a distance
between centers below a prefixed threshold are se-
lected as candidates to be the previous position of
blob LBat time instant (t1). The blob with the
minimum distance to LBiis selected as the previous
position of the current blob.
d=q(LBi.xc LTj.xc)2+ (LBi.yc LTj.yc)2(6)
Blob updating: Once the segmented blobs, LB, are
associated to their identifiers, a smoothing process is
performed to reduce the effect of noise introduced
during the detection process. This way, the box size
is smoothed to avoid abrupt variations. If a foreseen
blob is not detected during the segmentation process,
L0
T=LTLB, a prediction about its possible trajec-
tory is performed through a mean distance increment,
based on xc and yc, and a displacement angle cal-
culated between consecutive frames. If the perma-
nence memory value reaches its minimum, the blob is
discarded as it is considered to leave the scene.
Size adjustment: Given that the output of the seg-
mentation algorithm is not always accurate, the de-
tected blobs might include some noise that modifies
the size of their containing boxes. In order to alle-
viate the effects of noise, the obtained blobs are soft-
ened according to a weighted mean of their height and
Figure 3: Height resizing parameters.
width. Also, depending on the motion direction of a
blob its position may be modified. Fig. 3depicts a sit-
uation where a blob LTjmoves between two consec-
utive times with a significant variation in size. Only
the height component, LTj(t).h(denoted as LTj.h),
is shown to keep the figure simple. The height and
width are calculated as shown in equations 7and 8,
respectively.
LTj.h=Mω
M·LT.h+ω
M·LTj.h(7)
LTj.w=Mλ
M·LT.w+ω
M·LTj.w(8)
where Mis the number of previous instances of blob
LTjto calculate the mean, λis the weight given to the
current height and width values, and LT.hand LT.w
are the mean height and width, respectively. Once the
blob’s height and width are smoothed, the blob’s po-
sition is also enhanced. For this, the displacement an-
gle, θ, the blob center, (LTj.xc,LTj.yc), and the new
height and width are taken into account. This is per-
formed through equations 9and 10 for x-axis’ maxi-
mum and minimum coordinates, and equations 11 and
12 for y-axis’ components.
LTj.xmax =(3/5·LTj.w+LTj.xc,if cos θ0
2/5·LTj.w+LTj.xc,if cosθ<0(9)
LTj.xmin =(2/5·LTj.wLTj.xc,if cos θ0
3/5·LTj.wLTj.xc,if cosθ<0(10)
LTj.ymax =(3/5·LTj.h+LTj.yc,if sinθ0
2/5·LTj.h+LTj.yc,if sinθ<0(11)
LTj.ymin =(2/5·LTj.hLTj.yc,if sinθ0
3/5·LTj.hLTj.yc,if sinθ<0(12)
Figure 4: Tolerance areas and their thresholds.
Trajectory prediction: This phase uses the L0
Tblobs
list; that is, this phase works with the blobs not up-
dated in the previous segmentation phase. For each of
them, LT0
k, its permanence memory value is reduced
under the assumption that the object is probably not
present in the scene, just as shown in equation 13.
p(LT0
k) = p(LT0
k)δ(13)
where δis a predefined discharge value. Two different
permanence zones are defined in the image, each one
with a different threshold µlyµh, being µh>> µl(see
Fig. 4). In the external zone a blob is discarded if its
permanence value drops below µl. This causes that a
blob close to the image limit is discarded more easily,
as it is assumed to abandon the scene. Those blobs in-
ner blobs not detected during the segmentation phase
are supposed to be still.
Trajectories are predicted for those blobs con-
tained in LT0with permanence above the aforesaid
thresholds . This involves the previous calculation
of the mean distance between frames, xc and yc.
Thus, the blob coordinates are calculated as:
xmint=xmint1+cos θ·xc (14)
xmaxt=xmaxt1+cos θ·xc (15)
ymint=ymint1+sin θ·yc (16)
ymaxt=ymaxt1+sin θ·yc (17)
The content of list LT0updates the information of
LT, the input to the next phase, which uses the blobs
and their associated identifiers to define the activities
carried out in the scene.
2.3 Activity analysis based on finite
state automaton
The purpose of activity description is to reasonably
choose a group of motion words or shout expressions
to report activities of moving objects or humans in
natural scenes.
Action Origin
vertex
Description
Object in-
formation
Object
speed
Makes it possible to define if an object is
still, walking, running, etc.
Object
trajec-
tory
Apart from speed, the direction and mov-
ing direction of an object can be ob-
tained.
Environment
interaction
Direction The system determines if a person is ap-
proaching a specific area of the scenario.
By taking the object’s speed and trajec-
tory as reference, the object’s goal is in-
ferred.
Position By knowing the important areas of the
scenario, the system is capable of deter-
mining the relative position of dynamic
objects. This way, it can detect if a per-
son is standing in one of the areas.
Object in-
teraction
Proximity The system detects the distance between
objects.
Orientation The system determines whether an ob-
ject is approaching another or whether
they are both approaching each other.
Grouping The system uses the parameters gener-
ated in the two previous points to detect
object grouping (thanks to their proxim-
ity and direction).
Table 1: Local activities
Objects of Interest: From the ETISEO
classification proposal (available at
http://www-sop.inria.fr/orion/ETISEO
),
four categories are established for dynamic objects
and two for static objects. Dynamic objects are, for
instance, a person, a group of people (made up of
two or more people), a portable object (such as a
briefcase) and other dynamic objects (able to move
on their own), classified as moving object. Static
objects may be areas and pieces of equipment. The
latter can be labeled as a portable object if a dynamic
object, people or group, interacts with it and it starts
moving.
Description of Local Activities: In order to gener-
alize the detection process we start with simple func-
tionalities that detect simple actions of the active ob-
jects in the scene. Using these functions, more com-
plex behavior patterns are built. Simple actions are
defined in Table 1.
Description of Global Activities: Interpreting a vi-
sual scene is a task that, in general, resorts to a large
body of prior knowledge and experience of the viewer
(Neumann and M¨
oller, 2008). Through the actions or
queries described in the previous section, basic pat-
terns (e.g. the object speed or direction) and more
complex patterns (e.g. the theft of an object) can be
found. It is essential to define the desired behavioral
Figure 5: Position and direction analysis.
pattern in each situation, by using the basic actions
or queries. For each specific scene, a state diagram
and a set of rules are designed to indicate the patterns.
Thus, the proposed video surveillance system is able
to detect simple actions or queries and to adapt to a
great deal of situations. It is also configured to detect
the behavioral patterns necessary in each case and to
associate an alarm level to each one.
Image Preprocessing: Input image segmentation is
not enough to detect the activities in the scene . Thus,
the system takes the initial segmentation data and in-
fers the new necessary parameters (such as the objects
speed or direction). Thus, the preprocessing tech-
niques described in Table 2are necessary.
Specification of Simple Behaviors: The system has
to respond to a series of queries intended to find out
behavioral patterns of objects in the scene (see Table
3). These queries are defined as functions and return
a logical value, which is true if they are fulfilled for a
specific object. They are represented in the following
format:
query (parameter1,parameter2, ..., parametern)
Specification of Complex Behaviors: Concerning
the complex behaviors, two categories are differen-
tiated:
Local Complex Behaviors. Objects in the scene
are associated to a state machine that indicates the
state they are in (what they are doing at a time instant).
This state machine can be seen as a directed graph
where the vertices are the possible states of the object
and the edges are the basic functions or queries previ-
ously discussed. An edge has at least one associated
outcome of the assessment (true or false) of a query,
Preprocessing Details
Speed Hypoth-
esis
The average speed for each object is calculated by
dividing the displacement (x) by the time elapsed
(t) in each frame.
Direction and
Moving Direc-
tion Hypothe-
sis
To find out the direction of objects, the angle of the
straight line that passes through the positions of the
previous and current instants are calculated.
Image Rectifi-
cation
Perspective distortion occurs because the distance
between the furthest points from the camera is less
than the distance between the closest points. The
real position is measured through the weighted dis-
tance measure of the four manually placed points
closest to the position to be interpolated.
Data Smooth-
ing
The data taken at two time instants is separated
with enough time to avoid small distortions; but
this distance is small enough to enable accurate re-
sults. The distance between both consecutive time
instants is called the analysis interval. At each anal-
ysis interval, the value of the hypotheses is updated,
but the old value is not automatically substituted.
To calculate the value at that instant, the mean val-
ues are calculated.
Table 2: Preprocessing techniques
Type Queries Description
Movement hasSpeedBet ween
(min,max)
True if an object moves with a
speed within range [min,max].
hasSpeed GreaterT han
(speed )
True if an object moves with a
speed greater than speed .
Direction hasDirection
(staticOb j ect)
True if an object goes toward
staticOb j ect, being staticOb j ect
a static object of the scene.
isFollowing () True if a dynamic object is fol-
lowing another one. The dis-
placement angle is used.
Position isInsideZ one
(staticOb j ect)
True if a dynamic object is in
area staticOb j ect.
isCloseTo
(distance,st aticOb ject)
True if the distance to a
staticOb j ect is less than
distance.
enterI nScene () True if an object appears for the
first time in the scene.
Table 3: Simple queries
indicating an action of object, query qi. Therefore, an
edge could have more than one query associated to it.
For an edge with several actions to be fulfilled, all the
associated queries have to be fulfilled. If a more com-
plex rule is needed, where disjunctions also appear so
that an object changes states, the rule must be divided
into two edges.
Global Complex Behaviors. To detect global be-
havior patterns, more than just the local state machine
is needed since the states of global state machines are
Figure 7: Class View.
composed by the local ones. These patterns are repre-
sented through state machines whose vertices repre-
sent a possible state in the scene. Just like in the local
state machine, the edges are made up of a series of
queries that must be fulfilled at a certain time for the
scene to change states.
3 IMPLEMENTATION
The prototype implementation based on the proposed
surveillance architecture must fulfill the following ob-
jectives:
Obtain a detector of strange behaviors and intru-
sions in a monitored environment.
Provide a web interface to allow a view indepen-
dent of the platform.
Provide a scalable prototype.
The proposal captures information from multiple
sources, from traditional surveillance sensors (such as
IR barriers or motion detectors) to different spectrum
cameras (color, thermal and so on). Also, they are
allocated on a two-dimensional map to see the current
system state.
3.1 Class view
The architecture design is defined by the class view
(see Fig. 7). This diagram shows the classes and in-
terfaces that integrates the system, as well as their re-
lations. Thus, not only the static view of the system
is established, but also the interactions among the dif-
ferent components.
The control process class is responsible for con-
trolling the acquisition, segmentation, tracking and
Figure 8: Implementation View.
behaviors detection (activities). The proposal oper-
ates in two modes, desktop mode and web mode. The
desktop mode is designed to allow a higher interaction
with the user, providing interactive views of the differ-
ent cameras, selected from the map, and warning the
user about the alarms through a pop-ups mechanism.
On the other hand, the web mode show read-only in-
formation about the state of the sensors located in the
map and the triggered alarms. By means of a thresh-
olding mechanism, the application triggers two dif-
ferent alarm types: Pre-Alarm, for values below the
threshold, and Alarm, for values above the threshold.
As behavior patterns are detected through a state ma-
chine mechanism, an alarm value is associated to each
state which will be compared to the threshold in order
to differentiate between ”Alarm” and ”pre-Alarm”.
3.2 Implementation view
The implementation view define the components that
hold the classes defined in the class view. These com-
ponents define the architecture of the system. As
seen in Fig. 8, a DLL module is implemented for
each stage of the proposed architecture, namely Im-
age Capture to capture images of cameras, Segmen-
tation based on accumulative computation, Tracking
based on distance computation and Activities based
on finite state automaton. Moreover, a control class is
added to organize the surveillance architecture and an
user interface to visualize results and manage config-
urations.
3.3 Interface
As aforementioned, the system can operate in two dif-
ferent modes: Regarding the desktop mode (see Fig.
6), the interface provides a map to allocate the dif-
ferent sensors (IR barrier, opening sensor, camera).
A color code is utilized to indicate the sensor state
(grey means “irresponsible”, green “ok”, yellow “Pre-
Alarm” and red Alarm”). IR barriers and motion de-
tectors are set in red in the map whether the sensor is
Figure 6: Desktop prototype interface.
activated. The cameras can be set in yellow in case of
Pre-Alarm, or red in case of Alarm. Besides, the inter-
face provides different features: turning on or off the
system operation, showing sensor coverage and its id.
Moreover, selecting a camera on the map, the result
of its operation is shown. The system offers a view
up to eight cameras simultaneously. The view of the
camera process is scaled according to the number of
selected cameras. In the lower left side of the inter-
face, the alarm detection log is shown. Each time the
system triggers an alarm, a new line is added to the
log. The information appeared in the log is based on
the detected behavior. Thus, the alarms information
are summarized through four columns: alarm type,
hour of alarm, behavior, and alarm state(accepted or
canceled by the user). Under an alarm condition the
user interaction is required to confirm or cancel the
alarm. Once the user has confirmed the alarm, the
sensor involved returns to normal state (green color).
On the other hand, the web mode is designed to
show concise information about the system state. As
seen in Fig. 9, the interface shows a map where the
sensors are placed whilst keeping the color code uti-
lized in the desktop mode, as well as the alarms log
structure.
4 DATA AND RESULTS
The tests were carried out us-
ing the cases that CAVIAR project
(
http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
) makes avail-
Figure 9: Web Capture.
able for researchers. For the tests, videos recorded
with a wide angle camera in the lobby of INRIA
Labs in Grenoble, France, were used. In these
scenes, there are different people interacting with the
environment and with each other. The used datasets
were walk1,browse2,browse3,rest1 and browse
while waiting(Bww).
Since segmentation forms the first phase in the
proposed architecture, its results must be as accurate
as possible, because the remaining blocks are built
over it. Its results are shown in Table 4, presenting
Dataset Accuracy Sensitivity F-Score
Browse2 0,885 0,982 0,935
Browse3 0,992 0,855 0,919
Bww 0,995 0,964 0,979
Walk1 0,996 0,915 0,954
Rest1 0,993 0,917 0,953
Mean 0,972 0,927 0,948
Table 4: Segmentation algorithm results with CAVIAR
datasets
Figure 10: Tracking results. Left: Browse2 sequence; cen-
ter: Walk1 sequence; right: Bww1 sequence.
an F-score close to 95%, a sensitivity of 92% and an
accuracy of 97%, with most of values close to 100%.
In the absence of a tool to verify tracking and ac-
tivities results in a quantitative way, these are shown
quantitatively. In first place, the results of the tracking
algorithm are offered (see Fig. 10). This algorithm
uses as an input the previously segmented blobs.
For the activities phase, the test were focused on a
complex behavior: position and direction analysis. It
must be pointed out that activities detection takes only
one from each six frames. These provides robustness
against the detection and tracking noise, while pre-
Figure 11: Global diagram, holdup at a cashpoint
Frame Alert Object Estate Objective
437 Pre-
Alarm
1 Initial Cashpoint
443 Alarm 1 Goings
towards the
cashpoint
Cashpoint
563 Alarm 1 Close to the
cashpoint
Cashpoint
Table 5: Position and direction detection.
serving accuracy enough to perform inferences about
objects trajectories. Global behavior patterns are rep-
resented through state machines which vertices repre-
sent a possible state in the scene. As happened with
local state machine, the edges are made up of a series
of queries that must be fulfilled at a certain time for
the scene to change states (see Fig. 11).
To finish with our tests, Table 5shows an extract
of the results for position and direction analysis. The
sequence used was “Browse2”.
5 CONCLUSIONS
This article has introduced an intelligent surveillance
system by integrating segmentation, tracking and ac-
tivities detection algorithms. The system is able to
detect behaviors and report information to the user
thanks to an attractive and functional interface. As
a future work it is planned to add new sensor types
for surveillance and to work with a distributive archi-
tecture.
ACKNOWLEDGEMENTS
REFERENCES
Amer, A., Dubois, E., and Mitiche, A. (2005). Rule-based
real-time detection of context-independent events in
video shots. Real-Time Imaging, 11:244–256.
Ayers, D. and Shah, M. (2001). Monitoring human behavior
from video taken in an office environment. Image and
Vision Computing, 19(12):833–846.
Davis, J. and Sharma, V. (2007). Background-subtraction in
thermal imagery using contour saliency. International
Journal of Computer Vision, 71:161–181.
Delgado, A., L´
opez, M., and Fern ´
andez-Caballero, A.
(2010). Real-time motion detection by lateral inhi-
bition in accumulative computation. Engineering Ap-
plications of Artificial Intelligence, 23:129–139.
Fern´
andez-Caballero, A., Castillo, J., Mart´
ınez-Cantos, J.,
and Mart´
ınez-Tom´
as, R. (2010). Optical flow or image
subtraction in human detection from infrared camera
on mobile robot. Robotics and Autonomous Systems,
58:1273–1281.
Fern´
andez-Caballero, A., Castillo, J., Serrano-Cuerda, J.,
and Maldonado-Basc´
on, S. (2011). Real-time human
segmentation in infrared videos. Expert Systems with
Applications, 38:2577–2584.
Gascue˜
na, J. and Fern´
andez-Caballero, A. (2011). Agent-
oriented modeling and development of a person-
following mobile robot. Expert Systems with Appli-
cations, 38(4):4280–4290.
Hongeng, S., Nevatia, R., and Bremond, F. (2004). Video-
based event recognition: activity representation and
probabilistic recognition methods. Computer Vision
and Image Understanding, 96(2):129–162.
Isard, M. and Blake, A. (1998). Condensation - conditional
density propagation for visual tracking. International
Journal of Computer Vision, 29:5–28.
Ivanov, Y. and Bobick, A. (2000). Recognition of visual
activities and interactions by stochastic parsing. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 22(8):852–872.
Kim, J., Lee, C., Lee, K., Yun, T., and Kim, H. (2001).
Wavelet-based vehicle tracking for automatic traffic
surveillance. In Proceedings of IEEE Region 10 In-
ternational Conference on Electrical and Electronic
Technology, volume 1, pages 313–316.
Koller, D., Danilidis, K., and Nagel, H.-H. (1993). Model-
based object tracking in monocular image sequences
of road traffic scenes. International Journal of Com-
puter Vision, 10:257–281.
Lavee, G., Rivlin, E., and Rudzsky, M. (2009). Understand-
ing video events: a survey of methods for automatic
interpretation of semantic occurrences in video. IEEE
Transactions on Systems, Man, and Cybernetics, Part
C: Applications and Reviews, 39(5):489–504.
L´
ezoray, O. and Charrier, C. (2009). Color image segmen-
tation using morphological clustering and fusion with
automatic scale selection. Pattern Recognition Let-
ters, 30:397–406.
Maldonado-Basc´
on, S., Lafuente-Arroyo, S., Gil-Jim´
enez,
P., G ´
omez-Moreno, H., and L´
opez-Ferreras, F. (2007).
Road-sign detection and recognition based on support
vector machines. IEEE Transactions on Intelligent
Transportation Systems, 8(2):264–278.
Masoud, O. and Papanikolopoulos, N. (2001). A novel
method for tracking and counting pedestrians in real-
time using a single camera. IEEE Transactions on Ve-
hicular Technology, 50(5):1267–1278.
McCane, B., Galvin, B., and Novins, K. (2002). Algorith-
mic fusion for more robust feature tracking. Interna-
tional Journal of Computer Vision, 49:79–89.
Moreno-Garcia, J., Rodriguez-Benitez, L., Fern´
andez-
Caballero, A., and L´
opez, M. (2010). Video sequence
motion tracking by fuzzification techniques. Applied
Soft Computing, 10:318–331.
Moreno-Noguer, F., Sanfeliu, A., and Samaras, D. (2008).
Dependent multiple cue integration for robust track-
ing. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 30:670–685.
Natarajan, P. and Nevatia, R. (2008). View and scale in-
variant action recognition using multiview shape-flow
models. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8.
Neumann, B. and M¨
oller, R. (2008). On scene interpretation
with description logics. Image and Vision Computing,
26:82–101.
Oliver, N. and Horvitz, E. (2005). A comparison of hmms
and dynamic bayesian networks for recognizing office
activities. In 10th International Conference on User
Modeling, pages 199–209.
Regazzoni, C. and Marcenaro, L. (2000). Object detection
and tracking in distributed surveillance systems using
multiple cameras. Kluwer Academic Publishers.
Stauffer, C. and Grimson, W. (1999). Adaptive back-
ground mixture models for real-time tracking. In 1999
IEEE Computer Society Conference on Computer Vi-
sion and Pattern Recognition, volume 2.
Ulusoy, I. and Bishop, C. (2005). Generative versus dis-
criminative methods for object recognition. In IEEE
Computer Society Conference on Computer Vision
and Pattern Recognition, volume 2, pages 258–265.
Yilmaz, A., Shafique, K., and Shah, M. (2003). Target
tracking in airborne forward looking infrared imagery.
Image and Vision Computing, 21(7):623–635.
... One of the most active research topics is how to automate surveillance tasks based on mobile and fixed sensors platforms [1]. Many benefits can be anticipated from the use of multisensor systems in surveillance applications [2,3], such as decreasing task completion time and increasing mission reliability. Generally, monitoring of public and private sites is the main application of multisensor surveillance systems. ...
Article
Full-text available
The active surveillance of public and private sites is increasingly becoming a very important and critical issue. It is, therefore, imperative to develop mobile surveillance systems to protect these sites. Modern surveillance systems encompass spatially distributed mobile and static sensors in order to provide effective monitoring of persistent and transient objects and events in a given area of interest (AOI). The realization of the potential of mobile surveillance requires the solution of different challenging problems such as task allocation, mobile sensor deployment, multisensor management, cooperative object detection and tracking, decentralized data fusion, and interoperability and accessibility of system nodes. This paper proposes a market-based approach that can be used to handle different problems of mobile surveillance systems. Task allocation and cooperative target tracking are studied using the proposed approach as two challenging problems of mobile surveillance systems. These challenges are addressed individually and collectively.
Chapter
Full-text available
In the last decade video-surveillance systems have been developed for guarding remote environments in order to detect and prevent dangerous situations. Until few years ago, surveillance was performed entirely by human operators, who interpreted the visual information presented to them on one or more monitors. Sometimes, their work conditions resulted in failing to raise the generation of alarms in case of dangerous situations.
Article
Full-text available
Moving vehicles are detected and tracked automatically in monocular image sequences from road traffic scenes recorded by a stationary camera. In order to exploit the a priori knowledge about shape and motion of vehicles in traffic scenes, a parameterized vehicle model is used for an intraframe matching process and a recursive estimator based on a motion model is used for motion estimation. An interpretation cycle supports the intraframe matching process with a state MAP-update step. Initial model hypotheses are generated using an image segmentation component which clusters coherently moving image features into candidate representations of images of a moving vehicle. The inclusion of an illumination model allows taking shadow edges of the vehicle into account during the matching process. Only such an elaborate combination of various techniques has enabled us to track vehicles under complex illumination conditions and over long (over 400 frames) monocular image sequences. Results on various real-world road traffic scenes are presented and open problems as well as future work are outlined.
Article
Full-text available
Understanding video events, i.e., the translation of low-level content in video sequences into high-level semantic concepts, is a research topic that has received much interest in recent years. Important applications of this paper include smart surveillance systems, semantic video database indexing, and interactive systems. This technology can be applied to several video domains including airport terminal, parking lot, traffic, subway stations, aerial surveillance, and sign language data. In this paper, we identify the two main components of the event understanding process: abstraction and event modeling. Abstraction is the process of molding the data into informative units to be used as input to the event model. Due to space restrictions, we will limit the discussion on the topic of abstraction. See the study by Lavee et al. (Understanding video events: A survey of methods for automatic interpretation of semantic occurrences in video, Technion-Israel Inst. Technol., Haifa, Israel, Tech. Rep. CIS-2009-06, 2009) for a more complete discussion. Event modeling is devoted to describing events of interest formally and enabling recognition of these events as they occur in the video sequence. Event modeling can be further decomposed in the categories of pattern-recognition methods, state event models, and semantic event models. In this survey, we discuss this proposed taxonomy of the literature, offer a unifying terminology, and discuss popular event modeling formalisms (e.g., hidden Markov model) and their use in video event understanding using extensive examples from the literature. Finally, we consider the application domain of video event understanding in light of the proposed taxonomy, and propose future directions for research in this field.
Article
Full-text available
In this paper a method for moving objects segmentation and tracking from the so-called permanency matrix is introduced. Our motion-based algorithms enable to obtain the shapes of moving objects in video sequences starting from those image pixels where a change in their grey levels is detected between two consecutive frames by means of the permanency values. In the segmentation phase matching between objects along the image sequence is performed by using fuzzy bi-dimensional rectangular regions. The tracking phase performs the association between the various fuzzy regions in all the images through time. Finally, the analysis phase describes motion through a long video sequence. Segmentation, tracking an analysis phases are enhanced through the use of fuzzy logic techniques, which enable to work with the uncertainty of the permanency values due to image noise inherent to computer vision.
Article
The problem of tracking curves in dense visual clutter is challenging. Kalman filtering is inadequate because it is based on Gaussian densities which, being unimo dal, cannot represent simultaneous alternative hypotheses. The Condensation algorithm uses factored sampling, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set. Condensation uses learned dynamical models, together with visual observations, to propagate the random set over time. The result is highly robust tracking of agile motion. Notwithstanding the use of stochastic methods, the algorithm runs in near real-time.
Article
In this paper, we describe a system which automatically recognizes human actions from video sequences taken of a room. These actions include entering a room, using a computer terminal, opening a cabinet, picking up a phone, etc. Our system recognizes these actions by using prior knowledge about the layout of the room. In our system, action recognition is modeled by a state machine, which consists of ‘states’ and ‘transitions’ between states. The transitions from different states can be made based on a position of a person, scene change detection, or an object being tracked. In addition to generating textual description of recognized actions, the system is able to generate a set of key frames from video sequences, which is essentially content-based video compression. The system has been tested on several video sequences and has performed well. A representative set of results is presented in this paper. The ideas presented in this system are applicable to automated security.
Article
In this paper, we propose a robust approach for tracking targets in forward looking infrared (FLIR) imagery taken from an airborne moving platform. First, the targets are detected using fuzzy clustering, edge fusion and local texture energy. The position and the size of the detected targets are then used to initialize the tracking algorithm. For each detected target, intensity and local standard deviation distributions are computed, and tracking is performed by computing the mean-shift vector that minimizes the distance between the kernel distribution for the target in the current frame and the model. In cases when the ego-motion of the sensor causes the target to move more than the operational limits of the tracking module, we perform a multi-resolution global motion compensation using the Gabor responses of the consecutive frames. The decision whether to compensate the sensor ego-motion is based on the distance measure computed from the likelihood of target and candidate distributions. To overcome the problems related to the changes in the target feature distributions, we automatically update the target model. Selection of the new target model is based on the same distance measure that is used for motion compensation. The experiments performed on the AMCOM FLIR data set show the robustness of the proposed method, which combines automatic model update and global motion compensation into one framework.
Article
We present a new representation and recognition method for human activities. An activity is considered to be composed of action threads, each thread being executed by a single actor. A single-thread action is represented by a stochastic finite automaton of event states, which are recognized from the characteristics of the trajectory and shape of moving blob of the actor using Bayesian methods. A multi-agent event is composed of several action threads related by temporal constraints. Multi-agent events are recognized by propagating the constraints and likelihood of event threads in a temporal logic network. We present results on real-world data and performance characterization on perturbed data.
Article
The purpose of this paper is to investigate a real-time system to detect context-independent events in video shots. We test the system in video surveillance environments with a fixed camera. We assume that objects have been segmented (not necessarily perfectly) and reason with their low-level features, such as shape, and mid-level features, such as trajectory, to infer events related to moving objects.Our goal is to detect generic events, i.e., events that are independent of the context of where or how they occur. Events are detected based on a formal definition of these and on approximate but efficient world models. This is done by continually monitoring changes and behavior of features of video objects. When certain conditions are met, events are detected. We classify events into four types: primitive, action, interaction, and composite.Our system includes three interacting video processing layers: enhancement to estimate and reduce additive noise, analysis to segment and track video objects, and interpretation to detect context-independent events. The contributions in this paper are the interpretation of spatio-temporal object features to detect context-independent events in real time, the adaptation to noise, and the correction and compensation of low-level processing errors at higher layers where more information is available.The effectiveness and real-time response of our system are demonstrated by extensive experimentation on indoor and outdoor video shots in the presence of multi-object occlusion, different noise levels, and coding artifacts.
Conference Paper
Actions in real world applications typically take place in cluttered environments with large variations in the ori- entation and scale of the actor. We present an approach to simultaneously track and recognize known actions that is robust to such variations, starting from a person detection in the standing pose. In our approach we first render syn- thetic poses from multiple viewpoints using Mocap data for known actions and represent them in a Conditional Random Field(CRF) whose observation potentials are computed us- ing shape similarity and the transition potentials are com- puted using optical flow. We enhance these basic potentials with terms to represent spatial and temporal constraints and call our enhanced model the Shape,Flow,Duration- Conditional Random Field(SFD-CRF). We find the best se- quence of actions using Viterbi search in the SFD-CRF. We demonstrate our approach on videos from multiple view- points and in the presence of background clutter.