Conference PaperPDF Available

Multisensory architecture for intelligent surveillance systems - Integration of segmentation, tracking and activity analysis

January 2011

January 2011

DOI:10.13140/2.1.2471.5206

Source
DBLP

Conference: ICEIS 2011 - Proceedings of the 13th International Conference on Enterprise Information Systems, Volume 2, Beijing, China, 8-11 June, 2011

Authors:

Fran Cano

Rotonda Energy S.L.

José Carlos Castillo

University Carlos III de Madrid

Juan --------------------------serrano Cuerda

University of Castilla-La Mancha

Antonio Fernández-Caballero

University of Castilla-La Mancha

The proposed architecture

…

Height resizing parameters.

…

Tolerance areas and their thresholds.

…

Position and direction analysis.

…

Figures - uploaded by Antonio Fernández-Caballero

Content may be subject to copyright.

Content uploaded by Antonio Fernández-Caballero

Content may be subject to copyright.

MULTISENSORY ARCHITECTURE FOR INTELLIGENT

SURVEILLANCE SYSTEMS

Integration of segmentation, tracking and activity analysis

Keywords: Intelligent surveillance systems; Monitoring architecture; Segmentation; Tracking; Activity analysis.

Abstract: Intelligent surveillance systems deal with all aspects of threat detection in a given scene; these range from

segmentation to activity interpretation. The proposed architecture is a step towards solving the detection and

tracking of suspicious objects as well as the analysis of the activities in the scene. It is important to include

different kinds of sensors for the detection process. Indeed, their mutual advantages enhance the performance

provided by each sensor on its own. The results of the multisensory architecture offered in the paper, obtained

from testing the proposal on CAVIAR project data sets, are very promising within the three proposed levels,

that is, segmentation based on accumulative computation, tracking based on distance calculation and activity

analysis based on ﬁnite state automaton.

1 INTRODUCTION

Nowadays, obtaining surveillance systems able to

cover all levels of traditional image processing stages

is still a very challenging issue. Intelligent surveil-

lance systems must be able to predict potential sus-

picious behaviors, to work semi-automatically, and

to warn the human operator if necessary. More-

over, for a human operator, the task of looking

at a monitor during hours is very monotonous.

That is why a system able to detect suspicious

situations is essential for an efﬁcient surveillance

(Gascue˜

na and Fern´

andez-Caballero, 2011).

1.1 Segmentation in surveillance

Regarding the detection/segmentation process, it is

important to have different kinds of sensors. In-

deed, their mutual advantages enhance the perfor-

mance provided by each sensor on its own. For

instance, when working with surveillance cameras,

complementing color with thermal-infrared cameras

provides the surveillance systems with the ability to

perform detection in almost each situation, regard-

less of the environmental conditions (e.g. illumina-

tion changes, fog or smoke).

It is known that color is a very important fea-

ture for object recognition. Several approaches can

be found in the bibliography devoted to color image

segmentation (e.g. (L´

ezoray and Charrier, 2009)). In

(Maldonado-Basc´

on et al., 2007)RGB and HSI color

spaces are used for the detection of trafﬁc signals.

There are other proposals that use image features

combined with color to solve the segmentation prob-

lem (e.g. (Moreno-Noguer et al., 2008)).

Concerning infrared cameras, many proposals fo-

cus on the assumption that objects of interest –

mostly pedestrians – possess a temperature higher

than their surroundings (Yilmaz et al., 2003). Ther-

mal infrared video cameras detect relative differences

in the amount of thermal energy emitted/reﬂected

from objects in the scene. As long as the thermal

properties of a foreground object are slightly differ-

ent (higher or lower) from the background radiation,

the corresponding region in a thermal image appears

at a contrast from the environment. A technique

based on background subtraction for the detection of

objects under different environmental conditions has

been proposed (Davis and Sharma, 2007).

1.2 Tracking in surveillance

Tracking objects of interest in a scene is another

key step prior to video surveillance events detec-

tion (Regazzoni and Marcenaro, 2000). Tracking

approaches can be classiﬁed into four main cat-

egories, namely, tracking based on the moving

object region, where the bounding box sur-

rounding an object is tracked in the 2D space

(Masoud and Papanikolopoulos, 2001), tracking

based on the moving object contour, where a contour

is deﬁned by curves delimiting the moving objects

and dynamically updated (Isard and Blake, 1998),

tracking based on a model of the moving object,

where the 3D geometry of a moving object is deﬁned

(Koller et al., 1993), and, tracking based on the

features of the moving object, where some features of

the objects are extracted and monitored (e.g. vertices

in vehicle tracking (McCane et al., 2002)).

Besides these blocks there exist other propos-

als such as wavelets analysis-based (Kim et al., 2001)

or Kalman ﬁlter-based (Stauffer and Grimson, 1999)

tracking, among others.

1.3 Activity analysis in surveillance

In (Lavee et al., 2009) some recent approaches for

video event understanding are presented. The im-

portance of the two main components of the event

understanding process – abstraction and event mod-

eling – is pointed out. Abstraction corresponds

to the process of molding the data into informa-

tive units to be used as input to the event model

(Natarajan and Nevatia, 2008), while event modeling

is devoted to formally describing events of interest

and enabling recognition of these events as they oc-

cur in the video sequence (Ulusoy and Bishop, 2005).

In close relation to ﬁnite state machines theory,

in (Ayers and Shah, 2001) the authors describe a sys-

tem which automatically recognizes human actions in

a room from video sequences. The system recognizes

the actions by using prior knowledge about the lay-

out of the room. Indeed, action recognition is mod-

eled by a state machine, which consists of ’states’

and ’transitions’ between states. Another approach

(Hongeng et al., 2004) models scenario events from

shape and trajectory features using a hierarchical ac-

tivity representation, where events are organized into

several layers of abstraction, providing ﬂexibility and

modularity in modeling scheme. An event is consid-

ered to be composed of action threads, each thread

being executed by a single actor. A single-thread ac-

tion is represented by a stochastic ﬁnite automaton of

event states that are recognized from the characteris-

tics of the trajectory and shape of the moving blob of

the actor.

1.4 The proposal for intelligent

surveillance

The proposed architecture aims to solve the de-

tection and tracking of suspicious objects as

well as the analysis of the activities detected

in the scene. The approach is closely related

to the works of (Ivanov and Bobick, 2000) and

(Hongeng et al., 2004) in the sense that the exter-

nal knowledge about the problem domain is incorpo-

rated into the expected structure of the activity model.

Motion-based image features are linked explicitly to

a symbolic notion of hierarchical activity through

several layers of more abstract activity descriptions.

Atomic actions are detected at a low level and fed to

hand-crafted grammars to detect activity patterns of

interest. The inspiration also is close to the paper by

(Amer et al., 2005), as we work with shape and trajec-

tory to indicate the events related to moving objects.

In comparison to other approaches, such as Bayesian

networks or HMMs (Oliver and Horvitz, 2005), the

proposal is unable to model uncertainty in video

events, but it is presented as a useful tool in video

event understanding because of its simplicity and its

ability to model temporal sequence and to easily in-

corporate new actions.

2 ARCHITECTURE DEFINITION

This section describes in detail the different phases

proposed for the intelligent multisensor surveillance

architecture. Fig. 1schematically depicts the sys-

tem processing stages. Processing starts after cap-

turing color images as well as infrared images. The

use of two different spectral images allows the detec-

tion of objects of interest independently of lighting

or environmental conditions (day, night, fog, smoke,

etc.). The segmentation algorithm detects motion of

the scene objects. As a result, the segmentation algo-

rithm generates a list of blobs detected in the scene.

The blobs are used as the inputs to the tracking phase.

Following the segmentation process, a simple but

robust tracking approach is proposed. An identiﬁer

is assigned to each object until it leaves the scene or

the segmentation algorithm does not detect it for a de-

termined period of time (application dependent). Fi-

nally, the identiﬁed blobs are passed to the activity

analysis stage, where some predeﬁned activities are

modeled by means of state machines. At this stage

Figure 1: The proposed architecture

the different behaviors and their associated alarm lev-

els are obtained.To conﬁgure the parameters needed

for segmentation and tracking, as well as the static

objects deﬁnition on the scene a modeling stage has

been included to the proposal.

2.1 Segmentation based on

accumulative computation

This section describes the proposed segmen-

tation method integrated into the architecture.

The method has been tested on visual and non

visual spectrum (infrared) image sequences

with promising results in detecting moving ob-

jects (e.g. (Fern´

andez-Caballero et al., 2011;

Fern´

andez-Caballero et al., 2010)). The accumula-

tive computation method consist of the ﬁve phases

described next (and also depicted in Fig. 2):

Preprocessing: This phase performs the preprocess-

ing of the input images. Filters such as the mean or

the median are applied in order to enhance the image

contrast and to smooth the image noise.

Grey level bands segmentation: This phase is in

charge of splitting the input image, I(c,y,t), into k

grey level bands. That is, each image is binarized in

ranges according to the kbands following equation 1.

Equation 1shows how in a given time t, a pixel can

just belong to a unique grey level band.

GLSk(x,y,t) = ½0,if I(x,y,t)6=k∀k∈[0,255]

1,otherwise (1)

Accumulative computation: This phase obtains one

sub-layer per each band deﬁned in the previous phase.

Figure 2: The phases of the accumulative computation

method

Each band stores the pixels’ accumulative compu-

tation values. In ﬁrst place, the method estab-

lishes a permanence memory value for each pixel,

PMk(x,y,t). It is assumed that motion takes place at a

pixel when that value of the pixel falls in a new band.

For each pixel (x,y), at a given time t, and in a band

k, the following possibilities must be taken into ac-

count when comparing with the pixel at time t−1 (see

equation 2). A complete description may be found in

(Delgado et al., 2010).

PMk(x,y,t) = 









vdes,if GLSk(x,y,t) == 0

vsat,if GLSk(x,y,t) = 1∧GLSk(x,y,t−1) = 0

max{PMk(x,y,t)−vdm,vdes },

if GLSk(x,y,t) = 1∧GLSk(x,y,t−1) = 1

(2)

Fusion: This phase fuses the information coming

from the kaccumulative computation layers. The aim

is to detect the movement of the objects in the scene.

For this, a new layer is created to store the ksub-

bands of the accumulative computation. Each pixel

is assigned the maximum value of the different sub-

bands following equation 3. Next, a thresholding is

performed to discard regions with low motion. Clos-

ing and opening morphological operations eliminate

isolated pixels and unite close regions wrongly split

by the thresholding operation.

S(x,y,T) = max(PM(x,y,t)) (3)

Objects segmentation: This phase obtains the areas

containing moving regions. As an input a blobs list

LBis obtained and used to higher layers of the archi-

tecture.

2.2 Tracking based on distance

computation

The second level of the proposed architec-

ture consists in tracking of segmented objects

(Moreno-Garcia et al., 2010). This tracking proposal

consists of the four stages described below:

Object labeling: The tracking approach uses the re-

sult of the previous segmentation stage, that is, the

segmented spots, LB, though the tracking algorithm

has its own list of blobs, LT, updated over time with

the tracking results. Firstly, each blob contained in

LB,LBi, where i∈ {0,1,...,N}and Nis the number

of elements in LB, is compared to all blobs contained

in LT. The aim is to calculate the distance between

the centers of the boxes associated to the blobs. The

centers for LBblobs are calculated as shown in equa-

tions 4and 5. The centers for LTblobs are calculated

in a similar way.

LBi.xc =LBi.xmin +LBi.xmax

2(4)

LBi.yc =LBi.ymin +LBi.ymax

2(5)

where LBi.xmin and LBi.ymin are the initial coordinates

of blob LB, and LBi.xmax and LBi.ymax the ﬁnal ones.

According to equation 6, blobs LTjwith a distance

between centers below a preﬁxed threshold are se-

lected as candidates to be the previous position of

blob LBat time instant (t−1). The blob with the

minimum distance to LBiis selected as the previous

position of the current blob.

d=q(LBi.xc −LTj.xc)2+ (LBi.yc −LTj.yc)2(6)

Blob updating: Once the segmented blobs, LB, are

associated to their identiﬁers, a smoothing process is

performed to reduce the effect of noise introduced

during the detection process. This way, the box size

is smoothed to avoid abrupt variations. If a foreseen

blob is not detected during the segmentation process,

T=LT−LB, a prediction about its possible trajec-

tory is performed through a mean distance increment,

based on ∆xc and ∆yc, and a displacement angle cal-

culated between consecutive frames. If the perma-

nence memory value reaches its minimum, the blob is

discarded as it is considered to leave the scene.

Size adjustment: Given that the output of the seg-

mentation algorithm is not always accurate, the de-

tected blobs might include some noise that modiﬁes

the size of their containing boxes. In order to alle-

viate the effects of noise, the obtained blobs are soft-

ened according to a weighted mean of their height and

Figure 3: Height resizing parameters.

width. Also, depending on the motion direction of a

blob its position may be modiﬁed. Fig. 3depicts a sit-

uation where a blob LTjmoves between two consec-

utive times with a signiﬁcant variation in size. Only

the height component, LTj(t).h(denoted as LTj.h),

is shown to keep the ﬁgure simple. The height and

width are calculated as shown in equations 7and 8,

respectively.

LTj.h=M−ω

M·LT.h+ω

M·LTj.h(7)

LTj.w=M−λ

M·LT.w+ω

M·LTj.w(8)

where Mis the number of previous instances of blob

LTjto calculate the mean, λis the weight given to the

current height and width values, and LT.hand LT.w

are the mean height and width, respectively. Once the

blob’s height and width are smoothed, the blob’s po-

sition is also enhanced. For this, the displacement an-

gle, θ, the blob center, (LTj.xc,LTj.yc), and the new

height and width are taken into account. This is per-

formed through equations 9and 10 for x-axis’ maxi-

mum and minimum coordinates, and equations 11 and

12 for y-axis’ components.

LTj.xmax =(3/5·LTj.w+LTj.xc,if cos θ≥0

2/5·LTj.w+LTj.xc,if cosθ<0(9)

LTj.xmin =(2/5·LTj.w−LTj.xc,if cos θ≥0

3/5·LTj.w−LTj.xc,if cosθ<0(10)

LTj.ymax =(3/5·LTj.h+LTj.yc,if sinθ≥0

2/5·LTj.h+LTj.yc,if sinθ<0(11)

LTj.ymin =(2/5·LTj.h−LTj.yc,if sinθ≥0

3/5·LTj.h−LTj.yc,if sinθ<0(12)

Figure 4: Tolerance areas and their thresholds.

Trajectory prediction: This phase uses the L0

Tblobs

list; that is, this phase works with the blobs not up-

dated in the previous segmentation phase. For each of

them, LT0

k, its permanence memory value is reduced

under the assumption that the object is probably not

present in the scene, just as shown in equation 13.

p(LT0

k) = p(LT0

k)−δ(13)

where δis a predeﬁned discharge value. Two different

permanence zones are deﬁned in the image, each one

with a different threshold µlyµh, being µh>> µl(see

Fig. 4). In the external zone a blob is discarded if its

permanence value drops below µl. This causes that a

blob close to the image limit is discarded more easily,

as it is assumed to abandon the scene. Those blobs in-

ner blobs not detected during the segmentation phase

are supposed to be still.

Trajectories are predicted for those blobs con-

tained in LT0with permanence above the aforesaid

thresholds . This involves the previous calculation

of the mean distance between frames, ∆xc and ∆yc.

Thus, the blob coordinates are calculated as:

xmint=xmint−1+cos θ·∆xc (14)

xmaxt=xmaxt−1+cos θ·∆xc (15)

ymint=ymint−1+sin θ·∆yc (16)

ymaxt=ymaxt−1+sin θ·∆yc (17)

The content of list LT0updates the information of

LT, the input to the next phase, which uses the blobs

and their associated identiﬁers to deﬁne the activities

carried out in the scene.

2.3 Activity analysis based on ﬁnite

state automaton

The purpose of activity description is to reasonably

choose a group of motion words or shout expressions

to report activities of moving objects or humans in

natural scenes.

Action Origin

vertex

Description

Object in-

formation

Object

speed

Makes it possible to deﬁne if an object is

still, walking, running, etc.

Object

trajec-

tory

Apart from speed, the direction and mov-

ing direction of an object can be ob-

tained.

Environment

interaction

Direction The system determines if a person is ap-

proaching a speciﬁc area of the scenario.

By taking the object’s speed and trajec-

tory as reference, the object’s goal is in-

ferred.

Position By knowing the important areas of the

scenario, the system is capable of deter-

mining the relative position of dynamic

objects. This way, it can detect if a per-

son is standing in one of the areas.

Object in-

teraction

Proximity The system detects the distance between

objects.

Orientation The system determines whether an ob-

ject is approaching another or whether

they are both approaching each other.

Grouping The system uses the parameters gener-

ated in the two previous points to detect

object grouping (thanks to their proxim-

ity and direction).

Table 1: Local activities

Objects of Interest: From the ETISEO

classiﬁcation proposal (available at

http://www-sop.inria.fr/orion/ETISEO

four categories are established for dynamic objects

and two for static objects. Dynamic objects are, for

instance, a person, a group of people (made up of

two or more people), a portable object (such as a

briefcase) and other dynamic objects (able to move

on their own), classiﬁed as moving object. Static

objects may be areas and pieces of equipment. The

latter can be labeled as a portable object if a dynamic

object, people or group, interacts with it and it starts

moving.

Description of Local Activities: In order to gener-

alize the detection process we start with simple func-

tionalities that detect simple actions of the active ob-

jects in the scene. Using these functions, more com-

plex behavior patterns are built. Simple actions are

deﬁned in Table 1.

Description of Global Activities: Interpreting a vi-

sual scene is a task that, in general, resorts to a large

body of prior knowledge and experience of the viewer

(Neumann and M¨

oller, 2008). Through the actions or

queries described in the previous section, basic pat-

terns (e.g. the object speed or direction) and more

complex patterns (e.g. the theft of an object) can be

found. It is essential to deﬁne the desired behavioral

Figure 5: Position and direction analysis.

pattern in each situation, by using the basic actions

or queries. For each speciﬁc scene, a state diagram

and a set of rules are designed to indicate the patterns.

Thus, the proposed video surveillance system is able

to detect simple actions or queries and to adapt to a

great deal of situations. It is also conﬁgured to detect

the behavioral patterns necessary in each case and to

associate an alarm level to each one.

Image Preprocessing: Input image segmentation is

not enough to detect the activities in the scene . Thus,

the system takes the initial segmentation data and in-

fers the new necessary parameters (such as the objects

speed or direction). Thus, the preprocessing tech-

niques described in Table 2are necessary.

Speciﬁcation of Simple Behaviors: The system has

to respond to a series of queries intended to ﬁnd out

behavioral patterns of objects in the scene (see Table

3). These queries are deﬁned as functions and return

a logical value, which is true if they are fulﬁlled for a

speciﬁc object. They are represented in the following

format:

query (parameter1,parameter2, ..., parametern)

Speciﬁcation of Complex Behaviors: Concerning

the complex behaviors, two categories are differen-

tiated:

Local Complex Behaviors. Objects in the scene

are associated to a state machine that indicates the

state they are in (what they are doing at a time instant).

This state machine can be seen as a directed graph

where the vertices are the possible states of the object

and the edges are the basic functions or queries previ-

ously discussed. An edge has at least one associated

outcome of the assessment (true or false) of a query,

Preprocessing Details

Speed Hypoth-

esis

The average speed for each object is calculated by

dividing the displacement (∆x) by the time elapsed

(∆t) in each frame.

Direction and

Moving Direc-

tion Hypothe-

sis

To ﬁnd out the direction of objects, the angle of the

straight line that passes through the positions of the

previous and current instants are calculated.

Image Rectiﬁ-

cation

Perspective distortion occurs because the distance

between the furthest points from the camera is less

than the distance between the closest points. The

real position is measured through the weighted dis-

tance measure of the four manually placed points

closest to the position to be interpolated.

Data Smooth-

ing

The data taken at two time instants is separated

with enough time to avoid small distortions; but

this distance is small enough to enable accurate re-

sults. The distance between both consecutive time

instants is called the analysis interval. At each anal-

ysis interval, the value of the hypotheses is updated,

but the old value is not automatically substituted.

To calculate the value at that instant, the mean val-

ues are calculated.

Table 2: Preprocessing techniques

Type Queries Description

Movement hasSpeedBet ween

(min,max)

True if an object moves with a

speed within range [min,max].

hasSpeed GreaterT han

(speed )

True if an object moves with a

speed greater than speed .

Direction hasDirection

(staticOb j ect)

True if an object goes toward

staticOb j ect, being staticOb j ect

a static object of the scene.

isFollowing () True if a dynamic object is fol-

lowing another one. The dis-

placement angle is used.

Position isInsideZ one

(staticOb j ect)

True if a dynamic object is in

area staticOb j ect.

isCloseTo

(distance,st aticOb ject)

True if the distance to a

staticOb j ect is less than

distance.

enterI nScene () True if an object appears for the

ﬁrst time in the scene.

Table 3: Simple queries

indicating an action of object, query qi. Therefore, an

edge could have more than one query associated to it.

For an edge with several actions to be fulﬁlled, all the

associated queries have to be fulﬁlled. If a more com-

plex rule is needed, where disjunctions also appear so

that an object changes states, the rule must be divided

into two edges.

Global Complex Behaviors. To detect global be-

havior patterns, more than just the local state machine

is needed since the states of global state machines are

Figure 7: Class View.

composed by the local ones. These patterns are repre-

sented through state machines whose vertices repre-

sent a possible state in the scene. Just like in the local

state machine, the edges are made up of a series of

queries that must be fulﬁlled at a certain time for the

scene to change states.

3 IMPLEMENTATION

The prototype implementation based on the proposed

surveillance architecture must fulﬁll the following ob-

jectives:

•Obtain a detector of strange behaviors and intru-

sions in a monitored environment.

•Provide a web interface to allow a view indepen-

dent of the platform.

•Provide a scalable prototype.

The proposal captures information from multiple

sources, from traditional surveillance sensors (such as

IR barriers or motion detectors) to different spectrum

cameras (color, thermal and so on). Also, they are

allocated on a two-dimensional map to see the current

system state.

3.1 Class view

The architecture design is deﬁned by the class view

(see Fig. 7). This diagram shows the classes and in-

terfaces that integrates the system, as well as their re-

lations. Thus, not only the static view of the system

is established, but also the interactions among the dif-

ferent components.

The control process class is responsible for con-

trolling the acquisition, segmentation, tracking and

Figure 8: Implementation View.

behaviors detection (activities). The proposal oper-

ates in two modes, desktop mode and web mode. The

desktop mode is designed to allow a higher interaction

with the user, providing interactive views of the differ-

ent cameras, selected from the map, and warning the

user about the alarms through a pop-ups mechanism.

On the other hand, the web mode show read-only in-

formation about the state of the sensors located in the

map and the triggered alarms. By means of a thresh-

olding mechanism, the application triggers two dif-

ferent alarm types: Pre-Alarm, for values below the

threshold, and Alarm, for values above the threshold.

As behavior patterns are detected through a state ma-

chine mechanism, an alarm value is associated to each

state which will be compared to the threshold in order

to differentiate between ”Alarm” and ”pre-Alarm”.

3.2 Implementation view

The implementation view deﬁne the components that

hold the classes deﬁned in the class view. These com-

ponents deﬁne the architecture of the system. As

seen in Fig. 8, a DLL module is implemented for

each stage of the proposed architecture, namely Im-

age Capture to capture images of cameras, Segmen-

tation based on accumulative computation, Tracking

based on distance computation and Activities based

on ﬁnite state automaton. Moreover, a control class is

added to organize the surveillance architecture and an

user interface to visualize results and manage conﬁg-

urations.

3.3 Interface

As aforementioned, the system can operate in two dif-

ferent modes: Regarding the desktop mode (see Fig.

6), the interface provides a map to allocate the dif-

ferent sensors (IR barrier, opening sensor, camera).

A color code is utilized to indicate the sensor state

(grey means “irresponsible”, green “ok”, yellow “Pre-

Alarm” and red “Alarm”). IR barriers and motion de-

tectors are set in red in the map whether the sensor is

Figure 6: Desktop prototype interface.

activated. The cameras can be set in yellow in case of

Pre-Alarm, or red in case of Alarm. Besides, the inter-

face provides different features: turning on or off the

system operation, showing sensor coverage and its id.

Moreover, selecting a camera on the map, the result

of its operation is shown. The system offers a view

up to eight cameras simultaneously. The view of the

camera process is scaled according to the number of

selected cameras. In the lower left side of the inter-

face, the alarm detection log is shown. Each time the

system triggers an alarm, a new line is added to the

log. The information appeared in the log is based on

the detected behavior. Thus, the alarms information

are summarized through four columns: alarm type,

hour of alarm, behavior, and alarm state(accepted or

canceled by the user). Under an alarm condition the

user interaction is required to conﬁrm or cancel the

alarm. Once the user has conﬁrmed the alarm, the

sensor involved returns to normal state (green color).

On the other hand, the web mode is designed to

show concise information about the system state. As

seen in Fig. 9, the interface shows a map where the

sensors are placed whilst keeping the color code uti-

lized in the desktop mode, as well as the alarms log

structure.

4 DATA AND RESULTS

The tests were carried out us-

ing the cases that CAVIAR project

(

http://homepages.inf.ed.ac.uk/rbf/CAVIAR/

) makes avail-

Figure 9: Web Capture.

able for researchers. For the tests, videos recorded

with a wide angle camera in the lobby of INRIA

Labs in Grenoble, France, were used. In these

scenes, there are different people interacting with the

environment and with each other. The used datasets

were walk1,browse2,browse3,rest1 and browse

while waiting(Bww).

Since segmentation forms the ﬁrst phase in the

proposed architecture, its results must be as accurate

as possible, because the remaining blocks are built

over it. Its results are shown in Table 4, presenting

Dataset Accuracy Sensitivity F-Score

Browse2 0,885 0,982 0,935

Browse3 0,992 0,855 0,919

Bww 0,995 0,964 0,979

Walk1 0,996 0,915 0,954

Rest1 0,993 0,917 0,953

Mean 0,972 0,927 0,948

Table 4: Segmentation algorithm results with CAVIAR

datasets

Figure 10: Tracking results. Left: Browse2 sequence; cen-

ter: Walk1 sequence; right: Bww1 sequence.

an F-score close to 95%, a sensitivity of 92% and an

accuracy of 97%, with most of values close to 100%.

In the absence of a tool to verify tracking and ac-

tivities results in a quantitative way, these are shown

quantitatively. In ﬁrst place, the results of the tracking

algorithm are offered (see Fig. 10). This algorithm

uses as an input the previously segmented blobs.

For the activities phase, the test were focused on a

complex behavior: position and direction analysis. It

must be pointed out that activities detection takes only

one from each six frames. These provides robustness

against the detection and tracking noise, while pre-

Figure 11: Global diagram, holdup at a cashpoint

Frame Alert Object Estate Objective

437 Pre-

Alarm

1 Initial Cashpoint

443 Alarm 1 Goings

towards the

cashpoint

Cashpoint

563 Alarm 1 Close to the

cashpoint

Cashpoint

Table 5: Position and direction detection.

serving accuracy enough to perform inferences about

objects trajectories. Global behavior patterns are rep-

resented through state machines which vertices repre-

sent a possible state in the scene. As happened with

local state machine, the edges are made up of a series

of queries that must be fulﬁlled at a certain time for

the scene to change states (see Fig. 11).

To ﬁnish with our tests, Table 5shows an extract

of the results for position and direction analysis. The

sequence used was “Browse2”.

5 CONCLUSIONS

This article has introduced an intelligent surveillance

system by integrating segmentation, tracking and ac-

tivities detection algorithms. The system is able to

detect behaviors and report information to the user

thanks to an attractive and functional interface. As

a future work it is planned to add new sensor types

for surveillance and to work with a distributive archi-

tecture.

ACKNOWLEDGEMENTS

REFERENCES

Amer, A., Dubois, E., and Mitiche, A. (2005). Rule-based

real-time detection of context-independent events in

video shots. Real-Time Imaging, 11:244–256.

Ayers, D. and Shah, M. (2001). Monitoring human behavior

from video taken in an ofﬁce environment. Image and

Vision Computing, 19(12):833–846.

Davis, J. and Sharma, V. (2007). Background-subtraction in

thermal imagery using contour saliency. International

Journal of Computer Vision, 71:161–181.

Delgado, A., L´

opez, M., and Fern ´

andez-Caballero, A.

(2010). Real-time motion detection by lateral inhi-

bition in accumulative computation. Engineering Ap-

plications of Artiﬁcial Intelligence, 23:129–139.

Fern´

andez-Caballero, A., Castillo, J., Mart´

ınez-Cantos, J.,

and Mart´

ınez-Tom´

as, R. (2010). Optical ﬂow or image

subtraction in human detection from infrared camera

on mobile robot. Robotics and Autonomous Systems,

58:1273–1281.

Fern´

andez-Caballero, A., Castillo, J., Serrano-Cuerda, J.,

and Maldonado-Basc´

on, S. (2011). Real-time human

segmentation in infrared videos. Expert Systems with

Applications, 38:2577–2584.

Gascue˜

na, J. and Fern´

andez-Caballero, A. (2011). Agent-

oriented modeling and development of a person-

following mobile robot. Expert Systems with Appli-

cations, 38(4):4280–4290.

Hongeng, S., Nevatia, R., and Bremond, F. (2004). Video-

based event recognition: activity representation and

probabilistic recognition methods. Computer Vision

and Image Understanding, 96(2):129–162.

Isard, M. and Blake, A. (1998). Condensation - conditional

density propagation for visual tracking. International

Journal of Computer Vision, 29:5–28.

Ivanov, Y. and Bobick, A. (2000). Recognition of visual

activities and interactions by stochastic parsing. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 22(8):852–872.

Kim, J., Lee, C., Lee, K., Yun, T., and Kim, H. (2001).

Wavelet-based vehicle tracking for automatic trafﬁc

surveillance. In Proceedings of IEEE Region 10 In-

ternational Conference on Electrical and Electronic

Technology, volume 1, pages 313–316.

Koller, D., Danilidis, K., and Nagel, H.-H. (1993). Model-

based object tracking in monocular image sequences

of road trafﬁc scenes. International Journal of Com-

puter Vision, 10:257–281.

Lavee, G., Rivlin, E., and Rudzsky, M. (2009). Understand-

ing video events: a survey of methods for automatic

interpretation of semantic occurrences in video. IEEE

Transactions on Systems, Man, and Cybernetics, Part

C: Applications and Reviews, 39(5):489–504.

L´

ezoray, O. and Charrier, C. (2009). Color image segmen-

tation using morphological clustering and fusion with

automatic scale selection. Pattern Recognition Let-

ters, 30:397–406.

Maldonado-Basc´

on, S., Lafuente-Arroyo, S., Gil-Jim´

enez,

P., G ´

omez-Moreno, H., and L´

opez-Ferreras, F. (2007).

Road-sign detection and recognition based on support

vector machines. IEEE Transactions on Intelligent

Transportation Systems, 8(2):264–278.

Masoud, O. and Papanikolopoulos, N. (2001). A novel

method for tracking and counting pedestrians in real-

time using a single camera. IEEE Transactions on Ve-

hicular Technology, 50(5):1267–1278.

McCane, B., Galvin, B., and Novins, K. (2002). Algorith-

mic fusion for more robust feature tracking. Interna-

tional Journal of Computer Vision, 49:79–89.

Moreno-Garcia, J., Rodriguez-Benitez, L., Fern´

andez-

Caballero, A., and L´

opez, M. (2010). Video sequence

motion tracking by fuzziﬁcation techniques. Applied

Soft Computing, 10:318–331.

Moreno-Noguer, F., Sanfeliu, A., and Samaras, D. (2008).

Dependent multiple cue integration for robust track-

ing. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 30:670–685.

Natarajan, P. and Nevatia, R. (2008). View and scale in-

variant action recognition using multiview shape-ﬂow

models. In IEEE Conference on Computer Vision and

Pattern Recognition, pages 1–8.

Neumann, B. and M¨

oller, R. (2008). On scene interpretation

with description logics. Image and Vision Computing,

26:82–101.

Oliver, N. and Horvitz, E. (2005). A comparison of hmms

and dynamic bayesian networks for recognizing ofﬁce

activities. In 10th International Conference on User

Modeling, pages 199–209.

Regazzoni, C. and Marcenaro, L. (2000). Object detection

and tracking in distributed surveillance systems using

multiple cameras. Kluwer Academic Publishers.

Stauffer, C. and Grimson, W. (1999). Adaptive back-

ground mixture models for real-time tracking. In 1999

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition, volume 2.

Ulusoy, I. and Bishop, C. (2005). Generative versus dis-

criminative methods for object recognition. In IEEE

Computer Society Conference on Computer Vision

and Pattern Recognition, volume 2, pages 258–265.

Yilmaz, A., Shaﬁque, K., and Shah, M. (2003). Target

tracking in airborne forward looking infrared imagery.

Image and Vision Computing, 21(7):623–635.

Market-Based Approach to Mobile Surveillance Systems

Article

Full-text available

Oct 2012

The active surveillance of public and private sites is increasingly becoming a very important and critical issue. It is, therefore, imperative to develop mobile surveillance systems to protect these sites. Modern surveillance systems encompass spatially distributed mobile and static sensors in order to provide effective monitoring of persistent and transient objects and events in a given area of interest (AOI). The realization of the potential of mobile surveillance requires the solution of different challenging problems such as task allocation, mobile sensor deployment, multisensor management, cooperative object detection and tracking, decentralized data fusion, and interoperability and accessibility of system nodes. This paper proposes a market-based approach that can be used to handle different problems of mobile surveillance systems. Task allocation and cooperative target tracking are studied using the proposed approach as two challenging problems of mobile surveillance systems. These challenges are addressed individually and collectively.

Object Detection and Tracking in Distributed Surveillance Systems Using Multiple. Cameras

Chapter

Full-text available

Apr 2002

In the last decade video-surveillance systems have been developed for guarding remote environments in order to detect and prevent dangerous situations. Until few years ago, surveillance was performed entirely by human operators, who interpreted the visual information presented to them on one or more monitors. Sometimes, their work conditions resulted in failing to raise the generation of alarms in case of dangerous situations.

Model-Based object tracking in monocular sequences of road trafficscenes

Article

Full-text available

Jan 1993
INT J COMPUT VISION

Moving vehicles are detected and tracked automatically in monocular image sequences from road traffic scenes recorded by a stationary camera. In order to exploit the a priori knowledge about shape and motion of vehicles in traffic scenes, a parameterized vehicle model is used for an intraframe matching process and a recursive estimator based on a motion model is used for motion estimation. An interpretation cycle supports the intraframe matching process with a state MAP-update step. Initial model hypotheses are generated using an image segmentation component which clusters coherently moving image features into candidate representations of images of a moving vehicle. The inclusion of an illumination model allows taking shadow edges of the vehicle into account during the matching process. Only such an elaborate combination of various techniques has enabled us to track vehicles under complex illumination conditions and over long (over 400 frames) monocular image sequences. Results on various real-world road traffic scenes are presented and open problems as well as future work are outlined.

Understanding Video Events: A Survey of Methods for Automatic Interpretation of Semantic Occurrences in Video

Article

Full-text available

Oct 2009

Understanding video events, i.e., the translation of low-level content in video sequences into high-level semantic concepts, is a research topic that has received much interest in recent years. Important applications of this paper include smart surveillance systems, semantic video database indexing, and interactive systems. This technology can be applied to several video domains including airport terminal, parking lot, traffic, subway stations, aerial surveillance, and sign language data. In this paper, we identify the two main components of the event understanding process: abstraction and event modeling. Abstraction is the process of molding the data into informative units to be used as input to the event model. Due to space restrictions, we will limit the discussion on the topic of abstraction. See the study by Lavee et al. (Understanding video events: A survey of methods for automatic interpretation of semantic occurrences in video, Technion-Israel Inst. Technol., Haifa, Israel, Tech. Rep. CIS-2009-06, 2009) for a more complete discussion. Event modeling is devoted to describing events of interest formally and enabling recognition of these events as they occur in the video sequence. Event modeling can be further decomposed in the categories of pattern-recognition methods, state event models, and semantic event models. In this survey, we discuss this proposed taxonomy of the literature, offer a unifying terminology, and discuss popular event modeling formalisms (e.g., hidden Markov model) and their use in video event understanding using extensive examples from the literature. Finally, we consider the application domain of video event understanding in light of the proposed taxonomy, and propose future directions for research in this field.

Video sequence motion tracking by fuzzification techniques

Article

Full-text available

Jan 2010
APPL SOFT COMPUT

In this paper a method for moving objects segmentation and tracking from the so-called permanency matrix is introduced. Our motion-based algorithms enable to obtain the shapes of moving objects in video sequences starting from those image pixels where a change in their grey levels is detected between two consecutive frames by means of the permanency values. In the segmentation phase matching between objects along the image sequence is performed by using fuzzy bi-dimensional rectangular regions. The tracking phase performs the association between the various fuzzy regions in all the images through time. Finally, the analysis phase describes motion through a long video sequence. Segmentation, tracking an analysis phases are enhanced through the use of fuzzy logic techniques, which enable to work with the uncertainty of the permanency values due to image noise inherent to computer vision.

CONDENSATION—conditional density propagation for visual tracking

Article

Aug 1998

The problem of tracking curves in dense visual clutter is challenging. Kalman filtering is inadequate because it is based on Gaussian densities which, being unimo dal, cannot represent simultaneous alternative hypotheses. The Condensation algorithm uses factored sampling, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set. Condensation uses learned dynamical models, together with visual observations, to propagate the random set over time. The result is highly robust tracking of agile motion. Notwithstanding the use of stochastic methods, the algorithm runs in near real-time.

Monitoring human behavior from video taken in an office environment

Article

Oct 2001
IMAGE VISION COMPUT

In this paper, we describe a system which automatically recognizes human actions from video sequences taken of a room. These actions include entering a room, using a computer terminal, opening a cabinet, picking up a phone, etc. Our system recognizes these actions by using prior knowledge about the layout of the room. In our system, action recognition is modeled by a state machine, which consists of ‘states’ and ‘transitions’ between states. The transitions from different states can be made based on a position of a person, scene change detection, or an object being tracked. In addition to generating textual description of recognized actions, the system is able to generate a set of key frames from video sequences, which is essentially content-based video compression. The system has been tested on several video sequences and has performed well. A representative set of results is presented in this paper. The ideas presented in this system are applicable to automated security.

Tracking in airborne forward looking infrared imagery

Article

Nov 2002
IMAGE VISION COMPUT

In this paper, we propose a robust approach for tracking targets in forward looking infrared (FLIR) imagery taken from an airborne moving platform. First, the targets are detected using fuzzy clustering, edge fusion and local texture energy. The position and the size of the detected targets are then used to initialize the tracking algorithm. For each detected target, intensity and local standard deviation distributions are computed, and tracking is performed by computing the mean-shift vector that minimizes the distance between the kernel distribution for the target in the current frame and the model. In cases when the ego-motion of the sensor causes the target to move more than the operational limits of the tracking module, we perform a multi-resolution global motion compensation using the Gabor responses of the consecutive frames. The decision whether to compensate the sensor ego-motion is based on the distance measure computed from the likelihood of target and candidate distributions. To overcome the problems related to the changes in the target feature distributions, we automatically update the target model. Selection of the new target model is based on the same distance measure that is used for motion compensation. The experiments performed on the AMCOM FLIR data set show the robustness of the proposed method, which combines automatic model update and global motion compensation into one framework.

Running head: Video-Based Event Recognition Video-Based Event Recognition: Activity Representation and Probabilistic Recognition Methods

Article

Nov 2004

We present a new representation and recognition method for human activities. An activity is considered to be composed of action threads, each thread being executed by a single actor. A single-thread action is represented by a stochastic finite automaton of event states, which are recognized from the characteristics of the trajectory and shape of moving blob of the actor using Bayesian methods. A multi-agent event is composed of several action threads related by temporal constraints. Multi-agent events are recognized by propagating the constraints and likelihood of event threads in a temporal logic network. We present results on real-world data and performance characterization on perturbed data.

Rule-based real-time detection of context-independent events in video shots

Article

Jun 2005

The purpose of this paper is to investigate a real-time system to detect context-independent events in video shots. We test the system in video surveillance environments with a fixed camera. We assume that objects have been segmented (not necessarily perfectly) and reason with their low-level features, such as shape, and mid-level features, such as trajectory, to infer events related to moving objects.Our goal is to detect generic events, i.e., events that are independent of the context of where or how they occur. Events are detected based on a formal definition of these and on approximate but efficient world models. This is done by continually monitoring changes and behavior of features of video objects. When certain conditions are met, events are detected. We classify events into four types: primitive, action, interaction, and composite.Our system includes three interacting video processing layers: enhancement to estimate and reduce additive noise, analysis to segment and track video objects, and interpretation to detect context-independent events. The contributions in this paper are the interpretation of spatio-temporal object features to detect context-independent events in real time, the adaptation to noise, and the correction and compensation of low-level processing errors at higher layers where more information is available.The effectiveness and real-time response of our system are demonstrated by extensive experimentation on indoor and outdoor video shots in the presence of multi-object occlusion, different noise levels, and coding artifacts.

View and scale invariant action recognition using multiview shape-flow models

Conference Paper

Jun 2008
IEEE Comput Soc Conf Comput Vis Pattern Recogn

Actions in real world applications typically take place in cluttered environments with large variations in the ori- entation and scale of the actor. We present an approach to simultaneously track and recognize known actions that is robust to such variations, starting from a person detection in the standing pose. In our approach we first render syn- thetic poses from multiple viewpoints using Mocap data for known actions and represent them in a Conditional Random Field(CRF) whose observation potentials are computed us- ing shape similarity and the transition potentials are com- puted using optical flow. We enhance these basic potentials with terms to represent spatial and temporal constraints and call our enhanced model the Shape,Flow,Duration- Conditional Random Field(SFD-CRF). We find the best se- quence of actions using Viterbi search in the SFD-CRF. We demonstrate our approach on videos from multiple view- points and in the presence of background clutter.

Multisensory architecture for intelligent surveillance systems - Integration of segmentation, tracking and activity analysis

Figures

Recommended publications

Vision based human activity tracking using artificial neural networks

A seamless modular approach for real-time video analysis for surveillance

A Proposal for Local and Global Human Activities Identification

Accumulative computation and fuzzy sets for robust fall detection in color video

Human activity monitoring by local and global finite state machines

Intelligent monitoring and activity interpretation framework - INT3-Horus general description