Conference PaperPDF Available

Unsupervised fast anomaly detection in crowds

Authors:

Abstract and Figures

In this paper, we proposed a fast and robust unsupervised framework for anomaly detection and localization in crowed scenes. Our method avoids modeling the normal state of the crowds which is a very complex task due to the large within class variance of the normal target appearance and motion patterns. For each video frame, we extract the spatial temporal features of 3D blocks and generate the saliency map using a block-based center-surround difference operator. Then, motion vector matrix is obtained by adaptive rood pattern search block-matching algorithm and distance normalization. Attractive motion disorder descriptor is proposed to measure the global intensity of anomalies in the scene. Finally, we classify the frames into normal and anomalous ones by a binary classifier. In the experiments, we compared our method against several state-of-the-art approaches on UCSD dataset which is a widely used anomaly detection and localization benchmark. As the only unsupervised approach, our method outputs competitive results with near real-time processing speed
Content may be subject to copyright.
Unsupervised Fast Anomaly Detection in Crowds
Xiaoshuai Sun, Hongxun Yao, Rongrong Ji, Xianming Liu, Pengfei Xu
Department of Computer Science, Harbin Institute of Technology
No.92, West Dazhi Street, Harbin, P. R. China, 150001
{xiaoshuaisun, h.yao, rrji, xmliu, pfxu}@hit.edu.cn Tel: +86-451-86416485
ABSTRACT
In this paper, we proposed a fast and robust unsupervised
framework for anomaly detection and localization in crowed
scenes. Our method avoids modeling the normal state of the
crowds which is a very complex task due to the large within class
variance of the normal target appearance and motion patterns. For
each video frame, we extract the spatial temporal features of 3D
blocks and generate the saliency map using a block-based center-
surround difference operator. Then, motion vector matrix is
obtained by adaptive rood pattern search block-matching
algorithm and distance normalization. Attractive motion disorder
descriptor is proposed to measure the global intensity of
anomalies in the scene. Finally, we classify the frames into
normal and anomalous ones by a binary classifier. In the
experiments, we compared our method against several state-of-
the-art approaches on UCSD dataset which is a widely used
anomaly detection and localization benchmark. As the only
unsupervised approach, our method outputs competitive results
with near real-time processing speed.
Categories and Subject Descriptors
H.3.1 [Information Systems]: Content Analysis and Indexing;
H.5.1 [Multimedia information Systems]: Video
General Terms
Algorithms, Experimentation, Human Factors
Keywords
Unsupervised anomaly detection, motion estimation, attractive
motion disorder descriptor
1. INTRODUCTION
As reviewed in [1, 2], monitoring surveillance videos, especially
for videos of crowded scene, is a very expensive and tiring task.
Thus, automatic detection of anomalous events in crowds has
become an attractive topic in computer vision and pattern
recognition research. Due to the unreliability of trajectory
analysis in crowded scene [3], recent works focus on designing
robust dynamic scene representations that avoid multiple targets
tracking [4, 5, 6, 7, 8]. Adam et al. [4] maintain probabilities of
optical ow in local regions, using histograms. Kim and Grauman
[5] utilized a mixture of probabilistic PCA models to model local
optical ow patterns, and enforce global consistency using a
Markov Random Field (MRF). Inspired by classical studies of
crowd behavior, Mehran et al. [6] characterized crowd behavior
using concepts such as social force. These concepts lead to optic
ow measurement of target interaction within the crowds, which
are combined with a latent Dirichlet Allocation (LDA) model for
anomaly detection. Mahadevan et al. [8] proposed a unified
framework for joint modeling of appearance and dynamics of the
scene, under which the outliers are labeled as anomalies.
However, scene representation is not the only problem for
anomaly detection task. Modeling the normal state of the
crowded scene is another challenging problem due to the large
within-class variance of the normal target appearance and motion
patterns. Figure 1 shows the moving targets appeared in a 20
seconds video clip, which contains different target appearances
and movements. In real-world applications, the length of the
video with normal crowd behaviors will be much longer than 20
seconds, thus it’s nearly impossible to model the normal state
containing thousands of patterns with different spatial temporal
appearance. Compared with supervised or semi-supervised
learning of the normal states [2, 3, 4, 5, 6, 7, 8, 9], it may be more
practical to directly model the global intensity of anomalous
events in a purely unsupervised manner.
From experimental observations, we found that abnormal
contents or unusual human behaviors will consistently attract the
attention of human observers, which means most of the anomalies
are more attractive or more salient compared with the other
contents in the environment. Besides, the presence of anomalies
will probably turn the ordered crowd movements into a
disordered state. Based on these observations, we proposed an
unsupervised framework for anomaly detection and localization
task, which uses Attractive Motion Disorder descriptor to directly
measure the overall intensity of anomalies and avoids modeling
of the crowd’s normal behavior. Our descriptor is constructed by
fusing the statistical features of visual saliency and motion
vectors, which is inspired by both the perceptual and
computational observations on normal and anomalous videos.
*Area Chair: Kiyoharu Aizawa
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
MM’11, November 28-December 1, 2011, Scottsdale, Arizona, USA.
Copyright 2011 ACM 978-1-4503-0616-4/11/11…$10.00.
(a) Normal moving targets (b) The crowed scene
Figure 1. Large within-class variance of the normal target
appearance and motion patterns in crowded scene.
1469
2. METHOD
The proposed unsupervised anomaly detection framework is shown
in Figure 2. Temporal derivatives of spatial temporal video blocks
are extracted as visual features. Saliency is then computed by
block-based center-surround difference operator. Motion disorder is
measured by the standard deviation of the motion vectors estimated
by adaptive block-match algorithm. By analyzing the statistical
distribution of visual saliency and the motion vector matrix, we
construct attractive motion disorder descriptor to measure the
global anomalous intensity, with which video frames are classified
into normal or anomalous frames by a binary classifier.
Localization of the detected anomalies is achieved using the spatial
temporal saliency map.
Bin ar y Cl assi f ie r
Vi sual Sal i en cy De t e ct io n
Spatial-temporal
Featur e Extr act io n
Block-based center-
surround diff erence
Te s t i n g V i d e o Motion Estimation
Attractive Motion
Disor d er Descr ip t or
Out put s
Adaptive rood pattern
search block-matching
Distance Normalization
Motion Vector MatrixSaliency Map
Frame Label
Anomaly Localization
Figure 2. The framework of the proposed method.
2.1 Center-Surround Saliency Detection
Saliency is an important concept for computational visual
attention modeling, which could be quantitatively measured by
center-surround difference [10, 11], information maximization
[12], incremental coding length [13] and site entropy rate [14],
etc. In our case, we first extract spatial-temporal local features
from the video, then generate the saliency map using a block-
based center-surround operation, which is more computational
efficient and shares the plausibility of previous works. The visual
field is segmented into 24×32 3D sub-blocks represented by a
gradient-based spatial-temporal descriptor. The descriptor of a
sub-block is constructed by the absolute values of the temporal
derivatives in all pixels in the block. These values are stacked
into a 1-D feature vector. A center-surround difference operator,
akin to the visual receptive fields of human vision system, is
adopted as a quantitative measurement for visual saliency. In
traditional models [10, 11], the center-surround difference was
computed across different spatial scales using Difference of
Gaussian filters. In our case, we only compute the difference
between center block and its surrounding eight-neighborhoods for
the concern of computation efficiency. The saliency of a given
block is defined as the average center-surround difference
measured by the Manhattan Distance between the features of the
center and its surrounding blocks:
,,
1
11
,
11
ij i mj nij
mn
FFS
 
 (1)
Figure 3 shows some examples of the spatial temporal saliency
maps computed using the temporal gradient features and block-
based center-surround difference operator. It’s easy to notice that
the anomalies tend to appear at the locations with the largest
saliency value in the scene.
Figure 3. Examples of the spatial temporal saliency maps.
2.2 Attractive Motion Disorder Descriptor
The motion vectors obtained by adaptive rood pattern search
block-matching algorithm [15] are used as motion descriptors for
each sub-block. The visual field is segmented into 12×16 sub-
blocks with equal size. Note that, motion vectors can also be
directly obtained from the compression domain data if the video
is compressed using motion compensation technique. Let Mi,j
denote the motion vector of the sub-block in ith row and jth
column, we apply distance normalization to eliminate the scale
variance of the motion vector caused by the geometrical setting of
the camera:
,,
()
'ij ij
Hi
M
M
H





, (2)
where M is the motion vector matrix, H is the height of M,
0.5
is a distance compensation parameter which has been
fixed in our experiment. After normalization, object moments
appeared in all sub-blocks can be near equally measured by M’.
Figure 4 illustrates motion estimation and normalization results.
Figure 4. Motion estimation. From left to right: input video
frame, motion estimation result by adaptive rood pattern
search block-matching [15] and normalized motion vectors.
There are various measurements for system disorder such as
Entropy and Standard Deviation. Entropy is an important concept
in physics and information theory, which is also widely used as a
quantitative measurement for uncertainty or unpredictability.
Standard Deviation is an easy to compute statistical feature
describing the variance or diversity of a group of data. Practically,
we use standard deviation to measure the motion disorder,
because it leads to a better overall performance while costing
much less computations compared with other measurements.
Given the spatial temporal saliency map S, the motion vector
matrix M’, we define the Attractive Motion Disorder (AMD)
descriptor A by:
max( ) (1 ) std( ')AS M
, (3)
where [0,1]
is a fusing parameter, std(.) denotes the
standard deviation of the input matrix. The descriptor can be
regarded as a quantitative measurement for global intensity of all
the anomalous events appeared in the visual field. Higher value
for the AMD descriptor indicates larger probability for the
appearance of anomalies. Figure 5 illustrate the distribution of
AMD descriptor ( 0.5
) in normal and anomalous videos.
1470
Figure 5. Distribution of AMD descriptor in normal (blue)
and anomalous (green) video frames of UCSD Ped_1 dataset.
2.3 Anomaly Detection and Localization
Video frames can be classified into normal or abnormal frames
by a binary classifier using the AMD descriptor. As described in
Section 1, anomalous regions tend to attract more visual attention
compared with the other events happened in the scene. Thus,
saliency map can be used as a reference for localization and
segmentation of the anomalous regions. In practice, we adopt
Equation 4 to segment the anomalies, which is first proposed in
[16] for non-parametric proto-object segmentation.
(, ) 1 if '( , ) threshold,
0 otherwise.
xy Sxy
(4)
where O is the localization binary map, S’ = SG is a refined
saliency map smoothed by a Gaussian filter G (3×3, 1
). We
set the threshold to be 7×E(S) empirically, where E(S) is the
mean intensity of the saliency map. Examples of anomaly
detection and localization results are shown in Figure 6.
Figure 6. Anomaly detection in crowded videos. From left to
right: Detected abnormal frame, corresponding saliency map
and localization result of the anomalous region.
3. EXPERIMENTS
We evaluate the proposed approach on UCSD dataset [8]1, which
is a well annotated publicly available dataset for the evaluation of
abnormal detection and localization in crowded scenes. The
dataset was acquired with a stationary camera mounted at an
elevation at a resolution of 238 × 158 with 10 fps, overlooking
pedestrian walkways. The circulation of non pedestrian entities in
the walkways, and anomalous pedestrian motion patterns are
regarded as abnormal events. Commonly appeared anomalies
include bikers, skaters, small carts, and people walking across a
walkway or in the grass. Videos were split into 2 subsets: Ped_1
and Ped_2, each corresponding to a different scene. Videos
recorded from each scene were split into various clips each of
which has around 200 frames. Ped_1 contains 34 training clips
1 http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm
and 36 testing clips, while Ped2 contains 16 training clips and 14
testing clips. For each clip, the ground truth annotation includes a
binary ag per frame, indicating whether an anomaly is present in
that frame.
Practically, all video frames are resized to 120×160 in order to
reduce the computation cost. For each frame, we extract the
spatial temporal features of 5×5×3 3D video blocks, and generate
a 24×32 saliency map using the proposed block-based center-
surround difference operator. A 12×16 motion vector matrix is
then obtained based on adaptive rood pattern search block-
matching algorithm and distance normalization. Based on the
saliency map and the motion vector matrix, we compute the
AMD descriptor using Equation 3 ( 0.5
), which is proposed
to describe the overall intensity of anomalies appeared in the
frame. Finally, the video frame is classified into normal or
anomalous frame by a binary classifier.
The evaluation on UCSD dataset contains two components:
anomaly detection and localization. By varying the parameters of
the tested approach, an ROC curve can be drawn to intuitively
evaluate the anomaly detection performance. Figure 7 illustrates
the ROC curves for UCSD dataset of various state-of-the-art
approaches and our approach, while Figure 8 shows some visual
examples of anomaly localization and segmentation results of the
tested approaches. In addition to Figure 7, Table 1 shows the
area under ROC curve (AUC) of the tested methods, in which a
larger AUC score means better classification performance.
According to the experimental results, our method, as the only
completely unsupervised training-free approach, outputs
competitive results against the state-of-the-art methods with near
real-time processing speed. Visual results indicate that our
method is able to accurately localize the anomalous events in the
crowded scene and outputs better segmentation results with well
defined boundaries.
Figure 7. ROC curves of tested approaches on UCSD Ped_1
dataset. Tested approaches include our method, MDT-based
approach [8], the Social Force Model [6], the mixture of
optical flow (denoted as MPPCA [5]) and optical flow
monitoring method (Adam et al. [4]).
Table 1. Area Under ROC Curves
Method MDT SF MPPCA Adam Ours
AUC 0.7895 0.7413 0.6554 0.6350 0.7919
1471
4. CONCLUSION
In this paper, we proposed an unsupervised framework for fast
anomaly detection and localization in crowded scene. Instead of
modeling the normal states, we directly model the intensity of
anomalies using attractive motion disorder descriptor, which is
constructed by fusing the statistical features of saliency map and
motion vector matrix. Saliency detection and motion estimation
are conducted by block-based center-surround difference operator
and adaptive rood pattern search block-matching algorithm, both
of which are highly efficient and lead to a near real-time overall
processing speed. Experimental results on a widely used bench-
mark dataset demonstrate the effectiveness of the proposed
framework. Our future work lies in integrating other reliable
features, such as location distribution prior, into the framework to
further improve the overall performance.
5. ACKNOWLEDGEMENT
This work was supported by the National Natural Science
Foundation of China (Grant No. 61071180 and Key Program
Grant No. 61133003).
6. REFERENCES
[1] N. Haering, P. Venetianer, and A. Lipton. “The evolution of
video surveillance: an overview”. Machine Vision and
Applications, 19(5-6):279–290, 2008.
[2] L. Seidenari, M. Bertini. “Non-parametric anomaly detection
exploiting space-time features”. ACM Multimedia, pp.1139–
1142, 2010.
[3] F. Jiang, Y. Wu, and A. Katsaggelos. “A dynamic
hierarchical clustering method for trajectory-based unusual
video event detection”. IEEE TIP, 18(4):907–913, 2009.
[4] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. “Robust
real-time unusual event detection using multiple xed
location monitors”. IEEE TPAMI, 30(3):555–560, 2008.
[5] J. Kim and K. Grauman. “Observe locally, infer globally: A
space-time MRF for detecting abnormal activities with
incremental updates”. CVPR, pp. 2921–2928, 2009.
[6] R. Mehran, A. Oyama, and M. Shah. “Abnormal crowd
behavior detection using social force model”. CVPR,
pp.935–942, 2009.
[7] L. Kratz and K. Nishino. “Anomaly detection in extremely
crowded scenes using spatio-temporal motion pattern
models”. CVPR, pp.1446–1453, 2009.
[8] V. Mahadevan, W. Li, V. Bhalodia and N. Vasconcelos.
“Anomaly Detection in Crowded Scenes”. CVPR, 2010.
[9] O. Boiman and M. Irani. “Detecting irregularities in images
and in video”. IJCV, 74(1):17–31, Aug. 2007.
[10] L. Itti, C. Koch and E. Niebur. “A model of saliency-based
visual attention for rapid scene analysis”. IEEE TPAMI,
20(11), 1998.
[11] D. Gao, V. Mahadevan, and N. Vasconcelos. “The
discriminate center-surround hypothesis for bottom-up
saliency”. Advances in Neural Information Processing
Systems, pp.497-504, 2007.
[12] N. Bruce and J. Tsotsos. “Saliency based on information
maximization”. Advances in Neural Information Processing
Systems, pp.155–162, 2006.
[13] X. Hou and L. Zhang, “Dynamic visual attention: searching
for coding length increments. NIPS, pp. 681–688, 2008.
[14] W. Wang, Y. Wang, Q. Huang, and W. Gao, “Measuring
Visual Saliency by Site Entropy Rate”. CVPR, pp. 2368–
2375, 2010.
[15] Y. Nie, and K. Ma. “Adaptive rood pattern search algorithm
for fast block matching motion estimation”. IEEE TIP.
11(12), pp.1442--1448, 2002.
[16] X. Hou and L. Zhang, Saliency detection: a spectral residual
approach. CVPR, 2007.
Figure 8. Comparisons of abnormal localization results from (i) our approach; (ii) MDT approach and (iii) SF-MPPCA approach.
The results of MDT and SF-MPPCA are provided by Mahadevan et al. [8]
1472
... To handle all sorts and categories of anomalies, a typical setting is to train a model on a set of non-anomalous or normal video frames (train set) and then the learned model is applied to the anomalous frames (test set). It is also interesting to note that in many earlier works [3][4][5][6][7][8][9][10][11][12][13], crowd anomaly detection is either considered as scene-dependent (train and test sets contain the same scene) or scene-independent (train and test set contain different scenes) task. It has been argued by Ramachandra et al. [3] that the crowd anomaly detection is indeed a scene-dependent task as it is the only realistic scenario in surveillance videos for real-world applications. ...
... To detect anomaly in a test video, the HOFME descriptors are computed and a nearest neighbor classifier is employed. Sun et al. [9] propose a visual saliency detector using a dissimilarity measure between a spatiotemporal block at a position in a frame and its neighboring spatiotemporal blocks. The L1 distance metric is employed to compute the dissimilarity in this work. ...
Article
A two-stage framework for crowd anomaly detection in single-scene or scene-dependent surveillance videos is proposed in this article. The first stage generates several hypotheses corresponding to potential anomalous regions in a video frame and the second stage verifies them to reduce false alarms and identifies crowd anomalies. In the hypotheses generation stage, spatial and temporal derivatives are computed for each video frame and a saliency detector employing Hypercomplex Fourier Transform (HFT) is used to generate a saliency map. A threshold is applied to the saliency map to generate potential anomalous regions in the form of connected components. For each connected component, a set of 4 statistical features are computed and fed to the second stage which employs a Gaussian Mixture Model (GMM) as a verification method to yield the final crowd anomalies in the frame. The effectiveness of the proposed framework has been shown through results obtained on the UCSD anomaly detection benchmark dataset which contains two subsets namely Ped1 and Ped2 with a total of 48 test videos (9210 frames). Both frame-level and pixel-level anomaly detection results are provided using the widely recognized evaluation criterion in the domain and compared with the state-of-the-art methods. The experimental results show that the proposed framework obtains comparable results against the state-of-the-art methods.
... Further, the crowd can be sparse or dense as in Fig. 1b, c respectively. Various approaches have been proposed to address these challenges over the last few years based on optical flow [1][2][3], trajectory modeling [4,5], spatial-temporal context [6][7][8][9], mixtures of dynamic textures [10], sparse representation [11], social force model [12,13], etc. In most of these approaches, the general idea is to use training data to model normal crowd behavior and any deviation from the normal crowd behavior is identified as anomaly. ...
... Kratz et al. [6] use spatiotemporal gradients over local space time volumes to model crowd behavior and use HMM for anomaly detection. Sun et al. [7] use a combination of spatiotemporal saliency and motion vectors over spatiotemporal volumes to measure the global intensity of anomalies in the scene. Cong et al. [11] use multi-scale histogram of optical flow as the feature descriptor and use it as a basis for sparse representation. ...
Article
Full-text available
Abnormality detection in crowded scenes plays a very important role in automatic monitoring of surveillance feeds. Here we present a novel framework for abnormality detection in crowd videos. The key idea of the approach is that rarely or sparsely occurring events correspond to abnormal activities, while the regularly or commonly occurring events correspond to the normal activities. Each input video is represented using feature matrices that capture the nature of activity taking place while maintaining the spatial and temporal structure of the video. The feature matrices are decomposed into their low-rank and sparse components where sparse component corresponds to the abnormal activities. The approach does not require any explicit modeling of crowd behavior or training, but the information from training data can be seamlessly incorporated if it is available. The estimation is further improved by ensuring temporal and spatial coherence of sparse component across the videos using a Kalman filter-like framework. This not only results in reduction of outliers and noise but also fills missing regions in the sparse component. Localization of the anomalies is obtained as a by-product of the proposed approach. Evaluation on the UMN and UCSD datasets and comparisons with several state-of-the-art crowd abnormality detection approaches shows the effectiveness of the proposed approach. We also show results on a challenging crowd dataset created as part of this effort, with videos downloaded from the web.
... The disciplines affected by the absence of training data are diverse. In engineering [20], anomaly detection is entirely sample-specific. There is no training data that allows for the learning of a functional relationship between anomaly occurrence (parametrised by type and severity of anomaly), and conditions that the sample is subjected to. ...
Chapter
There are multiple real-world problems in which training data is unavailable, and still, the ambition is to learn values of the system parameters, at which test data on an observable is realised, subsequent to the learning of the functional relationship between these variables. We present a novel Bayesian method to deal with such a problem, in which we learn the system function of a stationary dynamical system, for which only test data on a vector-valued observable is available, though the distribution of this observable is unknown. Thus, we are motivated to learn the state space probability density function (pdf), where the state space vector is wholly or partially observed. As there is no training data available for either this pdf or the system function, we cannot learn their respective correlation structures. Instead, we perform inference (using Metropolis-within-Gibbs), on the discretised forms of the sought functions, where the pdf is constructed such that the unknown system parameters are embedded within its support. The likelihood of the unknowns given the available data is defined in terms of such a pdf. We make an application of this methodology, to learn the density of all gravitating matter in a real galaxy.
... The disciplines affected by the absence of training data are diverse. In engineering [20], anomaly detection is entirely sample-specific. There is no training data that allows for the learning of a functional relationship between anomaly occurrence (parametrised by type and severity of anomaly) and conditions that the sample is subjected to. ...
Preprint
Full-text available
There are multiple real-world problems in which training data is unavailable, and still, the ambition is to learn values of the system parameters, at which test data on an observable is realised, subsequent to the learning of the functional relationship between these variables. We present a novel Bayesian method to deal with such a problem, in which we learn a system function of a stationary dynamical system, for which only test data on a vector-valued observable is available, and training data is unavailable. This exercise borrows heavily from the state space probability density function ($pdf$), that we also learn. As there is no training data available for either sought function, we cannot learn its correlation structure, and instead, perform inference (using Metropolis-within-Gibbs), on the discretised form of the sought system function and of the ${pdf}$, where this $pdf$ is constructed such that the unknown system parameters are embedded within its support. Likelihood of the unknowns given the available data, is defined in terms of such a ${pdf}$. We make an application to the learning of the density of all gravitational matter in a real galaxy.
... In the former, the majority of the works employ dense feature representation, such as contextual information [3], [18], multiple information based on optical flow [4], [8], gradientbased features [16], and mixture of dynamic textures [17]. Another type of representation exploits information extracted from saliency maps, such as in [19] and in [20], where a Lagrangian particle map is used to segment the crowd. The main advantages of the methods described in this paragraph are the fixed number of features and their simplicity to set in classifiers or predictors. ...
Article
Full-text available
This paper presents an approach for detecting anomalous events in videos with crowds. The main goal is to recognize patterns that might lead to an anomalous event. An anomalous event might be characterized by the deviation from the normal or usual, but not necessarily in an undesirable manner, e.g., an anomalous event might just be different from normal but not a suspicious event from the surveillance point of view. One of the main challenges of detecting such events is the difficulty to create models due to their unpredictability and their dependency on the context of the scene. Based in these challenges, we present a model that uses general concepts, such as orientation, velocity, and entropy to capture anomalies. Using such type of information we can define models for different cases and environments. Assuming images captured from a single static camera, we propose a novel spatiotemporal feature descriptor, called Histograms of Optical Flow Orientation and Magnitude and Entropy (HOFME), based on optical flow information. To determine the normality or abnormality of an event, the proposed model is composed of training and test steps. In the training, we learn the normal patterns. Then, during test, events are described and if they differ significantly from the normal patterns learned, they are considered as anomalous. Experimental results demonstrate that our model can handle different situations and is able to recognize anomalous events with success. We use the well-known UCSD and Subway datasets and introduce a new dataset namely Badminton.
Article
The anomaly detection and localisation (ADL) gains remarkable interest as dealing with the complex surveillance videos for detecting the abnormal behaviour is tedious. The human effort in monitoring and classifying the abnormal object is inaccurate and time-consuming; therefore, the method is proposed using the Tucker tensor decomposition (TTD) and classification of the objects using Gaussian mixture model (GMM). Initially, the object is detected in the frames for easy recognition using simple background subtraction. The TTD decomposes the tensor as core tensor and factor matrices and the two decomposed tensors are compared using the cosine similarity measure that determines the location of the object in the frame. Finally, the features including shape and speed of the object are extracted that is used for classification using the GMM that follows the maximum posterior probability principle to detect and locate the anomaly in the video. The experimentation for anomaly detection proves that the proposed TTD and TTD-GMM method attains a higher rate of multiple object tracking precision, accuracy, sensitivity, and specificity at 0.96375, 0.975, 1, and 1, respectively.
Conference Paper
Due to increasing hospital costs and traveling time, more and more patients decide to use medical devices at home without traveling to the hospital. However, these devices are not always very straight-forward for usage, and the recent reports show that there are many injuries and even deaths caused by the wrong use of these devices. Since human supervision during every usage is impractical, there is a need for computer vision systems that would recognize actions and detect if the patient has done something wrong. In this paper, we propose to use Snippet Based Trajectory Statistics Histograms descriptor to recognize actions in two medical device usage problems; inhaler device usage and infusion pump usage. Snippet Based Trajectory Statistics Histograms encodes the motion and position statistics of densely extracted trajectories from a video. Our experiments show that by using Snippet Based Trajectory Statistics Histograms technique, we improve the overall performance for both tasks. Additionally, this method does not require heavy computation, and is suitable for real-time systems.
Article
Full-text available
The proposed unusual video event detection method is based on unsupervised clustering of object trajectories, which are modeled by hidden Markov models (HMM). The novelty of the method includes a dynamic hierarchical process incorporated in the trajectory clustering algorithm to prevent model overfitting and a 2-depth greedy search strategy for efficient clustering.
Conference Paper
Full-text available
The classical hypothesis, that bottom-up saliency is a cent er-surround process, is combined with a more recent hypothesis that all saliency decisions are optimal in a decision-theoretic sense. The combined hypothesis is denoted as discriminant center-surround saliency, and the corresponding optimal saliency architecture is derived. This architecture equates the saliency of each image location to the dis- criminant power of a set of features with respect to the classification problem that opposes stimuli at center and surround, at that location. It is shown that the result- ing saliency detector makes accurate quantitative predict ions for various aspects of the psychophysics of human saliency, including non-linear properties beyond the reach of previous saliency models. Furthermore, it is shown that discriminant center-surround saliency can be easily generalized to vari ous stimulus modalities (such as color, orientation and motion), and provides optimal solutions for many other saliency problems of interest for computer vision. Optimal solutions, under this hypothesis, are derived for a number of the former (including static natural images, dense motion fields, and even dynamic textures), and applied to a num- ber of the latter (the prediction of human eye fixations, moti on-based saliency in the presence of ego-motion, and motion-based saliency in the presence of highly dynamic backgrounds). In result, discriminant saliency is shown to predict eye fixations better than previous models, and produces backgro und subtraction algo- rithms that outperform the state-of-the-art in computer vi sion.
Conference Paper
Full-text available
Conference Paper
Full-text available
A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return.
Conference Paper
Full-text available
The ability of human visual system to detect visual saliency is extraordinarily fast and reliable. However, computational modeling of this basic intelligent behavior still remains a challenge. This paper presents a simple method for the visual saliency detection. Our model is independent of features, categories, or other forms of prior knowledge of the objects. By analyzing the log-spectrum of an input image, we extract the spectral residual of an image in spectral domain, and propose a fast method to construct the corresponding saliency map in spatial domain. We test this model on both natural pictures and artificial images such as psychological patterns. The result indicate fast and robust saliency detection of our method.
Conference Paper
Full-text available
In this paper, we propose a new computational model for visual saliency derived from the information maximization principle. The model is inspired by a few well acknowledged biological facts. To compute the saliency spots of an image, the model first extracts a number of sub-band feature maps using learned sparse codes. It adopts a fully-connected graph representation for each feature map, and runs random walks on the graphs to simulate the signal/information transmission among the interconnected neurons. We propose a new visual saliency measure called Site Entropy Rate (SER) to compute the average information transmitted from a node (neuron) to all the others during the random walk on the graphs/network. This saliency definition also explains the center-surround mechanism from computation aspect. We further extend our model to spatial-temporal domain so as to detect salient spots in videos. To evaluate the proposed model, we do extensive experiments on psychological stimuli, two well known image data sets, as well as a public video dataset. The experiments demonstrate encouraging results that the proposed model achieves the state-of-the-art performance of saliency detection in both still images and videos.
Conference Paper
Full-text available
A novel framework for anomaly detection in crowded scenes is presented. Three properties are identified as important for the design of a localized video representation suitable for anomaly detection in such scenes: (1) joint modeling of appearance and dynamics of the scene, and the abilities to detect (2) temporal, and (3) spatial abnormalities. The model for normal crowd behavior is based on mixtures of dynamic textures and outliers under this model are labeled as anomalies. Temporal anomalies are equated to events of low-probability, while spatial anomalies are handled using discriminant saliency. An experimental evaluation is conducted with a new dataset of crowded scenes, composed of 100 video sequences and five well defined abnormality categories. The proposed representation is shown to outperform various state of the art anomaly detection techniques.
Conference Paper
We propose a space-time Markov random field (MRF) model to detect abnormal activities in video. The nodes in the MRF graph correspond to a grid of local regions in the video frames, and neighboring nodes in both space and time are associated with links. To learn normal patterns of activity at each local node, we capture the distribution of its typical optical flow with a mixture of probabilistic principal component analyzers. For any new optical flow patterns detected in incoming video clips, we use the learned model and MRF graph to compute a maximum a posteriori estimate of the degree of normality at each local node. Further, we show how to incrementally update the current model parameters as new video observations stream in, so that the model can efficiently adapt to visual context changes over a long period of time. Experimental results on surveillance videos show that our space-time MRF model robustly detects abnormal activities both in a local and global sense: not only does it accurately localize the atomic abnormal activities in a crowded video, but at the same time it captures the global-level abnormalities caused by irregular interactions between local activities.
Conference Paper
In this paper we introduce a novel method to detect and localize abnormal behaviors in crowd videos using Social Force model. For this purpose, a grid of particles is placed over the image and it is advected with the space-time average of optical flow. By treating the moving particles as individuals, their interaction forces are estimated using social force model. The interaction force is then mapped into the image plane to obtain Force Flow for every pixel in every frame. Randomly selected spatio-temporal volumes of Force Flow are used to model the normal behavior of the crowd. We classify frames as normal and abnormal by using a bag of words approach. The regions of anomalies in the abnormal frames are localized using interaction forces. The experiments are conducted on a publicly available dataset from University of Minnesota for escape panic scenarios and a challenging dataset of crowd videos taken from the web. The experiments show that the proposed method captures the dynamics of the crowd behavior successfully. In addition, we have shown that the social force approach outperforms similar approaches based on pure optical flow.