ArticlePDF Available

Introduction to the Special Section on Video Surveillance

Authors:

Abstract

Not Available
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000 745
Introduction to the Special Section
on Video Surveillance
Robert T. Collins, Alan J. Lipton, and Takeo Kanade,
Fellow, IEEE
—————————— ——————————
UTOMATED video surveillance addresses real-time obser-
vation of people and vehicles within a busy environ-
ment, leading to a description of their actions and interactions.
The technical issues include moving object detection and
tracking, object classification, human motion analysis, and
activity understanding, touching on many of the core topics of
computer vision, pattern analysis, and aritificial intelligence.
Video surveillance has spawned large research projects in the
United States, Europe, and Japan, and has been the topic of
several international conferences and workshops in recent
years.
There are immediate needs for automated surveillance
systems in commercial, law enforcement, and military ap-
plications. Mounting video cameras is cheap, but finding
available human resources to observe the output is expen-
sive. Although surveillance cameras are already prevalent
in banks, stores, and parking lots, video data currently is
used only “after the fact” as a forensic tool, thus losing its
primary benefit as an active, real-time medium. What is
needed is continuous 24-hour monitoring of surveillance
video to alert security officers to a burglary in progress or to
a suspicious individual loitering in the parking lot, while
there is still time to prevent the crime. In addition to the
obvious security applications, video surveillance technol-
ogy has been proposed to measure traffic flow, detect acci-
dents on highways, monitor pedestrian congestion in pub-
lic spaces, compile consumer demographics in shopping
malls and amusement parks, log routine maintainence tasks
at nuclear facilities, and count endangered species. The
numerous military applications include patrolling national
borders, measuring the flow of refugees in troubled areas,
monitoring peace treaties, and providing secure perimeters
around bases and embassies.
The 11 papers in this special section illustrate topics
and techniques at the forefront of video surveillance
research. These papers can be loosely organized into
three categories.
Detection and tracking involves real-time extraction of
moving objects from video and continuous tracking over
time to form persistent object trajectories. C. Stauffer and
W.E.L. Grimson introduce unsupervised statistical learning
techniques to cluster object trajectories produced by adap-
tive background subtraction into descriptions of normal
scene activity. Viewpoint-specific trajectory descriptions
from multiple cameras are combined into a common scene
coordinate system using a calibration technique described
by L. Lee, R. Romano, and G. Stein, who automatically de-
termine the relative exterior orientation of overlapping
camera views by observing a sparse set of moving objects
on flat terrain. Two papers address the accumulation of
noisy motion evidence over time. R. Pless, T. Brodský, and
Y. Aloimonos detect and track small objects in aerial video
sequences by first compensating for the self-motion of the
aircraft, then accumulating residual normal flow to acquire
evidence of independent object motion. L. Wixson notes
that motion in the image does not always signify purpose-
ful travel by an independently moving object (examples of
such “motion clutter” are wind-blown tree branches and
sun reflections off rippling water) and devises a flow-based
salience measure to highlight objects that tend to move in a
consistent direction over time.
Human motion analysis is concerned with detecting pe-
riodic motion signifying a human gait and acquiring descrip-
tions of human body pose over time. R. Cutler and L.S. Davis
plot an object’s self-similarity across all pairs of frames to
form distinctive patterns that classify bipedal, quadripedal,
and rigid object motion. Y. Ricquebourg and P. Bouthemy
track apparent contours in XT slices of an XYT sequence vol-
ume to robustly delineate and track articulated human body
structure. I. Haritaoglu, D. Harwood, and L.S. Davis present
W4, a surveillance system specialized to the task of looking at
people. The W4 system can locate people and segment their
body parts, build simple appearance models for tracking,
disambiguate between and separately track multiple indi-
viduals in a group, and detect carried objects such as boxes
and backpacks.
Activity analysis deals with parsing temporal sequences
of object observations to produce high-level descriptions of
agent actions and multiagent interactions. In our opinion,
this will be the most important area of future research in
video surveillance. N.M. Oliver, B. Rosario, and A.P. Pent-
land introduce Coupled Hidden Markov Models (CHMMs)
to detect and classify interactions consisting of two inter-
leaved agent action streams and present a training method
based on synthetic agents to address the problem of pa-
rameter estimation from limited real-world training exam-
ples. M. Brand and V. Kettnaker present an entropy-
minimization approach to estimating HMM topology and
0162-8828/00/$10.00 © 2000 IEEE
————————————————
R.T. Collins and T. Kanade are with the Robotics Institute, Carnegie Mel-
lon University, Pittsburgh, PA
E-mail:{rcollins, tk}@cs.cmu.edu.
A.J. Lipton is with DiamondBack Vision, Inc., Washington DC.
E-mail: ajl@dbvision.net.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 112011.
A
746 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000
parameter values, thereby simultaneously clustering video
sequences into events and creating classifiers to detect those
events in the future. Y.A. Ivanov and A.F. Bobick recognize
gestures and multiobject interactions from noisy, low-level
tracking data by parsing a stochastic context-free grammar
(SCFG) that defines multiple events that can be occuring
simultaneously in the scene. T. Wada and T. Matsuyama
present a hypothesize-and-test approach to recognizing
multiple object behaviors directly from video sequences
using a Nondeterministic Finite Automaton (NFA) that al-
lows all feasible interpretation states to be simultanously
active. They also introduce a colored-token propagation
mechanism to keep track of the partial interpretations being
assembled for different objects over time and present exten-
sions to handle multiple simultaneous video streams.
Several of these papers represent work funded under the recent
DARPA Video Surveillance and Monitoring (VSAM) research pro-
gram. Carnegie Mellon University was chosen to lead this effort by of
developing an end-to-end testbed system that integrates a wide
range of advanced surveillance techniques: real-time moving object
detection and tracking from stationary and moving camera plat-
forms, recognition of generic object classes (e.g., human, sedan, truck)
and specific object types (e.g., campus police car, FedEx van), object
pose estimation with respect to a geospatial site model, active camera
control and multicamera cooperative tracking, human gait analysis,
recognition of simple multiagent activities, real-time data dissemina-
tion, data logging, and dynamic scene visualization. We invite the
reader to visit the VSAM web page at http://www.cs.cmu.edu/~vsam/ for
more information.
Discussions of video surveillance research with nonpraction-
ers invariably lead to comments about Big Brother. Although
this is obviously not the goal of current video surveillance re-
search, the concern is reasonable. In 1998, the NYC Surveillance
Camera Project run by the New York Civil Liberties Union
documented nearly 2,500 surveillance cameras viewing public
spaces within Manhattan. The vast majority are privately
owned cameras installed outside businesses and apartment
complexes, with no mechanism to correlate information be-
tween them. However, it would not be infeasible for a suffi-
ciently well-funded government to install a network of thou-
sands of cameras capable of tracking individual citizens as they
walk through the city. As the two research paths of video sur-
veillance and biometric identification begin to merge, this sce-
nario becomes even more troubling. Is the promise of never
being mugged worth the loss of privacy implied by always be-
ing watched? These larger societal questions stray outside the
scope of this technical journal, but now is a good time to begin
to specify what data should be collected, how long it should be
stored, and who has access, so that an ethical framework will be
in place to guide the development and application of the power-
ful technology that will soon be available.
Robert T. Collins
Alan. J. Lipton
Takeo K anade
Robert T. Collins received the PhD degree in
computer science in 1993 from the University of
Massachusetts at Amherst for work on scene
reconstruction using stochastic projective ge-
ometry. He is a member of the Research Faculty
at the Robotics Institute of Carnegie Mellon
University (CMU). From 1992 to 1996, he was
technical director of the DARPA RADIUS project
at the University of Massachusetts, culminating
in the ASCENDER system for populating 3D site
models from multiple, oblique aerial views. From
1996 to 1999, Dr. Collins was technical codirector of the DARPA
Video Surveillance and Monitoring (VSAM) project at CMU. This pro-
ject developed real-time, automated video understanding algorithms
that guide a network of active video sensors to monitor the activities
of people and vehicles in a complex scene. Dr. Collins has published
for more than a decade on topics in video surveillance, 3D site mod-
eling, multiimage stereo, projective geometry, and knowledge-based
scene understanding.
Alan J. Lipton received the PhD degree in elec-
trical and computer systems engineering from
Monash University, Melbourne, Australia in 1996.
For his thesis, he studied the problem of mobile
robot navigation by natural landmark recognition
using on-board vision sensing. He is a senior
scientist at DiamondBack Vision, Inc., an internet
startup company based in Washington, D.C. From
1997 through 2000, he served on the faculty of
CMU’s Robotics Institute. During his time at CMU,
Dr. Lipton was a project comanager of DARPA’s
Video Surveillance and Monitoring (VSAM) project. On this project, Dr.
Lipton developed algorithms for detection and tracking of people and
vehicles from video streams, integration and fusion of video data, user
interfaces for vision system networks, and intelligent sensor control.
Takeo Kanade received the BE degree in electri-
cal engineering from Kyoto University in 1968,
the ME degree in 1970, and the PhD degree in
1973. He is the U.A. and Helen Whitaker Profes-
sor of Computer Science and Robotics and direc-
tor of the Robotics Institute at Carnegie Mellon
University. He has made widely known technical
contributions in multiple areas of computer vision,
robotics, and sensor design. At CMU, he has led
many major projects on vision and robotics spon-
sored by NSF, DARPA, NASA, DOE, and NIMH.
He is a member of the National Academy of Engineering and a fellow
of the IEEE, ACM, and AAAI, respectively. He has received several
awards, including the Joseph F. Engelberger Award, the JARA Award,
the Yokogawa Prize, the Hip Society, Otto AuFranc Award, and the Marr
Prize. He is founding chief editor of the
International Journal of Com-
puter Vision.
... On the other hand, CD based on natural scene video is mainly used in automated video surveillance [48]. For example, I. Haritaoglu, D. Harwood, and L.S. Davis [49] developed a surveillance system that employs CD technology to identify individuals and their carried items in a crowd. ...
... For example, I. Haritaoglu, D. Harwood, and L.S. Davis [49] developed a surveillance system that employs CD technology to identify individuals and their carried items in a crowd. Carnegie Mellon University [48], on the other hand, utilized CD technology to track moving targets, such as people, cars, and others. Robert O'Callaghan et al. [50] proposed a robust CD algorithm that can accurately segment person changes in a scene while ignoring the effects of lighting changes. ...
Article
Full-text available
Building Change Detection (BCD) aims to identify new or disappeared buildings from bi-temporal images. However, the varied scales and appearances of buildings, along with the challenge of pseudo-change interference from complex backgrounds, make it difficult to accurately extract complete changes. To address these challenges in BCD, a U-shaped hybrid Siamese network combining a convolutional neural network and a vision Transformer (CNN-ViT) with learnable mask guidance, called U-Conformer, is designed. Firstly, a new hybrid architecture of U-Conformer is proposed. The architecture integrates the strengths of CNNs and ViTs to establish a robust, multi-scale heterogeneous representation that aids in detecting buildings of various sizes. Secondly, a Learnable Mask Guidance Module (LMGM) is specifically designed for U-Conformer, focusing the multi-scale heterogeneous representation on extracting relevant scale changes while progressively suppressing pseudo-changes. Furthermore, for the U-Conformer architecture, a mask information joint class-balanced loss function that combines the Binary Cross-Entropy (BCE) loss function and the Dice loss function is devised, significantly mitigating the issue of class imbalance. Experimental results on three publicly available change detection datasets, LEVIR-CD, WHU-CD, and GZ-CD, demonstrate that U-Conformer surpasses previous methods, achieving F1 scores of 91.5%, 94.6%, and 86.7%, as well as IoU scores of 84.3%, 89.7%, and 76.5% on the LEVIR-CD, WHU-CD, and GZ-CD datasets, respectively.
... Temporal Action Segmentation (TAS) [28,43,30,29,21,32], a notable endeavor in the study of untrimmed videos, aims to allocate an action label to each frame, enabling the detailed analysis of complex activities by identifying specific actions within long-form videos. It has extensive applications in surveillance [11,12], sports analytics [20], and human-computer interaction [38]. Despite its critical importance, existing TAS models often grapple with high computational costs and lengthy inference times, limiting their applicability in real-time and resource-constrained scenarios. ...
Preprint
Although the performance of Temporal Action Segmentation (TAS) has improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https://github.com/peiyao-w/BaFormer.
... Video surveillance is a tool that integrates the functionality of real-time cameras or closed-circuit Television (CCTV) and other components for managing safety and security within public spaces (Diard, 2018;Collins et al., 2000). Basic video surveillance systems comprise of interconnection between cameras via wireless video servers to PCs or mobile devices. ...
Article
Full-text available
The preliminary study focuses on two key aspects of crime prevention via Video Surveillance Systems (VSS) and the specific context of VSS deployment in Malaysia. The study employs interviews as a method of data collection and a thematic content analysis approach to data analysis. The research findings highlight the VSS's effectiveness in solving crime cases, satisfaction among participants using VSS, limited expertise and skills in video analysis, and the high costs associated with the installation. Future studies should delve deeper into the regional impacts of video surveillance on crime rates in Malaysia and investigate the integration of emerging technologies' effectiveness.
Article
Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a S parse s i gnal- g uided Transformer ( SigFormer ) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.
Chapter
Skeleton-based Temporal Action Segmentation (TAS) plays an important role in analyzing long videos of motion-centered human actions. Recent approaches perform spatial and temporal information modeling simultaneously in the spatial-temporal topological graph, leading to high computational costs due to the large graph magnitude. Additionally, multi-modal skeleton data has sufficient semantic information, which has not been fully explored. This paper proposes a Hierarchical Spatial-Temporal Network (HSTN) for skeleton-based TAS. In HSTN, the Multi-Branch Transfer Fusion (MBTF) module utilizes a multi-branch graph convolution structure with an attention mechanism to capture spatial dependencies in multi-modal skeleton data. In addition, the Multi-Scale Temporal Convolution (MSTC) module aggregates spatial information and performs multi-scale temporal information modeling to capture long-range dependencies. Extensive experiments on two challenging datasets are performed and our proposed method outperforms the State-of-the-Art (SOTA) methods.
ResearchGate has not been able to resolve any references for this publication.