ArticlePDF Available

Introduction to the Special Section on Video Surveillance

September 2000
IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8):745-746

September 2000
22(8):745-746

DOI:10.1109/TPAMI.2000.868676

Source
IEEE Xplore

Authors:

Takeo Kanade

Carnegie Mellon University

Not Available

Content uploaded by Takeo Kanade

Content may be subject to copyright.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000 745

Introduction to the Special Section

on Video Surveillance

Robert T. Collins, Alan J. Lipton, and Takeo Kanade,

Fellow, IEEE

—————————— ✦ ——————————

UTOMATED video surveillance addresses real-time obser-

vation of people and vehicles within a busy environ-

ment, leading to a description of their actions and interactions.

The technical issues include moving object detection and

tracking, object classification, human motion analysis, and

activity understanding, touching on many of the core topics of

computer vision, pattern analysis, and aritificial intelligence.

Video surveillance has spawned large research projects in the

United States, Europe, and Japan, and has been the topic of

several international conferences and workshops in recent

years.

There are immediate needs for automated surveillance

systems in commercial, law enforcement, and military ap-

plications. Mounting video cameras is cheap, but finding

available human resources to observe the output is expen-

sive. Although surveillance cameras are already prevalent

in banks, stores, and parking lots, video data currently is

used only “after the fact” as a forensic tool, thus losing its

primary benefit as an active, real-time medium. What is

needed is continuous 24-hour monitoring of surveillance

video to alert security officers to a burglary in progress or to

a suspicious individual loitering in the parking lot, while

there is still time to prevent the crime. In addition to the

obvious security applications, video surveillance technol-

ogy has been proposed to measure traffic flow, detect acci-

dents on highways, monitor pedestrian congestion in pub-

lic spaces, compile consumer demographics in shopping

malls and amusement parks, log routine maintainence tasks

at nuclear facilities, and count endangered species. The

numerous military applications include patrolling national

borders, measuring the flow of refugees in troubled areas,

monitoring peace treaties, and providing secure perimeters

around bases and embassies.

The 11 papers in this special section illustrate topics

and techniques at the forefront of video surveillance

research. These papers can be loosely organized into

three categories.

Detection and tracking involves real-time extraction of

moving objects from video and continuous tracking over

time to form persistent object trajectories. C. Stauffer and

W.E.L. Grimson introduce unsupervised statistical learning

techniques to cluster object trajectories produced by adap-

tive background subtraction into descriptions of normal

scene activity. Viewpoint-specific trajectory descriptions

from multiple cameras are combined into a common scene

coordinate system using a calibration technique described

by L. Lee, R. Romano, and G. Stein, who automatically de-

termine the relative exterior orientation of overlapping

camera views by observing a sparse set of moving objects

on flat terrain. Two papers address the accumulation of

noisy motion evidence over time. R. Pless, T. Brodský, and

Y. Aloimonos detect and track small objects in aerial video

sequences by first compensating for the self-motion of the

aircraft, then accumulating residual normal flow to acquire

evidence of independent object motion. L. Wixson notes

that motion in the image does not always signify purpose-

ful travel by an independently moving object (examples of

such “motion clutter” are wind-blown tree branches and

sun reflections off rippling water) and devises a flow-based

salience measure to highlight objects that tend to move in a

consistent direction over time.

Human motion analysis is concerned with detecting pe-

riodic motion signifying a human gait and acquiring descrip-

tions of human body pose over time. R. Cutler and L.S. Davis

plot an object’s self-similarity across all pairs of frames to

form distinctive patterns that classify bipedal, quadripedal,

and rigid object motion. Y. Ricquebourg and P. Bouthemy

track apparent contours in XT slices of an XYT sequence vol-

ume to robustly delineate and track articulated human body

structure. I. Haritaoglu, D. Harwood, and L.S. Davis present

W4, a surveillance system specialized to the task of looking at

people. The W4 system can locate people and segment their

body parts, build simple appearance models for tracking,

disambiguate between and separately track multiple indi-

viduals in a group, and detect carried objects such as boxes

and backpacks.

Activity analysis deals with parsing temporal sequences

of object observations to produce high-level descriptions of

agent actions and multiagent interactions. In our opinion,

this will be the most important area of future research in

video surveillance. N.M. Oliver, B. Rosario, and A.P. Pent-

land introduce Coupled Hidden Markov Models (CHMMs)

to detect and classify interactions consisting of two inter-

leaved agent action streams and present a training method

based on synthetic agents to address the problem of pa-

rameter estimation from limited real-world training exam-

ples. M. Brand and V. Kettnaker present an entropy-

minimization approach to estimating HMM topology and

————————————————

• R.T. Collins and T. Kanade are with the Robotics Institute, Carnegie Mel-

lon University, Pittsburgh, PA

E-mail:{rcollins, tk}@cs.cmu.edu.

• A.J. Lipton is with DiamondBack Vision, Inc., Washington DC.

E-mail: ajl@dbvision.net.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number 112011.

746 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000

parameter values, thereby simultaneously clustering video

sequences into events and creating classifiers to detect those

events in the future. Y.A. Ivanov and A.F. Bobick recognize

gestures and multiobject interactions from noisy, low-level

tracking data by parsing a stochastic context-free grammar

(SCFG) that defines multiple events that can be occuring

simultaneously in the scene. T. Wada and T. Matsuyama

present a hypothesize-and-test approach to recognizing

multiple object behaviors directly from video sequences

using a Nondeterministic Finite Automaton (NFA) that al-

lows all feasible interpretation states to be simultanously

active. They also introduce a colored-token propagation

mechanism to keep track of the partial interpretations being

assembled for different objects over time and present exten-

sions to handle multiple simultaneous video streams.

Several of these papers represent work funded under the recent

DARPA Video Surveillance and Monitoring (VSAM) research pro-

gram. Carnegie Mellon University was chosen to lead this effort by of

developing an end-to-end testbed system that integrates a wide

range of advanced surveillance techniques: real-time moving object

detection and tracking from stationary and moving camera plat-

forms, recognition of generic object classes (e.g., human, sedan, truck)

and specific object types (e.g., campus police car, FedEx van), object

pose estimation with respect to a geospatial site model, active camera

control and multicamera cooperative tracking, human gait analysis,

recognition of simple multiagent activities, real-time data dissemina-

tion, data logging, and dynamic scene visualization. We invite the

reader to visit the VSAM web page at http://www.cs.cmu.edu/~vsam/ for

more information.

Discussions of video surveillance research with nonpraction-

ers invariably lead to comments about Big Brother. Although

this is obviously not the goal of current video surveillance re-

search, the concern is reasonable. In 1998, the NYC Surveillance

Camera Project run by the New York Civil Liberties Union

documented nearly 2,500 surveillance cameras viewing public

spaces within Manhattan. The vast majority are privately

owned cameras installed outside businesses and apartment

complexes, with no mechanism to correlate information be-

tween them. However, it would not be infeasible for a suffi-

ciently well-funded government to install a network of thou-

sands of cameras capable of tracking individual citizens as they

walk through the city. As the two research paths of video sur-

veillance and biometric identification begin to merge, this sce-

nario becomes even more troubling. Is the promise of never

being mugged worth the loss of privacy implied by always be-

ing watched? These larger societal questions stray outside the

scope of this technical journal, but now is a good time to begin

to specify what data should be collected, how long it should be

stored, and who has access, so that an ethical framework will be

in place to guide the development and application of the power-

ful technology that will soon be available.

Robert T. Collins

Alan. J. Lipton

Takeo K anade

Robert T. Collins received the PhD degree in

computer science in 1993 from the University of

Massachusetts at Amherst for work on scene

reconstruction using stochastic projective ge-

ometry. He is a member of the Research Faculty

at the Robotics Institute of Carnegie Mellon

University (CMU). From 1992 to 1996, he was

technical director of the DARPA RADIUS project

at the University of Massachusetts, culminating

in the ASCENDER system for populating 3D site

models from multiple, oblique aerial views. From

1996 to 1999, Dr. Collins was technical codirector of the DARPA

Video Surveillance and Monitoring (VSAM) project at CMU. This pro-

ject developed real-time, automated video understanding algorithms

that guide a network of active video sensors to monitor the activities

of people and vehicles in a complex scene. Dr. Collins has published

for more than a decade on topics in video surveillance, 3D site mod-

eling, multiimage stereo, projective geometry, and knowledge-based

scene understanding.

Alan J. Lipton received the PhD degree in elec-

trical and computer systems engineering from

Monash University, Melbourne, Australia in 1996.

For his thesis, he studied the problem of mobile

robot navigation by natural landmark recognition

using on-board vision sensing. He is a senior

scientist at DiamondBack Vision, Inc., an internet

startup company based in Washington, D.C. From

1997 through 2000, he served on the faculty of

CMU’s Robotics Institute. During his time at CMU,

Dr. Lipton was a project comanager of DARPA’s

Video Surveillance and Monitoring (VSAM) project. On this project, Dr.

Lipton developed algorithms for detection and tracking of people and

vehicles from video streams, integration and fusion of video data, user

interfaces for vision system networks, and intelligent sensor control.

Takeo Kanade received the BE degree in electri-

cal engineering from Kyoto University in 1968,

the ME degree in 1970, and the PhD degree in

1973. He is the U.A. and Helen Whitaker Profes-

sor of Computer Science and Robotics and direc-

tor of the Robotics Institute at Carnegie Mellon

University. He has made widely known technical

contributions in multiple areas of computer vision,

robotics, and sensor design. At CMU, he has led

many major projects on vision and robotics spon-

sored by NSF, DARPA, NASA, DOE, and NIMH.

He is a member of the National Academy of Engineering and a fellow

of the IEEE, ACM, and AAAI, respectively. He has received several

awards, including the Joseph F. Engelberger Award, the JARA Award,

the Yokogawa Prize, the Hip Society, Otto AuFranc Award, and the Marr

Prize. He is founding chief editor of the

International Journal of Com-

puter Vision.

U-shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection

Article

Full-text available

Jan 2024

Building Change Detection (BCD) aims to identify new or disappeared buildings from bi-temporal images. However, the varied scales and appearances of buildings, along with the challenge of pseudo-change interference from complex backgrounds, make it difficult to accurately extract complete changes. To address these challenges in BCD, a U-shaped hybrid Siamese network combining a convolutional neural network and a vision Transformer (CNN-ViT) with learnable mask guidance, called U-Conformer, is designed. Firstly, a new hybrid architecture of U-Conformer is proposed. The architecture integrates the strengths of CNNs and ViTs to establish a robust, multi-scale heterogeneous representation that aids in detecting buildings of various sizes. Secondly, a Learnable Mask Guidance Module (LMGM) is specifically designed for U-Conformer, focusing the multi-scale heterogeneous representation on extracting relevant scale changes while progressively suppressing pseudo-changes. Furthermore, for the U-Conformer architecture, a mask information joint class-balanced loss function that combines the Binary Cross-Entropy (BCE) loss function and the Dice loss function is devised, significantly mitigating the issue of class imbalance. Experimental results on three publicly available change detection datasets, LEVIR-CD, WHU-CD, and GZ-CD, demonstrate that U-Conformer surpasses previous methods, achieving F1 scores of 91.5%, 94.6%, and 86.7%, as well as IoU scores of 84.3%, 89.7%, and 76.5% on the LEVIR-CD, WHU-CD, and GZ-CD datasets, respectively.

Efficient Temporal Action Segmentation via Boundary-aware Query Voting

Preprint

May 2024

Although the performance of Temporal Action Segmentation (TAS) has improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https://github.com/peiyao-w/BaFormer.

Video Surveillance System Usage in Preventing Crime: Preliminary findings from Malaysia

Article

Full-text available

Jan 2024

The preliminary study focuses on two key aspects of crime prevention via Video Surveillance Systems (VSS) and the specific context of VSS deployment in Malaysia. The study employs interviews as a method of data collection and a thematic content analysis approach to data analysis. The research findings highlight the VSS's effectiveness in solving crime cases, satisfaction among participants using VSS, limited expertise and skills in video analysis, and the high costs associated with the installation. Future studies should delve deeper into the regional impacts of video surveillance on crime rates in Malaysia and investigate the integration of emerging technologies' effectiveness.

A Survey of Object Detection Techniques for Improving Smart Surveillance

Conference Paper

Nov 2023

Bridging Knowledge Distillation Gap for Few-sample Unsupervised Semantic Segmentation

Article

May 2024
INFORM SCIENCES

SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Action Segmentation

Article

Apr 2024

Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a S parse s i gnal- g uided Transformer ( SigFormer ) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.

Study on the Environment State Perception Technology of the Finite Space

Conference Paper

Dec 2023

Analysis of the Signal and Computing Technology in Finite Space

Conference Paper

Dec 2023

Video Action Segmentation via Contextually Refined Temporal Keypoints

Conference Paper

Oct 2023

Hierarchical Spatial-Temporal Network for Skeleton-Based Temporal Action Segmentation

Chapter

Dec 2023

Skeleton-based Temporal Action Segmentation (TAS) plays an important role in analyzing long videos of motion-centered human actions. Recent approaches perform spatial and temporal information modeling simultaneously in the spatial-temporal topological graph, leading to high computational costs due to the large graph magnitude. Additionally, multi-modal skeleton data has sufficient semantic information, which has not been fully explored. This paper proposes a Hierarchical Spatial-Temporal Network (HSTN) for skeleton-based TAS. In HSTN, the Multi-Branch Transfer Fusion (MBTF) module utilizes a multi-branch graph convolution structure with an attention mechanism to capture spatial dependencies in multi-modal skeleton data. In addition, the Multi-Scale Temporal Convolution (MSTC) module aggregates spatial information and performs multi-scale temporal information modeling to capture long-range dependencies. Extensive experiments on two challenging datasets are performed and our proposed method outperforms the State-of-the-Art (SOTA) methods.

ResearchGate has not been able to resolve any references for this publication.

Introduction to the Special Section on Video Surveillance

Abstract

Recommended publications

Introduction to: Special Section on Papers Based on Presentations from the 16th International Displa...

Leveraging the science of relationships to build a compassionate society: An introduction to the spe...

Guest Editors' Introduction: Special Issue on Economics and Market Mechanisms for Cloud Computing

Introduction: Special Section on Extended Papers Selected from the 13th Color Imaging Conference (CI...