Conference PaperPDF Available

Video Anomaly Detection by the Combination of C3D and LSTM

Authors:

Abstract

Video anomaly detection is a significant problem in computer vision tasks. It asks methods to detect unusual events in videos. The kernel of this task is to produce a correct understanding of the input video. To achieve this target, both spatial and temporal features are needed to be extracted by methods. Based on the research of image processing, the deep convolutional neural networks have been evaluated that they have good performance on the spatial feature extraction. Thus, the problem becomes how to get temporal features in the video. This paper proposes a model that combine two effective temporal features processing methods, Convolution 3D and Long Short-term Memory to handle the video anomaly detection. We do experiments on a famous video anomaly dataset, UCF-crime, and achieve a better performance compared with other methods.
Video Anomaly Detection by the Combination of
C3D and LSTM
Yuxuan Zhao
Department of Computing, School of Advanced Technology
Xian Jiaotong-Liverpool University
Suzhou, China
Yuxuan.zhao@xjtlu.edu.cn
Gabriela Mogos
Department of Computing, School of Advanced Technology
Xian Jiaotong-Liverpool University
Suzhou, China
Gabriela.Mogos@xjtlu.edu.cn
Ka Lok Man
Department of Computing, School of Advanced Technology
Xian Jiaotong-Liverpool University
Suzhou, China
Swinburne University of Technology Sarawak
Malaysia
imec-DistriNet
KU Leuven
Belgium
Kazimieras Simonavicius University,
Lithuania
Vytautas Magnus University
Lithuania
Ka.man@xjtlu.edu.cn
AbstractVideo anomaly detection is a significant problem in
computer vision tasks. It asks methods to detect unusual events in
videos. The kernel of this task is to produce a correct
understanding of the input video. To achieve this target, both
spatial and temporal features are needed to be extracted by
methods. Based on the research of image processing, the deep
convolutional neural networks have been evaluated that they
have good performance on the spatial feature extraction. Thus,
the problem becomes how to get temporal features in the video.
This paper proposes a model that combine two effective temporal
features processing methods, Convolution 3D and Long Short-
term Memory to handle the video anomaly detection. We do
experiments on a famous video anomaly dataset, UCF-crime, and
achieve a better performance compared with other methods.
Keywordsvideo anomaly detection; computer vision; deep
learning; C3D; LSTM
I. INTRODUCTION
Video anomaly detection is the problem of detecting
unforeseeable and emergency events in videos. This task is
always tough and time-consuming. Anomaly detection
methods aim to automatically detect anomaly from input
videos. To achieve this target, methods always focus on the
spatial and temporal extractions. Unlike the image, besides the
spatial features, videos also contain temporal features, which
can improve the detection accuracy. However, it also requires
video anomaly detection method to have the ability to extract
useful temporal features. In this paper, we provide a method
that combines two widely used methods, 3D Convolutional
Neural Network (C3D) Eroare! Fără sursă de referință. and
Long Short-Term Memory (LSTM) [1] for temporal feature
extraction to achieve a better performance. In order to make
these two methods work together, some modifications to them
are made. In addition, we do some improvements for the video
processing part since the problem of computing power.
II. METHODOLOGY
The proposed method aims to get a better detection result
than previous methods. To achieve this method, it combines
the C3D and LSTM to improve the detection performance.
Considering the limitation of computing power and training
time, several improvements have been made for both these two
structures.
The general working process of the model is shown in Fig.
1. The input video is sampled as RGB frames first. Then
frames will be handled by C3D to get both the spatial and
temporal features. Then an LSTM layer is used to enhance the
temporal information and get the final detection result.
A. Video Processing
The first challenge is the computing power. Since there is
one more dimension for the feature, the features of C3D are
more complex than the features of traditional 2D CNN. In
addition, longer sequences need multiple LSTM layers or
more parameters to extract the temporal information. As the
Fig. 1 The general structure of the proposed method
result, more layers should be added or more parameters
should be used in a single LSTM layer. However, it will
lead to the problem of computing power. Therefore, C3D
features limits the number of LSTM layers and parameters.
To solve this problem, number of training samples should
be reduced for both C3D and LSTM. For example, we can
pick one part of video, or pick one frame from every five
frames. However, if we only pick one part of problem, it is
hard to decide which part of the video contains the
unforeseeable events. If we pick one frame from every five
frames, it will affect the performance of the detection since
the information of other frames would miss. Finally, we
divide each video into clips of 16 frames and extract C3D
features for every clip. Then each C3D can produce the
clip-based feature so that the LSTM can use these features
as input for the further processing. For example, if a video
contains 800 frames, we can get 800/16 = 50 clips. This
method has the following advantages.
It reduces the pressure of computer power. Both C3D
and LSTM are not arranged to handle frames of a
whole video.
It solves the problem of other sample reduction method.
This method still uses all frames as its input so it does
not waste any information.
It allows the C3D and LSTM works together.
B. Clip Processing (C3D)
Convolutional structure has been widely used in the spatial
feature extraction. To extend this classic model to the
temporal feature extraction, C3D is developed. The features
provided by C3D can has one more dimension for the
temporal information. Therefore, it can finally output both
spatial and temporal features. However, the extended
dimension also brings the problem of too many calculations
in the training process. The clip input can alleviate this
problem to a certain extent. In addition, more modifications
need to be done for the C3D structure itself.
Fig. 2 shows the C3D structure in the proposed method.
Compared the original C3D network, two convolutional
groups have been reduced to simplify the whole model. In
addition, every 3D convolutional kernel is set to be 3*3*3,
which is less than the size of traditional 2D CNN kernel.
C. LSTM
LSTM can be used for the unforeseeable events detection in
the previous paper. In the proposed model, the LSTM layer
is used to handle the clip-based features produced by the
C3D and output the video-based features for the final
detection. Compared with frames, clip-based features can
reduce the calculation of LSTM. Therefore, the LSTM in
the proposed method is a single-directional LSTM layer [3].
This simple structure can speed up the training process and
achieve better performance.
III. EXPERIMENTS AND RESULTS
A. Dataset
We do experiments on the UCF-Crime dataset [4]. It
consists of long untrimmed surveillance videos which cover 13
real-world anomalies, including Abuse, Arrest, Arson, Assault,
Road Accident, Burglary, Explosion, Fighting, Robbery,
Shooting, Stealing, Shoplifting, and Vandalism. The basic
information about this dataset is shown below.
Number of videos: 1900
Number of labels: 13
Total length: 128 hours
Average frame: 7274
Frame rate: 30fps
Besides the challenge of anomalous activity detection, there are
some characteristics of this dataset that make the experiment
more difficult.
Some videos replay themselves many times in one
video file (Abuse001).
Some videos contain different views of a single event
(Fighting 032). This will lead to dramatic changes in the
optical flow, which may make the temporal stream
extract wrong features.
The lengths of different videos vary greatly.
B. Results and discussion
TABLE I. shows the detection accuracy of different
methods. The proposed method achieves the highest accuracy
among these methods, which is 82.35%. Compared with the
second-best method, it still has a 7% improvement.
The experimental result shows that the proposed method
has a good performance on video anomaly detection task. The
C3D can work with LSTM to improve the detection result. In
addition, the video processing method in this model, which use
clip-based frame groups, is proven to effectively reduce the
amount of calculation.
TABLE I. PERFORMANCES OF DIFFERENT MODELS ON UCF-CRIME
DATASET
Method
Accuracy (%)
Hasan et al. [5]
50.6
Lu et al. [6]
65.51
Fig. 2 The structure of the C3D
Sultani et al. [7]
Proposed method
IV. CONCLUSION
This paper proposes a new model for the video anomaly
detection. The model combines the C3D and LSTM to achieve
better detection accuracy. The C3D is used to extract clip-
based spatiotemporal features of the video. Then a single-
directional LSTM layer is used to enhance the temporal
features and produce final video-based features and detection
results of the whole stream. The model is evaluated by the
UCF-crime video dataset. The result shows that our method
achieves the highest detection accuracy, which is 82.36%,
compared with other methods. This result shows that C3D and
LSTM can work together in the video anomaly detection task.
The future work will focus on the simplification of this model.
Since both C3D and LSTM requires a lot on the computing
power and affect the training time, modifications are necessary
to solve this problem.
ACKNOWLEDGMENT
This article is supported by Xi an Jiaotong-Liverpool
University (XJTLU), Suzhou, China with the Research
Development Fund (RDF-15-01-01). Ka Lok Man wishes to
thank the AI University Research Centre (AI-URC), Xian
Jiaotong-Liverpool University, Suzhou, China, for supporting
his related research contributions to this article through the
XJTLU Key Programme Special Fund (KSF-E-65) and
Suzhou-Leuven IoT \& AI Cluster Fund.
REFERENCES
[1] Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M., 2015.
Learning spatiotemporal features with 3d convolutional networks.
In Proceedings of the IEEE international conference on computer
vision (pp. 4489-4497).
[2] Hochreiter, S. and Schmidhuber, J., 1997. Long short-term
memory. Neural computation, 9(8), pp.1735-1780.
[3] Zhao, Y., Man, K.L., Smith, J., Siddique, K. and Guan, S.U., 2020.
Improved two-stream model for human action recognition. EURASIP
Journal on Image and Video Processing, 2020(1), pp.1-9.
[4] Sultani, W., Chen, C. and Shah, M., 2018. Real-world anomaly detection
in surveillance videos. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 6479-6488).
[5] Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A. K., & Davis, L.
S. (2016). Learning temporal regularity in video sequences.
In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 733-742).
[6] Lu, C., Shi, J. and Jia, J., 2013. Abnormal event detection at 150 fps in
matlab. In Proceedings of the IEEE international conference on
computer vision (pp. 2720-2727).
[7] Sultani, W., Chen, C. and Shah, M., 2018. Real-world anomaly detection
in surveillance videos. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 6479-6488).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Abstract This paper addresses the recognitions of human actions in videos. Human action recognition can be seen as the automatic labeling of a video according to the actions occurring in it. It has become one of the most challenging and attractive problems in the pattern recognition and video classification fields. The problem itself is difficult to solve by traditional video processing methods because of several challenges such as the background noise, sizes of subjects in different videos, and the speed of actions. Derived from the progress of deep learning methods, several directions are developed to recognize a human action from a video, such as the long-short-term memory (LSTM)-based model, two-stream convolutional neural network (CNN) model, and the convolutional 3D model.In this paper, we focus on the two-stream structure. The traditional two-stream CNN network solves the problem that CNNs do not have satisfactory performance on temporal features. By training a temporal stream, which uses the optical flow as the input, a CNN can have the ability to extract temporal features. However, the optical flow only contains limited temporal information because it only records the movements of pixels on the x-axis and the y-axis. Therefore, we attempt to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames. In addition, we implement a DenseNet in the temporal stream to improve the recognition accuracy. This is in-contrast to traditional approaches which typically utilize the spatial stream for extracting only spatial features. The quantitative evaluation and experiments are conducted on the UCF-101 dataset, which is a well-developed public video dataset. For the temporal stream, we choose the optical flow of UCF-101. Images in the optical flow are provided by the Graz University of Technology. The experimental result shows that the proposed method outperforms the traditional two-stream CNN method with an accuracy of at least 3%. For both spatial and temporal streams, the proposed model also achieves higher recognition accuracies. In addition, compared with the state of the art methods, the new model can still have the best recognition performance.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Conference Paper
Speedy abnormal event detection meets the growing demand to process an enormous number of surveillance videos. Based on inherent redundancy of video structures, we propose an efficient sparse combination learning framework. It achieves decent performance in the detection phase without compromising result quality. The short running time is guaranteed because the new method effectively turns the original complicated problem to one in which only a few costless small-scale least square optimization steps are involved. Our method reaches high detection rates on benchmark datasets at a speed of 140-150 frames per second on average when computing on an ordinary desktop PC using MATLAB.