Conference PaperPDF Available

Video Anomaly Detection by the Combination of C3D and LSTM

June 2021

June 2021

Conference: 2021 International Conference on Digital Contents: AICo (AI, IoT and Contents) Technology (DigiCon-21)

Authors:

Gabriela Mogos

Xi'an Jiaotong-Liverpool University

Ka Lok Man

Xi'an Jiaotong-Liverpool University

Video anomaly detection is a significant problem in computer vision tasks. It asks methods to detect unusual events in videos. The kernel of this task is to produce a correct understanding of the input video. To achieve this target, both spatial and temporal features are needed to be extracted by methods. Based on the research of image processing, the deep convolutional neural networks have been evaluated that they have good performance on the spatial feature extraction. Thus, the problem becomes how to get temporal features in the video. This paper proposes a model that combine two effective temporal features processing methods, Convolution 3D and Long Short-term Memory to handle the video anomaly detection. We do experiments on a famous video anomaly dataset, UCF-crime, and achieve a better performance compared with other methods.

Content uploaded by Gabriela Mogos

Content may be subject to copyright.

Video Anomaly Detection by the Combination of

C3D and LSTM

Yuxuan Zhao

Department of Computing, School of Advanced Technology

Xi’an Jiaotong-Liverpool University

Suzhou, China

Yuxuan.zhao@xjtlu.edu.cn

Gabriela Mogos

Department of Computing, School of Advanced Technology

Xi’an Jiaotong-Liverpool University

Suzhou, China

Gabriela.Mogos@xjtlu.edu.cn

Ka Lok Man

Department of Computing, School of Advanced Technology

Xi’an Jiaotong-Liverpool University

Suzhou, China

Swinburne University of Technology Sarawak

Malaysia

imec-DistriNet

KU Leuven

Belgium

Kazimieras Simonavicius University,

Lithuania

Vytautas Magnus University

Lithuania

Ka.man@xjtlu.edu.cn

Abstract—Video anomaly detection is a significant problem in

computer vision tasks. It asks methods to detect unusual events in

videos. The kernel of this task is to produce a correct

understanding of the input video. To achieve this target, both

spatial and temporal features are needed to be extracted by

methods. Based on the research of image processing, the deep

convolutional neural networks have been evaluated that they

have good performance on the spatial feature extraction. Thus,

the problem becomes how to get temporal features in the video.

This paper proposes a model that combine two effective temporal

features processing methods, Convolution 3D and Long Short-

term Memory to handle the video anomaly detection. We do

experiments on a famous video anomaly dataset, UCF-crime, and

achieve a better performance compared with other methods.

Keywords—video anomaly detection; computer vision; deep

learning; C3D; LSTM

I. INTRODUCTION

Video anomaly detection is the problem of detecting

unforeseeable and emergency events in videos. This task is

always tough and time-consuming. Anomaly detection

methods aim to automatically detect anomaly from input

videos. To achieve this target, methods always focus on the

spatial and temporal extractions. Unlike the image, besides the

spatial features, videos also contain temporal features, which

can improve the detection accuracy. However, it also requires

video anomaly detection method to have the ability to extract

useful temporal features. In this paper, we provide a method

that combines two widely used methods, 3D Convolutional

Neural Network (C3D) Eroare! Fără sursă de referință. and

Long Short-Term Memory (LSTM) [1] for temporal feature

extraction to achieve a better performance. In order to make

these two methods work together, some modifications to them

are made. In addition, we do some improvements for the video

processing part since the problem of computing power.

II. METHODOLOGY

The proposed method aims to get a better detection result

than previous methods. To achieve this method, it combines

the C3D and LSTM to improve the detection performance.

Considering the limitation of computing power and training

time, several improvements have been made for both these two

structures.

The general working process of the model is shown in Fig.

1. The input video is sampled as RGB frames first. Then

frames will be handled by C3D to get both the spatial and

temporal features. Then an LSTM layer is used to enhance the

temporal information and get the final detection result.

A. Video Processing

The first challenge is the computing power. Since there is

one more dimension for the feature, the features of C3D are

more complex than the features of traditional 2D CNN. In

addition, longer sequences need multiple LSTM layers or

more parameters to extract the temporal information. As the

Fig. 1 The general structure of the proposed method

result, more layers should be added or more parameters

should be used in a single LSTM layer. However, it will

lead to the problem of computing power. Therefore, C3D

features limits the number of LSTM layers and parameters.

To solve this problem, number of training samples should

be reduced for both C3D and LSTM. For example, we can

pick one part of video, or pick one frame from every five

frames. However, if we only pick one part of problem, it is

hard to decide which part of the video contains the

unforeseeable events. If we pick one frame from every five

frames, it will affect the performance of the detection since

the information of other frames would miss. Finally, we

divide each video into clips of 16 frames and extract C3D

features for every clip. Then each C3D can produce the

clip-based feature so that the LSTM can use these features

as input for the further processing. For example, if a video

contains 800 frames, we can get 800/16 = 50 clips. This

method has the following advantages.

• It reduces the pressure of computer power. Both C3D

and LSTM are not arranged to handle frames of a

whole video.

• It solves the problem of other sample reduction method.

This method still uses all frames as its input so it does

not waste any information.

• It allows the C3D and LSTM works together.

B. Clip Processing (C3D)

Convolutional structure has been widely used in the spatial

feature extraction. To extend this classic model to the

temporal feature extraction, C3D is developed. The features

provided by C3D can has one more dimension for the

temporal information. Therefore, it can finally output both

spatial and temporal features. However, the extended

dimension also brings the problem of too many calculations

in the training process. The clip input can alleviate this

problem to a certain extent. In addition, more modifications

need to be done for the C3D structure itself.

Fig. 2 shows the C3D structure in the proposed method.

Compared the original C3D network, two convolutional

groups have been reduced to simplify the whole model. In

addition, every 3D convolutional kernel is set to be 3*3*3,

which is less than the size of traditional 2D CNN kernel.

C. LSTM

LSTM can be used for the unforeseeable events detection in

the previous paper. In the proposed model, the LSTM layer

is used to handle the clip-based features produced by the

C3D and output the video-based features for the final

detection. Compared with frames, clip-based features can

reduce the calculation of LSTM. Therefore, the LSTM in

the proposed method is a single-directional LSTM layer [3].

This simple structure can speed up the training process and

achieve better performance.

III. EXPERIMENTS AND RESULTS

A. Dataset

We do experiments on the UCF-Crime dataset [4]. It

consists of long untrimmed surveillance videos which cover 13

real-world anomalies, including Abuse, Arrest, Arson, Assault,

Road Accident, Burglary, Explosion, Fighting, Robbery,

Shooting, Stealing, Shoplifting, and Vandalism. The basic

information about this dataset is shown below.

• Number of videos: 1900

• Number of labels: 13

• Total length: 128 hours

• Average frame: 7274

• Frame rate: 30fps

Besides the challenge of anomalous activity detection, there are

some characteristics of this dataset that make the experiment

more difficult.

• Some videos replay themselves many times in one

video file (Abuse001).

• Some videos contain different views of a single event

(Fighting 032). This will lead to dramatic changes in the

optical flow, which may make the temporal stream

extract wrong features.

• The lengths of different videos vary greatly.

B. Results and discussion

TABLE I. shows the detection accuracy of different

methods. The proposed method achieves the highest accuracy

among these methods, which is 82.35%. Compared with the

second-best method, it still has a 7% improvement.

The experimental result shows that the proposed method

has a good performance on video anomaly detection task. The

C3D can work with LSTM to improve the detection result. In

addition, the video processing method in this model, which use

clip-based frame groups, is proven to effectively reduce the

amount of calculation.

TABLE I. PERFORMANCES OF DIFFERENT MODELS ON UCF-CRIME

DATASET

Method

Accuracy (%)

Hasan et al. [5]

50.6

Lu et al. [6]

65.51

Fig. 2 The structure of the C3D

Sultani et al. [7]

75.41

Proposed method

82.36

IV. CONCLUSION

This paper proposes a new model for the video anomaly

detection. The model combines the C3D and LSTM to achieve

better detection accuracy. The C3D is used to extract clip-

based spatiotemporal features of the video. Then a single-

directional LSTM layer is used to enhance the temporal

features and produce final video-based features and detection

results of the whole stream. The model is evaluated by the

UCF-crime video dataset. The result shows that our method

achieves the highest detection accuracy, which is 82.36%,

compared with other methods. This result shows that C3D and

LSTM can work together in the video anomaly detection task.

The future work will focus on the simplification of this model.

Since both C3D and LSTM requires a lot on the computing

power and affect the training time, modifications are necessary

to solve this problem.

ACKNOWLEDGMENT

This article is supported by Xi ’an Jiaotong-Liverpool

University (XJTLU), Suzhou, China with the Research

Development Fund (RDF-15-01-01). Ka Lok Man wishes to

thank the AI University Research Centre (AI-URC), Xi’an

Jiaotong-Liverpool University, Suzhou, China, for supporting

his related research contributions to this article through the

XJTLU Key Programme Special Fund (KSF-E-65) and

Suzhou-Leuven IoT \& AI Cluster Fund.

REFERENCES

[1] Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M., 2015.

Learning spatiotemporal features with 3d convolutional networks.

In Proceedings of the IEEE international conference on computer

vision (pp. 4489-4497).

[2] Hochreiter, S. and Schmidhuber, J., 1997. Long short-term

memory. Neural computation, 9(8), pp.1735-1780.

[3] Zhao, Y., Man, K.L., Smith, J., Siddique, K. and Guan, S.U., 2020.

Improved two-stream model for human action recognition. EURASIP

Journal on Image and Video Processing, 2020(1), pp.1-9.

[4] Sultani, W., Chen, C. and Shah, M., 2018. Real-world anomaly detection

in surveillance videos. In Proceedings of the IEEE conference on

computer vision and pattern recognition (pp. 6479-6488).

[5] Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A. K., & Davis, L.

S. (2016). Learning temporal regularity in video sequences.

In Proceedings of the IEEE conference on computer vision and pattern

recognition (pp. 733-742).

[6] Lu, C., Shi, J. and Jia, J., 2013. Abnormal event detection at 150 fps in

matlab. In Proceedings of the IEEE international conference on

computer vision (pp. 2720-2727).

[7] Sultani, W., Chen, C. and Shah, M., 2018. Real-world anomaly detection

in surveillance videos. In Proceedings of the IEEE conference on

computer vision and pattern recognition (pp. 6479-6488).

ResearchGate has not been able to resolve any citations for this publication.

Improved two-stream model for human action recognition

Article

Full-text available

Jun 2020
Int J Image Video Process

Abstract This paper addresses the recognitions of human actions in videos. Human action recognition can be seen as the automatic labeling of a video according to the actions occurring in it. It has become one of the most challenging and attractive problems in the pattern recognition and video classification fields. The problem itself is difficult to solve by traditional video processing methods because of several challenges such as the background noise, sizes of subjects in different videos, and the speed of actions. Derived from the progress of deep learning methods, several directions are developed to recognize a human action from a video, such as the long-short-term memory (LSTM)-based model, two-stream convolutional neural network (CNN) model, and the convolutional 3D model.In this paper, we focus on the two-stream structure. The traditional two-stream CNN network solves the problem that CNNs do not have satisfactory performance on temporal features. By training a temporal stream, which uses the optical flow as the input, a CNN can have the ability to extract temporal features. However, the optical flow only contains limited temporal information because it only records the movements of pixels on the x-axis and the y-axis. Therefore, we attempt to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames. In addition, we implement a DenseNet in the temporal stream to improve the recognition accuracy. This is in-contrast to traditional approaches which typically utilize the spatial stream for extracting only spatial features. The quantitative evaluation and experiments are conducted on the UCF-101 dataset, which is a well-developed public video dataset. For the temporal stream, we choose the optical flow of UCF-101. Images in the optical flow are provided by the Graz University of Technology. The experimental result shows that the proposed method outperforms the traditional two-stream CNN method with an accuracy of at least 3%. For both spatial and temporal streams, the proposed model also achieves higher recognition accuracies. In addition, compared with the state of the art methods, the new model can still have the best recognition performance.

Long Short-term Memory

Article

Full-text available

Dec 1997

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

Learning Temporal Regularity in Video Sequences

Conference Paper

Jun 2016

Learning Spatiotemporal Features with 3D Convolutional Networks

Conference Paper

Dec 2015

Abnormal event detection at 150 FPS in MATLAB

Conference Paper

Dec 2013

Speedy abnormal event detection meets the growing demand to process an enormous number of surveillance videos. Based on inherent redundancy of video structures, we propose an efficient sparse combination learning framework. It achieves decent performance in the detection phase without compromising result quality. The short running time is guaranteed because the new method effectively turns the original complicated problem to one in which only a few costless small-scale least square optimization steps are involved. Our method reaches high detection rates on benchmark datasets at a speed of 140-150 frames per second on average when computing on an ordinary desktop PC using MATLAB.

Video Anomaly Detection by the Combination of C3D and LSTM

Abstract

Recommended publications

Recognizing Human Activities in Videos Using Improved Dense Trajectories Over LSTM

A novel two-stream structure for video anomaly detection in smart city management

3D ResNet with Ranking Loss Function for Abnormal Activity Detection in Videos

Learning Spatiotemporal Representation Based on 3D Autoencoder for Anomaly Detection

Cleaning Label Noise With Clusters for Minimally Supervised Anomaly Detection