Content uploaded by Chengbin Jin
Author content
All content in this area was uploaded by Chengbin Jin on Sep 18, 2015
Content may be subject to copyright.
Real-Time Human Action Recognition
Using CNN Over Temporal Images for Static
Video Surveillance Cameras
Cheng-Bin Jin, Shengzhe Li, Trung Dung Do, and Hakil Kim
(&)
Information and Communication Engineering, Inha University, Incheon, Korea
{sbkim,szli,dotrungdung}@vision.inha.ac.kr,
hikim@inha.ac.kr
Abstract. This paper proposes a real-time human action recognition approach
to static video surveillance systems. This approach predicts human actions using
temporal images and convolutional neural networks (CNN). CNN is a type of
deep learning model that can automatically learn features from training videos.
Although the state-of-the-art methods have shown high accuracy, they consume
a lot of computational resources. Another problem is that many methods assume
that exact knowledge of human positions. Moreover, most of the current
methods build complex handcrafted features for specific classifiers. Therefore,
these kinds of methods are difficult to apply in real-world applications. In this
paper, a novel CNN model based on temporal images and a hierarchical action
structure is developed for real-time human action recognition. The hierarchical
action structure includes three levels: action layer, motion layer, and posture
layer. The top layer represents subtle actions; the bottom layer represents pos-
ture. Each layer contains one CNN, which means that this model has three
CNNs working together; layers are combined to represent many different kinds
of action with a large degree of freedom. The developed approach was imple-
mented and achieved superior performance for the ICVL action dataset; the
algorithm can run at around 20 frames per second.
Keywords: Video surveillance Action recognition Temporal images
Convolutional neural network Hierarchical action structure
1 Introduction
The ability of a computer to recognize human actions can be important in many
real-word applications including intelligent video surveillance, kinematic analysis,
video retrieval, and criminal investigation. Based on the types of input video, action
recognition can be divided into four classes: surveillance videos, sport videos, movies
and user videos, and first-person videos. Different type of videos have different char-
acteristics: Surveillance video [1,2] usually uses a static camera that records from a
side or top view. Therefore, the background of surveillance video is relatively simple,
and research objects of surveillance are people or cars. Currently, millions of sur-
veillance cameras are in place throughout the world. This means that more than 800 K
video hours are generate per day. The objective of action recognition in the surveillance
©Springer International Publishing Switzerland 2015
Y.-S. Ho et al. (Eds.): PCM 2015, Part II, LNCS 9315, pp. 330–339, 2015.
DOI: 10.1007/978-3-319-24078-7_33
field is understanding of the video. It is necessary to have a program that can auto-
matically label human events. The viewpoint of sport video [3] is the same as that of
surveillance video. However, the objects in sport videos are usually fast-moving
people. Sport videos need to be segmented manually before performing
post-processing. Movies and user videos [4] are recorded with moving cameras; the
view is almost always from the front or the side. The problems of jittery video and a
dynamic and complicated background make this kind of video more difficult to process
than the previous one. First-person videos [5,6] are becoming popular after Google
launched Google Glass. Because this technology employs a moving camera, videos
obtained using Google Glass are very dynamic.
However, accurate action recognition is a very challenging task due to large
variations in appearance. For example, occlusions, non-rigid motion, scale variation,
view-point changes, subtle action, and clothing colors similar to the background color
are all important problems. Manual collection of training samples is another difficult
task. It requires much human effort and is time consuming. The other challenging task
is to create an approach that can process video in real-time and that can be applied in
real-world environment applications. The number of approaches to recognizing human
action in video has grown at a tremendous rate. Prior approaches can be divided into
appearance-based methods, motion-based methods, space-time based methods, and
deep learning-based methods this last of which has become a hot topic recently.
Motion history images (MHI) or temporal images [7–9] make up the most popular
appearance-based method. The advantages of the MHI method are that it is simple, fast,
and it works very well in controlled environments. However, MHI is sensitive to errors
of background subtraction. The fatal flaw of MHI is that it cannot capture interior
motions and shapes. MHI can only capture silhouettes, but silhouettes tell little about
actions. Other appearance-based methods are the active shape model, the learned
dynamic prior model, and the motion prior model.
Motion-based methods [10] (generic and parametric optical flow, and temporal
activity models) enable an analysis of the temporal information of sub-actions. These
methods can also be used to model long-term activities with variable structures of
action. However, important questions remain, such as how many sub-action units are
meaningful and how is it possible to find these sub-action units for a target activity?
Both of these are open problems that need to be solved.
Space-time is a feature-handcraft based method. This method was proposed to
handle complex dynamic senses. There are many different descriptors (e.g. HOG, HOF
Cuboids, HOG3D, and extended SURF) and detectors (Harris3D, Cuboids, Hessian,
and regular dense sampling) [11,12]. In order to improve performance, spatio-temporal
grids [13] and analysis of co-occurrence between action and scene [14] are considered.
The spatio-temporal grid method is one that divides interesting regions into many
areas; then, it linearly combines extracted descriptors from different areas.
Co-occurrence is a method that considers the relationship between action and scene; it
gives certain weight to classified results to update the result.
The convolution neural networks (CNN) model [15] is one of a deep learning
models. It is a class of supervised machine learning algorithm that can learn a hierarchy
of features by building high-level features from low-level ones [16]. Some researchers
have started to use CNN to recognize human actions [17,18]. However, it will be
Real-Time Human Action Recognition Using CNN 331
necessary to determine what a good CNN architecture is. This is a question that is still
difficult to answer, and a problem that will require further research.
The key contributions of this paper can be summarized as follows:
•This paper proposes a novel model for human detection, human tracking, and
recognition of actions in real-time. The model does not make any assumptions (e.g.,
ground truth of human region, small scale, or viewpoint changes) about the
circumstances.
•Hierarchical action structure is described for real-time human action recognition. In
this structure, three CNNs work together; layers are combined to represent many
different kinds of action with a large degree of freedom.
•Different temporal images are used in the 3 layers of the hierarchical action
structure. Experimental results show that these kinds of temporal images are very
suitable for use in video surveillance systems, with no need to be concerned about
the accuracy or processing time.
The rest of this paper is organized as follows: a definition of the hierarchical action
structure is provided in Sect. 2. Different temporal images in 3 layers, and the CNN
architecture, are described in Sect. 3. The experimental results for the ICVL (Inha
Computer Vision Lab) action dataset are reported in Sect. 4. Section 5provides the
conclusions of the paper.
2 Hierarchical Action Structure
The hierarchical action structure developed for real-time human action recognition is
shown in Fig. 1. The structure includes three layers: the action layer, motion layer, and
posture layer. The action layer has four classes. These are nothing,texting,smoking,
and others. The motion layer has classes of stationary,walking, and running. The
posture layer has classes of sitting and standing. There are certain types of common
information between the posture layer and the motion layer. For the action layer, the
posture layer provides supplementary information. The 3 layers together deliver a
complete set of information for human actions.
Fig. 1. Hierarchical action structure
332 C.-B. Jin et al.
The advantages of this structure are that it uses 9 action categories to represent
various action combinations. To add a new action, the revise of overall structure is not
necessary, rather, it is possible to revise only the corresponding layer and re-train the
layer.
3 Human Action Recognition Using CNN
The objective of the paper is to propose a real-time algorithm for recognizing human
action in surveillance video; at the same time, the method should not employ any
assumptions about the video. In order to process video in real-time, human detection is
a precondition of action recognition: this is also a big challenge that is still the subject
of much research. First, the approach delineated in this paper performs motion
detection using the Gaussian Mixture Model (GMM); after this, the system detects
humans in the motion region using a Histogram of Gradient (HOG) [19]. To increase
system speed, tracking by detection technique is employed in developed algorithm.
Occlusion and difficult-to-detect humans those detected in previous frames but lost in
the current frame are detected using the Kalman filter. The algorithm flow chart is
shown in Fig. 2.
Fig. 2. Algorithm flow chart
Real-Time Human Action Recognition Using CNN 333
3.1 Temporal Images
In the training stage, manually cropped human images and action labels are used as
training data. Every layer of the structure has one independent CNN that requires
different temporal images. Binary Difference Images (BDI), Motion History Images
(MHI), and Weighted Summation Images (WSI) are used in the 3 layers. BDI is the
specific form of the temporal images. It is given by Eq. (1)
bðx;y;tÞ¼ 1;if f ðx;y;tÞfðx;y;t0Þ[threshold
0;otherwise
ð1Þ
As can be seen in its name, BDI, is binary image. Pixels in the image are set at 1 if
the difference from another image is bigger than the threshold.xand yare indexes in
the image; f(x,y,t) is the current frame; f(x,y,t
0
) is the first frame of the input video. MHI
is defined in Eqs. (2)–(4)[8]
dx;y;tðÞ¼
1;if f ðx;y;tÞfðx;y;t1Þ[threshold
0;otherwise
ð2Þ
hsðx;y;tÞ¼ smax;if dðx;y;tÞ¼1
maxð0;hsðx;t;t1ÞDsÞ;otherwise
ð3Þ
Ds ¼smax smin
nð4Þ
From Eq. (3) the h
τ
(x,y,t) (MHI) is generated from the difference between the
current and previous frame f(x,y,t−1). For every frame, MHI is calculated from the
result of the previous MHI. Therefore, this value does not need to be calculated again
for the whole set of frames. In Eq. (4), nis the number of the frames to be considered.
WSI is the weighted summation of BDI and MHI, given by Eq. (5). There is also one
constraint: w
BDI
+w
MHI
= 1. The temporal images in this paper are constructed using
values of τ
max
of 255, τ
min
of 0, nof 10, w
BDI
of 0.4, and w
MHI
of 0.6.
sðx;y;tÞ¼wTsmax bðx;y;tÞ
hsðx;y;tÞ
w¼wBDI
wMHI
ð5Þ
3.2 CNN Architecture
According to the desired objectives, a variety of CNN architectures can be devised. To
keep the model simple, a light CNN architecture is developed for human action rec-
ognition on the ICVL action dataset. This model is shown in Fig. 3.
This model consists of 2 convolutional layers (C1 and C3), 2 subsampling layers
(S2 and S4), and 2 full connection layers (F5 and F6). The last full connection layer
(F6) is fully connected to the action categories via softmax. The number of kernels in
334 C.-B. Jin et al.
the two convolutional layers are 4 and 32; the stride of the convolution is 1; the kernel
sizes are 5 ×5 and 7 ×7. The subsampling layer uses 2 ×2 max pooling; the stride is 2.
Two full connection layers together use 512 neurons.
4 Experimental Results
For this section, experiments were performed to evaluate the proposed method on the
ICVL action dataset. The dataset consists of surveillance video data recorded at Inha
University. It consists of 158 videos using 11 different indoor and outdoor cameras
with a resolution of 1280 ×640 at 20 fps. The durations of the videos are from 1 min to
6 min; each frame has 3 labels for the action, motion, and posture layers. Different
training and test data are used in the proposed method; statistics for the data used in the
experiments are provided in Table 1.
The performances of the 3 different layers are evaluated using frame-by-frame
metrics. The metric is calculated according to:
Pd¼1X
Nc
j¼1X
Ntotal
i¼1
Nj
Ntotal
Iy
i;udxi
ðÞ
;d2A;M;P
fg ð6Þ
Table 1. Number of videos in the training and test from the ICVL action dataset
Camera/Data C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 Tot.
Tra. 13 15 21 12 23 8 17 14 23 0 0 146
Tes. 112111 11111 12
Tot. 14 16 23 13 24 9 18 15 24 1 1 158
Fig. 3. Architecture of CNN
Real-Time Human Action Recognition Using CNN 335
Iy
i;udxi
ðÞ
¼0;yi¼udxi
ðÞ
1;otherwise
ð7Þ
where P
A
,P
M,
and P
P
are the precisions of the action layer, motion layer, and
posture layer, respectively. N
c
is the number of action classes in the corresponding
layer; N
total
is the number of frames in the evaluated videos. I(y
i
,φ
d
(x
i
)) gives a value of
1 if the labels of frames i(y
i
) and φ
d
(x
i
) are the same. φ
d
(x
i
) represents the results of the
CNN from the input frame x
i
;iis the frame number.
The precisions of the 3 different layers are shown in Fig. 4. It can be seen that the
median precisions of the posture, motion, and action layers are 97.77 %, 85.99 %, and
71.29 %, respectively. The performance of the posture layer is very impressive; the
motion layer is quite stable; the performance of the action layer is slightly weak. These
results demonstrate that the appearance-based method is not good at representing subtle
actions and that the movements of texting and smoking are small.
The confusion matrix for the ICVL action dataset is shown in Fig. 5. The figure
shows certain levels of confusion between running and walking,texting and nothing,
and smoking and nothing. Some possible explanations for these results are that they are
caused by the light architecture of the CNN and that there exists an imbalance in the
number of training samples. In the ICVL action dataset, there are many sets of sample
data for standing,walking,nothing, and texting; however, there are only few samples
for other types of action. For example, the dataset has 18,504 training samples for
nothing, but just 1,106 for smoking a sixteen-fold difference. The proportion even
bigger than 16 times. However, the multiple layers of the hierarchical action structure
can eliminate many misclassifications.
Figure 6shows certain actions that were correctly recognized and certain actions
that were misclassified. Detected bounding boxes, trajectories of objects, object IDs,
Fig. 4. Precisions of 3 different layers
336 C.-B. Jin et al.
and 3-layer action results are shown in Fig. 6. The object ID and the trajectory were
updated every time after tracking when the detection method failed to detect the object.
The top row shows actions that were correctly recognized by the proposed model; the
bottom row shows those that were misclassified by the model.
In order to provide an evaluation of the processing time, the experimental envi-
ronment was established on a computer with an Intel(R) Core
TM
i7-3770 CPU @
3.40 GHz and two 4 GB RAMs. The input video was resized to 640 ×360 from the
original 1280 ×640. The processing time was tested on 12 videos and calculated as the
average processing time. The average processing time for one frame is 46.9319 ms.
GMM takes 11.3046 ms, HOG takes 12.6045 ms, the 3 temporal images take 1.2569 ms,
the 3 CNNs take 3.6577 ms per human, and the other processes (e.g. initialization,
Fig. 6. Correctly recognized and misclassified results
Fig. 5. Confusion matrix for classification results on ICVL action dataset
Real-Time Human Action Recognition Using CNN 337
Kalman filtering, post-processing, and displaying of results, etc.) take 18.2083 ms. As
can be seen above, the developed algorithm runs at more than 20 frames per second.
5 Conclusions
This paper has proposed a real-time human action recognition approach that does not
use any assumptions about the circumstances of the video in question. The developed
approach constructs temporal images from several static images using BDI, MHI, and
WSI. Temporal images are very simple and fast. They are quite suitable for use in fixed
surveillance video cameras, with no need to consider the processing time or precision.
Using the proposed hierarchical action structure, the model can employ as limited
number of actions to represent many different kinds of action with a large degree of
freedom. Further, employment of this structure makes it easy to add new actions simply
by re-training the corresponding layer. And, this method can effectively reduce
misclassifications.
In this paper, a light CNN architecture is considered for action recognition. There
are other deep architectures, such as Recurrent Neural Networks and Deep Belief
Networks, which have achieved promising performance for speech recognition and
image recognition. It would be interesting to employ such models for action recog-
nition or to make the current CNN deeper and more complicated.
Acknowledgements. This research was funded by the MSIP (Ministry of Science, ICT &
Future Planning), Korea in the ICT R&D Program 2015 (Project ID: 1391203002-130010200).
References
1. Oh, S., Hoogs, A., Perera, A., et al.: A large-scale benchmark dataset for event recognition in
surveillance video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Providence, Rhode Island ,pp. 3153–3160 (2011)
2. Vahdat, A., Gao, B., Ranjbar, M., et al.: A discriminative key pose sequence model for
recognizing human interactions. In: 2011 IEEE International Conference on Computer
Vision Workshops (ICCV Workshops), pp. 1729–1736, Barcelona (2011)
3. Lan, T., Wang, Y., Yang, W., et al.: Discriminative latent models for recognizing contextual
group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1549–1562 (2011)
4. Kim, I., Oh, S., Vahdat, A., et al.: Segmental multi-way local polling for video recognition.
In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013,
pp. 637–640, New York (2013)
5. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera
views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Providence, Rhode Island, pp. 2847–2854 (2011)
6. Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In:
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland,
Oregon, pp. 2730–2737 (2013)
338 C.-B. Jin et al.
7. Davis, J.W., Bobick, A.F.: The representation and recognition of human movement using
temporal templates. In: 1997 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, San Juan, pp. 928–934 (1997)
8. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates.
IEEE Trans. Pattern Anal. Mach. Intell. 3(3), 257–267 (2001)
9. Blank, M., Gorelick, L., Schechtman, E., et al.: Actions as space-time shapes. In: 2005
Tenth IEEE International Conference on Computer Vision (ICCV), Beijing, pp. 1395–1402
(2005)
10. Tang, K., Fei-Fei, L., Koller,.D.: Learning latent temporal structure for complex event
detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Providence, Rhode Island, pp. 1250–1257 (2012)
11. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE
International Conference on Computer Vision (ICCV), Sydney, New South Wales,
pp. 3551–3558 (2013)
12. Jiang, Z., Lin, Z., Davis, L.S.: A unified tree-based framework for joint action localization,
recognition and segmentation. Comput. Vis. Image Underst. 117(10), 1345–1355 (2013)
13. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale,
deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Anchorage, Alaska, pp. 1–8 (2008)
14. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: 2009 IEEE Conference on
Computer vision and Pattern Recognition (CVPR), Miami, Florida, pp. 2929–2936 (2009)
15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
16. Ji, S., Xu, W., Yang, M., et al.: 3D convolutional neural networks for human action
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
17. Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In:
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus,
Ohio, pp. 1653–1660 (2014)
18. Sun, L., Jia, K., Chan, T., et al.: DL-SFA: deeply-learned slow feature analysis for action
recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Columbus, Ohio, pp. 2625–2632 (2014)
19. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, California,
pp. 886–893 (2005)
Real-Time Human Action Recognition Using CNN 339