Conference PaperPDF Available

Real-Time Human Action Recognition Using CNN Over Temporal Images for Static Video Surveillance Cameras

Authors:

Abstract and Figures

This paper proposes a real-time human action recognition approach to static video surveillance systems. This approach predicts human actions using temporal images and convolutional neural networks (CNN). CNN is a type of deep learning model that can automatically learn features from training videos. Although the state-of-the-art methods have shown high accuracy, they consume a lot of computational resources. Another problem is that many methods assume that exact knowledge of human positions. Moreover, most of the current methods build complex handcrafted features for specific classifiers. Therefore, these kinds of methods are difficult to apply in real-world applications. In this paper, a novel CNN model based on temporal images and a hierarchical action structure is developed for real-time human action recognition. The hierarchical action structure includes three levels: action layer, motion layer, and posture layer. The top layer represents subtle actions; the bottom layer represents posture. Each layer contains one CNN, which means that this model has three CNNs working together; layers are combined to represent many different kinds of action with a large degree of freedom. The developed approach was implemented and achieved superior performance for the ICVL action dataset; the algorithm can run at around 20 frames per second.
Content may be subject to copyright.
Real-Time Human Action Recognition
Using CNN Over Temporal Images for Static
Video Surveillance Cameras
Cheng-Bin Jin, Shengzhe Li, Trung Dung Do, and Hakil Kim
(&)
Information and Communication Engineering, Inha University, Incheon, Korea
{sbkim,szli,dotrungdung}@vision.inha.ac.kr,
hikim@inha.ac.kr
Abstract. This paper proposes a real-time human action recognition approach
to static video surveillance systems. This approach predicts human actions using
temporal images and convolutional neural networks (CNN). CNN is a type of
deep learning model that can automatically learn features from training videos.
Although the state-of-the-art methods have shown high accuracy, they consume
a lot of computational resources. Another problem is that many methods assume
that exact knowledge of human positions. Moreover, most of the current
methods build complex handcrafted features for specic classiers. Therefore,
these kinds of methods are difcult to apply in real-world applications. In this
paper, a novel CNN model based on temporal images and a hierarchical action
structure is developed for real-time human action recognition. The hierarchical
action structure includes three levels: action layer, motion layer, and posture
layer. The top layer represents subtle actions; the bottom layer represents pos-
ture. Each layer contains one CNN, which means that this model has three
CNNs working together; layers are combined to represent many different kinds
of action with a large degree of freedom. The developed approach was imple-
mented and achieved superior performance for the ICVL action dataset; the
algorithm can run at around 20 frames per second.
Keywords: Video surveillance Action recognition Temporal images
Convolutional neural network Hierarchical action structure
1 Introduction
The ability of a computer to recognize human actions can be important in many
real-word applications including intelligent video surveillance, kinematic analysis,
video retrieval, and criminal investigation. Based on the types of input video, action
recognition can be divided into four classes: surveillance videos, sport videos, movies
and user videos, and rst-person videos. Different type of videos have different char-
acteristics: Surveillance video [1,2] usually uses a static camera that records from a
side or top view. Therefore, the background of surveillance video is relatively simple,
and research objects of surveillance are people or cars. Currently, millions of sur-
veillance cameras are in place throughout the world. This means that more than 800 K
video hours are generate per day. The objective of action recognition in the surveillance
©Springer International Publishing Switzerland 2015
Y.-S. Ho et al. (Eds.): PCM 2015, Part II, LNCS 9315, pp. 330339, 2015.
DOI: 10.1007/978-3-319-24078-7_33
eld is understanding of the video. It is necessary to have a program that can auto-
matically label human events. The viewpoint of sport video [3] is the same as that of
surveillance video. However, the objects in sport videos are usually fast-moving
people. Sport videos need to be segmented manually before performing
post-processing. Movies and user videos [4] are recorded with moving cameras; the
view is almost always from the front or the side. The problems of jittery video and a
dynamic and complicated background make this kind of video more difcult to process
than the previous one. First-person videos [5,6] are becoming popular after Google
launched Google Glass. Because this technology employs a moving camera, videos
obtained using Google Glass are very dynamic.
However, accurate action recognition is a very challenging task due to large
variations in appearance. For example, occlusions, non-rigid motion, scale variation,
view-point changes, subtle action, and clothing colors similar to the background color
are all important problems. Manual collection of training samples is another difcult
task. It requires much human effort and is time consuming. The other challenging task
is to create an approach that can process video in real-time and that can be applied in
real-world environment applications. The number of approaches to recognizing human
action in video has grown at a tremendous rate. Prior approaches can be divided into
appearance-based methods, motion-based methods, space-time based methods, and
deep learning-based methods this last of which has become a hot topic recently.
Motion history images (MHI) or temporal images [79] make up the most popular
appearance-based method. The advantages of the MHI method are that it is simple, fast,
and it works very well in controlled environments. However, MHI is sensitive to errors
of background subtraction. The fatal aw of MHI is that it cannot capture interior
motions and shapes. MHI can only capture silhouettes, but silhouettes tell little about
actions. Other appearance-based methods are the active shape model, the learned
dynamic prior model, and the motion prior model.
Motion-based methods [10] (generic and parametric optical ow, and temporal
activity models) enable an analysis of the temporal information of sub-actions. These
methods can also be used to model long-term activities with variable structures of
action. However, important questions remain, such as how many sub-action units are
meaningful and how is it possible to nd these sub-action units for a target activity?
Both of these are open problems that need to be solved.
Space-time is a feature-handcraft based method. This method was proposed to
handle complex dynamic senses. There are many different descriptors (e.g. HOG, HOF
Cuboids, HOG3D, and extended SURF) and detectors (Harris3D, Cuboids, Hessian,
and regular dense sampling) [11,12]. In order to improve performance, spatio-temporal
grids [13] and analysis of co-occurrence between action and scene [14] are considered.
The spatio-temporal grid method is one that divides interesting regions into many
areas; then, it linearly combines extracted descriptors from different areas.
Co-occurrence is a method that considers the relationship between action and scene; it
gives certain weight to classied results to update the result.
The convolution neural networks (CNN) model [15] is one of a deep learning
models. It is a class of supervised machine learning algorithm that can learn a hierarchy
of features by building high-level features from low-level ones [16]. Some researchers
have started to use CNN to recognize human actions [17,18]. However, it will be
Real-Time Human Action Recognition Using CNN 331
necessary to determine what a good CNN architecture is. This is a question that is still
difcult to answer, and a problem that will require further research.
The key contributions of this paper can be summarized as follows:
This paper proposes a novel model for human detection, human tracking, and
recognition of actions in real-time. The model does not make any assumptions (e.g.,
ground truth of human region, small scale, or viewpoint changes) about the
circumstances.
Hierarchical action structure is described for real-time human action recognition. In
this structure, three CNNs work together; layers are combined to represent many
different kinds of action with a large degree of freedom.
Different temporal images are used in the 3 layers of the hierarchical action
structure. Experimental results show that these kinds of temporal images are very
suitable for use in video surveillance systems, with no need to be concerned about
the accuracy or processing time.
The rest of this paper is organized as follows: a denition of the hierarchical action
structure is provided in Sect. 2. Different temporal images in 3 layers, and the CNN
architecture, are described in Sect. 3. The experimental results for the ICVL (Inha
Computer Vision Lab) action dataset are reported in Sect. 4. Section 5provides the
conclusions of the paper.
2 Hierarchical Action Structure
The hierarchical action structure developed for real-time human action recognition is
shown in Fig. 1. The structure includes three layers: the action layer, motion layer, and
posture layer. The action layer has four classes. These are nothing,texting,smoking,
and others. The motion layer has classes of stationary,walking, and running. The
posture layer has classes of sitting and standing. There are certain types of common
information between the posture layer and the motion layer. For the action layer, the
posture layer provides supplementary information. The 3 layers together deliver a
complete set of information for human actions.
Fig. 1. Hierarchical action structure
332 C.-B. Jin et al.
The advantages of this structure are that it uses 9 action categories to represent
various action combinations. To add a new action, the revise of overall structure is not
necessary, rather, it is possible to revise only the corresponding layer and re-train the
layer.
3 Human Action Recognition Using CNN
The objective of the paper is to propose a real-time algorithm for recognizing human
action in surveillance video; at the same time, the method should not employ any
assumptions about the video. In order to process video in real-time, human detection is
a precondition of action recognition: this is also a big challenge that is still the subject
of much research. First, the approach delineated in this paper performs motion
detection using the Gaussian Mixture Model (GMM); after this, the system detects
humans in the motion region using a Histogram of Gradient (HOG) [19]. To increase
system speed, tracking by detection technique is employed in developed algorithm.
Occlusion and difcult-to-detect humans those detected in previous frames but lost in
the current frame are detected using the Kalman lter. The algorithm ow chart is
shown in Fig. 2.
Fig. 2. Algorithm ow chart
Real-Time Human Action Recognition Using CNN 333
3.1 Temporal Images
In the training stage, manually cropped human images and action labels are used as
training data. Every layer of the structure has one independent CNN that requires
different temporal images. Binary Difference Images (BDI), Motion History Images
(MHI), and Weighted Summation Images (WSI) are used in the 3 layers. BDI is the
specic form of the temporal images. It is given by Eq. (1)
bðx;y;tÞ¼ 1;if f ðx;y;tÞfðx;y;t0Þ[threshold
0;otherwise
ð1Þ
As can be seen in its name, BDI, is binary image. Pixels in the image are set at 1 if
the difference from another image is bigger than the threshold.xand yare indexes in
the image; f(x,y,t) is the current frame; f(x,y,t
0
) is the rst frame of the input video. MHI
is dened in Eqs. (2)(4)[8]
dx;y;tðÞ¼
1;if f ðx;y;tÞfðx;y;t1Þ[threshold
0;otherwise
ð2Þ
hsðx;y;tÞ¼ smax;if dðx;y;tÞ¼1
maxð0;hsðx;t;t1ÞDsÞ;otherwise
ð3Þ
Ds ¼smax smin
nð4Þ
From Eq. (3) the h
τ
(x,y,t) (MHI) is generated from the difference between the
current and previous frame f(x,y,t1). For every frame, MHI is calculated from the
result of the previous MHI. Therefore, this value does not need to be calculated again
for the whole set of frames. In Eq. (4), nis the number of the frames to be considered.
WSI is the weighted summation of BDI and MHI, given by Eq. (5). There is also one
constraint: w
BDI
+w
MHI
= 1. The temporal images in this paper are constructed using
values of τ
max
of 255, τ
min
of 0, nof 10, w
BDI
of 0.4, and w
MHI
of 0.6.
sðx;y;tÞ¼wTsmax bðx;y;tÞ
hsðx;y;tÞ

w¼wBDI
wMHI
 ð5Þ
3.2 CNN Architecture
According to the desired objectives, a variety of CNN architectures can be devised. To
keep the model simple, a light CNN architecture is developed for human action rec-
ognition on the ICVL action dataset. This model is shown in Fig. 3.
This model consists of 2 convolutional layers (C1 and C3), 2 subsampling layers
(S2 and S4), and 2 full connection layers (F5 and F6). The last full connection layer
(F6) is fully connected to the action categories via softmax. The number of kernels in
334 C.-B. Jin et al.
the two convolutional layers are 4 and 32; the stride of the convolution is 1; the kernel
sizes are 5 ×5 and 7 ×7. The subsampling layer uses 2 ×2 max pooling; the stride is 2.
Two full connection layers together use 512 neurons.
4 Experimental Results
For this section, experiments were performed to evaluate the proposed method on the
ICVL action dataset. The dataset consists of surveillance video data recorded at Inha
University. It consists of 158 videos using 11 different indoor and outdoor cameras
with a resolution of 1280 ×640 at 20 fps. The durations of the videos are from 1 min to
6 min; each frame has 3 labels for the action, motion, and posture layers. Different
training and test data are used in the proposed method; statistics for the data used in the
experiments are provided in Table 1.
The performances of the 3 different layers are evaluated using frame-by-frame
metrics. The metric is calculated according to:
Pd¼1X
Nc
j¼1X
Ntotal
i¼1
Nj
Ntotal
Iy
i;udxi
ðÞ

;d2A;M;P
fg ð6Þ
Table 1. Number of videos in the training and test from the ICVL action dataset
Camera/Data C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 Tot.
Tra. 13 15 21 12 23 8 17 14 23 0 0 146
Tes. 112111 11111 12
Tot. 14 16 23 13 24 9 18 15 24 1 1 158
Fig. 3. Architecture of CNN
Real-Time Human Action Recognition Using CNN 335
Iy
i;udxi
ðÞ

¼0;yi¼udxi
ðÞ
1;otherwise
ð7Þ
where P
A
,P
M,
and P
P
are the precisions of the action layer, motion layer, and
posture layer, respectively. N
c
is the number of action classes in the corresponding
layer; N
total
is the number of frames in the evaluated videos. I(y
i
,φ
d
(x
i
)) gives a value of
1 if the labels of frames i(y
i
) and φ
d
(x
i
) are the same. φ
d
(x
i
) represents the results of the
CNN from the input frame x
i
;iis the frame number.
The precisions of the 3 different layers are shown in Fig. 4. It can be seen that the
median precisions of the posture, motion, and action layers are 97.77 %, 85.99 %, and
71.29 %, respectively. The performance of the posture layer is very impressive; the
motion layer is quite stable; the performance of the action layer is slightly weak. These
results demonstrate that the appearance-based method is not good at representing subtle
actions and that the movements of texting and smoking are small.
The confusion matrix for the ICVL action dataset is shown in Fig. 5. The gure
shows certain levels of confusion between running and walking,texting and nothing,
and smoking and nothing. Some possible explanations for these results are that they are
caused by the light architecture of the CNN and that there exists an imbalance in the
number of training samples. In the ICVL action dataset, there are many sets of sample
data for standing,walking,nothing, and texting; however, there are only few samples
for other types of action. For example, the dataset has 18,504 training samples for
nothing, but just 1,106 for smoking a sixteen-fold difference. The proportion even
bigger than 16 times. However, the multiple layers of the hierarchical action structure
can eliminate many misclassications.
Figure 6shows certain actions that were correctly recognized and certain actions
that were misclassied. Detected bounding boxes, trajectories of objects, object IDs,
Fig. 4. Precisions of 3 different layers
336 C.-B. Jin et al.
and 3-layer action results are shown in Fig. 6. The object ID and the trajectory were
updated every time after tracking when the detection method failed to detect the object.
The top row shows actions that were correctly recognized by the proposed model; the
bottom row shows those that were misclassied by the model.
In order to provide an evaluation of the processing time, the experimental envi-
ronment was established on a computer with an Intel(R) Core
TM
i7-3770 CPU @
3.40 GHz and two 4 GB RAMs. The input video was resized to 640 ×360 from the
original 1280 ×640. The processing time was tested on 12 videos and calculated as the
average processing time. The average processing time for one frame is 46.9319 ms.
GMM takes 11.3046 ms, HOG takes 12.6045 ms, the 3 temporal images take 1.2569 ms,
the 3 CNNs take 3.6577 ms per human, and the other processes (e.g. initialization,
Fig. 6. Correctly recognized and misclassied results
Fig. 5. Confusion matrix for classication results on ICVL action dataset
Real-Time Human Action Recognition Using CNN 337
Kalman ltering, post-processing, and displaying of results, etc.) take 18.2083 ms. As
can be seen above, the developed algorithm runs at more than 20 frames per second.
5 Conclusions
This paper has proposed a real-time human action recognition approach that does not
use any assumptions about the circumstances of the video in question. The developed
approach constructs temporal images from several static images using BDI, MHI, and
WSI. Temporal images are very simple and fast. They are quite suitable for use in xed
surveillance video cameras, with no need to consider the processing time or precision.
Using the proposed hierarchical action structure, the model can employ as limited
number of actions to represent many different kinds of action with a large degree of
freedom. Further, employment of this structure makes it easy to add new actions simply
by re-training the corresponding layer. And, this method can effectively reduce
misclassications.
In this paper, a light CNN architecture is considered for action recognition. There
are other deep architectures, such as Recurrent Neural Networks and Deep Belief
Networks, which have achieved promising performance for speech recognition and
image recognition. It would be interesting to employ such models for action recog-
nition or to make the current CNN deeper and more complicated.
Acknowledgements. This research was funded by the MSIP (Ministry of Science, ICT &
Future Planning), Korea in the ICT R&D Program 2015 (Project ID: 1391203002-130010200).
References
1. Oh, S., Hoogs, A., Perera, A., et al.: A large-scale benchmark dataset for event recognition in
surveillance video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Providence, Rhode Island ,pp. 31533160 (2011)
2. Vahdat, A., Gao, B., Ranjbar, M., et al.: A discriminative key pose sequence model for
recognizing human interactions. In: 2011 IEEE International Conference on Computer
Vision Workshops (ICCV Workshops), pp. 17291736, Barcelona (2011)
3. Lan, T., Wang, Y., Yang, W., et al.: Discriminative latent models for recognizing contextual
group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 15491562 (2011)
4. Kim, I., Oh, S., Vahdat, A., et al.: Segmental multi-way local polling for video recognition.
In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013,
pp. 637640, New York (2013)
5. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in rst-person camera
views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Providence, Rhode Island, pp. 28472854 (2011)
6. Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In:
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland,
Oregon, pp. 27302737 (2013)
338 C.-B. Jin et al.
7. Davis, J.W., Bobick, A.F.: The representation and recognition of human movement using
temporal templates. In: 1997 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, San Juan, pp. 928934 (1997)
8. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates.
IEEE Trans. Pattern Anal. Mach. Intell. 3(3), 257267 (2001)
9. Blank, M., Gorelick, L., Schechtman, E., et al.: Actions as space-time shapes. In: 2005
Tenth IEEE International Conference on Computer Vision (ICCV), Beijing, pp. 13951402
(2005)
10. Tang, K., Fei-Fei, L., Koller,.D.: Learning latent temporal structure for complex event
detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Providence, Rhode Island, pp. 12501257 (2012)
11. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE
International Conference on Computer Vision (ICCV), Sydney, New South Wales,
pp. 35513558 (2013)
12. Jiang, Z., Lin, Z., Davis, L.S.: A unied tree-based framework for joint action localization,
recognition and segmentation. Comput. Vis. Image Underst. 117(10), 13451355 (2013)
13. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale,
deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Anchorage, Alaska, pp. 18 (2008)
14. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: 2009 IEEE Conference on
Computer vision and Pattern Recognition (CVPR), Miami, Florida, pp. 29292936 (2009)
15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classication with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
16. Ji, S., Xu, W., Yang, M., et al.: 3D convolutional neural networks for human action
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221231 (2013)
17. Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In:
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus,
Ohio, pp. 16531660 (2014)
18. Sun, L., Jia, K., Chan, T., et al.: DL-SFA: deeply-learned slow feature analysis for action
recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Columbus, Ohio, pp. 26252632 (2014)
19. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, California,
pp. 886893 (2005)
Real-Time Human Action Recognition Using CNN 339

Supplementary resource (1)

... The background pixels affect the accurate human action detection specially in case of moving background. To avoid this irrelevant motion and to extract foreground information, researchers have explored various techniques including background subtraction [3], Silhouette extraction [18,20,24,46,50], temporal differencing [39,52], optical flow [11,49,58,101,102,109] and pose estimation [2,9,12,41,54,57,66,97,107] techniques. During the last few years, researchers have paid great attention towards action interpretation through pose estimation. ...
... In addition to single or multi stream CNN model, a hierarchal 3D CNN model was proposed by Jin et al. [39]. It was a novel model for human detection, tracking and action recognition in real time. ...
... It also increases the size of input tensor that in turn demands hardware with good specifications. To overcome this overhead, multiple strategies have been adopted such as the decomposition of video into clips [49,87], selection of key frames [75,77,105], MHI [1], heatmaps [17], and image summarization [39,90]. ...
Article
Full-text available
Human action interpretation (HAI) is one of the trending domains in the era of computer vision. It can further be divided into human action recognition (HAR) and human action detection (HAD). The HAR analyzes frames and provides label(s) to overall video, whereas the HAD localizes actor first, in each frame, and then estimates the action score for the detected region. The effectiveness of a HAI model is highly dependent on the representation of spatiotemporal features and the model’s architectural design. For the effective representation of these features, various studies have been carried out. Moreover, to better learn these features and to get the action score on the basis of these features, different designs of deep architectures have also been proposed. Among various deep architectures, convolutional neural network (CNN) is relatively more explored for HAI due to its lesser computational cost. To provide overview of these efforts, various surveys have been published to date; however, none of these surveys is focusing the features’ representation and design of proposed architectures in detail. Secondly, none of these studies is focusing the pose assisted HAI techniques. This study provides a more detailed survey on existing CNN-based HAI techniques by incorporating the frame level as well as pose level spatiotemporal features-based techniques. Besides these, it offers comparative study on different publicly available datasets used to evaluate HAI models based on various spatiotemporal features’ representations. Furthermore, it also discusses the limitations and challenges of the HAI and concludes that human action interpretation from visual data is still very far from the actual interpretation of human action in realistic videos which are continuous in nature and may contain multiple human beings performing multiple actions sequentially or in parallel.
... Automatic action recognition (AR) is a fundamental task for many applications such as video retrieval [1], video labelling [2], and video surveillance [2]. Consequently, research attention in AR has recently increased. ...
... Automatic action recognition (AR) is a fundamental task for many applications such as video retrieval [1], video labelling [2], and video surveillance [2]. Consequently, research attention in AR has recently increased. ...
Article
Full-text available
The visible spectrum is the most widely used modality for video media. Nonetheless, it is highly dependent on the lighting conditions. Hence, infrared (IR) imaging lower light sensitivity characterisation presents the untapped potential for robust automatic recognition systems. This is applicable to many applications including IR action recognition (AR), which is a relatively young field in IR. As such, in this study, the authors tackle IR and multimodal AR with the proposed utilisation of variational learning of Beta‐Liouville (BL) hidden Markov models (HMMs). Furthermore, to the best of the authors' knowledge, this is the first evaluation of the BL HMM in visible AR and in multimodal fusion for AR. They present the results of the proposed model on the infrared action recognition and the IOSB datasets. Experimental results demonstrate promising outcomes. The importance of using IR and multispectral fusion in AR is also highlighted by the results.
... The future of HAR in surveillance is characterized by the integration of AI tools, which are expected to enable the development of more context-aware and intelligent systems. This integration aims to enhance the ability of the system to accurately interpret complex scenarios, with recent advancements demonstrating a 25% improvement in detecting subtle activities [37]. Such advancements are particularly crucial in domains such as elderly care, where accurately distinguishing normal behaviors from potential emergencies can significantly impact the safety and well-being of individuals. ...
Article
Full-text available
Human action recognition (HAR), deeply rooted in computer vision, video surveillance, automated observation, and human computer interaction (HCI), enables precise identification of human actions. Numerous research groups have dedicated their efforts to various applications and problem domains in HAR systems. They trained classification models using diverse datasets, enhanced hardware capabilities, and employed different metrics to assess performance. Although several surveys and review articles have been published periodically to highlight research advancements in HAR, there is currently no comprehensive and up-to-date study that encompasses architecture, application areas, techniques/algorithms, and evaluation methods as well as challenges and issues. To bridge this gap in the literature, this article presents a comprehensive analysis of the current state of HAR systems by thoroughly examining a meticulously chosen collection of 136 publications published within the past two decades. These findings have significant implications for researchers engaged in different aspects of HAR systems.
... Machine learning has received noteworthy progress in the past decade and has been adopted in various critical domains, such as disease diagnosis [7], [24], intelligent surveillance [14], [42], financial decision [13], and so on. The security and privacy issues have raised increasing attention. ...
Preprint
Full-text available
The Electrocardiogram (ECG) measures the electrical cardiac activity generated by the heart to detect abnormal heartbeat and heart attack. However, the irregular occurrence of the abnormalities demands continuous monitoring of heartbeats. Machine learning techniques are leveraged to automate the task to reduce labor work needed during monitoring. In recent years, many companies have launched products with ECG monitoring and irregular heartbeat alert. Among all classification algorithms, the time series-based algorithm dynamic time warping (DTW) is widely adopted to undertake the ECG classification task. Though progress has been achieved, the DTW-based ECG classification also brings a new attacking vector of leaking the patients' diagnosis results. This paper shows that the ECG input samples' labels can be stolen via a side-channel attack, Flush+Reload. In particular, we first identify the vulnerability of DTW for ECG classification, i.e., the correlation between warping path choice and prediction results. Then we implement an attack that leverages Flush+Reload to monitor the warping path selection with known ECG data and then build a predictor for constructing the relation between warping path selection and labels of input ECG samples. Based on experiments, we find that the Flush+Reload-based inference leakage can achieve an 84.0\% attacking success rate to identify the labels of the two samples in DTW.
... Machine learning-based methods mainly include decision trees, clustering algorithms, genetic algorithms, and neural networks [10][11][12][13][14][15]. Machine learning methods can fully use the difference between normal and abnormal datasets for detection. ...
Article
Full-text available
This paper proposes a data anomaly detection and correction algorithm for the tea plantation IoT system based on deep learning, aiming at the multi-cause and multi-feature characteristics of abnormal data. The algorithm is based on the Z-score standardization of the original data and the determination of sliding window size according to the sampling frequency. First, we construct a convolutional neural network (CNN) model to extract abnormal data. Second, based on the support vector machine (SVM) algorithm, the Gaussian radial basis function (RBF) and one-to-one (OVO) multiclassification method are used to classify the abnormal data. Then, after extracting the time points of abnormal data, a long short-term memory network is established for prediction with multifactor historical data. The predicted values are used to replace and correct the abnormal data. When multiple consecutive abnormal values are detected, a faulty sensor judgment is given, and the specific faulty sensor location is output. The results show that the accuracy rate and micro-specificity of abnormal data detection for the CNN-SVM model are 3–4% and 20–30% higher than those of the traditional CNN model, respectively. The anomaly detection and correction algorithm for tea plantation data established in this paper provides accurate performance.
... The advancement of DNNs magnifies the deployment on both servers and edge devices with different computation resources and energy restrictions [12], [14]. Since then, a number of DNN-enabled applications have been deployed in past decades across various critical domains, such as disease diagnosis [8], [21], intelligent surveillance [16], [32], financial decision [15], and so on. However, the DNN models also bring new security risksthe leakage of label information may cause financial loss and privacy compromise since the label information of such DNN-enabled applications is directly linked to users' crucial decisions and sensitive information. ...
Article
Online action detection plays a vital role in video action understanding and can be widely used in various video analysis applications. This task aims to detect actions at the current moment within long untrimmed video streams. However, accurately identifying action-background transitions that are ambiguous in terms of time during detection can be challenging due to the similarity between the action and background clips, adding to the difficulty in finding a suitable division between them. To address this issue, we propose a hard video clip mining method based on deep metric learning for online action detection named HCM. The HCM method first selects video clips that are hard to distinguish to determine the optimization objects. Then, a hard clip mining loss is adopted to push the features toward the centers of the categories to which they belong and away from others. Furthermore, we introduce an intra-class feature compaction loss to constrain the divergence of action features, ensuring the stability of their distribution. We evaluated the proposed method on two challenging online action detection datasets, THUMOS14 and TVSeries. The results show that HCM is effective and efficient in online action detection and action anticipation tasks.
Article
Nowadays, the increased interest of users towards healthier lifestyle has motivated to develop virtual personal trainer application using Android as platform. Despite availability of numerous fitness apps and gyms, everyone needs proper training at their ease and wishes to monitor calories burnt. Thus, this paper proposes novel idea of virtual personal trainer application that recognizes user’s actions through videos. The video data is processed using convolutional neural network and bidirectional long short-term memory network. The motive of work is to recognize exercise accurately from video and calculate number of calories expended. The proposed application provides not only detailed information about exercise but also ascertains correct way of performing exercises as this is a major challenge that user faces due to lack of knowledge. The idea is implemented on UCF-101 Action Recognition dataset and experimental results show significant improvements as compared to baseline methods. This study would benefit users who are fitness enthusiasts and are more prone to gadgets.
Article
Convolutional Neural Network based action recognition methods have achieved significant improvements in recent years. The 3D convolution extends the 2D convolution to the spatial-temporal domain for better analysis of human activities in videos. The 3D convolution, however, involves many more parameters than the 2D convolution. Thus, it is much more expensive on computation, costly on storage, and difficult to learn. This work proposes efficient asymmetric one-directional 3D convolutions to approximate the traditional 3D convolution. To improve the feature learning capacity of asymmetric 3D convolutions, a set of local 3D convolutional networks, called MicroNets, are proposed by incorporating multi-scale 3D convolution branches. Then, an asymmetric 3D-CNN deep model is constructed by MicroNets for the action recognition task. Moreover, to avoid training two networks on the RGB and Flow frames separately as most works do, a simple but effective multi-source enhanced input is proposed, which fuses useful information of the RGB and Flow frame at the pre-processing stage. The asymmetric 3D-CNN model is evaluated on two of the most challenging action recognition benchmarks, UCF-101 and HMDB-51. The asymmetric 3D-CNN model outperforms all the traditional 3D-CNN models in both effectiveness and efficiency, and its performance is comparable with that of recent state-of-the-art action recognition methods on both benchmarks.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Most of the previous work on video action recognition use complex hand-designed local features, such as SIFT, HOG and SURF, but these approaches are implemented sophisticatedly and difficult to be extended to other sensor modalities. Recent studies discover that there are no universally best hand-engineered features for all datasets, and learning features directly from the data may be more advantageous. One such endeavor is Slow Feature Analysis (SFA) proposed by Wiskott and Sejnowski [33]. SFA can learn the invariant and slowly varying features from input signals and has been proved to be valuable in human action recognition [34]. It is also observed that the multi-layer feature representation has succeeded remarkably in widespread machine learning applications. In this paper, we propose to combine SFA with deep learning techniques to learn hierarchical representations from the video data itself. Specifically, we use a two-layered SFA learning structure with 3D convolution and max pooling operations to scale up the method to large inputs and capture abstract and structural features from the video. Thus, the proposed method is suitable for action recognition. At the same time, sharing the same merits of deep learning, the proposed method is generic and fully automated. Our classification results on Hollywood2, KTH and UCF Sports are competitive with previously published results. To highlight some, on the KTH dataset, our recognition rate shows approximately 1% improvement in comparison to state-of-the-art methods even without supervision or dense sampling.
Conference Paper
Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results on four challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.
Conference Paper
In this work, we address the problem of complex event detection on unconstrained videos. We introduce a novel multi-way feature pooling approach which leverages segment-level information. The approach is simple and widely applicable to diverse audio-visual features. Our approach uses a set of clusters discovered via unsupervised clustering of segment-level features. Depending on feature characteristics, not only scene-based clusters but also motion/audio-based clusters can be incorporated. Then, every video is represented with multiple descriptors, where each descriptor is designed to relate to one of the pre-built clusters. For classification, intersection kernel SVMs are used where the kernel is obtained by combining multiple kernels computed from corresponding per-cluster descriptor pairs. Evaluation on TRECVID'11 MED dataset shows a significant improvement by the proposed approach beyond the state-of-the-art.
Article
A unified tree-based framework for joint action localization, recognition and segmentation is proposed. An action is represented as a sequence of joint hog-flow descriptors extracted independently from each frame. During training, a set of action prototypes is first learned based on k-means clustering, and then a binary tree model is constructed from the set of action prototypes based on hierarchical k-means clustering. Each tree node is characterized by a hog-flow descriptor and a rejection threshold, and an initial action segmentation mask is defined for leaf nodes (corresponding to a prototype). During testing, an action is localized by mapping each test frame to its nearest neighbor prototype using a fast tree search method, followed by local search based tracking and global filtering based location refinement. An action is recognized by maximizing the sum of the joint probabilities of the action category and action prototype given an input sequence. An action pose from a test frame can be segmented by GrabCut algorithm using the initial segmentation mask from the matched leaf node as the user labeling. Our approach does not rely on background subtraction, and enables action localization and recognition in realistic and challenging conditions (such as crowded backgrounds). Experimental results show that our approach achieves start-of-art performances on the Weizmann dataset, CMU action dataset and UCF sports action dataset.
Conference Paper
We present a novel dataset and novel algorithms for the problem of detecting activities of daily living (ADL) in firstperson camera views. We have collected a dataset of 1 million frames of dozens of people performing unscripted, everyday activities. The dataset is annotated with activities, object tracks, hand positions, and interaction events. ADLs differ from typical actions in that they can involve long-scale temporal structure (making tea can take a few minutes) and complex object interactions (a fridge looks different when its door is open). We develop novel representations including (1) temporal pyramids, which generalize the well-known spatial pyramid to approximate temporal correspondence when scoring a model and (2) composite object models that exploit the fact that objects look different when being interacted with. We perform an extensive empirical evaluation and demonstrate that our novel representations produce a two-fold improvement over traditional approaches. Our analysis suggests that real-world ADL recognition is “all about the objects,” and in particular, “all about the objects being interacted with.”
Conference Paper
In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model.