Conference PaperPDF Available

Real-Time Human Action Recognition Using CNN Over Temporal Images for Static Video Surveillance Cameras

September 2015

September 2015

DOI:10.1007/978-3-319-24078-7_33

Conference: Pacific-Rim Conference on Multimedia 2015
At: Gwangju, Korea
Volume: 2

Authors:

Chengbin Jin

Inha University

Shengzhe Li

Inha University

Trung Dung Do

Inha University

Hakil Kim

Inha University

This paper proposes a real-time human action recognition approach to static video surveillance systems. This approach predicts human actions using temporal images and convolutional neural networks (CNN). CNN is a type of deep learning model that can automatically learn features from training videos. Although the state-of-the-art methods have shown high accuracy, they consume a lot of computational resources. Another problem is that many methods assume that exact knowledge of human positions. Moreover, most of the current methods build complex handcrafted features for specific classifiers. Therefore, these kinds of methods are difficult to apply in real-world applications. In this paper, a novel CNN model based on temporal images and a hierarchical action structure is developed for real-time human action recognition. The hierarchical action structure includes three levels: action layer, motion layer, and posture layer. The top layer represents subtle actions; the bottom layer represents posture. Each layer contains one CNN, which means that this model has three CNNs working together; layers are combined to represent many different kinds of action with a large degree of freedom. The developed approach was implemented and achieved superior performance for the ICVL action dataset; the algorithm can run at around 20 frames per second.

Hierarchical action structure

…

. Number of videos in the training and test from the ICVL action dataset

…

Algorithm flow chart

…

Precisions of 3 different layers

…

Confusion matrix for classification results on ICVL action dataset

…

Figures - uploaded by Chengbin Jin

Content may be subject to copyright.

Content uploaded by Chengbin Jin

Content may be subject to copyright.

Real-Time Human Action Recognition

Using CNN Over Temporal Images for Static

Video Surveillance Cameras

Cheng-Bin Jin, Shengzhe Li, Trung Dung Do, and Hakil Kim

(&)

Information and Communication Engineering, Inha University, Incheon, Korea

{sbkim,szli,dotrungdung}@vision.inha.ac.kr,

hikim@inha.ac.kr

Abstract. This paper proposes a real-time human action recognition approach

to static video surveillance systems. This approach predicts human actions using

temporal images and convolutional neural networks (CNN). CNN is a type of

deep learning model that can automatically learn features from training videos.

Although the state-of-the-art methods have shown high accuracy, they consume

a lot of computational resources. Another problem is that many methods assume

that exact knowledge of human positions. Moreover, most of the current

methods build complex handcrafted features for speciﬁc classiﬁers. Therefore,

these kinds of methods are difﬁcult to apply in real-world applications. In this

paper, a novel CNN model based on temporal images and a hierarchical action

structure is developed for real-time human action recognition. The hierarchical

action structure includes three levels: action layer, motion layer, and posture

layer. The top layer represents subtle actions; the bottom layer represents pos-

ture. Each layer contains one CNN, which means that this model has three

CNNs working together; layers are combined to represent many different kinds

of action with a large degree of freedom. The developed approach was imple-

mented and achieved superior performance for the ICVL action dataset; the

algorithm can run at around 20 frames per second.

Keywords: Video surveillance Action recognition Temporal images 

Convolutional neural network Hierarchical action structure

1 Introduction

The ability of a computer to recognize human actions can be important in many

real-word applications including intelligent video surveillance, kinematic analysis,

video retrieval, and criminal investigation. Based on the types of input video, action

recognition can be divided into four classes: surveillance videos, sport videos, movies

and user videos, and ﬁrst-person videos. Different type of videos have different char-

acteristics: Surveillance video [1,2] usually uses a static camera that records from a

side or top view. Therefore, the background of surveillance video is relatively simple,

and research objects of surveillance are people or cars. Currently, millions of sur-

veillance cameras are in place throughout the world. This means that more than 800 K

video hours are generate per day. The objective of action recognition in the surveillance

©Springer International Publishing Switzerland 2015

Y.-S. Ho et al. (Eds.): PCM 2015, Part II, LNCS 9315, pp. 330–339, 2015.

DOI: 10.1007/978-3-319-24078-7_33

ﬁeld is understanding of the video. It is necessary to have a program that can auto-

matically label human events. The viewpoint of sport video [3] is the same as that of

surveillance video. However, the objects in sport videos are usually fast-moving

people. Sport videos need to be segmented manually before performing

post-processing. Movies and user videos [4] are recorded with moving cameras; the

view is almost always from the front or the side. The problems of jittery video and a

dynamic and complicated background make this kind of video more difﬁcult to process

than the previous one. First-person videos [5,6] are becoming popular after Google

launched Google Glass. Because this technology employs a moving camera, videos

obtained using Google Glass are very dynamic.

However, accurate action recognition is a very challenging task due to large

variations in appearance. For example, occlusions, non-rigid motion, scale variation,

view-point changes, subtle action, and clothing colors similar to the background color

are all important problems. Manual collection of training samples is another difﬁcult

task. It requires much human effort and is time consuming. The other challenging task

is to create an approach that can process video in real-time and that can be applied in

real-world environment applications. The number of approaches to recognizing human

action in video has grown at a tremendous rate. Prior approaches can be divided into

appearance-based methods, motion-based methods, space-time based methods, and

deep learning-based methods this last of which has become a hot topic recently.

Motion history images (MHI) or temporal images [7–9] make up the most popular

appearance-based method. The advantages of the MHI method are that it is simple, fast,

and it works very well in controlled environments. However, MHI is sensitive to errors

of background subtraction. The fatal ﬂaw of MHI is that it cannot capture interior

motions and shapes. MHI can only capture silhouettes, but silhouettes tell little about

actions. Other appearance-based methods are the active shape model, the learned

dynamic prior model, and the motion prior model.

Motion-based methods [10] (generic and parametric optical ﬂow, and temporal

activity models) enable an analysis of the temporal information of sub-actions. These

methods can also be used to model long-term activities with variable structures of

action. However, important questions remain, such as how many sub-action units are

meaningful and how is it possible to ﬁnd these sub-action units for a target activity?

Both of these are open problems that need to be solved.

Space-time is a feature-handcraft based method. This method was proposed to

handle complex dynamic senses. There are many different descriptors (e.g. HOG, HOF

Cuboids, HOG3D, and extended SURF) and detectors (Harris3D, Cuboids, Hessian,

and regular dense sampling) [11,12]. In order to improve performance, spatio-temporal

grids [13] and analysis of co-occurrence between action and scene [14] are considered.

The spatio-temporal grid method is one that divides interesting regions into many

areas; then, it linearly combines extracted descriptors from different areas.

Co-occurrence is a method that considers the relationship between action and scene; it

gives certain weight to classiﬁed results to update the result.

The convolution neural networks (CNN) model [15] is one of a deep learning

models. It is a class of supervised machine learning algorithm that can learn a hierarchy

of features by building high-level features from low-level ones [16]. Some researchers

have started to use CNN to recognize human actions [17,18]. However, it will be

Real-Time Human Action Recognition Using CNN 331

necessary to determine what a good CNN architecture is. This is a question that is still

difﬁcult to answer, and a problem that will require further research.

The key contributions of this paper can be summarized as follows:

•This paper proposes a novel model for human detection, human tracking, and

recognition of actions in real-time. The model does not make any assumptions (e.g.,

ground truth of human region, small scale, or viewpoint changes) about the

circumstances.

•Hierarchical action structure is described for real-time human action recognition. In

this structure, three CNNs work together; layers are combined to represent many

different kinds of action with a large degree of freedom.

•Different temporal images are used in the 3 layers of the hierarchical action

structure. Experimental results show that these kinds of temporal images are very

suitable for use in video surveillance systems, with no need to be concerned about

the accuracy or processing time.

The rest of this paper is organized as follows: a deﬁnition of the hierarchical action

structure is provided in Sect. 2. Different temporal images in 3 layers, and the CNN

architecture, are described in Sect. 3. The experimental results for the ICVL (Inha

Computer Vision Lab) action dataset are reported in Sect. 4. Section 5provides the

conclusions of the paper.

2 Hierarchical Action Structure

The hierarchical action structure developed for real-time human action recognition is

shown in Fig. 1. The structure includes three layers: the action layer, motion layer, and

posture layer. The action layer has four classes. These are nothing,texting,smoking,

and others. The motion layer has classes of stationary,walking, and running. The

posture layer has classes of sitting and standing. There are certain types of common

information between the posture layer and the motion layer. For the action layer, the

posture layer provides supplementary information. The 3 layers together deliver a

complete set of information for human actions.

Fig. 1. Hierarchical action structure

332 C.-B. Jin et al.

The advantages of this structure are that it uses 9 action categories to represent

various action combinations. To add a new action, the revise of overall structure is not

necessary, rather, it is possible to revise only the corresponding layer and re-train the

layer.

3 Human Action Recognition Using CNN

The objective of the paper is to propose a real-time algorithm for recognizing human

action in surveillance video; at the same time, the method should not employ any

assumptions about the video. In order to process video in real-time, human detection is

a precondition of action recognition: this is also a big challenge that is still the subject

of much research. First, the approach delineated in this paper performs motion

detection using the Gaussian Mixture Model (GMM); after this, the system detects

humans in the motion region using a Histogram of Gradient (HOG) [19]. To increase

system speed, tracking by detection technique is employed in developed algorithm.

Occlusion and difﬁcult-to-detect humans those detected in previous frames but lost in

the current frame are detected using the Kalman ﬁlter. The algorithm ﬂow chart is

shown in Fig. 2.

Fig. 2. Algorithm ﬂow chart

Real-Time Human Action Recognition Using CNN 333

3.1 Temporal Images

In the training stage, manually cropped human images and action labels are used as

training data. Every layer of the structure has one independent CNN that requires

different temporal images. Binary Difference Images (BDI), Motion History Images

(MHI), and Weighted Summation Images (WSI) are used in the 3 layers. BDI is the

speciﬁc form of the temporal images. It is given by Eq. (1)

bðx;y;tÞ¼ 1;if f ðx;y;tÞfðx;y;t0Þ[threshold

0;otherwise

ð1Þ

As can be seen in its name, BDI, is binary image. Pixels in the image are set at 1 if

the difference from another image is bigger than the threshold.xand yare indexes in

the image; f(x,y,t) is the current frame; f(x,y,t

) is the ﬁrst frame of the input video. MHI

is deﬁned in Eqs. (2)–(4)[8]

dx;y;tðÞ¼

1;if f ðx;y;tÞfðx;y;t1Þ[threshold

0;otherwise

ð2Þ

hsðx;y;tÞ¼ smax;if dðx;y;tÞ¼1

maxð0;hsðx;t;t1ÞDsÞ;otherwise

ð3Þ

Ds ¼smax smin

nð4Þ

From Eq. (3) the h

(x,y,t) (MHI) is generated from the difference between the

current and previous frame f(x,y,t−1). For every frame, MHI is calculated from the

result of the previous MHI. Therefore, this value does not need to be calculated again

for the whole set of frames. In Eq. (4), nis the number of the frames to be considered.

WSI is the weighted summation of BDI and MHI, given by Eq. (5). There is also one

constraint: w

BDI

MHI

= 1. The temporal images in this paper are constructed using

values of τ

max

of 255, τ

min

of 0, nof 10, w

BDI

of 0.4, and w

MHI

of 0.6.

sðx;y;tÞ¼wTsmax bðx;y;tÞ

hsðx;y;tÞ



w¼wBDI

wMHI

 ð5Þ

3.2 CNN Architecture

According to the desired objectives, a variety of CNN architectures can be devised. To

keep the model simple, a light CNN architecture is developed for human action rec-

ognition on the ICVL action dataset. This model is shown in Fig. 3.

This model consists of 2 convolutional layers (C1 and C3), 2 subsampling layers

(S2 and S4), and 2 full connection layers (F5 and F6). The last full connection layer

(F6) is fully connected to the action categories via softmax. The number of kernels in

334 C.-B. Jin et al.

the two convolutional layers are 4 and 32; the stride of the convolution is 1; the kernel

sizes are 5 ×5 and 7 ×7. The subsampling layer uses 2 ×2 max pooling; the stride is 2.

Two full connection layers together use 512 neurons.

4 Experimental Results

For this section, experiments were performed to evaluate the proposed method on the

ICVL action dataset. The dataset consists of surveillance video data recorded at Inha

University. It consists of 158 videos using 11 different indoor and outdoor cameras

with a resolution of 1280 ×640 at 20 fps. The durations of the videos are from 1 min to

6 min; each frame has 3 labels for the action, motion, and posture layers. Different

training and test data are used in the proposed method; statistics for the data used in the

experiments are provided in Table 1.

The performances of the 3 different layers are evaluated using frame-by-frame

metrics. The metric is calculated according to:

Pd¼1X

j¼1X

Ntotal

i¼1

Ntotal

i;udxi

ðÞ



;d2A;M;P

fg ð6Þ

Table 1. Number of videos in the training and test from the ICVL action dataset

Camera/Data C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 Tot.

Tra. 13 15 21 12 23 8 17 14 23 0 0 146

Tes. 112111 11111 12

Tot. 14 16 23 13 24 9 18 15 24 1 1 158

Fig. 3. Architecture of CNN

Real-Time Human Action Recognition Using CNN 335

i;udxi

ðÞ



¼0;yi¼udxi

ðÞ

1;otherwise

ð7Þ

where P

and P

are the precisions of the action layer, motion layer, and

posture layer, respectively. N

is the number of action classes in the corresponding

layer; N

total

is the number of frames in the evaluated videos. I(y

,φ

)) gives a value of

1 if the labels of frames i(y

) and φ

) are the same. φ

) represents the results of the

CNN from the input frame x

;iis the frame number.

The precisions of the 3 different layers are shown in Fig. 4. It can be seen that the

median precisions of the posture, motion, and action layers are 97.77 %, 85.99 %, and

71.29 %, respectively. The performance of the posture layer is very impressive; the

motion layer is quite stable; the performance of the action layer is slightly weak. These

results demonstrate that the appearance-based method is not good at representing subtle

actions and that the movements of texting and smoking are small.

The confusion matrix for the ICVL action dataset is shown in Fig. 5. The ﬁgure

shows certain levels of confusion between running and walking,texting and nothing,

and smoking and nothing. Some possible explanations for these results are that they are

caused by the light architecture of the CNN and that there exists an imbalance in the

number of training samples. In the ICVL action dataset, there are many sets of sample

data for standing,walking,nothing, and texting; however, there are only few samples

for other types of action. For example, the dataset has 18,504 training samples for

nothing, but just 1,106 for smoking a sixteen-fold difference. The proportion even

bigger than 16 times. However, the multiple layers of the hierarchical action structure

can eliminate many misclassiﬁcations.

Figure 6shows certain actions that were correctly recognized and certain actions

that were misclassiﬁed. Detected bounding boxes, trajectories of objects, object IDs,

Fig. 4. Precisions of 3 different layers

336 C.-B. Jin et al.

and 3-layer action results are shown in Fig. 6. The object ID and the trajectory were

updated every time after tracking when the detection method failed to detect the object.

The top row shows actions that were correctly recognized by the proposed model; the

bottom row shows those that were misclassiﬁed by the model.

In order to provide an evaluation of the processing time, the experimental envi-

ronment was established on a computer with an Intel(R) Core

i7-3770 CPU @

3.40 GHz and two 4 GB RAMs. The input video was resized to 640 ×360 from the

original 1280 ×640. The processing time was tested on 12 videos and calculated as the

average processing time. The average processing time for one frame is 46.9319 ms.

GMM takes 11.3046 ms, HOG takes 12.6045 ms, the 3 temporal images take 1.2569 ms,

the 3 CNNs take 3.6577 ms per human, and the other processes (e.g. initialization,

Fig. 6. Correctly recognized and misclassiﬁed results

Fig. 5. Confusion matrix for classiﬁcation results on ICVL action dataset

Real-Time Human Action Recognition Using CNN 337

Kalman ﬁltering, post-processing, and displaying of results, etc.) take 18.2083 ms. As

can be seen above, the developed algorithm runs at more than 20 frames per second.

5 Conclusions

This paper has proposed a real-time human action recognition approach that does not

use any assumptions about the circumstances of the video in question. The developed

approach constructs temporal images from several static images using BDI, MHI, and

WSI. Temporal images are very simple and fast. They are quite suitable for use in ﬁxed

surveillance video cameras, with no need to consider the processing time or precision.

Using the proposed hierarchical action structure, the model can employ as limited

number of actions to represent many different kinds of action with a large degree of

freedom. Further, employment of this structure makes it easy to add new actions simply

by re-training the corresponding layer. And, this method can effectively reduce

misclassiﬁcations.

In this paper, a light CNN architecture is considered for action recognition. There

are other deep architectures, such as Recurrent Neural Networks and Deep Belief

Networks, which have achieved promising performance for speech recognition and

image recognition. It would be interesting to employ such models for action recog-

nition or to make the current CNN deeper and more complicated.

Acknowledgements. This research was funded by the MSIP (Ministry of Science, ICT &

Future Planning), Korea in the ICT R&D Program 2015 (Project ID: 1391203002-130010200).

References

1. Oh, S., Hoogs, A., Perera, A., et al.: A large-scale benchmark dataset for event recognition in

surveillance video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), Providence, Rhode Island ,pp. 3153–3160 (2011)

2. Vahdat, A., Gao, B., Ranjbar, M., et al.: A discriminative key pose sequence model for

recognizing human interactions. In: 2011 IEEE International Conference on Computer

Vision Workshops (ICCV Workshops), pp. 1729–1736, Barcelona (2011)

3. Lan, T., Wang, Y., Yang, W., et al.: Discriminative latent models for recognizing contextual

group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1549–1562 (2011)

4. Kim, I., Oh, S., Vahdat, A., et al.: Segmental multi-way local polling for video recognition.

In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013,

pp. 637–640, New York (2013)

5. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in ﬁrst-person camera

views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

Providence, Rhode Island, pp. 2847–2854 (2011)

6. Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In:

2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland,

Oregon, pp. 2730–2737 (2013)

338 C.-B. Jin et al.

7. Davis, J.W., Bobick, A.F.: The representation and recognition of human movement using

temporal templates. In: 1997 IEEE Computer Society Conference on Computer Vision and

Pattern Recognition, San Juan, pp. 928–934 (1997)

8. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates.

IEEE Trans. Pattern Anal. Mach. Intell. 3(3), 257–267 (2001)

9. Blank, M., Gorelick, L., Schechtman, E., et al.: Actions as space-time shapes. In: 2005

Tenth IEEE International Conference on Computer Vision (ICCV), Beijing, pp. 1395–1402

(2005)

10. Tang, K., Fei-Fei, L., Koller,.D.: Learning latent temporal structure for complex event

detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

Providence, Rhode Island, pp. 1250–1257 (2012)

11. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE

International Conference on Computer Vision (ICCV), Sydney, New South Wales,

pp. 3551–3558 (2013)

12. Jiang, Z., Lin, Z., Davis, L.S.: A uniﬁed tree-based framework for joint action localization,

recognition and segmentation. Comput. Vis. Image Underst. 117(10), 1345–1355 (2013)

13. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale,

deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), Anchorage, Alaska, pp. 1–8 (2008)

14. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: 2009 IEEE Conference on

Computer vision and Pattern Recognition (CVPR), Miami, Florida, pp. 2929–2936 (2009)

15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional

neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)

16. Ji, S., Xu, W., Yang, M., et al.: 3D convolutional neural networks for human action

recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

17. Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In:

2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus,

Ohio, pp. 1653–1660 (2014)

18. Sun, L., Jia, K., Chan, T., et al.: DL-SFA: deeply-learned slow feature analysis for action

recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), Columbus, Ohio, pp. 2625–2632 (2014)

19. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, California,

pp. 886–893 (2005)

Real-Time Human Action Recognition Using CNN 339

Real-Time Human Action Recognition

Data

September 2015

Chengbin Jin · Shengzhe Li · Trung Dung Do · Hakil Kim

Download

Human action interpretation using convolutional neural network: a survey

Article

Full-text available

Mar 2022
MACH VISION APPL

Human action interpretation (HAI) is one of the trending domains in the era of computer vision. It can further be divided into human action recognition (HAR) and human action detection (HAD). The HAR analyzes frames and provides label(s) to overall video, whereas the HAD localizes actor first, in each frame, and then estimates the action score for the detected region. The effectiveness of a HAI model is highly dependent on the representation of spatiotemporal features and the model’s architectural design. For the effective representation of these features, various studies have been carried out. Moreover, to better learn these features and to get the action score on the basis of these features, different designs of deep architectures have also been proposed. Among various deep architectures, convolutional neural network (CNN) is relatively more explored for HAI due to its lesser computational cost. To provide overview of these efforts, various surveys have been published to date; however, none of these surveys is focusing the features’ representation and design of proposed architectures in detail. Secondly, none of these studies is focusing the pose assisted HAI techniques. This study provides a more detailed survey on existing CNN-based HAI techniques by incorporating the frame level as well as pose level spatiotemporal features-based techniques. Besides these, it offers comparative study on different publicly available datasets used to evaluate HAI models based on various spatiotemporal features’ representations. Furthermore, it also discusses the limitations and challenges of the HAI and concludes that human action interpretation from visual data is still very far from the actual interpretation of human action in realistic videos which are continuous in nature and may contain multiple human beings performing multiple actions sequentially or in parallel.

Multimodal action recognition using variational-based Beta-Liouville hidden Markov models

Article

Full-text available

Mar 2021
IET IMAGE PROCESS

The visible spectrum is the most widely used modality for video media. Nonetheless, it is highly dependent on the lighting conditions. Hence, infrared (IR) imaging lower light sensitivity characterisation presents the untapped potential for robust automatic recognition systems. This is applicable to many applications including IR action recognition (AR), which is a relatively young field in IR. As such, in this study, the authors tackle IR and multimodal AR with the proposed utilisation of variational learning of Beta‐Liouville (BL) hidden Markov models (HMMs). Furthermore, to the best of the authors' knowledge, this is the first evaluation of the BL HMM in visible AR and in multimodal fusion for AR. They present the results of the proposed model on the infrared action recognition and the IOSB datasets. Experimental results demonstrate promising outcomes. The importance of using IR and multispectral fusion in AR is also highlighted by the results.

Human Action Recognition Systems: A Review of the Trends and State-of-the-Art

Article

Full-text available

Jan 2024

Human action recognition (HAR), deeply rooted in computer vision, video surveillance, automated observation, and human computer interaction (HCI), enables precise identification of human actions. Numerous research groups have dedicated their efforts to various applications and problem domains in HAR systems. They trained classification models using diverse datasets, enhanced hardware capabilities, and employed different metrics to assess performance. Although several surveys and review articles have been published periodically to highlight research advancements in HAR, there is currently no comprehensive and up-to-date study that encompasses architecture, application areas, techniques/algorithms, and evaluation methods as well as challenges and issues. To bridge this gap in the literature, this article presents a comprehensive analysis of the current state of HAR systems by thoroughly examining a meticulously chosen collection of 136 publications published within the past two decades. These findings have significant implications for researchers engaged in different aspects of HAR systems.

Side Channel-Assisted Inference Leakage from Machine Learning-based ECG Classification

Preprint

Full-text available

Apr 2023

The Electrocardiogram (ECG) measures the electrical cardiac activity generated by the heart to detect abnormal heartbeat and heart attack. However, the irregular occurrence of the abnormalities demands continuous monitoring of heartbeats. Machine learning techniques are leveraged to automate the task to reduce labor work needed during monitoring. In recent years, many companies have launched products with ECG monitoring and irregular heartbeat alert. Among all classification algorithms, the time series-based algorithm dynamic time warping (DTW) is widely adopted to undertake the ECG classification task. Though progress has been achieved, the DTW-based ECG classification also brings a new attacking vector of leaking the patients' diagnosis results. This paper shows that the ECG input samples' labels can be stolen via a side-channel attack, Flush+Reload. In particular, we first identify the vulnerability of DTW for ECG classification, i.e., the correlation between warping path choice and prediction results. Then we implement an attack that leverages Flush+Reload to monitor the warping path selection with known ECG data and then build a predictor for constructing the relation between warping path selection and labels of input ECG samples. Based on experiments, we find that the Flush+Reload-based inference leakage can achieve an 84.0\% attacking success rate to identify the labels of the two samples in DTW.

Detection and Correction of Abnormal IoT Data from Tea Plantations Based on Deep Learning

Article

Full-text available

Feb 2023

This paper proposes a data anomaly detection and correction algorithm for the tea plantation IoT system based on deep learning, aiming at the multi-cause and multi-feature characteristics of abnormal data. The algorithm is based on the Z-score standardization of the original data and the determination of sliding window size according to the sampling frequency. First, we construct a convolutional neural network (CNN) model to extract abnormal data. Second, based on the support vector machine (SVM) algorithm, the Gaussian radial basis function (RBF) and one-to-one (OVO) multiclassification method are used to classify the abnormal data. Then, after extracting the time points of abnormal data, a long short-term memory network is established for prediction with multifactor historical data. The predicted values are used to replace and correct the abnormal data. When multiple consecutive abnormal values are detected, a faulty sensor judgment is given, and the specific faulty sensor location is output. The results show that the accuracy rate and micro-specificity of abnormal data detection for the CNN-SVM model are 3–4% and 20–30% higher than those of the traditional CNN model, respectively. The anomaly detection and correction algorithm for tea plantation data established in this paper provides accurate performance.

Stealthy Inference Attack on DNN via Cache-based Side-Channel Attacks

Conference Paper

Full-text available

Mar 2022

Real-Time Human Action Representation Based on 2D Skeleton Joints

Conference Paper

Oct 2023

Side Channel-Assisted Inference Attacks on Machine Learning-Based ECG Classification

Conference Paper

Oct 2023

HCM: Online Action Detection With Hard Video Clip Mining

Article

Jan 2023

Online action detection plays a vital role in video action understanding and can be widely used in various video analysis applications. This task aims to detect actions at the current moment within long untrimmed video streams. However, accurately identifying action-background transitions that are ambiguous in terms of time during detection can be challenging due to the similarity between the action and background clips, adding to the difficulty in finding a suitable division between them. To address this issue, we propose a hard video clip mining method based on deep metric learning for online action detection named HCM. The HCM method first selects video clips that are hard to distinguish to determine the optimization objects. Then, a hard clip mining loss is adopted to push the features toward the centers of the categories to which they belong and away from others. Furthermore, we introduce an intra-class feature compaction loss to constrain the divergence of action features, ensuring the stability of their distribution. We evaluated the proposed method on two challenging online action detection datasets, THUMOS14 and TVSeries. The results show that HCM is effective and efficient in online action detection and action anticipation tasks.

Virtual Personal Trainer: Fitness Video Recognition Using Convolution Neural Network and Bidirectional LSTM

Article

Oct 2021

Nowadays, the increased interest of users towards healthier lifestyle has motivated to develop virtual personal trainer application using Android as platform. Despite availability of numerous fitness apps and gyms, everyone needs proper training at their ease and wishes to monitor calories burnt. Thus, this paper proposes novel idea of virtual personal trainer application that recognizes user’s actions through videos. The video data is processed using convolutional neural network and bidirectional long short-term memory network. The motive of work is to recognize exercise accurately from video and calculate number of calories expended. The proposed application provides not only detailed information about exercise but also ascertains correct way of performing exercises as this is a major challenge that user faces due to lack of knowledge. The idea is implemented on UCF-101 Action Recognition dataset and experimental results show significant improvements as compared to baseline methods. This study would benefit users who are fitness enthusiasts and are more prone to gadgets.

Asymmetric 3D Convolutional Neural Networks for Action Recognition

Article

Jul 2018
PATTERN RECOGN

Convolutional Neural Network based action recognition methods have achieved significant improvements in recent years. The 3D convolution extends the 2D convolution to the spatial-temporal domain for better analysis of human activities in videos. The 3D convolution, however, involves many more parameters than the 2D convolution. Thus, it is much more expensive on computation, costly on storage, and difficult to learn. This work proposes efficient asymmetric one-directional 3D convolutions to approximate the traditional 3D convolution. To improve the feature learning capacity of asymmetric 3D convolutions, a set of local 3D convolutional networks, called MicroNets, are proposed by incorporating multi-scale 3D convolution branches. Then, an asymmetric 3D-CNN deep model is constructed by MicroNets for the action recognition task. Moreover, to avoid training two networks on the RGB and Flow frames separately as most works do, a simple but effective multi-source enhanced input is proposed, which fuses useful information of the RGB and Flow frame at the pre-processing stage. The asymmetric 3D-CNN model is evaluated on two of the most challenging action recognition benchmarks, UCF-101 and HMDB-51. The asymmetric 3D-CNN model outperforms all the traditional 3D-CNN models in both effectiveness and efficiency, and its performance is comparable with that of recent state-of-the-art action recognition methods on both benchmarks.

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Histograms of Oriented Gradients for Human Detection

Article

Jun 2005

DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition

Conference Paper

Jun 2014

Most of the previous work on video action recognition use complex hand-designed local features, such as SIFT, HOG and SURF, but these approaches are implemented sophisticatedly and difficult to be extended to other sensor modalities. Recent studies discover that there are no universally best hand-engineered features for all datasets, and learning features directly from the data may be more advantageous. One such endeavor is Slow Feature Analysis (SFA) proposed by Wiskott and Sejnowski [33]. SFA can learn the invariant and slowly varying features from input signals and has been proved to be valuable in human action recognition [34]. It is also observed that the multi-layer feature representation has succeeded remarkably in widespread machine learning applications. In this paper, we propose to combine SFA with deep learning techniques to learn hierarchical representations from the video data itself. Specifically, we use a two-layered SFA learning structure with 3D convolution and max pooling operations to scale up the method to large inputs and capture abstract and structural features from the video. Thus, the proposed method is suitable for action recognition. At the same time, sharing the same merits of deep learning, the proposed method is generic and fully automated. Our classification results on Hollywood2, KTH and UCF Sports are competitive with previously published results. To highlight some, on the KTH dataset, our recognition rate shows approximately 1% improvement in comparison to state-of-the-art methods even without supervision or dense sampling.

Action Recognition with Improved Trajectories

Conference Paper

Dec 2013

Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results on four challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.

Segmental multi-way local pooling for video recognition

Conference Paper

Oct 2013

In this work, we address the problem of complex event detection on unconstrained videos. We introduce a novel multi-way feature pooling approach which leverages segment-level information. The approach is simple and widely applicable to diverse audio-visual features. Our approach uses a set of clusters discovered via unsupervised clustering of segment-level features. Depending on feature characteristics, not only scene-based clusters but also motion/audio-based clusters can be incorporated. Then, every video is represented with multiple descriptors, where each descriptor is designed to relate to one of the pre-built clusters. For classification, intersection kernel SVMs are used where the kernel is obtained by combining multiple kernels computed from corresponding per-cluster descriptor pairs. Evaluation on TRECVID'11 MED dataset shows a significant improvement by the proposed approach beyond the state-of-the-art.

A unified tree-based framework for joint action localization, recognition and segmentation

Article

Dec 2013
COMPUT VIS IMAGE UND

A unified tree-based framework for joint action localization, recognition and segmentation is proposed. An action is represented as a sequence of joint hog-flow descriptors extracted independently from each frame. During training, a set of action prototypes is first learned based on k-means clustering, and then a binary tree model is constructed from the set of action prototypes based on hierarchical k-means clustering. Each tree node is characterized by a hog-flow descriptor and a rejection threshold, and an initial action segmentation mask is defined for leaf nodes (corresponding to a prototype). During testing, an action is localized by mapping each test frame to its nearest neighbor prototype using a fast tree search method, followed by local search based tracking and global filtering based location refinement. An action is recognized by maximizing the sum of the joint probabilities of the action category and action prototype given an input sequence. An action pose from a test frame can be segmented by GrabCut algorithm using the initial segmentation mask from the matched leaf node as the user labeling. Our approach does not rely on background subtraction, and enables action localization and recognition in realistic and challenging conditions (such as crowded backgrounds). Experimental results show that our approach achieves start-of-art performances on the Weizmann dataset, CMU action dataset and UCF sports action dataset.

Detecting Activities of Daily Living in First-Person Camera Views

Conference Paper

May 2012
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We present a novel dataset and novel algorithms for the problem of detecting activities of daily living (ADL) in firstperson camera views. We have collected a dataset of 1 million frames of dozens of people performing unscripted, everyday activities. The dataset is annotated with activities, object tracks, hand positions, and interaction events. ADLs differ from typical actions in that they can involve long-scale temporal structure (making tea can take a few minutes) and complex object interactions (a fridge looks different when its door is open). We develop novel representations including (1) temporal pyramids, which generalize the well-known spatial pyramid to approximate temporal correspondence when scoring a model and (2) composite object models that exploit the fact that objects look different when being interacted with. We perform an extensive empirical evaluation and demonstrate that our novel representations produce a two-fold improvement over traditional approaches. Our analysis suggests that real-world ADL recognition is “all about the objects,” and in particular, “all about the objects being interacted with.”

Learning Latent Temporal Structure for Complex Event Detection

Conference Paper

Jun 2013
IEEE Comput Soc Conf Comput Vis Pattern Recogn

In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model.

Real-Time Human Action Recognition Using CNN Over Temporal Images for Static Video Surveillance Cameras

Abstract and Figures

Supplementary resource (1)

Recommended publications

Efficient Abandoned Luggage Detection in Complex Surveillance Videos

Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN

A Framework for Activity Recognition Through Deep Learning and Abnormality Detection in Daily Activi...

Real-Time Action Recognition Using Multi-level Action Descriptor and DNN

Real-Time Action Detection in Video Surveillance using a Sub-Action Descriptor with Multi-Convolutio...