ArticlePDF Available

Feature Fusion Based Human Action Recognition in Still Images

November 2019

November 2019

Authors:

Abdul Sattar Chan

Sukkur Institute of Business Administration

Zuhaibuddin Bhutto

Balochistan University of Engineering & Technology Khuzdar

Mudasar Latif Memon

Sukkur Institute of Business Administration

Show all 7 authorsHide

Recognizing human actions based on still-images is a challenging task involving predictions on human interaction with objects and body postures. In this paper, a novel method is proposed in which three networks are used to determine human pose, most relatable object in the scene and the overall scenario that includes actors and all objects around him. Before testing the proposed method the performance of the conventional transfer learning method is evaluated by four popular pre-trained convolutional neural networks for feature extraction and classification is performed by the Support vector machine, only principal components of extracted features are passed through SVM for predicting human action in the scene. To evaluate the proposed model Stanford40 dataset is used, the dataset contains images of 40 human actions and every image has a bounding box of the person performing the action. There is a total of 9532 images with 180-300 images per class, for the experiment only 10 classes of the dataset are used for proposed model evaluation. Experimental results show a proposed method in the paper achieves high robustness and accuracy. Ke ywords: convolutional neural networks, transfer learning, support vector machine.

Overview of the conventional transfer learning system.

…

The overview of the proposed method, feature fusion followed by CNN and SVM classifier.

…

Some Sample images from Stanford 40 dataset.

…

Classified actions from test dataset with true labels.

…

Figures - uploaded by Mudasar Latif Memon

Content may be subject to copyright.

Content uploaded by Mudasar Latif Memon

Content may be subject to copyright.

Content uploaded by Mudasar Latif Memon

Content may be subject to copyright.

IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019

151

Manuscript received November 5, 2019

Manuscript revised November 20, 2019

Feature Fusion Based Human Action Recognition in Still Images

Abdul Sattar Chan1, Kashif Saleem2, Zuhaibuddin Bhutto3, Mudasar Latif Memon4, Murtaza Hussain

Shaikh5, Saleem Ahmed6, and Ahsan Raza Siyal7

1Eletrical Engineering Dept. Sukkur IBA University, Sukkur

2Telecommunication Engineering Department, Dawood University of Engineering & Technology, Karachi, Pakistan

3Department of Computer Systems Engineering, Balochistan University of Engineering & Technology, Pakistan

4IBA Community College Naushehro Feroze, Sukkur IBA University, Pakistan

5Department of Computer Systems Engineering, Kyungsung University, Busan, South Korea

6Electronics Engineering Department, Dawood University of Engineering & Technology, Karachi, Pakistan

7Computer System Engineering Department, Dawood University of Engineering & Technology, Karachi, Pakistan

Summary

Recognizing human actions based on still-images is a challenging

task involving predictions on human interaction with objects and

body postures. In this paper, a novel method is proposed in which

three networks are used to determine human pose, most relatable

object in the scene and the overall scenario that includes actors and

all objects around him. Before testing the proposed method the

performance of the conventional transfer learning method is

evaluated by four popular pre-trained convolutional neural

networks for feature extraction and classification is performed by

the Support vector machine, only principal components of

extracted features are passed through SVM for predicting human

action in the scene. To evaluate the proposed model Stanford40

dataset is used, the dataset contains images of 40 human actions

and every image has a bounding box of the person performing the

action. There is a total of 9532 images with 180-300 images per

class, for the experiment only 10 classes of the dataset are used for

proposed model evaluation. Experimental results show a proposed

method in the paper achieves high robustness and accuracy.

Ke ywords:

convolutional neural networks, transfer learning, support vector

machine.

1. Introduction

Human action recognition based on videos has been

comparatively considered as an active research area in the

computer vision [1] [2]. On the other hand, human action

recognition still, image has not been in highlights and not

being focused by modern researchers. Lately, the research

community has increased attention and making efforts to set

up benchmarks and sort out issues like PASCAL VOC

action recognition [3]. Other than based on videos where

image sequences play a vital role [4]. In still image-based

action recognition, the main idea is predicted based on

action labels providing an interpretation of human actions

and their contact with the objects present in the scene [5].

The convolutional neural network (CNN) has emerged as a

key development in the computer vision that is replaced by

a conventional computer vision field. The CNN or

ConvNets models improve not only image classification

accuracy, but they are employed to extract features in the

field of depth estimation, semantic segmentation, and object

detection [6] [7]. Since CNN has a higher computational

cost and memory requirements to train and deploy the

model, hardware with high specifications is also essential.

A system to be deployed for human action monitoring or in

order to automate surveillance system, thievery detection

and warning system in banks, and malls, requires a real-time

processing capability even in an embedded board having the

comparatively less computational power and memory.

Unlike the desktop PC, embedded boards have limitations

in terms of computing power, memory and power

consumption due to these stated reasons deployments of

deep neural network-based algorithms and systems that

require extensive computations restricted by embedded

systems. for that reason, it is needed to carry out a study into

the optimization of convolutional neural network

technology to overcome such limitations.

Therefore, in order to tackle such limitations, this paper

proposes a method for detecting human actions in still

images with similar performance to state of the art methods

but with improved accuracy and less memory weight.

Feature extraction is carried out by four different popular

pre-trained networks for performance evaluation and

principal component analysis reduces the dimensionality of

the feature matrix and then support vector machine classify

the action in the scene.

2. Related Work

Action recognition based on videos has been well

established over the years with a long list of literature [1]

[27], [28]. For still image-based action recognition, there

are different parameters that have been investigated and

experimentally tested for efficient human action recognition

with high accuracy and less computational power

consumption. The group of existing methods can be

categorized into three categories.

IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019

152

The first scheme is based on human poses that apply human

part detectors to detect the parts of the human body and

encode them into the pose for action recognition [8]. In [9],

the author performs the training of a convolutional neural

network for the estimation of human poses.

The second scheme is based on the situation or

circumstances. This category not only consideration human

poses but also human-object interactions as an aid to

perform human action recognition. In [10], the author

creates pairs of human poses and objects human is

interacting and picks discriminative ones for human action

recognition. Yao in [11] considered multiple interactions in

a scene that include human poses, human-object interaction,

as well as the affiliation amongst objects. In [12], pre-

trained object detectors are deployed to detect most related

objects to the person in the scene.

The third approach is a part-based method. In [13], the use

of local patches of an image as parts in order to train the

model which similar to classifier for action recognition [14].

In [15], human action in a scene is recognized by only using

image labels in order to locate humans in a scene. The

multiple detectors are used to detect the human upper body

and face. After the detection of humans, the most related

objects are then detected on the bases of relative locations.

3. Proposed Method

In machine learning, transfer learning or knowledge transfer

is a method that utilizes previously learned knowledge to

solve a new problem. For training the models with a small

dataset, transfer learning using pre-trained deep conNets are

very useful because of conNets face overfitting problem

with small size dataset. However, overfitting can be avoided

by increasing the size of the dataset costing high annotations

and require high computations which can increase the

complexity. In this case, the transfer learning method is

used by utilizing pre-trained deep representations for the

construction of new architecture [16]. In this paper, we have

employed four popular pre-trained models Resnet18 [17],

VGG16 , VGG19 [18] and googlenet [19].

Resnet-18 is a pre-trained convolutional neural network on

more than a million images of 1000 different kinds of

categories of ImageNet dataset [20]. The network consists

of total 18 layers with an input layer of size 224 by 224 and

having the ability to classify 1000 different categories like

keyboard, mouse, pencil due to this extensive learning of

feature representations for a wide range of images. Both

VGG16 and VGG19 are pre-trained convolutional neural

networks on the ImageNet dataset [20]. The networks

consist of 16 and 19 layers respectively and having an input

size of 224 by 224. Googlenet is a convolutional neural

network that is a pre-trained having 22 layers of depth. It is

trained on ImageNet dataset [20] and capable of classifying

images into 1000 categories, such as mouse, pencil,

keyboard and many animals. The network has an input size

of 224 by 224.

In the proposed approach, the features are extracted by pre-

trained models and output from the network is extracted

from the 5th pooling layer. The principal component

analysis is performed on the extracted features from the pre-

trained models to reduce computations and followed by a

support vector machine (SVM) classifier for action

recognition. The block diagram of the proposed method is

shown in figure 1 which gives the overview of the

conventional transfer learning system, the first row

indicates the source architecture and the second row shows

the target.

Fig. 1 Overview of the conventional transfer learning system.

In the proposed method, three major factors that constitute

an action, human pose, most relatable object within the

scene and overall scenario are considered. In order to

include these factors three parallel networks are used

followed by the feature fusion and convolutional neural

network, classification is performed by SVM classifier.

Fig. 2 The overview of the proposed method, feature fusion followed by

CNN and SVM classifier.

IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019

153

Initially, the given input image  , we use different

networks to detect all humans, their poses and most

relatable objects in the scene, creating a detected set of

bounding boxes  󰇛

   

󰇜 where N represents the

total number of detected bounding boxes. The detected

boxes for human and for the objects are represented as, 

and  respectively and the detection confidence score for

both are represented as  and  respectively. Human pose

estimation and recognition for matching action are obtained

by transfer learning from datasets [22][23].

The action prediction of the given image  is calculated

for each candidate action, where  with dimension 

includes all action classes, given each human-object-

scenario bounding boxes (  and  ), where 

represents scenario bounding box which includes actors and

all other objects to give an overall aspect of the scene a

chance to play in the prediction score.  depends firstly on

the individual confidence score of the actor 

 and object



 , secondly human-object-scenario confidence score



 and thirdly on pose feature representations 

. The

action prediction score is given as,

 

 

 󰇛





󰇜 (1)

The sigmoid activation is utilized for classification to avoid

competition between predicted classes. The training

objective is to minimize the binary cross-entropy loss

between action labels  and the predicted score  .

󰇛 󰇜  

     









     (2)

 

    

    

  

󰇛

 󰇜 󰇛

 󰇜 (3)

Where  and  represents average cross-entropy loss on

M sample batch and total cross-entropy loss

respectively. is the action class for the ith action in the

jth prediction and  represents prediction score for the ith

action. Figure 2 describes the proposed method with feature

fusion followed by CNN network and SVM classifier.

4. Experiments and Results

In this section, we discuss the experimental setup, training

process and results of the proposed method. The proposed

method is tested on open source Stanford40 dataset [21].

The dataset contains 40 different human action images

approximately 180 to 300 images per class, each image in

the dataset has a bounding box of the person performing the

action. In this paper for experimental purposes, only 10

classes are used to evaluate the proposed method on four

different pre-trained models. Some samples from the

dataset with classes that are used in the experiment are

shown in figure 3.

Fig. 3 Some Sample images from Stanford 40 dataset.

The feature extraction is performed by four different pre-

trained networks having the same input layer size of 244 by

244 then principle component analysis performed on

extracted features followed by SVM classifier to classify

actions between 10 classes. The experimental results pre-

trained model-wise are shown in table 1.

Table 1: Comparison of classification results on Stanford40 dataset

Methods

Resnet18

VGG16

VGG19

googLenet

Mean AP(%)

87.132

85.748

83.657

84.387

Now, the proposed method is tested on the same dataset,

firstly the input image is processed through three different

networks to find bounding boxes for human detection, pose

estimation and object detection. The distance between all

the objects detected in the scene and the human detected

bounding box is calculated and the object which has

minimum distance will be declared a most relatable object

in the scene. Another network is used to estimate human

pose to participate in action prediction score and the

network utilizing previously learned knowledge is used

followed by SVM classifier to detect overall scenario which

includes actor and all the objects in the scene. Then finally

all scores are interrelated in the decision fusion to provide a

final decision. It is found that our method has performed

better than conventional transfer learning methods and

provide better accuracy of 86.413% Figure 4 shows some

of the recognized actions by the proposed method. The

mean AP comparison of the proposed method is shown in

table 2. It illustrates that the proposed method achieves

better results compared to the other existing methods.

IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019

154

Fig. 4 Classified actions from test dataset with true labels.

Table 1: Comparison of classification results on Stanford40 dataset

Methods

Mean AP(%)

Khan [24]

75.4

Semantic parts [25]

80.6

Image classification (VGG16 model)

81.4

Zhang [26]

82.6

Proposed method

87.1

5. Conclusion

In this paper, the human action recognition method is

proposed based on three networks utilizing transfer learning

by pre-trained Convolutional neural network architecture

and SVM classifiers. The architectures of the pre-trained

networks are used to determine human pose estimation,

objects in the scene and overall scenario. Then followed by

decision fusion where confidence scores of three different

networks are related and the final decision is produced. It

was established and demonstrated that transfer learning can

be effectively used to utilize already learned knowledge for

a new task in case of the small training dataset. Training of

the deep learning model from scratch is computationally

very high and time-consuming which can be avoided by

using the transfer learning method. The performance of the

proposed method was evaluated on stanford40 dataset and

achieved 87.13% overall accuracy based on resnet18 pre-

trained deep network.

References

[1] R. Poppe, “A survey on vision-based human action

recognition”, Image and Vision Computing, vol. 28, no. 6, pp.

976–90, 2010.

[2] G. Cheng, Y. Wan, A. Saudagar, K. Namuduri, and B.

Buckles, “Advances in human action recognition: A survey”,

arxiv, pp. 1–30, 2015.

[3] M. Everingham, L. V. Gool, C. Williams, J. Winn, and A.

Zisserman, “The PASCAL Visual Object Classes Challenge

2012 (VOC2012) Results”,

http://www.pascalnetwork.org/challenges/VOC/voc2012/wo

rkshop/index.html.

[4] J. Wu, Y. Zhang, and W. Lin, “Towards good practices for

action video encoding,” in Proc. IEEE Int’l Conf. on

Computer Vision and Pattern Recognition, 2014, pp. 2577–

2584.

[5] G. D. Guo and A. Lai, “A survey on still image based human

action recognition”, Pattern Recognition, vol. 47, no. 10, pp.

3343–61, 2014.

[6] C.Szegedy,W.Liu, Y.Jia, and P.Sermanet, “Going deeper

with convolutions,” aeXiv Prepr., 2014.

[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E.

Tzeng, and T. Darrel, “DeCAF: A Deep Convolutional

Activation Feature for Generic Visual Recognition,” Icml,

vol. 32, pp. 647-655, 2014.

[8] S. Maji, L. Bourdev, and J. Malik, “Action recognition from

a distributed representation of pose and appearance”, IEEE

Int’l Conf. on Computer Vision and Pattern Recognition,

2011, pp. 3177–3184.

[9] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler,

“Efficient object localization using convolutional networks”,

IEEE Int’l Conf. on Computer Vision and Pattern

Recognition, 2015, pp. 648–656.

[10] V. Delaitre, J. Sivic, and I. Laptev, “Learning person-object

interactions for action recognition in still images”, Advances

in Neural Information Processing Systems, 2011

[11] B. Yao and L. Fei-Fei, “Recognizing human-object

interactions in still images by modeling the mutual context of

objects and human poses”, IEEE Trans. on Pattern Analysis

and Machine Intelligence, vol. 34, no. 9, pp. 1691–1703,

2012.

[12] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action

recognition with R*CNN”, IEEE Int’l Conf. on Computer

Vision, 2015, pp. 1080–1088.

[13] G. Sharma, F. Jurie, and C. Schmid, “Expanded parts model

for semantic description of humans in still images”,

arXiv:1509.04186, 2015.

[14] P. Felzenszwalb, R. Girshick, D. McAllester, and D.

Ramanan, “Object detection with discriminatively trained

part-based models”, IEEE Trans. on Pattern Analysis and

Machine Intelligence, vol. 32, no. 9, pp. 1627– 1645, 2010.

[15] C. Schmid, and V. Ferrari, “Weakly supervised learning of

interactions between humans and objects”, IEEE Trans. on

Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp.

601–614, 2012.

[16] Y. C. Su, T. H. Chiu, C. Y. Yeh, H. F. Huang, “Transfer

Learning for Video Recognition with Scarce Training Data

for Deep Convolutional Neural Network”, arXiv preprint

arXiv:1409.4127, 2014

[17] H. Kaiming, X. Zhang, S. Ren, and J. Sun. "Deep residual

learning for image recognition." In Proceedings of the IEEE

conference on computer vision and pattern recognition, pp.

770-778. 2016

[18] Simonyan, Karen, and Andrew Zisserman. "Very deep

convolutional networks for large-scale image recognition",

arXiv preprint arXiv:1409.1556 (2014).

[19] Szegedy, C,hristian, Wei Liu, Yangqing Jia, Pierre Sermanet,

Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent

Vanhoucke, and Andrew Rabinovich. "Going deeper with

convolutions." IEEE conference on computer vision and

pattern recognition, pp. 1-9. 2015

[20] ImageNet. http://www.image-net.org

[21] B. Yao, X. Jiang, A. Khosla, A.L. Lin, L.J. Guibas, and L.

Fei-Fei, “Human Action Recognition “, Internation

Conference on Computer Vision (ICCV), Barcelona, Spain.

November 6-13, 2011

[22] R. Girshick, I. Radosavovic, G. Gkioxari, P. Doll´ar, and K.

He, “Detectron,”

https://github.com/facebookresearch/detectron, 2018.

IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019

155

[23] A. Recasens, A. Khosla, C. Vondrick, and A. Torralba,

“Where are they looking?” in NIPS, pp. 199–207, 2015.

[24] F. S. Khan, J. Xu, J. van de Weijer, A. D. Bagdanov, R. M.

Anwer, and A. M. Lopez, “Recognizing actions through

action specific person detection”, IEEE Transactions on

Image Processing, vol. 24, no. 11, pp. 4422–4432, 2015.

[25] Z. Zhao, H. Ma, and X. Chen, “Semantic parts based top-

down pyramid for action recognition”, Pattern Recognition

Letters, vol. 84, pp. 134–141, 2016.

[26] Y. Zhang, L. Cheng, J. Wu, J. Cai, M. N. Do, and J. Lu,

“Action recognition in still images with minimum annotation

efforts, IEEE Transactions on Image Processing, vol. 25, no.

11, pp. 5479–5490, Nov 2016.

[27] A. R. Siyal, Z. Bhutto, K. Saleem, A. S. Chan, M. L. Memon,

M. H. Shaikh, S. Ahmed, “Ship detection in satellite imagery

by multiple classifier network”, International Journal of

Computer Science and Network Security (IJCSNS), vol. 10,

no. 8, pp. 142-148, Aug. 2019.

[28] Z. Bhutto, M. Z. Tunio, A. Hussain, J. Shah, I. Ali, and M. H.

Shaikh, “Scaling of color fusion in stitching images”,

International Journal of Computer Science and Network

Security (IJCSNS), vol. 10, no. 4, pp. 61-64, Apr. 2019.

goldenAGER: A Personalized Feature Fusion Activity Recognition Model for Elderly

Article

Full-text available

Jan 2023

Physical and mental health are impacted as a person grows old. A Human Activity Recognition (HAR) system, which tracks a person’s activity patterns and intervenes in case of an abnormal activity, could help elderly individuals to live independently. However, because of the strong intra class correlation between different activities, it is a challenging task to recognise such activities. Therefore, we proposed a personalized feature fusion algorithm, goldenAGER, which can be used to build as a model for abnormal activity recognition. In the initial stage, it extracts handcrafted HOG features and self-learned VGG-16 features to provide a rich description about the internal information of images. Then, the extracted features are provided as two different inputs to the deep neural network which are finally concatenated to classify the action type. The dataset is collected from the elderly volunteers over the age of 60 in a homogeneous environment consisting 10 classes of activities. The fusion of the features has resulted in 95% accuracy on primary dataset. The performance of the proposed model has also been tested on Microsoft Research (MSR) Action dataset giving accuracy of 93.08%. A comparison of our proposed model with the other existing models is also performed which shows that our model outperformed the existing models.

Still Image-based Human Activity Recognition with Deep Representations and Residual Learning

Article

Full-text available

Jun 2020

Iterative Recognizing human activity in a scene is still a challenging and an important research area in the field of computer vision due to its various possible implementations on many fields including autonomous driving, bio medical, machine intelligent vision etc. Recently deep learning techniques have emerged and successfully deployed models for image recognition and classification, object detection, and speech recognition. Due to promising results the state of art deep learning techniques have replaced the traditional techniques. In this paper, a novel method is presented for human activity recognition based on pre-trained Convolutional Neural Network (CNN) model utilized as feature extractor and deep representations are followed by Support Vector Machine (SVM) classifier for action recognition. It has been observed that previously learnt CNN knowledge from large scale data-set could be transferred to activity recognition task with limited training data. The proposed method is evaluated on publicly available stanford40 human action data-set, which includes 40 classes of actions and 9532 images. The comparative experiment results show that proposed method achieves better performance over conventional methods in term of accuracy and computational power.

Recognition of human action for scene understanding using world cup optimization and transfer learning approach

Article

May 2023

Understanding human activities is one of the vital steps in visual scene recognition. Human daily activities include diverse scenes with multiple objects having complex interrelationships with each other. Representation of human activities finds application in areas such as surveillance, health care systems, entertainment, automated patient monitoring systems, and so on. Our work focuses on classifying scenes into different classes of human activities like waving hands, gardening, walking, running, etc . The dataset classes were pre-processed using the fuzzy color stacking technique. We adopted the transfer learning concept of pretrained deep CNN models. Our proposed methodology employs pretrained AlexNet, SqueezeNet, ResNet, and DenseNet for feature extraction. The adaptive World Cup Optimization (WCO) algorithm is used halfway to select the superior dominant features. Then, these dominant features are classified by the fully connected classifier layer of DenseNet 201. Evaluation of the performance matrices showed an accuracy of 94.7% with DenseNet as the feature extractor and WCO for feature selection compared to other models. Also, our proposed methodology proved to be superior to its counterpart without feature selection. Thus, we could improve the quality of the classification model by providing double filtering using the WCO feature selection process.

Ship Detection in Satellite Imagery by Multiple Classifier Network

Article

Full-text available

Aug 2019

Scaling of Color Fusion in Stitching Images

Article

Full-text available

Apr 2019

In this modern and information age, more than 80% of information accessed by human beings by viewing and observations of realities. Since now, everyone around the globe is going to be more and more familiar that video or images are two enormously important carriers of information which is obtained by people. Therefore, to get and process video or image becomes very important. Particularly, with the fast development of IT, Science & Technology: digital image processing, image correction, image fusion, and image stitching got high attention. This study heavily relies on Image Stitching of two images. Image stitching is the method of stitching multiple images to develop an image of high resolution which creates a visually probable mixture alike as the joint enclosed by the stitched images should be hidden. It requires nearly exact overlaps within images and identical exposures to process the logical results. The main object of this study is to develop a Matlab script that will stitch two images together to produce one image including the overlapping of region and plan of both images. In this context, a step called fuse has been used through which the colors of both corresponding images mixed to design a new or output image. For this, we captured different scenes from different aspects, angles, and finally selected two images as input or corresponding images. The input images have been evaluated through MATLAB code, analyzed the given task with the help of MATLAB programming, resulting stitching process completed.

Going deeper with convolutions

Conference Paper

Full-text available

Jun 2015

Recognizing Actions Through Action-Specific Person Detection

Article

Full-text available

Aug 2015
IEEE T IMAGE PROCESS

Action recognition in still images is a challenging problem in computer vision. To facilitate comparative evaluation independently of person detection, the standard evaluation protocol for action recognition uses an oracle person detector to obtain perfect bounding box information at both training and test time. The assumption is that, in practice, a general person detector will provide candidate bounding boxes for action recognition. In this paper we argue that this paradigm is sub-optimal and that action class labels should already be considered during the detection stage. Motivated by the observation that body pose is strongly conditioned on action class, we show: (i) that existing, state-of-the-art generic person detectors are not adequate for proposing candidate bounding boxes for action classification; (ii) that, due to limited training examples, direct training of action-specific person detectors is also inadequate; and (iii) that, using only a small number of labeled action examples, transfer learning is able to adapt an existing detector to propose higher-quality bounding boxes for subsequent action classification. To the best of our knowledge, we are the first to investigate transfer learning for the task of action-specific person detection in still images. We perform extensive experiments on two benchmark datasets: Stanford-40 and PASCAL VOC 2012. For the action detection task (i.e. both person localization and classification of the action performed), our approach outperforms methods based on general person detection by 5.7% mean average precision (MAP) on Stanford-40 and 2.1% MAP on PASCAL VOC 2012. Our approach also significantly outperforms the state-of-the-art with a MAP of 45.4% on Stanford-40 and 31.4% on PASCAL VOC 2012. We also evaluate our action detection approach for the task of action classification (i.e. recognizing actions without localizing them). For this task, our approach, without using any ground-truth person localization at test time, outperforms on both datasets state-of-the-art methods which do use person locations.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Technical Report

Sep 2014

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.

Deep Residual Learning for Image Recognition

Conference Paper

Jun 2016

Semantic parts based top-down pyramid for action recognition

Article

Sep 2016
PATTERN RECOGN LETT

We focus on the problem of recognizing actions in still images, and this paper provides an approach which arranges features of different semantic parts in spatial order. Our approach includes three components: 1) a semantic learning algorithm that collects a set of part detectors, 2) an efficient detection method that divides multiple images by the same grid and evaluates parallelly, and 3) a top-down spatial arrangement that increases the inter-class variance. The proposed semantic parts learning algorithm captures both interactive objects and discriminative poses. Our spatial arrangement can be seen as a kind of adaptive pyramid, which highlights spatial distribution of body parts in different actions, and provides more discriminative representations. Experimental results show that our approach outperforms the state-of-the-art significantly on two challenging benchmarks: 1) PASCAL VOC 2012 and 2) Stanford-40 (by 2.6% mAP and 5.2% mAP, respectively).

Action Recognition in Still Images With Minimum Annotation Efforts

Article

Sep 2016

We focus on the problem of still image-based human action recognition, which essentially involves making prediction by analyzing human poses and their interaction with objects in the scene. Besides image-level action labels (e.g., riding, phoning), during both training and testing stages, existing works usually require additional input of human bounding-boxes to facilitate the characterization of the underlying human-object interactions. We argue that this additional input requirement might severely discourage potential applications and is not very necessary. To this end, a systematic approach was developed in this paper to address this challenging problem of minimum annotation efforts, i.e. to perform recognition in the presence of only image-level action labels in the training stage. Experimental results on three benchmark datasets demonstrate that compared with the state-ofthe- art methods that have privileged access to additional human bounding-box annotations, our approach achieves comparable or even superior recognition accuracy using only action annotations in training. Interestingly, as a by-product in many cases, our approach is able to segment out the precise regions of underlying human-object interactions.

Efficient object localization using Convolutional Networks

Conference Paper

Jun 2015

Expanded Parts Model for Semantic Description of Humans in Still Images

Article

Sep 2015

We introduce an Expanded Parts Model (EPM) for recognizing human attributes (e.g. young, short hair, wearing suit) and actions (e.g. running, jumping) in still images. An EPM is a collection of part templates which are learnt discriminatively to explain specific scale-space regions in the images (in human centric coordinates). This is in contrast to current models which consist of a relatively few (i.e. a mixture of) 'average' templates. EPM uses only a subset of the parts to score an image and scores the image sparsely in space, i.e. it ignores redundant and random background in an image. To learn our model, we propose an algorithm which automatically mines parts and learns corresponding discriminative templates together with their respective locations from a large number of candidate parts. We validate our method on three recent challenging datasets of human attributes and actions. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets.

Feature Fusion Based Human Action Recognition in Still Images

Abstract and Figures

Recommended publications

Still Image-based Human Activity Recognition with Deep Representations and Residual Learning

Human Gesture Recognition in Still Images Using GMM Approach

Still Image Action Recognition by Predicting Spatial-Temporal Pixel Evolution

Action Recognition in Still Images With Minimum Annotation Efforts