ArticlePDF Available

Abstract and Figures

Recognizing human actions based on still-images is a challenging task involving predictions on human interaction with objects and body postures. In this paper, a novel method is proposed in which three networks are used to determine human pose, most relatable object in the scene and the overall scenario that includes actors and all objects around him. Before testing the proposed method the performance of the conventional transfer learning method is evaluated by four popular pre-trained convolutional neural networks for feature extraction and classification is performed by the Support vector machine, only principal components of extracted features are passed through SVM for predicting human action in the scene. To evaluate the proposed model Stanford40 dataset is used, the dataset contains images of 40 human actions and every image has a bounding box of the person performing the action. There is a total of 9532 images with 180-300 images per class, for the experiment only 10 classes of the dataset are used for proposed model evaluation. Experimental results show a proposed method in the paper achieves high robustness and accuracy. Ke ywords: convolutional neural networks, transfer learning, support vector machine.
Content may be subject to copyright.
IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019
151
Manuscript received November 5, 2019
Manuscript revised November 20, 2019
Feature Fusion Based Human Action Recognition in Still Images
Abdul Sattar Chan1, Kashif Saleem2, Zuhaibuddin Bhutto3, Mudasar Latif Memon4, Murtaza Hussain
Shaikh5, Saleem Ahmed6, and Ahsan Raza Siyal7
1Eletrical Engineering Dept. Sukkur IBA University, Sukkur
2Telecommunication Engineering Department, Dawood University of Engineering & Technology, Karachi, Pakistan
3Department of Computer Systems Engineering, Balochistan University of Engineering & Technology, Pakistan
4IBA Community College Naushehro Feroze, Sukkur IBA University, Pakistan
5Department of Computer Systems Engineering, Kyungsung University, Busan, South Korea
6Electronics Engineering Department, Dawood University of Engineering & Technology, Karachi, Pakistan
7Computer System Engineering Department, Dawood University of Engineering & Technology, Karachi, Pakistan
Summary
Recognizing human actions based on still-images is a challenging
task involving predictions on human interaction with objects and
body postures. In this paper, a novel method is proposed in which
three networks are used to determine human pose, most relatable
object in the scene and the overall scenario that includes actors and
all objects around him. Before testing the proposed method the
performance of the conventional transfer learning method is
evaluated by four popular pre-trained convolutional neural
networks for feature extraction and classification is performed by
the Support vector machine, only principal components of
extracted features are passed through SVM for predicting human
action in the scene. To evaluate the proposed model Stanford40
dataset is used, the dataset contains images of 40 human actions
and every image has a bounding box of the person performing the
action. There is a total of 9532 images with 180-300 images per
class, for the experiment only 10 classes of the dataset are used for
proposed model evaluation. Experimental results show a proposed
method in the paper achieves high robustness and accuracy.
Ke ywords:
convolutional neural networks, transfer learning, support vector
machine.
1. Introduction
Human action recognition based on videos has been
comparatively considered as an active research area in the
computer vision [1] [2]. On the other hand, human action
recognition still, image has not been in highlights and not
being focused by modern researchers. Lately, the research
community has increased attention and making efforts to set
up benchmarks and sort out issues like PASCAL VOC
action recognition [3]. Other than based on videos where
image sequences play a vital role [4]. In still image-based
action recognition, the main idea is predicted based on
action labels providing an interpretation of human actions
and their contact with the objects present in the scene [5].
The convolutional neural network (CNN) has emerged as a
key development in the computer vision that is replaced by
a conventional computer vision field. The CNN or
ConvNets models improve not only image classification
accuracy, but they are employed to extract features in the
field of depth estimation, semantic segmentation, and object
detection [6] [7]. Since CNN has a higher computational
cost and memory requirements to train and deploy the
model, hardware with high specifications is also essential.
A system to be deployed for human action monitoring or in
order to automate surveillance system, thievery detection
and warning system in banks, and malls, requires a real-time
processing capability even in an embedded board having the
comparatively less computational power and memory.
Unlike the desktop PC, embedded boards have limitations
in terms of computing power, memory and power
consumption due to these stated reasons deployments of
deep neural network-based algorithms and systems that
require extensive computations restricted by embedded
systems. for that reason, it is needed to carry out a study into
the optimization of convolutional neural network
technology to overcome such limitations.
Therefore, in order to tackle such limitations, this paper
proposes a method for detecting human actions in still
images with similar performance to state of the art methods
but with improved accuracy and less memory weight.
Feature extraction is carried out by four different popular
pre-trained networks for performance evaluation and
principal component analysis reduces the dimensionality of
the feature matrix and then support vector machine classify
the action in the scene.
2. Related Work
Action recognition based on videos has been well
established over the years with a long list of literature [1]
[27], [28]. For still image-based action recognition, there
are different parameters that have been investigated and
experimentally tested for efficient human action recognition
with high accuracy and less computational power
consumption. The group of existing methods can be
categorized into three categories.
IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019
152
The first scheme is based on human poses that apply human
part detectors to detect the parts of the human body and
encode them into the pose for action recognition [8]. In [9],
the author performs the training of a convolutional neural
network for the estimation of human poses.
The second scheme is based on the situation or
circumstances. This category not only consideration human
poses but also human-object interactions as an aid to
perform human action recognition. In [10], the author
creates pairs of human poses and objects human is
interacting and picks discriminative ones for human action
recognition. Yao in [11] considered multiple interactions in
a scene that include human poses, human-object interaction,
as well as the affiliation amongst objects. In [12], pre-
trained object detectors are deployed to detect most related
objects to the person in the scene.
The third approach is a part-based method. In [13], the use
of local patches of an image as parts in order to train the
model which similar to classifier for action recognition [14].
In [15], human action in a scene is recognized by only using
image labels in order to locate humans in a scene. The
multiple detectors are used to detect the human upper body
and face. After the detection of humans, the most related
objects are then detected on the bases of relative locations.
3. Proposed Method
In machine learning, transfer learning or knowledge transfer
is a method that utilizes previously learned knowledge to
solve a new problem. For training the models with a small
dataset, transfer learning using pre-trained deep conNets are
very useful because of conNets face overfitting problem
with small size dataset. However, overfitting can be avoided
by increasing the size of the dataset costing high annotations
and require high computations which can increase the
complexity. In this case, the transfer learning method is
used by utilizing pre-trained deep representations for the
construction of new architecture [16]. In this paper, we have
employed four popular pre-trained models Resnet18 [17],
VGG16 , VGG19 [18] and googlenet [19].
Resnet-18 is a pre-trained convolutional neural network on
more than a million images of 1000 different kinds of
categories of ImageNet dataset [20]. The network consists
of total 18 layers with an input layer of size 224 by 224 and
having the ability to classify 1000 different categories like
keyboard, mouse, pencil due to this extensive learning of
feature representations for a wide range of images. Both
VGG16 and VGG19 are pre-trained convolutional neural
networks on the ImageNet dataset [20]. The networks
consist of 16 and 19 layers respectively and having an input
size of 224 by 224. Googlenet is a convolutional neural
network that is a pre-trained having 22 layers of depth. It is
trained on ImageNet dataset [20] and capable of classifying
images into 1000 categories, such as mouse, pencil,
keyboard and many animals. The network has an input size
of 224 by 224.
In the proposed approach, the features are extracted by pre-
trained models and output from the network is extracted
from the 5th pooling layer. The principal component
analysis is performed on the extracted features from the pre-
trained models to reduce computations and followed by a
support vector machine (SVM) classifier for action
recognition. The block diagram of the proposed method is
shown in figure 1 which gives the overview of the
conventional transfer learning system, the first row
indicates the source architecture and the second row shows
the target.
Fig. 1 Overview of the conventional transfer learning system.
In the proposed method, three major factors that constitute
an action, human pose, most relatable object within the
scene and overall scenario are considered. In order to
include these factors three parallel networks are used
followed by the feature fusion and convolutional neural
network, classification is performed by SVM classifier.
Fig. 2 The overview of the proposed method, feature fusion followed by
CNN and SVM classifier.
IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019
153
Initially, the given input image  , we use different
networks to detect all humans, their poses and most
relatable objects in the scene, creating a detected set of
bounding boxes 󰇛
󰇜 where N represents the
total number of detected bounding boxes. The detected
boxes for human and for the objects are represented as, 
and  respectively and the detection confidence score for
both are represented as and respectively. Human pose
estimation and recognition for matching action are obtained
by transfer learning from datasets [22][23].
The action prediction of the given image is calculated
for each candidate action, where with dimension
includes all action classes, given each human-object-
scenario bounding boxes (  and  ), where 
represents scenario bounding box which includes actors and
all other objects to give an overall aspect of the scene a
chance to play in the prediction score.  depends firstly on
the individual confidence score of the actor
and object
, secondly human-object-scenario confidence score

and thirdly on pose feature representations
. The
action prediction score is given as,
 

 󰇛

󰇜 (1)
The sigmoid activation is utilized for classification to avoid
competition between predicted classes. The training
objective is to minimize the binary cross-entropy loss
between action labels and the predicted score .
󰇛 󰇜 
 


   (2)

  
   
 
󰇛
 󰇜 󰇛
 󰇜 (3)
Where and represents average cross-entropy loss on
M sample batch and total cross-entropy loss
respectively. is the action class for the ith action in the
jth prediction and  represents prediction score for the ith
action. Figure 2 describes the proposed method with feature
fusion followed by CNN network and SVM classifier.
4. Experiments and Results
In this section, we discuss the experimental setup, training
process and results of the proposed method. The proposed
method is tested on open source Stanford40 dataset [21].
The dataset contains 40 different human action images
approximately 180 to 300 images per class, each image in
the dataset has a bounding box of the person performing the
action. In this paper for experimental purposes, only 10
classes are used to evaluate the proposed method on four
different pre-trained models. Some samples from the
dataset with classes that are used in the experiment are
shown in figure 3.
Fig. 3 Some Sample images from Stanford 40 dataset.
The feature extraction is performed by four different pre-
trained networks having the same input layer size of 244 by
244 then principle component analysis performed on
extracted features followed by SVM classifier to classify
actions between 10 classes. The experimental results pre-
trained model-wise are shown in table 1.
Table 1: Comparison of classification results on Stanford40 dataset
Methods
Resnet18
VGG16
googLenet
Mean AP(%)
87.132
85.748
84.387
Now, the proposed method is tested on the same dataset,
firstly the input image is processed through three different
networks to find bounding boxes for human detection, pose
estimation and object detection. The distance between all
the objects detected in the scene and the human detected
bounding box is calculated and the object which has
minimum distance will be declared a most relatable object
in the scene. Another network is used to estimate human
pose to participate in action prediction score and the
network utilizing previously learned knowledge is used
followed by SVM classifier to detect overall scenario which
includes actor and all the objects in the scene. Then finally
all scores are interrelated in the decision fusion to provide a
final decision. It is found that our method has performed
better than conventional transfer learning methods and
provide better accuracy of 86.413% Figure 4 shows some
of the recognized actions by the proposed method. The
mean AP comparison of the proposed method is shown in
table 2. It illustrates that the proposed method achieves
better results compared to the other existing methods.
IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019
154
Fig. 4 Classified actions from test dataset with true labels.
Table 1: Comparison of classification results on Stanford40 dataset
Methods
Mean AP(%)
Khan [24]
75.4
Semantic parts [25]
80.6
Image classification (VGG16 model)
81.4
Zhang [26]
82.6
Proposed method
87.1
5. Conclusion
In this paper, the human action recognition method is
proposed based on three networks utilizing transfer learning
by pre-trained Convolutional neural network architecture
and SVM classifiers. The architectures of the pre-trained
networks are used to determine human pose estimation,
objects in the scene and overall scenario. Then followed by
decision fusion where confidence scores of three different
networks are related and the final decision is produced. It
was established and demonstrated that transfer learning can
be effectively used to utilize already learned knowledge for
a new task in case of the small training dataset. Training of
the deep learning model from scratch is computationally
very high and time-consuming which can be avoided by
using the transfer learning method. The performance of the
proposed method was evaluated on stanford40 dataset and
achieved 87.13% overall accuracy based on resnet18 pre-
trained deep network.
References
[1] R. Poppe, “A survey on vision-based human action
recognition”, Image and Vision Computing, vol. 28, no. 6, pp.
97690, 2010.
[2] G. Cheng, Y. Wan, A. Saudagar, K. Namuduri, and B.
Buckles, “Advances in human action recognition: A survey”,
arxiv, pp. 130, 2015.
[3] M. Everingham, L. V. Gool, C. Williams, J. Winn, and A.
Zisserman, “The PASCAL Visual Object Classes Challenge
2012 (VOC2012) Results”,
http://www.pascalnetwork.org/challenges/VOC/voc2012/wo
rkshop/index.html.
[4] J. Wu, Y. Zhang, and W. Lin, “Towards good practices for
action video encoding,” in Proc. IEEE Int’l Conf. on
Computer Vision and Pattern Recognition, 2014, pp. 2577
2584.
[5] G. D. Guo and A. Lai, “A survey on still image based human
action recognition”, Pattern Recognition, vol. 47, no. 10, pp.
334361, 2014.
[6] C.Szegedy,W.Liu, Y.Jia, and P.Sermanet, “Going deeper
with convolutions,” aeXiv Prepr., 2014.
[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E.
Tzeng, and T. Darrel, “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition,” Icml,
vol. 32, pp. 647-655, 2014.
[8] S. Maji, L. Bourdev, and J. Malik, “Action recognition from
a distributed representation of pose and appearance”, IEEE
Int’l Conf. on Computer Vision and Pattern Recognition,
2011, pp. 31773184.
[9] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler,
“Efficient object localization using convolutional networks”,
IEEE Int’l Conf. on Computer Vision and Pattern
Recognition, 2015, pp. 648656.
[10] V. Delaitre, J. Sivic, and I. Laptev, “Learning person-object
interactions for action recognition in still images”, Advances
in Neural Information Processing Systems, 2011
[11] B. Yao and L. Fei-Fei, “Recognizing human-object
interactions in still images by modeling the mutual context of
objects and human poses”, IEEE Trans. on Pattern Analysis
and Machine Intelligence, vol. 34, no. 9, pp. 16911703,
2012.
[12] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action
recognition with R*CNN”, IEEE Int’l Conf. on Computer
Vision, 2015, pp. 10801088.
[13] G. Sharma, F. Jurie, and C. Schmid, “Expanded parts model
for semantic description of humans in still images”,
arXiv:1509.04186, 2015.
[14] P. Felzenszwalb, R. Girshick, D. McAllester, and D.
Ramanan, “Object detection with discriminatively trained
part-based models”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 32, no. 9, pp. 1627 1645, 2010.
[15] C. Schmid, and V. Ferrari, “Weakly supervised learning of
interactions between humans and objects”, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp.
601614, 2012.
[16] Y. C. Su, T. H. Chiu, C. Y. Yeh, H. F. Huang, Transfer
Learning for Video Recognition with Scarce Training Data
for Deep Convolutional Neural Network, arXiv preprint
arXiv:1409.4127, 2014
[17] H. Kaiming, X. Zhang, S. Ren, and J. Sun. "Deep residual
learning for image recognition." In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp.
770-778. 2016
[18] Simonyan, Karen, and Andrew Zisserman. "Very deep
convolutional networks for large-scale image recognition",
arXiv preprint arXiv:1409.1556 (2014).
[19] Szegedy, C,hristian, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. "Going deeper with
convolutions." IEEE conference on computer vision and
pattern recognition, pp. 1-9. 2015
[20] ImageNet. http://www.image-net.org
[21] B. Yao, X. Jiang, A. Khosla, A.L. Lin, L.J. Guibas, and L.
Fei-Fei, Human Action Recognition “, Internation
Conference on Computer Vision (ICCV), Barcelona, Spain.
November 6-13, 2011
[22] R. Girshick, I. Radosavovic, G. Gkioxari, P. Doll´ar, and K.
He, “Detectron,”
https://github.com/facebookresearch/detectron, 2018.
IJCSNS International Journal of Computer Science and Network Security, VOL.19 No.11, November 2019
155
[23] A. Recasens, A. Khosla, C. Vondrick, and A. Torralba,
“Where are they looking?” in NIPS, pp. 199–207, 2015.
[24] F. S. Khan, J. Xu, J. van de Weijer, A. D. Bagdanov, R. M.
Anwer, and A. M. Lopez, “Recognizing actions through
action specific person detection”, IEEE Transactions on
Image Processing, vol. 24, no. 11, pp. 44224432, 2015.
[25] Z. Zhao, H. Ma, and X. Chen, “Semantic parts based top-
down pyramid for action recognition, Pattern Recognition
Letters, vol. 84, pp. 134141, 2016.
[26] Y. Zhang, L. Cheng, J. Wu, J. Cai, M. N. Do, and J. Lu,
“Action recognition in still images with minimum annotation
efforts, IEEE Transactions on Image Processing, vol. 25, no.
11, pp. 54795490, Nov 2016.
[27] A. R. Siyal, Z. Bhutto, K. Saleem, A. S. Chan, M. L. Memon,
M. H. Shaikh, S. Ahmed, “Ship detection in satellite imagery
by multiple classifier network”, International Journal of
Computer Science and Network Security (IJCSNS), vol. 10,
no. 8, pp. 142-148, Aug. 2019.
[28] Z. Bhutto, M. Z. Tunio, A. Hussain, J. Shah, I. Ali, and M. H.
Shaikh, “Scaling of color fusion in stitching images”,
International Journal of Computer Science and Network
Security (IJCSNS), vol. 10, no. 4, pp. 61-64, Apr. 2019.
... For the final classification, the selected attributes are sent to many classifiers which performed quite well to classify the actions. Three networks are used in Chan et al. [42] 's unique approach to determine the position of the human and other objects in the scene that are the most related to the human, and the entire situation, which includes the actors and everything else in the scene. Three parallel networks are utilized to incorporate these factors, followed by feature fusion and convolutional neural networks. ...
Article
Full-text available
Physical and mental health are impacted as a person grows old. A Human Activity Recognition (HAR) system, which tracks a person’s activity patterns and intervenes in case of an abnormal activity, could help elderly individuals to live independently. However, because of the strong intra class correlation between different activities, it is a challenging task to recognise such activities. Therefore, we proposed a personalized feature fusion algorithm, goldenAGER, which can be used to build as a model for abnormal activity recognition. In the initial stage, it extracts handcrafted HOG features and self-learned VGG-16 features to provide a rich description about the internal information of images. Then, the extracted features are provided as two different inputs to the deep neural network which are finally concatenated to classify the action type. The dataset is collected from the elderly volunteers over the age of 60 in a homogeneous environment consisting 10 classes of activities. The fusion of the features has resulted in 95% accuracy on primary dataset. The performance of the proposed model has also been tested on Microsoft Research (MSR) Action dataset giving accuracy of 93.08%. A comparison of our proposed model with the other existing models is also performed which shows that our model outperformed the existing models.
... Additionally, it transfers previously learnt knowledge to target model in order to perform training of new task has shown improved accuracy of the model and it saves time and money. Furthermore, research work has been reported reported in [28][29][30][31]. ...
Article
Full-text available
Iterative Recognizing human activity in a scene is still a challenging and an important research area in the field of computer vision due to its various possible implementations on many fields including autonomous driving, bio medical, machine intelligent vision etc. Recently deep learning techniques have emerged and successfully deployed models for image recognition and classification, object detection, and speech recognition. Due to promising results the state of art deep learning techniques have replaced the traditional techniques. In this paper, a novel method is presented for human activity recognition based on pre-trained Convolutional Neural Network (CNN) model utilized as feature extractor and deep representations are followed by Support Vector Machine (SVM) classifier for action recognition. It has been observed that previously learnt CNN knowledge from large scale data-set could be transferred to activity recognition task with limited training data. The proposed method is evaluated on publicly available stanford40 human action data-set, which includes 40 classes of actions and 9532 images. The comparative experiment results show that proposed method achieves better performance over conventional methods in term of accuracy and computational power.
Article
Understanding human activities is one of the vital steps in visual scene recognition. Human daily activities include diverse scenes with multiple objects having complex interrelationships with each other. Representation of human activities finds application in areas such as surveillance, health care systems, entertainment, automated patient monitoring systems, and so on. Our work focuses on classifying scenes into different classes of human activities like waving hands, gardening, walking, running, etc . The dataset classes were pre-processed using the fuzzy color stacking technique. We adopted the transfer learning concept of pretrained deep CNN models. Our proposed methodology employs pretrained AlexNet, SqueezeNet, ResNet, and DenseNet for feature extraction. The adaptive World Cup Optimization (WCO) algorithm is used halfway to select the superior dominant features. Then, these dominant features are classified by the fully connected classifier layer of DenseNet 201. Evaluation of the performance matrices showed an accuracy of 94.7% with DenseNet as the feature extractor and WCO for feature selection compared to other models. Also, our proposed methodology proved to be superior to its counterpart without feature selection. Thus, we could improve the quality of the classification model by providing double filtering using the WCO feature selection process.
Article
Full-text available
In this modern and information age, more than 80% of information accessed by human beings by viewing and observations of realities. Since now, everyone around the globe is going to be more and more familiar that video or images are two enormously important carriers of information which is obtained by people. Therefore, to get and process video or image becomes very important. Particularly, with the fast development of IT, Science & Technology: digital image processing, image correction, image fusion, and image stitching got high attention. This study heavily relies on Image Stitching of two images. Image stitching is the method of stitching multiple images to develop an image of high resolution which creates a visually probable mixture alike as the joint enclosed by the stitched images should be hidden. It requires nearly exact overlaps within images and identical exposures to process the logical results. The main object of this study is to develop a Matlab script that will stitch two images together to produce one image including the overlapping of region and plan of both images. In this context, a step called fuse has been used through which the colors of both corresponding images mixed to design a new or output image. For this, we captured different scenes from different aspects, angles, and finally selected two images as input or corresponding images. The input images have been evaluated through MATLAB code, analyzed the given task with the help of MATLAB programming, resulting stitching process completed.
Article
Full-text available
Action recognition in still images is a challenging problem in computer vision. To facilitate comparative evaluation independently of person detection, the standard evaluation protocol for action recognition uses an oracle person detector to obtain perfect bounding box information at both training and test time. The assumption is that, in practice, a general person detector will provide candidate bounding boxes for action recognition. In this paper we argue that this paradigm is sub-optimal and that action class labels should already be considered during the detection stage. Motivated by the observation that body pose is strongly conditioned on action class, we show: (i) that existing, state-of-the-art generic person detectors are not adequate for proposing candidate bounding boxes for action classification; (ii) that, due to limited training examples, direct training of action-specific person detectors is also inadequate; and (iii) that, using only a small number of labeled action examples, transfer learning is able to adapt an existing detector to propose higher-quality bounding boxes for subsequent action classification. To the best of our knowledge, we are the first to investigate transfer learning for the task of action-specific person detection in still images. We perform extensive experiments on two benchmark datasets: Stanford-40 and PASCAL VOC 2012. For the action detection task (i.e. both person localization and classification of the action performed), our approach outperforms methods based on general person detection by 5.7% mean average precision (MAP) on Stanford-40 and 2.1% MAP on PASCAL VOC 2012. Our approach also significantly outperforms the state-of-the-art with a MAP of 45.4% on Stanford-40 and 31.4% on PASCAL VOC 2012. We also evaluate our action detection approach for the task of action classification (i.e. recognizing actions without localizing them). For this task, our approach, without using any ground-truth person localization at test time, outperforms on both datasets state-of-the-art methods which do use person locations.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
We focus on the problem of recognizing actions in still images, and this paper provides an approach which arranges features of different semantic parts in spatial order. Our approach includes three components: 1) a semantic learning algorithm that collects a set of part detectors, 2) an efficient detection method that divides multiple images by the same grid and evaluates parallelly, and 3) a top-down spatial arrangement that increases the inter-class variance. The proposed semantic parts learning algorithm captures both interactive objects and discriminative poses. Our spatial arrangement can be seen as a kind of adaptive pyramid, which highlights spatial distribution of body parts in different actions, and provides more discriminative representations. Experimental results show that our approach outperforms the state-of-the-art significantly on two challenging benchmarks: 1) PASCAL VOC 2012 and 2) Stanford-40 (by 2.6% mAP and 5.2% mAP, respectively).
Article
We focus on the problem of still image-based human action recognition, which essentially involves making prediction by analyzing human poses and their interaction with objects in the scene. Besides image-level action labels (e.g., riding, phoning), during both training and testing stages, existing works usually require additional input of human bounding-boxes to facilitate the characterization of the underlying human-object interactions. We argue that this additional input requirement might severely discourage potential applications and is not very necessary. To this end, a systematic approach was developed in this paper to address this challenging problem of minimum annotation efforts, i.e. to perform recognition in the presence of only image-level action labels in the training stage. Experimental results on three benchmark datasets demonstrate that compared with the state-ofthe- art methods that have privileged access to additional human bounding-box annotations, our approach achieves comparable or even superior recognition accuracy using only action annotations in training. Interestingly, as a by-product in many cases, our approach is able to segment out the precise regions of underlying human-object interactions.
Article
We introduce an Expanded Parts Model (EPM) for recognizing human attributes (e.g. young, short hair, wearing suit) and actions (e.g. running, jumping) in still images. An EPM is a collection of part templates which are learnt discriminatively to explain specific scale-space regions in the images (in human centric coordinates). This is in contrast to current models which consist of a relatively few (i.e. a mixture of) 'average' templates. EPM uses only a subset of the parts to score an image and scores the image sparsely in space, i.e. it ignores redundant and random background in an image. To learn our model, we propose an algorithm which automatically mines parts and learns corresponding discriminative templates together with their respective locations from a large number of candidate parts. We validate our method on three recent challenging datasets of human attributes and actions. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets.