Illustration of body joint guided pooling in a 3D CNN. Different colors of feature maps represent different channels. Each channel of the feature maps is a 3D spatio-temporal cube. We pool the activations on 3D convolutional feature maps according to the positions of body joints. By aggregating the pooled activations of all the clips belonging to one video together, we obtain a descriptor of the video.

Source publication

Body Joint Guided 3-D Deep Convolutional Descriptors for Action Recognition

Article

Full-text available

Apr 2017

Three dimensional convolutional neural networks (3D CNNs) have been established as a powerful tool to simultaneously learn features from both spatial and temporal dimensions, which is suitable to be applied to video-based action recognition. In this work, we propose not to directly use the activations of fully-connected layers of a 3D CNN as the vi...

Context 1

... the preliminary version of our work [19], we have proposed an efficient way of pooling activations on 3D feature maps with the guidance of body joint positions to generate video descriptors as illustrated in Figure 1. The splitted video clips are fed into a 3D CNN for convolutional computation. ...

View in full-text

Estimation of energy consumption of electric vehicles using Deep Convolutional Neural Network to reduce driver's range anxiety

Preprint

Full-text available

Aug 2020

The goal of this work is to reduce driver's range anxiety by estimating the real-time energy consumption of electric vehicles using deep convolutional neural network. The real-time estimate can be used to accurately predict the remaining range for the vehicle and hence, can reduce driver's range anxiety. In contrast to existing techniques, the non-...

Examples of local densely space-time interest points in KTH dataset....

Space-time feature pooling (STFP): firstly, space-time interest points...

Ttime filter pooling (TFP): we take the case of feature maps from the...

Spatio-temporal pyramid structure for deep feature maps

Action Recognition Using Multiple Pooling Strategies of CNN Features

Article

Full-text available

Aug 2019

The deep convolution neural network has shown great potential in the field of human action recognition. For the sake of obtaining compact and discriminative feature representation, this paper proposes multiple pooling strategies using CNN features. We explore three different pooling strategies, which are called space-time feature pooling (STFP), ti...

Interval Estimation for Discrete-Time Linear Parameter-Varying System with Unknown Inputs

Conference Paper

Full-text available

Dec 2021

This paper proposes a new interval observer for joint estimation of the state and unknown inputs of a discrete-time linear parameter-varying (LPV) system with an unmeasurable parameter vector. This system is assumed to be subject to unknown inputs and unknown but bounded disturbances and measurement noise, while the parameter-varying matrices are e...

Examples of various images from the 3DPeS [3, 4], VIPeR [16] and QMUL...

Four examples of images from the 3DPeS [3, 4] data set and their...

An example of how the foreground modelling method separates each person...

The network architecture of our deep attribute prediction model. We...

Examples of person images and their ground-truth and predicted skeleton...

Person re-identification combining deep features and attribute detection

Article

Full-text available

Mar 2020

Attributes-Based Re-Identification is a way of identifying individuals when presented with multiple pictures taken under varying conditions. The method typically builds a classifier to detect the presence of certain appearance characteristics in an image, and creates feature descriptors based on the output of the classifier. We improve attribute de...

A compact representation of character skeleton using skeletal line based shape descriptor

Conference Paper

Full-text available

Sep 2019

Deep Convolutional Neural Network with Symbiotic Organism Search-Based Human Activity Recognition for Cognitive Health Assessment

Article

Full-text available

Nov 2023

Cognitive assessment plays a vital role in clinical care and research fields related to cognitive aging and cognitive health. Lately, researchers have worked towards providing resolutions to measure individual cognitive health; however, it is still difficult to use those resolutions from the real world, and therefore using deep neural networks to evaluate cognitive health is becoming a hot research topic. Deep learning and human activity recognition are two domains that have received attention for the past few years. The former is for its relevance in application fields like health monitoring or ambient assisted living, and the latter is due to their excellent performance and recent achievements in various fields of application, namely, speech and image recognition. This research develops a novel Symbiotic Organism Search with a Deep Convolutional Neural Network-based Human Activity Recognition (SOSDCNN-HAR) model for Cognitive Health Assessment. The goal of the SOSDCNN-HAR model is to recognize human activities in an end-to-end way. For the noise elimination process, the presented SOSDCNN-HAR model involves the Wiener filtering (WF) technique. In addition, the presented SOSDCNN-HAR model follows a RetinaNet-based feature extractor for automated extraction of features. Moreover, the SOS procedure is exploited as a hyperparameter optimizing tool to enhance recognition efficiency. Furthermore, a gated recurrent unit (GRU) prototype can be employed as a categorizer to allot proper class labels. The performance validation of the SOSDCNN-HAR prototype is examined using a set of benchmark datasets. A far-reaching experimental examination reported the betterment of the SOSDCNN-HAR prototype over current approaches with enhanced precision of 86.51% and 89.50% on Penn Action and NW-UCLA datasets, respectively.

An Improvements of Deep Learner Based Human Activity Recognition with the Aid of Graph Convolution Features

Article

Oct 2023

Many researchers are now focusing on Human Action Recognition (HAR), which is based on various deep-learning features related to body joints and their trajectories from videos. Among many schemes, Joints and Trajectory-pooled 3D-Deep Geometric Positional Attention-based Hierarchical Bidirectional Recurrent convolutional Descriptors (JTDGPAHBRD) can provide a video descriptor by learning geometric features and trajectories of the body joints. But the spatial-temporal dynamics of the different geometric features of the skeleton structure were not explored deeper. To solve this problem, this article develops the Graph Convolutional Network (GCN) in addition to the JTDGPAHBRD to create a video descriptor for HAR. The GCN can obtain complementary information, such as higher-level spatial-temporal features, between consecutive frames for enhancing end-to-end learning. In addition, to improve feature representation ability, a search space with several adaptive graph components is created. Then, a sampling and computation-effective evolution scheme are applied to explore this space. Moreover, the resultant GCN provides the temporal dynamics of the skeleton pattern, which are fused with the geometric features of the skeleton body joints and trajectory coordinates from the JTDGPAHBRD to create a more effective video descriptor for HAR. Finally, extensive experiments show that the JTDGPAHBRD-GCN model outperforms the existing HAR models on the Penn Action Dataset (PAD).

Learning Scene Flow With Skeleton Guidance For 3D Action Recognition

Preprint

Jun 2023

Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.

STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

Conference Paper

Full-text available

Jan 2023

In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder layer consists of a full self-attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.

LogRF: An Approach to Human Pose Estimation Using Skeleton Landmarks for Physiotherapy Fitness Exercise Correction

Article

Full-text available

Jan 2023

Human pose and gesture estimation are crucial in correcting physiotherapy fitness exercises. In recent years, advancements in computer vision and machine learning approaches have led to the development of sophisticated pose estimation models that accurately track and analyze human movements in real time. This technology enables physiotherapists and fitness trainers to gain valuable insights into their client’s exercise forms and techniques, facilitating more effective exercise corrections and personalized training regimens. This research aims to propose an efficient artificial intelligence method for human pose estimation during physiotherapy fitness exercises. We utilized a multi-class exercise dataset based on human skeleton movement points to conduct our experimental research. The dataset comprises 133 features derived from human skeleton movements during various exercises, resulting in high feature dimensionality that affects the performance of human pose estimation with machine learning and deep learning methods. We have introduced a novel Logistic regression Recursive Feature elimination (LogRF) method for feature selection. Extensive experiments demonstrate that using the top twenty selected features, the random forest method outperformed state-of-the-art studies with a high-performance score of 0.998. The performance of each applied method is validated through a k-fold approach and further enhanced using hyperparameter tuning. Our proposed study assists specialists in identifying and addressing potential biomechanical issues, improper postures, and incorrect movement patterns, which are essential for injury prevention and optimizing exercise outcomes. Furthermore, this study enhances the capabilities of remote monitoring and guidance capabilities, allowing physiotherapists to support their patient’s progress with prescribed exercises continually.

Violent Physical Behavior Detection using 3D Spatio-Temporal Convolutional Neural Networks

Article

Full-text available

Jan 2023

The use of surveillance cameras has made it possible to analyze a huge amount of data for automated surveillance. The use of security systems in schools, hotels, hospitals, and other security areas is required to identify the violent activities that can cause social, economic, and environmental damage. Detecting the mobile objects on each frame is a fundamental phase in the analysis of the video trail and the violence recognition. Therefore, a three-step approach is presented in this article. In our method, the separation of the frames containing the motion information and the detection of the violent behavior are applied at two levels of the network. First, the people in the video frames are identified by using a convolutional neural network. In the second step, a sequence of 16 frames containing the identified people is injected into the 3D CNN. Furthermore, we optimize the 3D CNN by using the visual inference and then a neural network optimization tool that transforms the pre-trained model into an average representation. Finally, this method uses the toolbox of OPENVINO to perform the optimization operations to increase the performance. To evaluate the accuracy of our algorithm, two datasets have been analyzed, which are: Violence in Movies and Hockey Fight. The results show that the final accuracy of this analysis is equal to 99.9% and 96% from each dataset.

Squirrel Search Optimization with Deep Convolutional Neural Network for Human Pose Estimation

Article

Full-text available

Dec 2022
CMC-COMPUT MATER CON

Human pose estimation (HPE) is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision (CV) communities. HPE finds its applications in several fields namely activity recognition and human-computer interface. Despite the benefits of HPE, it is still a challenging process due to the variations in visual appearances, lighting, occlusions, dimensionality, etc. To resolve these issues, this paper presents a squirrel search optimization with a deep convolutional neural network for HPE (SSDCNN-HPE) technique. The major intention of the SSDCNN-HPE technique is to identify the human pose accurately and efficiently. Primarily, the video frame conversion process is performed and pre-processing takes place via bilateral filtering-based noise removal process. Then, the EfficientNet model is applied to identify the body points of a person with no problem constraints. Besides, the hyperparameter tuning of the EfficientNet model takes place by the use of the squirrel search algorithm (SSA). In the final stage, the multiclass support vector machine (M-SVM) technique was utilized for the identification and classification of human poses. The design of bilateral filtering followed by SSA based EfficientNet model for HPE depicts the novelty of the work. To demonstrate the enhanced outcomes of the SSDCNN-HPE approach, a series of simulations are executed. The experimental results reported the betterment of the SSDCNN-HPE system over the recent existing techniques in terms of different measures.

Multi-scale and attention enhanced graph convolution network for skeleton-based violence action recognition

Article

Full-text available

Dec 2022

Graph convolution networks (GCNs) have been widely used in the field of skeleton-based human action recognition. However, it is still difficult to improve recognition performance and reduce parameter complexity. In this paper, a novel multi-scale attention spatiotemporal GCN (MSA-STGCN) is proposed for human violence action recognition by learning spatiotemporal features from four different skeleton modality variants. Firstly, the original joint data are preprocessed to obtain joint position, bone vector, joint motion and bone motion datas as inputs of recognition framework. Then, a spatial multi-scale graph convolution network based on the attention mechanism is constructed to obtain the spatial features from joint nodes, while a temporal graph convolution network in the form of hybrid dilation convolution is designed to enlarge the receptive field of the feature map and capture multi-scale context information. Finally, the specific relationship in the different skeleton data is explored by fusing the information of multi-stream related to human joints and bones. To evaluate the performance of the proposed MSA-STGCN, a skeleton violence action dataset: Filtered NTU RGB+D was constructed based on NTU RGB+D120. We conducted experiments on constructed Filtered NTU RGB+D and Kinetics Skeleton 400 datasets to verify the performance of the proposed recognition framework. The proposed method achieves an accuracy of 95.3% on the Filtered NTU RGB+D with the parameters 1.21M, and an accuracy of 36.2% (Top-1) and 58.5% (Top-5) on the Kinetics Skeleton 400, respectively. The experimental results on these two skeleton datasets show that the proposed recognition framework can effectively recognize violence actions without adding parameters.

Human pose estimation using deep learning: review, methodologies, progress and future research directions

Article

Full-text available

Nov 2022

Human pose estimation (HPE) has developed over the past decade into a vibrant field for research with a variety of real-world applications like 3D reconstruction, virtual testing and re-identification of the person. Information about human poses is also a critical component in many downstream tasks, such as activity recognition and movement tracking. This review focuses on the key aspects of deep learning in the development of both 2D & 3D HPE. It provides detailed information on the variety of databases, performance metrics and human body models incorporated for implementing HPE methodologies. This paper discusses variety of applications of HPE across domains like activity recognition, animation and gaming, virtual reality, video tracking, etc. The paper presents an analytical study of all the major works that use deep learning methods for various downstream tasks in each domain for both 2D & 3D HPE. Finally, it discusses issues and limitations in the current topic of HPE and recommend potential future research directions in order to make meaningful progress in this area.

Performance of different machine learning methods on activity recognition and pose estimation datasets

Preprint

Full-text available

Oct 2022

With advancements in computer vision taking place day by day, recently a lot of light is being shed on activity recognition. With the range for real-world applications utilizing this field of study increasing across a multitude of industries such as security and healthcare, it becomes crucial for businesses to distinguish which machine learning methods perform better than others in the area. This paper strives to aid in this predicament i.e. building upon previous related work, it employs both classical and ensemble approaches on rich pose estimation (OpenPose) and HAR datasets. Making use of appropriate metrics to evaluate the performance for each model, the results show that overall, random forest yields the highest accuracy in classifying ADLs. Relatively all the models have excellent performance across both datasets, except for logistic regression and AdaBoost perform poorly in the HAR one. With the limitations of this paper also discussed in the end, the scope for further research is vast, which can use this paper as a base in aims of producing better results.

Context in source publication

Similar publications

Citations