Fig 1 - uploaded by Congqi Cao
Content may be subject to copyright.
Illustration of body joint guided pooling in a 3D CNN. Different colors of feature maps represent different channels. Each channel of the feature maps is a 3D spatio-temporal cube. We pool the activations on 3D convolutional feature maps according to the positions of body joints. By aggregating the pooled activations of all the clips belonging to one video together, we obtain a descriptor of the video. 

Illustration of body joint guided pooling in a 3D CNN. Different colors of feature maps represent different channels. Each channel of the feature maps is a 3D spatio-temporal cube. We pool the activations on 3D convolutional feature maps according to the positions of body joints. By aggregating the pooled activations of all the clips belonging to one video together, we obtain a descriptor of the video. 

Source publication
Article
Full-text available
Three dimensional convolutional neural networks (3D CNNs) have been established as a powerful tool to simultaneously learn features from both spatial and temporal dimensions, which is suitable to be applied to video-based action recognition. In this work, we propose not to directly use the activations of fully-connected layers of a 3D CNN as the vi...

Context in source publication

Context 1
... the preliminary version of our work [19], we have proposed an efficient way of pooling activations on 3D feature maps with the guidance of body joint positions to generate video descriptors as illustrated in Figure 1. The splitted video clips are fed into a 3D CNN for convolutional computation. ...

Similar publications

Preprint
Full-text available
The goal of this work is to reduce driver's range anxiety by estimating the real-time energy consumption of electric vehicles using deep convolutional neural network. The real-time estimate can be used to accurately predict the remaining range for the vehicle and hence, can reduce driver's range anxiety. In contrast to existing techniques, the non-...
Article
Full-text available
The deep convolution neural network has shown great potential in the field of human action recognition. For the sake of obtaining compact and discriminative feature representation, this paper proposes multiple pooling strategies using CNN features. We explore three different pooling strategies, which are called space-time feature pooling (STFP), ti...
Conference Paper
Full-text available
This paper proposes a new interval observer for joint estimation of the state and unknown inputs of a discrete-time linear parameter-varying (LPV) system with an unmeasurable parameter vector. This system is assumed to be subject to unknown inputs and unknown but bounded disturbances and measurement noise, while the parameter-varying matrices are e...
Article
Full-text available
Attributes-Based Re-Identification is a way of identifying individuals when presented with multiple pictures taken under varying conditions. The method typically builds a classifier to detect the presence of certain appearance characteristics in an image, and creates feature descriptors based on the output of the classifier. We improve attribute de...

Citations

... This section validates the achievement of the SOSDCNN-HAR prototype on three distinct datasets: the UCF-Sports Action Dataset [25], the Penn Action dataset [26], and the NW-UCLA dataset [27]. Table 1 offers a detailed accuracy investigation of the SOSDCNN-HAR prototype with recent prototypes on the trial UCF-Sports Action Dataset [28][29][30]. The outcomes indicated that the SVM and DT prototypes resulted in lower precision percentages of 77.81% and 78.18%, respectively. ...
... In particular, the VA appeared to be greater than TA. Table 1 offers a detailed accuracy investigation of the SOSDCNN-HAR prototype with recent prototypes on the trial UCF-Sports Action Dataset [28][29][30]. The outcomes indicated that the SVM and DT prototypes resulted in lower precision percentages of 77.81% and 78.18%, respectively. ...
Article
Full-text available
Cognitive assessment plays a vital role in clinical care and research fields related to cognitive aging and cognitive health. Lately, researchers have worked towards providing resolutions to measure individual cognitive health; however, it is still difficult to use those resolutions from the real world, and therefore using deep neural networks to evaluate cognitive health is becoming a hot research topic. Deep learning and human activity recognition are two domains that have received attention for the past few years. The former is for its relevance in application fields like health monitoring or ambient assisted living, and the latter is due to their excellent performance and recent achievements in various fields of application, namely, speech and image recognition. This research develops a novel Symbiotic Organism Search with a Deep Convolutional Neural Network-based Human Activity Recognition (SOSDCNN-HAR) model for Cognitive Health Assessment. The goal of the SOSDCNN-HAR model is to recognize human activities in an end-to-end way. For the noise elimination process, the presented SOSDCNN-HAR model involves the Wiener filtering (WF) technique. In addition, the presented SOSDCNN-HAR model follows a RetinaNet-based feature extractor for automated extraction of features. Moreover, the SOS procedure is exploited as a hyperparameter optimizing tool to enhance recognition efficiency. Furthermore, a gated recurrent unit (GRU) prototype can be employed as a categorizer to allot proper class labels. The performance validation of the SOSDCNN-HAR prototype is examined using a set of benchmark datasets. A far-reaching experimental examination reported the betterment of the SOSDCNN-HAR prototype over current approaches with enhanced precision of 86.51% and 89.50% on Penn Action and NW-UCLA datasets, respectively.
... So, the JTDPAHBRD scheme was developed by applying the PAHBRNN, which divides the feature maps related to the human skeleton in every clip into different parts depending on the body structure [17]. Those part-based features were hierarchically learned by the separate PABRNNs to obtain and fuse the long-range spatiotemporal information related to the different body parts. ...
Article
Many researchers are now focusing on Human Action Recognition (HAR), which is based on various deep-learning features related to body joints and their trajectories from videos. Among many schemes, Joints and Trajectory-pooled 3D-Deep Geometric Positional Attention-based Hierarchical Bidirectional Recurrent convolutional Descriptors (JTDGPAHBRD) can provide a video descriptor by learning geometric features and trajectories of the body joints. But the spatial-temporal dynamics of the different geometric features of the skeleton structure were not explored deeper. To solve this problem, this article develops the Graph Convolutional Network (GCN) in addition to the JTDGPAHBRD to create a video descriptor for HAR. The GCN can obtain complementary information, such as higher-level spatial-temporal features, between consecutive frames for enhancing end-to-end learning. In addition, to improve feature representation ability, a search space with several adaptive graph components is created. Then, a sampling and computation-effective evolution scheme are applied to explore this space. Moreover, the resultant GCN provides the temporal dynamics of the skeleton pattern, which are fused with the geometric features of the skeleton body joints and trajectory coordinates from the JTDGPAHBRD to create a more effective video descriptor for HAR. Finally, extensive experiments show that the JTDGPAHBRD-GCN model outperforms the existing HAR models on the Penn Action Dataset (PAD).
... Intuitively, the most informative parts for action recognition generally lie around joint positions as also reported by Cao et al. [48] and therefore as a first step, we boost the features which lie around 2D joints, enabling the model to pay more attention on the body joint areas. As long as not all the body parts are equally informative, the boosting degree should depend on the informativeness of each area. ...
Preprint
Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.
... Penn-Action Dataset: Table 1 shows the results of the comparison experiments with other SoTA action recognition technologies for the Penn-Action dataset: 1) body joint guided 3D deep convolutional descriptors (3D Deep) [8], 2) evolution of pose estimation maps (EV-Pose) [29], 3) multitask CNN [32], 4) Bayesian hierarchical dynamic model (HDM-BG) [55], 5) view-invariant probabilistic embedding (Pr-VIPE) [40], and 6) a unified framework for skeletonbased action recognition (UNIK) [49]. UNIK [49] was pretrained using the Posetics dataset reconstructed from the Kinect-400 [25] dataset, and Pr-VIPE [40] was pretrained using the Human3.6M ...
... During the experiment, STAR-transformer and the three methods [8,29,32] using the RGB of the video frames and the pose (skeleton) feature together showed a high overall performance of 98% or higher. However, the three methods [55,40,49] using only the pose feature showed a relatively low performance of 93% to 97%. ...
Conference Paper
Full-text available
In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder layer consists of a full self-attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.
... Performance Accuracy [29] Ensemble Model 0.93 [30] Scalable Neural Network 0.95 [31] Two Stream Bilinear C3D 0.84 Our Study LogRF+RF 0.99 fitness exercises. We utilized a multi-class exercise dataset based on human skeleton movement points to conduct our experimental research. ...
Article
Full-text available
Human pose and gesture estimation are crucial in correcting physiotherapy fitness exercises. In recent years, advancements in computer vision and machine learning approaches have led to the development of sophisticated pose estimation models that accurately track and analyze human movements in real time. This technology enables physiotherapists and fitness trainers to gain valuable insights into their client’s exercise forms and techniques, facilitating more effective exercise corrections and personalized training regimens. This research aims to propose an efficient artificial intelligence method for human pose estimation during physiotherapy fitness exercises. We utilized a multi-class exercise dataset based on human skeleton movement points to conduct our experimental research. The dataset comprises 133 features derived from human skeleton movements during various exercises, resulting in high feature dimensionality that affects the performance of human pose estimation with machine learning and deep learning methods. We have introduced a novel Logistic regression Recursive Feature elimination (LogRF) method for feature selection. Extensive experiments demonstrate that using the top twenty selected features, the random forest method outperformed state-of-the-art studies with a high-performance score of 0.998. The performance of each applied method is validated through a k-fold approach and further enhanced using hyperparameter tuning. Our proposed study assists specialists in identifying and addressing potential biomechanical issues, improper postures, and incorrect movement patterns, which are essential for injury prevention and optimizing exercise outcomes. Furthermore, this study enhances the capabilities of remote monitoring and guidance capabilities, allowing physiotherapists to support their patient’s progress with prescribed exercises continually.
... In this article, this acceleration has been calculated and has been modeled by using the severity of the ocular stream changes in three consecutive frames. To calculate spatial features, this article uses the VGG19 network [8] trained on Imagenet and then selects the features from the penultimate layer as the feature vectors. Another network that considers the changes in the optical flow as a feature is a Tdd network [9] which was trained and created by using the UCF101 dataset. ...
Article
Full-text available
The use of surveillance cameras has made it possible to analyze a huge amount of data for automated surveillance. The use of security systems in schools, hotels, hospitals, and other security areas is required to identify the violent activities that can cause social, economic, and environmental damage. Detecting the mobile objects on each frame is a fundamental phase in the analysis of the video trail and the violence recognition. Therefore, a three-step approach is presented in this article. In our method, the separation of the frames containing the motion information and the detection of the violent behavior are applied at two levels of the network. First, the people in the video frames are identified by using a convolutional neural network. In the second step, a sequence of 16 frames containing the identified people is injected into the 3D CNN. Furthermore, we optimize the 3D CNN by using the visual inference and then a neural network optimization tool that transforms the pre-trained model into an average representation. Finally, this method uses the toolbox of OPENVINO to perform the optimization operations to increase the performance. To evaluate the accuracy of our algorithm, two datasets have been analyzed, which are: Violence in Movies and Hockey Fight. The results show that the final accuracy of this analysis is equal to 99.9% and 96% from each dataset.
... It can be additionally detected that the loss values get saturated with the count of epochs. To demonstrate the enhanced performance of the SSDCNN-HPE model, a comparative accuracy analysis of the Penn action dataset is made in Table 3 and Fig. 15 [31][32][33][34][35]. From the experimental results, it could be noticed that the JAR-PSV and PAAP models have resulted in minimal accuracy values of 0.860 and 0.790 correspondingly. ...
Article
Full-text available
Human pose estimation (HPE) is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision (CV) communities. HPE finds its applications in several fields namely activity recognition and human-computer interface. Despite the benefits of HPE, it is still a challenging process due to the variations in visual appearances, lighting, occlusions, dimensionality, etc. To resolve these issues, this paper presents a squirrel search optimization with a deep convolutional neural network for HPE (SSDCNN-HPE) technique. The major intention of the SSDCNN-HPE technique is to identify the human pose accurately and efficiently. Primarily, the video frame conversion process is performed and pre-processing takes place via bilateral filtering-based noise removal process. Then, the EfficientNet model is applied to identify the body points of a person with no problem constraints. Besides, the hyperparameter tuning of the EfficientNet model takes place by the use of the squirrel search algorithm (SSA). In the final stage, the multiclass support vector machine (M-SVM) technique was utilized for the identification and classification of human poses. The design of bilateral filtering followed by SSA based EfficientNet model for HPE depicts the novelty of the work. To demonstrate the enhanced outcomes of the SSDCNN-HPE approach, a series of simulations are executed. The experimental results reported the betterment of the SSDCNN-HPE system over the recent existing techniques in terms of different measures.
... However, it is difficult and challenging to balance and make full use of spatiotemporal information for human action recognition based on skeleton sequences (Kim and Reiter, 2017). The mainstream approaches usually represent skeleton sequences as pseudo images as the standard input of CNNs (Cao et al., 2018;Hou et al., 2018;Xu et al., 2018;Li C. et al., 2019). In these methods, the spatial structure and temporal dynamic information of the skeleton sequences are encoded as columns and rows of a tensor, respectively. ...
Article
Full-text available
Graph convolution networks (GCNs) have been widely used in the field of skeleton-based human action recognition. However, it is still difficult to improve recognition performance and reduce parameter complexity. In this paper, a novel multi-scale attention spatiotemporal GCN (MSA-STGCN) is proposed for human violence action recognition by learning spatiotemporal features from four different skeleton modality variants. Firstly, the original joint data are preprocessed to obtain joint position, bone vector, joint motion and bone motion datas as inputs of recognition framework. Then, a spatial multi-scale graph convolution network based on the attention mechanism is constructed to obtain the spatial features from joint nodes, while a temporal graph convolution network in the form of hybrid dilation convolution is designed to enlarge the receptive field of the feature map and capture multi-scale context information. Finally, the specific relationship in the different skeleton data is explored by fusing the information of multi-stream related to human joints and bones. To evaluate the performance of the proposed MSA-STGCN, a skeleton violence action dataset: Filtered NTU RGB+D was constructed based on NTU RGB+D120. We conducted experiments on constructed Filtered NTU RGB+D and Kinetics Skeleton 400 datasets to verify the performance of the proposed recognition framework. The proposed method achieves an accuracy of 95.3% on the Filtered NTU RGB+D with the parameters 1.21M, and an accuracy of 36.2% (Top-1) and 58.5% (Top-5) on the Kinetics Skeleton 400, respectively. The experimental results on these two skeleton datasets show that the proposed recognition framework can effectively recognize violence actions without adding parameters.
... It has been demonstrated that the central concept consists of using the locations of body joints to select visual features in space and time using traditional methods of feature extraction from [194]. It has been suggested that 3D convolutions are the best method for addressing the temporal dimension of image sequences; however, these methods have a large number of parameters and are unable to effectively benefit from the abundance of still images while they are being trained [195]. There is a third approach that can be taken to integrate the temporal component; however, these methods rely on the estimation of optical flow, which can be quite challenging. ...
... When it comes to 2D action recognition methods, the information on the body joints is typically only used to isolate localized visual characteristics [194]. For the most part, it is not possible to generate or it has lower precision with estimated poses [195]. For 3D action recognition, skeleton data is the primary source of information, as opposed to video for video-based recognition. ...
Article
Full-text available
Human pose estimation (HPE) has developed over the past decade into a vibrant field for research with a variety of real-world applications like 3D reconstruction, virtual testing and re-identification of the person. Information about human poses is also a critical component in many downstream tasks, such as activity recognition and movement tracking. This review focuses on the key aspects of deep learning in the development of both 2D & 3D HPE. It provides detailed information on the variety of databases, performance metrics and human body models incorporated for implementing HPE methodologies. This paper discusses variety of applications of HPE across domains like activity recognition, animation and gaming, virtual reality, video tracking, etc. The paper presents an analytical study of all the major works that use deep learning methods for various downstream tasks in each domain for both 2D & 3D HPE. Finally, it discusses issues and limitations in the current topic of HPE and recommend potential future research directions in order to make meaningful progress in this area.
... Previous approaches have deployed classical methods for features extractions [17,18] where the central principal revolves around using the locations of body joints to select visual features in space and time. In recent years, 3D convolutions seem to be outputting the highest classification scores [19,20] -the issue is that they require high number of parameters and need large amounts of memory for training. This can be improved by attention models which focus on body parts [21], with two-steam networks being utilised to merge both RGB images and the expensive optical flow maps [22]. ...
Preprint
Full-text available
With advancements in computer vision taking place day by day, recently a lot of light is being shed on activity recognition. With the range for real-world applications utilizing this field of study increasing across a multitude of industries such as security and healthcare, it becomes crucial for businesses to distinguish which machine learning methods perform better than others in the area. This paper strives to aid in this predicament i.e. building upon previous related work, it employs both classical and ensemble approaches on rich pose estimation (OpenPose) and HAR datasets. Making use of appropriate metrics to evaluate the performance for each model, the results show that overall, random forest yields the highest accuracy in classifying ADLs. Relatively all the models have excellent performance across both datasets, except for logistic regression and AdaBoost perform poorly in the HAR one. With the limitations of this paper also discussed in the end, the scope for further research is vast, which can use this paper as a base in aims of producing better results.