Figure - available from: Mathematical Problems in Engineering
This content is subject to copyright. Terms and conditions apply.
The first group of interaction: kicking

The first group of interaction: kicking

Source publication
Article
Full-text available
This paper proposes a novel approach to decompose two-person interaction into a Positive Action and a Negative Action for more efficient behavior recognition. A Positive Action plays the decisive role in a two-person exchange. Thus, interaction recognition can be simplified to Positive Action-based recognition, focusing on an action representation...

Similar publications

Article
Full-text available
The proportion of welding work in total man-hours required for shipbuilding processes has been perceived to be significant, and welding man-hours are greatly affected by working posture. Continuous research has been conducted to identify the posture in welding by utilizing the relationship between man-hours and working posture. However, the results...

Citations

... The dataset is divided into five distinct train-test splits as in [52]. (2) The K3HI: Kinect-Based 3D Human Interaction Dataset [53] is a two-person interaction dataset comprising eight interactions: approaching, departing, kicking, punching, pointing, pushing, exchanging an object, and shaking hands. The data are recorded from 15 volunteers. ...
... The dataset has approximately 320 interactions of duration 20 to 104 frames. The dataset is divided into four distinct train-test splits as in [53]. ...
... In the table, "our models" refers to the three models (M1, M2, M3) discussed in this paper, even though M1 was proposed in [11]. The other models cited in this table ( [53][54][55][56][57][58][59][60]) perform classification only (no generation). They take both skeletons as input, similar to our models. ...
Article
Full-text available
The remarkable human ability to predict others' intent during physical interactions develops at a very early age and is crucial for development. Intent prediction, defined as the simultaneous recognition and generation of human-human interactions, has many applications such as in assistive robotics, human-robot interaction, video and robotic surveillance, and autonomous driving. However, models for solving the problem are scarce. This paper proposes two attention-based agent models to predict the intent of interacting 3D skeletons by sampling them via a sequence of glimpses. The novelty of these agent models is that they are inherently multimodal, consisting of perceptual and proprioceptive pathways. The action (attention) is driven by the agent's generation error, and not by reinforcement. At each sampling instant, the agent completes the partially observed skeletal motion and infers the interaction class. It learns where and what to sample by minimizing the generation and classification errors. Extensive evaluation of our models is carried out on benchmark datasets and in comparison to a state-of-the-art model for intent prediction, which reveals that classification and generation accuracies of one of the proposed models are comparable to those of the state of the art even though our model contains fewer trainable parameters. The insights gained from our model designs can inform the development of efficient agents, the future of artificial intelligence (AI).
... Guo et al. [7] proposes the Extreme Pose Interaction dataset as well as a two-stream network with cross-interaction attention for interaction modeling. InterFormer [3] uses an attention-based interaction transformer to generate sparse-level reactive motions on the K3HI [10] and the DuetDance [13] datasets. SocialDiffusion [25] proposes the first diffusion-based stochastic multiperson motion anticipation model. ...
Preprint
Text-to-motion synthesis is a crucial task in computer vision. Existing methods are limited in their universality, as they are tailored for single-person or two-person scenarios and can not be applied to generate motions for more individuals. To achieve the number-free motion synthesis, this paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Furthermore, a generation module and an interaction module are designed for our FreeMotion framework to decouple the process of conditional motion generation and finally support the number-free motion synthesis. Besides, based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion. Extensive experiments demonstrate the superior performance of our method and our capability to infer single and multi-human motions simultaneously.
... We divide the dataset into five distinct train-test splits, as in [29]. Kinect-based 3D Human Interaction (K3HI) Dataset [30] consists of videos of eight two-person interactions: departing, approaching, kicking, pointing, punching, pushing, shaking hands, and exchanging object. The data is recorded from 15 volunteers. ...
... The dataset has around 320 interaction videos, ranging from 20 to 104 frames. We divide the dataset into four distinct train-test splits, as in [30]. ...
Article
The human ability to infer others' intent is innate and crucial to development. Machines ought to acquire this ability for seamless interaction with humans. In this article, we propose an agent model for predicting the intent of actors in human–human interactions. This requires simultaneous generation and recognition of an interaction at any time, for which end-to-end models are scarce. The proposed agent actively samples its environment via a sequence of glimpses. At each sampling instant, the model infers the observation class and completes the partially observed body motion. It learns the sequence of body locations to sample by jointly minimizing the classification and generation errors. The model is evaluated on videos of two-skeleton interactions under two settings: (first person) one skeleton is the modeled agent and the other skeleton's joint movements constitute its visual observation, and (third person) an audience is the modeled agent and the two interacting skeletons' joint movements constitute its visual observation. Three methods for implementing the attention mechanism are analyzed using benchmark datasets. One of them, where attention is driven by sensory prediction error, achieves the highest classification accuracy in both settings by sampling less than 50% of the skeleton joints, while also being the most efficient in terms of model size. This is the first known attention-based agent to learn end-to-end from two-person interactions for intent prediction, with high accuracy and efficiency.
... None of them considers close interactions between dynamic humans. Of the datasets containing human-human interactions [22,23,30,35,53,72,77] the most recent ones with close human interactions are summarized in Tab. 1. Shake-Five2 [23] and MuPoTS-3D [53] only provide 3D joint locations as reference data, lacking body shape information. The most related dataset to ours is CHI3D [22], which employs a motion capture system to fit parametric human models of at most one actor at a time. ...
Preprint
We propose Hi4D, a method and dataset for the automatic analysis of physically close human-human interaction under prolonged contact. Robustly disentangling several in-contact subjects is a challenging task due to occlusions and complex shapes. Hence, existing multi-view systems typically fuse 3D surfaces of close subjects into a single, connected mesh. To address this issue we leverage i) individually fitted neural implicit avatars; ii) an alternating optimization scheme that refines pose and surface through periods of close proximity; and iii) thus segment the fused raw scans into individual instances. From these instances we compile Hi4D dataset of 4D textured scans of 20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D contains rich interaction-centric annotations in 2D and 3D alongside accurately registered parametric body models. We define varied human pose and shape estimation tasks on this dataset and provide results from state-of-the-art methods on these benchmarks.
... No HRI DeepMind Kinetics [20] 2017 No HRI ShakeFive2 [21] 2016 No HRI K3HI [22] 2013 ...
Preprint
The field of social robotics will likely need to depart from a paradigm of designed behaviours and imitation learning and adopt modern reinforcement learning (RL) methods to enable robots to interact fluidly and efficaciously with humans. In this paper, we present the Social Reward Function as a mechanism to provide (1) a real-time, dense reward function necessary for the deployment of RL agents in social robotics, and (2) a standardised objective metric for comparing the efficacy of different social robots. The Social Reward Function is designed to closely mimic those genetically endowed social perception capabilities of humans in an effort to provide a simple, stable and culture-agnostic reward function. Presently, datasets used in social robotics are either small or significantly out-of-domain with respect to social robotics. The use of the Social Reward Function will allow larger in-domain datasets to be collected close to the behaviour policy of social robots, which will allow both further improvements to reward functions and to the behaviour policies of social robots. We believe this will be the key enabler to developing efficacious social robots in the future.
... Kinect-based 3D Human Interaction Dataset[16] is a two-person interaction dataset comprising of eight interactions: approaching, departing, kicking, punching, pointing, pushing, exchanging an object, and shaking hands. The data is recorded from 15 volunteers. ...
... Since the rapid development of RGB-D (i.e. color plus depth) sensor provides accurate real-time full-body tracking at a low-cost, it establishes a new way to facilitate the recognition of human actions [11]. Many RGB-D-based benchmark dataset are created as well. ...
... Feature_Two =1 10 Select Salient Point and save the Salient Joint 3D Coordinates as salient_position. 11 Compute the horizontal distance hd from the salient_position to the plane and the vertical distance vd from the salient_ position to the normal plane. 12 Assemble ...
Article
Full-text available
Depth sensor is widely used today and has great impact in object pose estimation, camera tracking, human actions, and scene reconstruction. This paper presents a novel method for human interaction recognition based on 3D skeleton data captured by Kinect sensor using hierarchical spatial-temporal saliency-based representation method. Hierarchical saliency can be conceptualized as Salient Actions at the highest level, determined by the initial movement in an interaction; Salient Points at middle level, determined by a single time point uniquely identified for all instances of Salient Action; Salient Joints at the lowest level, determined by the greatest positional changes of human joints in a Salient Action sequence. Given the interaction saliency at different levels, several types of features, such as spatial displacement, direction relations, and etc., are introduced based on action characteristics. Since there are few publicly accessible test datasets, we created a new dataset with eight types of interactions named K3HI, using the Microsoft Kinect. The method was tested based on Support Vector Machine (SVM) multi-class classifier. In the experiment, the results demonstrate that the average recognition accuracy of hierarchical saliency-based representation is 90.29%, outperforming methods using other features. © 2018 Springer Science+Business Media, LLC, part of Springer Nature
... However, depth cameras can be used to simultaneously acquire the color and depth images, providing a rich source of threedimensional (3D) spatial information of indoor scenes. It has been widely applied in 3D reconstruction [15], action tracking and recognition, and etc. [10,11,16,36,45]. In this study, an indoor room was probed using Microsoft Kinect depth cameras to investigate the methods for understanding the scene semantics and study the activity recognition methods based on the semantic context of the scene. ...
Article
Full-text available
Like outdoors, indoor security is also a critical problem and human action recognition in indoor area is still a hot topic. Most studies on human action recognition ignored the semantic information of a scene, whereas indoors contains varieties of semantics. Meanwhile, the depth sensor with color and depth data is more suitable for extracting the semantics context in human actions. Hence, this paper proposed an indoor action recognition method using Kinect based on the semantics of a scene. First, we proposed a trajectory clustering algorithm for a three-dimensional (3D) scene by combining the different characteristics of people such as the spatial location, movement direction, and speed. Based on the clustering results and scene context, it concludes a region of interest (ROI) extraction method for indoors, and dynamic time warping (DTW) is used to study the abnormal action sequences. Finally, the color and depth-data-based 3D motion history image (3D–MHI) features and the semantics context of the scene were combined to recognize human action. In the experiment, two datasets were tested and the results demonstrate that our semantics-based method performs better than other methods.
... There are many datasets of human-human (two persons) and multi-human (more than two persons) interactions. [15] and [16] contain datasets with eight types of interactions: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands. The LIRIS daily-life human activities dataset is recorded using a Kinect sensor fixed on a robot capturing human-human and human-object interactions [17]. ...
... Since the harmonic mean of a list of numbers tends to tend strongly towards the least elements of the list, it has the advantage of mitigating the impact of outliers with large values. So, using the harmonic mean (16) (17) (18) helps address the irrelevant frames around the main action and mitigate the impact of odd and wrong values extracted by the depth device. The harmonic mean of Euclidean distances d t between any two joints J x and J y in a window with W frames is Eq. ...
Article
Full-text available
Many critical applications require accurate real-time human action recognition. However, there are many hurdles associated with capturing and pre-processing image data, calculating features, and classification because they consume significant resources for both storage and computation. To circumvent these hurdles, this paper presents a recognition machine learning (ML) based system model which uses reduced data structure features by projecting real 3D skeleton modality on virtual 2D space. The MMU VAAC dataset is used to test the proposed ML model. The results show a high accuracy rate of 97.88% which is only slightly lower than the accuracy when using the original 3D modality-based features but with a 75% reduction ratio from using RGB modality. These results motivate implementing the proposed recognition model on an embedded system platform in the future.
... Indoor Surveillance is one of the most important applications for IoT, and people detection under complex scenarios has always been a difficult problem. In recent years, depth sensors have received more and more attention and have been applied in human action recognition and pedestrian detection areas [3,4]. Compared to the traditional optical camera, depth sensors have no dependencies on a light source, and they are able to calculate positional relationships between objects based on distance. ...
... At present, foreground target detection methods such as the RGBD+MoG method [3] proposed by Schiller and Koch, and the Depth-Extended Codebook (DECB) [6] proposed by Fernandez-Sanchez et al., which combine color and depth information, achieve good results. However, the DECB method relies on the size of the sample and decision of parameters. ...
... As in the previous section, the pedestrian detection test is conducted under three different illumination conditions. We respectively compared the experimental results of the RGBD+ViBe method with three other methods: HOG, HOD [3], and Comb-HOD [3]. The Comb-HOD method integrates the HOD and HOG features. ...
Article
Human detection is a popular topic and difficult problem in surveillance. This paper presents a research on human detection in complex indoor space utilizing a depth sensor. In recent years, target detection methods based on RGB-D data mainly include background learning, and feature detection operator. The former method depends on the initial background knowledge obtained from the first couple of frames in the video, and the length of frames decides detection quality. The latter method is more time intensive, and insufficient training samples will influence the detection result. Thus, in this paper we analyze the complex scene features thoroughly and integrate the color and depth information, proposing a RGBD+ViBe foreground extraction method. Based on the extraction outcome of the foreground, this research utilizes the 3D Mean-Shift method combined with depth constraints to handle multi-person partial occlusion problems. The experiment results indicate that the proposed RGBD+ViBe method outperforms the methods which only consider color or depth information, as well as the RGBD+MoG method. Furthermore, the proposed 3D Mean-Shift method achieves nearly 90% accuracy in multi-person detection result, and the false rate is merely 5%; while the accuracy of HOG, HOD and Comb-HOD methods are less than 75% and the false rate is around 10%.