Article

HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Abstract While research on articulated human,motion and pose estimation has progressed rapidly in the last few years, there has been no systematic quantitative evaluation of competing methods to establish the current state of the art. Current algorithms make many,different choices about how to model the human body, how to exploit image evidence and how to approach the inference problem. We argue that there is a need for common,datasets that allow fair comparison between different methods,and,their design choices. Until recently gathering ground-truth data for evaluation of results (especially in 3D) was challenging. In this report we present a novel dataset obtained using a unique setup for capturing synchronized video and ground-truth 3D motion. Data was captured simultaneously using a calibrated marker-based motion capture system and multiple high-speed video capture systems. The video and motion capture streams were synchronized in software using a direct optimization method. The resulting HumanEvaI dataset contains multiple subjects performing a set of predefined actions with a number,of repetitions. On the order of 50,000 frames of synchronized motion capture and video was collected at 60 Hz with an additional 37,000 frames of pure motion capture data. The data is partitioned into training, validation, and testing sub-sets. A standard set of error metrics is defined that can be used for evaluation of both 2D and D pose estimation and tracking algorithms. Support software and an on-line evaluation system for quantifying results using the test data is being made,available to the community. This report provides an overview of the

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Sigal and Balan [36] present HumanEva dataset for quantitative evaluation of competing methods of articulated human pose. HumanEva is a standard benchmark for multi-view 3D human pose estimation in the laboratory setting. ...
... The prior is typically quite diffused (because motion can be fast) but the likelihood function may be very peaky, containing multiple local maxima which are hard to account for in detail. Annealed Particle Filter (APF) [36] or local searches are the ways to tackle this problem. APF has been widely used for articulated human pose estimation due to its ability to precisely estimate the statistics of multi-modal and non-Gaussian processes. ...
... Datasets. HumanEva [36] is a standard benchmark for 3D human pose estimation in the laboratory setting, which allow quantitative evaluation of performance. The dataset consists of HumanEva-I and HumanEva-II by a set of multi-view sequences. ...
Thesis
Human pose estimation is a challenging problem in computer vision and shares all the difficulties of object detection. This thesis focuses on the problems of human pose estimation in still images or video, including the diversity of appearances, changes in scene illumination and confounding background clutter. To tackle these problems, we build a robust model consisting of the following components. First, the top-down and bottom-up methods are combined to estimation human pose. We extend the Pictorial Structure (PS) model to cooperate with annealed particle filter (APF) for robust multi-view pose estimation. Second, we propose an upper body based multiple mixture parts (MMP) model for human pose estimation that contains two stages. In the pre-estimation stage, there are three steps: upper body detection, model category estimation for upper body, and full model selection for pose estimation. In the estimation stage, we address the problem of a variety of human poses and activities. Finally, a Deep Convolutional Neural Network (DCNN) is introduced for human pose estimation. A Local Multi-Resolution Convolutional Neural Network (LMR-CNN) is proposed to learn the representation for each body part. Moreover, a LMR-CNN based hierarchical model is defined to meet the structural complexity of limb parts. The experimental results demonstrate the effectiveness of the proposed model
... Due to the advantage of capturing complex pose variability, the sparse representation (SR)-based approaches have shown their effectiveness in recovering 3D poses [17,21,40,66,68]. These approaches estimate a 3D human pose as a whole by fitting its projections to given/ detected 2D landmarks, in which the 3D pose is represented as a linear combination of a set of 3D geometry bases learned from existing motion capture (MoCap) datasets [12,19,45]. Under the assumption of sufficient training classes [62], the sparsity constraint is imposed on the inference procedure, so that only a few of 3D geometry bases are selected to represent the 3D pose. ...
... However, rich and diverse training samples are often difficult to obtain, since it is almost infeasible to collect 3D annotations in an unconstrained environment (e.g., outdoor scenarios) [11,52]. For example, the widely used MoCap datasets are collected in indoor [12,19,45], which greatly limits the diversity of subjects and actions. Therefore, the generalization of SR-based approaches is poor when handling with real-world scenarios that contain more diverse poses, resulting in a degradation of estimation results. ...
... The proposed approach is evaluated on four publicly available datasets; they are Human3.6M [19], HumanEva-I [45], CMU MoCap [12], and MPII Human Pose [3]. Specifically, the last one is used for the qualitative evaluation, and others for the quantitative evaluations. ...
Article
Full-text available
Recovering the 3D human pose from a single image with 2D joints is a challenging task in computer vision applications. The sparse representation (SR) model has been successfully adopted in 3D pose estimation approaches. However, since existing available training 3D data are often collected in a constrained environment (i.e., indoor) with limited diversity of subjects and actions, most SR-based approaches would have a lower generalization to real-world scenarios that may contain more complex cases. To alleviate this issue, this paper proposes SDM3d, a novel shape decomposition using multiple geometric priors for 3D pose estimation. SDM3d makes a new attempt by separating a 3D pose into the global structure and body deformations that are encoded explicitly via different priors constraints. Furthermore, a joint learning strategy is designed to learn two over-complete dictionaries from training data to capture more geometric priors information. We have evaluated SDM3d on four well-recognized benchmarks, i.e., Human3.6M, HumanEva-I, CMU MoCap, and MPII. The experiment results show the effectiveness of SDM3d.
... In this paper, we concern with a task whose goal is to predict a sequence of 3D pose skeletons in Human3.6M and HumanEva-I [9] datasets, when a previously observed 3D pose sequence is given as an input. ...
... For stochastic experiments, we preprocess Human3.6M [6] and HumanEva-I [9] datasets into xyz-based representation as [18], [19] do. Based on that, various metrics for evaluating likelihood and diversity are measured. ...
Preprint
Full-text available
After many researchers observed fruitfulness from the recent diffusion probabilistic model, its effectiveness in image generation is actively studied these days. In this paper, our objective is to evaluate the potential of diffusion probabilistic models for 3D human motion-related tasks. To this end, this paper presents a study of employing diffusion probabilistic models to predict future 3D human motion(s) from the previously observed motion. Based on the Human 3.6M and HumanEva-I datasets, our results show that diffusion probabilistic models are competitive for both single (deterministic) and multiple (stochastic) 3D motion prediction tasks, after finishing a single training process. In addition, we find out that diffusion probabilistic models can offer an attractive compromise, since they can strike the right balance between the likelihood and diversity of the predicted future motions. Our code is publicly available on the project website: https://sites.google.com/view/diffusion-motion-prediction.
... Existing datasets for 3D HPE are recorded using frame-based cameras, and the large majority include RGB color channels recordings. The most commonly used datasets are: HumanEva [133,134], Human3.6M [135], MPI-INF-3DHP [136] and MADS [137]. ...
... The datasets are recorded in a lab environment because they need a marker-based motion tracking system to collect ground truth joint positions. HumanEva is the first 3D Pose Estimation dataset of substantial size: the first version HumanEva-I [133] is from 2006 and and extended version HumanEva-II [134] is from 2010. It contains video sequences recorded using multiple RGB and grayscale cameras. ...
Article
Traditional visual sensor technology is firmly rooted in the concept of sequences of image frames. The sequence of stroboscopic images in these "frame cameras" is very different compared to the information running from the retina to the visual cortex. While conventional cameras have improved in the direction of smaller pixels and higher frame rates, the basics of image acquisition have remained the same. Event-based vision sensors were originally known as "silicon retinas" but are now widely called "event cameras." They are a new type of vision sensors that take inspiration from the mechanisms developed by nature for the mammalian retina and suggest a different way of perceiving the world. As in the neural system, the sensed information is encoded in a train of spikes, or so-called events, comparable to the action potential generated in the nerve. Event-based sensors produce sparse and asynchronous output that represents in- formative changes in the scene. These sensors have advantages in terms of fast response, low latency, high dynamic range, and sparse output. All these char- acteristics are appealing for computer vision and robotic applications, increasing the interest in this kind of sensor. However, since the sensor’s output is very dif- ferent, algorithms applied for frames need to be rethought and re-adapted. This thesis focuses on several applications of event cameras in scientific scenarios. It aims to identify where they can make the difference compared to frame cam- eras. The presented applications use the Dynamic Vision Sensor (event camera developed by the Sensors Group of the Institute of Neuroinformatics, University of Zurich and ETH). To explore some applications in more extreme situations, the first chapters of the thesis focus on the characterization of several advanced versions of the standard DVS. The low light condition represents a challenging situation for every vision sensor. Taking inspiration from standard Complementary Metal Oxide Semiconductor (CMOS) technology, the DVS pixel performances in a low light scenario can be improved, increasing sensitivity and quantum efficiency, by using back-side illumination. This thesis characterizes the so-called Back Side Illumination DAVIS (BSI DAVIS) camera and shows results from its application in calcium imaging of neural activity. The BSI DAVIS has shown better performance in the low light scene due to its high Quantum Efficiency (QE) of 93% and proved to be the best type of technology for microscopy application. The BSI DAVIS allows detecting fast dynamic changes in neural fluorescent imaging using the green fluorescent calcium indicator GCaMP6f. Event camera advances have pushed the exploration of event-based cameras in computer vision tasks. Chapters of this thesis focus on two of the most active research areas in computer vision: human pose estimation and hand gesture classification. Both chapters report the datasets collected to achieve the task, fulfilling the continuous need for data for this kind of new technology. The Dynamic Vision Sensor Human Pose dataset (DHP19) is an extensive collection of 33 whole-body human actions from 17 subjects. The chapter presents the first benchmark neural network model for 3D pose estimation using DHP19. The network archives a mean error of less than 8 mm in the 3D space, which is comparable with frame-based Human Pose Estimation (HPE) methods using frames. The gesture classification chapter reports an application running on a mobile device and explores future developments in the direction of embedded portable low power devices for online processing. The sparse output from the sensor suggests using a small model with a reduced number of parameters and low power consumption. The thesis also describes pilot results from two other scientific imaging applica- tions for raindrop size measurement and laser speckle analysis presented in the appendices.
... Due to the challenges of obtaining accurate 3D ground truths, most prior works used one or two of the few publicly available databases for 2D-image-based 3D pose estimation, such as: Human3.6M [19,20], HumanEva-I and HumanEva-II [21,22], Panoptic Studio [10,11], MPI-INF-3DHP [23], or JTA [24]. All these databases were either recorded in a laboratory ( a few sequences of MPI-INF-3DHP were recorded outdoors, but the diversity is still very limited) or synthesized and do not cover the full diversity of possible poses, peoples' appearances, camera characteristics, illuminations, backgrounds, occlusions, etc. ...
... In 2006 and 2010, Sigal et al. [21,22] published the HumanEva-I and HumanEva-II datasets to facilitate the quantitative evaluation and comparison of 3D human pose estimation algorithms. We used HumanEva-I, which is larger and more diverse than HumanEva-II. ...
Article
Full-text available
Vision-based 3D human pose estimation approaches are typically evaluated on datasets that are limited in diversity regarding many factors, e.g., subjects, poses, cameras, and lighting. However, for real-life applications, it would be desirable to create systems that work under arbitrary conditions (“in-the-wild”). To advance towards this goal, we investigated the commonly used datasets HumanEva-I, Human3.6M, and Panoptic Studio, discussed their biases (that is, their limitations in diversity), and illustrated them in cross-database experiments (for which we used a surrogate for roughly estimating in-the-wild performance). For this purpose, we first harmonized the differing skeleton joint definitions of the datasets, reducing the biases and systematic test errors in cross-database experiments. We further proposed a scale normalization method that significantly improved generalization across camera viewpoints, subjects, and datasets. In additional experiments, we investigated the effect of using more or less cameras, training with multiple datasets, applying a proposed anatomy-based pose validation step, and using OpenPose as the basis for the 3D pose estimation. The experimental results showed the usefulness of the joint harmonization, of the scale normalization, and of augmenting virtual cameras to significantly improve cross-database and in-database generalization. At the same time, the experiments showed that there were dataset biases that could not be compensated and call for new datasets covering more diversity. We discussed our results and promising directions for future work.
... Si bien la investigación sobre el movimiento humano articulado y la estimación de la pose ha progresado rápidamente en En losúltimos años, no ha habido una evaluación cuantitativa sistemática de métodos competitivos para establecer el estado actual del arte. Los algoritmos actuales toman muchas decisiones diferentes sobre cómo modelar el cuerpo humano, cómo explotar la evidencia de la imagen y cómo abordar la inferencia del problema [29] Los sistemas de captura de movimiento están disponibles comercialmente, y los dos tipos principales de sistemas sonópticos y magnético En la actualidad, ninguno tiene una clara ventaja sobre el otro, pero los sistemas magnéticos son significativamente más baratos [25] Animación basada en física En estos métodos, el animador proporciona datos físicos y el movimiento se obtiene resolviendo ecuaciones dinámicas. El movimiento está controlado globalmente. ...
... El estado del arte de los modelos de animación ha sido complejo de definir y se han utilizado diferentes métodos para su actualización [29] Los modelos de animación hechos a computadora tienen sus inicios a la par cuando las computadoras estaban apenas empezando a mostrar imágenes y desde ahí solo ha avanzado rápidamente hasta lo que es hoy. El intento de modelar el mundo real hacia la computadora ha traído nuevas dificultades y también nuevas formas de resolver problemas y se espera que siga avanzando. ...
Article
Full-text available
Modelos de animacion para computacion grafica UCV
... at is, if the MTGP is worse than TGP, the MTGP is insignificance and not selected to predict the arrival rates. Because the MTGP is based on TGP and TGP was introduced by Bo and Sminchisescu [67], the HumanEva-I dataset [69] of the Bo and Sminchisescu [67] are utilized to test the MTGP. In the MTGP algorithm, the kernel is defined as equation (16) and α � 0.9. ...
... Data Availability e datasets of the arrival rates, departure rates, and weight factors are stochastically generated in the manuscript. e datasets of the first experiment are obtained in [69]. ...
Article
Full-text available
In this paper, we state a combining programming approach to optimize traffic signal control problem. The objective of the model is to minimize the total queue length with weight factors at the end of each phase. Then, modified Twin Gaussian Process (MTGP) is employed to predict the arrival rates for the traffic signal control problem. For achieving automatic control of the traffic signal, an intelligent control method of the traffic signal is proposed in view of the combining method, that is to say, the combining method of MTGP and LP, called MTGPLP, is embraced in the intelligent control system. Furthermore, some numerical experiments are proposed to test the validity of the model and the MTGPLP approach. In particular, the results of numerical experiments show that the model is effective with different arrival rates, departure rates, and weight factors and the combining method is successful.
... Sparse representation (SR) model [15] is an effective model-based method to combine 2D annotations with 3D priors learned from existing Mocap dataset [16,17,18]. Based on SR model, each 3D pose is represented as a linear combination of a few of pre-defined 3D bases. ...
... The proposed approach is evaluated on four publicly available datasets, they are Human3.6M [18], HumanEva-I [17], CMU Mocap [16], and MPII Human Pose [50]. Specially, the last one is used for the qualitative evaluation, and others for the quantitative evaluations. ...
Preprint
In this paper, we aim to recover the 3D human pose from 2D body joints of a single image. The major challenge in this task is the depth ambiguity since different 3D poses may produce similar 2D poses. Although many recent advances in this problem are found in both unsupervised and supervised learning approaches, the performances of most of these approaches are greatly affected by insufficient diversities and richness of training data. To alleviate this issue, we propose an unsupervised learning approach, which is capable of estimating various complex poses well under limited available training data. Specifically, we propose a Shape Decomposition Model (SDM) in which a 3D pose is considered as the superposition of two parts which are global structure together with some deformations. Based on SDM, we estimate these two parts explicitly by solving two sets of different distributed combination coefficients of geometric priors. In addition, to obtain geometric priors, a joint dictionary learning algorithm is proposed to extract both coarse and fine pose clues simultaneously from limited training data. Quantitative evaluations on several widely used datasets demonstrate that our approach yields better performances over other competitive approaches. Especially, on some categories with more complex deformations, significant improvements are achieved by our approach. Furthermore, qualitative experiments conducted on in-the-wild images also show the effectiveness of the proposed approach.
... Most recent works obtain it from sample sequences of real human motions. Several kinds of actions, such as walking and running, are recorded in motion datasets that are widely used in Computer Vision [6] and Graphics[7], [5] communities. The motion model of each action can be used for pose tracking in that action. ...
... Motion priors can be modeled and applicable to pose tracking as described in Section 2.1. Training data of motion priors can be obtained from various motion datasets [6], [7], [5]. Different kinds of actions (e.g. ...
Preprint
Full-text available
Human pose estimation in images and videos is one of key technologies for realizing a variety of human activity recognition tasks (e.g., human-computer interaction, gesture recognition, surveillance, and video summarization). This paper presents two types of human pose estimation methodologies; 1) 3D human pose tracking using motion priors and 2) 2D human pose estimation with ensemble modeling.
... We evaluate our approach on two benchmark datasets: the HumanEva dataset [51]and the H3.6M dataset [52]. The Hu-manEva dataset contains multi-view videos, 3D poses and the camera parameters. ...
... We assume the 2D poses x are known and recover the 3D poses y from x. We use mean 3D joint error [51] as evaluation metric which is the average error over all joints. The results are reported using unit of millimetres (mms). ...
Article
We propose a method for estimating 3D human poses from single images or video sequences. The task is challenging because: (a) many 3D poses can have similar 2D pose projections which makes the lifting ambiguous, and (b) current 2D joint detectors are not accurate which can cause big errors in 3D estimates. We represent 3D poses by a sparse combination of bases which encode structural pose priors to reduce the lifting ambiguity. This prior is strengthened by adding limb length constraints. We estimate the 3D pose by minimizing an $L_1$ norm measurement error between the 2D pose and the 3D pose because it is less sensitive to inaccurate 2D poses. We modify our algorithm to output $K$ 3D pose candidates for an image, and for videos, we impose a temporal smoothness constraint to select the best sequence of 3D poses from the candidates. We demonstrate good results on 3D pose estimation from static images and improved performance by selecting the best 3D pose from the $K$ proposals. Our results on video sequences also show improvements (over static images) of roughly 15%.
... In this work, we use a prior on separate and sparse motion to constrain frame-to-frame 3D reconstruction (see figure 1). Instead of using a dictionary of shapes, some authors used articulated skeleton as in [50], [20], [31], probabilistic graphical models as in [47], [3], explicit regression by [16], [1]. Most of these methods are based on 2 -norm minimization which is strongly affected by noise in data and skeleton modeling. ...
Conference Paper
Full-text available
This work introduces a method to robustly reconstruct 3D human motion from the motion of 2D skeletal landmarks. We propose to use a lasso (least absolute shrinkage and selection operator) optimization framework where the 1-norm is computed over the vector of differential angular kinematics and the 2-norm is computed over the differential 2D reprojection error. The 1-norm term allows us to model sparse kinematic angular motion. The minimization of the reprojection error allows us to assume a bounded noise in both the kinematic model and the 2D landmark detection. This bound is controlled by a scale factor associated to the 2-norm data term. A posteriori verification condition is provided to check whether or not the lasso formulation has allowed us to recover the ground-truth 3D human motion. Results on publicly available data demonstrates the effectiveness of the proposed approach on state-of-the-art methods. It shows that both sparsity and bounded noise assumptions encoded in lasso formulation are robust priors to safely recover 3D human motion.
... Human Models. Most prior work with virtual human agents utilizes body skeletons to model 3D humans [21,53]. However, the surface of the body is important for rending the virtual agent or modeling interactions with the scene or any objects. ...
Preprint
Full-text available
We present PACE, a novel method for modifying motion-captured virtual agents to interact with and move throughout dense, cluttered 3D scenes. Our approach changes a given motion sequence of a virtual agent as needed to adjust to the obstacles and objects in the environment. We first take the individual frames of the motion sequence most important for modeling interactions with the scene and pair them with the relevant scene geometry, obstacles, and semantics such that interactions in the agents motion match the affordances of the scene (e.g., standing on a floor or sitting in a chair). We then optimize the motion of the human by directly altering the high-DOF pose at each frame in the motion to better account for the unique geometric constraints of the scene. Our formulation uses novel loss functions that maintain a realistic flow and natural-looking motion. We compare our method with prior motion generating techniques and highlight the benefits of our method with a perceptual study and physical plausibility metrics. Human raters preferred our method over the prior approaches. Specifically, they preferred our method 57.1% of the time versus the state-of-the-art method using existing motions, and 81.0% of the time versus a state-of-the-art motion synthesis method. Additionally, our method performs significantly higher on established physical plausibility and interaction metrics. Specifically, we outperform competing methods by over 1.2% in terms of the non-collision metric and by over 18% in terms of the contact metric. We have integrated our interactive system with Microsoft HoloLens and demonstrate its benefits in real-world indoor scenes. Our project website is available at https://gamma.umd.edu/pace/.
... Si on se penche sur les séries temporelles multivariées (plusieurs caractéristiques suivant uneévolution temporelle), on découvre d'autres méthodes pour traiter les séries temporelles. Par exemple, le mouvement d'un corps dans le temps[50] peutêtre caractérisé par un vecteur de R 3 suivant uneévolution temporelle. Dans ce type de cas, il aété démontré dans la littérature[51,52] que l'apprentissage sur variété permet une meilleure représentation des données comparéeà la distance Euclidienne.Apprentissage sur variété : L'idée ici est que les données sontà première vue dans un espace ambiant A mais qu'en réalité elles se situent sur une variété de dimension réduite[53]. ...
Thesis
Chaque année, lors de la période hivernale, les hôpitaux sont profondément impactés par l'arrivée des virus hivernaux. Ces virus hivernaux, la grippe et le VRS sont difficiles à anticiper. En effet, ces phénomènes épidémiques ne sont pas parfaitement périodiques et ont un impact principalement sur le temps de séjour des patients plutôt que sur le nombre d'arrivée. Il n'est donc pas possible d'anticiper ces épidémies en analysant directement le nombre de patients arrivant au service des urgences par jour. A posteriori, pour avoir une image de l'épidémie, des tests PCR sont réalisés sur les patients de l'hôpital. De plus, un patient arrivant aux urgences est immédiatement classé selon ses symptômes. On propose alors de réunir les tests PCR positifs et le nombre d'arrivée par symptôme via du clustering de série temporelle. Cela met en lumière les symptômes liés aux virus. Ainsi, pour analyser l'arrivée probable d'une épidémie, on peut utiliser le nombre d'arrivée pour les symptômes marqueurs des virus plutôt que le nombre d'arrivée totale au service des urgences. Pour réaliser ce clustering, nous proposons une méthode innovante reposant sur une représentation géométrique des séries temporelles. En particulier, on met en lumière l'efficacité d'utiliser la géométrie de Riemman appliquée à la variété de la Grassmann (via une représentation sur la variété de la Stiefel) pour analyser les séries temporelles.
... These datasets, alongside methods trained and evaluated on them [41], offer enough diverse data to train and test human motion prediction models focused on answering the question "Where is a human going to be during the next N steps?", but they are not adequately labeled with the context which would help to answer "What is (the goal of) the observed human motion?". On the other hand, datasets tailored for models focused on the second question, like the CMU's motion capture database [42], HumanEva [43] and G3D [44] excel in action diversity, but they are focused on distinguishing between different actions (jumping, catching, throwing), do not incorporate complicated motion patterns, and usually are not long enough for a long or mid-term human motion prediction problem. The MoGaze [34] dataset positions itself as an excellent blend of the aforementioned datasets because all the recorded motions have a labeled purpose (an object picking). ...
Article
As robots are progressing towards being ubiquitous and an indispensable part of our everyday environments, such as home, offices, healthcare, education, and manufacturing shop floors, efficient and safe collaboration and cohabitation become imperative. Given that, such environments could benefit greatly from accurate human action prediction. In addition to being accurate, human action prediction should be computationally efficient, in order to ensure a timely reaction, and capable of dealing with changing environments, since unstructured interaction and collaboration with humans usually do not assume static conditions. In this paper, we propose a model for human action prediction based on motion cues and gaze using shared-weight Long Short-Term Memory networks (LSTMs) and feature dimensionality reduction. LSTMs have proven to be a powerful tool in processing time series data, especially when dealing with long-term dependencies; however, to maximize their performance, LSTM networks should be fed with informative and quality inputs. Given that, in this paper, we furthermore conducted an extensive input feature analysis based on (i) signal correlation and their strength to act as stand-alone predictors, and (ii) a multilayer perceptron inspired by the autoencoder architecture. We validated the proposed model on a publicly available MoGaze¹ dataset for human action prediction, as well as on a smaller dataset recorded in our laboratory. Our model outperformed alternatives, such as recurrent neural networks, a fully connected LSTM network, and the strongest stand-alone signals (baselines), and can run in real-time on a standard laptop CPU. Since eye gaze might not always be available in a real-world scenario, we have implemented and tested a multi-layer perceptron for gaze estimation from more easily obtainable motion cues, such as head orientation and hand position. The estimated gaze signal can be utilized during inference of our LSTM-based model, thus making our action prediction pipeline suitable for real-time practical applications.
... Dirac open source video codec developed by the British Broadcasting Corporation (BBC), its name is taken from British scientist Paul Dirac. Dirac aims to provide any size high-quality video compression, it uses wavelet compression rather than discrete cosine transforms for compression [7][8] [9]. ...
... For HumanEva, we train with a sequence length of 75, where the inference of w only takes 15 frames with a random start point. The weights of different loss terms (λ r , λ mm , λ d,l , λ d,u , λ l , λ a , λ nf , λ z , λ w ) and the normalizing factors (α l , α u ) are set to (10, 5, 0.1, 0.2, 100, 1, 0.001, 0.5, 0.1) and (15,50). For H3.6M, we train with a sequence length of 125, where w is inferred from 25 frames. ...
Preprint
Full-text available
Studies on the automatic processing of 3D human pose data have flourished in the recent past. In this paper, we are interested in the generation of plausible and diverse future human poses following an observed 3D pose sequence. Current methods address this problem by injecting random variables from a single latent space into a deterministic motion prediction framework, which precludes the inherent multi-modality in human motion generation. In addition, previous works rarely explore the use of attention to select which frames are to be used to inform the generation process up to our knowledge. To overcome these limitations, we propose Hierarchical Transformer Dynamical Variational Autoencoder, HiT-DVAE, which implements auto-regressive generation with transformer-like attention mechanisms. HiT-DVAE simultaneously learns the evolution of data and latent space distribution with time correlated probabilistic dependencies, thus enabling the generative model to learn a more complex and time-varying latent space as well as diverse and realistic human motions. Furthermore, the auto-regressive generation brings more flexibility on observation and prediction, i.e. one can have any length of observation and predict arbitrary large sequences of poses with a single pre-trained model. We evaluate the proposed method on HumanEva-I and Human3.6M with various evaluation methods, and outperform the state-of-the-art methods on most of the metrics.
... These algorithms can mine the information from the silhouette images, but their performances will be different. The tested database is HumanEva dataset [36]. The dataset has the multi-view (three views) image sequences of several people, and the ground truth of 3D human motion model will correspond to each frame of the multi-view images. ...
Article
Full-text available
Abstract It is a challenge work to estimate the 3D human motion from image sequence. There are some problems, such as unsatisfactory estimation error, ambiguous matching and transient occlusion. Although the prior information of learning large‐scale samples exists, these problems are still difficult to be solved. How to extract the feature of the high‐dimensional (HD) sample of 3D human motion and find the desired one will become the key to solve these problems above. Some dimension reduction methods can extract the sample features and build the low‐dimensional (LD) space to view their LD features, but how to search the relevant valid and desired LD samples remains the bottleneck problem, which can be used to reconstruct the 3D human motions denoted by the corresponding high‐dimensional samples. Thus, a new method called dynamic manifold Boltzmann optimization (DMBO) is proposed to estimate the 3D human motion from multi‐view images. DMBO can find the best matching 3D human motion model by the help of the self‐supervised learning from Gaussian incremental dimension reduction model (GIDRM). DMBO can avoid the local optimum during searching and solve the problems above, so that the generation of the accurate 3D human motion corresponding to multi‐view images can be achieved.
... We use AMASS to train and evaluate the motion infiller and trajectory predictor. Specifically, we use the Transitions, SSM, and HumanEva [85] subsets for testing and all other subsets for training. 3DPW [94] is an in-the-wild human motion dataset that consists of 60 videos recorded with dynamic cameras in diverse environments. ...
Preprint
We present an approach for 3D global human mesh recovery from monocular videos recorded with dynamic cameras. Our approach is robust to severe and long-term occlusions and tracks human bodies even when they go outside the camera's field of view. To achieve this, we first propose a deep generative motion infiller, which autoregressively infills the body motions of occluded humans based on visible motions. Additionally, in contrast to prior work, our approach reconstructs human meshes in consistent global coordinates even with dynamic cameras. Since the joint reconstruction of human motions and camera poses is underconstrained, we propose a global trajectory predictor that generates global human trajectories based on local body movements. Using the predicted trajectories as anchors, we present a global optimization framework that refines the predicted trajectories and optimizes the camera poses to match the video evidence such as 2D keypoints. Experiments on challenging indoor and in-the-wild datasets with dynamic cameras demonstrate that the proposed approach outperforms prior methods significantly in terms of motion infilling and global mesh recovery.
... The IDRPPO and IDRNPPO can be seen in Figure 6, respectively. How to calculate error can be seen in [24]. From Figure 6, the errors of the human running motion and the missing walking motion from IDRPPO are lower than IDRNPPO on the whole. ...
Article
Full-text available
Three-dimensional (3D) human motion capture is a hot researching topic at present. The network becomes advanced nowadays, the appearance of 3D human motion is indispensable in the multimedia works, such as image, video, and game. 3D human motion plays an important role in the publication and expression of all kinds of medium. How to capture the 3D human motion is the key technology of multimedia product. Therefore, a new algorithm called incremental dimension reduction and projection position optimization (IDRPPO) is proposed in this paper. This algorithm can help to learn sparse 3D human motion samples and generate the new ones. Thus, it can provide the technique for making 3D character animation. By taking advantage of the Gaussian incremental dimension reduction model (GIDRM) and projection position optimization, the proposed algorithm can learn the existing samples and establish the relevant mapping between the low dimensional (LD) data and the high dimensional (HD) data. Finally, the missing frames of input 3D human motion and the other type of 3D human motion can be generated by the IDRPPO.
... These three independent descriptions are taken to be three views since the independent descriptions come from different feature spaces. The HumanEva 3D motion dataset [50] has five types of actions, namely boxing, gesturing, jogging, walking and throw-catch performed by different subjects in this database. We use the dataset sampled by Ma et al. [51]. ...
Article
Since social media, virtual communities and networks rapidly grow, multi-view data become more popular. In general, multi-view data always contain different feature components in different views. Although these data are extracted in different ways (views) from diverse settings and domains, they are used to describe the same samples which make them highly related. Hence, applying (single-view) clustering methods for multi-view data poses difficulty in achieving desirable clustering results. Thus, multi-view clustering methods should be developed that will utilize available multi-view information. Most of multi-view clustering techniques currently use k-means due to its conceptual simplicity, and use fuzzy c-means (FCM) that the datapoints can belong to more than one cluster based on their membership degrees from 0 to 1. However, the use of k-means or FCM may degrade its performance due to the presence of noise and outliers, especially on large or high-dimensional datasets. The constraint imposed on the membership degrees of k-means and FCM tends to assign a corresponding high membership value to an outlier or a noisy data point. To address these drawbacks, possibilistic c-means (PCM) relaxes the membership constraint of k-means and FCM so that outliers and noisy datapoints can be properly identified. On the other hand, there are various extensions of k-means and FCM for multi-view data, but no extension of PCM for multi-view data was made in the literature. Thus, we use PCM in our proposed multi-view clustering model. In this paper, we propose novel weighted multi-view PCM algorithms designed for clustering multi-view data as well as view and feature weights on PCM approaches, called W-MV-PCM and W-MV-PCM with L2 regularization (W-MV-PCM-L2). In multi-view clustering, different views may vary with respect to its importance and each view may contain some irrelevant features. In the proposed algorithms, a learning schema is constructed to compute for the view weights, and feature weights within each view. This schema will be able to identify the importance of each view and, at the same time, it will also identify and select relevant features in each view. Comparisons of W-MV-PCM-L2 with existing multi-view clustering algorithms are made on both synthetic and real datasets. The experimental results are evaluated using accuracy rate (AR) and external validity indexes, such as Rand index (RI) and normalized mutual information (NMI). The proposed W-MV-PCM-L2 algorithm with comparisons of existing algorithms under criteria of AR, RI and NMI shows that it is a feasible and effective multi-view clustering algorithm.
... The Human Eva dataset [87] was built to aid in the development of visual human pose estimation and tracking provisions. This dataset's sensory (video) and GT data (motion data) were captured simultaneously using multiple high-speed video capture systems and a calibrated marker-based motion capture system, respectively. ...
Article
Full-text available
The ever-growing capabilities of computers have enabled pursuing Computer Vision through Machine Learning (i.e., MLCV). ML tools require large amounts of information to learn from (ML datasets). These are costly to produce but have received reduced attention regarding standardization. This prevents the cooperative production and exploitation of these resources, impedes countless synergies, and hinders ML research. No global view exists of the MLCV dataset tissue. Acquiring it is fundamental to enable standardization. We provide an extensive survey of the evolution and current state of MLCV datasets (1994 to 2019) for a set of specific CV areas as well as a quantitative and qualitative analysis of the results. Data were gathered from online scientific databases (e.g., Google Scholar, CiteSeerX). We reveal the heterogeneous plethora that comprises the MLCV dataset tissue; their continuous growth in volume and complexity; the specificities of the evolution of their media and metadata components regarding a range of aspects; and that MLCV progress requires the construction of a global standardized (structuring, manipulating, and sharing) MLCV “library”. Accordingly, we formulate a novel interpretation of this dataset collective as a global tissue of synthetic cognitive visual memories and define the immediately necessary steps to advance its standardization and integration.
... And get better results on Human3.6m [13] and HumanEva-I dataset [14]. ...
Chapter
Full-text available
The research of Human Action Recognition (HAR) has made a lot of progress in recent years, and the research based on RGB images is the most extensive. However, there are two main shortcomings: the recognition accuracy is insufficient, and the time consumption of the algorithm is too large. In order to improve these issues our project attempts to optimize the algorithm based on the random forest algorithm by extracting the features of the human body 3D, trying to obtain more accurate human behavior recognition results, and can calculate the prediction results at a lower time cost. In this study, we used the 3D spatial coordinate data of multiple Kinect sensors to overcome these problems and make full use of each data feature. Then, we use the data obtained from multiple Kinects to get more accurate recognition results through post processing.
... The most commonly used datasets for 3D human pose estimation include Hu-man3.6M [21], HumanEva [22], TotalCapture [23], MPI-INF-3DHP [24], MuPoTS-3D [9] and 3DPW [25]. Human3.6M is a large-scale dataset featuring a single person at the center of videos recorded by a commercialized MoCap system. ...
Preprint
3D human pose estimation (HPE) is crucial in many fields, such as human behavior analysis, augmented reality/virtual reality (AR/VR) applications, and self-driving industry. Videos that contain multiple potentially occluded people captured from freely moving monocular cameras are very common in real-world scenarios, while 3D HPE for such scenarios is quite challenging, partially because there is a lack of such data with accurate 3D ground truth labels in existing datasets. In this paper, we propose a temporal regression network with a gated convolution module to transform 2D joints to 3D and recover the missing occluded joints in the meantime. A simple yet effective localization approach is further conducted to transform the normalized pose to the global trajectory. To verify the effectiveness of our approach, we also collect a new moving camera multi-human (MMHuman) dataset that includes multiple people with heavy occlusion captured by moving cameras. The 3D ground truth joints are provided by accurate motion capture (MoCap) system. From the experiments on static-camera based Human3.6M data and our own collected moving-camera based data, we show that our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods, especially for the scenarios with heavy occlusions.
... The development direction of HPSD is clearly given by the above works while AQA still remains to be developed as the materials of these datasets is still inadequate. It could be concluded that the functions of datasets are gradually expanded with the development of video techniques [17][9] [27][43] [32]. However, it lacks temporal segmentation related datasets in recent years. ...
Preprint
Full-text available
Action recognition is an important and challenging problem in video analysis. Although the past decade has witnessed progress in action recognition with the development of deep learning, such process has been slow in competitive sports content analysis. To promote the research on action recognition from competitive sports video clips, we introduce a Figure Skating Dataset (FSD-10) for finegrained sports content analysis. To this end, we collect 1484 clips from the worldwide figure skating championships in 2017-2018, which consist of 10 different actions in men/ladies programs. Each clip is at a rate of 30 frames per second with resolution 1080 $\times$ 720. These clips are then annotated by experts in type, grade of execution, skater info, .etc. To build a baseline for action recognition in figure skating, we evaluate state-of-the-art action recognition methods on FSD-10. Motivated by the idea that domain knowledge is of great concern in sports field, we propose a keyframe based temporal segment network (KTSN) for classification and achieve remarkable performance. Experimental results demonstrate that FSD-10 is an ideal dataset for benchmarking action recognition algorithms, as it requires to accurately extract action motions rather than action poses. We hope FSD-10, which is designed to have a large collection of finegrained actions, can serve as a new challenge to develop more robust and advanced action recognition models.
... Large scale datasets providing accurate 3D groundtruth use either Mo-Cap data (e.g. HumanEva [27] or Human3.6M [10]) or using a high number of cameras (such as the Panoptic Studio [11]). ...
Preprint
Full-text available
We present a novel data-driven regularizer for weakly-supervised learning of 3D human pose estimation that eliminates the drift problem that affects existing approaches. We do this by moving the stereo reconstruction problem into the loss of the network itself. This avoids the need to reconstruct 3D data prior to training and unlike previous semi-supervised approaches, avoids the need for a warm-up period of supervised training. The conceptual and implementational simplicity of our approach is fundamental to its appeal. Not only is it straightforward to augment many weakly-supervised approaches with our additional re-projection based loss, but it is obvious how it shapes reconstructions and prevents drift. As such we believe it will be a valuable tool for any researcher working in weakly-supervised 3D reconstruction. Evaluating on Panoptic, the largest multi-camera and markerless dataset available, we obtain an accuracy that is essentially indistinguishable from a strongly-supervised approach making full use of 3D groundtruth in training.
... However, if a sample of frame-wise 3D reconstructions are not exactly recovered from the set of possible poses it will contaminate the smoothing process and the proposed methods do not provide any guaranty of success. Instead of using a low-rank subspace of shapes, methods based on known articulated skeleton have also been investigated [49], [22], [30], probabilistic graphical models [44], [7], explicit regression [18], [2]. Most of these methods are based on 2 norm minimization and are sensitive to noise in data and skeleton modeling. ...
Preprint
Full-text available
In this paper, we address the problem of exact recovery condition in retrieving 3D human motion from 2D landmark motion. We use a skeletal kinematic model to represent the 3D human motion as a vector of angular articulation motion. We address this problem based on the observation that at high tracking rate, regardless of the global rigid motion, only few angular articulations have non-zero motion. We propose a first ideal formulation with $\ell_0$-norm to minimize the cardinal of non-zero angular articulation motion given an equality constraint on the time-differentiation of the reprojection error. The second relaxed formulation relies on an $\ell_1$-norm to minimize the sum of absolute values of the angular articulation motion. This formulation has the advantage of being able to provide 3D motion even in the under-determined case when twice the number of 2D landmarks is smaller than the number of angular articulations. We define a specific property which is the Projective Kinematic Space Property (PKSP) that takes into account the reprojection constraint and the kinematic model. We prove that for the relaxed formulation we are able to recover the exact 3D human motion from 2D landmarks if and only if the PKSP property is verified. We further demonstrate that solving the relaxed formulation provides the same ground-truth solution as the ideal formulation if and only if the PKSP condition is filled. Results with simulated sparse skeletal angular motion show the ability of the proposed method to recover exact location of angular motion. We provide results on publicly available real data (HUMAN3.6M, PANOPTIC and MPI-I3DHP).
... Existing human pose estimation datasets are either large scale but limited to studio conditions, where annotation can be automated using marker-less multiview solutions [57,36,21,23], simulated [8,77], or generic but small [5,51] because manual annotation [33] is cumbersome. Multi-person pose datasets are even more difficult to find. ...
Preprint
Learning general image representations has proven key to the success of many computer vision tasks. For example, many approaches to image understanding problems rely on deep networks that were initially trained on ImageNet, mostly because the learned features are a valuable starting point to learn from limited labeled data. However, when it comes to 3D motion capture of multiple people, these features are only of limited use. In this paper, we therefore propose an approach to learning features that are useful for this purpose. To this end, we introduce a self-supervised approach to learning what we call a neural scene decomposition (NSD) that can be exploited for 3D pose estimation. NSD comprises three layers of abstraction to represent human subjects: spatial layout in terms of bounding-boxes and relative depth; a 2D shape representation in terms of an instance segmentation mask; and subject-specific appearance and 3D pose information. By exploiting self-supervision coming from multiview data, our NSD model can be trained end-to-end without any 2D or 3D supervision. In contrast to previous approaches, it works for multiple persons and full-frame images. Because it encodes 3D geometry, NSD can then be effectively leveraged to train a 3D pose estimation network from small amounts of annotated data.
Chapter
Attention-only Transformers [34] have been applied to solve Natural Language Processing (NLP) tasks and Computer Vision (CV) tasks. One particular Transformer architecture developed for CV is the Vision Transformer (ViT) [15]. ViT models have been used to solve numerous tasks in the CV area. One interesting task is the pose estimation of a human subject. We present our modified ViT model, Un-TraPEs (UNsupervised TRAnsformer for Pose Estimation), that can reconstruct a subject’s pose from its monocular image and estimated depth. We compare the results obtained with such a model against a ResNet [17] trained from scratch and a ViT finetuned to the task and show promising results.KeywordsComputer visionImage understandingPose estimationVisual transformersArtificial intelligence and applications
Article
Full-text available
Pose estimation using deep learning (DL) is expected to solve traditional problems faced by sports biomechanics, including limitations resulting from the application of reflective markers. For sports biomechanists to correctly utilize these pose estimation techniques, there is a need to elucidate the estimation and learning procedures used in pose estimation as well as to consider how to utilize them. Therefore, we aimed to review recently published major pose estimation models and to examine the availability of pose estimation in sports biomechanics. We observed that the main models were developed for simultaneous estimation of multiple persons, but none of the these were designed to rigorously estimate center of joint position which is mainly required in sports biomechanics. Further, all training datasets for these models were digitized positions that appeared as the joint centers of people in "in-the-wild" videos; moreover, these workers were non-professionals termed as "crowd-workers". Therefore, regardless of the model quality, the dataset accuracy may be a bottleneck that impedes the estimation accuracy required in sports biomechanics. All the metrics used to verify the accuracy involved verification of the average estimation results of multiple joint points across the entire frame or multiple frames. Therefore, even with a high overall estimation accuracy, the accuracy of the estimated positions of the individual joints may be low. Taken together, it is difficult to utilize and calculate kinematic variables based on joint positions obtained through pose estimation. However, the existing pose estimation may help sports biomechanists calculate the movement periodization and number. To expand the utility of pose estimation in sports biomechanics, sports biomechanists should be actively involved in the development of pose estimation models and datasets.
Article
We present VideoPoseVR, a video-based animation authoring workflow using online videos to author character animations in VR. It leverages the state-of-the-art deep learning approach to reconstruct 3D motions from online videos, caption the motions, and store them in a motion dataset. Creators can import the videos, search in the dataset, modify the motion timeline, and combine multiple motions from videos to author character animations in VR. We implemented a proof-of-concept prototype and conducted a user study to evaluate the feasibility of the video-based authoring approach as well as gather initial feedback of the prototype. The study results suggest that VideoPoseVR was easy to learn for novice users to author animations and enable rapid exploration of prototyping for applications such as entertainment, skills training, and crowd simulations.
Article
Full-text available
Human segmentation and tracking (HS-T) in the video often utilize person detection results. In addition, 3D human pose estimation (3D-HPE) and human activity recognition (HAR) often use human segmentation results to reduce data storage and computational time. With recent advantages of deep learning, especially using Convolutional Neural Networks (CNNs), there are excellent results in these relevant tasks. Consequently, they can be applied to building many practical applications such as sports analysis, sports scoring, health protection, teaching, and preserving traditional martial arts. In this paper, we performed a survey of relevant studies, methods, datasets, and results for HS-T, 3D-HPE, and HAR. We also deeply analyze the results of detecting persons as it affects the results of human segmentation and human tracking. The survey is performed in great detail up to source code paths. The MADS (Martial Arts, Dancing, and Sports) dataset comprises fast and complex activities. It has been published for the task of estimating human pose. However, before determining the human pose, the person needs to be detected as a segment in the video, especially the 3D human pose annotation data is different from the point cloud data generated from RGB-D images. Therefore, we have also prepared 2D human pose annotation data on the 28k images for creating 3D human pose annotation and action labeling data. Moreover, we also evaluated the MADS dataset with many recently published deep learning methods for human segmentation (Mask R-CNN, PointRend, TridentNet, TensorMask, and CenterMask) and tracking, 3D-HPE (RepNet, MediaPipe Pose, and Lifting from the Deep, V2V-PoseNet), and HAR (ST-GCN, DD-net, and PA-GesGCN) in the video. All data and published results are available.
Article
Full-text available
The human pose estimation is a significant issue that has been taken into consideration in the computer vision network for recent decades. It is a vital advance toward understanding individuals in videos and still images. In simple terms, a human pose estimation model takes in an image or video and estimates the position of a person’s skeletal joints in either 2D or 3D space. Several studies on human posture estimation can be found in the literature, however, they center around a specific class; for instance, model-based methodologies or human movement investigation, and so on. Later, various Deep Learning (DL) algorithms came into existence to overcome the difficulties which were there in the earlier approaches. In this study, an exhaustive review of human pose estimation (HPE), including milestone work and recent advancements is carried out. This survey discusses the different two-dimensional (2D) and three-dimensional human (3D) pose estimation techniques along with their classical and deep learning approaches which provide the solution to the various computer vision problems. Moreover, the paper also considers the different deep learning models used in pose estimation, and the analysis of 2D and 3D datasets is done. Some of the evaluation metrics used for estimating human poses are also discussed here. By knowing the direction of the individuals, HPE opens a road for a few real-life applications some of which are talked about in this study.
Article
Estimating 3D human poses from a single image is an important task in computer graphics. Most model-based estimation methods represent the labeled/detected 2D poses and the projection of approximated 3D poses using vector representations of body joints. However, such lower-dimensional vector representations fail to maintain the spatial relations of original body joints, because the representations do not consider the inherent structure of body joints. In this paper, we propose JSL3d, a novel joint subspace learning approach with implicit structure supervision based on Sparse Representation (SR) model, capturing the latent spatial relations of 2D body joints by an end-to-end autoencoder network. JSL3djointly combines the learned latent spatial relations and 2D joints as inputs for the standard SR inference frame. The optimization is simultaneously processed via geometric priors in both latent and original feature spaces. We have evaluated JSL3dusing four large-scale and well-recognized benchmarks, including Human3.6M, HumanEva-I, CMU MoCap and MPII. The experiment results demonstrate the effectiveness of JSL3d.
Article
Parsing human into semantic parts is crucial to human-centric analysis. In this paper, we propose a human parsing pipeline that uses pose cues, e.g., estimates of human joint locations, to provide pose-guided segment proposals for semantic parts. These segment proposals are ranked using standard appearance cues, deep-learned semantic feature, and a novel pose feature called pose-context. Then these proposals are selected and assembled using an And-Or graph to output a parse of the person. The And-Or graph is able to deal with large human appearance variability due to pose, choice of clothing, etc. We evaluate our approach on the popular Penn-Fudan pedestrian parsing dataset, showing that it significantly outperforms the state of the art, and perform diagnostics to demonstrate the effectiveness of different stages of our pipeline.
Article
Recent advances in 3D reconstruction technology allow people to capture and share their experiences in 3D. However, little is known about people's sharing preferences and privacy concerns for these reconstructed experiences. To fill this gap, we first present ReliveReality, an experience-sharing method utilizing deep learning-based computer vision techniques to reconstruct clothed humans and 3D environments and estimate 3D pose with only a RGB camera. ReliveReality can be integrated into social virtual environments, allowing others to socially relive a shared experience by moving around the experience from different perspectives, on desktop or in VR. We conducted a 44-participant within-subject study to compare ReliveReality to viewing recorded videos, and to a ReliveReality version with blurring obfuscation. Our results shed light on how people identify with reconstructed avatars, how obfuscation affects reliving experiences, and sharing preferences and privacy concerns for reconstructed experiences. We propose design implications for addressing these issues.
Article
Full-text available
Visual surveillance has exponentially increased the growth of security devices and systems in the digital era. Gait-based person identification is an emerging biometric modality for automatic visual surveillance and monitoring as the walking patterns highly correlate to the subject’s identity. The scientific research on person identification using gait has grown dramatically over the past two decades due to its several benefits. It does not require active collaboration from users and can be performed without their cooperation. It is difficult to be impersonated and identification can be validated from low-resolution videos and with simple instrumentation. This paper presents a comprehensive overview of the exiting techniques, their key stages, and recent developments in vision-based person identification using gait. We reviewed the historical research on gait locomotion and explain that how it is used to recognize the identity. The article summarizes the different types of features that have been proposed to encode the biomechanics of gait and also groups them into different categories and subcategories based upon the similarity in their implementation. We also present the impact of different covariate factors that affect the performance of gait recognition systems and also discuss the recent works to cope with these challenges. Furthermore, a comparison of the recognition accuracies reported by the existing algorithms to assess their performance under verification and identification mode is also presented. A detailed summary of publicly available vision-based gait databases is also provided. Finally, it offers insight into the challenges and open problems for future perspectives in the field of gait recognition that can help to set the directions for future research in this field.
Article
Full-text available
We present a probabilistic framework to generate character animations based on weak control signals, such that the synthesized motions are realistic while retaining the stochastic nature of human movement. The proposed architecture, which is designed as a hierarchical recurrent model, maps each sub‐sequence of motions into a stochastic latent code using a variational autoencoder extended over the temporal domain. We also propose an objective function which respects the impact of each joint on the pose and compares the joint angles based on angular distance. We use two novel quantitative protocols and human qualitative assessment to demonstrate the ability of our model to generate convincing and diverse periodic and non‐periodic motion sequences without the need for strong control signals.
Preprint
Full-text available
We present a probabilistic framework to generate character animations based on weak control signals, such that the synthesized motions are realistic while retaining the stochastic nature of human movement. The proposed architecture, which is designed as a hierarchical recurrent model, maps each sub-sequence of motions into a stochastic latent code using a variational autoencoder extended over the temporal domain. We also propose an objective function which respects the impact of each joint on the pose and compares the joint angles based on angular distance. We use two novel quantitative protocols and human qualitative assessment to demonstrate the ability of our model to generate convincing and diverse periodic and non-periodic motion sequences without the need for strong control signals.
Preprint
Full-text available
The rise of non-linear and interactive media such as video games has increased the need for automatic movement animation generation. In this survey, we review and analyze different aspects of building automatic movement generation systems using machine learning techniques and motion capture data. We cover topics such as high-level movement characterization, training data, features representation, machine learning models, and evaluation methods. We conclude by presenting a discussion of the reviewed literature and outlining the research gaps and remaining challenges for future work.
Chapter
Human action recognition is an important field in computer vision. Skeleton-based models of human obtain more attention in related researches because of strong robustness to external interference factors. In traditional researches the form of the feature is usually so hand-crafted that effective feature is difficult to extract from skeletons. In this paper a unique method is proposed for human action recognition called Pixel Convolutional Networks, which use a natural and intuitive way to extract skeleton feature from two dimensions, space and time. It achieves good performance compared with mainstream methods in the past few years in the large dataset NTU-RGB+D.
Article
This paper presents a deep and extensive performance analysis of the particle filter (PF) algorithm for a very compute intensive 3D multi-view visual tracking problem. We compare different implementations and parameter settings of the PF algorithm in a CPU platform taking advantage of the multithreading capabilities of the modern processors and a graphics processing unit (GPU) platform using NVIDIA CUDA computing environment as developing framework. We extend our experimental study to each individual stage of the PF algorithm, and evaluate the quality versus performance trade-off among different ways to design these stages. We have observed that the GPU platform performs better than the multithreaded CPU platform when handling a large number of particles, but we also demonstrate that hybrid CPU/GPU implementations can run almost as fast as only GPU solutions.
Conference Paper
Full-text available
In this vision paper, we present an approach that makes it possible to protect developed ideas and early concepts even during their systematical development. We take the Design Thinking process as an example, in which interfaces are used for individual stages (understand, observe, define, ideate, prototype, test) to digitally record verbal, written or sketched, and even modeled or constructed outcome. This outcome is recorded and linked to the originating person. To guarantee both proof-of-existence and proof-of-origin, a unique hash is generated from each digital artifact stored and embedded into the Bitcoin Blockchain by the OriginStamp decentralized trusted timestamping service. Once this unique fingerprint is embedded in a transaction in the underlying Blockchain network, it can be proven where particular contributions originated due to the characteristics of Blockchain architecture. By setting up a decentralized tamper-proof means of record keeping, the entire innovation chain from the first ideation to the beginning of production is verifiably stored. By providing a clear proof-of-origin, all innovators (even competitors) could continue to work on existing problem-solving process and add their contribution proportionately, depending on the state of innovation development. This concept enables an Open Innovation ecosystem, which has the potential to increase the innovation potential of companies immensely. Additionally, inventions that are not patentable because they do not comply with the strict regulations of patent law can still be published and protected because the information about the origin of the respective contribution is guaranteed.
Conference Paper
Full-text available
A novel approach for estimating articulated body posture and motion from monocular video sequences is proposed. Human pose is defined as the instantaneous two dimensional configuration (i.e. the projection onto the image plane) of a single articulated body in terms of the position of a predetermined sets of joints. First, statistical segmentation of the human bodies from the background is performed and low-level visual features are found given the segmented body shape. The goal is to be able to map these generally low level visual features to body configurations. The system estimates different mappings, each one with a specific cluster in the visual feature space. Given a set of body motion sequences for training, unsupervised clustering is obtained via the Expectation Maximization algorithm. For each of the clusters, a function is estimated to build the mapping between low-level features to 2D pose. Given new visual features, a mapping from each cluster is performed to yield a set of possible poses. From this set, the system selects the most likely pose given the learned probability distribution and the visual feature of the proposed approach is characterized using real and artificially generated body postures, showing promising results
Article
Full-text available
This paper addresses the problem of human mo-tion tracking from multiple image sequences. The human body is described by five articulated mechanical chains and human body-parts are described by volumetric primitives with curved surfaces. If such a surface is observed with a camera, an extremal contour appears in the image when-ever the surface turns smoothly away from the viewer. We describe a method that recovers human motion through a kinematic parameterization of these extremal contours. The method exploits the fact that the observed image motion of these contours is a function of both the rigid displacement of the surface and of the relative position and orientation be-tween the viewer and the curved surface. First, we describe a parameterization of an extremal-contour point velocity for the case of developable surfaces. Second, we use the zero-reference kinematic representation and we derive an explicit formula that links extremal contour velocities to the angular velocities associated with the kinematic model. Third, we show how the chamfer-distance may be used to measure the discrepancy between predicted extremal contours and ob-served image contours; moreover we show how the cham-fer distance can be used as a differentiable multi-valued function and how the tracker based on this distance can be cast into a continuous non-linear optimization framework. Fourth, we describe implementation issues associated with a practical human-body tracker that may use an arbitrary number of cameras. One great methodological and practical advantage of our method is that it relies neither on model-to-image, nor on image-to-image point matches. In practice we model people with 5 kinematic chains, 19 volumetric prim-itives, and 54 degrees of freedom; We observe silhouettes in images gathered with several synchronized and calibrated cameras. The tracker has been successfully applied to sev-eral complex motions gathered at 30 frames/second.
Article
Full-text available
This paper address the problems of modeling the appearance of humans and distinguishing human appearance from the appearance of general scenes. We seek a model of appearance and motion that is generic in that it accounts for the ways in which people's appearance varies and, at the same time, is specific enough to be useful for tracking people in natural scenes. Given a 3D model of the person projected into an image we model the likelihood of observing various image cues conditioned on the predicted locations and orientations of the limbs. These cues are taken to be steered filter responses corresponding to edges, ridges, and motion-compensated temporal differences. Motivated by work on the statistics of natural scenes, the statistics of these filter responses for human limbs are learned from training images containing hand-labeled limb regions. Similarly, the statistics of the filter responses in general scenes are learned to define a background distribution. The likelihood of observing a scene given a predicted pose of a person is computed, for each limb, using the likelihood ratio between the learned foreground (person) and background distributions. Adopting a Bayesian formulation allows cues to be combined in a principled way. Furthermore, the use of learned distributions obviates the need for hand-tuned image noise models and thresholds. The paper provides a detailed analysis of the statistics of how people appear in scenes and provides a connection between work on natural image statistics and the Bayesian tracking of people.
Article
Full-text available
This paper describes a computational approach to edge detection. The success of the approach depends on the definition of a comprehensive set of goals for the computation of edge points. These goals must be precise enough to delimit the desired behavior of the detector while making minimal assumptions about the form of the solution. We define detection and localization criteria for a class of edges, and present mathematical forms for these criteria as functionals on the operator impulse response. A third criterion is then added to ensure that the detector has only one response to a single edge. We use the criteria in numerical optimization to derive detectors for several common image features, including step edges. On specializing the analysis to step edges, we find that there is a natural uncertainty principle between detection and localization performance, which are the two main goals. With this principle we derive a single operator shape which is optimal at any scale. The optimal detector has a simple approximate implementation in which edges are marked at maxima in gradient magnitude of a Gaussian-smoothed image. We extend this simple detector using operators of several widths to cope with different signal-to-noise ratios in the image. We present a general method, called feature synthesis, for the fine-to-coarse integration of information from operators at different scales. Finally we show that step edge detector performance improves considerably as the operator point spread function is extended along the edge.
Article
Full-text available
The ability to recognize humans and their activities by vision is key for a machine to interact intelligently and effortlessly with a human-inhabited environment. Because of many potentially important applications, “looking at people” is currently one of the most active application domains in computer vision. This survey identifies a number of promising applications and provides an overview of recent developments in this domain. The scope of this survey is limited to work on whole-body or hand motion; it does not include work on human faces. The emphasis is on discussing the various methodologies; they are grouped in 2-D approaches with or without explicit shape models and 3-D approaches. Where appropriate, systems are reviewed. We conclude with some thoughts about future directions.
Conference Paper
Full-text available
We present a novel similarity measure (likelihood) for estimating three-dimensional human pose from image silhouettes in model-based vision applications. One of the challenges in such approaches is the construction of a model-to-image likelihood that truly reflects the good configurations of the problem. This is hard, commonly due to the violation of consistency principle resulting in the introduction of spurious, unrelated peaks/minima that make the search for model localization difficult. We introduce an entirely continuous formulation which enforces model estimation consistency by means of an attraction/explanation silhouette-based term pair. We subsequently show how the proposed method provides significant consolidation and improved attraction zone around the desired likelihood configurations and elimination of some of the spurious ones. Finally, we present a skeleton-based smoothing method for the image silhouettes that stabilizes and accelerates the search process.
Conference Paper
Full-text available
The detection and tracking of three-dimensional human body models has progressed rapidly but successful approaches typically rely on accurate foreground silhouettes obtained using background segmentation. There are many practical applications where such information is imprecise. Here we develop a new image likelihood function based on the visual appearance of the subject being tracked. We propose a robust, adaptive, appearance model based on the Wandering-Stable-Lost framework extended to the case of articulated body parts. The method models appearance using a mixture model that includes an adaptive template, frame-to-frame matching and an outlier process. We employ an annealed particle filtering algorithm for inference and take advantage of the 3D body model to predict selfocclusion and improve pose estimation accuracy. Quantitative tracking results are presented for a walking sequence with a 180 degree turn, captured with four synchronized and calibrated cameras and containing significant appearance changes and self-occlusion in each view.
Conference Paper
Full-text available
The potential success of discriminative learning ap- proaches to D reconstruction relies on the ability to effi- ciently train predictive algorithms using sufficiently many examples that are representative of the typical configura- tions encountered in the application domain. Recent re- search indicates that sparse conditional Bayesian Mixture of Experts (cMoE) models (e.g. BME (21)) are adequate modeling tools that not only provide contextual 3D pre- dictions for problems like human pose reconstruction, but can also represent multiple interpretations that result from depth ambiguities or occlusion. However, training condi- tional predictors requires sophisticated double-loop algo- rithms that scale unfavorably with the input dimension and the training set size, thus limiting their usage to 10,000 examples of less, so far. In this paper we present large- scale algorithms, referred to as fBME, that combine for- ward feature selection and bound optimization in order to train probabilistic, BME models, with one order of magni- tude more data (100,000 examples and up) and more than one order of magnitude faster. We present several large scale experiments, including monocular evaluation on the HumanEva dataset (19), demonstrating how the proposed methods overcome the scaling limitations of existing ones.
Conference Paper
Full-text available
Part-based tree-structured models have been widely used for 2D articulated human pose-estimation. These approaches admit efficient inference algorithms while capturing the important kinematic constraints of the human body as a graphical model. These methods often fail however when multiple body parts fit the same image region resulting in global pose estimates that poorly explain the overall image evidence. Attempts to solve this problem have focused on the use of strong prior models that are limited to learned activities such as walking. We argue that the problem actually lies with the image observations and not with the prior. In particular, image evidence for each body part is estimated independently of other parts without regard to self-occlusion. To address this we introduce occlusion-sensitive local likelihoods that approximate the global image likelihood using per-pixel hidden binary variables that encode the occlusion relationships between parts. This occlusion reasoning introduces interactions between non-adjacent body parts creating loops in the underlying graphical model. We deal with this using an extension of an approximate belief propagation algorithm (PAMPAS). The algorithm recovers the real-valued 2D pose of the body in the presence of occlusions, does not require strong priors over body pose and does a quantitatively better job of explaining image evidence than previous methods.
Conference Paper
Full-text available
Human motion tracking is an important problem in com- puter vision. Most prior approaches have concentrated on efficient inference algorithms and prior motion models; however, few can explicitly account for physical plausibility of recovered motion. The primary purpose of this work is to enforce physical plausibility in the tracking of a single artic- ulated human subject. Towards this end, we propose a full- body 3D physical simulation-based prior that explicitly in- corporates motion control and dynamics into the Bayesian filtering framework. We consider the human's motion to be generated by a "control loop". In this control loop, Newto- nian physics approximates the rigid-body motion dynamics of the human and the environment through the application and integration of forces. Collisions generate interaction forces to prevent physically impossible hypotheses. This al- lows us to properly model human motion dynamics, ground contact and environment interactions. For efficient infer- ence in the resulting high-dimensional state space, we in- troduce exemplar-based control strategy to reduce the ef- fective search space. As a result we are able to recover the physically-plausible kinematic and dynamic state of the body from monocular and multi-view imagery. We show, both quantitatively and qualitatively, that our approach per- forms favorably with respect to standard Bayesian filtering methods.
Conference Paper
Full-text available
We introduce a physics-based model for 3D person tracking. Based on a biomechanical characterization of lower-body dynamics, the model captures important physical properties of bipedal locomotion such as balance and ground contact, generalizes naturally to variations in style due to changes in speed, step-length, and mass, and avoids common problems such as footskate that arise with existing trackers. The model dynamics comprises a two degree-of-freedom representation of human locomotion with inelastic ground contact. A stochastic controller generates impulsive forces during the toe-off stage of walking and spring-like forces between the legs. A higher-dimensional kinematic observation model is then conditioned on the underlying dynamics. We use the model for tracking walking people from video, including examples with turning, occlusion, and varying gait.
Conference Paper
Full-text available
We address the problem of estimating human pose in video sequences, where rough location has been determined. We exploit both appearance and motion information by defining suitable features of an image and its temporal neighbors, and learning a regression map to the parameters of a model of the human body using boosting techniques. Our algorithm can be viewed as a fast initialization step for human body trackers, or as a tracker itself. We extend gradient boosting techniques to learn a multi-dimensional map from (rotated and scaled) Haar features to the entire set of joint angles representing the full body pose. We test our approach by learning a map from image patches to body joint angles from synchronized video and motion capture walking data. We show how our technique enables learning an efficient real-time pose estimator, validated on publicly available datasets.
Conference Paper
Full-text available
A model of human appearance is presented for efficient pose estimation from real-world images. In common with related approaches, a high-level model defines a space of configurations which can be associated with image measurements and thus scored. A search is performed to identify good configuration(s). Such an approach is challenging because the configuration space is high dimensional, the search is global, and the appearance of humans in images is complex due to background clutter, shape uncertainty and texture. The system presented here is novel in several respects. The formulation allows differing numbers of parts to be parameterised and allows poses of differing dimensionality to be compared in a principled manner based upon learnt likelihood ratios. In contrast with current approaches, this allows a part based search in the presence of self occlusion. Furthermore, it provides a principled automatic approach to other object occlusion. View based probabilistic models of body part shapes are learnt that represent intra and inter person variability (in contrast to rigid geometric primitives). The probabilistic region for each part is transformed into the image using the configuration hypothesis and used to collect two appearance distributions for the part’s foreground and adjacent background. Likelihood ratios for single parts are learnt from the dissimilarity of the foreground and adjacent background appearance distributions. It is important to note the distinction between this technique and restrictive foreground/background specific modelling. It is demonstrated that this likelihood allows better discrimination of body parts in real world images than contour to edge matching techniques. Furthermore, the likelihood is less sparse and noisy, making coarse sampling and local search more effective. A likelihood ratio for body part pairs with similar appearances is also learnt. Together with a model of inter-part distances this better describes correct higher dimensional configurations. Results from applying an optimization scheme to the likelihood model for challenging real world images are presented.
Conference Paper
Full-text available
This paper addresses the problem of probabilistically model- ing 3D human motion for synthesis and tracking. Given the high dimen- sional nature of human motion, learning an explicit probabilistic model from available training data is currently impractical. Instead we exploit methods from texture synthesis that treat images as representing an im- plicit empirical distribution. These methods replace the problem of rep- resenting the probability of a texture pattern with that of searching the training data for similar instances of that pattern. We extend this idea to temporal data representing 3D human motion with a large database of example motions. To make the method useful in practice, we must address the problem of efficient search in a large training set; efficiency is particularly important for tracking. Towards that end, we learn a low dimensional linear model of human motion that is used to structure the example motion database into a binary tree. An approximate probabilis- tic tree search method exploits the coefficients of this low-dimensional representation and runs in sub-linear time. This probabilistic tree search returns a particular sample human motion with probability approximat- ing the true distribution of human motions in the database. This sam- pling method is suitable for use with particle filtering techniques and is applied to articulated 3D tracking of humans within a Bayesian frame- work. Successful tracking results are presented, along with examples of synthesizing human motion using the model.
Conference Paper
Full-text available
Filtering based algorithms have become popular in tracking human body pose. Such algorithms can suer the curse of dimensionality due to the high dimensionality of the pose state space; therefore, eorts have been dedicated to either smart sampling or reducing the dimension- ality of the original pose state space. In this paper, a novel formulation that employs a dimensionality reduced state space for multi-hypothesis tracking is proposed. During o-line training, a mixture of factor an- alyzers is learned. Each factor analyzer can be thought of as a "local dimensionality reducer" that locally approximates the pose manifold. Global coordination between local factor analyzers is achieved by learn- ing a set of linear mixture functions that enforces agreement between lo- cal factor analyzers. The formulation allows easy bidirectional mapping between the original body pose space and the low-dimensional space. During online tracking, the clusters of factor analyzers are utilized in a multiple hypothesis tracking algorithm. Experiments demonstrate that the proposed algorithm tracks 3D body pose eciently and accurately , even when self-occlusion, motion blur and large limb movements occur. Quantitative comparisons show that the formulation produces more ac- curate 3D pose estimates over time than those that can be obtained via a number of previously-proposed particle filtering based tracking algo- rithms.
Conference Paper
Full-text available
This paper describes a framework for constructing a linear subspace model of image appearance for complex articulated 3D figures such as humans and other animals. A commercial motion capture system provides 3D data that is aligned with images of subjects performing various activities. Portions of a limb's image appearance are seen from multiple views and for multiple subjects. From these partial views, weighted principal component analysis is used to construct a linear subspace representation of the “unwrapped” image appearance of each limb. The linear subspaces provide a generative model of the object appearance that is exploited in a Bayesian particle filtering tracking system. Results of tracking single limbs and walking humans are presented
Conference Paper
Full-text available
In this contribution we present a silhouette based human motion estimation system. The system components contain silhouette extraction based on level sets, a correspondence module, which relates image data to model data and a pose estimation module. Experiments are done in a four camera setup and we estimate the model components with 21 degrees of freedom in two frames per second. Finally, we perform a comparison of the motion estimation system with a marker based tracking system to perform a quantitative error analysis. The results show the applicability of the system for marker-less sports movement analysis.
Conference Paper
Full-text available
In this paper we consider modeling data lying on multi- ple continuous manifolds. In particular, we model the shape manifold of a person performing a motion observed from dif- ferent view points along a view circle at fixed camera height. We introduce a model that ties together the body configura- tion (kinematics) manifold and the visual manifold (observa- tions) in a way that facilitates tracking the 3D configuration with continuous relative view variability. The model exploits the low dimensionality nature of both the body configuration manifold and the view manifold where each of them are repre- sented separately.
Article
Full-text available
A system capable of analyzing image sequences of human motion is described. The system is structured as a feedback loop between high and low levels: predictions are made at the semantic level and verifications are sought at the image level. The domain of human motion lends itself to a model-driven analysis, and the system includes a detailed model of the human body. All information extracted from the image is interpreted through a constraint network based on the structure of the human model. A constraint propagation operator is defined and its theoretical properties outlined. An implementation of this operator is described, and results of the analysis system for short image sequences are presented.
Article
Full-text available
Interacting and annealing are two powerful strategies that are applied in different areas of stochastic modelling and data analysis. Interacting particle systems approximate a distribution of interest by a finite number of particles where the particles interact between the time steps. In computer vision, they are commonly known as particle filters. Simulated annealing, on the other hand, is a global optimization method derived from statistical mechanics. A recent heuristic approach to fuse these two techniques for motion capturing has become known as annealed particle filter. In order to analyze these techniques, we rigorously derive in this paper two algorithms with annealing properties based on the mathematical theory of interacting particle systems. Convergence results and sufficient parameter restrictions enable us to point out limitations of the annealed particle filter. Moreover, we evaluate the impact of the parameters on the performance in various experiments, including the tracking of articulated bodies from noisy measurements. Our results provide a general guidance on suitable parameter choices for different applications.
Article
Full-text available
Identification of people by analysis of gait patterns extracted from video has recently become a popular research problem. However, the conditions under which the problem is "solvable" are not understood or characterized. To provide a means for measuring progress and characterizing the properties of gait recognition, we introduce the HumanID Gait Challenge Problem. The challenge problem consists of a baseline algorithm, a set of 12 experiments, and a large data set. The baseline algorithm estimates silhouettes by background subtraction and performs recognition by temporal correlation of silhouettes. The 12 experiments are of increasing difficulty, as measured by the baseline algorithm, and examine the effects of five covariates on performance. The covariates are: change in viewing angle, change in shoe type, change in walking surface, carrying or not carrying a briefcase, and elapsed time between sequences being compared. Identification rates for the 12 experiments range from 78 percent on the easiest experiment to 3 percent on the hardest. All five covariates had statistically significant effects on performance, with walking surface and time difference having the greatest impact. The data set consists of 1,870 sequences from 122 subjects spanning five covariates (1.2 Gigabytes of data). The gait data, the source code of the baseline algorithm, and scripts to run, score, and analyze the challenge experiments are available at http://www.GaitChallenge.org. This infrastructure supports further development of gait recognition algorithms and additional experiments to understand the strengths and weaknesses of new algorithms. The more detailed the experimental results presented, the more detailed is the possible meta-analysis and greater is the understanding. It is this potential from the adoption of this challenge problem that represents a radical departure from traditional computer vision research methodology.
Conference Paper
Full-text available
We present a 3-level hierarchical model for localizing human bodies in still images from arbitrary viewpoints. We first fit a simple tree-structured model defined on a small landmark set along the body contours by Dynamic Programming (DP). The output is a series of proposal maps that encode the probabilities of partial body configurations. Next, we fit a mixture of view-dependent models by Sequential Monte Carlo (SMC), which handles self-occlusion, anthropometric constraints, and large viewpoint changes. DP and SMC are designed to search in opposite directions such that the DP proposals are utilized effectively to initialize and guide the SMC inference. This hybrid strategy of combining deterministic and stochastic search ensures both the robustness and efficiency of DP, and the accuracy of SMC. Finally, we fit an expanded mixture model with increased landmark density through local optimization. The model hierarchy is trained on a large number of gait images. Extensive tests on cluttered images with varying poses including walking, dancing and various types of sports activities demonstrate the feasibility of the proposed approach.
Conference Paper
Full-text available
The Bayesian estimation of 3D human motion from video sequences is quantitatively evaluated using synchronized, multi-camera, calibrated video and 3D ground truth poses acquired with a commercial motion capture system. While many methods for human pose estimation and tracking have been proposed, to date there has been no quantitative comparison. Our goal is to evaluate how different design choices influence tracking performance. Toward that end, we independently implemented two fairly standard Bayesian person trackers using two variants of particle filtering and propose an evaluation measure appropriate for assessing the quality of probabilistic tracking methods. In the Bayesian framework we compare various image likelihood functions and prior models of human motion that have been proposed in the literature. Our results suggest that in constrained laboratory environments, current methods perform quite well. Multiple cameras and background subtraction, however, are required to achieve reliable tracking suggesting that many current methods may be inappropriate in more natural settings. We discuss the implications of the study and the directions for future research that it entails.
Article
Simulated annealing—moving from a tractable distribution to a distribution of interest via a sequence of intermediate distributions—has traditionally been used as an inexact method of handling isolated modes in Markov chain samplers. Here, it is shown how one can use the Markov chain transitions for such an annealing sequence to define an importance sampler. The Markov chain aspect allows this method to perform acceptably even for high-dimensional problems, where finding good importance sampling distributions would otherwise be very difficult, while the use of importance weights ensures that the estimates found converge to the correct values as the number of annealing runs increases. This annealed importance sampling procedure resembles the second half of the previously-studied tempered transitions, and can be seen as a generalization of a recently-proposed variant of sequential importance sampling. It is also related to thermodynamic integration methods for estimating ratios of normalizing constants. Annealed importance sampling is most attractive when isolated modes are present, or when estimates of normalizing constants are required, but it may also be more generally useful, since its independent sampling allows one to bypass some of the problems of assessing convergence and autocorrelation in Markov chain samplers.
Conference Paper
Tracking body poses of multiple persons in monocular video is a challenging problem due to the high dimensionality of the state space and issues such as inter-occlusion of the persons’ bodies. We proposed a three-stage approach with a multi-level state representation that enables a hierarchical estimation of 3D body poses. At the first stage, humans are tracked as blobs. In the second stage, parts such as face, shoulders and limbs are estimated and estimates are combined by grid-based belief propagation to infer 2D joint positions. The derived belief maps are used as proposal functions in the third stage to infer the 3D pose using data-driven Markov chain Monte Carlo. Experimental results on realistic indoor video sequences show that the method is able to track multiple persons during complex movement such as turning movement with inter-occlusion.
Article
The problem of tracking curves in dense visual clutter is challenging. Kalman filtering is inadequate because it is based on Gaussian densities which, being unimo dal, cannot represent simultaneous alternative hypotheses. The Condensation algorithm uses factored sampling, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set. Condensation uses learned dynamical models, together with visual observations, to propagate the random set over time. The result is highly robust tracking of agile motion. Notwithstanding the use of stochastic methods, the algorithm runs in near real-time.
Article
In this paper we present a computationally efficient framework for part-based modeling and recognition of objects. Our work is motivated by the pictorial structure models introduced by Fischler and Elschlager. The basic idea is to represent an object by a collection of parts arranged in a deformable configuration. The appearance of each part is modeled separately, and the deformable configuration is represented by spring-like connections between pairs of parts. These models allow for qualitative descriptions of visual appearance, and are suitable for generic recognition problems. We address the problem of using pictorial structure models to find instances of an object in an image as well as the problem of learning an object model from training examples, presenting efficient algorithms in both cases. We demonstrate the techniques by learning models that represent faces and human bodies and using the resulting models to locate the corresponding objects in novel images.
Article
Quantitative geometric descriptions of the movements of persons are obtained by fitting the projection of a three-dimensional person model to consecutive frames of an image sequence. The kinematic of the person model is given by a homogeneous transformation tree and its body parts are modeled by right-elliptical cones. The values of a varying number of degrees of freedom (DOFs; body joints, position, and orientation of the person relative to the camera) can be determined according to the application and the kind of image sequence. The determination of the DOFs is understood as an estimation problem which is solved by an iterated extended Kalman filter (IEKF). For this purpose, the person model is augmented by a simple motion model of constant velocity for all DOFs which is used in the prediction step of the IEKF. In the update step, both region and edge information are used. Various experiments demonstrate the efficiency of our approach.
Article
A comprehensive survey of computer vision-based human motion capture literature from the past two decades is presented. The focus is on a general overview based on a taxonomy of system functionalities, broken down into four processes: initialization, tracking, pose estimation, and recognition. Each process is discussed and divided into subprocesses and/or categories of methods to provide a reference to describe and compare the more than 130 publications covered by the survey. References are included throughout the paper to exemplify important issues and their relations to the various methods. A number of general assumptions used in this research field are identified and the character of these assumptions indicates that the research field is still in an early stage of development. To evaluate the state of the art, the major application areas are identified and performances are analyzed in light of the methods presented in the survey. Finally, suggestions for future research directions are offered.
Conference Paper
A novel approach for accurate markerless motion capture combining a precise tracking algorithm with a database of articulated models is presented. The tracking approach employs an articulated iterative closest point algorithm with soft-joint constraints for tracking body segments in visual hull sequences. The database of articulated models is derived from a combination of human shapes and anthropometric data, contains a large variety of models and closely mimics variations found in the human population. The database provides articulated models that closely match the outer appearance of the visual hulls, e.g. matches overall height and volume. This information is paired with a kinematic chain enhanced through anthropometric regression equations. Deviations in the kinematic chain from true joint center locations are compensated by the soft-joint constraints approach. As a result accurate and a more anatomical correct outcome is obtained suitable for biomechanical and clinical applications. Joint kinematics obtained using this approach closely matched joint kinematics obtained from a marker based motion capture system.
Conference Paper
The goal of this work is to detect a human figure image and localize his joints and limbs along with their associated pixel masks. In this work we attempt to tackle this problem in a general setting. The dataset we use is a collection of sports news photographs of baseball players, varying dramatically in pose and clothing. The approach that we take is to use segmentation to guide our recognition algorithm to salient bits of the image. We use this segmentation approach to build limb and torso detectors, the outputs of which are assembled into human figures. We present quantitative results on torso localization, in addition to shortlisted full body configurations.
Conference Paper
This paper addresses the problem of recovering 3D hu- man pose from a single monocular image, using a discrimi- native bag-of-words approach. In previous work, the visual words are learned by unsupervised clustering algorithms. They capture the most common patterns and are good fea- tures for coarse-grain recognition tasks like object classi- fication. But for those tasks which deal with subtle differ- ences such as pose estimation, such representation may lack the needed discriminative power. In this paper, we propose to jointly learn the visual words and the pose regressors in a supervised manner. More specifically, we learn an indi- vidual distance metric for each visual word to optimize the pose estimation performance. The learned metrics rescale the visual words to suppress unimportant dimensions such as those corresponding to background. Another contribu- tion is that we design an Appearance and Position Context (APC) local descriptor that achieves both selectivity and invariance while requiring no background subtraction. We test our approach on both a quasi-synthetic dataset and a real dataset (HumanEva) to verify its effectiveness. Our ap- proach also achieves fast computational speed thanks to the integral histograms used in APC descriptor extraction and fast inference of pose regressors.
Conference Paper
Many computer vision tasks may be expressed as the problem of learning a mapping between image space and a parameter space. For example, in human body pose es- timation, recent research has directly modelled the map- ping from image features (z) to joint angles ( ). Fitting such models requires training data in the form of labelled (z, ) pairs, from which are learned the conditional den- sities p( |z). Inference is then simple: given test image features z, the conditional p( |z) is immediately computed. However large amounts of training data are required to fit the models, particularly in the case where the spaces are high dimensional. We show how the use of unlabelled data—samples from the marginal distributions p(z) and p( )—may be used to improve fitting. This is valuable because it is often signif- icantly easier to obtain unlabelled than labelled samples. We use a Gaussian process latent variable model to learn the mapping from a shared latent low-dimensional manifold to the feature and parameter spaces. This extends existing approaches to (a) use unlabelled data, and (b) represent one-to-many mappings. Experiments on synthetic and real problems demonstrate how the use of unlabelled data improves over existing tech- niques. In our comparisons, we include existing approaches that are explicitly semi-supervised as well as those which implicitly make use of unlabelled examples.
Conference Paper
The goal of this work is to recover human body configurations from static images. Without assuming a priori knowledge of scale, pose or appearance, this problem is extremely challenging and demands the use of all possible sources of information. We develop a framework which can incorporate arbitrary pairwise constraints between body parts, such as scale compatibility, relative position, symmetry of clothing and smooth contour connections between parts. We detect candidate body parts from bottom-up using parallelism, and use various pairwise configuration constraints to assemble them together into body configurations. To find the most probable configuration, we solve an integer quadratic programming problem with a standard technique using linear approximations. Approximate IQP allows us to incorporate much more information than the traditional dynamic programming and remains computationally efficient. 15 hand-labeled images are used to train the low-level part detector and learn the pairwise constraints. We show test results on a variety of images.
Conference Paper
We present a motion exemplar approach for finding body configuration in monocular videos. A motion correlation technique is employed to measure the motion similarity at various space-time locations between the input video and stored video templates. These observations are used to predict the conditional state distributions of exemplars and joint positions. Exemplar sequence selection and joint posi- tion estimation are then solved with approximate inference using Gibbs sampling and gradient ascent. The presented approach is able tofind joint positions accurately for people with textured clothing. Results are presented on a dataset containing slow, fast and incline walk videos of various peo- ple from different view angles. The results demonstrate an overall improvement compared to previous methods.
Conference Paper
The goal of this work is to learn a parsimonious and in- formative representation for high-dimensional time series. Conceptually, this comprises two distinct yet tightly cou- pled tasks: learning a low-dimensional manifold and mod- eling the dynamical process. These two tasks have a com- plementary relationship as the temporal constraints pro- vide valuable neighborhood information for dimensional- ity reduction and conversely, the low-dimensional space al- lows dynamics to be learnt efficiently. Solving these two tasks simultaneously allows important information to be ex- changed mutually. If nonlinear models are required to cap- ture the rich complexity of time series, then the learning problem becomes harder as the nonlinearities in both tasks are coupled. The proposed solution approximates the non- linear manifold and dynamics using piecewise linear mod- els. The interactions among the linear models are captured in a graphical model. By exploiting the model structure, ef- ficient inference and learning algorithms are obtained with- out oversimplifying the model of the underlying dynami- cal process. Evaluation of the proposed framework with competing approaches is conducted in three sets of exper- iments: dimensionality reduction and reconstruction using synthetic time series, video synthesis using a dynamic tex- ture database, and human motion synthesis, classification and tracking on a benchmark data set. In all experiments, the proposed approach provides superior performance.
Conference Paper
Recognizing humans, estimating their pose and segmenting their body parts are key to high-level image understanding. Because humans are highly articulated, the range of deformations they undergo makes this task extremely challenging. Previous methods have focused largely on heuristics or pairwise part models in approaching this problem. We propose a bottom-up growing, similar to parsing, of increasingly more complete partial body masks guided by a set of parse rules. At each level of the growing process, we evaluate the partial body masks directly via shape matching with exemplars (and also image features), without regard to how the hypotheses are formed. The body is evaluated as a whole, not the sum of its parts, unlike previous approaches. Multiple image segmentations are included at each of the levels of the growing/parsing, to augment existing hypotheses or to introduce ones. Our method yields both a pose estimate as well as a segmentation of the human. We demonstrate competitive results on this challenging task with relatively few training examples on a dataset of baseball players with wide pose variation. Our method is comparatively simple and could be easily extended to other objects. We also give a learning framework for parse ranking that allows us to keep fewer parses for similar performance.
Article
We develop a modied particle lter which is shown to be effective at search- ing the high-dimensional congur ation spaces (c. 30+ dimensions) encountered in visual tracking of articulated body motion. The algorithm uses a continuation principle, based on annealing, to introduce the inuence of narrow peaks in the tness function, gradually. The new algorithm, termed annealed particle ltering , is shown to be capable of recovering full articulated body motion efciently . A mechanism for achieving a soft partitioning of the search space is described and implemented, and shown to improve the algorithm's performance. Likewise, the introduction of a crossover operator is shown to improve the effectiveness of the search for kinematic trees (such as a human body). Results are given for a variety of agile motions such as walking, running and jumping.
Article
For a machine to be able to ‘see’, it must know something about the object it is ‘looking’ at. A common method in machine vision is to provide the machine with general rather than specific knowledge about the object. An alternative technique, and the one used in this paper, is a model-based approach in which particulars about the object are given and this drives the analysis. The computer program described here, the WALKER model, maps images into a description in which a person is represented by the series of hierarchical levels, i.e. a person has an arm which has a lower-arm which has a hand. The performance of the program is illustrated by superimposing the machine-generated picture over the original photographic images.
Article
We present a template-based approach to detecting human silhouettes in a specific walking pose. Our templates consist of short sequences of 2D silhouettes obtained from motion capture data. This lets us incorporate motion information into them and helps distinguish actual people who move in a predictable way from static objects whose outlines roughly resemble those of humans. Moreover, during the training phase we use statistical learning techniques to estimate and store the relevance of the different silhouette parts to the recognition task. At run-time, we use it to convert Chamfer distance to meaningful probability estimates. The templates can handle six different camera views, excluding the frontal and back view, as well as different scales. We demonstrate the effectiveness of our technique using both indoor and outdoor sequences of people walking in front of cluttered backgrounds and acquired with a moving camera, which makes techniques such as background subtraction impractical.
Article
The functional method identifies the hip joint centre (HJC) as the centre of rotation of the femur relative to the pelvis during an ad hoc movement normally recorded using stereophotogrammetry. This method may be used for the direct determination of subject-specific HJC coordinates or for creating a database from which regression equations may be derived that allow for the prediction of those coordinates. In order to contribute to the optimization of the functional method, the effects of the following factors were investigated: the algorithm used to estimate the HJC coordinates from marker coordinates, the type and amplitude of the movement of the femur relative to the pelvis, marker cluster location and dimensions, and the number of data samples. This was done using a simulation approach which, in turn, was validated using experiments made on a physical analogue of the pelvis and femur system. The algorithms used in the present context were classified and, in some instances, modified in order to optimize both accuracy and computation time, and submitted to a comparative evaluation. The type of movement that allowed for the most accurate results consisted of several flexion-extension/abduction-adduction movements performed on vertical planes of different orientations, followed by a circumduction movement. The accuracy of the HJC estimate improved, with an increasing rate, as a function of the amplitude of these movements. A sharp improvement was found as the number of the photogrammetric data samples used to describe the movement increased up to 500. For optimal performance with the recommended algorithms, markers were best located as far as possible from each other and with their centroid as close as possible to the HJC. By optimizing the analytical and experimental protocol, HJC location error not caused by soft tissue artefacts may be reduced by a factor of ten with a maximal expected value for such error of approximately 1mm.
Article
The objective of the study was to develop a framework for the accurate identification of joint centers to be used for the calculation of human body kinematics and kinetics. The present work introduces a method for the functional identification of joint centers using markerless motion capture (MMC). The MMC system used 8 color VGA cameras. An automatic segmentation-registration algorithm was developed to identify the optimal joint center in a least-square sense. The method was applied to the hip joint center with a validation study conducted in a virtual environment. The results had an accuracy (6mm mean absolute error) below the current MMC system resolution (1cm voxel resolution). Direct experimental comparison with marker-based methods was carried out showing mean absolute deviations over the three anatomical directions of 11.9 and 15.3mm if compared with either a full leg or only thigh markers protocol, respectively. Those experimental results were presented only in terms of deviations between the two systems (marker-based and markerless) as no real gold standard was available. The methods presented in this paper provide an important enabling step towards the biomechanical and clinical applications of markerless motion capture.
Conference Paper
Inference in 3D articulated human body tracking is challenging due to the high dimensionality and nonlinearity of the parameter-space. We propose a particle filter with Rao-Blackwellisation which marginalizes part of the state variables by exploiting the correlation between the right-side and the left-side joint Euler angles. The correlation is naturally induced by the symmetric and repetitive patterns in specific human activities. A novel algorithm is proposed to learn the correlation from the training data using partial least square regression. The learned correlation is then used as motion prior in designing the Rao-Blackwellised particle filter, which estimates only one group of state variables using the Monte Carlo method, leaving the other group being exactly computed through an analytical filter that utilizes the learned motion correlation. We evaluate the effectiveness of the motion correlation for 3D articulated human body tracking. The accuracy of the proposed 3D tracker is quantitatively assessed based on the distance between the true and the estimated marker positions. Extensive experiments with multi-camera walking sequences from the HumanEva-I/II data set show that (i) the proposed tracker achieves significantly lower estimation error than both the annealed particle filter and the standard particle filter; and (ii) the learned motion correlation generalizes well to motion performed by subjects other than the training subject.
Conference Paper
This paper presents the first systematic empirical study of the particle filter (PF) algorithms for human figure tracking in video. Our analysis and evaluation follows a modular approach which is based upon the underlying statistical principles and computational concerns that govern the performance of PF algorithms. Based on our analysis, we propose a novel PF algorithm for figure tracking with superior performance called the Optimized Unscented PF. We examine the role of edge and template features, introduce computationally-equivalent sample sets, and describe a method for the automatic acquisition of reference data using standard motion capture hardware. The software and test data are made publicly-available on our project website.