FIGURE 5 - uploaded by Anargyros Chatzitofis
Content may be subject to copyright.
(Left) Initial MoCap skeleton structure mapped to 3D and 2D pose joint indices. (Right) Animations for the scanned actor were re-targeted to match the skeleton rig of the 3D model.

(Left) Initial MoCap skeleton structure mapped to 3D and 2D pose joint indices. (Right) Animations for the scanned actor were re-targeted to match the skeleton rig of the 3D model.

Source publication
Article
Full-text available
We introduce HUMAN4D, a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and 2 male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse...

Contexts in source publication

Context 1
... captured animations of the actor whose body was subsequently scanned, underwent a retargeting process by a professional 3D animator. The goal of this process was to adjust the recorded animations to where slight differences between the captured MoCap skeleton structure and the one of the rigged 3D model exist, as illustrated in Fig. 5 ...
Context 2
... spatio-temporal alignment between the modalities and the highly frequent and precise 3D motion capture enable the extraction of 3D poses accurately mapped on the RGBD data cues. With a set of J = 33 j-joints, as depicted in Fig. 5, a 3D pose per frame f and skeleton s is mapped to every single mRGBD frame. Then, by applying inverse transformation per camera pose and projecting the 3D positions of the joints on the RGBD views, the 2D keypoints K are calculated ...

Similar publications

Article
Full-text available
Affected by electric field generated in power transmission line, live worker on the high-voltage transmission line is in dangerous working environment. In this paper, posture perception, numerical simulation, and risk assessment are adopted to realize the deep fusion of physical trajectory and spatial virtual electric field distribution of live wor...

Citations

... Some existing highly domain-specific 3D video datasets (i.e. human bodies, human faces, etc.) have also been manually curated using a large number of sensors (Chatzitofis et al., 2020;Reimat et al., 2021;Yoon et al., 2021;Pagés et al., 2021). However, the use of these dynamic 3D scenes is limited due to their domain-specific nature. ...
Preprint
Full-text available
A recent frontier in computer vision has been the task of 3D video generation, which consists of generating a time-varying 3D representation of a scene. To generate dynamic 3D scenes, current methods explicitly model 3D temporal dynamics by jointly optimizing for consistency across both time and views of the scene. In this paper, we instead investigate whether it is necessary to explicitly enforce multiview consistency over time, as current approaches do, or if it is sufficient for a model to generate 3D representations of each timestep independently. We hence propose a model, Vid3D, that leverages 2D video diffusion to generate 3D videos by first generating a 2D "seed" of the video's temporal dynamics and then independently generating a 3D representation for each timestep in the seed video. We evaluate Vid3D against two state-of-the-art 3D video generation methods and find that Vid3D is achieves comparable results despite not explicitly modeling 3D temporal dynamics. We further ablate how the quality of Vid3D depends on the number of views generated per frame. While we observe some degradation with fewer views, performance degradation remains minor. Our results thus suggest that 3D temporal knowledge may not be necessary to generate high-quality dynamic 3D scenes, potentially enabling simpler generative algorithms for this task.
... Datasets. For evaluation, we trained DragPoser using the DanceDB dataset [Aristidou et al. 2019] and evaluated with the HUMAND4D [Chatzitofis et al. 2020] and SOMA [Ghorbani and Black 2021] datasets, both part of the AMASS collection [Mahmood et al. 2019]. Our choice of AMASS was guided by its public availability and its use in previous works. ...
Preprint
Full-text available
High-quality motion reconstruction that follows the user's movements can be achieved by high-end mocap systems with many sensors. However, obtaining such animation quality with fewer input devices is gaining popularity as it brings mocap closer to the general public. The main challenges include the loss of end-effector accuracy in learning-based approaches, or the lack of naturalness and smoothness in IK-based solutions. In addition, such systems are often finely tuned to a specific number of trackers and are highly sensitive to missing data e.g., in scenarios where a sensor is occluded or malfunctions. In response to these challenges, we introduce DragPoser, a novel deep-learning-based motion reconstruction system that accurately represents hard and dynamic on-the-fly constraints, attaining real-time high end-effectors position accuracy. This is achieved through a pose optimization process within a structured latent space. Our system requires only one-time training on a large human motion dataset, and then constraints can be dynamically defined as losses, while the pose is iteratively refined by computing the gradients of these losses within the latent space. To further enhance our approach, we incorporate a Temporal Predictor network, which employs a Transformer architecture to directly encode temporality within the latent space. This network ensures the pose optimization is confined to the manifold of valid poses and also leverages past pose data to predict temporally coherent poses. Results demonstrate that DragPoser surpasses both IK-based and the latest data-driven methods in achieving precise end-effector positioning, while it produces natural poses and temporally coherent motion. In addition, our system showcases robustness against on-the-fly constraint modifications, and exhibits exceptional adaptability to various input configurations and changes.
... Studies have even shown that the scanning technology of the human body has become comparable in accuracy to the expert measurers [2,3,4], which are still the golden standard [5]. These remarkable technological improvements and the availability of the high-quality scanners to smaller laboratories have resulted in several, publicly available, free, and high-quality 3D and 4D datasets of human bodies in tight or minimal clothing [6,7,8,9,10,11] and in everyday clothing [12,13,14,15,16,17,18]. The publicly available datasets are particularly useful for the research community, fostering advancements in anthropometry, textile research, computer vision, fitness, and medicine [1]. ...
... Human body pose estimation [37,38,39,40] is a particularly active research area and is important for several industrial applications such as fitness, sports, medicine, entertainment, etc. [1]. To obtain the corresponding 3D pose ground-truth information during scanning, previous works have either used motion capture systems such as IMU sensors [14,25], suits [10,11,24], or body markers [15] or markerless motion capture systems that rely on extracting accurate joint locations from multiple views [12,13,16,17,18]. There are many publicly available datasets, both indoor and outdoor (Section 2.3), that contain images and videos of people, as well as corresponding ground-truth 3D human pose information. ...
Conference Paper
Full-text available
With the advancements in high-quality scanning technology, which is becoming more accurate, convenient, and available to smaller labs, the recommendations and protocols should be discussed and established. The protocols are important to make sure that the required list of steps is carried out for the desired set of goals and to minimize the overall time required to carry out the preparation, scanning, and postprocessing steps. In this paper, we propose scanning protocols for clothed people in indoor and outdoor settings. The indoor settings should be more suitable for high-quality 3D scans, which should serve as a reference to the ground-truth human body. The outdoor setting should be more suitable for providing challenging scenarios, closer to the real world. We discuss the postprocessing steps required to align indoor and outdoor datasets for the best quality of ground-truth information. Finally, we overview existing use cases and applications using scanning datasets and recommend the corresponding protocols.
... Quantitative. We test our method using two datasets from AMASS [Mahmood et al. 2019] that have not been used for training in the learning-based methods: HUMAN4D [Chatzitofis et al. 2020] and SOMA [Ghorbani and Black 2021], which contain a variety of human activities captured by commercial marker-based motion capture systems. We chose AMASS as it is a well-known human motion database and is compatible with SMPL [Loper et al. 2015], which is required by the code provided by the authors of AvatarPoser, TransPose, and PIP. ...
... To evaluate the overall pose Table 1. Real time evaluation on HUMAN4D [Chatzitofis et al. 2020] and SOMA [Ghorbani and Black 2021]. We train our method with the DanceDB [Aristidou et al. 2019] and evaluate it with no added latency (Ours-0) and with access to 7 future frames (Ours-7). ...
Article
Full-text available
Accurate and reliable human motion reconstruction is crucial for creating natural interactions of full-body avatars in Virtual Reality (VR) and entertainment applications. As the Metaverse and social applications gain popularity, users are seeking cost-effective solutions to create full-body animations that are comparable in quality to those produced by commercial motion capture systems. In order to provide affordable solutions though, it is important to minimize the number of sensors attached to the subject’s body. Unfortunately, reconstructing the full-body pose from sparse data is a heavily under-determined problem. Some studies that use IMU sensors face challenges in reconstructing the pose due to positional drift and ambiguity of the poses. In recent years, some mainstream VR systems have released 6-degree-of-freedom (6-DoF) tracking devices providing positional and rotational information. Nevertheless, most solutions for reconstructing full-body poses rely on traditional inverse kinematics (IK) solutions, which often produce non-continuous and unnatural poses. In this paper, we introduce SparsePoser, a novel deep learning-based solution for reconstructing a full-body pose from a reduced set of six tracking devices. Our system incorporates a convolutional-based autoencoder that synthesizes high-quality continuous human poses by learning the human motion manifold from motion capture data. Then, we employ a learned IK component, made of multiple lightweight feed-forward neural networks, to adjust the hands and feet towards the corresponding trackers. We extensively evaluate our method on publicly available motion capture datasets and with real-time live demos. We show that our method outperforms state-of-the-art techniques using IMU sensors or 6-DoF tracking devices, and can be used for users with different body dimensions and proportions.
... Following the pioneering dataset HumanEva [52], many other datasets have been acquired: Hu-man3.6M [26], Total Capture [56], AMASS [38] or HUMAN4D [14]. The reflective markers allow to extract a good approximation of the pose, which is considered ground truth. ...
Preprint
Full-text available
his work presents 4DHumanOutfit, a new dataset of densely sampled spatio-temporal 4D human motion data of different actors, outfits and motions. The dataset is designed to contain different actors wearing different outfits while performing different motions in each outfit. In this way, the dataset can be seen as a cube of data containing 4D motion sequences along 3 axes with identity, outfit and motion. This rich dataset has numerous potential applications for the processing and creation of digital humans, e.g. augmented reality, avatar creation and virtual try on. 4DHumanOutfit is released for research purposes at this https URL. In addition to image data and 4D reconstructions, the dataset includes reference solutions for each axis. We present independent baselines along each axis that demonstrate the value of these reference solutions for evaluation tasks.
... Following the pioneering dataset HumanEva [52], many other datasets have been acquired: Hu-man3.6M [26], Total Capture [56], AMASS [38] or HUMAN4D [14]. The reflective markers allow to extract a good approximation of the pose, which is considered ground truth. ...
Preprint
Full-text available
This work presents 4DHumanOutfit, a new dataset of densely sampled spatio-temporal 4D human motion data of different actors, outfits and motions. The dataset is designed to contain different actors wearing different outfits while performing different motions in each outfit. In this way, the dataset can be seen as a cube of data containing 4D motion sequences along 3 axes with identity, outfit and motion. This rich dataset has numerous potential applications for the processing and creation of digital humans, e.g. augmented reality, avatar creation and virtual try on. 4DHumanOutfit is released for research purposes at https://kinovis.inria.fr/4dhumanoutfit/. In addition to image data and 4D reconstructions, the dataset includes reference solutions for each axis. We present independent baselines along each axis that demonstrate the value of these reference solutions for evaluation tasks.
... Meanwhile, Human4D [1] dataset contains simultaneously captured human activities using both motion capture devices and depth sensors, such a motion capture method utilized 24 motion capture cameras to achieve higher accuracy. And the method requires actors to wear a pure black uniform with color dots to locate key skeleton points to obtain reference animation of the actors, which is not practical in real-life scenarios. ...
Preprint
Full-text available
Recent years have witnessed a rapid development of immersive multimedia which bridges the gap between the real world and virtual space. Volumetric videos, as an emerging representative 3D video paradigm that empowers extended reality, stand out to provide unprecedented immersive and interactive video watching experience. Despite the tremendous potential, the research towards 3D volumetric video is still in its infancy, relying on sufficient and complete datasets for further exploration. However, existing related volumetric video datasets mostly only include a single object, lacking details about the scene and the interaction between them. In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset that includes multiple people and their daily activities interacting with the external environments. Comprehensive dataset description and analysis are conducted, with potential usage of this dataset. The dataset and additional tools can be accessed via the following website: https://cuhksz-inml.github.io/full_scene_volumetric_video_dataset/.
... A 3D dynamic imaging system is fundamental to observe, understand and interact with the world. In recent years, a large number of work [3,6,12,13,14,19,21] has been devoted to indoor 3D imaging systems, and the data obtained by these systems have made great contributions to the research work on accurate human pose estimation and mesh generation. If dynamic 3D imaging data of large-scale outdoor scenes can be obtained, it will be extremely beneficial to the application scenes such as security surveillance, sporting game analysis, and cooperative vehicle-infrastructure system (CVIS). ...
... Existing 3D imaging systems mostly relied on cameras in a small indoor scene [6,12,14,19,22]. Ionescu et al. [12] employed multi-view RGB cameras and a Timeof-Flight (ToF) depth sensor to collect and publish the Hu-man3.6M ...
... However, massive sensors and complex systems also lead to the difficulty of reproduction, construction and free movement. Chatzitofis et al. [6] collected the Human4D dataset, consisting of 50, 000 multi-view 3D imaging data in a room equipped with 24 cameras rigidly placed on the walls and 4 Intel RealSense D415 depth sensors. In this capturing system, four depth sensors were synchronized using the HW-Synced method in order to conduct related research on multi-view 3D data with high-precision spatio-temporal alignment. ...
Preprint
Multi-view imaging systems enable uniform coverage of 3D space and reduce the impact of occlusion, which is beneficial for 3D object detection and tracking accuracy. However, existing imaging systems built with multi-view cameras or depth sensors are limited by the small applicable scene and complicated composition. In this paper, we propose a wireless multi-view multi-modal 3D imaging system generally applicable to large outdoor scenes, which consists of a master node and several slave nodes. Multiple spatially distributed slave nodes equipped with cameras and LiDARs are connected to form a wireless sensor network. While providing flexibility and scalability, the system applies automatic spatio-temporal calibration techniques to obtain accurate 3D multi-view multi-modal data. This system is the first imaging system that integrates mutli-view RGB cameras and LiDARs in large outdoor scenes among existing 3D imaging systems. We perform point clouds based 3D object detection and long-term tracking using the 3D imaging dataset collected by this system. The experimental results show that multi-view point clouds greatly improve 3D object detection and tracking accuracy regardless of complex and various outdoor environments.
... However, none of the aforementioned datasets provide audio data. Volumetric datasets that also recorded the audio component are HUMAN4D [Chatzitofis et al. 2020], MHAD [Ofli et al. 2013], CMUPanoptic [Joo et al. 2015], and NAVVS [Stenzel et al. 2021], even though only MHAD and NAVVS released the audio recordings. Even in these rare exceptions, the audio sources are recorded with dedicated microphones and not with the intention of capturing the directional cues of the acoustic sound-field of the performance, as it would be possible with microphone arrays or binaural microphones. ...
Preprint
Full-text available
3D audio-visual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audio-visual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio research primarily offer unimodal content, and when visual data is included, the quality is far from meeting the standard production needs. We present "Tragic Talkers", an audio-visual dataset consisting of excerpts from the "Romeo and Juliet" drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays. Additionally, we provide voice activity labels, 2D face bounding boxes for each camera view, 2D pose detection keypoints, 3D tracking data of the mouth of the actors, and dialogue transcriptions. We believe the community will benefit from this dataset as it can assist multidisciplinary research. Possible uses of the dataset are discussed.
... Table 1compares existing mocap datasets and our motion data. Note that AMASS[49] is a collection of several existing motion capture datasets[26,46,64,36,31,29,28,50,35,33,61,50,54,27,47,66,40,65]. Our dataset contains the largest number of motions with the longest consecutive 3D human motions.Table 2summarises the included motion categories. ...
Chapter
We present UnrealEgo, i.e., a new large-scale naturalistic dataset for egocentric 3D human pose estimation. UnrealEgo is based on an advanced concept of eyeglasses equipped with two fisheye cameras that can be used in unconstrained environments. We design their virtual prototype and attach them to 3D human models for stereo view capture. We next generate a large corpus of human motions. As a consequence, UnrealEgo is the first dataset to provide in-the-wild stereo images with the largest variety of motions among existing egocentric datasets. Furthermore, we propose a new benchmark method with a simple but effective idea of devising a 2D keypoint estimation module for stereo inputs to improve 3D human pose estimation. The extensive experiments show that our approach outperforms the previous state-of-the-art methods qualitatively and quantitatively. UnrealEgo and our source codes are available on our project web page (https://4dqv.mpi-inf.mpg.de/UnrealEgo/). KeywordsEgocentric 3D human pose estimationNaturalistic data