Figure 2 - uploaded by Ron Kimmel
Content may be subject to copyright.
The data capture setup. a) 2mm magnetic sensors. The larger rectangular sensors are not used. b) A fingertip sensor 

The data capture setup. a) 2mm magnetic sensors. The larger rectangular sensors are not used. b) A fingertip sensor 

Source publication
Conference Paper
Full-text available
We investigate a novel global orientation regression approach for articulated objects using a deep convolutional neural network. This is integrated with an in-plane image derotation scheme, DeROT, to tackle the problem of per-frame fingertip detection in depth images. The method reduces the complexity of learning in the space of articulated poses w...

Contexts in source publication

Context 1
... prevents lateral and medial movement along the finger. This can be seen in Figure 2. The skin tight elastic loops have an additional significant benefit over gloves in that the depth profile and hand movements are not affected by the attached sensors and thus do not pollute the data. ...
Context 2
... do this by positioning the magnetic sensors on the corners of a checkerboard pattern thereby creating physical correspondence between the detected corner locations and the ac- tual sensors. This setup can be seen in Figure 2. We use the extracted 2D locations of the corner points on the callibration board [4] together with the sampled sensor 3D locations to perform EPnP [15] to determine the extrinsic configuration between the devices. ...

Similar publications

Conference Paper
Full-text available
A print property closely related to color and with a strong dependency on the choices made for the sake of color separation is the smoothness of a printed pattern. While certain obvious choices are simply related to grain, such as the degree of black ink use in colorant-channel imaging pipelines, the domain of all possible printable patterns that H...

Citations

... In addition to a method for generating instance segmentation masks, a suitable dataset is of major importance for both training and testing a model. There are several datasets with segmentation labels for hands [13][14][15][16][17][18][19], but few of these contain objects as well as labels. An overview of the datasets most similar to the dataset presented in this paper is presented in Table 1, and the sizes, modalities, and available label types in these datasets are also listed. ...
... An overview of the datasets most similar to the dataset presented in this paper is presented in Table 1, and the sizes, modalities, and available label types in these datasets are also listed. HandNet [13] and the dataset presented in [14] contain depth and even thermal images but lack labeled objects. In contrast, the WorkingHands dataset [17] contains segmentation labels for objects but no thermal images. ...
... HandNet [13] 202.9 K 0 -Hand-CNN [16] 40.5 K 0 * --WorkingHands [17] 7.9 K 37 (13) -EgoHands [18] 4.8 K 0 --Kim et al. [14] 401 K 0 ContactPose [19] 2.5 M 25 (25) ** -OHO (ours) 5.3 K 43 (32) * contains samples from COCO, and it is not specified how many labels can be reused; ** objects are 3D-printed in blue. ...
Article
Full-text available
In the context of collaborative robotics, handing over hand-held objects to a robot is a safety-critical task. Therefore, a robust distinction between human hands and presented objects in image data is essential to avoid contact with robotic grippers. To be able to develop machine learning methods for solving this problem, we created the OHO (Object Hand-Over) dataset of tools and other everyday objects being held by human hands. Our dataset consists of color, depth, and thermal images with the addition of pose and shape information about the objects in a real-world scenario. Although the focus of this paper is on instance segmentation, our dataset also enables training for different tasks such as 3D pose estimation or shape estimation of objects. For the instance segmentation task, we present a pipeline for automated label generation in point clouds, as well as image data. Through baseline experiments, we show that these labels are suitable for training an instance segmentation to distinguish hands from objects on a per-pixel basis. Moreover, we present qualitative results for applying our trained model in a real-world application.
... Annotating the 3D position of hand joints from a single RGB image is inherently impossible without any prior information or additional sensors due to an ill-posed condition. To assign accurate hand pose labels, hand-marker-based annotation using magnetic sensors Wetzler et al., 2015;Yuan et al., 2017), motion capture systems (Miyata et al., 2004;Schröder et al., 2015;Taheri et al., 2020), or hand gloves (Bianchi et al., 2013;Glauser et al., 2019;Wang & Popovic, 2009) has been studied. These sensors can provide 6-DoF information (i.e., location and orientation) of attached markers and enable us to calculate the coordinates of full hand joints from the tracked markers. ...
... To address this issue, hand-marker-based tracking Taheri et al., 2020;Wetzler et al., 2015;Yuan et al., 2017) and multi-view camera studios (Chao et al., 2021;Hampali et al., 2020;Moon et al., 2020;Simon et al., 2017;Zimmermann et al., 2019) have been studied. The hand markers offer 6-DoF information during these occlusions, so the hand-maker-based annotation is robust to the occlusion. ...
... As shown in Fig. 4, generating data using synthetic models Hasson et al., 2019;Mueller et al., 2017Mueller et al., , 2018Zimmermann & Brox, 2017) is cost-effective, but it creates unrealistic hand texture (Ohkawa et al., 2021). Although the hand-marker-based annotation Taheri et al., 2020;Wetzler et al., 2015;Yuan et al., 2017) can automatically track the hand joints from the information of hand sensors, the sensors distort the hand appearance and hinder the natural hand movement. In-lab data acquired by multi-camera setups (Chao et al., 2021;Hampali et al., 2020;Moon et al., 2020;Simon et al., 2017;Zimmermann et al., 2019) make the annotation easier because they can reduce the occlusion effect. ...
Article
Full-text available
In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and learning. 3D hand pose estimation has been an important research area owing to its potential to enable various applications, such as video understanding, AR/VR, and robotics. However, the performance of models is tied to the quality and quantity of annotated 3D hand poses. Under the status quo, acquiring such annotated 3D hand poses is challenging, e.g., due to the difficulty of 3D annotation and the presence of occlusion. To reveal this problem, we review the pros and cons of existing annotation methods classified as manual, synthetic-model-based, hand-sensor-based, and computational approaches. Additionally, we examine methods for learning 3D hand poses when annotated data are scarce, including self-supervised pretraining, semi-supervised learning, and domain adaptation. Based on the study of efficient annotation and learning, we further discuss limitations and possible future directions in this field.
... Another approach to the labelling automation is the assumption that the hand is the closest object to the camera [16] and can be found by applying a color threshold to the image. An alternative approach [17] is to use tracking sensors, or infrared markers [18] fastened to the hand and fingers to automatically generate the labels. ...
Article
Full-text available
In this paper, we focus on the problem of applying domain randomization to produce synthetic datasets for training depth image segmentation models for the task of hand localization. We provide new synthetic datasets for industrial environments suitable for various hand tracking applications, as well as ready-to-use pre-trained models. The presented datasets are analyzed to evaluate the characteristics of these datasets that affect the generalizability of the trained models, and recommendations are given for adapting the simulation environment to achieve satisfactory results when creating datasets for specialized applications. Our approach is not limited by the shortcomings of standard analytical methods, such as color, specific gestures, or hand orientation. The models in this paper were trained solely on a synthetic dataset and were never trained on real camera images; nevertheless, we demonstrate that our most diverse datasets allow the models to achieve up to 90% accuracy. The proposed hand localization system is designed for industrial applications where the operator shares the workspace with the robot.
... This challenge was addressed by making use of graphically generated hands [62,63,123], which may introduce a domain gap between real and synthetic data. For real data, an auxiliary input such as magnetic sensors was used [27] 337 15 cameras 3DMM [14] 200 3D scanner BFM [67] 200 3D scanner ICL [17] 10,000 3D scanner FaceScape [103] 938 68 DSLR cameras NYU Hand [95] 2 (81K samples) Depth camera HandNet [101] 10 (213K samples) Depth camera and magnetic sensor BigHand 2.2M [110] 10 (2.2M samples) Depth camera and magnetic sensor RHD [123] 20 (44K samples) N/A (synthesized) STB [112] 1 (18K samples) 1 pair of stereo cameras FreiHand [21] N/A (33K samples) 8 cameras CMU Mocap ∼100 Marker-based CMU Skin Mocap [65] <10 ...
... Markerless ( to precisely measure the joint angle and recover 3D hand pose using forward kinematics [101,110]. Notably, a multicamera system has been used to annotate hands using 3D bootstrapping [82], which provided the hand annotations for RGB data. FreiHAND [21] leveraged MANO [72] mesh model to represent dense hand pose. ...
Preprint
Full-text available
This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of five primary body signals including gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and style. With the multiview image streams, we reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance. We demonstrate that HUMBI is highly effective in learning and reconstructing a complete human model and is complementary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets. Based on HUMBI, we formulate a new benchmark challenge of a pose-guided appearance rendering task that aims to substantially extend photorealism in modeling diverse human expressions in 3D, which is the key enabling factor of authentic social tele-presence. HUMBI is publicly available at http://humbi-data.net
... Sensor RGBD RGB Stereo Third person [26], [48]- [51] [8], [9], [36], [47] [11] Egocentric [42], [46], [48] [8], [9], [36], [47] [proposed] ...
Article
Full-text available
Egocentric hand pose estimation is of particular interest for wearable cameras. Several studies on hand pose estimation have recently been presented based on RGBD or RGB sensors. Although these methods provide accurate hand pose estimation, they have several limitations. For example, RGB-based techniques have intrinsic difficulty in converting relative 3D poses into absolute 3D poses, and RGBD-based techniques only work in indoor environments. Recently, stereo-sensor-based techniques have gained increasing attention owing to their potential to overcome these limitations. However, to the best of our knowledge, there are few techniques and no real datasets available for egocentric stereo vision. In this paper, we propose a top-down pipeline for estimating absolute 3D hand poses using stereo sensors, as well as a novel dataset for training. Our top-down pipeline consists of two steps: hand detection, which detects hand areas, and hand pose estimation, which estimates the positions of the hand joints. In particular, for hand pose estimation with a stereo camera, we propose an attention-based architecture called StereoNet, a geometry-based loss function called StereoLoss, and a novel 2D disparity map called StereoDMap for effective stereo feature learning. To collect the dataset, we proposed a novel annotation method that helps reduce human annotation efforts. Our dataset is publicly available at https://github.com/seo0914/SEH .We conducted comprehensive experiments to demonstrate the effectiveness of our approach compared with the state-of-the-art methods.
... L. Baraldi et al. [10] presented a method for monocular hand gesture recognition in ego-vision scenarios that can work in nearly real time on wearable devices. Wetzler et al. [11] used a CNN-based fingertip detection model with a Kinect camera. A global orientation regression approach was proposed and depth images were used to predict the position of fingertips. ...
Article
Full-text available
This research investigated real-time fingertip detection in frames captured from the increasingly popular wearable device, smart glasses. The egocentric-view fingertip detection and character recognition can be used to create a novel way of inputting texts. We first employed Unity3D to build a synthetic dataset with pointing gestures from the first-person perspective. The obvious benefits of using synthetic data are that they eliminate the need for time-consuming and error-prone manual labeling and they provide a large and high-quality dataset for a wide range of purposes. Following that, a modified Mask Regional Convolutional Neural Network (Mask R-CNN) is proposed, consisting of a region-based CNN for finger detection and a three-layer CNN for fingertip location. The process can be completed in 25 ms per frame for 640×480 RGB images, with an average error of 8.3 pixels. The speed is high enough to enable real-time “air-writing”, where users are able to write characters in the air to input texts or commands while wearing smart glasses. The characters can be recognized by a ResNet-based CNN from the fingertip trajectories. Experimental results demonstrate the feasibility of this novel methodology.
... A total of 72, 757 depth frames were captured on a single person for the training dataset, whereas 8252 were captured on Table 1 Depth datasets Dataset ContactPose [88] Ego3DHands [89] SynHandEgo [90] ObMan [70] FHAD [14] SynHand5M [91] BigHand2.2M [92] EgoDexter [93] SynthHands [93] RHD [15] STB [13] Dexter+Object [94] MSRC [95] MSRA15 [96] HandNet [97] Hands in Action [98] ICVL [99] NYU [1] MSRA14 [4] UCI-EGO [100] ASTAR [101] Dexter1 [102] two people for the test dataset. Three Kinects were used to capture each frame, one from the front and two on the sides [1] . ...
... MSRA15, or the MSRA hand gesture database, captures 17 different gestures on the right hands of nine test subjects using a Creative Interactive Gesture Camera. With roughly 500 frames saved for every gesture, the dataset contains 76,375 frames in total, with 21 joints from a third-person perspective [97] . ...
... However, an advantage that is unique to collecting hand data is the availability of data gloves and magnetic sensors, which can collect the positions of joints or fingers. To this end, several existing depth datasets, such as ASTAR [101] , HandNet [97] , FHAD [14] , and BigHand2.2M [92] , use this technology to generate or assist the process of making annotations. ...
Article
The field of vision-based human hand three-dimensional (3D) shape and pose estimation has attracted significant attention recently owing to its key role in various applications, such as natural humancomputer interactions. With the availability of large-scale annotated hand datasets and the rapid developments of deep neural networks (DNNs), numerous DNN-based data-driven methods have been proposed for accurate and rapid hand shape and pose estimation. Nonetheless, the existence of complicated hand articulation, depth and scale ambiguities, occlusions, and finger similarity remain challenging. In this study, we present a comprehensive survey of state-of-the-art 3D hand shape and pose estimation approaches using RGB-D cameras. Related RGB-D cameras, hand datasets, and a performance analysis are also discussed to provide a holistic view of recent achievements. We also discuss the research potential of this rapidly growing field.
... In addition to releasing the database, code for the robust object segmentation method introduced in this paper, as well as the groundtruth region annotation for the MURA X-ray image dataset for the upper extremity of the human body which we have manually annotated will also be available for researchers to follow-up on our work. In addition using the HandNet dataset (Wetzler et al., 2015), we obtained the heatmap images and IR images of human hands. As this dataset does not contain other components of the upper extremity (i.e. ...
... As this dataset does not contain other components of the upper extremity (i.e. elbow, forearm, humerus, shoulder and wrist), we use the HandNet dataset (Wetzler et al., 2015) to validate our method for other modalities and to demonstrate its applicability in other modalities. ...
... Using the HandNet (Wetzler et al., 2015) dataset, we obtained IR images and heatmap images along with their groundtruth annotation of the mask. HandNet dataset contains 214 971 images in total. ...
Article
We propose a modality invariant method to obtain high quality semantic object segmentation of human body parts, for four imaging modalities which consist of visible images, X-ray images, thermal images (heatmaps) and infrared radiation (IR) images. We first consider two modalities (i.e. visible and X-ray images) to develop an architecture suitable for multi-modal semantic segmentation. Due to the intrinsic difference between images from the two modalities, state-of-the-art approaches such as Mask R-CNN do not perform satisfactorily. Insights from analysing how the intermediate layers within Mask R-CNN work on both visible and X-ray modalities have led us to propose a new and efficient network architecture which yields highly accurate semantic segmentation results across both visible and X-ray domains. We design multi-task losses to train the network across different modalities. By conducting multiple experiments across visible and X-ray images of the human upper extremity, we validate the proposed approach, which outperforms the traditional Mask R-CNN method through better exploiting the output features of CNNs. Based on the insights gained on these images from visible and X-ray domains, we extend the proposed multi-modal semantic segmentation method to two additional modalities; (viz. heatmap and IR images). Experiments conducted on these two modalities, further confirm our architecture’s capacity to improve the segmentation by exploiting the complementary information in the different modalities of the images. Our method can also be applied to include other modalities and can be effectively utilized for several tasks including medical image analysis tasks such as image registration and 3D reconstruction across modalities.
... Handnet [138] includes 202 K training and 10 K testing depth images, as well as more than 2.7 K samples for validation, with a resolution of 320 × 240. Samples were recorded with a RealSense RGB-D camera and depict the 6D postures of the hand, as well as the position and orientation of each fingertip. ...
Article
Full-text available
The field of 3D hand pose estimation has been gaining a lot of attention recently, due to its significance in several applications that require human-computer interaction (HCI). The utilization of technological advances, such as cost-efficient depth cameras coupled with the explosive progress of Deep Neural Networks (DNNs), has led to a significant boost in the development of robust markerless 3D hand pose estimation methods. Nonetheless, finger occlusions and rapid motions still pose significant challenges to the accuracy of such methods. In this survey, we provide a comprehensive study of the most representative deep learning-based methods in literature and propose a new taxonomy heavily based on the input data modality, being RGB, depth, or multimodal information. Finally, we demonstrate results on the most popular RGB and depth-based datasets and discuss potential research directions in this rapidly growing field.
... We focus our attention on the hand pose estimation task using only RGB images as input. This is significantly more challenging than hand pose estimation when a depth sensor is also available [58,47,16,38,54]. Hand mesh recovery is a more general task as it aims at reconstructing a relatively dense point cloud and estimates both the pose and subjectspecific shape of the target object. ...
Conference Paper
We introduce a simple and effective network architecture for monocular 3D hand pose estimation consisting of an image encoder followed by a mesh convolutional decoder that is trained through a direct 3D hand mesh reconstruction loss. We train our network by gathering a large-scale dataset of hand action in YouTube videos and use it as a source of weak supervision. Our weakly-supervised mesh convolutions-based system largely outperforms state-of-the-art methods, even halving the errors on the in the wild benchmark. The dataset and additional resources are available at https://arielai.com/mesh_hands.