Figure 2 - uploaded by Jordi Gonzàlez
Content may be subject to copyright.
Finger models. a) Hand 3D model used for the dataset generation, b) unique color labels used to identify surface points on the hand, c) DOF for different joints. Joints are indexed by assigned numbers. This figure also shows how skeleton is fitted inside hand. d) Palm coordinate system. Finger parameters are computed based on this coordinate system. is solved separately. In the model, palm pose is first detected. We assume palm is rigid and refer to palm pose as a composition of wrist and base joints of all fingers except the thumb in 3D space (i.e. joints 4, 8, 12, 16 and 20 in Fig. 2(c)). Sun et al. [26] regress palm pose by iterative refinement of an initial pose. Sharp et al. [21] estimate a global view point and iteratively fit a model by generating some hypothesis candidates. It has been shown that NN-based approaches perform well in practice [8]. In this work we rely on extracted nearest shapes to both estimate palm pose and segment the hand. Nearest shapes can vary in shape and pose and need to be aligned to each other beforehand. We use palm joints of nearest shapes to align them through Procrustes analysis. This provides a uniform and smooth distribution of palm points in the point cloud of the nearest shapes. Given this point cloud with their corresponding labels l i , we find an affine transformation A with scaling factor s to hand point cloud P by applying iterative closest point matching (ICP) [36]. For a faster convergence, we modify ICP process to find closest points from group of points with the same label. Pixel labels of test frame were estimated by RF beforehand. Then, we get the palm joints by transforming the nearest shape joints given A and s. Although our trained RF could segment the hand, it is not reliable under some situations, especially for distinguishing fingers (See Fig. 4.2 for some samples). Correct hand segmentation is critical for the accuracy of our approach. Since we fit a finger model based on segmented pixels of that finger, an incorrectly segmented finger instantly causes a failure pose. Quadratic discriminant analysis provides a proper way to assign each point in the point cloud in query a label from aligned point cloud of

Finger models. a) Hand 3D model used for the dataset generation, b) unique color labels used to identify surface points on the hand, c) DOF for different joints. Joints are indexed by assigned numbers. This figure also shows how skeleton is fitted inside hand. d) Palm coordinate system. Finger parameters are computed based on this coordinate system. is solved separately. In the model, palm pose is first detected. We assume palm is rigid and refer to palm pose as a composition of wrist and base joints of all fingers except the thumb in 3D space (i.e. joints 4, 8, 12, 16 and 20 in Fig. 2(c)). Sun et al. [26] regress palm pose by iterative refinement of an initial pose. Sharp et al. [21] estimate a global view point and iteratively fit a model by generating some hypothesis candidates. It has been shown that NN-based approaches perform well in practice [8]. In this work we rely on extracted nearest shapes to both estimate palm pose and segment the hand. Nearest shapes can vary in shape and pose and need to be aligned to each other beforehand. We use palm joints of nearest shapes to align them through Procrustes analysis. This provides a uniform and smooth distribution of palm points in the point cloud of the nearest shapes. Given this point cloud with their corresponding labels l i , we find an affine transformation A with scaling factor s to hand point cloud P by applying iterative closest point matching (ICP) [36]. For a faster convergence, we modify ICP process to find closest points from group of points with the same label. Pixel labels of test frame were estimated by RF beforehand. Then, we get the palm joints by transforming the nearest shape joints given A and s. Although our trained RF could segment the hand, it is not reliable under some situations, especially for distinguishing fingers (See Fig. 4.2 for some samples). Correct hand segmentation is critical for the accuracy of our approach. Since we fit a finger model based on segmented pixels of that finger, an incorrectly segmented finger instantly causes a failure pose. Quadratic discriminant analysis provides a proper way to assign each point in the point cloud in query a label from aligned point cloud of

Source publication
Article
Full-text available
State-of-the-art approaches on hand pose estimation from depth images have reported promising results under quite controlled considerations. In this paper we propose a two-step pipeline for recovering the hand pose from a sequence of depth images. The pipeline has been designed to deal with images taken from any viewpoint and exhibiting a high degr...

Contexts in source publication

Context 1
... order to evaluate our method on highly-variable poses and viewpoints, as well as temporal analysis, we created a rich synthetic dataset mimicking the features of commodity depth cameras (Sec. 4.1). We illustrate some properties of the hand model used to create this dataset in Fig. 2. We created a hand model with 25 semantic segments used as low-level pixel labels in the dataset. At a higher level of semantics, we segmented the hand by assigning each pixel a label from the set L = {l 1 , ..., l 6 }, where L represents fingers and the palm. Next, we detail the main components of the proposed ...
Context 2
... fit a simple finger model for each finger separately to get fingers poses. Each finger model S is composed of three cylinders and half-spheres except for the thumb, which is composed of an ellipsoid, two cylinders and three half-spheres. Finger model parameters are com- puted based on the palm coordinate system (see Fig. ...
Context 3
... generation. Datasets were generated with Blender 2.74 using a detailed, realistic 3D model of a hu- man adult male hand (Fig. 2(a)). The model was rigged using a hand skeleton (Fig. 2(c)) with four bones per finger, reproducing the distal, intermediate, and proxi- mal phalanges, as well as the metacarpals. The thumb finger had no intermediate phalanx and was controlled with three bones. Additional bones were used to con- trol palm and wrist rotation. Unfeasible ...
Context 4
... generation. Datasets were generated with Blender 2.74 using a detailed, realistic 3D model of a hu- man adult male hand (Fig. 2(a)). The model was rigged using a hand skeleton (Fig. 2(c)) with four bones per finger, reproducing the distal, intermediate, and proxi- mal phalanges, as well as the metacarpals. The thumb finger had no intermediate phalanx and was controlled with three bones. Additional bones were used to con- trol palm and wrist rotation. Unfeasible hand poses were avoided by defining per-bone rotation ...
Context 5
... on the hand's surface were assigned a unique color label identifying the underlying skeleton joint, as shown in Fig. 2(b). The palm center was assumed to be roughly at the metacarpals' ...
Context 6
... the 1-NN with an average error of 23 mm compared to the 1-NN error of 90% performance obtained by RF. On the other hand, KNNs are good enough to generate an accurate hand seg- mentation as can be observed in Fig. 11(b). Therefore, we can argue that the final hand segmentation performance tion of pose and viewpoint in the data. We illustrate in Fig. 12 how RF performance improved in a number of frames based on our finger segmentation strategy. We added ICP MSE of extracted nearest shapes to Fig. 12 which shows a meaningful relationship between the accuracy of RF and the alignment error among ex- tracted nearest neighbors. We normalized MSE by a max- imum threshold of 200 for the sake ...
Context 7
... an accurate hand seg- mentation as can be observed in Fig. 11(b). Therefore, we can argue that the final hand segmentation performance tion of pose and viewpoint in the data. We illustrate in Fig. 12 how RF performance improved in a number of frames based on our finger segmentation strategy. We added ICP MSE of extracted nearest shapes to Fig. 12 which shows a meaningful relationship between the accuracy of RF and the alignment error among ex- tracted nearest neighbors. We normalized MSE by a max- imum threshold of 200 for the sake of ...
Context 8
... segmentation accuracy is critical for final pose recovery. This relationship is shown in Fig. 11(d). How- ever, one can observe how complex poses (ICP MSE) af- fects the mean pose error by comparing Fig. 13 to Fig. ...

Citations

... Unlike in monocular images, videos can provide information on temporal evolution to improve the accuracy and robustness of 3D hand pose estimation. Therefore, some methods [5,40,41] exploit temporal information using precursor and successor data in the sequence to predict the 3D pose of the target frame. Nevertheless, CNN-based approaches typically rely on dilated temporal convolution to model global correlations; however, the dilatation technique has limited temporal connectivity. ...
Article
Many biomedical applications require fine motor skill assessments; however, real-time and contactless fine motor skill assessments are not typically implemented. In this study, we followed the 2D-to-3D pipeline principle and proposed a transformer-based spatial-temporal network to accurately regress 3D hand joint locations by inputting infrared thermal video for eliminating need of multiple cameras or RGB-D devices. We also developed a dataset composed of infrared thermal videos and ground truth annotations for training. The label represents a set of 3D joint locations from infrared optical trackers, which is considered the gold standard for clinical applications. To demonstrate their potential, the proposed method was used to measure the finger motion angle, and we investigated its accuracy by comparing the proposal with the Azure Kinect system and Leap Motion system. On the proposed dataset, the proposed method achieved a 3D hand pose mean error of less than 14 mm and outperforms the other deep learning methods. When the error thresholds were larger than approximately 35 mm, our method first to achieved excellent performance (>80%) in terms of the fraction of good frames. For the finger motion angle calculation task, the proposed and commercial systems had comparable inter-system reliability (ICC2,1 ranging from 0.81 to 0.83) and excellent validity (Pearson's r-values ranging from 0.82 to 0.86). We believe that the proposed approaches can capture hand motion and measure finger motion angles and can be used in different biomedicine scenarios as an effective evaluation tool for fine motor skills.
... To analyse the performance of the occluded joints, we first need to compute which joints are occluded. To do so, we base on [50] to first segment the hand based on the nearest distance of each ground-truth joint to the point cloud. ...
Article
Full-text available
Abstract Despite recent advances in 3‐D pose estimation of human hands, thanks to the advent of convolutional neural networks (CNNs) and depth cameras, this task is still far from being solved in uncontrolled setups. This is mainly due to the highly non‐linear dynamics of fingers and self‐occlusions, which make hand model training a challenging task. In this study, a novel hierarchical tree‐like structured CNN is exploited, in which branches are trained to become specialised in predefined subsets of hand joints called local poses. Further, local pose features, extracted from hierarchical CNN branches, are fused to learn higher order dependencies among joints in the final pose by end‐to‐end training. Lastly, the loss function used is also defined to incorporate appearance and physical constraints about doable hand motions and deformations. Finally, a non‐rigid data augmentation approach is introduced to increase the amount of training depth data. Experimental results suggest that feeding a tree‐shaped CNN, specialised in local poses, into a fusion network for modelling joints' correlations and dependencies, helps to increase the precision of final estimations, showing competitive results on NYU, MSRA, Hands17 and SyntheticHand datasets.
... Hand pose estimations conventionally struggle from an extensive space of pose articulations and occlusions including self-occlusions. Some recent 3D hand pose estimators that take sequential depth image frames as inputs have tried to enhance their performance considering temporal information of hand motions [14,22,26,43]. Motion context provides temporal features for narrower search space, hand personalizing, robustness to occlusion and refinement of estimations. ...
... Considering temporal features of depth maps, sequential data of hand pose depth images [25,26,46,48] have been trained with hand pose estimators. The temporal features of hand pose variations are used for encoding temporal variations of hand poses with recurrent structure of a model [14,43], modeling of hand shape space [17], and refinement of current estimations [22,26]. With sequential monocular RGB-D inputs, Taylor et al. [41] optimize surface hand shape models, updating subdivision surfaces on corresponding 3D hand geometric models. ...
Preprint
3D hand pose estimation based on RGB images has been studied for a long time. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. In this paper, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks.
... These methods combine learningbased and model-based approaches into one framework to combine the strengths of both. One class of hybrid methods uses learning-based components in a tracking framework to initialize, update, or otherwise guide the tracker's convergence to the correct pose [18], [23], [27], [29], [30], [12]. These methods are more robust than the traditional model-based trackers, but must trade-off model and solver efficiency with accuracy during runtime. ...
Preprint
We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model. This additional loss allows training of a hand pose estimator that accurately infers the entire set of 21 hand keypoints while only using supervision for 6 easy-to-annotate keypoints (fingertips and wrist). We show that our partially-supervised method achieves results that are comparable to those of fully-supervised methods which enforce articulation consistency. Moreover, for the first time we demonstrate that such an approach can be used to train on datasets that have erroneous annotations, i.e. "ground truth" with notable measurement errors, while obtaining predictions that explain the depth images better than the given "ground truth".
... Depth-based hand tracking is the subject of a large number of published works [37], [33], [49] [11], [15], [46], [21], [22] , in computer vision or human-computer interaction. Current work can be grouped into discriminative, generative, and hybrid methods. ...
Article
Full-text available
In this paper, we present a novel approach for 3D hand tracking in real-time from a set of depth images. In each frame, our approach initializes hand pose with learning and then jointly optimizes the hand pose and shape. For pose initialization, we propose a gesture classification and root location network(GCRL), which can capture the meaningful topological structure of the hand to estimate the gesture and root location of the hand. With the per-frame initialization, our approach can rapidly recover from tracking failures. For optimization, unlike most existing methods that have been using a fixed-size hand model or manual calibration, we propose a hand gesture-guided optimization strategy to estimate pose and shape iteratively, which makes the tracking results more accuracy. Experiments on three challenging datasets show that our proposed approach achieves similar accuracy as state-of-the-art approaches, while runs on a low computational resource (without GPU).
Article
Interacting with computer applications using actions that are designed by end users themselves instead of pre-defined ones has advantages such as better memorability in some Human-Computer Interaction (HCI) scenarios. In this paper we propose a method for allowing users to use self-defined mid-air hand gestures as commands for HCI after they provide a few training samples for each gesture in front of a depth image sensor. The gesture detection and recognition algorithm is mainly based on pattern matching using 3 separate sets of features, which carry both finger-action and hand-motion information. An experiment in which each subject designed their own set of 8 gestures, provided about 5 samples for each, and then used them to play a game is conducted all in one sitting. During the experiment a recognition rate of 66.7% is achieved with a false positive ratio of 22.2%. Further analyses on the collected dataset shows that a higher recognition rate of up to about 85% can be achieved if more wrong detections were allowed.
Article
Depth-based hand pose estimation has received increasing attention in the fields of human-computer interaction and virtual reality. A comprehensive survey and analysis of depth-based hand pose estimation of recent works are conducted. First, the definition and difficulties of this problem are explained, the widely used sensor and public datasets are also introduced. Then, the works of this field are divided into three categories, model-driven, data-driven, and hybrid method. The model-driven methods perform a model fitting between the model and the depth points. The data-driven methods learn a function, which maps the depth image to pose. The hybrid methods combine model-driven and data-driven to recovery the hand pose. In the course of narration, we focus on the solved problems and shortcomings to be solved. In the final, the works are compared in terms of accuracy, suitability, and robustness. The future research in this direction is also discussed. © 2021, Beijing China Science Journal Publishing Co. Ltd. All right reserved.