Content uploaded by Vito Gentile
Author content
All content in this area was uploaded by Vito Gentile on Apr 05, 2017
Content may be subject to copyright.
Real-time Hand Pose Recognition Based on a Neural Network Using Microsoft
Kinect
Salvatore Sorce, Vito Gentile, Antonio Gentile
Dipartimento di Ingegneria Chimica Gestionale Informatica Meccanica
Università degli Studi di Palermo
90128 Palermo - Italy
salvatore.sorce@unipa.it, vitogentile@live.it, antonio.gentile@unipa.it,
Abstract—The Microsoft Kinect sensor is largely used to
detect and recognize body gestures and layout with enough
reliability, accuracy and precision in a quite simple way.
However, the pretty low resolution of the optical sensors does
not allow the device to detect gestures of body parts, such as
the fingers of a hand, with the same straightforwardness.
Given the clear application of this technology to the field of the
user interaction within immersive multimedia environments,
there is the actual need to have a reliable and effective method
to detect the pose of some body parts. In this paper we propose
a method based on a neural network to detect in real time the
hand pose, to recognize whether it is closed or not. The neural
network is used to process information of color, depth and
skeleton coming from the Kinect device. This information is
preprocessed to extract some significant feature. The output of
the neural network is then filtered with a time average, to
reduce the noise due to the fluctuation of the input data. We
analyze and discuss three possible i mplementations of the
proposed method, obtaining an accuracy of 90% under good
conditions of lighting and background, and even reaching the
95% in best cases, in real time.
Keywords—human-computer interaction; gesture-based
interaction; gesture recognition; Microsoft Kinect.
I. I
NTRODUCTION
The Microsoft Kinect sensor is largely used for several
purposes, far beyond the original one (that is to interact with
games by gestures). The low price, the availability of
software development kits - both from proprietary and third
parties, and the easiness of programming and integration,
made such device adaptable to any need, even in research
domains. The Microsoft Kinect basically allows its users to
detect people within a scene, along with their body posture,
by the recognition of the skeleton based on the analysis of a
color view of the scene along with a synchronized depth
map. Both developers and researchers exploited such
features to build several software libraries to detect and
recognize even body gestures.
One of the main drawbacks of the Kinect sensor is the
resolution limitation both for color images and depth maps.
In fact, the resolution is limited to 640x480 pixels when the
device works at 30 fps. This implies that people standing a
few feet away from the sensor will be displayed in a
relatively small area of the image. This limitation makes the
recognition of body parts posture and gesture, such as the
hands, a very hard task.
In this paper we present an algorithm by which the
Kinect sensor recognizes whether people’s hands are closed
or not. This pose recognition could be useful within any
application that involves the grabbing gesture or similar, and
could be integrated in some existing solution of interactive
service provision [12] [13]. We assume that there are no
obstacles between the sensor and the hands, and that people
are standing in front of the sensor at a distance between 1.5
and 2.5 meters. This range also allows for the integration of
the presented solution with other ones designed for the
recognition of gestures of the body or parts of it.
The solution we present and discuss in this paper is
based on the processing of sensory data (specifically depth
and skeleton information), along with some anthropometric
considerations, for the segmentation of the hand. This
allows us to identify a region of interest containing
information on the hand only, with no need to use markers
or devices to be worn, and with no contrast constraints
against the background. Such a region is thus processed and
used as input to a neural network that acts as a classifier.
The output of the network is then filtered with a time
average to reduce the noise.
We will lastly discuss experimental results showing the
effectiveness of the proposed solution for real-time
applications.
II. M
ETHOD
O
VERVIEW
Many researchers [1] [2] [3] carried out the hand pose
recognition task based on the detection of the individual
fingers, assuming that the user is suitably close to the Kinect
sensor. Zhou Ren et al. [4] also assumed that users wear a
black bracelet to better recognize the hand from the forearm,
thus simplifying the segmentation task.
Our proposed method does not set any particular
constraint, with the exception of the distance from the sensor,
according to Microsoft guidelines to achieve best results. In
fact, to keep the possibility to recognize common body
poses, thus allowing for the integration with other gesture
recognition algorithms, the Kinect sensor must be able to
frame the entire user body, even during the execution of the
2013 Eighth International Conference on Broadband, Wireless Computing, Communication and Applications
978-0-7695-5093-0/13 $31.00 © 2013 IEEE
DOI 10.1109/BWCCA.2013.60
344
2013 Eighth International Conference on Broadband, Wireless Computing, Communication and Applications
978-0-7695-5093-0/13 $31.00 © 2013 IEEE
DOI 10.1109/BWCCA.2013.60
344
D R A F T
gesture. To this end Microsoft suggests the users must be at a
distance between 1.2 and 3.5 meters from the sensor in order
to obtain good recognition performances. However, based on
the study of Kholshelham on the distance measurement error
of the depth sensor [5], the maximum distance allowed for
the user is set to 2.5 meters.
At this distance, the user skeleton will then be correctly
identified, so that all sensory information coming from the
Kinect device can be suitably used to feed the proposed
algorithm. All dimensions (depth, distance, positions) can
be expressed both in meters and pixels, since the Microsoft
dll’s and API’s provide programmers with useful conversion
and mapping tools. We will then refer to these dimensions
with no explicit reference to the unit of measurement.
In the following sections, we will present and discuss the
recognition of one hand pose; however, the proposed
approach can be easily adapted to recognize both hands at
the same time.
A. Starting Recognition Process
The first step of our method determines whether the
appropriate conditions occur that will trigger the actual hand
pose recognition procedure. In fact, it is useless to process
the sensory data if the user has his arms along his body, or if
they are behind his back (fig. 1a). In general, hands has to
be clearly visible and quite far from the body (fig. 1b).
a) b)
Fig. 1. a) the hand is along the body; b) the hand is far enough from the
body
We thus first determine whether the hand stands at a
suitable distance from the body or not. This distance may be
evaluated based on information related to the skeleton scale,
and in particular to the forearm length (or the whole arm).
For sake of simplicity, it is also possible to set a value that is
quite good in most cases, based on experimental results. In
our method we set at 15 cm the lower limit of the distance
between the hand and the center of mass of the body of the
user1.
1
The Microsoft APIs output when an human figure is placed in frame, is
given by a set of points in correspondence to the skeleton main joints.
The APIs also give another point which corresponds to the whole
skeleton. This can be identified as the body center of mass.
Let PH(xH, yH, zH) be the 3D coordinates of the hand
point, and PS(xS, yS, zS) the center of mass of the body (see
fig. 2 for the reference system). To start the hand pose
recognition process it must be that2:
|zH - zS| > 0.15 m
Fig. 2. the 3D reference system used by the Kinect device
B. Hand Segmentation
Assuming that the frames of the RGB camera are
aligned with those of the depth sensor3, we set a square
centered on the hand point PH (actually on the 2D point xH,
yH). This can be done by converting the depth information in
gray scale images. We therefore consider the hand is all
included within this square. The size L of the hand bounding
square can be suitably obtained on an experimental basis, or
it can be based on anthropometric considerations, so that L
is obtained by comparison with the entire body size (this
task can be carried out by measuring the distances between
other key points of the skeleton). For our purposes we use a
value L = 25 cm that is revealed suitable in most cases.
Better results can be achieved with less empirical values, as
suggested by Cheng et al. [2], who estimated a linear
relationship between the size of the hand palm and its depth.
Let now Imdepth and ImGray be the images coming from
the depth sensor and the RGB camera respectively (actually
converted in gray scale mode). We then apply a threshold on
Imdepth thus obtaining a new binary image, Immask (fig. 3),
according to the following rule:
¯
®
Δ−≥
=
otherwhiseBLACK
zifWHITE
yx depthHdepth
mask
Im
),(Im
The ǻdepth value represents the hand thickness along the
hand-sensor segment (fig. 4). This value must be estimated
based on empirical or anthropometric considerations. In our
experiments we obtained good results by setting ǻdepth = 8
cm, even from the noise tolerance point of view.
2
We assume the hand is moved forward the body to be detected, as a
natural action (see figures 2 a) and b).
3
Actually depth and RGB images are not perfectly aligned due to the non-
zero distance between the sensors on the Kinect package. However, the
Kinect SDK is equipped with an efficient mapping tool, so we can
assume they are actually aligned.
345345
D R A F T
a) b) c)
Fig. 3. a) the gray scale image of the hand, b) the depth data of the hand
(Im
depth
), and c) a binary mask of the hand (Im
mask
)
a) b)
Fig. 4. Geometric meaning of ǻ
depth
: a) from below; b) from above
C. Inputs for the Neural Classifier
The sensory data related to the hand are then used to feed
a neural network. We used three different approaches that
will be detailed below: hand mask as input, hand mask with
edges as input, SURF descriptors as input.
1) Hand Depth Mask as Input
In this simple approach, we use the binary hand image
only as input for the neural network. Prior to use the image,
we scale it to a fixed size (100x100 px in our case). Rather
than input 10,000 binary values, we decided to base our
approach on the horizontal and vertical histograms (fig. 5). In
fact, from a LxL image, we get:
•An L-dimensional array, in which the n-th element
represents the number of white pixels in the n-th row
in the image
•An L-dimensional array, in which the n-th element
represents the number of white pixels in the n-th
column in the image
By juxtaposing these arrays, we obtain a single array
whose dimension is 2L (in our case 200 elements, instead of
10,000 we should have had with the whole binary image).
Fig. 5. Horizontal and vertical projection histograms of the binary mask
The main drawback of this approach is that it could not
be suitable to recognize the hand status in some case, such
as that in which the hand is nearly parallel to the hand-
sensor line (fig. 6). In this case, the binary image alone may
be not suitable for discrimination, and some other process
should be used. The main advantage is that it does not make
use of the color image, thus allowing for the recognition
even in poor lighting conditions or when the user wears
gloves.
a) b)
Fig. 6. The gra y scale images show the different hand poses, a) closed and
b) open, which in turn are unnoticeable in the binary masks obtained from
the depth image.
2) Hand Depth Mask with Edges as Input
To improve the discrimination capabilities of the neural
network in cases such as the above described one, more
information must be added to the input. A possible way is to
extract the edges from the gray scale image, apply them a
binary threshold, and then obtain the horizontal and vertical
histograms. Such information can be used in addition to that
of the hand mask alone as input. In this case, the input of the
neural network is given by a 4L-dimensional array (fig. 7).
Fig. 7. The horizontal and vertical histograms, both from the binary mask
and the binary edge. Together they are the input for the neural network
346346
D R A F T
In this approach the main disadvantage is that the edges
detection is strongly dependent on the image quality, which
in turn decreases as the user distance from the sensor
increases. Furthermore, with this approach it is impossible to
recognize the hand status if the user wears gloves.
3) SURF Descriptors as Input
Another way to add information aiming at a better
discrimination of similar binary hand masks, is to merge the
binary threshold of the depth image with the RGB camera
data. Let Immask* be the image obtained by enlarging the
Immask image by 2 pixels. Now multiply Immask*4 and ImGray
element-by-element and consider the smallest rectangle that
includes the non-black area of the result. Let us call ImROI
this area (fig. 8).
a) b) c) d)
Fig. 8. The binary image (a) is morphologically enlarged (b), and the
result is applied as a mask on the gray scale image, thus eliminating the
background (c). The region of interest is obtained by keeping all the non-
black pixels (d)
Bagdanov et al. [6] show that it is possible to train a
Support Vector Machine (SVM) with a set of five features
SURF-128 [7], based on the same number of key points,
each of them referring to parts of the ImROI region partially
overlapping. Figure 9 shows these parts.
a) b) c) d) (e)
Fig. 9. Key points and image areas
In this way we extract an array of 128 x 5 = 640 elements
from the image, that can be used as input for the neural
network. This allows us to exploit the invariance properties
from scaling and rotation. Also for this case user should not
wear gloves.
D. Neural Network Structure and its Usage
Since our goal is to recognize whether an hand is closed
or not, the decisional process can be carried out by a neural
network, as often occurs in similar cases [8]. The neural
network is trained by means of a MATLAB toolbox that
implements a variation of the Widrow-Hoff learning rule
with back propagation [9]. The network is composed of NI
neurons directly connected to the inputs, it has only one
hidden level composed of NH neurons, and NS = 2 output
neurons. It can be sketched as in the figure 10.
4
Im
mask
* is supposed composed of real values in [0, 1].
Fig. 10. Sketch of the neural classifier
Each neuron uses a non linear transfer function for the
output, based on the hyperbolic tangent. MATLAB uses a
more efficient implementation than tanh(n), that is5:
1
1
2
)(tansig 2−
+
=−n
e
n
To set the correct amount of hidden neurons NH, instead
of the thumb rule or to proceed by attempts, we choose a
number that best approximate the number of neuron needed
for the exact learning, according to what Elisseeff and
Paugam-Moisy have demonstrated [10]. With the transfer
function set as above, given NP the number of training set
elements, NS the outputs of the network, NI the inputs of the
network, and assuming that the redundancy degree is null, it
can be demonstrated that to have an exact learning from the
training set it must holds true that:
SI
Sp
H
SI
Sp
NN
NN
N
NN
NN
+
≤≤
+2
This does not means that with such a number of neurons
we can have the exact learning, but that it is likely to
approximate it. In general, an accurate training process does
not aim at the exact learning to avoid overfitting problems.
All the above discussions led us to choose:
»
¼
»
«
¬
«
+
=
SI
Sp
H
NN
NN
N5.1
E. Output Filtering
Since the output of the neural network is often noisy, it is
useful to implement a noise reduction mechanism. Bagdanov
et al. [6] consider the output noise as a Gaussian process with
zero-mean, and use a Kalman filter to reduce it. In our
method we use an EWMA (Exponential Weighted Moving
Average). Given that the difference between the outputs in
two consecutive instants is marginal (if the hand status does
not change), the EWMA between the current output
outNeuralNetwork and the previous one outi-1 allows for the
attenuation of unwanted occasional noise effects. When the
5
See documentation about Hyperbolic Tangent sigmoid transfer function
on Mathworks MATLAB Documentation Center:
http://www.mathworks.it/it/help/nnet/ref/tansig.html
347347
D R A F T
hand status change, the noise reduction based on the EWMA
add a little delay, that however comes out to be acceptable:
outi = (1 – Į) × outi-1 + Į × outNeuralNetwork
where Į = 0.3 is a good compromise between the noise
reduction goal and the system responsiveness (fig. 11).
Fig. 11. Comparison between the raw output and the averaged one
III. E
XPERIMENTAL
R
ESULTS
To test the effectiveness of our method, we carried out
several trials with different inputs to the neural network,
according to the three possible hand representation
previously described:
•For the hand depth mask as input method, we trained
a neural network with a training set composed of
2500 cases, and 36 neurons in the hidden level;
•For the hand depth mask with edge as input method,
we trained a neural network with a training set
composed of 2500 cases, and 20 neurons in the
hidden level;
•For the SURF Descriptors as input method, we
trained a neural network with the training set used by
Bagdanov et al. [6] composed of 28400 cases, and
133 neurons in the hidden level.
In all cases, we calculated the number of hidden neurons
according to the discussion above.
Table I shows the processing times obtained using a
MacBook Pro 15” Late 20116, equipped with a 2,4 GHz
quad-core processing unit. The simplest approach (based on
the hand mask only) turned out to be the fastest one.
Anyway, all three the methods are suitable for real-time
applications, since their timings are compatible with a
theoretical frame-rate of 100 fps. The actual maximum
frame-rate of the Microsoft Kinect is 30 fps, so the results
are largely satisfactory.
TABLE I. A
COMPARISON OF PROCESSING TIMES
Depth Mask Only Depth Mask and Edges SURF Features
1.6865 ms 10.1584 ms 7.6415 ms
3.8888 · 10
3
ticks 31.4383 · 10
3
ticks 13.2570 · 10
3
ticks
6
See MacBook Pro 15” Late 2011 specifications:
http://support.apple.com/kb/sp644
The streams used for the tests included people both with
sleeves up and down, captured in three different
environments, each with different lighting and reflective
conditions, and with different items in the scene.
In more detail, the first environment had a uniform
lighting; the second one had a quite uniform lighting less
intense than the first; the third one had a big light source (an
open window) behind the user (fig. 12). In all cases the
sensor was set in the Automatic Exposure mode.
a)
b)
c)
Fig. 12. Some example we used in our tests: a) strong lighting behind the
user confusing the sensor (open hand wrongly recognized as closed); b)
open hand correctly recognized in the same environment; c) closed hand
correctly recognized in a constant lighting environment
Tables II, III, IV and figures 13 and 14 show the
recognition faults (%) vs. distance and lighting conditions.
As a consequence of the discussion above and the data
described in the plots, we can say that the hand mask method
gives the best results, in addition to being the simplest and
the fastest method. This is due to its independence from the
variations of lighting conditions. Such variations are the
main cause of inefficiency for the hand mask with edge
method. The Sobel algorithm is in fact highly vulnerable to
the lighting variations, so it is not suitable for non-controlled
environments. The SURF based method is the best one in
optimal lighting conditions, but its performance worsen
when the lighting is not controlled adequately.
TABLE II. H
AND
D
EPTH
M
ASK
ON
LY AS
N
EURAL
N
ETWORK
I
NPUT
348348
D R A F T
TABLE III. HAND DEPTH MASK AND EDGES AS NEURAL NETWORK INPUT
TABLE IV. HAN D SURF DESCRIPTORS AS NEURAL NETWORK INPUT
Fig. 13. Misclassification error vs. distance
Fig. 14. Misclassification error vs. lighting quality
IV. D
ISCUSSIO N
The main differences between the proposed method and
other solutions available in the scientific literature, concern
three points of view: constraints, performance and accuracy.
In the following short discussion we will refer to the hand
depth mask as input approach (see section II.C.1). This
approach produces the most accurate results (see tables II,
III, IV and figures 13 and 14), and it also represents the best
solution in terms of performance (see table I).
•Constraints: some authors set quite strong constraints
about the environment, the lighting or the possibility
to wear gloves and jewels. Our proposed system is
free from lighting constraints, since it is mainly based
on depth data. The solution proposed by Bagdanov et
al. [6], which is based on SURF features, needs a
minimum lighting to recognize the hand, because
they use color information. The same applies to all
the studies [2] that are based on the segmentation of
color information. Furthermore, the use of depth data
poses no constraints about the skin color (that can
also be painted or tattooed) or the possibility to wear
gloves, rings and bracelets. In [4] it is mandatory for
users to wear a bracelet to mark the separation
between the hand and the wrist. In our solution, the
only constraint is the distance between the user and
the Kinect, that should be in the range 1,5 ÷ 2,5 mt,
but this is due to device-related considerations, as
above mentioned. Authors of [6] set the distance in
the range 1 ÷ 3 mt for almost the same reasons.
•Performance: our system works in real time.
Bagdanov et al. [6] achieve the same result, but they
use SURF descriptors. Their process is quite fast, but
it is not as fast as depth masks construction, that can
be obtained by a simple thresholding. Furthermore, in
our case every frame is always processed and used for
the classification. The system proposed by Ahmed
[11], that is based on the extraction of features from
the image being thresholded, processes a frame only
if there is a significant difference since the last
processed frame. This solution cuts off data that can
be significant for the classification, but it is required
to keep the response time acceptable.
•Accuracy: among the systems we referred for
comparison, the only one that allows for the
recognition of whether an hand is closed or open is
the one proposed in [6]. The other considered systems
[2] [4] [11] allows for the recognition of different
poses, with several constraints about lighting, worn
objects, and distances. Bagdanov et al. reach an
accuracy of 98% [6] that is slightly better than our
96.5 in the best case. Anyway, their result is achieved
despite a greater computational load and a less
general applicability.
349349
D R A F T
V. C
ONCLUSIONS
In this paper we presented and discussed a method to
recognize whether an hand is open or closed, based on a
neural network and on sensory data coming from the Kinect
device. No further constraints are set to users or to the
environment in terms of background or colors, and the
results are achieved in real-time.
Our method can be integrated within more complex
recognition systems, to implement the hand pose recognition
task. It also turns out to be a good base for further
developments to improve its yet good performances.
Based on our experiments and discussion, we first
conclude that the approach that includes the edge is the worst
one and it has to be left out, unless some light-independent
algorithm is used for the edge extraction. The remaining two
methods can be worth some further study to refine them.
Concerning the SURF-based one, the source images can be
improved, for example, by smoothing the mask used for the
background removal, and to normalize the contrast, so that
the extracted features are independent from insignificant
data.
Depth masks can be represented in different ways, to
make them independent from scale, rotation, etc. Ren et al.
[4] extract a signature from the depth mask to represent it
with a fixed-dimension array, apart its actual size (e.g., by
sampling at each sexagesimal degree, in a 360-dimensions
space). To ensure the rotational independence, it may
advance to apply an algorithm that rotates the image
appropriately (as Matos et al. [14] suggest). Ahmed [11]
extract 33 features from a binary image, based on the
percentage of white pixels in different overlapping image
areas, as well as on the processing of some central moments
of the hand position (also Biswas et al. [15] use a similar
approach) (fig. 13).
a)
b)
Fig. 15. Some possible way to represent the binary depth masks: a) a
signature representing the edge as one-dimensional function, in polar
coordinates; b) subdivision levels to extract some features [11].
A
CKNOWLEDGMENT
This paper has been partially supported under the
research program P.O.N. RICERCA E COMPETITIVITA'
2007-2013, project title SINTESYS - Security INTElligence
SYStem, project code PON 01_01687.
R
EFERENCES
[1] Frati, V.; Prattichizzo, D., "Using Kinect for hand tracking and
rendering in wearable haptics," IEEE World Haptics Conference
(WHC 2011), pp. 317-321, 21-24 June 2011, doi:
10.1109/WHC.2011.5945505
[2] Cheng Tang; Yongsheng Ou; Guolai Jiang; Qunqun Xie; Yangsheng
Xu, "Hand tracking and pose recognition via depth and color
information," 2012 IEEE International Conference on Robotics and
Biomimetics (ROBIO), pp.1104,1109, 11-14 Dec. 2012, doi:
10.1109/ROBIO.2012.6491117
[3] La Cascia, M.; Morana, M.; Sorce, S., “Mobile Interface for Content-
Based Image Management,” 2010 International Conference on
Complex, Intelligent and Software Intensive Systems (CISIS),
pp.718,723, 15-18 Feb. 2010, doi: 10.1109/CISIS.2010.172
[4] Zhou Ren, Junsong Yuan, and Zhengyou Zhang. 2011. Robust hand
gesture recognition based on finger-earth mover's distance with a
commodity depth camera. In Proceedings of the 19th ACM
international conference on Multimedia (MM '11). ACM, New York,
NY, USA, 1093-1096. DOI=10.1145/2072298.2071946
[5] Khoshelham K, “Accuracy analysis of Kinect depth data”. In: ISPRS
Workshop Laser Scanning, vol. XXXVIII (2011), pp. 133-138
[6] Bagdanov, A.D.; Del Bimbo, A.; Seidenari, L.; Usai, L., "Real-time
hand status recognition from RGB-D imagery," 21st International
Conference on Pattern Recognition (ICPR 2012), pp.2456-2459, 11-
15 Nov. 2012
[7] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, "SURF:
Speeded Up Robust Features", Computer Vision and Image
Understanding (CVIU), Vol. 110, No. 3, pp. 346--359, 2008
[8] Zhang, G.P., "Neural networks for classification: a survey," IEEE
Transactions on Systems, Man, and Cybernetics, vol.30, no.4,
pp.451,462, Nov 2000, doi: 10.1109/5326.897072
[9] Beale, Mark Hudson, Hagan, Martin T. and Demuth, Howard B.
“Neural Network Toolbox User's Guide”, Mathworks Documentation
Center. [Online] [last accessed on: 17 04 2013.]
http://www.mathworks.it/help/pdf_doc/nnet/nnet_ug.pdf.
[10] André Elisseeff, Hélène Paugam-Moisy, “Size of Multilayer
Networks for Exact Learning: Analytic Approach.”, pp.162-168 In
proceeding of: Advances in Neural Information Processing Systems
9, NIPS, Denver, CO, USA, December 2-5, 1996
[11] Tasnuva Ahmed, “A Neural Network based Real Time Hand Gesture
Recognition System”, International Journal of Computer
Applications, 59[5]:17-22, December 2012. Published by Foundation
of Computer Science, New York, USA, doi: 10.5120/9535-3971
[12] Gentile, A.; Andolina, S.; Massara, A.; Pirrone, D.; Russo, G.;
Santangelo, A.; Trumello, E.; Sorce, S., "A Multichannel Information
System to Build and Deliver Rich User-Experiences in Exhibits and
Museums," Broadband and Wireless Computing, Communication and
Applications (BWCCA), 2011 International Conference on , vol., no.,
pp.57,64, 26-28 Oct. 2011, doi: 10.1109/BWCCA.2011.14
[13] Sorce, S.; Augello, A.; Santangelo, A.; Gentile, A.; Genco, A.;
Gaglio, S.; Pilato, G., “Interacting with Augmented Environments,”
IEEE Pervasive Computing, vol.9, no.2, pp.56,58, April-June 2010,
doi: 10.1109/MPRV.2010.34
[14] Hélder Matos; Hélder P. Oliveira; Filipe Magalhães, “Hand-
Geometry Based Recognition System A Non Restricted Acquisition
Approach,” in 9th International Conference on Image Analysis and
Recognition (ICIAR), Aveiro, Portugal, 2012, pp. 38-45. DOI:
10.1007/978-3-642-31298-4_5
[15] K. K. Biswas; Kumar Basu Saurav, “Gesture Recognition using
Microsoft Kinect®,” in 5th International Conference on Automation,
Robotics and Applications (ICARA), Wellington, New Zealand, 2011,
pp. 100-103. DOI: 10.1109/ICARA.2011.6144864.
350350
D R A F T