Conference PaperPDF Available

Real-Time Hand Pose Recognition Based on a Neural Network Using Microsoft Kinect

Authors:

Abstract

The Microsoft Kinect sensor is largely used to detect and recognize body gestures and layout with enough reliability, accuracy and precision in a quite simple way. However, the pretty low resolution of the optical sensors does not allow the device to detect gestures of body parts, such as the fingers of a hand, with the same straightforwardness. Given the clear application of this technology to the field of the user interaction within immersive multimedia environments, there is the actual need to have a reliable and effective method to detect the pose of some body parts. In this paper we propose a method based on a neural network to detect in real time the hand pose, to recognize whether it is closed or not. The neural network is used to process information of color, depth and skeleton coming from the Kinect device. This information is preprocessed to extract some significant feature. The output of the neural network is then filtered with a time average, to reduce the noise due to the fluctuation of the input data. We analyze and discuss three possible implementations of the proposed method, obtaining an accuracy of 90% under good conditions of lighting and background, and even reaching the 95% in best cases, in real time.
Real-time Hand Pose Recognition Based on a Neural Network Using Microsoft
Kinect
Salvatore Sorce, Vito Gentile, Antonio Gentile
Dipartimento di Ingegneria Chimica Gestionale Informatica Meccanica
Università degli Studi di Palermo
90128 Palermo - Italy
salvatore.sorce@unipa.it, vitogentile@live.it, antonio.gentile@unipa.it,
Abstract—The Microsoft Kinect sensor is largely used to
detect and recognize body gestures and layout with enough
reliability, accuracy and precision in a quite simple way.
However, the pretty low resolution of the optical sensors does
not allow the device to detect gestures of body parts, such as
the fingers of a hand, with the same straightforwardness.
Given the clear application of this technology to the field of the
user interaction within immersive multimedia environments,
there is the actual need to have a reliable and effective method
to detect the pose of some body parts. In this paper we propose
a method based on a neural network to detect in real time the
hand pose, to recognize whether it is closed or not. The neural
network is used to process information of color, depth and
skeleton coming from the Kinect device. This information is
preprocessed to extract some significant feature. The output of
the neural network is then filtered with a time average, to
reduce the noise due to the fluctuation of the input data. We
analyze and discuss three possible i mplementations of the
proposed method, obtaining an accuracy of 90% under good
conditions of lighting and background, and even reaching the
95% in best cases, in real time.
Keywords—human-computer interaction; gesture-based
interaction; gesture recognition; Microsoft Kinect.
I. I
NTRODUCTION
The Microsoft Kinect sensor is largely used for several
purposes, far beyond the original one (that is to interact with
games by gestures). The low price, the availability of
software development kits - both from proprietary and third
parties, and the easiness of programming and integration,
made such device adaptable to any need, even in research
domains. The Microsoft Kinect basically allows its users to
detect people within a scene, along with their body posture,
by the recognition of the skeleton based on the analysis of a
color view of the scene along with a synchronized depth
map. Both developers and researchers exploited such
features to build several software libraries to detect and
recognize even body gestures.
One of the main drawbacks of the Kinect sensor is the
resolution limitation both for color images and depth maps.
In fact, the resolution is limited to 640x480 pixels when the
device works at 30 fps. This implies that people standing a
few feet away from the sensor will be displayed in a
relatively small area of the image. This limitation makes the
recognition of body parts posture and gesture, such as the
hands, a very hard task.
In this paper we present an algorithm by which the
Kinect sensor recognizes whether people’s hands are closed
or not. This pose recognition could be useful within any
application that involves the grabbing gesture or similar, and
could be integrated in some existing solution of interactive
service provision [12] [13]. We assume that there are no
obstacles between the sensor and the hands, and that people
are standing in front of the sensor at a distance between 1.5
and 2.5 meters. This range also allows for the integration of
the presented solution with other ones designed for the
recognition of gestures of the body or parts of it.
The solution we present and discuss in this paper is
based on the processing of sensory data (specifically depth
and skeleton information), along with some anthropometric
considerations, for the segmentation of the hand. This
allows us to identify a region of interest containing
information on the hand only, with no need to use markers
or devices to be worn, and with no contrast constraints
against the background. Such a region is thus processed and
used as input to a neural network that acts as a classifier.
The output of the network is then filtered with a time
average to reduce the noise.
We will lastly discuss experimental results showing the
effectiveness of the proposed solution for real-time
applications.
II. M
ETHOD
O
VERVIEW
Many researchers [1] [2] [3] carried out the hand pose
recognition task based on the detection of the individual
fingers, assuming that the user is suitably close to the Kinect
sensor. Zhou Ren et al. [4] also assumed that users wear a
black bracelet to better recognize the hand from the forearm,
thus simplifying the segmentation task.
Our proposed method does not set any particular
constraint, with the exception of the distance from the sensor,
according to Microsoft guidelines to achieve best results. In
fact, to keep the possibility to recognize common body
poses, thus allowing for the integration with other gesture
recognition algorithms, the Kinect sensor must be able to
frame the entire user body, even during the execution of the
2013 Eighth International Conference on Broadband, Wireless Computing, Communication and Applications
978-0-7695-5093-0/13 $31.00 © 2013 IEEE
DOI 10.1109/BWCCA.2013.60
344
2013 Eighth International Conference on Broadband, Wireless Computing, Communication and Applications
978-0-7695-5093-0/13 $31.00 © 2013 IEEE
DOI 10.1109/BWCCA.2013.60
344
D R A F T
gesture. To this end Microsoft suggests the users must be at a
distance between 1.2 and 3.5 meters from the sensor in order
to obtain good recognition performances. However, based on
the study of Kholshelham on the distance measurement error
of the depth sensor [5], the maximum distance allowed for
the user is set to 2.5 meters.
At this distance, the user skeleton will then be correctly
identified, so that all sensory information coming from the
Kinect device can be suitably used to feed the proposed
algorithm. All dimensions (depth, distance, positions) can
be expressed both in meters and pixels, since the Microsoft
dll’s and API’s provide programmers with useful conversion
and mapping tools. We will then refer to these dimensions
with no explicit reference to the unit of measurement.
In the following sections, we will present and discuss the
recognition of one hand pose; however, the proposed
approach can be easily adapted to recognize both hands at
the same time.
A. Starting Recognition Process
The first step of our method determines whether the
appropriate conditions occur that will trigger the actual hand
pose recognition procedure. In fact, it is useless to process
the sensory data if the user has his arms along his body, or if
they are behind his back (fig. 1a). In general, hands has to
be clearly visible and quite far from the body (fig. 1b).
a) b)
Fig. 1. a) the hand is along the body; b) the hand is far enough from the
body
We thus first determine whether the hand stands at a
suitable distance from the body or not. This distance may be
evaluated based on information related to the skeleton scale,
and in particular to the forearm length (or the whole arm).
For sake of simplicity, it is also possible to set a value that is
quite good in most cases, based on experimental results. In
our method we set at 15 cm the lower limit of the distance
between the hand and the center of mass of the body of the
user1.
1
The Microsoft APIs output when an human figure is placed in frame, is
given by a set of points in correspondence to the skeleton main joints.
The APIs also give another point which corresponds to the whole
skeleton. This can be identified as the body center of mass.
Let PH(xH, yH, zH) be the 3D coordinates of the hand
point, and PS(xS, yS, zS) the center of mass of the body (see
fig. 2 for the reference system). To start the hand pose
recognition process it must be that2:
|zH - zS| > 0.15 m
Fig. 2. the 3D reference system used by the Kinect device
B. Hand Segmentation
Assuming that the frames of the RGB camera are
aligned with those of the depth sensor3, we set a square
centered on the hand point PH (actually on the 2D point xH,
yH). This can be done by converting the depth information in
gray scale images. We therefore consider the hand is all
included within this square. The size L of the hand bounding
square can be suitably obtained on an experimental basis, or
it can be based on anthropometric considerations, so that L
is obtained by comparison with the entire body size (this
task can be carried out by measuring the distances between
other key points of the skeleton). For our purposes we use a
value L = 25 cm that is revealed suitable in most cases.
Better results can be achieved with less empirical values, as
suggested by Cheng et al. [2], who estimated a linear
relationship between the size of the hand palm and its depth.
Let now Imdepth and ImGray be the images coming from
the depth sensor and the RGB camera respectively (actually
converted in gray scale mode). We then apply a threshold on
Imdepth thus obtaining a new binary image, Immask (fig. 3),
according to the following rule:
¯
®
Δ
=
otherwhiseBLACK
zifWHITE
yx depthHdepth
mask
Im
),(Im
The ǻdepth value represents the hand thickness along the
hand-sensor segment (fig. 4). This value must be estimated
based on empirical or anthropometric considerations. In our
experiments we obtained good results by setting ǻdepth = 8
cm, even from the noise tolerance point of view.
2
We assume the hand is moved forward the body to be detected, as a
natural action (see figures 2 a) and b).
3
Actually depth and RGB images are not perfectly aligned due to the non-
zero distance between the sensors on the Kinect package. However, the
Kinect SDK is equipped with an efficient mapping tool, so we can
assume they are actually aligned.
345345
D R A F T
a) b) c)
Fig. 3. a) the gray scale image of the hand, b) the depth data of the hand
(Im
depth
), and c) a binary mask of the hand (Im
mask
)
a) b)
Fig. 4. Geometric meaning of ǻ
depth
: a) from below; b) from above
C. Inputs for the Neural Classifier
The sensory data related to the hand are then used to feed
a neural network. We used three different approaches that
will be detailed below: hand mask as input, hand mask with
edges as input, SURF descriptors as input.
1) Hand Depth Mask as Input
In this simple approach, we use the binary hand image
only as input for the neural network. Prior to use the image,
we scale it to a fixed size (100x100 px in our case). Rather
than input 10,000 binary values, we decided to base our
approach on the horizontal and vertical histograms (fig. 5). In
fact, from a LxL image, we get:
An L-dimensional array, in which the n-th element
represents the number of white pixels in the n-th row
in the image
An L-dimensional array, in which the n-th element
represents the number of white pixels in the n-th
column in the image
By juxtaposing these arrays, we obtain a single array
whose dimension is 2L (in our case 200 elements, instead of
10,000 we should have had with the whole binary image).
Fig. 5. Horizontal and vertical projection histograms of the binary mask
The main drawback of this approach is that it could not
be suitable to recognize the hand status in some case, such
as that in which the hand is nearly parallel to the hand-
sensor line (fig. 6). In this case, the binary image alone may
be not suitable for discrimination, and some other process
should be used. The main advantage is that it does not make
use of the color image, thus allowing for the recognition
even in poor lighting conditions or when the user wears
gloves.
a) b)
Fig. 6. The gra y scale images show the different hand poses, a) closed and
b) open, which in turn are unnoticeable in the binary masks obtained from
the depth image.
2) Hand Depth Mask with Edges as Input
To improve the discrimination capabilities of the neural
network in cases such as the above described one, more
information must be added to the input. A possible way is to
extract the edges from the gray scale image, apply them a
binary threshold, and then obtain the horizontal and vertical
histograms. Such information can be used in addition to that
of the hand mask alone as input. In this case, the input of the
neural network is given by a 4L-dimensional array (fig. 7).
Fig. 7. The horizontal and vertical histograms, both from the binary mask
and the binary edge. Together they are the input for the neural network
346346
D R A F T
In this approach the main disadvantage is that the edges
detection is strongly dependent on the image quality, which
in turn decreases as the user distance from the sensor
increases. Furthermore, with this approach it is impossible to
recognize the hand status if the user wears gloves.
3) SURF Descriptors as Input
Another way to add information aiming at a better
discrimination of similar binary hand masks, is to merge the
binary threshold of the depth image with the RGB camera
data. Let Immask* be the image obtained by enlarging the
Immask image by 2 pixels. Now multiply Immask*4 and ImGray
element-by-element and consider the smallest rectangle that
includes the non-black area of the result. Let us call ImROI
this area (fig. 8).
a) b) c) d)
Fig. 8. The binary image (a) is morphologically enlarged (b), and the
result is applied as a mask on the gray scale image, thus eliminating the
background (c). The region of interest is obtained by keeping all the non-
black pixels (d)
Bagdanov et al. [6] show that it is possible to train a
Support Vector Machine (SVM) with a set of five features
SURF-128 [7], based on the same number of key points,
each of them referring to parts of the ImROI region partially
overlapping. Figure 9 shows these parts.
a) b) c) d) (e)
Fig. 9. Key points and image areas
In this way we extract an array of 128 x 5 = 640 elements
from the image, that can be used as input for the neural
network. This allows us to exploit the invariance properties
from scaling and rotation. Also for this case user should not
wear gloves.
D. Neural Network Structure and its Usage
Since our goal is to recognize whether an hand is closed
or not, the decisional process can be carried out by a neural
network, as often occurs in similar cases [8]. The neural
network is trained by means of a MATLAB toolbox that
implements a variation of the Widrow-Hoff learning rule
with back propagation [9]. The network is composed of NI
neurons directly connected to the inputs, it has only one
hidden level composed of NH neurons, and NS = 2 output
neurons. It can be sketched as in the figure 10.
4
Im
mask
* is supposed composed of real values in [0, 1].
Fig. 10. Sketch of the neural classifier
Each neuron uses a non linear transfer function for the
output, based on the hyperbolic tangent. MATLAB uses a
more efficient implementation than tanh(n), that is5:
1
1
2
)(tansig 2
+
=n
e
n
To set the correct amount of hidden neurons NH, instead
of the thumb rule or to proceed by attempts, we choose a
number that best approximate the number of neuron needed
for the exact learning, according to what Elisseeff and
Paugam-Moisy have demonstrated [10]. With the transfer
function set as above, given NP the number of training set
elements, NS the outputs of the network, NI the inputs of the
network, and assuming that the redundancy degree is null, it
can be demonstrated that to have an exact learning from the
training set it must holds true that:
SI
Sp
H
SI
Sp
NN
NN
N
NN
NN
+
+2
This does not means that with such a number of neurons
we can have the exact learning, but that it is likely to
approximate it. In general, an accurate training process does
not aim at the exact learning to avoid overfitting problems.
All the above discussions led us to choose:
»
¼
»
«
¬
«
+
=
SI
Sp
H
NN
NN
N5.1
E. Output Filtering
Since the output of the neural network is often noisy, it is
useful to implement a noise reduction mechanism. Bagdanov
et al. [6] consider the output noise as a Gaussian process with
zero-mean, and use a Kalman filter to reduce it. In our
method we use an EWMA (Exponential Weighted Moving
Average). Given that the difference between the outputs in
two consecutive instants is marginal (if the hand status does
not change), the EWMA between the current output
outNeuralNetwork and the previous one outi-1 allows for the
attenuation of unwanted occasional noise effects. When the
5
See documentation about Hyperbolic Tangent sigmoid transfer function
on Mathworks MATLAB Documentation Center:
http://www.mathworks.it/it/help/nnet/ref/tansig.html
347347
D R A F T
hand status change, the noise reduction based on the EWMA
add a little delay, that however comes out to be acceptable:
outi = (1 – Į) × outi-1 + Į × outNeuralNetwork
where Į = 0.3 is a good compromise between the noise
reduction goal and the system responsiveness (fig. 11).
Fig. 11. Comparison between the raw output and the averaged one
III. E
XPERIMENTAL
R
ESULTS
To test the effectiveness of our method, we carried out
several trials with different inputs to the neural network,
according to the three possible hand representation
previously described:
For the hand depth mask as input method, we trained
a neural network with a training set composed of
2500 cases, and 36 neurons in the hidden level;
For the hand depth mask with edge as input method,
we trained a neural network with a training set
composed of 2500 cases, and 20 neurons in the
hidden level;
For the SURF Descriptors as input method, we
trained a neural network with the training set used by
Bagdanov et al. [6] composed of 28400 cases, and
133 neurons in the hidden level.
In all cases, we calculated the number of hidden neurons
according to the discussion above.
Table I shows the processing times obtained using a
MacBook Pro 15” Late 20116, equipped with a 2,4 GHz
quad-core processing unit. The simplest approach (based on
the hand mask only) turned out to be the fastest one.
Anyway, all three the methods are suitable for real-time
applications, since their timings are compatible with a
theoretical frame-rate of 100 fps. The actual maximum
frame-rate of the Microsoft Kinect is 30 fps, so the results
are largely satisfactory.
TABLE I. A
COMPARISON OF PROCESSING TIMES
Depth Mask Only Depth Mask and Edges SURF Features
1.6865 ms 10.1584 ms 7.6415 ms
3.8888 · 10
3
ticks 31.4383 · 10
3
ticks 13.2570 · 10
3
ticks
6
See MacBook Pro 15” Late 2011 specifications:
http://support.apple.com/kb/sp644
The streams used for the tests included people both with
sleeves up and down, captured in three different
environments, each with different lighting and reflective
conditions, and with different items in the scene.
In more detail, the first environment had a uniform
lighting; the second one had a quite uniform lighting less
intense than the first; the third one had a big light source (an
open window) behind the user (fig. 12). In all cases the
sensor was set in the Automatic Exposure mode.
a)
b)
c)
Fig. 12. Some example we used in our tests: a) strong lighting behind the
user confusing the sensor (open hand wrongly recognized as closed); b)
open hand correctly recognized in the same environment; c) closed hand
correctly recognized in a constant lighting environment
Tables II, III, IV and figures 13 and 14 show the
recognition faults (%) vs. distance and lighting conditions.
As a consequence of the discussion above and the data
described in the plots, we can say that the hand mask method
gives the best results, in addition to being the simplest and
the fastest method. This is due to its independence from the
variations of lighting conditions. Such variations are the
main cause of inefficiency for the hand mask with edge
method. The Sobel algorithm is in fact highly vulnerable to
the lighting variations, so it is not suitable for non-controlled
environments. The SURF based method is the best one in
optimal lighting conditions, but its performance worsen
when the lighting is not controlled adequately.
TABLE II. H
AND
D
EPTH
M
ASK
ON
LY AS
N
EURAL
N
ETWORK
I
NPUT
348348
D R A F T
TABLE III. HAND DEPTH MASK AND EDGES AS NEURAL NETWORK INPUT
TABLE IV. HAN D SURF DESCRIPTORS AS NEURAL NETWORK INPUT
Fig. 13. Misclassification error vs. distance
Fig. 14. Misclassification error vs. lighting quality
IV. D
ISCUSSIO N
The main differences between the proposed method and
other solutions available in the scientific literature, concern
three points of view: constraints, performance and accuracy.
In the following short discussion we will refer to the hand
depth mask as input approach (see section II.C.1). This
approach produces the most accurate results (see tables II,
III, IV and figures 13 and 14), and it also represents the best
solution in terms of performance (see table I).
Constraints: some authors set quite strong constraints
about the environment, the lighting or the possibility
to wear gloves and jewels. Our proposed system is
free from lighting constraints, since it is mainly based
on depth data. The solution proposed by Bagdanov et
al. [6], which is based on SURF features, needs a
minimum lighting to recognize the hand, because
they use color information. The same applies to all
the studies [2] that are based on the segmentation of
color information. Furthermore, the use of depth data
poses no constraints about the skin color (that can
also be painted or tattooed) or the possibility to wear
gloves, rings and bracelets. In [4] it is mandatory for
users to wear a bracelet to mark the separation
between the hand and the wrist. In our solution, the
only constraint is the distance between the user and
the Kinect, that should be in the range 1,5 ÷ 2,5 mt,
but this is due to device-related considerations, as
above mentioned. Authors of [6] set the distance in
the range 1 ÷ 3 mt for almost the same reasons.
Performance: our system works in real time.
Bagdanov et al. [6] achieve the same result, but they
use SURF descriptors. Their process is quite fast, but
it is not as fast as depth masks construction, that can
be obtained by a simple thresholding. Furthermore, in
our case every frame is always processed and used for
the classification. The system proposed by Ahmed
[11], that is based on the extraction of features from
the image being thresholded, processes a frame only
if there is a significant difference since the last
processed frame. This solution cuts off data that can
be significant for the classification, but it is required
to keep the response time acceptable.
Accuracy: among the systems we referred for
comparison, the only one that allows for the
recognition of whether an hand is closed or open is
the one proposed in [6]. The other considered systems
[2] [4] [11] allows for the recognition of different
poses, with several constraints about lighting, worn
objects, and distances. Bagdanov et al. reach an
accuracy of 98% [6] that is slightly better than our
96.5 in the best case. Anyway, their result is achieved
despite a greater computational load and a less
general applicability.
349349
D R A F T
V. C
ONCLUSIONS
In this paper we presented and discussed a method to
recognize whether an hand is open or closed, based on a
neural network and on sensory data coming from the Kinect
device. No further constraints are set to users or to the
environment in terms of background or colors, and the
results are achieved in real-time.
Our method can be integrated within more complex
recognition systems, to implement the hand pose recognition
task. It also turns out to be a good base for further
developments to improve its yet good performances.
Based on our experiments and discussion, we first
conclude that the approach that includes the edge is the worst
one and it has to be left out, unless some light-independent
algorithm is used for the edge extraction. The remaining two
methods can be worth some further study to refine them.
Concerning the SURF-based one, the source images can be
improved, for example, by smoothing the mask used for the
background removal, and to normalize the contrast, so that
the extracted features are independent from insignificant
data.
Depth masks can be represented in different ways, to
make them independent from scale, rotation, etc. Ren et al.
[4] extract a signature from the depth mask to represent it
with a fixed-dimension array, apart its actual size (e.g., by
sampling at each sexagesimal degree, in a 360-dimensions
space). To ensure the rotational independence, it may
advance to apply an algorithm that rotates the image
appropriately (as Matos et al. [14] suggest). Ahmed [11]
extract 33 features from a binary image, based on the
percentage of white pixels in different overlapping image
areas, as well as on the processing of some central moments
of the hand position (also Biswas et al. [15] use a similar
approach) (fig. 13).
a)
b)
Fig. 15. Some possible way to represent the binary depth masks: a) a
signature representing the edge as one-dimensional function, in polar
coordinates; b) subdivision levels to extract some features [11].
A
CKNOWLEDGMENT
This paper has been partially supported under the
research program P.O.N. RICERCA E COMPETITIVITA'
2007-2013, project title SINTESYS - Security INTElligence
SYStem, project code PON 01_01687.
R
EFERENCES
[1] Frati, V.; Prattichizzo, D., "Using Kinect for hand tracking and
rendering in wearable haptics," IEEE World Haptics Conference
(WHC 2011), pp. 317-321, 21-24 June 2011, doi:
10.1109/WHC.2011.5945505
[2] Cheng Tang; Yongsheng Ou; Guolai Jiang; Qunqun Xie; Yangsheng
Xu, "Hand tracking and pose recognition via depth and color
information," 2012 IEEE International Conference on Robotics and
Biomimetics (ROBIO), pp.1104,1109, 11-14 Dec. 2012, doi:
10.1109/ROBIO.2012.6491117
[3] La Cascia, M.; Morana, M.; Sorce, S., “Mobile Interface for Content-
Based Image Management,” 2010 International Conference on
Complex, Intelligent and Software Intensive Systems (CISIS),
pp.718,723, 15-18 Feb. 2010, doi: 10.1109/CISIS.2010.172
[4] Zhou Ren, Junsong Yuan, and Zhengyou Zhang. 2011. Robust hand
gesture recognition based on finger-earth mover's distance with a
commodity depth camera. In Proceedings of the 19th ACM
international conference on Multimedia (MM '11). ACM, New York,
NY, USA, 1093-1096. DOI=10.1145/2072298.2071946
[5] Khoshelham K, “Accuracy analysis of Kinect depth data”. In: ISPRS
Workshop Laser Scanning, vol. XXXVIII (2011), pp. 133-138
[6] Bagdanov, A.D.; Del Bimbo, A.; Seidenari, L.; Usai, L., "Real-time
hand status recognition from RGB-D imagery," 21st International
Conference on Pattern Recognition (ICPR 2012), pp.2456-2459, 11-
15 Nov. 2012
[7] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, "SURF:
Speeded Up Robust Features", Computer Vision and Image
Understanding (CVIU), Vol. 110, No. 3, pp. 346--359, 2008
[8] Zhang, G.P., "Neural networks for classification: a survey," IEEE
Transactions on Systems, Man, and Cybernetics, vol.30, no.4,
pp.451,462, Nov 2000, doi: 10.1109/5326.897072
[9] Beale, Mark Hudson, Hagan, Martin T. and Demuth, Howard B.
“Neural Network Toolbox User's Guide”, Mathworks Documentation
Center. [Online] [last accessed on: 17 04 2013.]
http://www.mathworks.it/help/pdf_doc/nnet/nnet_ug.pdf.
[10] André Elisseeff, Hélène Paugam-Moisy, “Size of Multilayer
Networks for Exact Learning: Analytic Approach.”, pp.162-168 In
proceeding of: Advances in Neural Information Processing Systems
9, NIPS, Denver, CO, USA, December 2-5, 1996
[11] Tasnuva Ahmed, “A Neural Network based Real Time Hand Gesture
Recognition System”, International Journal of Computer
Applications, 59[5]:17-22, December 2012. Published by Foundation
of Computer Science, New York, USA, doi: 10.5120/9535-3971
[12] Gentile, A.; Andolina, S.; Massara, A.; Pirrone, D.; Russo, G.;
Santangelo, A.; Trumello, E.; Sorce, S., "A Multichannel Information
System to Build and Deliver Rich User-Experiences in Exhibits and
Museums," Broadband and Wireless Computing, Communication and
Applications (BWCCA), 2011 International Conference on , vol., no.,
pp.57,64, 26-28 Oct. 2011, doi: 10.1109/BWCCA.2011.14
[13] Sorce, S.; Augello, A.; Santangelo, A.; Gentile, A.; Genco, A.;
Gaglio, S.; Pilato, G., “Interacting with Augmented Environments,”
IEEE Pervasive Computing, vol.9, no.2, pp.56,58, April-June 2010,
doi: 10.1109/MPRV.2010.34
[14] Hélder Matos; Hélder P. Oliveira; Filipe Magalhães, “Hand-
Geometry Based Recognition System A Non Restricted Acquisition
Approach,” in 9th International Conference on Image Analysis and
Recognition (ICIAR), Aveiro, Portugal, 2012, pp. 38-45. DOI:
10.1007/978-3-642-31298-4_5
[15] K. K. Biswas; Kumar Basu Saurav, “Gesture Recognition using
Microsoft Kinect®,” in 5th International Conference on Automation,
Robotics and Applications (ICARA), Wellington, New Zealand, 2011,
pp. 100-103. DOI: 10.1109/ICARA.2011.6144864.
350350
D R A F T
... The classifier training phase was carried out using a MCs database containing information about the number of micro present, position and segmentation of each single micro that constituted it [12]. The developed classifier was a three-layer feed forward neural network, trained through the supervised learning technique, using the momentum back propagation algorithm [15], [16]. In order to obtain a performance classification between the pathological and non-pathological classes, it was necessary to extract a large number of features [17]- [21]. ...
Conference Paper
Breast cancer is one of the leading causes to women mortality in the world. Clustered microcalcifications (MCs) in mammograms can be an important early sign of breast cancer, the detection is important to prevent and treat the disease. In this work, we present a novel method for the detection of MCs in mammograms which consists of regions of Interest (ROIs) segmentation, based on a spatial filter that allows the detection of small and large microcalcifications, clustering and classification of MCs by Artificial Neural Network. The system has been tested on a public dataset of digital images and compared with previous approaches. The results demonstrate that the proposed approach could achieve significantly higher FROC curves: our CAD system achieve a cluster-based sensitivity of 70, 80, and 90 % at 0.31, 0.69, and 1.6 FPs/image, respectively.
... As future developments, we are planning to add at least two new modules for users authentication (so that gestures can be stored along with performer information), and for robust data synchronization (currently, gestures acquired from remote devices can have different clocks). We are also working to extend the available recognition algorithms to recognize static poses [49] and to improve the recognition rates, including deep learning approaches [50]. ...
Article
Full-text available
In the last decades, we have witnessed a growing interest toward touchless gestural user interfaces. Among other reasons, this is due to the large availability of different low-cost gesture acquisition hardware (the so-called “Kinect-like devices”). As a consequence, there is a growing need for solutions that allow to easily integrate such devices within actual systems. In this paper, we present KIND-DAMA, an open and modular middleware that helps in the development of interactive applications based on gestural input. We first review the existing middlewares for gestural data management. Then, we describe the proposed architecture and compare its features against the existing similar solutions we found in the literature. Finally, we present a set of studies and use cases that show the effectiveness of our proposal in some possible real-world scenarios.
... Using the aforementioned joints as basic features, it is possible to extract dynamic and static body gestures. According to Henze et al. [16], gestures are said to be static if they can be described by their position and spatial arrangement only; this class of gestures is also known as postures or poses [17], and they only need a single time frame to be entirely observed. In this work, instead, we will focus on the so-called dynamic gestures, i.e. a sequence of changing postures along a variable time interval. ...
Conference Paper
Full-text available
Gesture recognition is an emerging cross-discipline research field, which aims at interpreting human gestures and associating them to a well-defined meaning. It has been used as a mean for supporting human to machine interaction in several applications of robotics, artificial intelligence, and machine learning. In this paper, we propose a system able to recognize human body gestures which implements a constrained training set reduction technique. This allows the system for a real-time execution. The system has been tested on a publicly available dataset of 7,000 gestures, and experimental results have highlighted that at the cost of a little decrease in the maximum achievable recognition accuracy, the required time for recognition can be dramatically reduced.
... At this purpose, Kinect-like data (i.e. RGB-D videos, audio and joint sequences [16]) can be used for extracting significant features for gesture recognition [17]. For the purposes of this paper, we evaluated some available datasets for emotion recognition [18], [19] and [20] and found that several of them provide data taken from a Kinect-like device. ...
Conference Paper
In the last decade, there has been a growing interest in emotion analysis research, which has been applied in several areas of computer science. Many authors have contributed to the development of emotion recognition algorithms, considering textual or non verbal data as input, such as facial expressions, gestures or, in the case of multi-modal emotion recognition, a combination of them. In this paper, we describe a method to detect emotions from gestures using the skeletal data obtained from Kinect-like devices as input, as well as a textual description of their meaning. The experimental results show that the correlation existing between body movements and spoken user sentence(s) can be used to reveal user's emotions from gestures.
Article
The objective of this application is to make a cross stage easy to understand application that offers clients the capacity to keep up with the wellness and find the wellness level for each day. The objective of wellness following applications is to gather information about the client's exercises. Everybody's first concerns have consistently included wellbeing and wellness. Some time ago, be that as it may, remaining fit and on-pattern was more troublesome because of the trouble of finding wellness mentors who you would appreciate working with. In this framework going to catch the client present and distinguish the client present like assuming the client is doing the activity accurately or not and guide the user. This application used CNN algorithm for pose detection and it contains a BMI calculator, weight tracker. Both virtual wellness and yoga arrangement are a similar screen. Prevalence of yoga and exercise is expanding day to day. The justification for this is the physical, mental and profound advantages that could be acquired by rehearsing yoga. Being fit actually and intellectually is each human being's definitive craving. Individuals are continuously trying to have a sound body wellness and they are some way or another participated in day to day life. Thus, we accept that our application can settle this issue in android gadget clients, the applications can be extraordinary alleviation to individuals who lack the opportunity to visit wellness focuses, through assist clients with canning deal with the sound life system. Key Words: BMI Calculator, CNN, Weight Tracker , Pose Detection, Action Correctness.
Conference Paper
Recognizing static hand gesture in complex backgrounds is a challenging task. This paper presents a static hand gesture recognition system using both color and depth information. Firstly, the hand region is extracted from complex background based on depth segmentation and skin-color model. The Moore-Neighbor tracing algorithm is then used to obtain hand gesture contour. The k-curvature method is used to locate fingertips and determine the number of fingers, then the angle between fingers are generated as features. The appearance-based features are integrated to the decision tree model for hand gesture recognition. Experiments have been conducted on two gesture recognition datasets. Experimental results show that the proposed method achieves a high recognition accuracy and strong robustness.
Article
Full-text available
This paper presents an investigation of the geometric quality of depth data obtained by the Kinect sensor. Based on the mathematical model of depth measurement by the sensor a theoretical error analysis is presented, which provides an insight into the factors influencing the accuracy of the data. Experimental results show that the random error of depth measurement increases with increasing distance to the sensor, and ranges from a few millimetres up to about 4 cm at the maximum range of the sensor. The accuracy of the data is also found to be influenced by the low resolution of the depth measurements.
Conference Paper
Full-text available
Hand-geometry biometric recognition is normally based on the detection of five points that correspond to the fingertips and four points between them (valley points). Specific methods often have to be implemented during the acquisition stage to make the detection of those points easier. This study presents techniques that have been developed to overcome the difficulties and limitations of the current systems. Moreover, a hand-geometry based recognition system that has no constraints during image acquisition is presented. A methodology was developed based on the hand skeleton for the points on the fingertips and for the valley points it was based on the curvature of the hand contour. The principal difficulties were found during the segmentation step, which often fails if the fingers are not spread out. Once the points have been located, the necessary features for authentication were extracted. Classification algorithms were implemented at this stage. Those showing the best results presented a Genuine Acceptance Rate (GAR) of 76% and 8% for the False Acceptance Rate (FAR).
Conference Paper
Full-text available
In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.
Article
Full-text available
Pervasive systems augment environments by integrating information processing into everyday objects and activities. They consist of two parts: a visible part populated by animate (visitors, operators) or inanimate (AI) entities interacting with the environment through digital devices, and an invisible part composed of software objects performing specific tasks in an underlying framework. This paper shows an ongoing work from the University of Palermo's Department of Computer Science and Engineering that addresses two issues related to simplifying and broadening augmented environment access.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Conference Paper
As one of the most natural and intuitive way of communication between people and machines, hand gesture is widely used in HCI (Human-Computer-interaction). In this paper, we proposed a novel method for hand tracking and pose recognition based on Kinect. For hand tracking, skin information is used for initialization of hand segmentation, and then a region growing algorithm is applied in the depth image to separate hand from other skin colored objects. Finally, a Kalman filter is used for tracking hand in 3D space. For hand recognition, we decompose the problem of recognizing hand pose into recognizing different finger states. Both contour information of the whole hand and depth information inside the contour are considered for finger states recognition. It is shown in the experiments that our system can track the hand robustly and recognize more than 90% of the hand poses we define for our depth image database.
Conference Paper
One of the most critical limitations of KinectTM-based interfaces is the need for persistence in order to interact with virtual objects. Indeed, a user must keep her arm still for a not-so-short span of time while pointing at an object with which she wishes to interact. The most natural way to overcome this limitation and improve interface reactivity is to employ a vision module able to recognize simple hand poses (e.g. open/closed) in order to add a state to the virtual pointer represented by the user hand. In this paper we propose a method to robustly predict the status of the user hand in real-time. We jointly exploit depth and RGB imagery to produce a robust feature for hand representation. Finally, we use temporal filtering to reduce spurious prediction errors. We have also prepared a dataset of more than 30K depth-RGB image pairs of hands that is being made publicly available. The proposed method achieves more than 98% accuracy and is highly responsive.
Article
Gesture is habitually used in every day life style. It is so natural way to communicate. Hand gesture recognition method is widely used in the application area of Controlling mouse and/or keyboard functionality, mechanical system, 3D World, Manipulate virtual objects, Navigate in a Virtual Environment, Human/Robot Manipulation and Instruction Communicate at a distance. This paper introduces a real time hand gesture recognition system. This system consists of three stages: image acquisition, feature extraction, and recognition. In the first stage input image of hand gestures are acquiesced by digital camera in approximate frame rate. In second stage a rotation, translation, scaling and orientation invariant feature extraction method has been introduce to extract the feature of the input image based on moment feature extraction method. Finally, a neural network is used to recognize the hand gestures. The performance of the system tested on real data. Based on the experimental results, we noted that this system shows satisfactory performance in hand gesture recognition.