Conference PaperPDF Available

Real-Time Hand Pose Recognition Based on a Neural Network Using Microsoft Kinect

October 2013

October 2013

DOI:10.1109/BWCCA.2013.60

Conference: Broadband and Wireless Computing, Communication and Applications (BWCCA), 2013 Eighth International Conference on

Authors:

Salvatore Sorce

Kore University of Enna

Vito Gentile

synbrAIn s.r.l.

Antonio Gentile

Università degli Studi di Palermo

The Microsoft Kinect sensor is largely used to detect and recognize body gestures and layout with enough reliability, accuracy and precision in a quite simple way. However, the pretty low resolution of the optical sensors does not allow the device to detect gestures of body parts, such as the fingers of a hand, with the same straightforwardness. Given the clear application of this technology to the field of the user interaction within immersive multimedia environments, there is the actual need to have a reliable and effective method to detect the pose of some body parts. In this paper we propose a method based on a neural network to detect in real time the hand pose, to recognize whether it is closed or not. The neural network is used to process information of color, depth and skeleton coming from the Kinect device. This information is preprocessed to extract some significant feature. The output of the neural network is then filtered with a time average, to reduce the noise due to the fluctuation of the input data. We analyze and discuss three possible implementations of the proposed method, obtaining an accuracy of 90% under good conditions of lighting and background, and even reaching the 95% in best cases, in real time.

Content uploaded by Vito Gentile

Content may be subject to copyright.

Real-time Hand Pose Recognition Based on a Neural Network Using Microsoft

Kinect

Salvatore Sorce, Vito Gentile, Antonio Gentile

Dipartimento di Ingegneria Chimica Gestionale Informatica Meccanica

Università degli Studi di Palermo

90128 Palermo - Italy

salvatore.sorce@unipa.it, vitogentile@live.it, antonio.gentile@unipa.it,

Abstract—The Microsoft Kinect sensor is largely used to

detect and recognize body gestures and layout with enough

reliability, accuracy and precision in a quite simple way.

However, the pretty low resolution of the optical sensors does

not allow the device to detect gestures of body parts, such as

the fingers of a hand, with the same straightforwardness.

Given the clear application of this technology to the field of the

user interaction within immersive multimedia environments,

there is the actual need to have a reliable and effective method

to detect the pose of some body parts. In this paper we propose

a method based on a neural network to detect in real time the

hand pose, to recognize whether it is closed or not. The neural

network is used to process information of color, depth and

skeleton coming from the Kinect device. This information is

preprocessed to extract some significant feature. The output of

the neural network is then filtered with a time average, to

reduce the noise due to the fluctuation of the input data. We

analyze and discuss three possible i mplementations of the

proposed method, obtaining an accuracy of 90% under good

conditions of lighting and background, and even reaching the

95% in best cases, in real time.

Keywords—human-computer interaction; gesture-based

interaction; gesture recognition; Microsoft Kinect.

I. I

NTRODUCTION

The Microsoft Kinect sensor is largely used for several

purposes, far beyond the original one (that is to interact with

games by gestures). The low price, the availability of

software development kits - both from proprietary and third

parties, and the easiness of programming and integration,

made such device adaptable to any need, even in research

domains. The Microsoft Kinect basically allows its users to

detect people within a scene, along with their body posture,

by the recognition of the skeleton based on the analysis of a

color view of the scene along with a synchronized depth

map. Both developers and researchers exploited such

features to build several software libraries to detect and

recognize even body gestures.

One of the main drawbacks of the Kinect sensor is the

resolution limitation both for color images and depth maps.

In fact, the resolution is limited to 640x480 pixels when the

device works at 30 fps. This implies that people standing a

few feet away from the sensor will be displayed in a

relatively small area of the image. This limitation makes the

recognition of body parts posture and gesture, such as the

hands, a very hard task.

In this paper we present an algorithm by which the

Kinect sensor recognizes whether people’s hands are closed

or not. This pose recognition could be useful within any

application that involves the grabbing gesture or similar, and

could be integrated in some existing solution of interactive

service provision [12] [13]. We assume that there are no

obstacles between the sensor and the hands, and that people

are standing in front of the sensor at a distance between 1.5

and 2.5 meters. This range also allows for the integration of

the presented solution with other ones designed for the

recognition of gestures of the body or parts of it.

The solution we present and discuss in this paper is

based on the processing of sensory data (specifically depth

and skeleton information), along with some anthropometric

considerations, for the segmentation of the hand. This

allows us to identify a region of interest containing

information on the hand only, with no need to use markers

or devices to be worn, and with no contrast constraints

against the background. Such a region is thus processed and

used as input to a neural network that acts as a classifier.

The output of the network is then filtered with a time

average to reduce the noise.

We will lastly discuss experimental results showing the

effectiveness of the proposed solution for real-time

applications.

II. M

ETHOD

VERVIEW

Many researchers [1] [2] [3] carried out the hand pose

recognition task based on the detection of the individual

fingers, assuming that the user is suitably close to the Kinect

sensor. Zhou Ren et al. [4] also assumed that users wear a

black bracelet to better recognize the hand from the forearm,

thus simplifying the segmentation task.

Our proposed method does not set any particular

constraint, with the exception of the distance from the sensor,

according to Microsoft guidelines to achieve best results. In

fact, to keep the possibility to recognize common body

poses, thus allowing for the integration with other gesture

recognition algorithms, the Kinect sensor must be able to

frame the entire user body, even during the execution of the

2013 Eighth International Conference on Broadband, Wireless Computing, Communication and Applications

DOI 10.1109/BWCCA.2013.60

344

2013 Eighth International Conference on Broadband, Wireless Computing, Communication and Applications

DOI 10.1109/BWCCA.2013.60

344

D R A F T

gesture. To this end Microsoft suggests the users must be at a

distance between 1.2 and 3.5 meters from the sensor in order

to obtain good recognition performances. However, based on

the study of Kholshelham on the distance measurement error

of the depth sensor [5], the maximum distance allowed for

the user is set to 2.5 meters.

At this distance, the user skeleton will then be correctly

identified, so that all sensory information coming from the

Kinect device can be suitably used to feed the proposed

algorithm. All dimensions (depth, distance, positions) can

be expressed both in meters and pixels, since the Microsoft

dll’s and API’s provide programmers with useful conversion

and mapping tools. We will then refer to these dimensions

with no explicit reference to the unit of measurement.

In the following sections, we will present and discuss the

recognition of one hand pose; however, the proposed

approach can be easily adapted to recognize both hands at

the same time.

A. Starting Recognition Process

The first step of our method determines whether the

appropriate conditions occur that will trigger the actual hand

pose recognition procedure. In fact, it is useless to process

the sensory data if the user has his arms along his body, or if

they are behind his back (fig. 1a). In general, hands has to

be clearly visible and quite far from the body (fig. 1b).

a) b)

Fig. 1. a) the hand is along the body; b) the hand is far enough from the

body

We thus first determine whether the hand stands at a

suitable distance from the body or not. This distance may be

evaluated based on information related to the skeleton scale,

and in particular to the forearm length (or the whole arm).

For sake of simplicity, it is also possible to set a value that is

quite good in most cases, based on experimental results. In

our method we set at 15 cm the lower limit of the distance

between the hand and the center of mass of the body of the

user1.

The Microsoft APIs output when an human figure is placed in frame, is

given by a set of points in correspondence to the skeleton main joints.

The APIs also give another point which corresponds to the whole

skeleton. This can be identified as the body center of mass.

Let PH(xH, yH, zH) be the 3D coordinates of the hand

point, and PS(xS, yS, zS) the center of mass of the body (see

fig. 2 for the reference system). To start the hand pose

recognition process it must be that2:

|zH - zS| > 0.15 m

Fig. 2. the 3D reference system used by the Kinect device

B. Hand Segmentation

Assuming that the frames of the RGB camera are

aligned with those of the depth sensor3, we set a square

centered on the hand point PH (actually on the 2D point xH,

yH). This can be done by converting the depth information in

gray scale images. We therefore consider the hand is all

included within this square. The size L of the hand bounding

square can be suitably obtained on an experimental basis, or

it can be based on anthropometric considerations, so that L

is obtained by comparison with the entire body size (this

task can be carried out by measuring the distances between

other key points of the skeleton). For our purposes we use a

value L = 25 cm that is revealed suitable in most cases.

Better results can be achieved with less empirical values, as

suggested by Cheng et al. [2], who estimated a linear

relationship between the size of the hand palm and its depth.

Let now Imdepth and ImGray be the images coming from

the depth sensor and the RGB camera respectively (actually

converted in gray scale mode). We then apply a threshold on

Imdepth thus obtaining a new binary image, Immask (fig. 3),

according to the following rule:

Δ−≥

otherwhiseBLACK

zifWHITE

yx depthHdepth

mask

),(Im

The ǻdepth value represents the hand thickness along the

hand-sensor segment (fig. 4). This value must be estimated

based on empirical or anthropometric considerations. In our

experiments we obtained good results by setting ǻdepth = 8

cm, even from the noise tolerance point of view.

We assume the hand is moved forward the body to be detected, as a

natural action (see figures 2 a) and b).

Actually depth and RGB images are not perfectly aligned due to the non-

zero distance between the sensors on the Kinect package. However, the

Kinect SDK is equipped with an efficient mapping tool, so we can

assume they are actually aligned.

345345

D R A F T

a) b) c)

Fig. 3. a) the gray scale image of the hand, b) the depth data of the hand

(Im

depth

), and c) a binary mask of the hand (Im

mask

)

a) b)

Fig. 4. Geometric meaning of ǻ

depth

: a) from below; b) from above

C. Inputs for the Neural Classifier

The sensory data related to the hand are then used to feed

a neural network. We used three different approaches that

will be detailed below: hand mask as input, hand mask with

edges as input, SURF descriptors as input.

1) Hand Depth Mask as Input

In this simple approach, we use the binary hand image

only as input for the neural network. Prior to use the image,

we scale it to a fixed size (100x100 px in our case). Rather

than input 10,000 binary values, we decided to base our

approach on the horizontal and vertical histograms (fig. 5). In

fact, from a LxL image, we get:

•An L-dimensional array, in which the n-th element

represents the number of white pixels in the n-th row

in the image

•An L-dimensional array, in which the n-th element

represents the number of white pixels in the n-th

column in the image

By juxtaposing these arrays, we obtain a single array

whose dimension is 2L (in our case 200 elements, instead of

10,000 we should have had with the whole binary image).

Fig. 5. Horizontal and vertical projection histograms of the binary mask

The main drawback of this approach is that it could not

be suitable to recognize the hand status in some case, such

as that in which the hand is nearly parallel to the hand-

sensor line (fig. 6). In this case, the binary image alone may

be not suitable for discrimination, and some other process

should be used. The main advantage is that it does not make

use of the color image, thus allowing for the recognition

even in poor lighting conditions or when the user wears

gloves.

a) b)

Fig. 6. The gra y scale images show the different hand poses, a) closed and

b) open, which in turn are unnoticeable in the binary masks obtained from

the depth image.

2) Hand Depth Mask with Edges as Input

To improve the discrimination capabilities of the neural

network in cases such as the above described one, more

information must be added to the input. A possible way is to

extract the edges from the gray scale image, apply them a

binary threshold, and then obtain the horizontal and vertical

histograms. Such information can be used in addition to that

of the hand mask alone as input. In this case, the input of the

neural network is given by a 4L-dimensional array (fig. 7).

Fig. 7. The horizontal and vertical histograms, both from the binary mask

and the binary edge. Together they are the input for the neural network

346346

D R A F T

In this approach the main disadvantage is that the edges

detection is strongly dependent on the image quality, which

in turn decreases as the user distance from the sensor

increases. Furthermore, with this approach it is impossible to

recognize the hand status if the user wears gloves.

3) SURF Descriptors as Input

Another way to add information aiming at a better

discrimination of similar binary hand masks, is to merge the

binary threshold of the depth image with the RGB camera

data. Let Immask* be the image obtained by enlarging the

Immask image by 2 pixels. Now multiply Immask*4 and ImGray

element-by-element and consider the smallest rectangle that

includes the non-black area of the result. Let us call ImROI

this area (fig. 8).

a) b) c) d)

Fig. 8. The binary image (a) is morphologically enlarged (b), and the

result is applied as a mask on the gray scale image, thus eliminating the

background (c). The region of interest is obtained by keeping all the non-

black pixels (d)

Bagdanov et al. [6] show that it is possible to train a

Support Vector Machine (SVM) with a set of five features

SURF-128 [7], based on the same number of key points,

each of them referring to parts of the ImROI region partially

overlapping. Figure 9 shows these parts.

a) b) c) d) (e)

Fig. 9. Key points and image areas

In this way we extract an array of 128 x 5 = 640 elements

from the image, that can be used as input for the neural

network. This allows us to exploit the invariance properties

from scaling and rotation. Also for this case user should not

wear gloves.

D. Neural Network Structure and its Usage

Since our goal is to recognize whether an hand is closed

or not, the decisional process can be carried out by a neural

network, as often occurs in similar cases [8]. The neural

network is trained by means of a MATLAB toolbox that

implements a variation of the Widrow-Hoff learning rule

with back propagation [9]. The network is composed of NI

neurons directly connected to the inputs, it has only one

hidden level composed of NH neurons, and NS = 2 output

neurons. It can be sketched as in the figure 10.

mask

* is supposed composed of real values in [0, 1].

Fig. 10. Sketch of the neural classifier

Each neuron uses a non linear transfer function for the

output, based on the hyperbolic tangent. MATLAB uses a

more efficient implementation than tanh(n), that is5:

)(tansig 2−

=−n

To set the correct amount of hidden neurons NH, instead

of the thumb rule or to proceed by attempts, we choose a

number that best approximate the number of neuron needed

for the exact learning, according to what Elisseeff and

Paugam-Moisy have demonstrated [10]. With the transfer

function set as above, given NP the number of training set

elements, NS the outputs of the network, NI the inputs of the

network, and assuming that the redundancy degree is null, it

can be demonstrated that to have an exact learning from the

training set it must holds true that:

≤≤

This does not means that with such a number of neurons

we can have the exact learning, but that it is likely to

approximate it. In general, an accurate training process does

not aim at the exact learning to avoid overfitting problems.

All the above discussions led us to choose:

N5.1

E. Output Filtering

Since the output of the neural network is often noisy, it is

useful to implement a noise reduction mechanism. Bagdanov

et al. [6] consider the output noise as a Gaussian process with

zero-mean, and use a Kalman filter to reduce it. In our

method we use an EWMA (Exponential Weighted Moving

Average). Given that the difference between the outputs in

two consecutive instants is marginal (if the hand status does

not change), the EWMA between the current output

outNeuralNetwork and the previous one outi-1 allows for the

attenuation of unwanted occasional noise effects. When the

See documentation about Hyperbolic Tangent sigmoid transfer function

on Mathworks MATLAB Documentation Center:

http://www.mathworks.it/it/help/nnet/ref/tansig.html

347347

D R A F T

hand status change, the noise reduction based on the EWMA

add a little delay, that however comes out to be acceptable:

outi = (1 – Į) × outi-1 + Į × outNeuralNetwork

where Į = 0.3 is a good compromise between the noise

reduction goal and the system responsiveness (fig. 11).

Fig. 11. Comparison between the raw output and the averaged one

III. E

XPERIMENTAL

ESULTS

To test the effectiveness of our method, we carried out

several trials with different inputs to the neural network,

according to the three possible hand representation

previously described:

•For the hand depth mask as input method, we trained

a neural network with a training set composed of

2500 cases, and 36 neurons in the hidden level;

•For the hand depth mask with edge as input method,

we trained a neural network with a training set

composed of 2500 cases, and 20 neurons in the

hidden level;

•For the SURF Descriptors as input method, we

trained a neural network with the training set used by

Bagdanov et al. [6] composed of 28400 cases, and

133 neurons in the hidden level.

In all cases, we calculated the number of hidden neurons

according to the discussion above.

Table I shows the processing times obtained using a

MacBook Pro 15” Late 20116, equipped with a 2,4 GHz

quad-core processing unit. The simplest approach (based on

the hand mask only) turned out to be the fastest one.

Anyway, all three the methods are suitable for real-time

applications, since their timings are compatible with a

theoretical frame-rate of 100 fps. The actual maximum

frame-rate of the Microsoft Kinect is 30 fps, so the results

are largely satisfactory.

TABLE I. A

COMPARISON OF PROCESSING TIMES

Depth Mask Only Depth Mask and Edges SURF Features

1.6865 ms 10.1584 ms 7.6415 ms

3.8888 · 10

ticks 31.4383 · 10

ticks 13.2570 · 10

ticks

See MacBook Pro 15” Late 2011 specifications:

http://support.apple.com/kb/sp644

The streams used for the tests included people both with

sleeves up and down, captured in three different

environments, each with different lighting and reflective

conditions, and with different items in the scene.

In more detail, the first environment had a uniform

lighting; the second one had a quite uniform lighting less

intense than the first; the third one had a big light source (an

open window) behind the user (fig. 12). In all cases the

sensor was set in the Automatic Exposure mode.

Fig. 12. Some example we used in our tests: a) strong lighting behind the

user confusing the sensor (open hand wrongly recognized as closed); b)

open hand correctly recognized in the same environment; c) closed hand

correctly recognized in a constant lighting environment

Tables II, III, IV and figures 13 and 14 show the

recognition faults (%) vs. distance and lighting conditions.

As a consequence of the discussion above and the data

described in the plots, we can say that the hand mask method

gives the best results, in addition to being the simplest and

the fastest method. This is due to its independence from the

variations of lighting conditions. Such variations are the

main cause of inefficiency for the hand mask with edge

method. The Sobel algorithm is in fact highly vulnerable to

the lighting variations, so it is not suitable for non-controlled

environments. The SURF based method is the best one in

optimal lighting conditions, but its performance worsen

when the lighting is not controlled adequately.

TABLE II. H

AND

EPTH

ASK

LY AS

EURAL

ETWORK

NPUT

348348

D R A F T

TABLE III. HAND DEPTH MASK AND EDGES AS NEURAL NETWORK INPUT

TABLE IV. HAN D SURF DESCRIPTORS AS NEURAL NETWORK INPUT

Fig. 13. Misclassification error vs. distance

Fig. 14. Misclassification error vs. lighting quality

IV. D

ISCUSSIO N

The main differences between the proposed method and

other solutions available in the scientific literature, concern

three points of view: constraints, performance and accuracy.

In the following short discussion we will refer to the hand

depth mask as input approach (see section II.C.1). This

approach produces the most accurate results (see tables II,

III, IV and figures 13 and 14), and it also represents the best

solution in terms of performance (see table I).

•Constraints: some authors set quite strong constraints

about the environment, the lighting or the possibility

to wear gloves and jewels. Our proposed system is

free from lighting constraints, since it is mainly based

on depth data. The solution proposed by Bagdanov et

al. [6], which is based on SURF features, needs a

minimum lighting to recognize the hand, because

they use color information. The same applies to all

the studies [2] that are based on the segmentation of

color information. Furthermore, the use of depth data

poses no constraints about the skin color (that can

also be painted or tattooed) or the possibility to wear

gloves, rings and bracelets. In [4] it is mandatory for

users to wear a bracelet to mark the separation

between the hand and the wrist. In our solution, the

only constraint is the distance between the user and

the Kinect, that should be in the range 1,5 ÷ 2,5 mt,

but this is due to device-related considerations, as

above mentioned. Authors of [6] set the distance in

the range 1 ÷ 3 mt for almost the same reasons.

•Performance: our system works in real time.

Bagdanov et al. [6] achieve the same result, but they

use SURF descriptors. Their process is quite fast, but

it is not as fast as depth masks construction, that can

be obtained by a simple thresholding. Furthermore, in

our case every frame is always processed and used for

the classification. The system proposed by Ahmed

[11], that is based on the extraction of features from

the image being thresholded, processes a frame only

if there is a significant difference since the last

processed frame. This solution cuts off data that can

be significant for the classification, but it is required

to keep the response time acceptable.

•Accuracy: among the systems we referred for

comparison, the only one that allows for the

recognition of whether an hand is closed or open is

the one proposed in [6]. The other considered systems

[2] [4] [11] allows for the recognition of different

poses, with several constraints about lighting, worn

objects, and distances. Bagdanov et al. reach an

accuracy of 98% [6] that is slightly better than our

96.5 in the best case. Anyway, their result is achieved

despite a greater computational load and a less

general applicability.

349349

D R A F T

V. C

ONCLUSIONS

In this paper we presented and discussed a method to

recognize whether an hand is open or closed, based on a

neural network and on sensory data coming from the Kinect

device. No further constraints are set to users or to the

environment in terms of background or colors, and the

results are achieved in real-time.

Our method can be integrated within more complex

recognition systems, to implement the hand pose recognition

task. It also turns out to be a good base for further

developments to improve its yet good performances.

Based on our experiments and discussion, we first

conclude that the approach that includes the edge is the worst

one and it has to be left out, unless some light-independent

algorithm is used for the edge extraction. The remaining two

methods can be worth some further study to refine them.

Concerning the SURF-based one, the source images can be

improved, for example, by smoothing the mask used for the

background removal, and to normalize the contrast, so that

the extracted features are independent from insignificant

data.

Depth masks can be represented in different ways, to

make them independent from scale, rotation, etc. Ren et al.

[4] extract a signature from the depth mask to represent it

with a fixed-dimension array, apart its actual size (e.g., by

sampling at each sexagesimal degree, in a 360-dimensions

space). To ensure the rotational independence, it may

advance to apply an algorithm that rotates the image

appropriately (as Matos et al. [14] suggest). Ahmed [11]

extract 33 features from a binary image, based on the

percentage of white pixels in different overlapping image

areas, as well as on the processing of some central moments

of the hand position (also Biswas et al. [15] use a similar

approach) (fig. 13).

Fig. 15. Some possible way to represent the binary depth masks: a) a

signature representing the edge as one-dimensional function, in polar

coordinates; b) subdivision levels to extract some features [11].

CKNOWLEDGMENT

This paper has been partially supported under the

research program P.O.N. RICERCA E COMPETITIVITA'

2007-2013, project title SINTESYS - Security INTElligence

SYStem, project code PON 01_01687.

EFERENCES

[1] Frati, V.; Prattichizzo, D., "Using Kinect for hand tracking and

rendering in wearable haptics," IEEE World Haptics Conference

(WHC 2011), pp. 317-321, 21-24 June 2011, doi:

10.1109/WHC.2011.5945505

[2] Cheng Tang; Yongsheng Ou; Guolai Jiang; Qunqun Xie; Yangsheng

Xu, "Hand tracking and pose recognition via depth and color

information," 2012 IEEE International Conference on Robotics and

Biomimetics (ROBIO), pp.1104,1109, 11-14 Dec. 2012, doi:

10.1109/ROBIO.2012.6491117

[3] La Cascia, M.; Morana, M.; Sorce, S., “Mobile Interface for Content-

Based Image Management,” 2010 International Conference on

Complex, Intelligent and Software Intensive Systems (CISIS),

pp.718,723, 15-18 Feb. 2010, doi: 10.1109/CISIS.2010.172

[4] Zhou Ren, Junsong Yuan, and Zhengyou Zhang. 2011. Robust hand

gesture recognition based on finger-earth mover's distance with a

commodity depth camera. In Proceedings of the 19th ACM

international conference on Multimedia (MM '11). ACM, New York,

NY, USA, 1093-1096. DOI=10.1145/2072298.2071946

[5] Khoshelham K, “Accuracy analysis of Kinect depth data”. In: ISPRS

Workshop Laser Scanning, vol. XXXVIII (2011), pp. 133-138

[6] Bagdanov, A.D.; Del Bimbo, A.; Seidenari, L.; Usai, L., "Real-time

hand status recognition from RGB-D imagery," 21st International

Conference on Pattern Recognition (ICPR 2012), pp.2456-2459, 11-

15 Nov. 2012

[7] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, "SURF:

Speeded Up Robust Features", Computer Vision and Image

Understanding (CVIU), Vol. 110, No. 3, pp. 346--359, 2008

[8] Zhang, G.P., "Neural networks for classification: a survey," IEEE

Transactions on Systems, Man, and Cybernetics, vol.30, no.4,

pp.451,462, Nov 2000, doi: 10.1109/5326.897072

[9] Beale, Mark Hudson, Hagan, Martin T. and Demuth, Howard B.

“Neural Network Toolbox User's Guide”, Mathworks Documentation

Center. [Online] [last accessed on: 17 04 2013.]

http://www.mathworks.it/help/pdf_doc/nnet/nnet_ug.pdf.

[10] André Elisseeff, Hélène Paugam-Moisy, “Size of Multilayer

Networks for Exact Learning: Analytic Approach.”, pp.162-168 In

proceeding of: Advances in Neural Information Processing Systems

9, NIPS, Denver, CO, USA, December 2-5, 1996

[11] Tasnuva Ahmed, “A Neural Network based Real Time Hand Gesture

Recognition System”, International Journal of Computer

Applications, 59[5]:17-22, December 2012. Published by Foundation

of Computer Science, New York, USA, doi: 10.5120/9535-3971

[12] Gentile, A.; Andolina, S.; Massara, A.; Pirrone, D.; Russo, G.;

Santangelo, A.; Trumello, E.; Sorce, S., "A Multichannel Information

System to Build and Deliver Rich User-Experiences in Exhibits and

Museums," Broadband and Wireless Computing, Communication and

Applications (BWCCA), 2011 International Conference on , vol., no.,

pp.57,64, 26-28 Oct. 2011, doi: 10.1109/BWCCA.2011.14

[13] Sorce, S.; Augello, A.; Santangelo, A.; Gentile, A.; Genco, A.;

Gaglio, S.; Pilato, G., “Interacting with Augmented Environments,”

IEEE Pervasive Computing, vol.9, no.2, pp.56,58, April-June 2010,

doi: 10.1109/MPRV.2010.34

[14] Hélder Matos; Hélder P. Oliveira; Filipe Magalhães, “Hand-

Geometry Based Recognition System A Non Restricted Acquisition

Approach,” in 9th International Conference on Image Analysis and

Recognition (ICIAR), Aveiro, Portugal, 2012, pp. 38-45. DOI:

10.1007/978-3-642-31298-4_5

[15] K. K. Biswas; Kumar Basu Saurav, “Gesture Recognition using

Microsoft Kinect®,” in 5th International Conference on Automation,

Robotics and Applications (ICARA), Wellington, New Zealand, 2011,

pp. 100-103. DOI: 10.1109/ICARA.2011.6144864.

350350

D R A F T

A Microcalcification Detection System in Mammograms based on ANN Clustering

Conference Paper

May 2019

Breast cancer is one of the leading causes to women mortality in the world. Clustered microcalcifications (MCs) in mammograms can be an important early sign of breast cancer, the detection is important to prevent and treat the disease. In this work, we present a novel method for the detection of MCs in mammograms which consists of regions of Interest (ROIs) segmentation, based on a spatial filter that allows the detection of small and large microcalcifications, clustering and classification of MCs by Artificial Neural Network. The system has been tested on a public dataset of digital images and compared with previous approaches. The results demonstrate that the proposed approach could achieve significantly higher FROC curves: our CAD system achieve a cluster-based sensitivity of 70, 80, and 90 % at 0.31, 0.69, and 1.6 FPs/image, respectively.

KIND‐DAMA: A modular middleware for Kinect‐like device data management

Article

Full-text available

Aug 2017
SOFTWARE PRACT EXPER

In the last decades, we have witnessed a growing interest toward touchless gestural user interfaces. Among other reasons, this is due to the large availability of different low-cost gesture acquisition hardware (the so-called “Kinect-like devices”). As a consequence, there is a growing need for solutions that allow to easily integrate such devices within actual systems. In this paper, we present KIND-DAMA, an open and modular middleware that helps in the development of interactive applications based on gestural input. We first review the existing middlewares for gestural data management. Then, we describe the proposed architecture and compare its features against the existing similar solutions we found in the literature. Finally, we present a set of studies and use cases that show the effectiveness of our proposal in some possible real-world scenarios.

Real-Time Body Gestures Recognition Using Training Set Constrained Reduction

Conference Paper

Full-text available

Jul 2017

Gesture recognition is an emerging cross-discipline research field, which aims at interpreting human gestures and associating them to a well-defined meaning. It has been used as a mean for supporting human to machine interaction in several applications of robotics, artificial intelligence, and machine learning. In this paper, we propose a system able to recognize human body gestures which implements a constrained training set reduction technique. This allows the system for a real-time execution. The system has been tested on a publicly available dataset of 7,000 gestures, and experimental results have highlighted that at the cost of a little decrease in the maximum achievable recognition accuracy, the required time for recognition can be dramatically reduced.

Body Gestures and Spoken Sentences: A Novel Approach for Revealing User’s Emotions

Conference Paper

Jan 2017

In the last decade, there has been a growing interest in emotion analysis research, which has been applied in several areas of computer science. Many authors have contributed to the development of emotion recognition algorithms, considering textual or non verbal data as input, such as facial expressions, gestures or, in the case of multi-modal emotion recognition, a combination of them. In this paper, we describe a method to detect emotions from gestures using the skeletal data obtained from Kinect-like devices as input, as well as a textual description of their meaning. The experimental results show that the correlation existing between body movements and spoken user sentence(s) can be used to reveal user's emotions from gestures.

VIRTUAL FITNESS APPLICATION

Article

Jun 2022

IJSREM Journal

The objective of this application is to make a cross stage easy to understand application that offers clients the capacity to keep up with the wellness and find the wellness level for each day. The objective of wellness following applications is to gather information about the client's exercises. Everybody's first concerns have consistently included wellbeing and wellness. Some time ago, be that as it may, remaining fit and on-pattern was more troublesome because of the trouble of finding wellness mentors who you would appreciate working with. In this framework going to catch the client present and distinguish the client present like assuming the client is doing the activity accurately or not and guide the user. This application used CNN algorithm for pose detection and it contains a BMI calculator, weight tracker. Both virtual wellness and yoga arrangement are a similar screen. Prevalence of yoga and exercise is expanding day to day. The justification for this is the physical, mental and profound advantages that could be acquired by rehearsing yoga. Being fit actually and intellectually is each human being's definitive craving. Individuals are continuously trying to have a sound body wellness and they are some way or another participated in day to day life. Thus, we accept that our application can settle this issue in android gadget clients, the applications can be extraordinary alleviation to individuals who lack the opportunity to visit wellness focuses, through assist clients with canning deal with the sound life system. Key Words: BMI Calculator, CNN, Weight Tracker , Pose Detection, Action Correctness.

High-Speed Rail Operating Environment Recognition Based on Neural Network and Adversarial Training

Conference Paper

Nov 2019

Applications of hand gestures recognition in industrial robots: a review

Conference Paper

Mar 2019

Recognition of Yoga poses through an interactive system with Kinect based on confidence value

Conference Paper

Jul 2018

Recognition of Yoga Poses Through an Interactive System with Kinect Device

Conference Paper

Jun 2018

Integrating Color and Depth Cues for Static Hand Gesture Recognition

Conference Paper

Nov 2017

Recognizing static hand gesture in complex backgrounds is a challenging task. This paper presents a static hand gesture recognition system using both color and depth information. Firstly, the hand region is extracted from complex background based on depth segmentation and skin-color model. The Moore-Neighbor tracing algorithm is then used to obtain hand gesture contour. The k-curvature method is used to locate fingertips and determine the number of fingers, then the angle between fingers are generated as features. The appearance-based features are integrated to the decision tree model for hand gesture recognition. Experiments have been conducted on two gesture recognition datasets. Experimental results show that the proposed method achieves a high recognition accuracy and strong robustness.

Accuracy analysis of Kinect depth data

Article

Full-text available

May 2012

Kourosh Khoshelham

This paper presents an investigation of the geometric quality of depth data obtained by the Kinect sensor. Based on the mathematical model of depth measurement by the sensor a theoretical error analysis is presented, which provides an insight into the factors influencing the accuracy of the data. Experimental results show that the random error of depth measurement increases with increasing distance to the sensor, and ranges from a few millimetres up to about 4 cm at the maximum range of the sensor. The accuracy of the data is also found to be influenced by the low resolution of the depth measurements.

Hand-Geometry Based Recognition System - A Non Restricted Acquisition Approach

Conference Paper

Full-text available

Jun 2012

Hand-geometry biometric recognition is normally based on the detection of five points that correspond to the fingertips and four points between them (valley points). Specific methods often have to be implemented during the acquisition stage to make the detection of those points easier. This study presents techniques that have been developed to overcome the difficulties and limitations of the current systems. Moreover, a hand-geometry based recognition system that has no constraints during image acquisition is presented. A methodology was developed based on the hand skeleton for the points on the fingertips and for the valley points it was based on the curvature of the hand contour. The principal difficulties were found during the segmentation step, which often fails if the fingers are not spread out. Once the points have been located, the necessary features for authentication were extracted. Classification algorithms were implemented at this stage. Those showing the best results presented a Genuine Acceptance Rate (GAR) of 76% and 8% for the False Acceptance Rate (FAR).

SURF: Speeded up robust features

Conference Paper

Full-text available

Jul 2006

In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.

Interacting with Augmented Environments

Article

Full-text available

Jul 2010

Pervasive systems augment environments by integrating information processing into everyday objects and activities. They consist of two parts: a visible part populated by animate (visitors, operators) or inanimate (AI) entities interacting with the environment through digital devices, and an invisible part composed of software objects performing specific tasks in an underlying framework. This paper shows an ongoing work from the University of Palermo's Department of Computer Science and Engineering that addresses two issues related to simplifying and broadening augmented environment access.

SURF: Speeded up robust features

Article

Jan 2008
COMPUT VIS IMAGE UND

Speeded-up robust features (SURF)

Article

Jun 2008

This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.

Hand tracking and pose recognition via depth and color information

Conference Paper

Dec 2012

As one of the most natural and intuitive way of communication between people and machines, hand gesture is widely used in HCI (Human-Computer-interaction). In this paper, we proposed a novel method for hand tracking and pose recognition based on Kinect. For hand tracking, skin information is used for initialization of hand segmentation, and then a region growing algorithm is applied in the depth image to separate hand from other skin colored objects. Finally, a Kalman filter is used for tracking hand in 3D space. For hand recognition, we decompose the problem of recognizing hand pose into recognizing different finger states. Both contour information of the whole hand and depth information inside the contour are considered for finger states recognition. It is shown in the experiments that our system can track the hand robustly and recognize more than 90% of the hand poses we define for our depth image database.

Real-time hand status recognition from RGB-D imagery

Conference Paper

Jan 2012

One of the most critical limitations of KinectTM-based interfaces is the need for persistence in order to interact with virtual objects. Indeed, a user must keep her arm still for a not-so-short span of time while pointing at an object with which she wishes to interact. The most natural way to overcome this limitation and improve interface reactivity is to employ a vision module able to recognize simple hand poses (e.g. open/closed) in order to add a state to the virtual pointer represented by the user hand. In this paper we propose a method to robustly predict the status of the user hand in real-time. We jointly exploit depth and RGB imagery to produce a robust feature for hand representation. Finally, we use temporal filtering to reduce spurious prediction errors. We have also prepared a dataset of more than 30K depth-RGB image pairs of hands that is being made publicly available. The proposed method achieves more than 98% accuracy and is highly responsive.

A Neural Network based Real Time Hand Gesture Recognition System

Article

Dec 2012

Tasnuva Ahmed

Gesture is habitually used in every day life style. It is so natural way to communicate. Hand gesture recognition method is widely used in the application area of Controlling mouse and/or keyboard functionality, mechanical system, 3D World, Manipulate virtual objects, Navigate in a Virtual Environment, Human/Robot Manipulation and Instruction Communicate at a distance. This paper introduces a real time hand gesture recognition system. This system consists of three stages: image acquisition, feature extraction, and recognition. In the first stage input image of hand gestures are acquiesced by digital camera in approximate frame rate. In second stage a rotation, translation, scaling and orientation invariant feature extraction method has been introduce to extract the feature of the input image based on moment feature extraction method. Finally, a neural network is used to recognize the hand gestures. The performance of the system tested on real data. Based on the experimental results, we noted that this system shows satisfactory performance in hand gesture recognition.

Size of Multilayer Networks for Exact Learning: Analytic Approach.

Conference Paper

Jan 1996

Real-Time Hand Pose Recognition Based on a Neural Network Using Microsoft Kinect

Abstract

Recommended publications

Enabling Finger-Gesture Interaction with Kinect

Gesture recognition using low-cost devices: Techniques, applications, perspectives

Continuous Hand Openness Detection Using a Kinect-Like Device

A robust method of detecting hand gestures using depth sensors

Internet of things: Why we are not there yet