ArticlePDF Available

Recognition of JSL Fingerspelling Using Deep Convolutional Neural Networks

Authors:

Abstract and Figures

In this paper, we present approach for recognition of static fingerspelling in Japanese Sign Language on RGB images. Two 3D articulated hand models have been developed to generate synthetic fingerspellings and to extend a dataset consisting of real hand gestures.In the first approach, advanced graphics techniques were employed to rasterize photorealistic gestures using a skinned hand model. In the second approach, gestures rendered using simpler lighting techniques were post-processed by a modified Generative Adversarial Network. In order to avoid generation of unrealistic fingerspellings a hand segmentation term has been added to the loss function of the GAN. The segmentation of the hand in images with complex background was done by proposed ResNet34-based segmentation network. The finger-spelled signs were recognized by an ensemble with both fine-tuned and trained from scratch neural networks. Experimental results demonstrate that owing to sufficient amount of training data a high recognition rate can be attained on RGB images. The JSL dataset with pixel-level hand segmentations is available for download.
Content may be subject to copyright.
Recognition of JSL Fingerspelling Using Deep
Convolutional Neural Networks
Bogdan Kwoleka,
, Wojciech Baczynskia, Shinji Sakob
aAGH University of Science and Technology, 30 Mickiewicza Av., 30-059 Krakow, Poland
bNagoya Institute of Technology, Nagoya, Japan
Abstract
In this paper, we present approach for recognition of static fingerspelling in
Japanese Sign Language on RGB images. Two 3D articulated hand models have
been developed to generate synthetic fingerspellings and to extend a dataset
consisting of real hand gestures. In the first approach, advanced graphics tech-
niques were employed to rasterize photorealistic gestures using a skinned hand
model. In the second approach, gestures rendered using simpler lighting tech-
niques were post-processed by a modified Generative Adversarial Network. In
order to avoid generation of unrealistic fingerspellings a hand segmentation term
has been added to the loss function of the GAN. The segmentation of the hand
in images with complex background was done by proposed ResNet34-based seg-
mentation network. The finger-spelled signs were recognized by an ensemble
with both fine-tuned and trained from scratch neural networks. Experimen-
tal results demonstrate that owing to sufficient amount of training data a high
recognition rate can be attained on RGB images. The JSL dataset with pixel-
level hand segmentations is available for download.
Keywords: Fingerspelling recognition, Generative Adversarial Networks,
semantic segmentation, U-Net, residual networks (ResNets)
Correspondence to: Department of Computer Science, AGH University of Science and
Technology, 30 Mickiewicza Av., Building D-17, 30-059 Krakow, Poland
Email address: bkw@agh.edu.pl (Bogdan Kwolek )
3.01.2021
1. Introduction
Hand detection, tracking and recognition of fingerspellings are significant re-
search areas due to high application potential in human-machine-communication,
virtual reality [1], entertainment, robotics [2, 3, 4], medicine, and assistive tech-5
nologies for the handicapped and the elderly [5]. Communication by gesture is
one of the most intuitive and flexible ways to attain user-friendly man-machine
interaction. Although huge efforts by many research teams were undertaken in
the last decade [6, 7], there are still challenges to be addressed for attaining
recognition performance that is required by real-life applications, e.g. [8, 9].10
Gesture recognition on images acquired by a single color camera is very useful,
yet complex task because of several difficulties, including occlusions, variations
in gesture expressions, differences in hand anatomy and appearance.
In recent years, a number of approaches to recognition of static gestures
on RGB images has been proposed [10, 11]. In a recently published work [12],15
an end-to-end network for human skin detection on color images by integrat-
ing recurrent neural layers into Fully Convolutional Neural Networks (FCNs)
has been proposed. Despite significant developments in learning deep Convo-
lutional Neural Networks (CNNs), a recent review on gesture recognition [10]
evokes only one noteworthy work done by Tompson in collaboration with Le-20
Cun et al. [13]. In a former work [14], a CNN capable of classifying six hand
gestures and controlling robots on the basis of colored gloves has been proposed.
More recently, in [15] a CNN implemented in Theano and applied on the Nao
humanoid robot has been discussed. In [16], a CNN learned on one million of
data samples to classify sign characters has been proposed. However, only a25
subset of the dataset, i.e. 3361 manually labeled frames into 45 classes has been
made publicly available. Recently, a method employing Gabor features, Zernike
moments, Hu moments, and contour-based descriptors to select features for fur-
ther combination by a fusion-based convolutional neural network (FFCNN) has
been introduced [17].30
2
One of the obstacles that researches and practitioners face in utilization the
deep CNNs on a larger scale is lack of properly aligned datasets of sufficient
size [6] as well as shortage of robust real-time hand detectors. A dataset in-
troduced in [18] consists of 65000 samples representing 24 classes. However,
the gestures were performed only by nine subjects. A dataset utilized in [19]35
contains 2750 samples with complex background, which were performed by 40
subjects. However, this dataset has only ten classes. In a dataset [13] targeted
for hand pose recovery there are 72757 and 8252 frames in the training and
test sets, respectively. However, only two performers attended in recordings in
a scenario with three Kinect sensors (a frontal and two side). The EgoHands40
dataset [20], which has pixel-level annotations for hands with two participants in
each video interacting with each other has recently been used in a work devoted
to hand segmentation [21]. From the above literature review it follows that, in
general, there is lack of datasets with pixel-level annotations. Moreover, almost
all dataset have data that have no separated training and testing subsets. As45
shown in our work [22], recognition rate in person-independent evaluation pro-
tocols drops significantly in comparison to evaluation protocols in which data
of the same subjects are both in training and testing subsets. Moreover, as
demonstrated in [23, 24] synthetically generated hand images can improve the
classification performance. However, currently available datasets do not have50
data that would permit further development of methods allowing more effective
employing of synthetically generated data.
Fingerspelling is a loanword system for borrowing orthographic representa-
tions of words from language to sign language, for instance from English into
American Sign Language (ASL), or as in this work from Japanese to Japanese55
Sign language. It is a form, and frequently an integral part of sign language,
where each sign corresponds to a word of the alphabet. It is very often used
for proper names, brand names, place names as well as digits, which do not
have conventional lexical signs. Individuals generally spell their name when
they introduce themselves. There are many compelling reasons that make60
fingerspelling an appealing area of research. Tabata and Kuroda [25] developed
3
a Stringlove system for recognizing hand shapes and fingerspelling in Japanese
Sign Language (JSL). It is built on custom-made glove that is equipped with
sensors capturing finger features. The glove uses nine contact sensors and 24
inductcoders that jointly estimate features like: adduction/abduction angles of65
fingers, thumb and wrist rotations, the joint flexion/extension of fingers as well
as contact positions among fingertips of the fingers. More recently, in [26] a
modified kind of the shape matrix for capturing salience of fingerspelling pos-
tures through precise sampling of contours and regions has been proposed. In
[22], recognition of JSL fingerspellings was done on the basis of embeddings de-70
termined by multiple Siamese CNNs. A dataset consisting of real gestures has
been introduced in discussed work. It contains 5311 training images and 579
test images of size 64 ×64.
Japanese Sign Language, also known under the acronym JSL, is a visual
sign language in Japan. As other sign languages, the JSL comprises words, or75
signs, and the grammar with which they are bonded together. The Japanese
Sign Language syllabary is a system of manual kana utilized as part of the JSL.
In general, fingerspelling is used mostly for foreign words and last names. The
JSL fingerspellings are performed by five fingers of the hand and direction it
points. For example, the signs ’na’, ’ni’, ’ha’ are all expressed with the first80
two fingers of the hand extended straight, but for the sign ’na’ the fingers point
down, for ’ni’ across the body, and for ’ha’ toward the partner or audience. The
signs for ’te’ and ’ho’ are both made with open flat hand, but in ’te’ the palm
faces the viewer, and in ’ho’ it faces away. These and many other aspects make
recognition of JSL fingerspelling on the basis of a single RGB camera a difficult85
task. Most of fingerspellings are expressed through static postures, but some of
them are dynamic postures. In addition, dullness, half dullness, and long sound
are represented by dynamic postures. In this study, we only focused on static
fingerspellings. There are 41 static fingerspellings in the JSL.
In this work, we present a framework for recognition of static fingerspelling in90
Japanese Sign Language on RGB images. Two 3D articulated hand models have
been developed to generate synthetic fingerspellings and to extend real hand ges-
4
tures of the JSL dataset [22] about gestures with photorealistic renderings of the
hand. In the first approach, advanced graphics techniques were employed to cre-
ate photorealistic gestures using a skinned hand model. In the second approach,95
gestures rendered in advance using simple lighting techniques were further post-
processed by a modified Generative Adversarial Network (GAN). To avoid gen-
eration of unrealistic fingerspellings a hand segmentation term has been added
to the loss function of the GAN. The segmentation of the hand in images with
complex background was done by proposed ResNet34-based segmentation net-100
work. The finger-spelled signs were recognized by an ensemble consisting of
a VGG-based neural network and two ResNet quaternion convolutional neural
networks. The contribution of this work is a framework for improving recog-
nition of fingerspellings on RGB images by employing advanced techniques for
photorealistic hand image synthesis, including rendering techniques and modi-105
fications of GANs for enhancing photorealism of hand gestures. A large dataset
(eleven thousands of images) with both pixel-level hand segmentations and syn-
thetically generated hand postures for learning deep segmentation models as
well as deep neural networks for fingerspelling recognition is proposed.
2. Relevant work110
In [15], a multichannel CNN for hand posture recognition has been proposed.
A cubic kernel to enhance features for posture classification has been used. The
system has been evaluated on the Nao robot. In [27], a glove has been utilized in
order to provide contour representation of gestures. A neural network, which has
been trained on a dataset consisting of one hundred images per gesture permits115
achieving 90% classification accuracy. In [28], a staged probabilistic regressor
(SPORE) algorithm for estimation of hand orientation from 2D monocular im-
ages has been proposed. In discussed approach, simultaneous learning hand
orientation and pose significantly increased the performance of pose classifica-
tion on 2D monocular images. In a recently published work [29] focusing on120
ASL language, an approach for detection and extraction of shape for static fin-
5
gerspelling recognition on the basis of boundary tracking and chain code has
been proposed. On images of size 320 ×240 acquired by a webcam and images
collected from freely available resources the recognition accuracy was 97.75%
and 96.48% for alphabet characters and numbers, respectively. Kim et al. [30]125
determine a signer-dependent skin color model using manually annotated hand
regions for fingerspelling recognition. In a signer-dependent setting they achieve
up to about 92% letter accuracy, whereas in a multi-signer they achieve up to
83% accuracies in letter recognition. Huang et al. [31] utilize a Faster R-CNN
based hand detector, trained on manually annotated hand bounding boxes, and130
apply it to general sign language recognition. Convolutional neural network-
based features demonstrated high potential in recent approaches [16, 32]. A
survey of recent achievements in vision-based continuous sign language recog-
nition is presented in [33]. A recently introduced Chicago Fingerspelling in the
Wild (ChicagoFSWild) dataset [34] contains 7304 fingerspelling sequences from135
online videos.
3. Fingerspelling Modeling and Rendering
Rendering a realistic and accurate hand shape is not an easy task because
of large variety of poses that the human hand can assume as well as difficulties
in modeling shin appearance [35]. A commonly used approach for rendering the140
hand shape in the requested poses is linear blend skinning (LBS). Starting from
the open source library LibHand v. 0.9 [36] and a 42-DOF skeleton we developed
a new rendering API for the LibHand for configuring the hand in the requested
poses by a set of sliders, which can be manually moved using the mouse. It
uses the textured 3D model of the LibHand and permits exporting the resulting145
models into the md5 graphics format for further OpenGL-based rendering. For
each gesture we considered several allowable hand postures and orientations,
different finger inter-distances in order to synthesize a large number of training
images. Thirty three students posed and articulated the 3D hand model to
render the gestures. As a result, four thousands (4018) synthetic images were150
6
generated in order to balance the JSL dataset [22] and then after additional
post-processing they were stored as a subset of the mentioned JSL dataset.
The resulting synthetic hand images are quite faithful representation of the JSL
signs, see Fig. 1. However, due to insufficient quality of lighting algorithms for
human skin rendering that are available in standard OpenGL, and particularly155
in order to increase photorealism of the rendered gestures, the images were
further post-processed by Generative Adversarial Networks (GANs), outlined
in Sect. 6. Finally, a sub-dataset containing 4018 samples has been generated
and then it has been stored as another subset (called RHM) of the whole JSL
dataset. There are roughly 98 different samples for each of the 41 JSL gestures
Figure 1: Examples of rendered Hiragana signs using LibHand with our API for hand articu-
lation and gesture modeling.
160
with different orientations and small variations in the 3D positions of the fingers
with respect to location of the wrist.
In order to employ recent rendering and lighting developments, a custom
3D hand model has been prepared in Blender, which is a 3D computer graph-
ics software toolset [37]. The models have been designed using 2.80 software165
version, which is extended about Eevee physically-based real-time renderer, a
successor of the Cycles rendering engine. The Cycles engine works by casting
7
rays of light from each pixel of the camera into the scene. They refract and
reflect, or get absorbed, until they either hit the light source or reach a prede-
fined bounce limit. Eevee is a real-time render engine with advanced features,170
including Non-photorealistic Rendering (NPR). The NPR is an active area of
computer graphics, which focuses on enabling a wide variety of expressive styles
for digital entertainment. Eevee renderer has support for baked indirect light-
ing, screen space ambient occlusion, screen space reflections, and other modern
commodities provided by current generation graphics hardware. Having on re-175
gard, that renderings with the Eevee engine can be performed in about five
times shorter time in comparison to Cycles rendering time on the same hard-
ware, the initial renderings were done in the Eevee engine, whereas the final
ones were done in the Cycles engine. A hand model [24] consisting of 21 bones
and five control bones has been extended in this work to permit photorealistic180
hand gesture modeling and animations by Python scripts. The bones in the
skeleton construct a structure of rigid bodies that are connected together by
joints with one or more degrees of freedom. The articulated hand model has
37 degrees of freedom (DOF). The 3D mesh consists of 528 vertices and it is
composed of 1036 triangles [24]. The 26 element skeleton (armature) is bound185
to such a 3D mesh. The root joint is located in the hand’s wrist.
Generation of synthetic training images is done by executing Python scripts
that employ Blender graphics engine [38]. Basic configuration options of the
scripts include different camera viewpoints and different lighting. The model
can also be exported to md5 data format to perform animations in external190
programs. Figure 2 depicts the JSL sign ’ka’, which has been expressed by the
3D hand observed from different camera views.
At the beginning, we investigated a widely used three point lighting tech-
nique in rendering of realistic fingerspelling. However, it quickly turned out
that such a three point technique is not the main bottleneck to achieve pho-195
torealistic skin rendering and that a significant improvement of realism of the
rendered hands can be achieved by the use of Subsurface Scattering (SSS), which
simulates the transport of light through a translucent surface. In the discussed
8
Figure 2: Selected shots of JSL sign ’ka’ for various camera views.
technique the light penetrates a material, in our case the human skin, and inter-
nally scatters at irregular angles, resulting in more photorealistic human skin.200
This is very important issue in photorealistic hand rendering. Recently, a deep
learning approach to subsurface scattering has been proposed in [39]. Further
improvement of hand realism has been achieved by the use Filmic Blender color
palette. Figure 3 presents the effect of using such a color palette in photorealistic
hand rendering.
Figure 3: Photorealistic hand rendering: regular texture (left), Filmic Blender color palette
(right). The illustrated images are stored in vectorized graphics format and a better viewing
can be obtained by zooming this figure.
205
Finally, a simpler two point lighting has been employed in the rendering of
the sub-dataset with realistic hand renderings. The main light source was an
9
Area type lamp providing the surface light, which can be rotated around the
hand according to user needs. The second source light is a back light providing
chiaroscuro effects, and which can be turned on or turned off through the Python210
scripts. Figure 4 depicts sample images which were rendered using the discussed
techniques.
Figure 4: Selected shots of JSL sign ’ka’ for various lighting.
The 3D articulated hand model has been used to generate synthetic fin-
gerspellings and to extend our dataset consisting of real hand gestures. Twelve
different gesture realizations were prepared for each of 41 signs. Ten images have215
been rendered for each realization through interpolations between the starting
and end poses. Figure 5 depicts samples of the rendered images for the sign
’ka’ from the JSL. For each starting gesture a final gesture has been created
and ten interpolated images were rendered among them. This means that ges-
tures differ in hand postures to express various realizations of the gesture by220
different persons. For each realization of the gesture we modeled starting and
final posture and then interpolated hand postures between them. The number
of images generated on the basis of the interpolation is equal to eight. In total,
5892 realistic hand gestures were selected from the rendered dataset and then
stored as a subset in the JSL dataset.225
10
Figure 5: Example realizations of JSL gesture ’ka’ that were obtained in the photorealistic
hand rendering.
3.1. 3D-Model Based Dataset for JSL Recognition
The JSL dataset consists of 18029 images, where the training subset consists
of 16343 images, and testing subset contains 1686 images. The test subset
contains only real images with gestures performed by four persons, including
three Japanese performers, who did not attend in recording of training images.230
The training subset contains both real and the synthesized images. The real
images are taken from training subset of former JSL dataset [22, 24]. The
images are of size 64 ×64 with uniform background. Thanks to the uniform
background the hands can be delineated easily and then used to train models for
hand segmentation or gesture classification on images with artificially included235
complex backgrounds. As far as we know, currently this is the largest dataset
with pixel-level delineated hands. Moreover, rather simple rendering techniques
were applied until now in generating synthetic hands for training deep learning
models. The dataset has been stored in .mat files and can be easily imported
into Matlab and Python. The whole JSL dataset is freely available at: http:240
//home.agh.edu.pl/~bkw/data/neu2020.
4. Neural Networks for Gesture Modeling and Recognition
At the beginning of this Section we present Residual Neural Networks. Af-
terwards, we outline Quaternion Convolutional Neural Networks.
11
4.1. ResNet Convolutional Neural Networks245
In [40], He at al. introduced residual networks (ResNets), which provide
an important contribution to training very deep neural networks. The residual
learning framework simplifies the training of neural networks, and enables them
to be substantially deeper, which leads to an improved performance. The resid-
ual networks are much deeper in comparison to their ordinary counterparts, yet250
they require a similar number of parameters. The main idea is to utilize blocks
that re-route the input, and to add to the concept learned from the previous
layer. The constituent building block of discussed architecture is the ResNet
unit. A deeper network can be built by simply repeating such a block, i.e. the
smaller sub-network. A desired underlying mapping H(x) can be approximated255
by a few stacked nonlinear layers, so it can also be obtained through underly-
ing mapping F(x) = H(x)x. As a result, it is possible to reformulate it as
H(x) = F(x)+ x, which comprises the Residual Function F(x) and the input x.
The connection of the input to the output is called a skip connection or identity
mapping. The central idea is that if multiple nonlinear layers can approximate260
the complex function H(x), then it is possible to approximate the residual func-
tion F(x). Thus, the stacked layers are not employed to fit H(x), but instead
these layers approximate the residual function F(x).
4.2. Quaternion Convolutional Neural Network
Recently, in order to exploit internal dependencies within the features a
quaternion convolutional neural network (QCNN) has been proposed [41]. Let
γl
ab and Sl
ab denote the quaternion output and the pre-activation quaternion
output at layer land at the indexes (a, b) of the feature map, and wbe a
quaternion-valued weight filter map of size K×K. The convolution can be
expressed in the following manner:
γl
ab =α(Sl
ab) (1)
12
where Sl
ab is equal to:
Sl
ab =
K1
X
c=0
K1
X
d=0
wlγl1
(a+c)(b+d)(2)
and αstands for quaternion split activation function [42] that is defined as
follows:
α(Q) = f(r) + f(x)i+f(y)j+f(z)k(3)
where fis related to any standard activation function. A derivation of the265
backpropagation algorithm for quaternion neural networks can be found in [43].
Recently, in [44] a QCNN for color image processing has been proposed. In
the discussed approach the image is represented in the quaternion domain as a
quaternion matrix. The quaternion convolution provides scaling and rotation
of input in the color space, which carries out more structural representation of270
color information [44], whereas the conventional real-valued convolution is only
capable of executing scaling transformations on the input. Because QCNNs
enforce an implicit regularizer on the network architecture, more complicated
relationships across different channels can improve the training of such kind of
neural networks.275
5. Hand Segmentation
In recent years a considerable progress in object detection has been achieved
[45]. However, little work has been done in area of hand detection. In [46]
a dataset consisting of 600 images acquired in various lighting conditions and
backgrounds has been proposed to highlight the advantages and shortcomings of280
different methods for ego-centric hand detection. Later, in [47] another approach
to detect hands in social interactions in egocentric videos has been demon-
strated. However, only interactions in laboratory settings were considered. In
already evoked work [20], Bambach et al. introduced a skin-based approach that
first determines a set of bounding boxes that may surround hand regions, after-285
wards utilizes CNNs to detect hands, and finally executes GrabCut to segment
them. They also introduced an EgoHands dataset consisting of 48 first-person
13
videos of people interacting in realistic environments, with pixel-level ground
truth for over 15000 hand instances. Our dataset contains 14875 images with
pixel-level ground truth and it has a potential to fill a gap for hand detection290
and segmentation in third person-images. In such a third-person settings, [48]
used deformable part models and skin heuristics to detect hands. Recently, a
large dataset suitable for deep learning has been introduced in [49]. However,
the dataset mentioned above does not contain pixel-level annotations.
In order to reliably segment the hand on RGB images with complex back-295
ground we designed an encoder-decoder neural network. In the proposed neural
network for hand segmentation on images with complex background we employ
a deep CNN and add skip connections between the layers in the encoder and
the decoder. The encoder path is based on the ResNet 34-layer (ResNet34)
neural network, whereas decoder path uses the transpose 2D blocks to perform300
the 2D upsampling. The parameters of each transpose 2D block are such that
the height and width are doubled, whereas the number of channels is helved,
see Fig. 6. There are three skip connections, where the first connection is done
after (3 ×3,64; 3 ×3,64) ×3 ResNet blocks, the second one is performed after
(3×3,128; 3×3,128)×4 blocks and the last one is after (3×3,128; 3×3,128)×6305
blocks of the 34-layer ResNet. The feature maps delivered by such skip con-
nections from the encoder, i.e. the ResNet34 neural network are summed with
feature maps extracted in the decoder path, which uses the transpose 2D blocks
to expand dimensions of convoluted feature outputs. Such skip connections
between encoder layers and decoder layers were introduced in the U-Net neu-310
ral network [50], which is a symmetrical neural network with ’U’ like shape.
Our segmentation network is not symmetrical since the encoder path is based
on deep ResNet34 neural network, whereas in the decoder path no residual
blocks are employed, see Fig. 6. In U-Net neural networks a down-sampling
(contraction) path is utilized to extract and interpret the context (what), while315
an up-sampling (expansion) path is used to enable precise localization (where).
Furthermore, in order to fully recover the fine-grained spatial information lost
in the pooling or down-sampling layers, skip connections between symmetrical
14
layers are employed in such encoder-decoder networks. By combining the loca-
tion information from the down-sampling path with the contextual information320
in the up-sampling path, such networks permit obtaining general maps that
combine localization and context. Our ResNet34-based network has all the fea-
tures mentioned above, and additionally it possesses extended capabilities for
feature extraction.
Figure 6: ResNet34 based network for hand segmentation.
6. Generative Adversarial Network for Photorealistic Fingerspelling325
Synthesis
Generative Adversarial Networks (GANs) utilize an adversarial discrimina-
tor to align the distributions of real and generated data samples. In a two-player
minimax game the generator Gtries to generate samples on the basis of noise z
that fool the discriminator D, while Dlearns to maximize the probability of as-330
signing the correct class label for both the real data and the fake data G(z) [51].
In the optimal case, the generated samples would be indistinguishable from real
samples. The conventional GANs required the paired training data. A recently
proposed CycleGANs [52] utilizes the unpaired training data thanks to a cycle
consistency loss function. CycleGAN is a general framework for learning from335
unaligned examples the mapping functions between two domains Xand Y. The
goal is to learn a mapping G:XYsuch that the distribution of data from
G(X) would be indistinguishable from the distribution of data in Yaccording
15
to an adversarial loss. To achieve this they proposed to consider also an inverse
mapping F:YXas well as to employ so-called cycle consistency to prevent340
the learned mappings Gand Ffrom contradicting each other [52].
Given training samples {xi}N
i=1,{yi}M
i=1, where xiXand yiYwith
data distributions xpdata(x)and ypdata(y), for the mapping G:XY,
and discriminator DYthe objective function can be expressed in the following
manner:
LGAN(G, DY, X, Y ) = Eypdata(y)[log(DY(y))]
+Expdata(x)[log(1 DY(G(x)))]
(4)
The generator Gminimizes it against the adversary DYthat tries to maximize it:
minGmaxDYLGAN(G, DY, X, Y ). For the mapping F:YXand the discrim-
inator DX, the generator Fminimizes the objective LGAN (F, DX, X, Y ) against
the adversary DXthat tries to maximize it: minFmaxDXLGAN (F, DX, X, Y ).
The cycle consistency loss takes the following form [52]:
Lcyc(G, F ) = Expdata(x)[||F(G(x)) x||1]
+Eypdata(y)[||G(F(y)) y||1]
(5)
The loss function has the following form:
L(G, F, DX, DY) = LGAN(G, DY, X, Y ) + LGAN (F, DX, X, Y ) + γLcyc (G, F )
(6)
where γbalances the objectives. Cycle consistency means that that the com-
position of the mappings is the identity mapping. The aim of the CycleGAN is
to find the generators:
G, F = arg min
G,F
max
DX,DY
L(G, F, DX, DY) (7)
In our work we learned mappings from synthetic images to real images and from
real to synthetic images. The inputs were the cropped synthetic and real images
of the hand with uniform background.
CycleGAN has been successfully applied in a few image-to-image applica-
tions. However, the CycleGAN has not been designed for maintaining object
16
shapes well. In this work, we extend the CycleGAN by adding the segmentation
consistency loss to encourage shape alignment between images in two domains
and to improve the accuracy at the hand boundaries. By incorporating an ad-
ditional geometric consistency loss that incorporates information about hand
shapes we better maintain the hand shape and its pose during image post-
processing. The proposed loss term has the following form:
Lcycu(G, F ) = Expdata(x)[||U(F(G(x))) xb||1]
+Eypdata(y)[||U(G(F(y))) yb||1]
(8)
where Uis segmentation mapping that is performed by trained in advance345
ResNet34-based segmentation unit, whereas xband ybdenote the binary masks
of the hands. This means that the input to our network are (cropped) synthetic
and real images of the hand with their respective silhouettes, i.e. binary fore-
ground masks. The discussed term has been multiplied by γand included as
additional term in the loss function (6), i.e. calculating L.350
7. Fingerspelling Recognition
7.1. Fingerspelling Recognition Using Neural Networks Trained from Scratch
We implemented a ResNet based convolutional neural network consisting of
three ResNet-blocks, see Fig. 7. Afterwards, we implemented a QCNN, and then
after extending the ResNet about spatial the pyramid pooling layer SPP [53]355
we substituted the convolutional blocks by the quaternion-based convolutional
blocks. The motivation behind using the SPP layer results from its ability to
better represent the object at multiple scales and input sizes.
7.2. Fingerspelling Recognition Using Pre-trained Neural Networks
The output of the base VGG-19 CNN has been flatten, and then a dense360
layer consisting of 512 neurons with dropout=0.5 followed by output layer with
soft-max activation have been added to such a base network. The weights in
the base model were frozen for initial training of the network. Afterwards, the
layers starting from seventeen one (block5) were set as trainable for fine-tuning
17
Figure 7: Flowchart of the ResNet used for JSL fingerspelling classification.
the network. The output of the ResNet50 base model has been fed to a global365
average-pooling and global max-pooling. The outputs have been concatenated
and then fed to a batch normalization layer. The outputs were next fed to
dense layer with 1024 neurons followed by the batch normalization and dropout
layers. Afterwards, a dense layer consisting of 512 neurons with dropout=0.5
followed by the output layer with the soft-max activation have been utilized in370
the model, similarly as in the VGG-19 network.
7.3. Ensemble of CNNs
Models obtained on the basis of convolutional neural networks are nonlin-
ear. They are learned via optimization using stochastic training algorithms and
they are sensitive to the distribution of the training data. Thus, the optimizers375
find a different set of weights each time they are executed, which in turn leads
to unlike predictions. This means that predictions of neural networks usually
have a high variance. One of the successful approaches to reducing the vari-
ance of such predictions is to learn multiple neural network models instead of a
single model and to combine the predictions of these models. The ensemble of380
such independently trained models not only reduces the variance of predictions
but also produces final outputs that are better than predictions of any single
model. Every ensemble member contributes to the final output and individual
weaknesses are offset by the contribution the other members. Essentially, en-
sembles tend to yield better results when there is a significant diversity among385
18
the members [54], (called also base-learners). There are many different types
of ensembles. In a weighted average ensemble the decisions of ensemble mem-
bers are weighted on the basis of their performance on a hold-out validation
dataset. In a stacking-based ensemble the decisions of base-learners are taken
as input for training a meta-learner, that learns how to optimally combine the390
predictions of base-learners. At the beginning the selected neural networks are
learned using the available training data. Afterwards, a meta-learner is trained
to make a final prediction using the predictions of the trained networks. The
main difference between both methods is that in the weighted average ensem-
ble the weights are optimized and then used for weighting all outputs of the395
base-learners, and finally are taken to calculate the weighted average. This
means that no meta-learner is employed in such ensembles. In a stacking-based
ensemble, the meta-learner takes every single output of the base-learners as a
training instance and learns how to optimally map the base-learner decisions
into a better output decision. The meta-learner can be any classic machine400
learning algorithm.
8. Experimental Results and Discussion
Experimental evaluations have been performed on our JSL fingerspelling
dataset, which was discussed in Subsection 3.1. All experiments were carried
out on color RGB images of size 64 ×64. Altogether 16343 images for training405
and 1686 test images were employed in the evaluations consisting in recogni-
tion of 41 JSL static hand gestures. The recognition performance has been
assessed in a person independent (cross-person) scenario, wherein persons at-
tending recordings of test data did not attend in the recordings of the training
data. In training the GANs as well as training the neural networks for hand410
segmentation the synthetic images were used both in the training and the test-
ing of the models, whereas in the gesture recognition only real images were used
in evaluations of learned models.
19
8.1. Evaluation of Hand Segmentation
In the first phase of experiments, we selected 1500 real images from the415
training subset and 500 synthetic images from the RHM sub-dataset. Given
the binary hand shapes we introduced the complex background into the hand
images. We used randomly sampled patches from an Office-Caltech dataset [55].
The Office-Caltech dataset contains images of office objects from ten common
categories shared by the Office-31 and Caltech-256 datasets. It is composed of420
ten classes: backpack, bike, calculator, headphones, keyboard, laptop, monitor,
mouse, mug, and video-projector. In the next step of this phase of experiments,
in order to show the potential of our ResNet34-based network for hand seg-
mentation we trained an ordinary U-Net. The contraction path of the U-Net is
made of four contraction blocks, where each block takes an input map and then425
applies two 3×3 convolution layers followed by a 2×2 max pooling. The number
of kernels or feature maps after each block doubles so that this architecture can
learn the complex structures effectively. A bottleneck part, which is between
the contracting and expanding path is built from simply 2 convolutional layers
(with batch normalization), with dropout. It uses two 3 ×3 CNN layers fol-430
lowed by 2 ×2 up convolution layer. Similar to contraction layer, the expansion
section also consists of four expansion blocks. Each block passes the input to
two 3 ×3 convolutional layers followed by a 2 ×2 upsampling layer. Also after
each block the number of feature maps utilized by the convolutional layer gets
half to obtain symmetrical encoder-decoder network. The input of each block is435
concatenated with feature maps of the corresponding contraction layer. After
passing through the expansion blocks, the resultant mapping passes through
another 3 ×3 convolutional layer with the number of feature maps equal to the
number of segments desired. Figure 8 depicts selected images, which were seg-
mented by our segmentation network. As we can observe, our neural network440
segments the hands quite reliably on images with complex background. Our
experiments demonstrated that the ordinary U-Net is capable of extracting the
hands quite properly only in case of both training and evaluation on images
with uniform background.
20
Figure 8: Masks of the segmented hands in images with complex background. Input images
(top row), masks of segmented hands (bottom row).
Table 1 presents Dice scores that were obtained on a dataset consisting of445
randomly selected 200 images from the test subset of the JSL dataset with the
added complex background. The neural networks were trained on 2000 images
with complex background in 50 epochs using RMSprop optimizer. Afterwards,
we created a training dataset consisting of 2000 images consisting of both real
and synthetic images and trained our network to segment the hand during train-450
ing of GANs with the proposed segmentation term in the loss function.
Table 1: Dice scores on the test sub-dataset.
network U-Net our
Dice score [%] 0.954 0.984
8.2. Rendering Gestures and GAN-based Post-processing
At the beginning of experiments we investigated various approaches to ren-
dering the images representing the JSL gestures. Using our API for the LibHand
we modeled the gestures, exported the models representing gestures to md5 data455
format and then used our parser in order to import the mesh and the animation
data for OpenGL-based rendering. The mesh data and animation data in the
discussed format are separated in distinct files. One of the advantages of the
md5 data format is that data is stored in ASCII files and are human readable.
21
The rotations are represented by quaternions. Through modifying the values460
of the parameters stored in the plain text files it is possible to configure the
skeleton of the model into the required poses, and then render the model. After
parsing the skeleton and animation data, in every frame the 3D hand has been
rotated about randomly generated angles to simulate the observing the hand
from different camera views. Then, we prepared a dataset consisting of training465
and synthetic images for training GANs in order to enhance the photorealism
of images rendered in such a way. Figure 9 depicts example images that were
post-processed by our GAN to improve photorealism of synthetically generated
gestures. As we can observe, photorealism of images generated synthetically
has been improved. In particular, light reflections, which were difficult to model470
were added to the images. The results presented above were obtained on the
basis of our GAN, which has been trained on 1080 and 1226 synthetic and real
images, respectively, in 300 epochs, batch size set to 14, and using Adam opti-
mizer with lr=0.0002, beta 1=0.5. On TitanX GPU the training of the GAN
on images of size 128 ×128 took about twenty four hours. The GAN generator475
trained in such a way has then been used to post-process the synthetic images,
which were generated on the basis of models rendered by the OpenGL.
Figure 9: Examples of GAN-based post-processed images to improve photorealism of syn-
thetically generated gestures. Synthetic images (upper row), post-processed images (bottom
row).
It is worth noting that ordinary cycleGANs, i.e. without the segmentation
component in the loss function were unable to generate hands without arti-
22
facts and unrealistic deformations of the hand, see Fig. 10. Although some480
images with introduced modifications of hand shape could be potentially useful,
see also 1st image from left, considerable percentage of images is not rendered
properly. Moreover, subtle shape differences are rendered properly by our net-
work, compare post-processed images #2 - #4 in Fig. 9 and Fig. 10. One of the
disadvantages of GAN-based data augmentation for visual fingerspelling recog-485
nition [23] is that a visual inspection of data by a man is needed to eliminate
poor gesture realizations. In contrast, the approach to data augmentation for
visual fingerspelling recognition that is presented in this work is fully automatic,
i.e. no human-in-the-loop is needed in the process of fingerspelling recognition.
The discussed examples were obtained in identical number of epochs, which has490
been used by the network achieving results shown on Fig. 9. Somewhat bet-
ter results can be achieved at the cost of significantly larger number of epochs.
The proposed modification of CycleGAN stabilizes the training of the GAN and
permits to achieve better hand shapes in the post-processing for adding more
photorealism to 3D model-based rendered fingerspellings.
Figure 10: Examples of GAN-based post-processed images without the segmentation term in
the loss function.
495
Afterwards, we rendered the hands in Blender without light effects. Finally,
in order to obtain more photorealistic images we illuminated the hands using
virtual lights, together with techniques discussed in Section 3, see Fig. 11. Ges-
tures rendered in such a way were included as a subset in the JSL dataset.
500
23
Figure 11: Example images of 01 a sign, no lighting (top row), with lighting (bottom row).
8.3. Fingerspelling Recognition
We experimented with various neural networks both trained from scratch
and fine-tuned deep CNNs. In the first stage of experiments we trained convo-
lutional neural networks from scratch. The networks were initially pre-trained
on downsampled to 64 ×64 ×3 ImageNet dataset. We implemented a ResNet505
based convolutional neural network consisting of three ResNet-blocks, which
has been outlined in Subsection 7.1. The neural network has been trained on
RGB images of size 64 ×64 ×3. Each model was trained using the Adam
optimizer (lr=0.001, beta 1=0.9, beta 2=0.999, epsilon=1e-08) and categorical
cross-entropy loss, with a small learning rate. The learning rate was scheduled510
to be reduced after 20, 30, 40, 50 epochs. The values of the hyper-parameters
were selected empirically. Afterwards, we trained the ResNet with convolu-
tional blocks substituted by the quaternion-based convolutional blocks. The
neural network has been trained on RGB images of size 64 ×64 ×3 using the
same Adam optimizer as well as parameters.515
In the next stage of experiments we focused on the fingerspelling recognition
using pre-trained neural networks. We trained neural networks, which were dis-
cussed in Sebsection 7.2. As the VGG model expects the input images of size
224 ×224 ×3, the images were resized to the above mentioned size. We initially
trained the networks in 30 epochs using SGD optimizer with lr=1e-4, momen-520
tum=0.9. Next, the neural networks have been fine-tuned in 30 epochs using
the SGD optimizer with lr=1e-4 and momentum=0.9. We also experimented
24
with other pre-trained CNNs, including ResNet34, mobileNet and Inception-
ResNetV2, fine-tuned for the fingerspelling recognition. However, the results
were worse in comparison to results achieved by the mentioned above networks.525
Finally, an ensemble consisting of VGG-19, Resnet50 and ResNet with con-
volutional blocks substituted by the quaternion-based convolutional blocks has
been constructed. The models of the neural networks, which were trained in
advance have been loaded and then used to construct an ensemble of deep net-
works. The output of the ensemble is determined by voting. An MLP-based530
ensemble has also been trained and evaluated.
Table 2 presents the classification performance that has been obtained by
neural networks on the test subset of the JSL dataset. During learning the neu-
ral networks an online data augmentation has been executed. The classification
performance has been obtained on the test subset of the JSL dataset. As we535
can observe the synthetic images allow achieving fat better results. Consider-
able improvement in classification accuracy has been obtained thanks to the
use of the synthetic images. The multi-model, voting-based ensemble improves
the classification performance about 1.5% and the results are slightly better
in comparison to results achieved by the stacking ensemble. Figure 12 depicts540
the confusion matrix obtained by the best single classifier, i.e. ResNet50-based
classifier.
Table 2: Classification performance in performer independent experiment.
Accuracy Precision Recall F1-score
no. rend. 0.671 0.678 0.671 0.670
ResNet18 0.815 0.829 0.815 0.814
VGG19 0.863 0.869 0.860 0.858
ResNet50 0.877 0.895 0.876 0.875
ensemble 0.892 0.906 0.904 0.904
One of the major reasons of insufficient classification performance for a few
25
Figure 12: Confusion matrix - each row represents the real class while each column represents
the predicted class of gestures.
classes are strong inter-class similarities. The hand shapes of several JSL ges-
tures are quite similar, which may explain the reason for the incorrect predic-545
tions of the classifier. As we can notice on images shown on Fig. 13, the hand
shapes in classes ’06 ka’ and ’41 ra’, as well as ’04 e’ and ’11 se’ are pretty alike.
One of the reason is that this an ill-posed problem with inherent ambiguities.
We investigated several approaches to improve the recognition performance,
including rendering additional images for the classes with lower classification550
ratios, synthesis of additional images on the basis of Adversarial Generative
Models (GANs), and evaluations using fine-tuned deep neural networks, e.g.
MobileNet. However, none of the above mentioned approaches has not been
able to improve the experimental results presented above.
26
06 ka 04 e
41 ra 11 sa
Figure 13: Example inter-class similarities, 06 ka – 41 ra and 04 e – 11 sa.
To validate the usefulness of the JSL dataset as well as potential of the555
trained models in real scenarios we performed experiments on image sequences.
The training and testing data were created on the basis of image sequences that
were acquired during recording the JSL dataset. All corresponding test images
from the JSL dataset were included in the discussed dataset and additionally
we included the original images so that each sequence contained minimum ten560
images. The total number of test images is equal to 7250. In the same way,
we prepared the training subset that consisted of the same number of image
sequences as the test subset, but the total number of images was equal to 7300.
Additionally, we assumed that if an original image sequence was included in the
training set, we did not omit from it the images already selected as part the JSL565
dataset. The experiments were conducted on original RGB images of size 320 ×
240. We compared the recognition accuracy achieved by the best performing
neural network with accuracies achieved by recent algorithms [56, 57, 58, 59, 60].
The hands were detected using OpenPose [61] and we processed only the hand
performing the gesture. If in a compared algorithm the OpenPose gave a better570
result than original hand detector we employed it instead of the author’s hand
detector. Table 3 compares results achieved by our best performing model with
results achieved by recent algorithms for isolated fingerspelling recognition on
sequences of RGB images. The input shape of CNN from [58] has been changed
to size 64 ×64 and we extended the network about a convolutional layer and575
the following pooling layer. As we can notice, our recognizer achieved superior
27
results.
Table 3: Comparative recognition performance with performance of recent algorithms for
isolated fingerspelling recognition on sequences of RGB images.
Method Accuracy [%]
Keyframes [56] 64.4
GEI CNN 64 ×64 [58] 39.6
CNN 64 ×64 [57] 74.6
Multi-scale descriptor [59] 64.8
VGG16 HOG [60] 89.2
Our approach 92.1
The neural networks for gesture recognition have been trained on TitanX
GPU with epoch size set to 64 and number of epochs set to 100. The neural
networks were implemented in Python using TensorFlow/Keras frameworks.580
9. Conclusions
In this paper we presented a framework for recognition of static finger-
spellings on RGB images. The recognition of hand gestures is performed by
convolutional neural networks, which have been trained using both real and
synthetic images. A few thousands of synthetic images for training were gener-585
ated on the basis of two skinned hand models. In the first approach, advanced
graphics techniques were used to create photorealistic gestures, whereas in the
second one the gestures rendered using simpler lighting techniques were post-
processed by a modified Generative Adversarial Network. In order to avoid
unrealistic modifications of the hand shape a hand segmentation term has been590
added to the loss function of the GAN. The segmentation of the hand in images
with complex background was done by proposed ResNet34-based segmentation
network. The finger-spelled signs were recognized by an ensemble consisting
of finetuned VGG19 and ResNet50 neural networks and ResNet convolutional
neural network trained from scratch. Experimental results demonstrate that595
28
thanks to sufficient amount of training data a high recognition rate can be
attained on RGB images. We demonstrated experimentally that in a person in-
dependent scenario, on a test subset with gestures expressed by four performers
a recognition rate close to 90% can be achieved using the proposed approach.
Future work will include investigations on using the rendering techniques for600
data augmentation while training neural networks.
Acknowledgements
This work was supported by Polish National Science Center (NCN) under a
research grant 2017/27/B/ST6/01743.
References605
[1] M. Sagayam and J. Hemanth, “Hand posture and gesture recognition tech-
niques for virtual reality applications: A survey,” Virtual Reality, vol. 21,
no. 2, pp. 91–107, 2017.
[2] F. Chen, Q. Zhong, F. Cannella, K. Sekiyama, and T. Fukuda, “Hand
gesture modeling and recognition for human and robot interactive assembly610
using Hidden Markov Models,” Int. J. of Advanced Robotic Systems, vol. 12,
no. 4, p. 48, 2015.
[3] D. Raj, I. Gogul, M. Thangaraja, and V. Kumar, “Static gesture recog-
nition based precise positioning of 5-DOF robotic arm using FPGA,” in
Trends in Industrial Measurement and Automation (TIMA), 2017, pp. 1–615
6.
[4] H. Liu and L. Wang, “Gesture recognition for human-robot collaboration:
A review,” Int. J. of Industrial Ergonomics, vol. 68, pp. 355 – 367, 2018.
[5] S. Patil, D. K. Dennis, C. Pabbaraju, R. Deshmukh, H. Simhadri,
M. Varma, and P. Jain, “GesturePod: Programmable gesture recognition620
for augmenting assistive devices,” Microsoft, Tech. Rep., May 2018.
29
[6] P. Pisharady and M. Saerbeck, “Recent methods and databases in vision-
based hand gesture recognition,” Comput. Vis. Image Underst., vol. 141,
pp. 152–165, 2015.
[7] A. S. Al-Shamayleh, R. Ahmad, M. Abushariah, K. A. Alam, and625
N. Jomhari, “A systematic literature review on vision based gesture recog-
nition techniques,” Multimedia Tools and Applications, vol. 77, no. 21, pp.
28 121–28 184, 2018.
[8] O. Matei, P. C. Pop, and H. V˘alean, “Optical character recognition in
real environments using neural networks and k-nearest neighbor,” Applied630
Intelligence, vol. 39, no. 4, pp. 739–748, 2013.
[9] O. Kopuklu, A. Gunduz, N. Kose, and G. Rigoll, “Online dynamic hand
gesture recognition including efficiency analysis,” IEEE Trans. on Biomet-
rics, Behavior, and Identity Science, vol. 2, no. 2, pp. 85–97, 2020.
[10] O. Oyedotun and A. Khashman, “Deep learning in vision-based static hand635
gesture recognition,” Neural Computing and Applications, pp. 1–11, 2016.
[11] A. Wadhawan and P. Kumar, “Sign language recognition systems: A
decade systematic literature review,” Archives of Computational Methods
in Engineering, Dec. 2019.
[12] H. Zuo, H. Fan, E. Blasch, and H. Ling, “Combining convolutional and re-640
current neural networks for human skin detection,” IEEE Signal Processing
Letters, vol. 24, no. 3, pp. 289–293, 2017.
[13] J. Tompson, M. Stein, Y. LeCun, and K. Perlin, “Real-time continuous
pose recovery of human hands using convolutional networks,” ACM Trans.
Graph., vol. 33, no. 5, 2014.645
[14] J. Nagi and F. Ducatelle, et al., “Max-pooling convolutional neural net-
works for vision-based hand gesture recognition,” in IEEE ICSIP, 2011,
pp. 342–347.
30
[15] P. Barros, S. Magg, C. Weber, and S. Wermter, A Multichannel Convolu-
tional Neural Network for Hand Posture Recognition. Springer, 2014, pp.650
403–410.
[16] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a CNN on
1 million hand images when your data is continuous and weakly labelled,”
in IEEE Conf. on Comp. Vision and Pattern Rec., 2016, pp. 3793–3802.
[17] S. F. Chevtchenko, R. F. Vale, V. Macario, and F. R. Cordeiro, “A con-655
volutional neural network with feature fusion for real-time hand posture
recognition,” Applied Soft Computing, vol. 73, pp. 748 – 766, 2018.
[18] N. Pugeault and R. Bowden, “Spelling it out: Real-time ASL fingerspelling
recognition,” in IEEE Int. Conf. on Computer Vision Workshops, 2011, pp.
1114–1119.660
[19] Y. Chuang, L. Chen, and G. Chen, “Saliency-guided improvement for hand
posture detection and recognition,” Neurocomputing, vol. 133, pp. 404 –
415, 2014.
[20] S. Bambach, S. Lee, D. J. Crandall, and C. Yu, “Lending a hand: Detecting
hands and recognizing activities in complex egocentric interactions,” in665
IEEE Int. Conf. on Computer Vision (ICCV), 2015, pp. 1949–1957.
[21] A. U. Khan and A. Borji, “Analysis of hand segmentation in the wild,” in
IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018, pp.
4710–4719.
[22] B. Kwolek and S. Sako, “Learning Siamese features for finger spelling recog-670
nition,” in Advanced Concepts for Intelligent Vision Systems. Lecture
Notes in Computer Science, vol. 10617, Springer, 2017, pp. 225–236.
[23] B. Kwolek, “GAN-based data augmentation for visual finger spelling recog-
nition,” in Eleventh Int. Conf. on Machine Vision (ICMV 2018), vol. 11041.
SPIE, 2019, pp. 493 – 500.675
31
[24] N. T. Nguen, S. Sako, and B. Kwolek, “Deep CNN-based recognition of jsl
finger spelling,” in Proc. Int. Conf. on Hybrid Artificial Intelligent Systems
(HAIS), LNCS, vol. 11734. Springer, 2019, pp. 602–613.
[25] Y. Tabata and T. Kuroda, “Finger spelling recognition using distinctive
features of hand shape,” in Int. Conf. on Disability, Virtual Reality and680
Associated Technologies with Art Abilitation, 2008, pp. 287–292.
[26] L. Kane and P. Khanna, “A framework for live and cross platform finger-
spelling recognition using modified shape matrix variants on depth silhou-
ettes,” Comput. Vis. Image Underst., vol. 141, pp. 138–151, 2015.
[27] Rosalina, L. Yusnita, N. Hadisukmana, R. B. Wahyu, R. Roestam, and685
Y. Wahyu, “Implementation of real-time static hand gesture recognition
using artificial neural network,” in Int. Conf. on Computer Appl. and Inf.
Proc. Techn. (CAIPT), 2017, pp. 1–6.
[28] M. Asad and G. Slabaugh, “SPORE: Staged probabilistic regression for
hand orientation inference,” Computer Vision and Image Understanding,690
vol. 161, pp. 114 – 129, 2017.
[29] A. Y. Dawod, M. J. Nordin, and J. Abdullah, “Static fingerspelling recogni-
tion based on boundary tracing algorithm and chain code,” in Int. Conf. on
Intell. Systems, Metaheuristics & Swarm Intell. ACM, 2018, pp. 104–109.
[30] T. Kim, J. Keane, W. Wang, H. Tang, J. Riggle, G. Shakhnarovich,695
D. Brentari, and K. Livescu, “Lexicon-free fingerspelling recognition from
video: Data, models, and signer adaptation,” Computer Speech & Lan-
guage, vol. 46, pp. 209 – 232, 2017.
[31] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign lan-
guage recognition without temporal segmentation,” in AAAI, 2018.700
[32] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep Sign: Enabling
robust statistical continuous sign language recognition via hybrid CNN-
HMMs,” Int. J. Comput. Vision, vol. 126, no. 12, pp. 1311–1325, 2018.
32
[33] N. Aloysius and M. Geetha, “Understanding vision-based continuous sign
language recognition,” Multimedia Tools and Applications, vol. 79, pp.705
22 177–22 209, 05 2020.
[34] B. Shi, A. M. D. R. nad J. Keane, J. Michaux, D. Brentari, and G. S.
aand K. Livescu, “American Sign Language fingerspelling recognition in
the wild,” IEEE Spoken Language Technology Workshop (SLT), pp. 145–
152, 2018.710
[35] T. Igarashi, K. Nishino, and S. K. Nayar, “The appearance of human skin:
A survey,” Found. Trends. Comput. Graph. Vis., vol. 3, no. 1, pp. 1–95,
2007.
[36] M. ˇ
Sari´c, “Libhand: A library for hand articulation,” 2011, version 0.9.
[Online]. Available: http://www.libhand.org/715
[37] Blender Online Community, Blender - a 3D modelling and rendering
package, Blender Foundation, Stichting Blender Foundation, Amsterdam,
2020. [Online]. Available: http://www.blender.org
[38] W. Baczynski, “Hand pose recognition using 3D hand models,” Master’s
Thesis, AGH Univ. of Science and Technology, Faculty of Computer Scince,720
Elecronics and Telecommunications, Krakow, Poland, 2019.
[39] D. Vicini, V. Koltun, and W. Jakob, “A learned shape-adaptive subsurface
scattering model,” ACM Trans. Graph., vol. 38, no. 4, 2019.
[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Im-
age Recognition,” in IEEE Conf. on Computer Vision and Pattern Rec.725
(CVPR), 2016, pp. 770–778.
[41] T. Parcollet, Y. Zhang, M. Morchid, C. Trabelsi, G. Linar`es, R. de Mori,
and Y. Bengio, “Quaternion Convolutional Neural Networks for End-to-
End Automatic Speech Recognition,” in Interspeech. ISCA, 2018, pp.
22–26.730
33
[42] C.-A. Popa, “Learning algorithms for quaternion-valued neural networks,”
Neural Process. Lett., vol. 47, no. 3, pp. 949–973, 2018.
[43] T. Nitta, “A quaternary version of the back-propagation algorithm,” in
Proc. of Int. Conf. on Neural Networks, vol. 5, 1995, pp. 2753–2756.
[44] X. Zhu, Y. Xu, H. Xu, and C. Chen, “Quaternion Convolutional Neural735
Networks,” in European Conf on Computer Vision (ECCV). Springer,
2018, pp. 645–661.
[45] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deep learning:
A review,” IEEE Trans. on Neural Networks and Learning Systems, vol. 30,
no. 11, pp. 3212–3232, 2019.740
[46] C. Li and K. M. Kitani, “Pixel-level hand detection in ego-centric videos,”
in IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2013, pp.
3570–3577.
[47] S. Lee, S. Bambach, D. J. Crandall, J. M. Franchak, and C. Yu, “This hand
is my hand: A probabilistic approach to hand disambiguation in egocen-745
tric video,” in IEEE Conf. on Computer Vision and Pattern Recognition
Workshops, 2014, pp. 557–564.
[48] A. Z. Arpit Mittal and P. Torr, “Hand detection using multiple proposals,”
in Proc. of the British Machine Vision Conf. BMVA Press, 2011, pp.
75.1–75.11.750
[49] S. Narasimhaswamy, Z. Wei, Y. Wang, J. Zhang, and M. Hoai, “Contextual
attention for hand detection in the wild,” in Int. Conf. on Computer Vision
(ICCV), 2019, pp. 9567–9576.
[50] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks
for biomedical image segmentation,” in MICCAI. Springer, 2015, pp.755
234–241.
34
[51] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in
Proc. of the 27th Int. Conf. on Neural Information Processing Systems -
Vol. 2. Cambridge, USA: MIT Press, 2014, pp. 2672–2680.760
[52] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image
Translation Using Cycle-Consistent Adversarial Networks,” in IEEE Int.
Conf. on Computer Vision (ICCV), 2017, pp. 2242–2251.
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” in ECCV. Springer, 2014,765
pp. 346–361.
[54] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,”
J. Artif. Int. Res., vol. 11, no. 1, pp. 169–198, 1999.
[55] K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).770
USA: IEEE Computer Society, 2012, pp. 2066–2073.
[56] H. Tang, H. Liu, W. Xiao, and N. Sebe, “Fast and robust dynamic hand
gesture recognition via key frames extraction and feature fusion,” Neuro-
computing, vol. 331, pp. 424–433, 2019.
[57] P. Nakjai and T. Katanyukul, “Hand sign recognition for Thai Finger775
Spelling: an application of convolution neural network,” J. of Signal Pro-
cessing Systems, vol. 91, no. 2, pp. 131–146, 2019.
[58] K. M. Lim, A. W. C. Tan, C. P. Lee, and S. C. Tan, “Isolated sign language
recognition using convolutional neural network hand modelling and hand
energy image,” Multimedia Tools and Applications, vol. 78, no. 14, pp.780
19 917–19 944, 2019.
[59] Y. Huang and J. Yang, “A multi-scale descriptor for real time RGB-D hand
gesture recognition,” Pattern Recognition Letters, 2020.
35
[60] A. Sharma, N. Sharma, Y. Saxena, A. Singh, and D. Sadhya, “Benchmark-
ing deep neural network approaches for Indian Sign Language recognition,”785
Neural Computing and Applications, Oct. 2020.
[61] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, “OpenPose:
Realtime multi-person 2D pose estimation using part affinity fields,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp.
172–186, 2021.790
36
... Kwolek et al. proposed a JSL recognition system with RGB images. In that work, they followed three steps; firstly, they used a 3D articulated hand model and generative adversarial network (GAN) for generating the synthetic data with graphics techniques to rasterize the photorealistic hand model [29]. After that, they used ResNet34 as a segmentation technique, and finally, they applied an ensemble model to classify the dynamic JSL recognition and achieved 92.10% accuracy. ...
... Kwolek et al. proposed three steps to recognize the JSL alphabet. First, they used a 3D articulated hand model and generative adversarial network (GAN) for generating the synthetic data, then used ResNet34 as a segmentation technique, and finally, they applied an ensemble model to classify the dynamic JSL recognition and achieved 92.10% accuracy [29]. Kobayashi et al. grouped similar dynamic signs such as の (no) and り (ri) and others into the same class; they used subclass and sub-class means signs that have similar shapes into the same sub-class for the dynamic sign [30]. ...
Article
Full-text available
Japanese Sign Language (JSL) is vital for communication in Japan’s deaf and hard-of-hearing community. But probably because of the large number of patterns, 46 types, there is a mixture of static and dynamic, and the dynamic ones have been excluded in most studies. Few researchers have been working to develop a dynamic JSL alphabet, and their performance accuracy is unsatisfactory. We proposed a dynamic JSL recognition system using effective feature extraction and feature selection approaches to overcome the challenges. In the procedure, we follow the hand pose estimation, effective feature extraction, and machine learning techniques. We collected a video dataset capturing JSL gestures through standard RGB cameras and employed MediaPipe for hand pose estimation. Four types of features were proposed. The significance of these features is that the same feature generation method can be used regardless of the number of frames or whether the features are dynamic or static. We employed a Random forest (RF) based feature selection approach to select the potential feature. Finally, we fed the reduced features into the kernels-based Support Vector Machine (SVM) algorithm classification. Evaluations conducted on our proprietary newly created dynamic Japanese sign language alphabet dataset and LSA64 dynamic dataset yielded recognition accuracies of 97.20% and 98.40%, respectively. This innovative approach not only addresses the complexities of JSL but also holds the potential to bridge communication gaps, offering effective communication for the deaf and hard-of-hearing, and has broader implications for sign language recognition systems globally.
... By using the powerful feature extraction ability and learning ability of Convolutive Neural Network (CNN), researchers have proposed a variety of 3D animation generation methods based on CNN, such as Anantrasirichai and Bull (2022) based on the method of generating confrontation networks [8], Kwolek et al. (2021) and the method of variational selfencoder, etc. [9]. These methods have achieved remarkable results in reducing manual participation, improving generation efficiency and enhancing generation quality. ...
... By using the powerful feature extraction ability and learning ability of Convolutive Neural Network (CNN), researchers have proposed a variety of 3D animation generation methods based on CNN, such as Anantrasirichai and Bull (2022) based on the method of generating confrontation networks [8], Kwolek et al. (2021) and the method of variational selfencoder, etc. [9]. These methods have achieved remarkable results in reducing manual participation, improving generation efficiency and enhancing generation quality. ...
Article
Full-text available
This paper aims to improve the quality and fidelity of three-dimensional (3D) animation. Firstly, the application model of Multi-Column Convolutional Neural Network (MCNN) in 3D animation generation and image enhancement is proposed. Aiming at the generation of 3D animation, the MCNN algorithm suitable for this field is selected, and its working principle is explained in detail. Meanwhile, the theoretical basis of 3D animation generation is introduced, which provides a theoretical basis for subsequent experiments. Secondly, for image enhancement, the MCNN is also selected as the key technology, and its application model in image enhancement is explained. Finally, a simulation experiment is carried out to evaluate the effect of the proposed MCNN model in 3D animation generation and image enhancement. By collecting appropriate data sets and setting parameters in the corresponding experimental environment, the performance of the proposed model is evaluated. The results show that, compared with the traditional methods, the MCNN model shows better performance and effect in animation generation and image enhancement tasks. Specifically, this method can still maintain good performance under the conditions of shorter training time, faster reasoning time and lower memory occupation, and this method has advantages in computational efficiency. 3D animation generation and image enhancement technology with MCNN model can significantly improve the animation quality and image fidelity, and satisfactory experimental results have been obtained. The experimental results in this paper verify the application potential of MCNN in 3D animation generation and image enhancement, and provide new ideas and directions for further research and application.
... Although many studies are focused on deep-learning-based fingerspelling classifiers and on the recognition accuracy for each sign, few researchers have considered the sequence of recognized letters. For the Japanese fingerspelling alphabet, Kim et al. (2017) and Kwolek et al. (2021) investigated the problem of letter order. Kim et al. (2017) collected videos of fingerspelled words to train a deep neural network (DNN) classifier and use the bigram letter language model to avoid unexpectedly recognized letters. ...
... Kim et al. (2017) collected videos of fingerspelled words to train a deep neural network (DNN) classifier and use the bigram letter language model to avoid unexpectedly recognized letters. Kwolek et al. (2021) added a hand segmentation term to the loss function to anticipate the generation of unrealistic fingerspellings. Warcho et al. (2019) used hidden Markov models to recognize the letter sequences. ...
... Chen et al. 20 proposed the nonparametric structure regularization Machine (NSRM) for two-dimensional hand pose estimation. Kwolek et al. 21 presented a gesture recognition method that utilized RGB images. They employed a generative adversarial network and ResNet model for gesture segmentation and recognition. ...
Article
Full-text available
Sign language is an important way to provide expression information to people with hearing and speaking disabilities. Therefore, sign language recognition has always been a very important research topic. However, many sign language recognition systems currently require complex deep models and rely on expensive sensors, which limits the application scenarios of sign language recognition. To address this issue, based on computer vision, this study proposed a lightweight, dual-path background erasing deep convolutional neural network (DPCNN) model for sign language recognition. The DPCNN consists of two paths. One path is used to learn the overall features, while the other path learns the background features. The background features are gradually subtracted from the overall features to obtain an effective representation of hand features. Then, these features are flatten into a one-dimensional layer, and pass through a fully connected layer with an output unit of 128. Finally, use a fully connected layer with an output unit of 24 as the output layer. Based on the ASL Finger Spelling dataset, the total accuracy and Macro-F1 scores of the proposed method is 99.52% and 0.997, respectively. More importantly, the proposed method can be applied to small terminals, thereby improving the application scenarios of sign language recognition. Through experimental comparison, the dual path background erasure network model proposed in this paper has better generalization ability.
... Finally, they employed an SVM for multi-class classification and achieved 84.20% accuracy with ten JSL words-based datasets. Kwolek et al. developed another RGB image-based JSL recognition system where they followed the image generation, feature extraction and classification technique [24]. They first used the Generative Adversarial Network (GAN) technique to increase the training image by generating the synthetic data with the graph technique, and then they thought about skin segmentation using ResNet34. ...
Article
Full-text available
Sign language recognition is vital for enhancing communication accessibility among the Deaf and hard-of-hearing communities. In Japan, approximately 360,000 individuals with hearing and speech disabilities rely on Japanese Sign Language (JSL) for communication. However, existing JSL recognition systems have faced significant performance limitations due to inherent complexities. In response to these challenges, we present a novel JSL recognition system that employs a strategic fusion approach, combining joint skeleton-based handcrafted features and pixel-based deep learning features. Our system incorporates two distinct streams: the first stream extracts crucial handcrafted features, emphasizing the capture of hand and body movements within JSL gestures. Simultaneously, a deep learning-based transfer learning stream captures hierarchical representations of JSL gestures in the second stream. Then, we concatenated the critical information of the first stream and the hierarchy of the second stream features to produce the multiple levels of the fusion features, aiming to create a comprehensive representation of the JSL gestures. After reducing the dimensionality of the feature, a feature selection approach and a kernel-based support vector machine (SVM) were used for the classification. To assess the effectiveness of our approach, we conducted extensive experiments on our Lab JSL dataset and a publicly available Arabic sign language (ArSL) dataset. Our results unequivocally demonstrate that our fusion approach significantly enhances JSL recognition accuracy and robustness compared to individual feature sets or traditional recognition methods.
Article
This work aimed to develop a data glove for the real-time translation of Turkish sign language. In addition, a novel Fuzzy Logic Assisted ELM method (FLA-ELM) for hand gesture classification is proposed. In order to acquire motion information from the gloves, 12 flexibility sensors, two inertial sensors, and 10 Hall sensors were employed. The NVIDIA Jetson Nano, a small pocketable minicomputer, was used to run the recognition software. A total of 34 signal information was gathered from the sensors, and feature matrices were generated in the form of time series for each word. In addition, an algorithm based on Euclidean distance has been developed to detect end-points between adjacent words in a sentence. In addition to the proposed method, CNN and classical ANN methods, whose model was created by us, were used in sign language recognition experiments, and the results were compared. For each classified word, samples were collected from 25 different signers, and 3000 sample data were obtained for 120 words. Furthermore, the dataset’s size was reduced using PCA, and the results of the newly created datasets were compared to the reference results. In the performance tests, single words and three-word sentences were translated with an accuracy of up to 96.8% and a minimum 2.4 ms processing time.
Book
This book is a detailed reference guide on deep learning and its applications. It aims to provide a basic understanding of deep learning and its different architectures that are applied to process images, speech, and natural language. It explains basic concepts and many modern use cases through fifteen chapters contributed by computer science academics and researchers. By the end of the book, the reader will become familiar with different deep learning approaches and models, and understand how to implement various deep learning algorithms using multiple frameworks and libraries. The second part is dedicated to sentiment analysis using deep learning and machine learning techniques. This book section covers the experimentation and application of deep learning techniques and architectures in real-world applications. It details the salient approaches, issues, and challenges in building ethically aligned machines. An approach inspired by traditional Eastern thought and wisdom is also presented. The final part covers artificial intelligence approaches used to explain the machine learning models that enhance transparency for the benefit of users. A review and detailed description of the use of knowledge graphs in generating explanations for black-box recommender systems and a review of ethical system design and a model for sustainable education is included in this section. An additional chapter demonstrates how a semi-supervised machine-learning technique can be used for cryptocurrency portfolio management. The book is a timely reference for academicians, professionals, researchers and students at engineering and medical institutions working on artificial intelligence applications.
Chapter
This book is a detailed reference guide on deep learning and its applications. It aims to provide a basic understanding of deep learning and its different architectures that are applied to process images, speech, and natural language. It explains basic concepts and many modern use cases through fifteen chapters contributed by computer science academics and researchers. By the end of the book, the reader will become familiar with different deep learning approaches and models, and understand how to implement various deep learning algorithms using multiple frameworks and libraries. The second part is dedicated to sentiment analysis using deep learning and machine learning techniques. This book section covers the experimentation and application of deep learning techniques and architectures in real-world applications. It details the salient approaches, issues, and challenges in building ethically aligned machines. An approach inspired by traditional Eastern thought and wisdom is also presented. The final part covers artificial intelligence approaches used to explain the machine learning models that enhance transparency for the benefit of users. A review and detailed description of the use of knowledge graphs in generating explanations for black-box recommender systems and a review of ethical system design and a model for sustainable education is included in this section. An additional chapter demonstrates how a semi-supervised machine-learning technique can be used for cryptocurrency portfolio management. The book is a timely reference for academicians, professionals, researchers and students at engineering and medical institutions working on artificial intelligence applications.
Article
Full-text available
Deaf and hard of hearing people use sign language to communicate. People around mute and deaf people have difficulty communicating with each other if they do not understand sign language. This problem has prompted many researchers to conduct studies on sign language translation. However, there is a lack of compilation of SLR on this topic. Therefore, this paper aims to provide a thorough literature review of previous studies on sign language to text translation based on the vision method. PRISMA (Preferred Reporting Items to writing a standard Systematic Review and Meta-Analyses) is used in this systematic review. Two primary databases, Web of Science and Scopus, have been used to search for relevant articles and resources included in this systematic literature review. Based on the outcome of the systematic review of the topic, the primary studies on sign language translation systems were conducted using self-generated datasets more than public datasets. More static action sign language was studied compared to dynamic action sign language. For the type of recognition, more alphabet sign language was studied compared to digit, word, or sentence sign language. Other than that, most studies used digital cameras rather than Microsoft Kinect or a webcam. The most used classification method was Convolution Neural Network (CNN). The study is intended to guide readers and researchers for future research and knowledge enhancement in the field of sign language recognition.
Article
Full-text available
Sign language is the language of the deaf and mute. However, this particular population of the world is unfortunately overlooked as the majority of the hearing population does not understand sign language. In this paper, an extensive comparative analysis of various gesture recognition techniques involving convolutional neural networks and machine learning algorithms has been discussed and tested for real-time accuracy. Three models: a pre-trained VGG16 with fine-tuning, VGG16 with transfer learning and a hierarchical neural network were analyzed based on a number of trainable parameters. These models were trained on a self-developed dataset consisting images of Indian Sign Language (ISL) representation of all 26 English alphabets. The performance evaluation was based on the practical application of these models, which was simulated by varying lighting and background environments. Out of the three, the hierarchical model outperformed the other two models to give the best accuracy of 98.52% for one-hand and 97% for two-hand gestures. Thereafter, a conversation interface was built in Django using this model for the real-time gesture to speech conversion and vice versa. This publicly accessible interface can be used by anyone who wishes to learn or converse in ISL.
Article
Full-text available
Real-time sign language translation systems, that convert continuous sign sequences to text/speech, will facilitate communication between the deaf-mute community and the normal hearing majority. A translation system could be vision-based or sensor-based, depending on the type of input it receives. To date, most of the commercial systems for this purpose are sensor-based, which are expensive and not user-friendly. Vision-based sign translation systems are the need of the hour but should overcome many challenges to build a full-fledged working system. Preliminary investigations in this work have revealed that the traditional approaches to continuous sign language recognition (CSLR) using HMM, CRF and DTW, tried to solve the problem of Isolated Sign Language Recognition (ISLR) and extended the solution to CSLR, leading to reduced performance. The main challenge of identifying Movement Epenthesis (ME) segments in continuous utterances, were handled explicitly with these traditional methods. With the advent of technologies like Deep Learning, more feasible solutions for vision-based CSLR are emerging, which has led to an increase in the research on vision-based approaches. In this paper, a detailed review of all the works in vision-based CSLR is presented, based on the methods they have followed. The challenges posed in continuous sign recognition are also discussed in detail, followed by a brief on sensor-based systems and benchmark databases. Finally, performance evaluation of all the associated methods are performed, which leads to a short discussion on the overall study and concludes by pointing out future research directions in the field.
Conference Paper
Full-text available
We present Hand-CNN, a novel convolutional network architecture for detecting hand masks and predicting hand orientations in unconstrained images. Hand-CNN extends MaskRCNN with a novel attention mechanism to incorporate contextual cues in the detection process. This attention mechanism can be implemented as an efficient network module that captures non-local dependencies between features. This network module can be inserted at different stages of an object detection network, and the entire detector can be trained end-to-end. We also introduce large-scale annotated hand datasets containing hands in unconstrained images for training and evaluation. We show that Hand-CNN outperforms existing methods on the newly collected datasets and the publicly available PASCAL VOC human layout dataset. Data and code: https://www3.cs.stonybrook.edu/~cvl/projects/hand_det_attention/
Chapter
Full-text available
In this paper, we present a framework for recognition of static finger spelling in Japanese Sign Language on RGB images. The finger spelled signs were recognized by an ensemble consisting of a ResNet-based convolutional neural network and two ResNet quaternion convolutional neural networks. A 3D articulated hand model has been used to generate synthetic finger spellings and to extend a dataset consisting of real hand gestures. Twelve different gesture realizations were prepared for each of 41 signs. Ten images have been rendered for each realization through interpolations between the starting and end poses. Experimental results demonstrate that owing to sufficient amount of training data a high recognition rate can be attained on images from a single RGB camera. Results achieved by the ResNet quaternion convolutional neural network are better than results obtained by the ResNet CNN. The best recognition results were achieved by the ensemble. The JSL-rend dataset is available for download.
Article
Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.
Article
The development of depth cameras, e.g., the Kinect sensor, provides new opportunities for human computer interaction (HCI). Although the Kinect sensor has been extensively applied for human tracking, human action recognition and hand gesture recognition, real time hand gesture recognition is still a challenging problem. In this paper, a new real time hand gesture recognition method is proposed. Since fingers are the most important clue for hand gesture classification, a finger-emphasized multi-scale descriptor is proposed. The proposed descriptor incorporates three types of parameters of multiple scales to make a discriminative representation of the hand shape. Furthermore, the features of fingers are emphasized for hand gesture analysis. Three solutions to hand gesture recognition are then investigated with DTW, SVM, and neural network. Extensive experiments are conducted and the results show that the proposed method is robust to noise, articulations and rigid transformations. The comparison with state-of-the-art methods verifies the accuracy and efficiency of our method.
Article
Online dynamic hand gesture recognition is challenging mainly due to three reasons: (i) There is no indication when a gesture starts and ends in the video, (ii) performed gestures should only be recognized once, and (iii) the entire architecture should be designed considering the memory and power budget. In this paper, a two-level hierarchical structure consisting of a detector and a classifier is proposed which enables offline-working convolutional neural network (CNN) architectures to operate online efficiently by using sliding window approach. For efficiency analysis, different CNN architectures are applied to compare these architectures over offline classification accuracy, number of parameters and computation complexity. In order to evaluate the single-time activations of the detected gestures, we used Levenshtein distance as an evaluation metric since it can measure misclassifications, multiple detections, and missing detections at the same time. The performance of the approach is evaluated on two public datasets -EgoGesture and NVIDIA Dynamic Hand Gesture Datasets -which require temporal detection and classification of the performed hand gestures. ResNeXt-101 model achieves the state-of-the-art offline classification accuracy of 94.03% on EgoGesture benchmark and competitive results on NVIDIA benchmarks. In online recognition, we obtain very good performances with considerable early detections.
Article
Despite the importance of sign language recognition systems, there is a lack of a Systematic Literature Review and a classification scheme for it. This is the first identifiable academic literature review of sign language recognition systems. It provides an academic database of literature between the duration of 2007–2017 and proposes a classification scheme to classify the research articles. Three hundred and ninety six research articles were identified and reviewed for their direct relevance to sign language recognition systems. One hundred and seventeen research articles were subsequently selected, reviewed and classified. Each of 117 selected papers was categorized on the basis of twenty five sign languages and were further compared on the basis of six dimensions (data acquisition techniques, static/dynamic signs, signing mode, single/double handed signs, classification technique and recognition rate). The Systematic Literature Review and classification process was verified independently. Literature findings of this paper indicate that the major research on sign language recognition has been performed on static, isolated and single handed signs using camera. Overall, it will be hoped that the study may provide readers and researchers a roadmap to guide future research and facilitate knowledge accumulation and creation into the field of sign language recognition.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Article
Subsurface scattering, in which light refracts into a translucent material to interact with its interior, is the dominant mode of light transport in many types of organic materials. Accounting for this phenomenon is thus crucial for visual realism, but explicit simulation of the complex internal scattering process is often too costly. BSSRDF models based on analytic transport solutions are significantly more efficient but impose severe assumptions that are almost always violated, e.g. planar geometry, isotropy, low absorption, and spatio-directional separability. The resulting discrepancies between model and usage lead to objectionable errors in renderings, particularly near geometric features that violate planarity. This article introduces a new shape-adaptive BSSRDF model that retains the efficiency of prior analytic methods while greatly improving overall accuracy. Our approach is based on a conditional variational autoencoder, which learns to sample from a reference distribution produced by a brute-force volumetric path tracer. In contrast to the path tracer, our autoencoder directly samples outgoing locations on the object surface, bypassing a potentially lengthy internal scattering process. The distribution is conditional on both material properties and a set of features characterizing geometric variation in a neighborhood of the incident location. We use a low-order polynomial to model the local geometry as an implicitly defined surface, capturing curvature, thickness, corners, as well as cylindrical and toroidal regions. We present several examples of objects with challenging medium parameters and complex geometry and compare to ground truth simulations and prior work.