Content uploaded by Bogdan Kwolek
Author content
All content in this area was uploaded by Bogdan Kwolek on Mar 06, 2022
Content may be subject to copyright.
Recognition of JSL Fingerspelling Using Deep
Convolutional Neural Networks
Bogdan Kwoleka,∗
, Wojciech Baczynskia, Shinji Sakob
aAGH University of Science and Technology, 30 Mickiewicza Av., 30-059 Krakow, Poland
bNagoya Institute of Technology, Nagoya, Japan
Abstract
In this paper, we present approach for recognition of static fingerspelling in
Japanese Sign Language on RGB images. Two 3D articulated hand models have
been developed to generate synthetic fingerspellings and to extend a dataset
consisting of real hand gestures. In the first approach, advanced graphics tech-
niques were employed to rasterize photorealistic gestures using a skinned hand
model. In the second approach, gestures rendered using simpler lighting tech-
niques were post-processed by a modified Generative Adversarial Network. In
order to avoid generation of unrealistic fingerspellings a hand segmentation term
has been added to the loss function of the GAN. The segmentation of the hand
in images with complex background was done by proposed ResNet34-based seg-
mentation network. The finger-spelled signs were recognized by an ensemble
with both fine-tuned and trained from scratch neural networks. Experimen-
tal results demonstrate that owing to sufficient amount of training data a high
recognition rate can be attained on RGB images. The JSL dataset with pixel-
level hand segmentations is available for download.
Keywords: Fingerspelling recognition, Generative Adversarial Networks,
semantic segmentation, U-Net, residual networks (ResNets)
∗Correspondence to: Department of Computer Science, AGH University of Science and
Technology, 30 Mickiewicza Av., Building D-17, 30-059 Krakow, Poland
Email address: bkw@agh.edu.pl (Bogdan Kwolek )
3.01.2021
1. Introduction
Hand detection, tracking and recognition of fingerspellings are significant re-
search areas due to high application potential in human-machine-communication,
virtual reality [1], entertainment, robotics [2, 3, 4], medicine, and assistive tech-5
nologies for the handicapped and the elderly [5]. Communication by gesture is
one of the most intuitive and flexible ways to attain user-friendly man-machine
interaction. Although huge efforts by many research teams were undertaken in
the last decade [6, 7], there are still challenges to be addressed for attaining
recognition performance that is required by real-life applications, e.g. [8, 9].10
Gesture recognition on images acquired by a single color camera is very useful,
yet complex task because of several difficulties, including occlusions, variations
in gesture expressions, differences in hand anatomy and appearance.
In recent years, a number of approaches to recognition of static gestures
on RGB images has been proposed [10, 11]. In a recently published work [12],15
an end-to-end network for human skin detection on color images by integrat-
ing recurrent neural layers into Fully Convolutional Neural Networks (FCNs)
has been proposed. Despite significant developments in learning deep Convo-
lutional Neural Networks (CNNs), a recent review on gesture recognition [10]
evokes only one noteworthy work done by Tompson in collaboration with Le-20
Cun et al. [13]. In a former work [14], a CNN capable of classifying six hand
gestures and controlling robots on the basis of colored gloves has been proposed.
More recently, in [15] a CNN implemented in Theano and applied on the Nao
humanoid robot has been discussed. In [16], a CNN learned on one million of
data samples to classify sign characters has been proposed. However, only a25
subset of the dataset, i.e. 3361 manually labeled frames into 45 classes has been
made publicly available. Recently, a method employing Gabor features, Zernike
moments, Hu moments, and contour-based descriptors to select features for fur-
ther combination by a fusion-based convolutional neural network (FFCNN) has
been introduced [17].30
2
One of the obstacles that researches and practitioners face in utilization the
deep CNNs on a larger scale is lack of properly aligned datasets of sufficient
size [6] as well as shortage of robust real-time hand detectors. A dataset in-
troduced in [18] consists of 65000 samples representing 24 classes. However,
the gestures were performed only by nine subjects. A dataset utilized in [19]35
contains 2750 samples with complex background, which were performed by 40
subjects. However, this dataset has only ten classes. In a dataset [13] targeted
for hand pose recovery there are 72757 and 8252 frames in the training and
test sets, respectively. However, only two performers attended in recordings in
a scenario with three Kinect sensors (a frontal and two side). The EgoHands40
dataset [20], which has pixel-level annotations for hands with two participants in
each video interacting with each other has recently been used in a work devoted
to hand segmentation [21]. From the above literature review it follows that, in
general, there is lack of datasets with pixel-level annotations. Moreover, almost
all dataset have data that have no separated training and testing subsets. As45
shown in our work [22], recognition rate in person-independent evaluation pro-
tocols drops significantly in comparison to evaluation protocols in which data
of the same subjects are both in training and testing subsets. Moreover, as
demonstrated in [23, 24] synthetically generated hand images can improve the
classification performance. However, currently available datasets do not have50
data that would permit further development of methods allowing more effective
employing of synthetically generated data.
Fingerspelling is a loanword system for borrowing orthographic representa-
tions of words from language to sign language, for instance from English into
American Sign Language (ASL), or as in this work from Japanese to Japanese55
Sign language. It is a form, and frequently an integral part of sign language,
where each sign corresponds to a word of the alphabet. It is very often used
for proper names, brand names, place names as well as digits, which do not
have conventional lexical signs. Individuals generally spell their name when
they introduce themselves. There are many compelling reasons that make60
fingerspelling an appealing area of research. Tabata and Kuroda [25] developed
3
a Stringlove system for recognizing hand shapes and fingerspelling in Japanese
Sign Language (JSL). It is built on custom-made glove that is equipped with
sensors capturing finger features. The glove uses nine contact sensors and 24
inductcoders that jointly estimate features like: adduction/abduction angles of65
fingers, thumb and wrist rotations, the joint flexion/extension of fingers as well
as contact positions among fingertips of the fingers. More recently, in [26] a
modified kind of the shape matrix for capturing salience of fingerspelling pos-
tures through precise sampling of contours and regions has been proposed. In
[22], recognition of JSL fingerspellings was done on the basis of embeddings de-70
termined by multiple Siamese CNNs. A dataset consisting of real gestures has
been introduced in discussed work. It contains 5311 training images and 579
test images of size 64 ×64.
Japanese Sign Language, also known under the acronym JSL, is a visual
sign language in Japan. As other sign languages, the JSL comprises words, or75
signs, and the grammar with which they are bonded together. The Japanese
Sign Language syllabary is a system of manual kana utilized as part of the JSL.
In general, fingerspelling is used mostly for foreign words and last names. The
JSL fingerspellings are performed by five fingers of the hand and direction it
points. For example, the signs ’na’, ’ni’, ’ha’ are all expressed with the first80
two fingers of the hand extended straight, but for the sign ’na’ the fingers point
down, for ’ni’ across the body, and for ’ha’ toward the partner or audience. The
signs for ’te’ and ’ho’ are both made with open flat hand, but in ’te’ the palm
faces the viewer, and in ’ho’ it faces away. These and many other aspects make
recognition of JSL fingerspelling on the basis of a single RGB camera a difficult85
task. Most of fingerspellings are expressed through static postures, but some of
them are dynamic postures. In addition, dullness, half dullness, and long sound
are represented by dynamic postures. In this study, we only focused on static
fingerspellings. There are 41 static fingerspellings in the JSL.
In this work, we present a framework for recognition of static fingerspelling in90
Japanese Sign Language on RGB images. Two 3D articulated hand models have
been developed to generate synthetic fingerspellings and to extend real hand ges-
4
tures of the JSL dataset [22] about gestures with photorealistic renderings of the
hand. In the first approach, advanced graphics techniques were employed to cre-
ate photorealistic gestures using a skinned hand model. In the second approach,95
gestures rendered in advance using simple lighting techniques were further post-
processed by a modified Generative Adversarial Network (GAN). To avoid gen-
eration of unrealistic fingerspellings a hand segmentation term has been added
to the loss function of the GAN. The segmentation of the hand in images with
complex background was done by proposed ResNet34-based segmentation net-100
work. The finger-spelled signs were recognized by an ensemble consisting of
a VGG-based neural network and two ResNet quaternion convolutional neural
networks. The contribution of this work is a framework for improving recog-
nition of fingerspellings on RGB images by employing advanced techniques for
photorealistic hand image synthesis, including rendering techniques and modi-105
fications of GANs for enhancing photorealism of hand gestures. A large dataset
(eleven thousands of images) with both pixel-level hand segmentations and syn-
thetically generated hand postures for learning deep segmentation models as
well as deep neural networks for fingerspelling recognition is proposed.
2. Relevant work110
In [15], a multichannel CNN for hand posture recognition has been proposed.
A cubic kernel to enhance features for posture classification has been used. The
system has been evaluated on the Nao robot. In [27], a glove has been utilized in
order to provide contour representation of gestures. A neural network, which has
been trained on a dataset consisting of one hundred images per gesture permits115
achieving 90% classification accuracy. In [28], a staged probabilistic regressor
(SPORE) algorithm for estimation of hand orientation from 2D monocular im-
ages has been proposed. In discussed approach, simultaneous learning hand
orientation and pose significantly increased the performance of pose classifica-
tion on 2D monocular images. In a recently published work [29] focusing on120
ASL language, an approach for detection and extraction of shape for static fin-
5
gerspelling recognition on the basis of boundary tracking and chain code has
been proposed. On images of size 320 ×240 acquired by a webcam and images
collected from freely available resources the recognition accuracy was 97.75%
and 96.48% for alphabet characters and numbers, respectively. Kim et al. [30]125
determine a signer-dependent skin color model using manually annotated hand
regions for fingerspelling recognition. In a signer-dependent setting they achieve
up to about 92% letter accuracy, whereas in a multi-signer they achieve up to
83% accuracies in letter recognition. Huang et al. [31] utilize a Faster R-CNN
based hand detector, trained on manually annotated hand bounding boxes, and130
apply it to general sign language recognition. Convolutional neural network-
based features demonstrated high potential in recent approaches [16, 32]. A
survey of recent achievements in vision-based continuous sign language recog-
nition is presented in [33]. A recently introduced Chicago Fingerspelling in the
Wild (ChicagoFSWild) dataset [34] contains 7304 fingerspelling sequences from135
online videos.
3. Fingerspelling Modeling and Rendering
Rendering a realistic and accurate hand shape is not an easy task because
of large variety of poses that the human hand can assume as well as difficulties
in modeling shin appearance [35]. A commonly used approach for rendering the140
hand shape in the requested poses is linear blend skinning (LBS). Starting from
the open source library LibHand v. 0.9 [36] and a 42-DOF skeleton we developed
a new rendering API for the LibHand for configuring the hand in the requested
poses by a set of sliders, which can be manually moved using the mouse. It
uses the textured 3D model of the LibHand and permits exporting the resulting145
models into the md5 graphics format for further OpenGL-based rendering. For
each gesture we considered several allowable hand postures and orientations,
different finger inter-distances in order to synthesize a large number of training
images. Thirty three students posed and articulated the 3D hand model to
render the gestures. As a result, four thousands (4018) synthetic images were150
6
generated in order to balance the JSL dataset [22] and then after additional
post-processing they were stored as a subset of the mentioned JSL dataset.
The resulting synthetic hand images are quite faithful representation of the JSL
signs, see Fig. 1. However, due to insufficient quality of lighting algorithms for
human skin rendering that are available in standard OpenGL, and particularly155
in order to increase photorealism of the rendered gestures, the images were
further post-processed by Generative Adversarial Networks (GANs), outlined
in Sect. 6. Finally, a sub-dataset containing 4018 samples has been generated
and then it has been stored as another subset (called RHM) of the whole JSL
dataset. There are roughly 98 different samples for each of the 41 JSL gestures
Figure 1: Examples of rendered Hiragana signs using LibHand with our API for hand articu-
lation and gesture modeling.
160
with different orientations and small variations in the 3D positions of the fingers
with respect to location of the wrist.
In order to employ recent rendering and lighting developments, a custom
3D hand model has been prepared in Blender, which is a 3D computer graph-
ics software toolset [37]. The models have been designed using 2.80 software165
version, which is extended about Eevee physically-based real-time renderer, a
successor of the Cycles rendering engine. The Cycles engine works by casting
7
rays of light from each pixel of the camera into the scene. They refract and
reflect, or get absorbed, until they either hit the light source or reach a prede-
fined bounce limit. Eevee is a real-time render engine with advanced features,170
including Non-photorealistic Rendering (NPR). The NPR is an active area of
computer graphics, which focuses on enabling a wide variety of expressive styles
for digital entertainment. Eevee renderer has support for baked indirect light-
ing, screen space ambient occlusion, screen space reflections, and other modern
commodities provided by current generation graphics hardware. Having on re-175
gard, that renderings with the Eevee engine can be performed in about five
times shorter time in comparison to Cycles rendering time on the same hard-
ware, the initial renderings were done in the Eevee engine, whereas the final
ones were done in the Cycles engine. A hand model [24] consisting of 21 bones
and five control bones has been extended in this work to permit photorealistic180
hand gesture modeling and animations by Python scripts. The bones in the
skeleton construct a structure of rigid bodies that are connected together by
joints with one or more degrees of freedom. The articulated hand model has
37 degrees of freedom (DOF). The 3D mesh consists of 528 vertices and it is
composed of 1036 triangles [24]. The 26 element skeleton (armature) is bound185
to such a 3D mesh. The root joint is located in the hand’s wrist.
Generation of synthetic training images is done by executing Python scripts
that employ Blender graphics engine [38]. Basic configuration options of the
scripts include different camera viewpoints and different lighting. The model
can also be exported to md5 data format to perform animations in external190
programs. Figure 2 depicts the JSL sign ’ka’, which has been expressed by the
3D hand observed from different camera views.
At the beginning, we investigated a widely used three point lighting tech-
nique in rendering of realistic fingerspelling. However, it quickly turned out
that such a three point technique is not the main bottleneck to achieve pho-195
torealistic skin rendering and that a significant improvement of realism of the
rendered hands can be achieved by the use of Subsurface Scattering (SSS), which
simulates the transport of light through a translucent surface. In the discussed
8
Figure 2: Selected shots of JSL sign ’ka’ for various camera views.
technique the light penetrates a material, in our case the human skin, and inter-
nally scatters at irregular angles, resulting in more photorealistic human skin.200
This is very important issue in photorealistic hand rendering. Recently, a deep
learning approach to subsurface scattering has been proposed in [39]. Further
improvement of hand realism has been achieved by the use Filmic Blender color
palette. Figure 3 presents the effect of using such a color palette in photorealistic
hand rendering.
Figure 3: Photorealistic hand rendering: regular texture (left), Filmic Blender color palette
(right). The illustrated images are stored in vectorized graphics format and a better viewing
can be obtained by zooming this figure.
205
Finally, a simpler two point lighting has been employed in the rendering of
the sub-dataset with realistic hand renderings. The main light source was an
9
Area type lamp providing the surface light, which can be rotated around the
hand according to user needs. The second source light is a back light providing
chiaroscuro effects, and which can be turned on or turned off through the Python210
scripts. Figure 4 depicts sample images which were rendered using the discussed
techniques.
Figure 4: Selected shots of JSL sign ’ka’ for various lighting.
The 3D articulated hand model has been used to generate synthetic fin-
gerspellings and to extend our dataset consisting of real hand gestures. Twelve
different gesture realizations were prepared for each of 41 signs. Ten images have215
been rendered for each realization through interpolations between the starting
and end poses. Figure 5 depicts samples of the rendered images for the sign
’ka’ from the JSL. For each starting gesture a final gesture has been created
and ten interpolated images were rendered among them. This means that ges-
tures differ in hand postures to express various realizations of the gesture by220
different persons. For each realization of the gesture we modeled starting and
final posture and then interpolated hand postures between them. The number
of images generated on the basis of the interpolation is equal to eight. In total,
5892 realistic hand gestures were selected from the rendered dataset and then
stored as a subset in the JSL dataset.225
10
Figure 5: Example realizations of JSL gesture ’ka’ that were obtained in the photorealistic
hand rendering.
3.1. 3D-Model Based Dataset for JSL Recognition
The JSL dataset consists of 18029 images, where the training subset consists
of 16343 images, and testing subset contains 1686 images. The test subset
contains only real images with gestures performed by four persons, including
three Japanese performers, who did not attend in recording of training images.230
The training subset contains both real and the synthesized images. The real
images are taken from training subset of former JSL dataset [22, 24]. The
images are of size 64 ×64 with uniform background. Thanks to the uniform
background the hands can be delineated easily and then used to train models for
hand segmentation or gesture classification on images with artificially included235
complex backgrounds. As far as we know, currently this is the largest dataset
with pixel-level delineated hands. Moreover, rather simple rendering techniques
were applied until now in generating synthetic hands for training deep learning
models. The dataset has been stored in .mat files and can be easily imported
into Matlab and Python. The whole JSL dataset is freely available at: http:240
//home.agh.edu.pl/~bkw/data/neu2020.
4. Neural Networks for Gesture Modeling and Recognition
At the beginning of this Section we present Residual Neural Networks. Af-
terwards, we outline Quaternion Convolutional Neural Networks.
11
4.1. ResNet Convolutional Neural Networks245
In [40], He at al. introduced residual networks (ResNets), which provide
an important contribution to training very deep neural networks. The residual
learning framework simplifies the training of neural networks, and enables them
to be substantially deeper, which leads to an improved performance. The resid-
ual networks are much deeper in comparison to their ordinary counterparts, yet250
they require a similar number of parameters. The main idea is to utilize blocks
that re-route the input, and to add to the concept learned from the previous
layer. The constituent building block of discussed architecture is the ResNet
unit. A deeper network can be built by simply repeating such a block, i.e. the
smaller sub-network. A desired underlying mapping H(x) can be approximated255
by a few stacked nonlinear layers, so it can also be obtained through underly-
ing mapping F(x) = H(x)−x. As a result, it is possible to reformulate it as
H(x) = F(x)+ x, which comprises the Residual Function F(x) and the input x.
The connection of the input to the output is called a skip connection or identity
mapping. The central idea is that if multiple nonlinear layers can approximate260
the complex function H(x), then it is possible to approximate the residual func-
tion F(x). Thus, the stacked layers are not employed to fit H(x), but instead
these layers approximate the residual function F(x).
4.2. Quaternion Convolutional Neural Network
Recently, in order to exploit internal dependencies within the features a
quaternion convolutional neural network (QCNN) has been proposed [41]. Let
γl
ab and Sl
ab denote the quaternion output and the pre-activation quaternion
output at layer land at the indexes (a, b) of the feature map, and wbe a
quaternion-valued weight filter map of size K×K. The convolution can be
expressed in the following manner:
γl
ab =α(Sl
ab) (1)
12
where Sl
ab is equal to:
Sl
ab =
K−1
X
c=0
K−1
X
d=0
wl⊗γl−1
(a+c)(b+d)(2)
and αstands for quaternion split activation function [42] that is defined as
follows:
α(Q) = f(r) + f(x)i+f(y)j+f(z)k(3)
where fis related to any standard activation function. A derivation of the265
backpropagation algorithm for quaternion neural networks can be found in [43].
Recently, in [44] a QCNN for color image processing has been proposed. In
the discussed approach the image is represented in the quaternion domain as a
quaternion matrix. The quaternion convolution provides scaling and rotation
of input in the color space, which carries out more structural representation of270
color information [44], whereas the conventional real-valued convolution is only
capable of executing scaling transformations on the input. Because QCNNs
enforce an implicit regularizer on the network architecture, more complicated
relationships across different channels can improve the training of such kind of
neural networks.275
5. Hand Segmentation
In recent years a considerable progress in object detection has been achieved
[45]. However, little work has been done in area of hand detection. In [46]
a dataset consisting of 600 images acquired in various lighting conditions and
backgrounds has been proposed to highlight the advantages and shortcomings of280
different methods for ego-centric hand detection. Later, in [47] another approach
to detect hands in social interactions in egocentric videos has been demon-
strated. However, only interactions in laboratory settings were considered. In
already evoked work [20], Bambach et al. introduced a skin-based approach that
first determines a set of bounding boxes that may surround hand regions, after-285
wards utilizes CNNs to detect hands, and finally executes GrabCut to segment
them. They also introduced an EgoHands dataset consisting of 48 first-person
13
videos of people interacting in realistic environments, with pixel-level ground
truth for over 15000 hand instances. Our dataset contains 14875 images with
pixel-level ground truth and it has a potential to fill a gap for hand detection290
and segmentation in third person-images. In such a third-person settings, [48]
used deformable part models and skin heuristics to detect hands. Recently, a
large dataset suitable for deep learning has been introduced in [49]. However,
the dataset mentioned above does not contain pixel-level annotations.
In order to reliably segment the hand on RGB images with complex back-295
ground we designed an encoder-decoder neural network. In the proposed neural
network for hand segmentation on images with complex background we employ
a deep CNN and add skip connections between the layers in the encoder and
the decoder. The encoder path is based on the ResNet 34-layer (ResNet34)
neural network, whereas decoder path uses the transpose 2D blocks to perform300
the 2D upsampling. The parameters of each transpose 2D block are such that
the height and width are doubled, whereas the number of channels is helved,
see Fig. 6. There are three skip connections, where the first connection is done
after (3 ×3,64; 3 ×3,64) ×3 ResNet blocks, the second one is performed after
(3×3,128; 3×3,128)×4 blocks and the last one is after (3×3,128; 3×3,128)×6305
blocks of the 34-layer ResNet. The feature maps delivered by such skip con-
nections from the encoder, i.e. the ResNet34 neural network are summed with
feature maps extracted in the decoder path, which uses the transpose 2D blocks
to expand dimensions of convoluted feature outputs. Such skip connections
between encoder layers and decoder layers were introduced in the U-Net neu-310
ral network [50], which is a symmetrical neural network with ’U’ like shape.
Our segmentation network is not symmetrical since the encoder path is based
on deep ResNet34 neural network, whereas in the decoder path no residual
blocks are employed, see Fig. 6. In U-Net neural networks a down-sampling
(contraction) path is utilized to extract and interpret the context (what), while315
an up-sampling (expansion) path is used to enable precise localization (where).
Furthermore, in order to fully recover the fine-grained spatial information lost
in the pooling or down-sampling layers, skip connections between symmetrical
14
layers are employed in such encoder-decoder networks. By combining the loca-
tion information from the down-sampling path with the contextual information320
in the up-sampling path, such networks permit obtaining general maps that
combine localization and context. Our ResNet34-based network has all the fea-
tures mentioned above, and additionally it possesses extended capabilities for
feature extraction.
Figure 6: ResNet34 based network for hand segmentation.
6. Generative Adversarial Network for Photorealistic Fingerspelling325
Synthesis
Generative Adversarial Networks (GANs) utilize an adversarial discrimina-
tor to align the distributions of real and generated data samples. In a two-player
minimax game the generator Gtries to generate samples on the basis of noise z
that fool the discriminator D, while Dlearns to maximize the probability of as-330
signing the correct class label for both the real data and the fake data G(z) [51].
In the optimal case, the generated samples would be indistinguishable from real
samples. The conventional GANs required the paired training data. A recently
proposed CycleGANs [52] utilizes the unpaired training data thanks to a cycle
consistency loss function. CycleGAN is a general framework for learning from335
unaligned examples the mapping functions between two domains Xand Y. The
goal is to learn a mapping G:X→Ysuch that the distribution of data from
G(X) would be indistinguishable from the distribution of data in Yaccording
15
to an adversarial loss. To achieve this they proposed to consider also an inverse
mapping F:Y→Xas well as to employ so-called cycle consistency to prevent340
the learned mappings Gand Ffrom contradicting each other [52].
Given training samples {xi}N
i=1,{yi}M
i=1, where xi∈Xand yi∈Ywith
data distributions x∼pdata(x)and y∼pdata(y), for the mapping G:X→Y,
and discriminator DYthe objective function can be expressed in the following
manner:
LGAN(G, DY, X, Y ) = Ey∼pdata(y)[log(DY(y))]
+Ex∼pdata(x)[log(1 −DY(G(x)))]
(4)
The generator Gminimizes it against the adversary DYthat tries to maximize it:
minGmaxDYLGAN(G, DY, X, Y ). For the mapping F:Y→Xand the discrim-
inator DX, the generator Fminimizes the objective LGAN (F, DX, X, Y ) against
the adversary DXthat tries to maximize it: minFmaxDXLGAN (F, DX, X, Y ).
The cycle consistency loss takes the following form [52]:
Lcyc(G, F ) = Ex∼pdata(x)[||F(G(x)) −x||1]
+Ey∼pdata(y)[||G(F(y)) −y||1]
(5)
The loss function has the following form:
L(G, F, DX, DY) = LGAN(G, DY, X, Y ) + LGAN (F, DX, X, Y ) + γLcyc (G, F )
(6)
where γbalances the objectives. Cycle consistency means that that the com-
position of the mappings is the identity mapping. The aim of the CycleGAN is
to find the generators:
G∗, F ∗= arg min
G,F
max
DX,DY
L(G, F, DX, DY) (7)
In our work we learned mappings from synthetic images to real images and from
real to synthetic images. The inputs were the cropped synthetic and real images
of the hand with uniform background.
CycleGAN has been successfully applied in a few image-to-image applica-
tions. However, the CycleGAN has not been designed for maintaining object
16
shapes well. In this work, we extend the CycleGAN by adding the segmentation
consistency loss to encourage shape alignment between images in two domains
and to improve the accuracy at the hand boundaries. By incorporating an ad-
ditional geometric consistency loss that incorporates information about hand
shapes we better maintain the hand shape and its pose during image post-
processing. The proposed loss term has the following form:
Lcycu(G, F ) = Ex∼pdata(x)[||U(F(G(x))) −xb||1]
+Ey∼pdata(y)[||U(G(F(y))) −yb||1]
(8)
where Uis segmentation mapping that is performed by trained in advance345
ResNet34-based segmentation unit, whereas xband ybdenote the binary masks
of the hands. This means that the input to our network are (cropped) synthetic
and real images of the hand with their respective silhouettes, i.e. binary fore-
ground masks. The discussed term has been multiplied by γand included as
additional term in the loss function (6), i.e. calculating L.350
7. Fingerspelling Recognition
7.1. Fingerspelling Recognition Using Neural Networks Trained from Scratch
We implemented a ResNet based convolutional neural network consisting of
three ResNet-blocks, see Fig. 7. Afterwards, we implemented a QCNN, and then
after extending the ResNet about spatial the pyramid pooling layer SPP [53]355
we substituted the convolutional blocks by the quaternion-based convolutional
blocks. The motivation behind using the SPP layer results from its ability to
better represent the object at multiple scales and input sizes.
7.2. Fingerspelling Recognition Using Pre-trained Neural Networks
The output of the base VGG-19 CNN has been flatten, and then a dense360
layer consisting of 512 neurons with dropout=0.5 followed by output layer with
soft-max activation have been added to such a base network. The weights in
the base model were frozen for initial training of the network. Afterwards, the
layers starting from seventeen one (block5) were set as trainable for fine-tuning
17
Figure 7: Flowchart of the ResNet used for JSL fingerspelling classification.
the network. The output of the ResNet50 base model has been fed to a global365
average-pooling and global max-pooling. The outputs have been concatenated
and then fed to a batch normalization layer. The outputs were next fed to
dense layer with 1024 neurons followed by the batch normalization and dropout
layers. Afterwards, a dense layer consisting of 512 neurons with dropout=0.5
followed by the output layer with the soft-max activation have been utilized in370
the model, similarly as in the VGG-19 network.
7.3. Ensemble of CNNs
Models obtained on the basis of convolutional neural networks are nonlin-
ear. They are learned via optimization using stochastic training algorithms and
they are sensitive to the distribution of the training data. Thus, the optimizers375
find a different set of weights each time they are executed, which in turn leads
to unlike predictions. This means that predictions of neural networks usually
have a high variance. One of the successful approaches to reducing the vari-
ance of such predictions is to learn multiple neural network models instead of a
single model and to combine the predictions of these models. The ensemble of380
such independently trained models not only reduces the variance of predictions
but also produces final outputs that are better than predictions of any single
model. Every ensemble member contributes to the final output and individual
weaknesses are offset by the contribution the other members. Essentially, en-
sembles tend to yield better results when there is a significant diversity among385
18
the members [54], (called also base-learners). There are many different types
of ensembles. In a weighted average ensemble the decisions of ensemble mem-
bers are weighted on the basis of their performance on a hold-out validation
dataset. In a stacking-based ensemble the decisions of base-learners are taken
as input for training a meta-learner, that learns how to optimally combine the390
predictions of base-learners. At the beginning the selected neural networks are
learned using the available training data. Afterwards, a meta-learner is trained
to make a final prediction using the predictions of the trained networks. The
main difference between both methods is that in the weighted average ensem-
ble the weights are optimized and then used for weighting all outputs of the395
base-learners, and finally are taken to calculate the weighted average. This
means that no meta-learner is employed in such ensembles. In a stacking-based
ensemble, the meta-learner takes every single output of the base-learners as a
training instance and learns how to optimally map the base-learner decisions
into a better output decision. The meta-learner can be any classic machine400
learning algorithm.
8. Experimental Results and Discussion
Experimental evaluations have been performed on our JSL fingerspelling
dataset, which was discussed in Subsection 3.1. All experiments were carried
out on color RGB images of size 64 ×64. Altogether 16343 images for training405
and 1686 test images were employed in the evaluations consisting in recogni-
tion of 41 JSL static hand gestures. The recognition performance has been
assessed in a person independent (cross-person) scenario, wherein persons at-
tending recordings of test data did not attend in the recordings of the training
data. In training the GANs as well as training the neural networks for hand410
segmentation the synthetic images were used both in the training and the test-
ing of the models, whereas in the gesture recognition only real images were used
in evaluations of learned models.
19
8.1. Evaluation of Hand Segmentation
In the first phase of experiments, we selected 1500 real images from the415
training subset and 500 synthetic images from the RHM sub-dataset. Given
the binary hand shapes we introduced the complex background into the hand
images. We used randomly sampled patches from an Office-Caltech dataset [55].
The Office-Caltech dataset contains images of office objects from ten common
categories shared by the Office-31 and Caltech-256 datasets. It is composed of420
ten classes: backpack, bike, calculator, headphones, keyboard, laptop, monitor,
mouse, mug, and video-projector. In the next step of this phase of experiments,
in order to show the potential of our ResNet34-based network for hand seg-
mentation we trained an ordinary U-Net. The contraction path of the U-Net is
made of four contraction blocks, where each block takes an input map and then425
applies two 3×3 convolution layers followed by a 2×2 max pooling. The number
of kernels or feature maps after each block doubles so that this architecture can
learn the complex structures effectively. A bottleneck part, which is between
the contracting and expanding path is built from simply 2 convolutional layers
(with batch normalization), with dropout. It uses two 3 ×3 CNN layers fol-430
lowed by 2 ×2 up convolution layer. Similar to contraction layer, the expansion
section also consists of four expansion blocks. Each block passes the input to
two 3 ×3 convolutional layers followed by a 2 ×2 upsampling layer. Also after
each block the number of feature maps utilized by the convolutional layer gets
half to obtain symmetrical encoder-decoder network. The input of each block is435
concatenated with feature maps of the corresponding contraction layer. After
passing through the expansion blocks, the resultant mapping passes through
another 3 ×3 convolutional layer with the number of feature maps equal to the
number of segments desired. Figure 8 depicts selected images, which were seg-
mented by our segmentation network. As we can observe, our neural network440
segments the hands quite reliably on images with complex background. Our
experiments demonstrated that the ordinary U-Net is capable of extracting the
hands quite properly only in case of both training and evaluation on images
with uniform background.
20
Figure 8: Masks of the segmented hands in images with complex background. Input images
(top row), masks of segmented hands (bottom row).
Table 1 presents Dice scores that were obtained on a dataset consisting of445
randomly selected 200 images from the test subset of the JSL dataset with the
added complex background. The neural networks were trained on 2000 images
with complex background in 50 epochs using RMSprop optimizer. Afterwards,
we created a training dataset consisting of 2000 images consisting of both real
and synthetic images and trained our network to segment the hand during train-450
ing of GANs with the proposed segmentation term in the loss function.
Table 1: Dice scores on the test sub-dataset.
network U-Net our
Dice score [%] 0.954 0.984
8.2. Rendering Gestures and GAN-based Post-processing
At the beginning of experiments we investigated various approaches to ren-
dering the images representing the JSL gestures. Using our API for the LibHand
we modeled the gestures, exported the models representing gestures to md5 data455
format and then used our parser in order to import the mesh and the animation
data for OpenGL-based rendering. The mesh data and animation data in the
discussed format are separated in distinct files. One of the advantages of the
md5 data format is that data is stored in ASCII files and are human readable.
21
The rotations are represented by quaternions. Through modifying the values460
of the parameters stored in the plain text files it is possible to configure the
skeleton of the model into the required poses, and then render the model. After
parsing the skeleton and animation data, in every frame the 3D hand has been
rotated about randomly generated angles to simulate the observing the hand
from different camera views. Then, we prepared a dataset consisting of training465
and synthetic images for training GANs in order to enhance the photorealism
of images rendered in such a way. Figure 9 depicts example images that were
post-processed by our GAN to improve photorealism of synthetically generated
gestures. As we can observe, photorealism of images generated synthetically
has been improved. In particular, light reflections, which were difficult to model470
were added to the images. The results presented above were obtained on the
basis of our GAN, which has been trained on 1080 and 1226 synthetic and real
images, respectively, in 300 epochs, batch size set to 14, and using Adam opti-
mizer with lr=0.0002, beta 1=0.5. On TitanX GPU the training of the GAN
on images of size 128 ×128 took about twenty four hours. The GAN generator475
trained in such a way has then been used to post-process the synthetic images,
which were generated on the basis of models rendered by the OpenGL.
Figure 9: Examples of GAN-based post-processed images to improve photorealism of syn-
thetically generated gestures. Synthetic images (upper row), post-processed images (bottom
row).
It is worth noting that ordinary cycleGANs, i.e. without the segmentation
component in the loss function were unable to generate hands without arti-
22
facts and unrealistic deformations of the hand, see Fig. 10. Although some480
images with introduced modifications of hand shape could be potentially useful,
see also 1st image from left, considerable percentage of images is not rendered
properly. Moreover, subtle shape differences are rendered properly by our net-
work, compare post-processed images #2 - #4 in Fig. 9 and Fig. 10. One of the
disadvantages of GAN-based data augmentation for visual fingerspelling recog-485
nition [23] is that a visual inspection of data by a man is needed to eliminate
poor gesture realizations. In contrast, the approach to data augmentation for
visual fingerspelling recognition that is presented in this work is fully automatic,
i.e. no human-in-the-loop is needed in the process of fingerspelling recognition.
The discussed examples were obtained in identical number of epochs, which has490
been used by the network achieving results shown on Fig. 9. Somewhat bet-
ter results can be achieved at the cost of significantly larger number of epochs.
The proposed modification of CycleGAN stabilizes the training of the GAN and
permits to achieve better hand shapes in the post-processing for adding more
photorealism to 3D model-based rendered fingerspellings.
Figure 10: Examples of GAN-based post-processed images without the segmentation term in
the loss function.
495
Afterwards, we rendered the hands in Blender without light effects. Finally,
in order to obtain more photorealistic images we illuminated the hands using
virtual lights, together with techniques discussed in Section 3, see Fig. 11. Ges-
tures rendered in such a way were included as a subset in the JSL dataset.
500
23
Figure 11: Example images of 01 a sign, no lighting (top row), with lighting (bottom row).
8.3. Fingerspelling Recognition
We experimented with various neural networks both trained from scratch
and fine-tuned deep CNNs. In the first stage of experiments we trained convo-
lutional neural networks from scratch. The networks were initially pre-trained
on downsampled to 64 ×64 ×3 ImageNet dataset. We implemented a ResNet505
based convolutional neural network consisting of three ResNet-blocks, which
has been outlined in Subsection 7.1. The neural network has been trained on
RGB images of size 64 ×64 ×3. Each model was trained using the Adam
optimizer (lr=0.001, beta 1=0.9, beta 2=0.999, epsilon=1e-08) and categorical
cross-entropy loss, with a small learning rate. The learning rate was scheduled510
to be reduced after 20, 30, 40, 50 epochs. The values of the hyper-parameters
were selected empirically. Afterwards, we trained the ResNet with convolu-
tional blocks substituted by the quaternion-based convolutional blocks. The
neural network has been trained on RGB images of size 64 ×64 ×3 using the
same Adam optimizer as well as parameters.515
In the next stage of experiments we focused on the fingerspelling recognition
using pre-trained neural networks. We trained neural networks, which were dis-
cussed in Sebsection 7.2. As the VGG model expects the input images of size
224 ×224 ×3, the images were resized to the above mentioned size. We initially
trained the networks in 30 epochs using SGD optimizer with lr=1e-4, momen-520
tum=0.9. Next, the neural networks have been fine-tuned in 30 epochs using
the SGD optimizer with lr=1e-4 and momentum=0.9. We also experimented
24
with other pre-trained CNNs, including ResNet34, mobileNet and Inception-
ResNetV2, fine-tuned for the fingerspelling recognition. However, the results
were worse in comparison to results achieved by the mentioned above networks.525
Finally, an ensemble consisting of VGG-19, Resnet50 and ResNet with con-
volutional blocks substituted by the quaternion-based convolutional blocks has
been constructed. The models of the neural networks, which were trained in
advance have been loaded and then used to construct an ensemble of deep net-
works. The output of the ensemble is determined by voting. An MLP-based530
ensemble has also been trained and evaluated.
Table 2 presents the classification performance that has been obtained by
neural networks on the test subset of the JSL dataset. During learning the neu-
ral networks an online data augmentation has been executed. The classification
performance has been obtained on the test subset of the JSL dataset. As we535
can observe the synthetic images allow achieving fat better results. Consider-
able improvement in classification accuracy has been obtained thanks to the
use of the synthetic images. The multi-model, voting-based ensemble improves
the classification performance about 1.5% and the results are slightly better
in comparison to results achieved by the stacking ensemble. Figure 12 depicts540
the confusion matrix obtained by the best single classifier, i.e. ResNet50-based
classifier.
Table 2: Classification performance in performer independent experiment.
Accuracy Precision Recall F1-score
no. rend. 0.671 0.678 0.671 0.670
ResNet18 0.815 0.829 0.815 0.814
VGG19 0.863 0.869 0.860 0.858
ResNet50 0.877 0.895 0.876 0.875
ensemble 0.892 0.906 0.904 0.904
One of the major reasons of insufficient classification performance for a few
25
Figure 12: Confusion matrix - each row represents the real class while each column represents
the predicted class of gestures.
classes are strong inter-class similarities. The hand shapes of several JSL ges-
tures are quite similar, which may explain the reason for the incorrect predic-545
tions of the classifier. As we can notice on images shown on Fig. 13, the hand
shapes in classes ’06 ka’ and ’41 ra’, as well as ’04 e’ and ’11 se’ are pretty alike.
One of the reason is that this an ill-posed problem with inherent ambiguities.
We investigated several approaches to improve the recognition performance,
including rendering additional images for the classes with lower classification550
ratios, synthesis of additional images on the basis of Adversarial Generative
Models (GANs), and evaluations using fine-tuned deep neural networks, e.g.
MobileNet. However, none of the above mentioned approaches has not been
able to improve the experimental results presented above.
26
06 ka 04 e
41 ra 11 sa
Figure 13: Example inter-class similarities, 06 ka – 41 ra and 04 e – 11 sa.
To validate the usefulness of the JSL dataset as well as potential of the555
trained models in real scenarios we performed experiments on image sequences.
The training and testing data were created on the basis of image sequences that
were acquired during recording the JSL dataset. All corresponding test images
from the JSL dataset were included in the discussed dataset and additionally
we included the original images so that each sequence contained minimum ten560
images. The total number of test images is equal to 7250. In the same way,
we prepared the training subset that consisted of the same number of image
sequences as the test subset, but the total number of images was equal to 7300.
Additionally, we assumed that if an original image sequence was included in the
training set, we did not omit from it the images already selected as part the JSL565
dataset. The experiments were conducted on original RGB images of size 320 ×
240. We compared the recognition accuracy achieved by the best performing
neural network with accuracies achieved by recent algorithms [56, 57, 58, 59, 60].
The hands were detected using OpenPose [61] and we processed only the hand
performing the gesture. If in a compared algorithm the OpenPose gave a better570
result than original hand detector we employed it instead of the author’s hand
detector. Table 3 compares results achieved by our best performing model with
results achieved by recent algorithms for isolated fingerspelling recognition on
sequences of RGB images. The input shape of CNN from [58] has been changed
to size 64 ×64 and we extended the network about a convolutional layer and575
the following pooling layer. As we can notice, our recognizer achieved superior
27
results.
Table 3: Comparative recognition performance with performance of recent algorithms for
isolated fingerspelling recognition on sequences of RGB images.
Method Accuracy [%]
Keyframes [56] 64.4
GEI CNN 64 ×64 [58] 39.6
CNN 64 ×64 [57] 74.6
Multi-scale descriptor [59] 64.8
VGG16 HOG [60] 89.2
Our approach 92.1
The neural networks for gesture recognition have been trained on TitanX
GPU with epoch size set to 64 and number of epochs set to 100. The neural
networks were implemented in Python using TensorFlow/Keras frameworks.580
9. Conclusions
In this paper we presented a framework for recognition of static finger-
spellings on RGB images. The recognition of hand gestures is performed by
convolutional neural networks, which have been trained using both real and
synthetic images. A few thousands of synthetic images for training were gener-585
ated on the basis of two skinned hand models. In the first approach, advanced
graphics techniques were used to create photorealistic gestures, whereas in the
second one the gestures rendered using simpler lighting techniques were post-
processed by a modified Generative Adversarial Network. In order to avoid
unrealistic modifications of the hand shape a hand segmentation term has been590
added to the loss function of the GAN. The segmentation of the hand in images
with complex background was done by proposed ResNet34-based segmentation
network. The finger-spelled signs were recognized by an ensemble consisting
of finetuned VGG19 and ResNet50 neural networks and ResNet convolutional
neural network trained from scratch. Experimental results demonstrate that595
28
thanks to sufficient amount of training data a high recognition rate can be
attained on RGB images. We demonstrated experimentally that in a person in-
dependent scenario, on a test subset with gestures expressed by four performers
a recognition rate close to 90% can be achieved using the proposed approach.
Future work will include investigations on using the rendering techniques for600
data augmentation while training neural networks.
Acknowledgements
This work was supported by Polish National Science Center (NCN) under a
research grant 2017/27/B/ST6/01743.
References605
[1] M. Sagayam and J. Hemanth, “Hand posture and gesture recognition tech-
niques for virtual reality applications: A survey,” Virtual Reality, vol. 21,
no. 2, pp. 91–107, 2017.
[2] F. Chen, Q. Zhong, F. Cannella, K. Sekiyama, and T. Fukuda, “Hand
gesture modeling and recognition for human and robot interactive assembly610
using Hidden Markov Models,” Int. J. of Advanced Robotic Systems, vol. 12,
no. 4, p. 48, 2015.
[3] D. Raj, I. Gogul, M. Thangaraja, and V. Kumar, “Static gesture recog-
nition based precise positioning of 5-DOF robotic arm using FPGA,” in
Trends in Industrial Measurement and Automation (TIMA), 2017, pp. 1–615
6.
[4] H. Liu and L. Wang, “Gesture recognition for human-robot collaboration:
A review,” Int. J. of Industrial Ergonomics, vol. 68, pp. 355 – 367, 2018.
[5] S. Patil, D. K. Dennis, C. Pabbaraju, R. Deshmukh, H. Simhadri,
M. Varma, and P. Jain, “GesturePod: Programmable gesture recognition620
for augmenting assistive devices,” Microsoft, Tech. Rep., May 2018.
29
[6] P. Pisharady and M. Saerbeck, “Recent methods and databases in vision-
based hand gesture recognition,” Comput. Vis. Image Underst., vol. 141,
pp. 152–165, 2015.
[7] A. S. Al-Shamayleh, R. Ahmad, M. Abushariah, K. A. Alam, and625
N. Jomhari, “A systematic literature review on vision based gesture recog-
nition techniques,” Multimedia Tools and Applications, vol. 77, no. 21, pp.
28 121–28 184, 2018.
[8] O. Matei, P. C. Pop, and H. V˘alean, “Optical character recognition in
real environments using neural networks and k-nearest neighbor,” Applied630
Intelligence, vol. 39, no. 4, pp. 739–748, 2013.
[9] O. Kopuklu, A. Gunduz, N. Kose, and G. Rigoll, “Online dynamic hand
gesture recognition including efficiency analysis,” IEEE Trans. on Biomet-
rics, Behavior, and Identity Science, vol. 2, no. 2, pp. 85–97, 2020.
[10] O. Oyedotun and A. Khashman, “Deep learning in vision-based static hand635
gesture recognition,” Neural Computing and Applications, pp. 1–11, 2016.
[11] A. Wadhawan and P. Kumar, “Sign language recognition systems: A
decade systematic literature review,” Archives of Computational Methods
in Engineering, Dec. 2019.
[12] H. Zuo, H. Fan, E. Blasch, and H. Ling, “Combining convolutional and re-640
current neural networks for human skin detection,” IEEE Signal Processing
Letters, vol. 24, no. 3, pp. 289–293, 2017.
[13] J. Tompson, M. Stein, Y. LeCun, and K. Perlin, “Real-time continuous
pose recovery of human hands using convolutional networks,” ACM Trans.
Graph., vol. 33, no. 5, 2014.645
[14] J. Nagi and F. Ducatelle, et al., “Max-pooling convolutional neural net-
works for vision-based hand gesture recognition,” in IEEE ICSIP, 2011,
pp. 342–347.
30
[15] P. Barros, S. Magg, C. Weber, and S. Wermter, A Multichannel Convolu-
tional Neural Network for Hand Posture Recognition. Springer, 2014, pp.650
403–410.
[16] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a CNN on
1 million hand images when your data is continuous and weakly labelled,”
in IEEE Conf. on Comp. Vision and Pattern Rec., 2016, pp. 3793–3802.
[17] S. F. Chevtchenko, R. F. Vale, V. Macario, and F. R. Cordeiro, “A con-655
volutional neural network with feature fusion for real-time hand posture
recognition,” Applied Soft Computing, vol. 73, pp. 748 – 766, 2018.
[18] N. Pugeault and R. Bowden, “Spelling it out: Real-time ASL fingerspelling
recognition,” in IEEE Int. Conf. on Computer Vision Workshops, 2011, pp.
1114–1119.660
[19] Y. Chuang, L. Chen, and G. Chen, “Saliency-guided improvement for hand
posture detection and recognition,” Neurocomputing, vol. 133, pp. 404 –
415, 2014.
[20] S. Bambach, S. Lee, D. J. Crandall, and C. Yu, “Lending a hand: Detecting
hands and recognizing activities in complex egocentric interactions,” in665
IEEE Int. Conf. on Computer Vision (ICCV), 2015, pp. 1949–1957.
[21] A. U. Khan and A. Borji, “Analysis of hand segmentation in the wild,” in
IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018, pp.
4710–4719.
[22] B. Kwolek and S. Sako, “Learning Siamese features for finger spelling recog-670
nition,” in Advanced Concepts for Intelligent Vision Systems. Lecture
Notes in Computer Science, vol. 10617, Springer, 2017, pp. 225–236.
[23] B. Kwolek, “GAN-based data augmentation for visual finger spelling recog-
nition,” in Eleventh Int. Conf. on Machine Vision (ICMV 2018), vol. 11041.
SPIE, 2019, pp. 493 – 500.675
31
[24] N. T. Nguen, S. Sako, and B. Kwolek, “Deep CNN-based recognition of jsl
finger spelling,” in Proc. Int. Conf. on Hybrid Artificial Intelligent Systems
(HAIS), LNCS, vol. 11734. Springer, 2019, pp. 602–613.
[25] Y. Tabata and T. Kuroda, “Finger spelling recognition using distinctive
features of hand shape,” in Int. Conf. on Disability, Virtual Reality and680
Associated Technologies with Art Abilitation, 2008, pp. 287–292.
[26] L. Kane and P. Khanna, “A framework for live and cross platform finger-
spelling recognition using modified shape matrix variants on depth silhou-
ettes,” Comput. Vis. Image Underst., vol. 141, pp. 138–151, 2015.
[27] Rosalina, L. Yusnita, N. Hadisukmana, R. B. Wahyu, R. Roestam, and685
Y. Wahyu, “Implementation of real-time static hand gesture recognition
using artificial neural network,” in Int. Conf. on Computer Appl. and Inf.
Proc. Techn. (CAIPT), 2017, pp. 1–6.
[28] M. Asad and G. Slabaugh, “SPORE: Staged probabilistic regression for
hand orientation inference,” Computer Vision and Image Understanding,690
vol. 161, pp. 114 – 129, 2017.
[29] A. Y. Dawod, M. J. Nordin, and J. Abdullah, “Static fingerspelling recogni-
tion based on boundary tracing algorithm and chain code,” in Int. Conf. on
Intell. Systems, Metaheuristics & Swarm Intell. ACM, 2018, pp. 104–109.
[30] T. Kim, J. Keane, W. Wang, H. Tang, J. Riggle, G. Shakhnarovich,695
D. Brentari, and K. Livescu, “Lexicon-free fingerspelling recognition from
video: Data, models, and signer adaptation,” Computer Speech & Lan-
guage, vol. 46, pp. 209 – 232, 2017.
[31] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign lan-
guage recognition without temporal segmentation,” in AAAI, 2018.700
[32] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep Sign: Enabling
robust statistical continuous sign language recognition via hybrid CNN-
HMMs,” Int. J. Comput. Vision, vol. 126, no. 12, pp. 1311–1325, 2018.
32
[33] N. Aloysius and M. Geetha, “Understanding vision-based continuous sign
language recognition,” Multimedia Tools and Applications, vol. 79, pp.705
22 177–22 209, 05 2020.
[34] B. Shi, A. M. D. R. nad J. Keane, J. Michaux, D. Brentari, and G. S.
aand K. Livescu, “American Sign Language fingerspelling recognition in
the wild,” IEEE Spoken Language Technology Workshop (SLT), pp. 145–
152, 2018.710
[35] T. Igarashi, K. Nishino, and S. K. Nayar, “The appearance of human skin:
A survey,” Found. Trends. Comput. Graph. Vis., vol. 3, no. 1, pp. 1–95,
2007.
[36] M. ˇ
Sari´c, “Libhand: A library for hand articulation,” 2011, version 0.9.
[Online]. Available: http://www.libhand.org/715
[37] Blender Online Community, Blender - a 3D modelling and rendering
package, Blender Foundation, Stichting Blender Foundation, Amsterdam,
2020. [Online]. Available: http://www.blender.org
[38] W. Baczynski, “Hand pose recognition using 3D hand models,” Master’s
Thesis, AGH Univ. of Science and Technology, Faculty of Computer Scince,720
Elecronics and Telecommunications, Krakow, Poland, 2019.
[39] D. Vicini, V. Koltun, and W. Jakob, “A learned shape-adaptive subsurface
scattering model,” ACM Trans. Graph., vol. 38, no. 4, 2019.
[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Im-
age Recognition,” in IEEE Conf. on Computer Vision and Pattern Rec.725
(CVPR), 2016, pp. 770–778.
[41] T. Parcollet, Y. Zhang, M. Morchid, C. Trabelsi, G. Linar`es, R. de Mori,
and Y. Bengio, “Quaternion Convolutional Neural Networks for End-to-
End Automatic Speech Recognition,” in Interspeech. ISCA, 2018, pp.
22–26.730
33
[42] C.-A. Popa, “Learning algorithms for quaternion-valued neural networks,”
Neural Process. Lett., vol. 47, no. 3, pp. 949–973, 2018.
[43] T. Nitta, “A quaternary version of the back-propagation algorithm,” in
Proc. of Int. Conf. on Neural Networks, vol. 5, 1995, pp. 2753–2756.
[44] X. Zhu, Y. Xu, H. Xu, and C. Chen, “Quaternion Convolutional Neural735
Networks,” in European Conf on Computer Vision (ECCV). Springer,
2018, pp. 645–661.
[45] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deep learning:
A review,” IEEE Trans. on Neural Networks and Learning Systems, vol. 30,
no. 11, pp. 3212–3232, 2019.740
[46] C. Li and K. M. Kitani, “Pixel-level hand detection in ego-centric videos,”
in IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2013, pp.
3570–3577.
[47] S. Lee, S. Bambach, D. J. Crandall, J. M. Franchak, and C. Yu, “This hand
is my hand: A probabilistic approach to hand disambiguation in egocen-745
tric video,” in IEEE Conf. on Computer Vision and Pattern Recognition
Workshops, 2014, pp. 557–564.
[48] A. Z. Arpit Mittal and P. Torr, “Hand detection using multiple proposals,”
in Proc. of the British Machine Vision Conf. BMVA Press, 2011, pp.
75.1–75.11.750
[49] S. Narasimhaswamy, Z. Wei, Y. Wang, J. Zhang, and M. Hoai, “Contextual
attention for hand detection in the wild,” in Int. Conf. on Computer Vision
(ICCV), 2019, pp. 9567–9576.
[50] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks
for biomedical image segmentation,” in MICCAI. Springer, 2015, pp.755
234–241.
34
[51] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in
Proc. of the 27th Int. Conf. on Neural Information Processing Systems -
Vol. 2. Cambridge, USA: MIT Press, 2014, pp. 2672–2680.760
[52] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image
Translation Using Cycle-Consistent Adversarial Networks,” in IEEE Int.
Conf. on Computer Vision (ICCV), 2017, pp. 2242–2251.
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” in ECCV. Springer, 2014,765
pp. 346–361.
[54] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,”
J. Artif. Int. Res., vol. 11, no. 1, pp. 169–198, 1999.
[55] K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,”
in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).770
USA: IEEE Computer Society, 2012, pp. 2066–2073.
[56] H. Tang, H. Liu, W. Xiao, and N. Sebe, “Fast and robust dynamic hand
gesture recognition via key frames extraction and feature fusion,” Neuro-
computing, vol. 331, pp. 424–433, 2019.
[57] P. Nakjai and T. Katanyukul, “Hand sign recognition for Thai Finger775
Spelling: an application of convolution neural network,” J. of Signal Pro-
cessing Systems, vol. 91, no. 2, pp. 131–146, 2019.
[58] K. M. Lim, A. W. C. Tan, C. P. Lee, and S. C. Tan, “Isolated sign language
recognition using convolutional neural network hand modelling and hand
energy image,” Multimedia Tools and Applications, vol. 78, no. 14, pp.780
19 917–19 944, 2019.
[59] Y. Huang and J. Yang, “A multi-scale descriptor for real time RGB-D hand
gesture recognition,” Pattern Recognition Letters, 2020.
35
[60] A. Sharma, N. Sharma, Y. Saxena, A. Singh, and D. Sadhya, “Benchmark-
ing deep neural network approaches for Indian Sign Language recognition,”785
Neural Computing and Applications, Oct. 2020.
[61] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, “OpenPose:
Realtime multi-person 2D pose estimation using part affinity fields,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp.
172–186, 2021.790
36