ArticlePDF Available

Recognition of JSL Fingerspelling Using Deep Convolutional Neural Networks

June 2021
Neurocomputing 456(2)

June 2021
456(2)

DOI:10.1016/j.neucom.2021.03.133

Authors:

AGH University of Science and Technology in Kraków

In this paper, we present approach for recognition of static fingerspelling in Japanese Sign Language on RGB images. Two 3D articulated hand models have been developed to generate synthetic fingerspellings and to extend a dataset consisting of real hand gestures.In the first approach, advanced graphics techniques were employed to rasterize photorealistic gestures using a skinned hand model. In the second approach, gestures rendered using simpler lighting techniques were post-processed by a modified Generative Adversarial Network. In order to avoid generation of unrealistic fingerspellings a hand segmentation term has been added to the loss function of the GAN. The segmentation of the hand in images with complex background was done by proposed ResNet34-based segmentation network. The finger-spelled signs were recognized by an ensemble with both fine-tuned and trained from scratch neural networks. Experimental results demonstrate that owing to sufficient amount of training data a high recognition rate can be attained on RGB images. The JSL dataset with pixel-level hand segmentations is available for download.

Examples of rendered Hiragana signs using LibHand with our API for hand articulation and gesture modeling.

…

Selected shots of JSL sign 'ka' for various camera views.

…

Photorealistic hand rendering: regular texture (left), Filmic Blender color palette (right). The illustrated images are stored in vectorized graphics format and a better viewing can be obtained by zooming this figure.

…

Selected shots of JSL sign 'ka' for various lighting.

…

Flowchart of the ResNet used for JSL fingerspelling classification.

…

Figures - uploaded by Bogdan Kwolek

Content may be subject to copyright.

Content uploaded by Bogdan Kwolek

Content may be subject to copyright.

Recognition of JSL Fingerspelling Using Deep

Convolutional Neural Networks

Bogdan Kwoleka,∗

, Wojciech Baczynskia, Shinji Sakob

aAGH University of Science and Technology, 30 Mickiewicza Av., 30-059 Krakow, Poland

bNagoya Institute of Technology, Nagoya, Japan

Abstract

In this paper, we present approach for recognition of static ﬁngerspelling in

Japanese Sign Language on RGB images. Two 3D articulated hand models have

been developed to generate synthetic ﬁngerspellings and to extend a dataset

consisting of real hand gestures. In the ﬁrst approach, advanced graphics tech-

niques were employed to rasterize photorealistic gestures using a skinned hand

model. In the second approach, gestures rendered using simpler lighting tech-

niques were post-processed by a modiﬁed Generative Adversarial Network. In

order to avoid generation of unrealistic ﬁngerspellings a hand segmentation term

has been added to the loss function of the GAN. The segmentation of the hand

in images with complex background was done by proposed ResNet34-based seg-

mentation network. The ﬁnger-spelled signs were recognized by an ensemble

with both ﬁne-tuned and trained from scratch neural networks. Experimen-

tal results demonstrate that owing to suﬃcient amount of training data a high

recognition rate can be attained on RGB images. The JSL dataset with pixel-

level hand segmentations is available for download.

Keywords: Fingerspelling recognition, Generative Adversarial Networks,

semantic segmentation, U-Net, residual networks (ResNets)

∗Correspondence to: Department of Computer Science, AGH University of Science and

Technology, 30 Mickiewicza Av., Building D-17, 30-059 Krakow, Poland

Email address: bkw@agh.edu.pl (Bogdan Kwolek )

3.01.2021

1. Introduction

Hand detection, tracking and recognition of ﬁngerspellings are signiﬁcant re-

search areas due to high application potential in human-machine-communication,

virtual reality [1], entertainment, robotics [2, 3, 4], medicine, and assistive tech-5

nologies for the handicapped and the elderly [5]. Communication by gesture is

one of the most intuitive and ﬂexible ways to attain user-friendly man-machine

interaction. Although huge eﬀorts by many research teams were undertaken in

the last decade [6, 7], there are still challenges to be addressed for attaining

recognition performance that is required by real-life applications, e.g. [8, 9].10

Gesture recognition on images acquired by a single color camera is very useful,

yet complex task because of several diﬃculties, including occlusions, variations

in gesture expressions, diﬀerences in hand anatomy and appearance.

In recent years, a number of approaches to recognition of static gestures

on RGB images has been proposed [10, 11]. In a recently published work [12],15

an end-to-end network for human skin detection on color images by integrat-

ing recurrent neural layers into Fully Convolutional Neural Networks (FCNs)

has been proposed. Despite signiﬁcant developments in learning deep Convo-

lutional Neural Networks (CNNs), a recent review on gesture recognition [10]

evokes only one noteworthy work done by Tompson in collaboration with Le-20

Cun et al. [13]. In a former work [14], a CNN capable of classifying six hand

gestures and controlling robots on the basis of colored gloves has been proposed.

More recently, in [15] a CNN implemented in Theano and applied on the Nao

humanoid robot has been discussed. In [16], a CNN learned on one million of

data samples to classify sign characters has been proposed. However, only a25

subset of the dataset, i.e. 3361 manually labeled frames into 45 classes has been

made publicly available. Recently, a method employing Gabor features, Zernike

moments, Hu moments, and contour-based descriptors to select features for fur-

ther combination by a fusion-based convolutional neural network (FFCNN) has

been introduced [17].30

One of the obstacles that researches and practitioners face in utilization the

deep CNNs on a larger scale is lack of properly aligned datasets of suﬃcient

size [6] as well as shortage of robust real-time hand detectors. A dataset in-

troduced in [18] consists of 65000 samples representing 24 classes. However,

the gestures were performed only by nine subjects. A dataset utilized in [19]35

contains 2750 samples with complex background, which were performed by 40

subjects. However, this dataset has only ten classes. In a dataset [13] targeted

for hand pose recovery there are 72757 and 8252 frames in the training and

test sets, respectively. However, only two performers attended in recordings in

a scenario with three Kinect sensors (a frontal and two side). The EgoHands40

dataset [20], which has pixel-level annotations for hands with two participants in

each video interacting with each other has recently been used in a work devoted

to hand segmentation [21]. From the above literature review it follows that, in

general, there is lack of datasets with pixel-level annotations. Moreover, almost

all dataset have data that have no separated training and testing subsets. As45

shown in our work [22], recognition rate in person-independent evaluation pro-

tocols drops signiﬁcantly in comparison to evaluation protocols in which data

of the same subjects are both in training and testing subsets. Moreover, as

demonstrated in [23, 24] synthetically generated hand images can improve the

classiﬁcation performance. However, currently available datasets do not have50

data that would permit further development of methods allowing more eﬀective

employing of synthetically generated data.

Fingerspelling is a loanword system for borrowing orthographic representa-

tions of words from language to sign language, for instance from English into

American Sign Language (ASL), or as in this work from Japanese to Japanese55

Sign language. It is a form, and frequently an integral part of sign language,

where each sign corresponds to a word of the alphabet. It is very often used

for proper names, brand names, place names as well as digits, which do not

have conventional lexical signs. Individuals generally spell their name when

they introduce themselves. There are many compelling reasons that make60

ﬁngerspelling an appealing area of research. Tabata and Kuroda [25] developed

a Stringlove system for recognizing hand shapes and ﬁngerspelling in Japanese

Sign Language (JSL). It is built on custom-made glove that is equipped with

sensors capturing ﬁnger features. The glove uses nine contact sensors and 24

inductcoders that jointly estimate features like: adduction/abduction angles of65

ﬁngers, thumb and wrist rotations, the joint ﬂexion/extension of ﬁngers as well

as contact positions among ﬁngertips of the ﬁngers. More recently, in [26] a

modiﬁed kind of the shape matrix for capturing salience of ﬁngerspelling pos-

tures through precise sampling of contours and regions has been proposed. In

[22], recognition of JSL ﬁngerspellings was done on the basis of embeddings de-70

termined by multiple Siamese CNNs. A dataset consisting of real gestures has

been introduced in discussed work. It contains 5311 training images and 579

test images of size 64 ×64.

Japanese Sign Language, also known under the acronym JSL, is a visual

sign language in Japan. As other sign languages, the JSL comprises words, or75

signs, and the grammar with which they are bonded together. The Japanese

Sign Language syllabary is a system of manual kana utilized as part of the JSL.

In general, ﬁngerspelling is used mostly for foreign words and last names. The

JSL ﬁngerspellings are performed by ﬁve ﬁngers of the hand and direction it

points. For example, the signs ’na’, ’ni’, ’ha’ are all expressed with the ﬁrst80

two ﬁngers of the hand extended straight, but for the sign ’na’ the ﬁngers point

down, for ’ni’ across the body, and for ’ha’ toward the partner or audience. The

signs for ’te’ and ’ho’ are both made with open ﬂat hand, but in ’te’ the palm

faces the viewer, and in ’ho’ it faces away. These and many other aspects make

recognition of JSL ﬁngerspelling on the basis of a single RGB camera a diﬃcult85

task. Most of ﬁngerspellings are expressed through static postures, but some of

them are dynamic postures. In addition, dullness, half dullness, and long sound

are represented by dynamic postures. In this study, we only focused on static

ﬁngerspellings. There are 41 static ﬁngerspellings in the JSL.

In this work, we present a framework for recognition of static ﬁngerspelling in90

Japanese Sign Language on RGB images. Two 3D articulated hand models have

been developed to generate synthetic ﬁngerspellings and to extend real hand ges-

tures of the JSL dataset [22] about gestures with photorealistic renderings of the

hand. In the ﬁrst approach, advanced graphics techniques were employed to cre-

ate photorealistic gestures using a skinned hand model. In the second approach,95

gestures rendered in advance using simple lighting techniques were further post-

processed by a modiﬁed Generative Adversarial Network (GAN). To avoid gen-

eration of unrealistic ﬁngerspellings a hand segmentation term has been added

to the loss function of the GAN. The segmentation of the hand in images with

complex background was done by proposed ResNet34-based segmentation net-100

work. The ﬁnger-spelled signs were recognized by an ensemble consisting of

a VGG-based neural network and two ResNet quaternion convolutional neural

networks. The contribution of this work is a framework for improving recog-

nition of ﬁngerspellings on RGB images by employing advanced techniques for

photorealistic hand image synthesis, including rendering techniques and modi-105

ﬁcations of GANs for enhancing photorealism of hand gestures. A large dataset

(eleven thousands of images) with both pixel-level hand segmentations and syn-

thetically generated hand postures for learning deep segmentation models as

well as deep neural networks for ﬁngerspelling recognition is proposed.

2. Relevant work110

In [15], a multichannel CNN for hand posture recognition has been proposed.

A cubic kernel to enhance features for posture classiﬁcation has been used. The

system has been evaluated on the Nao robot. In [27], a glove has been utilized in

order to provide contour representation of gestures. A neural network, which has

been trained on a dataset consisting of one hundred images per gesture permits115

achieving 90% classiﬁcation accuracy. In [28], a staged probabilistic regressor

(SPORE) algorithm for estimation of hand orientation from 2D monocular im-

ages has been proposed. In discussed approach, simultaneous learning hand

orientation and pose signiﬁcantly increased the performance of pose classiﬁca-

tion on 2D monocular images. In a recently published work [29] focusing on120

ASL language, an approach for detection and extraction of shape for static ﬁn-

gerspelling recognition on the basis of boundary tracking and chain code has

been proposed. On images of size 320 ×240 acquired by a webcam and images

collected from freely available resources the recognition accuracy was 97.75%

and 96.48% for alphabet characters and numbers, respectively. Kim et al. [30]125

determine a signer-dependent skin color model using manually annotated hand

regions for ﬁngerspelling recognition. In a signer-dependent setting they achieve

up to about 92% letter accuracy, whereas in a multi-signer they achieve up to

83% accuracies in letter recognition. Huang et al. [31] utilize a Faster R-CNN

based hand detector, trained on manually annotated hand bounding boxes, and130

apply it to general sign language recognition. Convolutional neural network-

based features demonstrated high potential in recent approaches [16, 32]. A

survey of recent achievements in vision-based continuous sign language recog-

nition is presented in [33]. A recently introduced Chicago Fingerspelling in the

Wild (ChicagoFSWild) dataset [34] contains 7304 ﬁngerspelling sequences from135

online videos.

3. Fingerspelling Modeling and Rendering

Rendering a realistic and accurate hand shape is not an easy task because

of large variety of poses that the human hand can assume as well as diﬃculties

in modeling shin appearance [35]. A commonly used approach for rendering the140

hand shape in the requested poses is linear blend skinning (LBS). Starting from

the open source library LibHand v. 0.9 [36] and a 42-DOF skeleton we developed

a new rendering API for the LibHand for conﬁguring the hand in the requested

poses by a set of sliders, which can be manually moved using the mouse. It

uses the textured 3D model of the LibHand and permits exporting the resulting145

models into the md5 graphics format for further OpenGL-based rendering. For

each gesture we considered several allowable hand postures and orientations,

diﬀerent ﬁnger inter-distances in order to synthesize a large number of training

images. Thirty three students posed and articulated the 3D hand model to

render the gestures. As a result, four thousands (4018) synthetic images were150

generated in order to balance the JSL dataset [22] and then after additional

post-processing they were stored as a subset of the mentioned JSL dataset.

The resulting synthetic hand images are quite faithful representation of the JSL

signs, see Fig. 1. However, due to insuﬃcient quality of lighting algorithms for

human skin rendering that are available in standard OpenGL, and particularly155

in order to increase photorealism of the rendered gestures, the images were

further post-processed by Generative Adversarial Networks (GANs), outlined

in Sect. 6. Finally, a sub-dataset containing 4018 samples has been generated

and then it has been stored as another subset (called RHM) of the whole JSL

dataset. There are roughly 98 diﬀerent samples for each of the 41 JSL gestures

Figure 1: Examples of rendered Hiragana signs using LibHand with our API for hand articu-

lation and gesture modeling.

160

with diﬀerent orientations and small variations in the 3D positions of the ﬁngers

with respect to location of the wrist.

In order to employ recent rendering and lighting developments, a custom

3D hand model has been prepared in Blender, which is a 3D computer graph-

ics software toolset [37]. The models have been designed using 2.80 software165

version, which is extended about Eevee physically-based real-time renderer, a

successor of the Cycles rendering engine. The Cycles engine works by casting

rays of light from each pixel of the camera into the scene. They refract and

reﬂect, or get absorbed, until they either hit the light source or reach a prede-

ﬁned bounce limit. Eevee is a real-time render engine with advanced features,170

including Non-photorealistic Rendering (NPR). The NPR is an active area of

computer graphics, which focuses on enabling a wide variety of expressive styles

for digital entertainment. Eevee renderer has support for baked indirect light-

ing, screen space ambient occlusion, screen space reﬂections, and other modern

commodities provided by current generation graphics hardware. Having on re-175

gard, that renderings with the Eevee engine can be performed in about ﬁve

times shorter time in comparison to Cycles rendering time on the same hard-

ware, the initial renderings were done in the Eevee engine, whereas the ﬁnal

ones were done in the Cycles engine. A hand model [24] consisting of 21 bones

and ﬁve control bones has been extended in this work to permit photorealistic180

hand gesture modeling and animations by Python scripts. The bones in the

skeleton construct a structure of rigid bodies that are connected together by

joints with one or more degrees of freedom. The articulated hand model has

37 degrees of freedom (DOF). The 3D mesh consists of 528 vertices and it is

composed of 1036 triangles [24]. The 26 element skeleton (armature) is bound185

to such a 3D mesh. The root joint is located in the hand’s wrist.

Generation of synthetic training images is done by executing Python scripts

that employ Blender graphics engine [38]. Basic conﬁguration options of the

scripts include diﬀerent camera viewpoints and diﬀerent lighting. The model

can also be exported to md5 data format to perform animations in external190

programs. Figure 2 depicts the JSL sign ’ka’, which has been expressed by the

3D hand observed from diﬀerent camera views.

At the beginning, we investigated a widely used three point lighting tech-

nique in rendering of realistic ﬁngerspelling. However, it quickly turned out

that such a three point technique is not the main bottleneck to achieve pho-195

torealistic skin rendering and that a signiﬁcant improvement of realism of the

rendered hands can be achieved by the use of Subsurface Scattering (SSS), which

simulates the transport of light through a translucent surface. In the discussed

Figure 2: Selected shots of JSL sign ’ka’ for various camera views.

technique the light penetrates a material, in our case the human skin, and inter-

nally scatters at irregular angles, resulting in more photorealistic human skin.200

This is very important issue in photorealistic hand rendering. Recently, a deep

learning approach to subsurface scattering has been proposed in [39]. Further

improvement of hand realism has been achieved by the use Filmic Blender color

palette. Figure 3 presents the eﬀect of using such a color palette in photorealistic

hand rendering.

Figure 3: Photorealistic hand rendering: regular texture (left), Filmic Blender color palette

(right). The illustrated images are stored in vectorized graphics format and a better viewing

can be obtained by zooming this ﬁgure.

205

Finally, a simpler two point lighting has been employed in the rendering of

the sub-dataset with realistic hand renderings. The main light source was an

Area type lamp providing the surface light, which can be rotated around the

hand according to user needs. The second source light is a back light providing

chiaroscuro eﬀects, and which can be turned on or turned oﬀ through the Python210

scripts. Figure 4 depicts sample images which were rendered using the discussed

techniques.

Figure 4: Selected shots of JSL sign ’ka’ for various lighting.

The 3D articulated hand model has been used to generate synthetic ﬁn-

gerspellings and to extend our dataset consisting of real hand gestures. Twelve

diﬀerent gesture realizations were prepared for each of 41 signs. Ten images have215

been rendered for each realization through interpolations between the starting

and end poses. Figure 5 depicts samples of the rendered images for the sign

’ka’ from the JSL. For each starting gesture a ﬁnal gesture has been created

and ten interpolated images were rendered among them. This means that ges-

tures diﬀer in hand postures to express various realizations of the gesture by220

diﬀerent persons. For each realization of the gesture we modeled starting and

ﬁnal posture and then interpolated hand postures between them. The number

of images generated on the basis of the interpolation is equal to eight. In total,

5892 realistic hand gestures were selected from the rendered dataset and then

stored as a subset in the JSL dataset.225

Figure 5: Example realizations of JSL gesture ’ka’ that were obtained in the photorealistic

hand rendering.

3.1. 3D-Model Based Dataset for JSL Recognition

The JSL dataset consists of 18029 images, where the training subset consists

of 16343 images, and testing subset contains 1686 images. The test subset

contains only real images with gestures performed by four persons, including

three Japanese performers, who did not attend in recording of training images.230

The training subset contains both real and the synthesized images. The real

images are taken from training subset of former JSL dataset [22, 24]. The

images are of size 64 ×64 with uniform background. Thanks to the uniform

background the hands can be delineated easily and then used to train models for

hand segmentation or gesture classiﬁcation on images with artiﬁcially included235

complex backgrounds. As far as we know, currently this is the largest dataset

with pixel-level delineated hands. Moreover, rather simple rendering techniques

were applied until now in generating synthetic hands for training deep learning

models. The dataset has been stored in .mat ﬁles and can be easily imported

into Matlab and Python. The whole JSL dataset is freely available at: http:240

//home.agh.edu.pl/~bkw/data/neu2020.

4. Neural Networks for Gesture Modeling and Recognition

At the beginning of this Section we present Residual Neural Networks. Af-

terwards, we outline Quaternion Convolutional Neural Networks.

4.1. ResNet Convolutional Neural Networks245

In [40], He at al. introduced residual networks (ResNets), which provide

an important contribution to training very deep neural networks. The residual

learning framework simpliﬁes the training of neural networks, and enables them

to be substantially deeper, which leads to an improved performance. The resid-

ual networks are much deeper in comparison to their ordinary counterparts, yet250

they require a similar number of parameters. The main idea is to utilize blocks

that re-route the input, and to add to the concept learned from the previous

layer. The constituent building block of discussed architecture is the ResNet

unit. A deeper network can be built by simply repeating such a block, i.e. the

smaller sub-network. A desired underlying mapping H(x) can be approximated255

by a few stacked nonlinear layers, so it can also be obtained through underly-

ing mapping F(x) = H(x)−x. As a result, it is possible to reformulate it as

H(x) = F(x)+ x, which comprises the Residual Function F(x) and the input x.

The connection of the input to the output is called a skip connection or identity

mapping. The central idea is that if multiple nonlinear layers can approximate260

the complex function H(x), then it is possible to approximate the residual func-

tion F(x). Thus, the stacked layers are not employed to ﬁt H(x), but instead

these layers approximate the residual function F(x).

4.2. Quaternion Convolutional Neural Network

Recently, in order to exploit internal dependencies within the features a

quaternion convolutional neural network (QCNN) has been proposed [41]. Let

γl

ab and Sl

ab denote the quaternion output and the pre-activation quaternion

output at layer land at the indexes (a, b) of the feature map, and wbe a

quaternion-valued weight ﬁlter map of size K×K. The convolution can be

expressed in the following manner:

γl

ab =α(Sl

ab) (1)

where Sl

ab is equal to:

ab =

K−1

c=0

K−1

d=0

wl⊗γl−1

(a+c)(b+d)(2)

and αstands for quaternion split activation function [42] that is deﬁned as

follows:

α(Q) = f(r) + f(x)i+f(y)j+f(z)k(3)

where fis related to any standard activation function. A derivation of the265

backpropagation algorithm for quaternion neural networks can be found in [43].

Recently, in [44] a QCNN for color image processing has been proposed. In

the discussed approach the image is represented in the quaternion domain as a

quaternion matrix. The quaternion convolution provides scaling and rotation

of input in the color space, which carries out more structural representation of270

color information [44], whereas the conventional real-valued convolution is only

capable of executing scaling transformations on the input. Because QCNNs

enforce an implicit regularizer on the network architecture, more complicated

relationships across diﬀerent channels can improve the training of such kind of

neural networks.275

5. Hand Segmentation

In recent years a considerable progress in object detection has been achieved

[45]. However, little work has been done in area of hand detection. In [46]

a dataset consisting of 600 images acquired in various lighting conditions and

backgrounds has been proposed to highlight the advantages and shortcomings of280

diﬀerent methods for ego-centric hand detection. Later, in [47] another approach

to detect hands in social interactions in egocentric videos has been demon-

strated. However, only interactions in laboratory settings were considered. In

already evoked work [20], Bambach et al. introduced a skin-based approach that

ﬁrst determines a set of bounding boxes that may surround hand regions, after-285

wards utilizes CNNs to detect hands, and ﬁnally executes GrabCut to segment

them. They also introduced an EgoHands dataset consisting of 48 ﬁrst-person

videos of people interacting in realistic environments, with pixel-level ground

truth for over 15000 hand instances. Our dataset contains 14875 images with

pixel-level ground truth and it has a potential to ﬁll a gap for hand detection290

and segmentation in third person-images. In such a third-person settings, [48]

used deformable part models and skin heuristics to detect hands. Recently, a

large dataset suitable for deep learning has been introduced in [49]. However,

the dataset mentioned above does not contain pixel-level annotations.

In order to reliably segment the hand on RGB images with complex back-295

ground we designed an encoder-decoder neural network. In the proposed neural

network for hand segmentation on images with complex background we employ

a deep CNN and add skip connections between the layers in the encoder and

the decoder. The encoder path is based on the ResNet 34-layer (ResNet34)

neural network, whereas decoder path uses the transpose 2D blocks to perform300

the 2D upsampling. The parameters of each transpose 2D block are such that

the height and width are doubled, whereas the number of channels is helved,

see Fig. 6. There are three skip connections, where the ﬁrst connection is done

after (3 ×3,64; 3 ×3,64) ×3 ResNet blocks, the second one is performed after

(3×3,128; 3×3,128)×4 blocks and the last one is after (3×3,128; 3×3,128)×6305

blocks of the 34-layer ResNet. The feature maps delivered by such skip con-

nections from the encoder, i.e. the ResNet34 neural network are summed with

feature maps extracted in the decoder path, which uses the transpose 2D blocks

to expand dimensions of convoluted feature outputs. Such skip connections

between encoder layers and decoder layers were introduced in the U-Net neu-310

ral network [50], which is a symmetrical neural network with ’U’ like shape.

Our segmentation network is not symmetrical since the encoder path is based

on deep ResNet34 neural network, whereas in the decoder path no residual

blocks are employed, see Fig. 6. In U-Net neural networks a down-sampling

(contraction) path is utilized to extract and interpret the context (what), while315

an up-sampling (expansion) path is used to enable precise localization (where).

Furthermore, in order to fully recover the ﬁne-grained spatial information lost

in the pooling or down-sampling layers, skip connections between symmetrical

layers are employed in such encoder-decoder networks. By combining the loca-

tion information from the down-sampling path with the contextual information320

in the up-sampling path, such networks permit obtaining general maps that

combine localization and context. Our ResNet34-based network has all the fea-

tures mentioned above, and additionally it possesses extended capabilities for

feature extraction.

Figure 6: ResNet34 based network for hand segmentation.

6. Generative Adversarial Network for Photorealistic Fingerspelling325

Synthesis

Generative Adversarial Networks (GANs) utilize an adversarial discrimina-

tor to align the distributions of real and generated data samples. In a two-player

minimax game the generator Gtries to generate samples on the basis of noise z

that fool the discriminator D, while Dlearns to maximize the probability of as-330

signing the correct class label for both the real data and the fake data G(z) [51].

In the optimal case, the generated samples would be indistinguishable from real

samples. The conventional GANs required the paired training data. A recently

proposed CycleGANs [52] utilizes the unpaired training data thanks to a cycle

consistency loss function. CycleGAN is a general framework for learning from335

unaligned examples the mapping functions between two domains Xand Y. The

goal is to learn a mapping G:X→Ysuch that the distribution of data from

G(X) would be indistinguishable from the distribution of data in Yaccording

to an adversarial loss. To achieve this they proposed to consider also an inverse

mapping F:Y→Xas well as to employ so-called cycle consistency to prevent340

the learned mappings Gand Ffrom contradicting each other [52].

Given training samples {xi}N

i=1,{yi}M

i=1, where xi∈Xand yi∈Ywith

data distributions x∼pdata(x)and y∼pdata(y), for the mapping G:X→Y,

and discriminator DYthe objective function can be expressed in the following

manner:

LGAN(G, DY, X, Y ) = Ey∼pdata(y)[log(DY(y))]

+Ex∼pdata(x)[log(1 −DY(G(x)))]

(4)

The generator Gminimizes it against the adversary DYthat tries to maximize it:

minGmaxDYLGAN(G, DY, X, Y ). For the mapping F:Y→Xand the discrim-

inator DX, the generator Fminimizes the objective LGAN (F, DX, X, Y ) against

the adversary DXthat tries to maximize it: minFmaxDXLGAN (F, DX, X, Y ).

The cycle consistency loss takes the following form [52]:

Lcyc(G, F ) = Ex∼pdata(x)[||F(G(x)) −x||1]

+Ey∼pdata(y)[||G(F(y)) −y||1]

(5)

The loss function has the following form:

L(G, F, DX, DY) = LGAN(G, DY, X, Y ) + LGAN (F, DX, X, Y ) + γLcyc (G, F )

(6)

where γbalances the objectives. Cycle consistency means that that the com-

position of the mappings is the identity mapping. The aim of the CycleGAN is

to ﬁnd the generators:

G∗, F ∗= arg min

G,F

max

DX,DY

L(G, F, DX, DY) (7)

In our work we learned mappings from synthetic images to real images and from

real to synthetic images. The inputs were the cropped synthetic and real images

of the hand with uniform background.

CycleGAN has been successfully applied in a few image-to-image applica-

tions. However, the CycleGAN has not been designed for maintaining object

shapes well. In this work, we extend the CycleGAN by adding the segmentation

consistency loss to encourage shape alignment between images in two domains

and to improve the accuracy at the hand boundaries. By incorporating an ad-

ditional geometric consistency loss that incorporates information about hand

shapes we better maintain the hand shape and its pose during image post-

processing. The proposed loss term has the following form:

Lcycu(G, F ) = Ex∼pdata(x)[||U(F(G(x))) −xb||1]

+Ey∼pdata(y)[||U(G(F(y))) −yb||1]

(8)

where Uis segmentation mapping that is performed by trained in advance345

ResNet34-based segmentation unit, whereas xband ybdenote the binary masks

of the hands. This means that the input to our network are (cropped) synthetic

and real images of the hand with their respective silhouettes, i.e. binary fore-

ground masks. The discussed term has been multiplied by γand included as

additional term in the loss function (6), i.e. calculating L.350

7. Fingerspelling Recognition

7.1. Fingerspelling Recognition Using Neural Networks Trained from Scratch

We implemented a ResNet based convolutional neural network consisting of

three ResNet-blocks, see Fig. 7. Afterwards, we implemented a QCNN, and then

after extending the ResNet about spatial the pyramid pooling layer SPP [53]355

we substituted the convolutional blocks by the quaternion-based convolutional

blocks. The motivation behind using the SPP layer results from its ability to

better represent the object at multiple scales and input sizes.

7.2. Fingerspelling Recognition Using Pre-trained Neural Networks

The output of the base VGG-19 CNN has been ﬂatten, and then a dense360

layer consisting of 512 neurons with dropout=0.5 followed by output layer with

soft-max activation have been added to such a base network. The weights in

the base model were frozen for initial training of the network. Afterwards, the

layers starting from seventeen one (block5) were set as trainable for ﬁne-tuning

Figure 7: Flowchart of the ResNet used for JSL ﬁngerspelling classiﬁcation.

the network. The output of the ResNet50 base model has been fed to a global365

average-pooling and global max-pooling. The outputs have been concatenated

and then fed to a batch normalization layer. The outputs were next fed to

dense layer with 1024 neurons followed by the batch normalization and dropout

layers. Afterwards, a dense layer consisting of 512 neurons with dropout=0.5

followed by the output layer with the soft-max activation have been utilized in370

the model, similarly as in the VGG-19 network.

7.3. Ensemble of CNNs

Models obtained on the basis of convolutional neural networks are nonlin-

ear. They are learned via optimization using stochastic training algorithms and

they are sensitive to the distribution of the training data. Thus, the optimizers375

ﬁnd a diﬀerent set of weights each time they are executed, which in turn leads

to unlike predictions. This means that predictions of neural networks usually

have a high variance. One of the successful approaches to reducing the vari-

ance of such predictions is to learn multiple neural network models instead of a

single model and to combine the predictions of these models. The ensemble of380

such independently trained models not only reduces the variance of predictions

but also produces ﬁnal outputs that are better than predictions of any single

model. Every ensemble member contributes to the ﬁnal output and individual

weaknesses are oﬀset by the contribution the other members. Essentially, en-

sembles tend to yield better results when there is a signiﬁcant diversity among385

the members [54], (called also base-learners). There are many diﬀerent types

of ensembles. In a weighted average ensemble the decisions of ensemble mem-

bers are weighted on the basis of their performance on a hold-out validation

dataset. In a stacking-based ensemble the decisions of base-learners are taken

as input for training a meta-learner, that learns how to optimally combine the390

predictions of base-learners. At the beginning the selected neural networks are

learned using the available training data. Afterwards, a meta-learner is trained

to make a ﬁnal prediction using the predictions of the trained networks. The

main diﬀerence between both methods is that in the weighted average ensem-

ble the weights are optimized and then used for weighting all outputs of the395

base-learners, and ﬁnally are taken to calculate the weighted average. This

means that no meta-learner is employed in such ensembles. In a stacking-based

ensemble, the meta-learner takes every single output of the base-learners as a

training instance and learns how to optimally map the base-learner decisions

into a better output decision. The meta-learner can be any classic machine400

learning algorithm.

8. Experimental Results and Discussion

Experimental evaluations have been performed on our JSL ﬁngerspelling

dataset, which was discussed in Subsection 3.1. All experiments were carried

out on color RGB images of size 64 ×64. Altogether 16343 images for training405

and 1686 test images were employed in the evaluations consisting in recogni-

tion of 41 JSL static hand gestures. The recognition performance has been

assessed in a person independent (cross-person) scenario, wherein persons at-

tending recordings of test data did not attend in the recordings of the training

data. In training the GANs as well as training the neural networks for hand410

segmentation the synthetic images were used both in the training and the test-

ing of the models, whereas in the gesture recognition only real images were used

in evaluations of learned models.

8.1. Evaluation of Hand Segmentation

In the ﬁrst phase of experiments, we selected 1500 real images from the415

training subset and 500 synthetic images from the RHM sub-dataset. Given

the binary hand shapes we introduced the complex background into the hand

images. We used randomly sampled patches from an Oﬃce-Caltech dataset [55].

The Oﬃce-Caltech dataset contains images of oﬃce objects from ten common

categories shared by the Oﬃce-31 and Caltech-256 datasets. It is composed of420

ten classes: backpack, bike, calculator, headphones, keyboard, laptop, monitor,

mouse, mug, and video-projector. In the next step of this phase of experiments,

in order to show the potential of our ResNet34-based network for hand seg-

mentation we trained an ordinary U-Net. The contraction path of the U-Net is

made of four contraction blocks, where each block takes an input map and then425

applies two 3×3 convolution layers followed by a 2×2 max pooling. The number

of kernels or feature maps after each block doubles so that this architecture can

learn the complex structures eﬀectively. A bottleneck part, which is between

the contracting and expanding path is built from simply 2 convolutional layers

(with batch normalization), with dropout. It uses two 3 ×3 CNN layers fol-430

lowed by 2 ×2 up convolution layer. Similar to contraction layer, the expansion

section also consists of four expansion blocks. Each block passes the input to

two 3 ×3 convolutional layers followed by a 2 ×2 upsampling layer. Also after

each block the number of feature maps utilized by the convolutional layer gets

half to obtain symmetrical encoder-decoder network. The input of each block is435

concatenated with feature maps of the corresponding contraction layer. After

passing through the expansion blocks, the resultant mapping passes through

another 3 ×3 convolutional layer with the number of feature maps equal to the

number of segments desired. Figure 8 depicts selected images, which were seg-

mented by our segmentation network. As we can observe, our neural network440

segments the hands quite reliably on images with complex background. Our

experiments demonstrated that the ordinary U-Net is capable of extracting the

hands quite properly only in case of both training and evaluation on images

with uniform background.

Figure 8: Masks of the segmented hands in images with complex background. Input images

(top row), masks of segmented hands (bottom row).

Table 1 presents Dice scores that were obtained on a dataset consisting of445

randomly selected 200 images from the test subset of the JSL dataset with the

added complex background. The neural networks were trained on 2000 images

with complex background in 50 epochs using RMSprop optimizer. Afterwards,

we created a training dataset consisting of 2000 images consisting of both real

and synthetic images and trained our network to segment the hand during train-450

ing of GANs with the proposed segmentation term in the loss function.

Table 1: Dice scores on the test sub-dataset.

network U-Net our

Dice score [%] 0.954 0.984

8.2. Rendering Gestures and GAN-based Post-processing

At the beginning of experiments we investigated various approaches to ren-

dering the images representing the JSL gestures. Using our API for the LibHand

we modeled the gestures, exported the models representing gestures to md5 data455

format and then used our parser in order to import the mesh and the animation

data for OpenGL-based rendering. The mesh data and animation data in the

discussed format are separated in distinct ﬁles. One of the advantages of the

md5 data format is that data is stored in ASCII ﬁles and are human readable.

The rotations are represented by quaternions. Through modifying the values460

of the parameters stored in the plain text ﬁles it is possible to conﬁgure the

skeleton of the model into the required poses, and then render the model. After

parsing the skeleton and animation data, in every frame the 3D hand has been

rotated about randomly generated angles to simulate the observing the hand

from diﬀerent camera views. Then, we prepared a dataset consisting of training465

and synthetic images for training GANs in order to enhance the photorealism

of images rendered in such a way. Figure 9 depicts example images that were

post-processed by our GAN to improve photorealism of synthetically generated

gestures. As we can observe, photorealism of images generated synthetically

has been improved. In particular, light reﬂections, which were diﬃcult to model470

were added to the images. The results presented above were obtained on the

basis of our GAN, which has been trained on 1080 and 1226 synthetic and real

images, respectively, in 300 epochs, batch size set to 14, and using Adam opti-

mizer with lr=0.0002, beta 1=0.5. On TitanX GPU the training of the GAN

on images of size 128 ×128 took about twenty four hours. The GAN generator475

trained in such a way has then been used to post-process the synthetic images,

which were generated on the basis of models rendered by the OpenGL.

Figure 9: Examples of GAN-based post-processed images to improve photorealism of syn-

thetically generated gestures. Synthetic images (upper row), post-processed images (bottom

row).

It is worth noting that ordinary cycleGANs, i.e. without the segmentation

component in the loss function were unable to generate hands without arti-

facts and unrealistic deformations of the hand, see Fig. 10. Although some480

images with introduced modiﬁcations of hand shape could be potentially useful,

see also 1st image from left, considerable percentage of images is not rendered

properly. Moreover, subtle shape diﬀerences are rendered properly by our net-

work, compare post-processed images #2 - #4 in Fig. 9 and Fig. 10. One of the

disadvantages of GAN-based data augmentation for visual ﬁngerspelling recog-485

nition [23] is that a visual inspection of data by a man is needed to eliminate

poor gesture realizations. In contrast, the approach to data augmentation for

visual ﬁngerspelling recognition that is presented in this work is fully automatic,

i.e. no human-in-the-loop is needed in the process of ﬁngerspelling recognition.

The discussed examples were obtained in identical number of epochs, which has490

been used by the network achieving results shown on Fig. 9. Somewhat bet-

ter results can be achieved at the cost of signiﬁcantly larger number of epochs.

The proposed modiﬁcation of CycleGAN stabilizes the training of the GAN and

permits to achieve better hand shapes in the post-processing for adding more

photorealism to 3D model-based rendered ﬁngerspellings.

Figure 10: Examples of GAN-based post-processed images without the segmentation term in

the loss function.

495

Afterwards, we rendered the hands in Blender without light eﬀects. Finally,

in order to obtain more photorealistic images we illuminated the hands using

virtual lights, together with techniques discussed in Section 3, see Fig. 11. Ges-

tures rendered in such a way were included as a subset in the JSL dataset.

500

Figure 11: Example images of 01 a sign, no lighting (top row), with lighting (bottom row).

8.3. Fingerspelling Recognition

We experimented with various neural networks both trained from scratch

and ﬁne-tuned deep CNNs. In the ﬁrst stage of experiments we trained convo-

lutional neural networks from scratch. The networks were initially pre-trained

on downsampled to 64 ×64 ×3 ImageNet dataset. We implemented a ResNet505

based convolutional neural network consisting of three ResNet-blocks, which

has been outlined in Subsection 7.1. The neural network has been trained on

RGB images of size 64 ×64 ×3. Each model was trained using the Adam

optimizer (lr=0.001, beta 1=0.9, beta 2=0.999, epsilon=1e-08) and categorical

cross-entropy loss, with a small learning rate. The learning rate was scheduled510

to be reduced after 20, 30, 40, 50 epochs. The values of the hyper-parameters

were selected empirically. Afterwards, we trained the ResNet with convolu-

tional blocks substituted by the quaternion-based convolutional blocks. The

neural network has been trained on RGB images of size 64 ×64 ×3 using the

same Adam optimizer as well as parameters.515

In the next stage of experiments we focused on the ﬁngerspelling recognition

using pre-trained neural networks. We trained neural networks, which were dis-

cussed in Sebsection 7.2. As the VGG model expects the input images of size

224 ×224 ×3, the images were resized to the above mentioned size. We initially

trained the networks in 30 epochs using SGD optimizer with lr=1e-4, momen-520

tum=0.9. Next, the neural networks have been ﬁne-tuned in 30 epochs using

the SGD optimizer with lr=1e-4 and momentum=0.9. We also experimented

with other pre-trained CNNs, including ResNet34, mobileNet and Inception-

ResNetV2, ﬁne-tuned for the ﬁngerspelling recognition. However, the results

were worse in comparison to results achieved by the mentioned above networks.525

Finally, an ensemble consisting of VGG-19, Resnet50 and ResNet with con-

volutional blocks substituted by the quaternion-based convolutional blocks has

been constructed. The models of the neural networks, which were trained in

advance have been loaded and then used to construct an ensemble of deep net-

works. The output of the ensemble is determined by voting. An MLP-based530

ensemble has also been trained and evaluated.

Table 2 presents the classiﬁcation performance that has been obtained by

neural networks on the test subset of the JSL dataset. During learning the neu-

ral networks an online data augmentation has been executed. The classiﬁcation

performance has been obtained on the test subset of the JSL dataset. As we535

can observe the synthetic images allow achieving fat better results. Consider-

able improvement in classiﬁcation accuracy has been obtained thanks to the

use of the synthetic images. The multi-model, voting-based ensemble improves

the classiﬁcation performance about 1.5% and the results are slightly better

in comparison to results achieved by the stacking ensemble. Figure 12 depicts540

the confusion matrix obtained by the best single classiﬁer, i.e. ResNet50-based

classiﬁer.

Table 2: Classiﬁcation performance in performer independent experiment.

Accuracy Precision Recall F1-score

no. rend. 0.671 0.678 0.671 0.670

ResNet18 0.815 0.829 0.815 0.814

VGG19 0.863 0.869 0.860 0.858

ResNet50 0.877 0.895 0.876 0.875

ensemble 0.892 0.906 0.904 0.904

One of the major reasons of insuﬃcient classiﬁcation performance for a few

Figure 12: Confusion matrix - each row represents the real class while each column represents

the predicted class of gestures.

classes are strong inter-class similarities. The hand shapes of several JSL ges-

tures are quite similar, which may explain the reason for the incorrect predic-545

tions of the classiﬁer. As we can notice on images shown on Fig. 13, the hand

shapes in classes ’06 ka’ and ’41 ra’, as well as ’04 e’ and ’11 se’ are pretty alike.

One of the reason is that this an ill-posed problem with inherent ambiguities.

We investigated several approaches to improve the recognition performance,

including rendering additional images for the classes with lower classiﬁcation550

ratios, synthesis of additional images on the basis of Adversarial Generative

Models (GANs), and evaluations using ﬁne-tuned deep neural networks, e.g.

MobileNet. However, none of the above mentioned approaches has not been

able to improve the experimental results presented above.

06 ka 04 e

41 ra 11 sa

Figure 13: Example inter-class similarities, 06 ka – 41 ra and 04 e – 11 sa.

To validate the usefulness of the JSL dataset as well as potential of the555

trained models in real scenarios we performed experiments on image sequences.

The training and testing data were created on the basis of image sequences that

were acquired during recording the JSL dataset. All corresponding test images

from the JSL dataset were included in the discussed dataset and additionally

we included the original images so that each sequence contained minimum ten560

images. The total number of test images is equal to 7250. In the same way,

we prepared the training subset that consisted of the same number of image

sequences as the test subset, but the total number of images was equal to 7300.

Additionally, we assumed that if an original image sequence was included in the

training set, we did not omit from it the images already selected as part the JSL565

dataset. The experiments were conducted on original RGB images of size 320 ×

240. We compared the recognition accuracy achieved by the best performing

neural network with accuracies achieved by recent algorithms [56, 57, 58, 59, 60].

The hands were detected using OpenPose [61] and we processed only the hand

performing the gesture. If in a compared algorithm the OpenPose gave a better570

result than original hand detector we employed it instead of the author’s hand

detector. Table 3 compares results achieved by our best performing model with

results achieved by recent algorithms for isolated ﬁngerspelling recognition on

sequences of RGB images. The input shape of CNN from [58] has been changed

to size 64 ×64 and we extended the network about a convolutional layer and575

the following pooling layer. As we can notice, our recognizer achieved superior

results.

Table 3: Comparative recognition performance with performance of recent algorithms for

isolated ﬁngerspelling recognition on sequences of RGB images.

Method Accuracy [%]

Keyframes [56] 64.4

GEI CNN 64 ×64 [58] 39.6

CNN 64 ×64 [57] 74.6

Multi-scale descriptor [59] 64.8

VGG16 HOG [60] 89.2

Our approach 92.1

The neural networks for gesture recognition have been trained on TitanX

GPU with epoch size set to 64 and number of epochs set to 100. The neural

networks were implemented in Python using TensorFlow/Keras frameworks.580

9. Conclusions

In this paper we presented a framework for recognition of static ﬁnger-

spellings on RGB images. The recognition of hand gestures is performed by

convolutional neural networks, which have been trained using both real and

synthetic images. A few thousands of synthetic images for training were gener-585

ated on the basis of two skinned hand models. In the ﬁrst approach, advanced

graphics techniques were used to create photorealistic gestures, whereas in the

second one the gestures rendered using simpler lighting techniques were post-

processed by a modiﬁed Generative Adversarial Network. In order to avoid

unrealistic modiﬁcations of the hand shape a hand segmentation term has been590

added to the loss function of the GAN. The segmentation of the hand in images

with complex background was done by proposed ResNet34-based segmentation

network. The ﬁnger-spelled signs were recognized by an ensemble consisting

of ﬁnetuned VGG19 and ResNet50 neural networks and ResNet convolutional

neural network trained from scratch. Experimental results demonstrate that595

thanks to suﬃcient amount of training data a high recognition rate can be

attained on RGB images. We demonstrated experimentally that in a person in-

dependent scenario, on a test subset with gestures expressed by four performers

a recognition rate close to 90% can be achieved using the proposed approach.

Future work will include investigations on using the rendering techniques for600

data augmentation while training neural networks.

Acknowledgements

This work was supported by Polish National Science Center (NCN) under a

research grant 2017/27/B/ST6/01743.

References605

[1] M. Sagayam and J. Hemanth, “Hand posture and gesture recognition tech-

niques for virtual reality applications: A survey,” Virtual Reality, vol. 21,

no. 2, pp. 91–107, 2017.

[2] F. Chen, Q. Zhong, F. Cannella, K. Sekiyama, and T. Fukuda, “Hand

gesture modeling and recognition for human and robot interactive assembly610

using Hidden Markov Models,” Int. J. of Advanced Robotic Systems, vol. 12,

no. 4, p. 48, 2015.

[3] D. Raj, I. Gogul, M. Thangaraja, and V. Kumar, “Static gesture recog-

nition based precise positioning of 5-DOF robotic arm using FPGA,” in

Trends in Industrial Measurement and Automation (TIMA), 2017, pp. 1–615

[4] H. Liu and L. Wang, “Gesture recognition for human-robot collaboration:

A review,” Int. J. of Industrial Ergonomics, vol. 68, pp. 355 – 367, 2018.

[5] S. Patil, D. K. Dennis, C. Pabbaraju, R. Deshmukh, H. Simhadri,

M. Varma, and P. Jain, “GesturePod: Programmable gesture recognition620

for augmenting assistive devices,” Microsoft, Tech. Rep., May 2018.

[6] P. Pisharady and M. Saerbeck, “Recent methods and databases in vision-

based hand gesture recognition,” Comput. Vis. Image Underst., vol. 141,

pp. 152–165, 2015.

[7] A. S. Al-Shamayleh, R. Ahmad, M. Abushariah, K. A. Alam, and625

N. Jomhari, “A systematic literature review on vision based gesture recog-

nition techniques,” Multimedia Tools and Applications, vol. 77, no. 21, pp.

28 121–28 184, 2018.

[8] O. Matei, P. C. Pop, and H. V˘alean, “Optical character recognition in

real environments using neural networks and k-nearest neighbor,” Applied630

Intelligence, vol. 39, no. 4, pp. 739–748, 2013.

[9] O. Kopuklu, A. Gunduz, N. Kose, and G. Rigoll, “Online dynamic hand

gesture recognition including eﬃciency analysis,” IEEE Trans. on Biomet-

rics, Behavior, and Identity Science, vol. 2, no. 2, pp. 85–97, 2020.

[10] O. Oyedotun and A. Khashman, “Deep learning in vision-based static hand635

gesture recognition,” Neural Computing and Applications, pp. 1–11, 2016.

[11] A. Wadhawan and P. Kumar, “Sign language recognition systems: A

decade systematic literature review,” Archives of Computational Methods

in Engineering, Dec. 2019.

[12] H. Zuo, H. Fan, E. Blasch, and H. Ling, “Combining convolutional and re-640

current neural networks for human skin detection,” IEEE Signal Processing

Letters, vol. 24, no. 3, pp. 289–293, 2017.

[13] J. Tompson, M. Stein, Y. LeCun, and K. Perlin, “Real-time continuous

pose recovery of human hands using convolutional networks,” ACM Trans.

Graph., vol. 33, no. 5, 2014.645

[14] J. Nagi and F. Ducatelle, et al., “Max-pooling convolutional neural net-

works for vision-based hand gesture recognition,” in IEEE ICSIP, 2011,

pp. 342–347.

[15] P. Barros, S. Magg, C. Weber, and S. Wermter, A Multichannel Convolu-

tional Neural Network for Hand Posture Recognition. Springer, 2014, pp.650

403–410.

[16] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a CNN on

1 million hand images when your data is continuous and weakly labelled,”

in IEEE Conf. on Comp. Vision and Pattern Rec., 2016, pp. 3793–3802.

[17] S. F. Chevtchenko, R. F. Vale, V. Macario, and F. R. Cordeiro, “A con-655

volutional neural network with feature fusion for real-time hand posture

recognition,” Applied Soft Computing, vol. 73, pp. 748 – 766, 2018.

[18] N. Pugeault and R. Bowden, “Spelling it out: Real-time ASL ﬁngerspelling

recognition,” in IEEE Int. Conf. on Computer Vision Workshops, 2011, pp.

1114–1119.660

[19] Y. Chuang, L. Chen, and G. Chen, “Saliency-guided improvement for hand

posture detection and recognition,” Neurocomputing, vol. 133, pp. 404 –

415, 2014.

[20] S. Bambach, S. Lee, D. J. Crandall, and C. Yu, “Lending a hand: Detecting

hands and recognizing activities in complex egocentric interactions,” in665

IEEE Int. Conf. on Computer Vision (ICCV), 2015, pp. 1949–1957.

[21] A. U. Khan and A. Borji, “Analysis of hand segmentation in the wild,” in

IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018, pp.

4710–4719.

[22] B. Kwolek and S. Sako, “Learning Siamese features for ﬁnger spelling recog-670

nition,” in Advanced Concepts for Intelligent Vision Systems. Lecture

Notes in Computer Science, vol. 10617, Springer, 2017, pp. 225–236.

[23] B. Kwolek, “GAN-based data augmentation for visual ﬁnger spelling recog-

nition,” in Eleventh Int. Conf. on Machine Vision (ICMV 2018), vol. 11041.

SPIE, 2019, pp. 493 – 500.675

[24] N. T. Nguen, S. Sako, and B. Kwolek, “Deep CNN-based recognition of jsl

ﬁnger spelling,” in Proc. Int. Conf. on Hybrid Artiﬁcial Intelligent Systems

(HAIS), LNCS, vol. 11734. Springer, 2019, pp. 602–613.

[25] Y. Tabata and T. Kuroda, “Finger spelling recognition using distinctive

features of hand shape,” in Int. Conf. on Disability, Virtual Reality and680

Associated Technologies with Art Abilitation, 2008, pp. 287–292.

[26] L. Kane and P. Khanna, “A framework for live and cross platform ﬁnger-

spelling recognition using modiﬁed shape matrix variants on depth silhou-

ettes,” Comput. Vis. Image Underst., vol. 141, pp. 138–151, 2015.

[27] Rosalina, L. Yusnita, N. Hadisukmana, R. B. Wahyu, R. Roestam, and685

Y. Wahyu, “Implementation of real-time static hand gesture recognition

using artiﬁcial neural network,” in Int. Conf. on Computer Appl. and Inf.

Proc. Techn. (CAIPT), 2017, pp. 1–6.

[28] M. Asad and G. Slabaugh, “SPORE: Staged probabilistic regression for

hand orientation inference,” Computer Vision and Image Understanding,690

vol. 161, pp. 114 – 129, 2017.

[29] A. Y. Dawod, M. J. Nordin, and J. Abdullah, “Static ﬁngerspelling recogni-

tion based on boundary tracing algorithm and chain code,” in Int. Conf. on

Intell. Systems, Metaheuristics & Swarm Intell. ACM, 2018, pp. 104–109.

[30] T. Kim, J. Keane, W. Wang, H. Tang, J. Riggle, G. Shakhnarovich,695

D. Brentari, and K. Livescu, “Lexicon-free ﬁngerspelling recognition from

video: Data, models, and signer adaptation,” Computer Speech & Lan-

guage, vol. 46, pp. 209 – 232, 2017.

[31] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign lan-

guage recognition without temporal segmentation,” in AAAI, 2018.700

[32] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep Sign: Enabling

robust statistical continuous sign language recognition via hybrid CNN-

HMMs,” Int. J. Comput. Vision, vol. 126, no. 12, pp. 1311–1325, 2018.

[33] N. Aloysius and M. Geetha, “Understanding vision-based continuous sign

language recognition,” Multimedia Tools and Applications, vol. 79, pp.705

22 177–22 209, 05 2020.

[34] B. Shi, A. M. D. R. nad J. Keane, J. Michaux, D. Brentari, and G. S.

aand K. Livescu, “American Sign Language ﬁngerspelling recognition in

the wild,” IEEE Spoken Language Technology Workshop (SLT), pp. 145–

152, 2018.710

[35] T. Igarashi, K. Nishino, and S. K. Nayar, “The appearance of human skin:

A survey,” Found. Trends. Comput. Graph. Vis., vol. 3, no. 1, pp. 1–95,

2007.

[36] M. ˇ

Sari´c, “Libhand: A library for hand articulation,” 2011, version 0.9.

[Online]. Available: http://www.libhand.org/715

[37] Blender Online Community, Blender - a 3D modelling and rendering

package, Blender Foundation, Stichting Blender Foundation, Amsterdam,

2020. [Online]. Available: http://www.blender.org

[38] W. Baczynski, “Hand pose recognition using 3D hand models,” Master’s

Thesis, AGH Univ. of Science and Technology, Faculty of Computer Scince,720

Elecronics and Telecommunications, Krakow, Poland, 2019.

[39] D. Vicini, V. Koltun, and W. Jakob, “A learned shape-adaptive subsurface

scattering model,” ACM Trans. Graph., vol. 38, no. 4, 2019.

[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Im-

age Recognition,” in IEEE Conf. on Computer Vision and Pattern Rec.725

(CVPR), 2016, pp. 770–778.

[41] T. Parcollet, Y. Zhang, M. Morchid, C. Trabelsi, G. Linar`es, R. de Mori,

and Y. Bengio, “Quaternion Convolutional Neural Networks for End-to-

End Automatic Speech Recognition,” in Interspeech. ISCA, 2018, pp.

22–26.730

[42] C.-A. Popa, “Learning algorithms for quaternion-valued neural networks,”

Neural Process. Lett., vol. 47, no. 3, pp. 949–973, 2018.

[43] T. Nitta, “A quaternary version of the back-propagation algorithm,” in

Proc. of Int. Conf. on Neural Networks, vol. 5, 1995, pp. 2753–2756.

[44] X. Zhu, Y. Xu, H. Xu, and C. Chen, “Quaternion Convolutional Neural735

Networks,” in European Conf on Computer Vision (ECCV). Springer,

2018, pp. 645–661.

[45] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deep learning:

A review,” IEEE Trans. on Neural Networks and Learning Systems, vol. 30,

no. 11, pp. 3212–3232, 2019.740

[46] C. Li and K. M. Kitani, “Pixel-level hand detection in ego-centric videos,”

in IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2013, pp.

3570–3577.

[47] S. Lee, S. Bambach, D. J. Crandall, J. M. Franchak, and C. Yu, “This hand

is my hand: A probabilistic approach to hand disambiguation in egocen-745

tric video,” in IEEE Conf. on Computer Vision and Pattern Recognition

Workshops, 2014, pp. 557–564.

[48] A. Z. Arpit Mittal and P. Torr, “Hand detection using multiple proposals,”

in Proc. of the British Machine Vision Conf. BMVA Press, 2011, pp.

75.1–75.11.750

[49] S. Narasimhaswamy, Z. Wei, Y. Wang, J. Zhang, and M. Hoai, “Contextual

attention for hand detection in the wild,” in Int. Conf. on Computer Vision

(ICCV), 2019, pp. 9567–9576.

[50] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks

for biomedical image segmentation,” in MICCAI. Springer, 2015, pp.755

234–241.

[51] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in

Proc. of the 27th Int. Conf. on Neural Information Processing Systems -

Vol. 2. Cambridge, USA: MIT Press, 2014, pp. 2672–2680.760

[52] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image

Translation Using Cycle-Consistent Adversarial Networks,” in IEEE Int.

Conf. on Computer Vision (ICCV), 2017, pp. 2242–2251.

[53] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep

convolutional networks for visual recognition,” in ECCV. Springer, 2014,765

pp. 346–361.

[54] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,”

J. Artif. Int. Res., vol. 11, no. 1, pp. 169–198, 1999.

[55] K. Grauman, “Geodesic ﬂow kernel for unsupervised domain adaptation,”

in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).770

USA: IEEE Computer Society, 2012, pp. 2066–2073.

[56] H. Tang, H. Liu, W. Xiao, and N. Sebe, “Fast and robust dynamic hand

gesture recognition via key frames extraction and feature fusion,” Neuro-

computing, vol. 331, pp. 424–433, 2019.

[57] P. Nakjai and T. Katanyukul, “Hand sign recognition for Thai Finger775

Spelling: an application of convolution neural network,” J. of Signal Pro-

cessing Systems, vol. 91, no. 2, pp. 131–146, 2019.

[58] K. M. Lim, A. W. C. Tan, C. P. Lee, and S. C. Tan, “Isolated sign language

recognition using convolutional neural network hand modelling and hand

energy image,” Multimedia Tools and Applications, vol. 78, no. 14, pp.780

19 917–19 944, 2019.

[59] Y. Huang and J. Yang, “A multi-scale descriptor for real time RGB-D hand

gesture recognition,” Pattern Recognition Letters, 2020.

[60] A. Sharma, N. Sharma, Y. Saxena, A. Singh, and D. Sadhya, “Benchmark-

ing deep neural network approaches for Indian Sign Language recognition,”785

Neural Computing and Applications, Oct. 2020.

[61] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, “OpenPose:

Realtime multi-person 2D pose estimation using part aﬃnity ﬁelds,” IEEE

Trans. on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp.

172–186, 2021.790

Dynamic Japanese Sign Language Recognition Throw Hand Pose Estimation Using Effective Feature Extraction and Classification Approach

Article

Full-text available

Jan 2024
SENSORS-BASEL

Japanese Sign Language (JSL) is vital for communication in Japan’s deaf and hard-of-hearing community. But probably because of the large number of patterns, 46 types, there is a mixture of static and dynamic, and the dynamic ones have been excluded in most studies. Few researchers have been working to develop a dynamic JSL alphabet, and their performance accuracy is unsatisfactory. We proposed a dynamic JSL recognition system using effective feature extraction and feature selection approaches to overcome the challenges. In the procedure, we follow the hand pose estimation, effective feature extraction, and machine learning techniques. We collected a video dataset capturing JSL gestures through standard RGB cameras and employed MediaPipe for hand pose estimation. Four types of features were proposed. The significance of these features is that the same feature generation method can be used regardless of the number of frames or whether the features are dynamic or static. We employed a Random forest (RF) based feature selection approach to select the potential feature. Finally, we fed the reduced features into the kernels-based Support Vector Machine (SVM) algorithm classification. Evaluations conducted on our proprietary newly created dynamic Japanese sign language alphabet dataset and LSA64 dynamic dataset yielded recognition accuracies of 97.20% and 98.40%, respectively. This innovative approach not only addresses the complexities of JSL but also holds the potential to bridge communication gaps, offering effective communication for the deaf and hard-of-hearing, and has broader implications for sign language recognition systems globally.

Three-Dimensional Animation Generation and Image Enhancement Technology Based on Multi-Column Convolutional Neural Network Model

Article

Full-text available

Jan 2023

This paper aims to improve the quality and fidelity of three-dimensional (3D) animation. Firstly, the application model of Multi-Column Convolutional Neural Network (MCNN) in 3D animation generation and image enhancement is proposed. Aiming at the generation of 3D animation, the MCNN algorithm suitable for this field is selected, and its working principle is explained in detail. Meanwhile, the theoretical basis of 3D animation generation is introduced, which provides a theoretical basis for subsequent experiments. Secondly, for image enhancement, the MCNN is also selected as the key technology, and its application model in image enhancement is explained. Finally, a simulation experiment is carried out to evaluate the effect of the proposed MCNN model in 3D animation generation and image enhancement. By collecting appropriate data sets and setting parameters in the corresponding experimental environment, the performance of the proposed model is evaluated. The results show that, compared with the traditional methods, the MCNN model shows better performance and effect in animation generation and image enhancement tasks. Specifically, this method can still maintain good performance under the conditions of shorter training time, faster reasoning time and lower memory occupation, and this method has advantages in computational efficiency. 3D animation generation and image enhancement technology with MCNN model can significantly improve the animation quality and image fidelity, and satisfactory experimental results have been obtained. The experimental results in this paper verify the application potential of MCNN in 3D animation generation and image enhancement, and provide new ideas and directions for further research and application.

Development of a hybrid word recognition system and dataset for the Azerbaijani Sign Language dactyl alphabet

Article

Jul 2023
SPEECH COMMUN

Sign language recognition based on dual-path background erasure convolutional neural network

Article

Full-text available

May 2024

Sign language is an important way to provide expression information to people with hearing and speaking disabilities. Therefore, sign language recognition has always been a very important research topic. However, many sign language recognition systems currently require complex deep models and rely on expensive sensors, which limits the application scenarios of sign language recognition. To address this issue, based on computer vision, this study proposed a lightweight, dual-path background erasing deep convolutional neural network (DPCNN) model for sign language recognition. The DPCNN consists of two paths. One path is used to learn the overall features, while the other path learns the background features. The background features are gradually subtracted from the overall features to obtain an effective representation of hand features. Then, these features are flatten into a one-dimensional layer, and pass through a fully connected layer with an output unit of 128. Finally, use a fully connected layer with an output unit of 24 as the output layer. Based on the ASL Finger Spelling dataset, the total accuracy and Macro-F1 scores of the proposed method is 99.52% and 0.997, respectively. More importantly, the proposed method can be applied to small terminals, thereby improving the application scenarios of sign language recognition. Through experimental comparison, the dual path background erasure network model proposed in this paper has better generalization ability.

Japanese Sign Language Recognition by Combining Joint Skeleton-Based Handcrafted and Pixel-Based Deep Learning Features with Machine Learning Classification

Article

Full-text available

Mar 2024
Comput Model Eng Sci

Sign language recognition is vital for enhancing communication accessibility among the Deaf and hard-of-hearing communities. In Japan, approximately 360,000 individuals with hearing and speech disabilities rely on Japanese Sign Language (JSL) for communication. However, existing JSL recognition systems have faced significant performance limitations due to inherent complexities. In response to these challenges, we present a novel JSL recognition system that employs a strategic fusion approach, combining joint skeleton-based handcrafted features and pixel-based deep learning features. Our system incorporates two distinct streams: the first stream extracts crucial handcrafted features, emphasizing the capture of hand and body movements within JSL gestures. Simultaneously, a deep learning-based transfer learning stream captures hierarchical representations of JSL gestures in the second stream. Then, we concatenated the critical information of the first stream and the hierarchy of the second stream features to produce the multiple levels of the fusion features, aiming to create a comprehensive representation of the JSL gestures. After reducing the dimensionality of the feature, a feature selection approach and a kernel-based support vector machine (SVM) were used for the classification. To assess the effectiveness of our approach, we conducted extensive experiments on our Lab JSL dataset and a publicly available Arabic sign language (ArSL) dataset. Our results unequivocally demonstrate that our fusion approach significantly enhances JSL recognition accuracy and robustness compared to individual feature sets or traditional recognition methods.

Evaluation of Sign Language Recognition with Higher Resolution Face Images顔表情を解析する手話認識における画像解像度と認識精度の評価

Article

Nov 2023

Turkish sign language recognition using fuzzy logic asisted ELM and CNN methods

Article

Sep 2023
J INTELL FUZZY SYST

This work aimed to develop a data glove for the real-time translation of Turkish sign language. In addition, a novel Fuzzy Logic Assisted ELM method (FLA-ELM) for hand gesture classification is proposed. In order to acquire motion information from the gloves, 12 flexibility sensors, two inertial sensors, and 10 Hall sensors were employed. The NVIDIA Jetson Nano, a small pocketable minicomputer, was used to run the recognition software. A total of 34 signal information was gathered from the sensors, and feature matrices were generated in the form of time series for each word. In addition, an algorithm based on Euclidean distance has been developed to detect end-points between adjacent words in a sentence. In addition to the proposed method, CNN and classical ANN methods, whose model was created by us, were used in sign language recognition experiments, and the results were compared. For each classified word, samples were collected from 25 different signers, and 3000 sample data were obtained for 120 words. Furthermore, the dataset’s size was reduced using PCA, and the results of the newly created datasets were compared to the reference results. In the performance tests, single words and three-word sentences were translated with an accuracy of up to 96.8% and a minimum 2.4 ms processing time.

Deep Learning: Theory, Architectures and Applications in Speech, Image and Language Processing

Book

Aug 2023

This book is a detailed reference guide on deep learning and its applications. It aims to provide a basic understanding of deep learning and its different architectures that are applied to process images, speech, and natural language. It explains basic concepts and many modern use cases through fifteen chapters contributed by computer science academics and researchers. By the end of the book, the reader will become familiar with different deep learning approaches and models, and understand how to implement various deep learning algorithms using multiple frameworks and libraries. The second part is dedicated to sentiment analysis using deep learning and machine learning techniques. This book section covers the experimentation and application of deep learning techniques and architectures in real-world applications. It details the salient approaches, issues, and challenges in building ethically aligned machines. An approach inspired by traditional Eastern thought and wisdom is also presented. The final part covers artificial intelligence approaches used to explain the machine learning models that enhance transparency for the benefit of users. A review and detailed description of the use of knowledge graphs in generating explanations for black-box recommender systems and a review of ethical system design and a model for sustainable education is included in this section. An additional chapter demonstrates how a semi-supervised machine-learning technique can be used for cryptocurrency portfolio management. The book is a timely reference for academicians, professionals, researchers and students at engineering and medical institutions working on artificial intelligence applications.

Deep SLRT: The Development of Deep Learning based Multilingual and Multimodal Sign Language Recognition and Translation Framework

Chapter

Aug 2023

A Systematic Literature Review on Vision-Based Hand Gesture for Sign Language Translation

Article

Full-text available

Mar 2023

Deaf and hard of hearing people use sign language to communicate. People around mute and deaf people have difficulty communicating with each other if they do not understand sign language. This problem has prompted many researchers to conduct studies on sign language translation. However, there is a lack of compilation of SLR on this topic. Therefore, this paper aims to provide a thorough literature review of previous studies on sign language to text translation based on the vision method. PRISMA (Preferred Reporting Items to writing a standard Systematic Review and Meta-Analyses) is used in this systematic review. Two primary databases, Web of Science and Scopus, have been used to search for relevant articles and resources included in this systematic literature review. Based on the outcome of the systematic review of the topic, the primary studies on sign language translation systems were conducted using self-generated datasets more than public datasets. More static action sign language was studied compared to dynamic action sign language. For the type of recognition, more alphabet sign language was studied compared to digit, word, or sentence sign language. Other than that, most studies used digital cameras rather than Microsoft Kinect or a webcam. The most used classification method was Convolution Neural Network (CNN). The study is intended to guide readers and researchers for future research and knowledge enhancement in the field of sign language recognition.

Benchmarking deep neural network approaches for Indian Sign Language recognition

Article

Full-text available

Jun 2021
NEURAL COMPUT APPL

Sign language is the language of the deaf and mute. However, this particular population of the world is unfortunately overlooked as the majority of the hearing population does not understand sign language. In this paper, an extensive comparative analysis of various gesture recognition techniques involving convolutional neural networks and machine learning algorithms has been discussed and tested for real-time accuracy. Three models: a pre-trained VGG16 with fine-tuning, VGG16 with transfer learning and a hierarchical neural network were analyzed based on a number of trainable parameters. These models were trained on a self-developed dataset consisting images of Indian Sign Language (ISL) representation of all 26 English alphabets. The performance evaluation was based on the practical application of these models, which was simulated by varying lighting and background environments. Out of the three, the hierarchical model outperformed the other two models to give the best accuracy of 98.52% for one-hand and 97% for two-hand gestures. Thereafter, a conversation interface was built in Django using this model for the real-time gesture to speech conversion and vice versa. This publicly accessible interface can be used by anyone who wishes to learn or converse in ISL.

Understanding vision-based continuous sign language recognition

Article

Full-text available

Aug 2020
MULTIMED TOOLS APPL

Real-time sign language translation systems, that convert continuous sign sequences to text/speech, will facilitate communication between the deaf-mute community and the normal hearing majority. A translation system could be vision-based or sensor-based, depending on the type of input it receives. To date, most of the commercial systems for this purpose are sensor-based, which are expensive and not user-friendly. Vision-based sign translation systems are the need of the hour but should overcome many challenges to build a full-fledged working system. Preliminary investigations in this work have revealed that the traditional approaches to continuous sign language recognition (CSLR) using HMM, CRF and DTW, tried to solve the problem of Isolated Sign Language Recognition (ISLR) and extended the solution to CSLR, leading to reduced performance. The main challenge of identifying Movement Epenthesis (ME) segments in continuous utterances, were handled explicitly with these traditional methods. With the advent of technologies like Deep Learning, more feasible solutions for vision-based CSLR are emerging, which has led to an increase in the research on vision-based approaches. In this paper, a detailed review of all the works in vision-based CSLR is presented, based on the methods they have followed. The challenges posed in continuous sign recognition are also discussed in detail, followed by a brief on sensor-based systems and benchmark databases. Finally, performance evaluation of all the associated methods are performed, which leads to a short discussion on the overall study and concludes by pointing out future research directions in the field.

Contextual Attention for Hand Detection in the Wild

Conference Paper

Full-text available

Nov 2019

We present Hand-CNN, a novel convolutional network architecture for detecting hand masks and predicting hand orientations in unconstrained images. Hand-CNN extends MaskRCNN with a novel attention mechanism to incorporate contextual cues in the detection process. This attention mechanism can be implemented as an efficient network module that captures non-local dependencies between features. This network module can be inserted at different stages of an object detection network, and the entire detector can be trained end-to-end. We also introduce large-scale annotated hand datasets containing hands in unconstrained images for training and evaluation. We show that Hand-CNN outperforms existing methods on the newly collected datasets and the publicly available PASCAL VOC human layout dataset. Data and code: https://www3.cs.stonybrook.edu/~cvl/projects/hand_det_attention/

Deep CNN-Based Recognition of JSL Finger Spelling

Chapter

Full-text available

Aug 2019

In this paper, we present a framework for recognition of static finger spelling in Japanese Sign Language on RGB images. The finger spelled signs were recognized by an ensemble consisting of a ResNet-based convolutional neural network and two ResNet quaternion convolutional neural networks. A 3D articulated hand model has been used to generate synthetic finger spellings and to extend a dataset consisting of real hand gestures. Twelve different gesture realizations were prepared for each of 41 signs. Ten images have been rendered for each realization through interpolations between the starting and end poses. Experimental results demonstrate that owing to sufficient amount of training data a high recognition rate can be attained on images from a single RGB camera. Results achieved by the ResNet quaternion convolutional neural network are better than results obtained by the ResNet CNN. The best recognition results were achieved by the ensemble. The JSL-rend dataset is available for download.

Video-Based Sign Language Recognition Without Temporal Segmentation

Article

Apr 2018

Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.

A multi-scale descriptor for real time RGB-D hand gesture recognition

Article

Nov 2020
PATTERN RECOGN LETT

The development of depth cameras, e.g., the Kinect sensor, provides new opportunities for human computer interaction (HCI). Although the Kinect sensor has been extensively applied for human tracking, human action recognition and hand gesture recognition, real time hand gesture recognition is still a challenging problem. In this paper, a new real time hand gesture recognition method is proposed. Since fingers are the most important clue for hand gesture classification, a finger-emphasized multi-scale descriptor is proposed. The proposed descriptor incorporates three types of parameters of multiple scales to make a discriminative representation of the hand shape. Furthermore, the features of fingers are emphasized for hand gesture analysis. Three solutions to hand gesture recognition are then investigated with DTW, SVM, and neural network. Extensive experiments are conducted and the results show that the proposed method is robust to noise, articulations and rigid transformations. The comparison with state-of-the-art methods verifies the accuracy and efficiency of our method.

Online Dynamic Hand Gesture Recognition Including Efficiency Analysis

Article

Feb 2020

Online dynamic hand gesture recognition is challenging mainly due to three reasons: (i) There is no indication when a gesture starts and ends in the video, (ii) performed gestures should only be recognized once, and (iii) the entire architecture should be designed considering the memory and power budget. In this paper, a two-level hierarchical structure consisting of a detector and a classifier is proposed which enables offline-working convolutional neural network (CNN) architectures to operate online efficiently by using sliding window approach. For efficiency analysis, different CNN architectures are applied to compare these architectures over offline classification accuracy, number of parameters and computation complexity. In order to evaluate the single-time activations of the detected gestures, we used Levenshtein distance as an evaluation metric since it can measure misclassifications, multiple detections, and missing detections at the same time. The performance of the approach is evaluated on two public datasets -EgoGesture and NVIDIA Dynamic Hand Gesture Datasets -which require temporal detection and classification of the performed hand gestures. ResNeXt-101 model achieves the state-of-the-art offline classification accuracy of 94.03% on EgoGesture benchmark and competitive results on NVIDIA benchmarks. In online recognition, we obtain very good performances with considerable early detections.

Sign Language Recognition Systems: A Decade Systematic Literature Review

Article

Dec 2019

Despite the importance of sign language recognition systems, there is a lack of a Systematic Literature Review and a classification scheme for it. This is the first identifiable academic literature review of sign language recognition systems. It provides an academic database of literature between the duration of 2007–2017 and proposes a classification scheme to classify the research articles. Three hundred and ninety six research articles were identified and reviewed for their direct relevance to sign language recognition systems. One hundred and seventeen research articles were subsequently selected, reviewed and classified. Each of 117 selected papers was categorized on the basis of twenty five sign languages and were further compared on the basis of six dimensions (data acquisition techniques, static/dynamic signs, signing mode, single/double handed signs, classification technique and recognition rate). The Systematic Literature Review and classification process was verified independently. Literature findings of this paper indicate that the major research on sign language recognition has been performed on static, isolated and single handed signs using camera. Overall, it will be hoped that the study may provide readers and researchers a roadmap to guide future research and facilitate knowledge accumulation and creation into the field of sign language recognition.

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

Article

Jul 2019

Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

A learned shape-adaptive subsurface scattering model

Article

Jul 2019

Subsurface scattering, in which light refracts into a translucent material to interact with its interior, is the dominant mode of light transport in many types of organic materials. Accounting for this phenomenon is thus crucial for visual realism, but explicit simulation of the complex internal scattering process is often too costly. BSSRDF models based on analytic transport solutions are significantly more efficient but impose severe assumptions that are almost always violated, e.g. planar geometry, isotropy, low absorption, and spatio-directional separability. The resulting discrepancies between model and usage lead to objectionable errors in renderings, particularly near geometric features that violate planarity. This article introduces a new shape-adaptive BSSRDF model that retains the efficiency of prior analytic methods while greatly improving overall accuracy. Our approach is based on a conditional variational autoencoder, which learns to sample from a reference distribution produced by a brute-force volumetric path tracer. In contrast to the path tracer, our autoencoder directly samples outgoing locations on the object surface, bypassing a potentially lengthy internal scattering process. The distribution is conditional on both material properties and a set of features characterizing geometric variation in a neighborhood of the incident location. We use a low-order polynomial to model the local geometry as an implicitly defined surface, capturing curvature, thickness, corners, as well as cylindrical and toroidal regions. We present several examples of objects with challenging medium parameters and complex geometry and compare to ground truth simulations and prior work.

Recognition of JSL Fingerspelling Using Deep Convolutional Neural Networks

Abstract and Figures

Recommended publications

Deep CNN-based Recognition of JSL Finger Spelling

Deep CNN-Based Recognition of JSL Finger Spelling

Learning Siamese Features for Finger Spelling Recognition

GAN-based data augmentation for visual finger spelling recognition