PreprintPDF Available

Recognizing American Sign Language Manual Signs from RGB-D Videos

June 2019

June 2019

Authors:

Elahe Vahdani

CUNY Graduate Center

Yingli Tian

City College of New York

Preprints and early-stage research may not have been peer reviewed yet.

In this paper, we propose a 3D Convolutional Neural Network (3DCNN) based multi-stream framework to recognize American Sign Language (ASL) manual signs (consisting of movements of the hands, as well as non-manual face movements in some cases) in real-time from RGB-D videos, by fusing multimodality features including hand gestures, facial expressions, and body poses from multi-channels (RGB, depth, motion, and skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3DCNN model. We collect a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera, each of 100 ASL manual signs, including RGB channel, depth maps, skeleton joints, face features, and HDface. The dataset is fully annotated for each semantic region (i.e. the time duration of each word that the human signer performs). Our proposed method achieves 92.88 accuracy for recognizing 100 ASL words in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on the Chalearn IsoGD dataset and achieves 76 accuracy which is 5.51 higher than the state-of-the-art work in terms of average fusion by using only 5 channels instead of 12 channels in the previous work.

Generating representative proxy video by our proposed random temporal augmentation. (a) Eight consecutive frames from a video clip of an ASL sign. (b) Randomly sampled eight frames from the video clip of the same ASL sign. With the same number of frames, the proxy video reserves more temporal dynamics of the ASL sign.

…

The pipeline of the proposed multi-channel multi-modality 3DCNN framework for ASL recognition. The multiple channels contain RGB, Depth, and Optical flow while the multiple modalities include hand gestures, facial expressions and body poses. While the full size image is used to represent body pose, to better model hand gestures and the facial expression changes, the regions of hands and face are obtained from the RGB image based on the location guided by skeleton joints. The whole framework consists of two main components: proxy video generation and 3DCNN modeling. First, proxy videos are generated for each ASL sign by selecting a subset of frames spanning the whole video clip of each ASL sign, to represent the overall temporal dynamics. Then the generated proxy videos of RGB, Depth, Optical flow, RGB of hands, and RGB of face are fed into the multi-stream 3DCNN component. The predictions of these networks are weighted to obtain the final results of ASL recognition.

…

Four sample frames of each channel of an ASL sign from our dataset including RGB, skeleton joints (25 joints for every frame), depth map, basic face features (5 main face components), and HDFace (1,347 points.)

…

The comparison of the performance of recognizing 100 ASL words on 3DResNet-34 with different pretrained models.

…

The comparison of the performance of networks with differ- ent input channels and their fusions. All the models are pretrained on Chalearn dataset and finetuned on ASL-100-RGBD dataset with 64 frames.

…

Figures - uploaded by Elahe Vahdani

Content may be subject to copyright.

Content uploaded by Elahe Vahdani

Content may be subject to copyright.

Recognizing American Sign Language Manual Signs from RGB-D

Videos

Longlong Jing∗·Elahe Vahdani∗·Matt Huenerfauth ·Yingli Tian†

Abstract In this paper, we propose a 3D Convolutional Neu-

ral Network (3DCNN) based multi-stream framework to rec-

ognize American Sign Language (ASL) manual signs (con-

sisting of movements of the hands, as well as non-manual

face movements in some cases) in real-time from RGB-D

videos, by fusing multimodality features including hand ges-

tures, facial expressions, and body poses from multi-channels

(RGB, depth, motion, and skeleton joints). To learn the over-

all temporal dynamics in a video, a proxy video is generated

by selecting a subset of frames for each video which are then

used to train the proposed 3DCNN model. We collect a new

ASL dataset, ASL-100-RGBD, which contains 42 RGB-

D videos captured by a Microsoft Kinect V2 camera, each

of 100 ASL manual signs, including RGB channel, depth

maps, skeleton joints, face features, and HDface. The dataset

is fully annotated for each semantic region (i.e. the time du-

ration of each word that the human signer performs). Our

proposed method achieves 92 .88 %accuracy for recogniz-

ing 100 ASL words in our newly collected ASL-100-RGBD

dataset. The effectiveness of our framework for recogniz-

ing hand gestures from RGB-D videos is further demon-

strated on the Chalearn IsoGD dataset and achieves 76% ac-

curacy which is 5.51% higher than the state-of-the-art work

L. Jing and E. Vahdani

Department of Computer Science, The Graduate Center, The City Uni-

versity of New York, NY, 10016.

E-mail: {ljing, evahdani}@gradcenter.cuny.edu

∗Equal Contribution

M. Huenerfauth

Golisano College of Computing and Information Sciences, the

Rochester Institute of Technology (RIT), Rochester, NY, USA.

E-mail: matt.huenerfauth@rit.edu

Y. Tian

Department of Electrical Engineering, The City College, and the De-

partment of Computer Science, the Graduate Center, the City Univer-

sity of New York, NY, 10031.

E-mail: ytian@ccny.cuny.edu

†Corresponding Author

in terms of average fusion by using only 5 channels instead

of 12 channels in the previous work.

Keywords American Sign Language Recognition ·

Hand Gesture Recognition ·RGB-D Video Analysis ·

Multimodality ·3D Convolutional Neural Networks ·Proxy

Video.

1 Introduction

The focus of our research is to develop a real-time system

that can automatically identify ASL manual signs (individ-

ual words, which consist of movements of the hands, as well

as facial expression changes) from RGB-D videos. How-

ever, our broader goal is to create useful technologies that

would support ASL education, which would utilize this tech-

nology for identifying ASL signs and provide ASL students

immediate feedback about whether their signing is ﬂuent or

not.

There are more than one hundred sign languages used

around the world, and ASL is used throughout the U.S. and

Canada, as well as other regions of the world, including

West Africa and Southeast Asia. Within the U.S.A., about

28 million people today are Deaf or Hard-of-Hearing (DHH)

[1]. There are approximately 500,000 people who use ASL

as a primary language [59], and since there are signiﬁcant

linguistic differences between English and ASL, it is possi-

ble to be ﬂuent in one language but not in the other.

In addition to the many members of the Deaf community

who may prefer to communicate in ASL, there are many

individuals who seek to learn the language. Due to a va-

riety of educational factors and childhood language expo-

sure, researchers have measured lower levels of English lit-

eracy among many deaf adults in the U.S. [79]. Studies have

shown that deaf children raised in homes with exposure to

ASL have better literacy as adults, but it can be challenging

for parents, teachers, and other adults in the life of a deaf

child to rapidly gain ﬂuency in ASL. The study of ASL as a

arXiv:1906.02851v1 [cs.CV] 7 Jun 2019

2 Longlong Jing∗et al.

foreign language in universities has signiﬁcantly increased

by 16.4% from 2006 to 2009, which ranked ASL as the

4th most studied language at colleges [23]. Thus, there are

many individuals would beneﬁt from a ﬂexible way to prac-

tice their ASL signing skills, and our research investigates

technologies for recognizing signs performed in color and

depth videos, as discussed in [30].

While the development of user-interfaces for educational

software was described in our prior work [30], this article

instead focuses on the development and evaluation of our

ASL recognition technologies, which underlie our educa-

tional tool. Beyond this speciﬁc application, technology to

automatically recognize ASL signs from videos could en-

able new communication and accessibility technologies for

people who are DHH, which may allow these users to in-

put information into computing systems by performing sign

language or may serve as a foundation for future research on

machine translation technologies for sign languages.

The rest of this article is structured as follows: Section

1.1 provides a summary of relevant ASL linguistic details,

and Section 1.2 motivates and deﬁnes the scope of our con-

tributions. Section 2 surveys related work in ASL recogni-

tion, gesture recognition in videos, and some video-based

ASL corpora (collections of linguistically labeled video record-

ings). Section 3 describes our framework for ASL recogni-

tion, Section 4 describes the new dataset of 100 ASL words

captured by a RGBD camera which is used in this work,

and Section 5 presents the experiments to evaluate our ASL

recognition model and the extension of our framework for

Chalearn IsoGD dataset. Finally, Section 6 summarizes the

proposed work.

1.1 ASL Linguistics Background

ASL is a natural language conveyed through movements and

poses of the hands, body, head, eyes, and face [80]. Most

ASL signs consist of the hands moving, pausing, and chang-

ing orientation in space. Individual ASL signs (words) con-

sist of a sequence of several phonological segments, which

include:

–An essential parameter of a sign is the conﬁguration of

the hand, i.e. the degree to which each of the ﬁnger joints

are bent, which is commonly referred to as the “hand-

shape.” In ASL, there are approximately 86 handshapes

which are commonly used [62], and the hand may tran-

sit between handshapes during the production of a single

sign.

–During an ASL sign, the signer’s hands will occupy spe-

ciﬁc locations and will perform movement through space.

Some signs are performed by a single hand, but most are

performed using both of the signer’s hands, which move

through the space in front of their head and torso. During

Fig. 1 Example images of lexical facial expressions along with hand

gestures for signs: NEVER, WHO, and WHAT. For NEVER, the signer

shakes her head side-to-side slightly, which is a Negative facial ex-

pression in ASL. For WHO and WHAT, the signer is furrowing the

brows and slightly tilting moving the head forward, which is a WH-

word Question facial expression in ASL.

two-handed signs, the two hands may have symmetrical

movements, or the signer’s dominant hand (e.g. the right

hand of a right-handed person) will have greater move-

ments than the non-dominant hand.

–The orientation of the palm of the hand in 3D space is

also a meaningful aspect of an ASL sign, and this param-

eter may differentiate pairs of otherwise identical signs.

–Some signs co-occur with speciﬁc “non-manual signals,”

which are generally facial expressions that are charac-

terized by speciﬁc eyebrow movement, head tilt/turn, or

head movement (e.g., forward-backward relative to the

torso).

As discussed in [60], facial expressions in ASL are most

commonly utilized to convey information about entire sen-

tences or phrases, and these classes of facial expressions

are commonly referred to as “syntactic facial expressions.”

While some researchers, e.g. [57], have investigated the iden-

tiﬁcation of facial expressions that extend across multiple

words to indicate grammatical information, in this paper,

we describe our work on recognizing manual signs which

consist of movements of the hands and facial expression

changes.

In addition to “syntactic” facial expressions that extend

across multiple words in an ASL sentence, there exists an-

other category of facial expressions, which is speciﬁcally

relevant to the task of recognizing individual signs: “lexi-

cal facial expressions,” which are considered as a part of

the production of an individual ASL word (see examples in

Fig. 1). Such facial expressions are therefore essential for

the task of sign recognition. For instance, words with neg-

ative semantic polarity, e.g. NONE or NEVER, tend to oc-

cur with a negative facial expression consisting of a slight

head shake and nose wrinkle. In addition, there are speciﬁc

ASL signs that almost always occur in a context in which

a speciﬁc ASL syntactic facial expression occurs. For in-

stance, some question words, e.g. WHO or WHAT, tend to

co-occur with a syntactic facial expression (brows furrowed,

head tilted forward), which indicates that an entire sentence

is a WH-word Question. Thus, the occurrence of such a fa-

Recognizing American Sign Language Manual Signs from RGB-D Videos 3

cial expression may be useful evidence to consider when

building a sign-recognition system for such words.

1.2 Motivations and Scope of Contributions

As discussed in Section 2.1, most prior ASL recognition re-

search typically focuses on isolated hand gestures of a re-

stricted vocabulary. In this paper, we propose a 3D multi-

stream framework to recognize a set of grammatical ASL

words in real-time from RGB-D videos, by fusing multi-

modality features including hand gestures, facial expressions,

and body poses from multi-channels (RGB, depth, motion,

and skeleton joints). In an extension to our previous work

[89] and [86], the main contributions of the proposed frame-

work can be summarized as follows:

–We propose a 3D multi-stream framework by using 3D

convolutional neural network for ASL recognition in RGB-

D videos by fusing multi-channels including RGB, depth,

motion, and skeleton joints.

–We propose a random temporal augmentation strategy to

augment the training data to handle wide diverse videos

of relative small datasets.

–We create a new ASL dataset, ASL-100-RGBD, includ-

ing multiple modalities (facial movements, hand ges-

tures, and body pose) and multiple channels (RGB, depth,

skeleton joints, and HDface) by collaborating with ASL

linguistic researchers; this dataset contains annotation of

the time duration when each ASL word is performed by

the human in the video. The dataset will be released to

public with the publication of this article.

–We further evaluate the proposed framework to recog-

nize hand gestures on the Chalearn LAP IsoGD dataset

[81] which consists of 249 gesture classes in RGB-D

videos. The accuracy of our framework is 5.51% higher

than the state-of-the-art work in terms of average fusion

using fewer channels (5channels instead of 12).

2 Related Work

2.1 RGB-D Based ASL Recognition

Sign language (SL) recognition has been studied for three

decades since the ﬁrst attempt to recognize Japanese SL by

Tamura and Kawasaki in 1988 [77]. The existing SL recog-

nition research can be classiﬁed as sensor-based methods in-

cluding data gloves and body trackers to capture and track

hand and body motions [20, 35, 45, 50] and non-intrusive

camera-based methods by applying computer vision tech-

nologies [9, 10, 13, 14, 24, 29, 39, 42–44, 53, 66–69, 75, 83,

85]. While much research in this area focuses on the hands,

there is also some research focusing on linguistic informa-

tion conveyed by the face and head of a human performing

sign language, such as [5, 48, 51, 57]. More details about SL

recognition can be found in these survey papers [19,64].

As cost-effective consumer depth cameras have become

available in recent years, such as RGB-D cameras of Mi-

crosoft Kinect V2 [4], Intel Realsense [2], Orbbec Astra

[3], it has become practical to capture high resolution RGB

videos and depth maps as well as to track a set of skeleton

joints in real time. Compared to traditional 2D RGB images,

RGB-D images provide both photometric and geometric in-

formation. Therefore, recent research work has been moti-

vated to investigate ASL recognition using both RGB and

depth information [6,8,12,32,70,72,84,86, 88,89]. In this ar-

ticle, we brieﬂy summarize ASL recognition methods based

on RGB-D images or videos.

Some early work of SL recognition based on RGB-D

cameras only focused on a very small number of signs from

static images [40, 70, 72]. Pugeault and Richard proposed a

multi-class random forest classiﬁcation method to recognize

24 static ASL ﬁngerspelling alphabet letters by ignoring the

letters jand z(as they involve motion) and by combining

both appearance and depth information of handshapes cap-

tured by a Kinect camera [70]. Keskin et al. [40] recognized

24 static handshapes of the ASL alphabet, based on scale in-

variant features extracted from depth images, and then fed

to a Randomized Decision Forest for classiﬁcation at the

pixel level, where the ﬁnal recognition label was voted based

on a majority. Ren et al. proposed a modiﬁed Finger-Earth

Movers Distance metric to recognize static handshapes for

10 digits captured using a Kinect camera [72].

While these systems only used the static RGB and depth

images, some studies employed the RGB-D videos for ASL

recognition. Zafrulla et al. developed a hidden Markov model

(HMM) to recognize 19 ASL signs collected by a Kinect and

compared the performance with that from colored-glove and

accelerometer sensors [88]. For the Kinect data, they also

compared the system performance between the signer seated

and standing and found that higher accuracy resulted when

the users were standing. Yang developed a hierarchical con-

ditional random ﬁeld method to recognize 24 manual ASL

signs (seven one-handed and 17 two-handed) from the hand-

shape and motion in RGB-D videos [84]. Lang et al. [49]

presented a HMM framework to recognize 25 signs of Ger-

man Sign Language using depth-camera speciﬁc features.

Mehrotra et al. [56] employed a support vector machine

(SVM) classiﬁer to recognize 37 signs of Indian Sign Lan-

guage based on 3D skeleton points captured using a Kinect

camera. Almeida et al. [6] also employed a SVM classiﬁer to

recognize 34 signs of Brazilian Sign Language using hand-

shape, movement and position captured by a Kinect. Jiang

et al. proposed to recognize 34 signs of Chinese Sign Lan-

guage based on the color images and the skeleton joints cap-

tured by a Kinect camera [32]. Recently, Kumar et al. [47]

4 Longlong Jing∗et al.

combined a Kinect camera with a Leap Motion sensor to

recognize 50 signs of India Sign Language.

As discussed above, SL consists of hand gestures, facial

expressions, and body poses. However, most existing work

has focused only on hand gestures without combining with

facial expressions and body poses. While a few attempted

to combine hand and face [5, 41, 48, 57, 68, 83], they only

use RGB videos. To the best of our knowledge, we believe

that this is the ﬁrst work that combines multi-channel RGB-

D videos (RGB and depth) with fusion of multi-modality

features (hand, face, and body) for ASL recognition.

2.2 CNN for Action and Hand Gesture Recognition

Since the work of AlexNet [46] which makes use of the pow-

erful computation ability of GPUs, deep neural networks

(DNNs) have enjoyed a renaissance in various areas of com-

puter vision, such as image classiﬁcation [17,76], object de-

tection [25,28], image description [16,36], and others. Many

efforts have been made to extend CNNs from the image

to the video domain [21], which is more challenging since

video data are much larger than images; therefore, handling

video data in the limited GPU memory is not tractable. An

intuitive way to extend image-based CNN structures to the

video domain is to perform the ﬁne-tuning and classiﬁca-

tion process on each frame independently, and then conduct

a later fusion, such as average scoring, to predict the action

class of the video [37]. To incorporate temporal information

in the video, [73] introduced a two-stream framework. One

stream was based on RGB images, and the other, on stacked

optical ﬂows. Although that work proposed an innovative

way to learn temporal information using a CNN structure,

in essence, it was still image-based, since the third dimen-

sion of stacked optical ﬂows collapsed immediately after the

ﬁrst convolutional layer.

To model the sequential information of extracted fea-

tures from different segments of a video, [16] and [87] pro-

posed to input features into Recurrent Neural Network (RNN)

structures, and they achieved good results for action recog-

nition. The former emphasized pooling strategies and how

to fuse different features, while the latter focused on how

to train an end-to-end DNN structure that integrates CNNs

with RNNs. These networks mainly use CNN to extract spa-

tial features, then RNN is applied to extract the temporal in-

formation of the spatial features. 3DCNN was recently pro-

posed to learn the spatio-temporal features with 3D convolu-

tion operations [15,27,31,33,34,71,78], and has been widely

used in video analysis tasks such as video caption and ac-

tion detection. 3DCNN is usually trained with ﬁxed-length

clips (usually 16 frames [27], [78],) and later fusion is per-

formed to obtain the ﬁnal category of the entire video. Hara

et al. [27] proposed the 3D-ResNet by replacing all the 2D

kernels in 2D-ResNet with 3D convolution operations. With

its advantage of avoiding gradient vanishing and explosion,

the 3D-ResNet outperforms many complex networks.

ASL recognition shares properties with video action recog-

nition, therefore, many networks for video action have been

applied to this task. Pigou et al. proposed temporal resid-

ual networks for gesture and sign language recognition [68]

and temporal convolutions on top of the features extracted

by 2DCNN for gesture recognition [67]. Huang et al. pro-

posed a Hierarchical Attention Network with Latent Space

(LS-HAN) which eliminates the pre-processing of the tem-

poral segmentation [29]. Pu et al. proposed to employ 3D

residual convolutional network (3D-ResNet) to extract vi-

sual features which are then fed to a stacked dilated convo-

lution network with connectionist temporal classiﬁcation to

map the visual features into text sentence [69]. Camgoz et

al. attempted to generate spoken language translations from

sign language video [10]. Camgoz et al. proposed SubUNets

for simultaneous hand shape and continuous sign language

recognition [9]. Cui et al. proposed a weakly supervised

framework to train the network from videos with ordered

gloss labels but no exact temporal locations for continuous

sign language recognition [14]. In prior work, our research

team proposed a 3D-FCRNN for ASL recognition by com-

bining the 3DCNN and a fully connected RNN [86].

2.3 Public Camera-based ASL Datasets

As discussed in Section 2.1, technology to recognize ASL

signs from videos could enable new educational tools or as-

sistive technologies for people who are DHH, and there has

been signiﬁcant prior research on sign language recognition.

However, a limiting factor for much of this research has been

the scarcity of video recordings of sign language that have

been annotated with time interval labels of the words that the

human has performed in the video: For ASL, there have been

some annotated video-based datasets [63] or collections of

motion capture recordings of humans wearing special sen-

sors [54]. Most publicly available datasets, e.g. [22,41], con-

tain general ASL vocabularies from RGB videos and a few

with RGB-D channels.

2D Camera-based ASL databases: The American Sign

Language Linguistic Research Project (ASLLRP) dataset con-

tains video clips of signing from the front and side and in-

cludes a close-up view of the face [63], with annotations

for 19 short narratives (1,002 utterances) and 885 additional

elicited utterances from four Deaf native ASL signers; an-

notation includes: the start and endpoints of each sign, a

unique gloss label for each sign, part of speech, and start

and end points of a range of non-manual behaviors (e.g.,

raised/lowered eyebrows, head position and periodic head

movements, expressions of the nose and mouth) also labeled

with respect to the linguistic information that they convey

Recognizing American Sign Language Manual Signs from RGB-D Videos 5

(b) Randomly sampled eight frames from the video clip of the same ASL sign

(a) Eight Consecutive frames from a video clip of an ASL sign

Fig. 2 Generating representative proxy video by our proposed random temporal augmentation. (a) Eight consecutive frames from a video clip of

an ASL sign. (b) Randomly sampled eight frames from the video clip of the same ASL sign. With the same number of frames, the proxy video

reserves more temporal dynamics of the ASL sign.

(serving to mark, e.g., different sentence types, topics, nega-

tion, etc.). Dreuw et al. [18] produced several subsets from

the ASLLRP dataset as benchmark databases for automatic

recognition of isolated and continuous sign language.

The American Sign Language Lexicon Video Dataset

(ASLLVD) [7] is a large dataset of videos of isolated signs

from ASL. It contains video sequences of about 3,000 dis-

tinct signs, each produced by 1 to 6 native ASL signers

recorded by four cameras under three views: front, side, and

face region, along with annotations of those sequences, in-

cluding start/end frames and class label (i.e., gloss-based

identiﬁcation) of every sign, as well as hand and face lo-

cations at every frame.

The RVL-SLLL ASL Database [55] consists three sets

of ASL videos with distinct motion patterns, distinct hand-

shapes, and structured sentences respectively. These videos

were captured from 14 native ASL signers (184 videos per

signer) under different lighting conditions. For annotation,

the videos with distinct motion patterns or distinct hand-

shapes are saved as separate clips. However, there is no de-

tailed annotations for the videos of structured sentences which

limited the usefulness of the database.

RGB-D Camera-based ASL and Gesture Databases:

Recently, some RGB-D databases have been collected for

hand gesture and SL recognition, for ASL or other sign lan-

guages [12,22,66]. Here we only brieﬂy summarize RGB-D

databases for ASL.

The ”Spelling-It-Out” dataset consists of 24 static hand-

shapes of the ASL ﬁngerspelling alphabet, ignoring the let-

ters j and z as they involve motion, from four signers; each

signer repeats 500 samples for each letter in front of a Kinect

camera [70]. The NTU dataset consists of 10 static hand ges-

tures for digits 1 to 10 and was collected from 10 subjects by

a Kinect camera. Each subject performs 10 different poses

with variations in hand orientation, scale, articulation for the

same gesture, and there is a color image and the correspond-

ing depth map for each [72].

The Chalearn LAP IsoGD dataset [81] is a large-scale

hand gesture RGB-D dataset, which is derived from Chalearn

Gesture dataset (CGD 2011) [26]. This dataset consists of

47,933 RGB-D video clips fallen into 249 classes of hand

gestures including mudras (Hindu/ Buddhist hand gestures),

Chinese numbers, and diving signals. Although it is not about

ASL recognition, it can be used to learn RGB-D features

from different environment settings. Using the learned fea-

tures as a pretrained model, the ﬁne-tuned ASL recognition

model will be more robust to handle different backgrounds

and scales (e.g. distance variations between Kinect camera

and the signer).

To support our research, we have collected and anno-

tated a novel RGB-D ASL dataset, ASL-100-RGBD, de-

scribed in Section 4, with the following properties:

–100 ASL signs have been collected which are performed

by 15 individual signers (often with multiple recordings

from each signer).

–The ASL-100-RGBD dataset has been captured using

a Kinect V2 camera and contains multiple channels in-

cluding RGB, depth, skeleton joints, and HDface.

–Each video consists of the 100 ASL words with time-

coded annotations in collaboration with ASL computa-

tional linguistic researchers.

–The 100 ASL words have been strategically selected to

support sign recognition technology for ASL education

tools (many of these words consist of hand gestures and

facial expression changes), with the detailed vocabulary

composition described in Section 4.

6 Longlong Jing∗et al.

RGB Body Network Fusion Final Prediction

Depth Body Network

RGB Hands Network

RGB Face Network

Proxy Video Generation 3DCNN Modeling

Conv 1

Block 1

GAP

Dense

(100)

Block 2

Block 3

Block 4

Block 5

Conv 1

Block 1

GAP

Dense

(100)

Block 2

Block 3

Block 4

Block 5

Conv 1

Block 1

GAP

Dense

(100)

Block 2

Block 3

Block 4

Block 5

Conv 1

Block 1

GAP

Dense

(100)

Block 2

Block 3

Block 4

Block 5

Fig. 3 The pipeline of the proposed multi-channel multi-modality 3DCNN framework for ASL recognition. The multiple channels contain RGB,

Depth, and Optical ﬂow while the multiple modalities include hand gestures, facial expressions and body poses. While the full size image is used

to represent body pose, to better model hand gestures and the facial expression changes, the regions of hands and face are obtained from the

RGB image based on the location guided by skeleton joints. The whole framework consists of two main components: proxy video generation and

3DCNN modeling. First, proxy videos are generated for each ASL sign by selecting a subset of frames spanning the whole video clip of each ASL

sign, to represent the overall temporal dynamics. Then the generated proxy videos of RGB, Depth, Optical ﬂow, RGB of hands, and RGB of face

are fed into the multi-stream 3DCNN component. The predictions of these networks are weighted to obtain the ﬁnal results of ASL recognition.

3 The Proposed Method for ASL Recognition

The pipeline of our proposed method is illustrated in Fig. 3.

There are two main components in the framework: random

temporal augmentation to generate proxy videos (which are

representative of the overall temporal dynamics of video clip

of an ASL sign) and 3DCNN to recognize the class label of

the sign.

3.1 Random Temporal Augmentation for Proxy Video

Generation

The performance of the deep neural network greatly depends

on the amount of the training data. Large-scale training data

and different data augmentation techniques usually are needed

for deep networks to avoid over-ﬁtting. During training, dif-

ferent kinds of data augmentation techniques, such as ran-

dom resizing and random cropping of images, are already

widely applied in 3DCNN training. In order to capture the

overall temporal dynamics, we apply a random temporal

augmentation, to generate a proxy video for each sign video

clip channel, by selecting a subset of frames, which has proved

to be very effective for our proposed framework.

Videos are often redundant in the temporal dimension,

and some consecutive frames are very similar without ob-

servable difference, as shown in Fig. 2 (a) which displays

8consecutive frames in a video clip of an ASL sign while

the proxy video in 2 (b) displays the 8frames selected from

the same video clip by random temporal augmentation. With

the same number of frames, the proxy video provides more

temporal dynamics. Thus, proxy videos are generated to rep-

resent the overall temporal dynamics for each ASL word.

The process of proxy video generation by randomly sam-

pling is formulated in Eq. (1) below:

Si=random(bN/T c) + bN /T c ∗ i, (1)

where Nis the number of frames of a sign video, Tis the

number of the sampled frames from the video, Siis the

ith sampled frame, random(N/T )generates one random

number ranging b0, N/T cfor every i. To generate the proxy

video, each video is uniformly divided into Tintervals, and

one frame is randomly sampled from every interval. If the

Recognizing American Sign Language Manual Signs from RGB-D Videos 7

Fig. 4 The full list of the 100 ASL words in our “ASL-100-RGBD” dataset under 6semantic categories. These ASL words have been strategically

selected to support sign recognition technology for ASL education tools (many of these words consist of both hand gestures and facial expression

changes.)

total number of frames for a video is less than T, it is padded

with the last frame to the length of T. These proxy videos

make it feasible to train deep neural network on the proposed

dataset.

3.2 3D Convolutional Neural Network

3DCNN was ﬁrst proposed for video action recognition [31],

and was improved in C3D [78] by using a similar archi-

tecture to VGG [74]. It obtained the state-of-the-art perfor-

mance for several video recognition tasks. The difference

between the 2DCNN and 3DCNN operation is that 3DCNN

has an extra temporal dimension, which can capture the spa-

tial and temporal information between video frames more

effectively.

After the emergence of C3D, many 3DCNN models were

proposed for video action recognition [11], [15], [71]. 3D-

ResNet was the 3D version of ResNet which introduced iden-

tical mapping to avoid gradient vanishing and explosion,

which makes the training of very deep of convolutional neu-

ral networks feasible. Compared to 2D ResNet, the size of

the convolution kernel is w×h×t(wis the width of the

kernel, his the height of the kernel and tis the temporal di-

mension of the kernel) in 3D-ResNet, while it is w×hin

2D-ResNet. In this paper, 3D-ResNet is chosen as the base

network for ASL recognition.

In order to handle the three important elements of ASL

recognition (hand gesture, facial expression, and body pose),

a hybrid framework is designed including two 3DCNN net-

works: one for full body, to capture the full body move-

ments including hands and face with the inputs of the multi-

channel proxy videos generated from the full images includ-

ing RGB, depth, and optical ﬂow; and another for hand and

face, to capture the coordinates of hands and face with the

inputs of the multi-channel proxy videos generated from the

cropped regions of left hand, right hand, and face. Note for

the Hand-Face network, RGB and depth channels are em-

ployed for hand regions. The optical ﬂow is not employed

8 Longlong Jing∗et al.

since it cannot accurately track the quick and large hand

motions. For the face regions, only RGB channel is em-

ployed since facial expressions generally change much less

in depth. The prediction results of the networks are weighted

to obtain the ﬁnal prediction of each ASL sign.

The optical ﬂow images are calculated by stacking the

x-component, the y-component, and the magnitude of the

ﬂow. Each value in the image is then rescaled to 0and 255 .

This practice has yielded good performance in other studies

[16, 87]. As observed in the experimental results, by fusing

all the features generated by RGB, optical ﬂow, and depth

images, the performance can be improved, which indicates

that complementary information are provided by different

channels in training deep neural networks.

4 Proposed ASL Dataset: “ASL-100-RGBD”

As mentioned in Section 2.3, a new dataset has been col-

lected for this research in collaboration with ASL computa-

tional linguistic researchers, from native ASL signers (indi-

viduals who have been using the language since very early

childhood) who performed a word list of 100 ASL signs

(See the full list of ASL words in Fig. 4) by using a Kinect

V2 camera. Participants responded afﬁrmatively to the fol-

lowing screening question: Did you use ASL at home grow-

ing up, or did you attend a school as a very young child

where you used ASL? Participants were provided with a

slide-show presentation that asked them to perform a se-

quence of 100 individual ASL signs, without lowering their

hands between words. Since this new dataset includes 100

signs with RGB and depth data, we refer to it as the “ASL-

100-RGBD” dataset.

During the recording session, a native ASL signer met

the participant and conducted the session: prior research in

ASL computational linguistics has emphasized the impor-

tance of having only native signers present when recording

ASL videos so that the signer does not produce English-

inﬂuenced signing [54]. Several videos were recorded of

each of the 15 people, while they signed the 100 ASL signs.

Typically three videos were recorded from each person, to

produce a total collection of 42 videos (each video contains

all the 100 signs) and 4,200 samples of ASL signs.

To facilitate this collection process, we have developed

a recording system based on Kinect 2.0 RGB-D camera to

capture multimodality (facial expressions, hand gestures, and

body poses) from multiple channels of information (RGB

video and depth video) for ASL recognition. The recordings

also include skeleton and HDface information. The video

resolution is 1920 x1080 pixels for the RGB channel and

512 x424 pixels for the depth channel respectively.

The 100 ASL signs in this collection were selected to

strategically support research on sign recognition for ASL

education applications, and the words were chosen based on

vocabulary that is traditionally included in introductory ASL

courses. Speciﬁcally, as discussed in [30], our recognition

system must identify a subset of ASL words that relate to a

list of errors often made by students who are learning ASL.

Our proposed educational tool [30] would receive as input a

video of a student who is performing ASL sentences, and

the system would automatically identify whether the stu-

dent’s performance may include one of several dozen errors,

which are common among students learning ASL. As part of

the operation of this system, we require a sign-recognition

component that can identify when a video of a person in-

cludes any of these 100 words during the video (and the

time period of the video when this word occurs). When one

of these 100 key words are identiﬁed, then the system will

consider other properties of the signer’s movements [30], to

determine whether the signer may have made a mistake in

their signing.

For instance, the 100 ASL signs includes words related

to questions (e.g. WHO, WHAT), time-phrases (e.g. TO-

DAY, YESTERDAY), negation (e.g. NOT, NEVER), and

other categories of words that relate to key grammar rules

of ASL. A full listing of the words included in this dataset

appears in Fig. 4. Note that there is no one-to-one map-

ping between English words and ASL signs, and some ASL

signs have variations in their performance, e.g. due to ge-

ographic/regional differences or other factors. For this rea-

son, some words in Fig. 4 appear with integers after their

name, e.g. THURSDAY and THURSDAY2, to reﬂect more

than one variation in how the ASL word may be produced.

For instance, THURSDAY indicates a sign produced by the

signer’s dominant hand in the ”H” alphabet-letter handshape

gentle circling in space; whereas, THURSDAY2 indicates

a sign produced with the signer’s dominant hand quickly

switching from the alphabet-letter handshape of ”T” to ”H”

while held in space in front of the torso. Both are commonly

used ASL signs for the concept of ”Thursday”; they simply

represent two different ASL words that could be used for the

same concept.

As shown in Fig. 4, the words are grouped into 6seman-

tic categories (Negative, WH Questions, Yes/No Questions,

Time, Pointing, and Conditional), which in some cases sug-

gest particular facial expressions that are likely to co-occur

with these words when they are used in ASL sentences. For

instance, time-related phrases that appear at the beginning

of ASL sentences tend to co-occur with a speciﬁc facial ex-

pression (head tilted back slightly and to the side, with eye-

brows raised). Additional details about how detecting words

in these various categories would be useful in the context of

educational software appear in [30].

After the videos were collected from participants, the

videos were analyzed by a team of ASL linguists, who pro-

duced time-coded annotations for each video. The linguists

Recognizing American Sign Language Manual Signs from RGB-D Videos 9

Fig. 5 Four sample frames of each channel of an ASL sign from our

dataset including RGB, skeleton joints (25 joints for every frame),

depth map, basic face features (5 main face components), and HDFace

(1,347 points.)

used a coding scheme in which an English identiﬁer label

was used to correspond to each of the ASL words used in

the videos, in a consistent manner across the videos. For ex-

ample, all of the time spans in the videos when the human

performed the ASL word “NOT” were labeled with the En-

glish string ”NOT” in our linguistic annotation.

Fig. 5 demonstrates several frames of each channel of an

ASL sign from our dataset including RGB, skeleton joints

(25 joints for every frame), depth map, basic face features

(5 main face components), and HDFace (1,347 points). With

the publication of this article, ASL-100-RGBD dataset1will

be released to the research community.

5 Experiments and Discussions

In this section, extensive experiments are conducted to eval-

uate the proposed approach on the newly collected “ASL-

100-RGBD” dataset and the Chalearn LAP IsoGD dataset

[81].

1Some example videos can be found at our research website

http://media-lab.ccny.cuny.edu/wordpress/datecode/

5.1 Implementation Details

Same 3D-ResNet architecture is employed for all experi-

ments. Different channels and modalities are fed to the net-

work as input. The input channels are RGB, Depth, RG-

Bﬂow (i.e. Optical ﬂow of RGB images), and Depthﬂow

(i.e. Optical ﬂow of depth images) of modalities including

hands, face, and full body. The fusion of different channels

are studied and compared.

Our proposed model is trained in PyTorch on four Titan

X GPUs. To avoid over-ﬁtting, the pretrained model from

Kinetics or Chalearn dataset is employed and the follow-

ing data augmentation techniques were used: random crop-

ping (using a patch size of 112 ×112) and random rotation

(with a random number of degrees in a range of [−10,10]).

The models are then ﬁne-tuned for 50 epochs with an initial

learning rate of λ= 3 ×10−3, reduced by a factor of 10

after every 25 epochs.

To apply the pre-trained 3D-ResNet model on 3 bands

in RGB image format to one channel depth images or op-

tical ﬂow images, the depth images are simply converted to

3 bands as RGB image format. For the optical ﬂow images,

the pre-trained 3D-ResNet model takes the x-component,

the y-component, and the magnitude of ﬂow as the R, G,

and B bands in the RGB format.

5.2 Experiments on ASL-100-RGBD

To prepare the training and testing for evaluation of the pro-

posed method on “ASL-100-RGBD” dataset, we ﬁrst ex-

tracted the video clips for each ASL sign. We use 3,250

ASL clips for training ( 75 %of the data) and the remain-

ing 25 %ASL clips for testing. To ensure a subject indepen-

dent evaluation, there is no same signer appearing in both

training and testing datasets. To augment the data, a new 16-

frame proxy video is generated from each video by selecting

different subset of frames for each epoch during the training

phase.

5.2.1 Effects of Data Augmentations

The training dataset which contains 3,250 ASL video clips

of 100 ASL manual signs is relatively small for 3DCNN

training and could easily cause an over-ﬁtting problem. In

order to extract more representative temporal dynamics as

well as avoid over-ﬁtting, a random temporal augmentation

technique is applied to generate proxy videos (a new proxy

video for each epoch) for each ASL clip. The ASL recogni-

tion results of using the proposed proxy video (16 frames

per video) are compared with the traditional method (us-

ing the same number of consecutive frames). The network,

3DResNet-34, dose not converge when trained with 16 con-

secutive frames, while the network trained with proxy video

10 Longlong Jing∗et al.

obtained 68.4% on the testing dataset. This is likely due to

the majority of movements being from hands in these videos

and the consecutive frames could not effectively represent

the temporal and spatial information. Therefore, the network

could not distinguish the clips based on only 16 consecutive

frames. We also evaluate the effect of random cropping (us-

ing a patch size of 112 ×112) and random rotation (with a

random number of degrees in a range of [−10,10]).

Table 1 lists the effects of different data augmentation

techniques for the performance for recognizing 100 ASL

words on only RGB channel. With the proxy videos, the

3DCNN model obtains 68.4% accuracy on the testing data

for recognizing 100 ASL signs. By adding the random crop,

the performance is improved by 4.4% and adding the ran-

dom rotation further improved the performance to 75.9%. In

the following experiments, proxy videos together with ran-

dom crop and random rotation are employed to augment the

data.

Table 1 The comparison of the performance of different data augmen-

tation methods on only RGB channel with 16 frames for recogniz-

ing 100 ASL manual signs. All the models are pretrained on Kinet-

ics and ﬁnetuned on ASL-100-RGBD dataset. The best performance

is achieved with the random proxy video, random crop, and random

rotation.

Augmentations Fusions

Random Proxy Video 7√ √ √

Random Crop 7√ √

Random Rotation 7√

Performance Not converging 68.4% 72.8% 75.9%

5.2.2 Effects of Network Architectures

In this experiment, the ASL recognition results of differ-

ent number of layers at 18,34,50, and 101 for 3DRes-

Net are compared on full RGB, optical ﬂow, and depth im-

ages. As shown in Table 2, the performance of 3DResNet-

18, 3DResNet-50, and 3DResNet-101 achieve comparable

results on RGB channel. However, the performance on op-

tical ﬂow and depth channels are much lower than that of

RGB channel because the network has been pre-trained on

from Kinetics dataset which contains only RGB images. As

shown in Table 2, 3DResNet-34 obtained the best perfor-

mance for all RGB, optical ﬂow, and depth channels. Hence,

3DResNet-34 is chosen for all the subsequent experiments.

Table 2 The effects of number of layers for 3DResNet with 16 frames

on RGB, optical ﬂow, and depth channels. All the models are pre-

trained on Kinetics and ﬁnetuned on ASL-100-RGBD dataset.

Network RGB (%) Optical Flow (%) Depth (%)

3DResNet-18 73.2 61.9 65.0

3DResNet-34 75.9 62.8 66.5

3DResNet-50 72.3 55.4 62.0

3DResNet-101 72.5 55.0 61.5

Fig. 6 Example images of three datasets. ASL-100-RGBD: various

ASL signs. Kinetics dataset: consisting of diverse human actions, in-

volving different parts of body. Chalearn IsoGD: various hands ges-

tures including mudras (Hindu/ Buddhist hand gestures) and diving

signals.

5.2.3 Effects of Pre-trained Models

To evaluate the effects of pre-trained models, we ﬁne-tune

3DResNet-34 with pretrained models from the Kinectics [38]

and the Chalearn LAP IsoGD datasets [81], respectively. Ki-

netics dataset consists of RGB videos of diverse human ac-

tions which involve different parts of body while the Chalearn

LAP IsoGD dataset contains both RGB and depth videos of

various hand gestures including mudras (Hindu/ Buddhist

hand gestures), Chinese numbers and diving signals, as shown

in Fig. 6.

The results are shown in Table 3. The temporal duration

is ﬁxed to 16 and the channels are RGB, Depth, and RG-

Bﬂow. In all channels, the performance using the pretrained

models from Chalearn dataset is better than pretrained mod-

els from Kinetics dataset. This is probably because all the

videos in Chalearn dataset are focused on hand gestures and

the network trained on this dataset can learn prior knowl-

edge of hand gestures. The Kinetics dataset consists of gen-

eral videos from YouTube and the network focuses on the

prior knowledge of motions. Therefore, for each channel the

pretrained model on the same channel of Chalearn dataset is

used in the subsequent experiments.

Table 3 The comparison of the performance of recognizing 100 ASL

words on 3DResNet-34 with different pretrained models.

Channels Kinetics (%) Chalearn (%)

RGB 75.9 76.38

Depth 66.5 68.18

RGB Flow 62.8 66.79

Recognizing American Sign Language Manual Signs from RGB-D Videos 11

5.2.4 Effects of Temporal Duration of Proxy Videos

We study the effects of temporal duration (i.e. number of

frames used in proxy videos) by ﬁnetuning 3DResNet-34

on ASL-100-RGBD dataset at different temporal duration

in proxy videos at 16, 32, and 64 respectively. Note that

the same temporal duration is also used to train the cor-

responding pre-trained model on the Chalearn dataset. Re-

sults are shown in Table 4. The performance of the network

with 64 frames achieves the best performance. Therefore,

3D-ResNet-34 with 64 frames is used in all the following

experiments.

Table 4 The comparison of the performance of networks with different

temporal duration (i.e. number of frames used in proxy videos). All the

models are pretrained on Chalearn dataset and ﬁnetuned on ASL-100-

RGBD dataset by using same temporal duration.

Channel 16 frames (%) 32 frames (%) 64 frames (%)

RGB 76.38 80.73 87.83

Depth 68.18 74.21 81.93

RGB Flow 66.79 71.74 80.51

5.2.5 Effects of Different Input Channels

In this section, we examine the fusion results of different in-

put channels. The RGB channel provides global spatial and

temporal appearance information, the depth channel provides

the distance information, and the optical ﬂow channel cap-

tures the motion information. The network is ﬁnetuned on

the three input channels respectively. The average fusion is

obtained by weighting the predicted results.

Table 5 shows the performance of ASL recognition on

ASL-100-RGBD dataset for each input channel and differ-

ent fusions. While RGB channel alone achieves 87 .83 %,

by fusing with optical ﬂow, the performance is boosted up

to 89 .02 %. With the fusion of all the three channels (RGB,

Optical ﬂow, and Depth), the performance is further im-

proved to 89 .91 %. This indicates that depth and optical ﬂow

channels contain complementary information to RGB chan-

nel for ASL recognition.

Table 5 The comparison of the performance of networks with differ-

ent input channels and their fusions. All the models are pretrained on

Chalearn dataset and ﬁnetuned on ASL-100-RGBD dataset with 64

frames.

Channels Fusions

RGB √ √ √ √

Depth √ √ √

Optical Flow √ √ √

Performance 87.83% 81.93% 80.51% 89.91% 89.02% 89.71

5.2.6 Effects of Different Modalities

We attain further insight into the learned features of the 3DCNN

model for RGB channel. In Fig 7 we visualize some ex-

amples of the attention maps of the ﬁfth convolution layer

on our test dataset generated by the trained RGB 3DCNN

model for ASL recognition. These attention maps are com-

puted by averaging the magnitude of activations of convolu-

tion layer which reﬂect the attention of the network. The at-

tention maps show that the model mostly focused on hands

and face of the signer during the ASL recognition process.

Fig. 7 The example RGB images and their corresponding attention

maps from the ﬁfth convolution layer of the 3DResNet-34 on our test

dataset of ASL recognition which the hands and face have most of the

attention.

Hence, we conduct experiments to analyze the effects of

each modality (hand gestures, facial expression, and body

poses) with the RGB channel. As shown in Fig. 3, the hand

regions and the face regions are obtained from the RGB im-

age based on the location guided by skeleton joints. The per-

formance of each modality and their fusions are summarized

in Table 6.

Table 6 The comparison of the performance of different modalities

and their fusions. All the models are pretrained on Chalearn dataset

and ﬁnetuned on ASL-100-RGBD dataset with 64 frames.

Channels Fusions

Body √ √ √

Hand √ √ √

Face √

Performance 87.83%80.9%89.81%91.5%

In addition to the accuracy of ASL sign recognition, we

further analyzed the accuracy of the six categories (see Fig.

4 for details) for each modality and their combinations in

Table 7. For the categories that involve many facial expres-

sions, such as Question(Yes/No) and Negative, the accu-

racy of hand modality is improved by more than 15% af-

ter fusion with face modality. For the Conditional category

which utilizes more subtle facial expressions, the accuracy

12 Longlong Jing∗et al.

of hand modality is not improved after fusion with face modal-

ity.

Table 7 The performance (%) of different modalities and their fusions

on six categories listed in Fig. 4: Conditional (Cond), Negative (Neg),

Pointing (Point), Question (WH), Yes/No Question (Y/N) and Time.

The last column is the accuracy (%) for ASL signs.

Modalities Cond Neg Point WH Y/N Time Acc

Hand 90.0 78.1 68.4 84.3 68.4 81.4 80.9

Body 100.0 87.4 84.2 88.0 89.5 87.6 87.83

Body+Hand 90.9 86.6 89.5 88.7 94.7 90.2 89.81

Body+Hand+Face 90.9 93.3 84.2 90.6 84.2 91.8 91.5

5.2.7 Fusions of Different Channels and Modalities

The fusion results of different input channels and modali-

ties on ASL-100-RGBD dataset are shown in Table 8. The

experiments are based on 3DResNet-34 with 64 frames, pre-

trained on Chalearn dataset. Among all the models, fusion of

RGB+Depth+Hands RGB+ Face RGB achieves the best

performance with 92.88% accuracy. Adding RGBﬂow to this

combination results in 92.48% accuracy which is compara-

ble but not improved since the channels have redundant in-

formation.

Table 8 Performance of 3DResNet-34 with 64 from fusion of different

channels and modalities.

Channels Fusions

RGB √ √ √

Depth √ √ √ √

RGBﬂow √ √ √

RGB of Hands √ √ √ √

RGB of Face √ √ √

Performance 91.19%92.48%92.48%92.88%

5.3 Experiments on Chalearn LAP IsoGD dataset

5.3.1 Effects of Network Architectures

The 3D-ResNet is pre-trained on Kinetics [38] for all the

experiments in this section. To ﬁnd the best network archi-

tecture for Chalearn dataset, the parameters of 3D-ResNet

are studied on RGB videos. The results are shown in Table

9. By changing the number of layers to 18,34,50 while ﬁx-

ing the temporal duration to 32, ResNet-34 achieved the best

accuracy.

Table 9 Ablation study of number of layers of the network on RGB

videos of Chalearn Dataset.

Network Temporal Duration Accuracy

ResNet-18 32 52.69%

ResNet-34 32 56.28%

ResNet-50 32 54.57%

We also examined the performance of ResNet-34 by chang-

ing the temporal duration to 16,32, and 64. Our results indi-

cate that ResNet-34 with 64 frames has the best architecture

for Chalearn dataset, as shown in Table 10.

Table 10 Ablation study of temporal duration on RGB videos of

Chalearn Dataset.

Network Temporal Duration Accuracy

ResNet-34 16 45.00%

ResNet-34 32 56.28%

ResNet-34 64 58.32%

5.3.2 Effects of Different Channels and Modalities

We evaluate the effects of different channels including RGB,

RGB ﬂow, Depth, and Depth ﬂow. Because the Chalearn

dataset is designed for hand gesture recognition, we fur-

ther analyze the effects of different hands (left and right), as

well as the whole body. We develop a method to distinguish

left and right hands in Chalearn Isolated Gesture dataset,

and will release the coordinates of hands (distinguished be-

tween right and left hands) with the publication of this arti-

cle. Since the Chalearn dataset is collected for recognizing

hand gestures, here, the face channel is not employed.

We train 12 3D-ResNet-34 networks with 64 frames by

using different combinations of channels and modalities re-

spectively and show the results in table 11. The accuracy

of right hand is signiﬁcantly higher than the left hand. The

reason is that for most of the gestures in Chalearn dataset,

the right hand is dominant and the left hand does not move

much for many hand gestures.

Table 11 Performance of 3D-ResNet-34 with 64 frames on Chalearn

Dataset for different channels and modalities.

Channel Global Channel (%) Left Hand (%) Right Hand (%)

RGB 58.32 18.01 48.58

Depth 63.16 19.43 54.15

RGB Flow 60.26 21.97 48.79

Depth Flow 55.37 20.28 47.07

5.3.3 Effects of Fusions on different channels and

Modalities

Here we analyze the effects of average fusion on different

channels and modalities. The results are shown in Table 12.

Using only RGB and Depth channels, the accuracy is 67.58%

which is improved to 69.97% by adding RGB ﬂow. We ob-

serve that among all different triplets of channels, Right Hand

RGB + Depth + RGBﬂow has the highest accuracy at 73.32%.

By applying the average fusion on four channels RGB+ RG-

Bﬂow+ Right Hand RGB + Right Hand Depth, our model

achieves the accuracy about 75.88%which outperforms the

Recognizing American Sign Language Manual Signs from RGB-D Videos 13

average fusion results of all previous work on Chalearn dataset.

In the-state-of-the-art work of [61], the accuracy of average

fusion is 71.93% for 7channels and 70.37% for 12 channels,

respectively.

Finally, the average fusion of all global channels (RGB,

RGB ﬂow, Depth, Depth ﬂow) and Right hand channels (

Right hand RGB, Right hand RGB ﬂow, Right hand Depth,

Right hand Depth ﬂow) resulted in 76.04% accuracy and the

accuracy of 12 channels together resulted in 75.68%. This

means that the 12 channels contain redundant information,

and adding more channels does not necessarily improve the

results.

Table 12 Performance of 3DResNet-34 with 64 frames for fusion of

different channels and modalities on Charlearn dataset.

Channels Fusions

RGB √ √ √ √

Depth √ √ √ √

RGBﬂow √ √ √ √

RGB of Right Hand √ √ √

Depth of Right Hand √ √

Performance 67.58%69.97%73.32%75.53%75.88%

5.3.4 Comparison with the-state-of-the-arts

Our framework achieves accuracy of 75.88% and 76.04%

from the fusion of 5and 8channels, respectively, on Chalearn

IsoGD dataset. Table 13 lists the state-of-the-art results from

Chalearn IsoGD competition 2017 as well as a recent paper,

FOANet [61]. As shown in the table, in terms of the Average

Fusion, our framework achieves around 6% higher accuracy

than the-state-of-the-arts methods.

Table 13 Comparison with State-of-the-art Results on Chalearn

IsoGD Dataset.

Framework Accuracy on Test Set (%)

Our Results 76.04

FOANet (Average Fusion) [61] 70.37

Miao et al. (ASU) [58] 67.71

SYSU-IEEE 67.02

Lostoy 65.97

Wang et al. (AMRL) [82] 65.59

Zhang et al. (XDETVP) [90] 60.47

It is worth noting that FOANet [61] reported the accu-

racy of 82.07% by applying Sparse Fusion on the softmax

scores of 12 channels (combinations of right hand, left hand,

and whole body while each has 4 channels of RGB, Depth,

RGBﬂow and Depthﬂow). The purpose of using sparse fu-

sion is to learn which channels are important for each ges-

ture. The accuracy of FOANet framework using average fu-

sion is 70.37% which is around 6% lower than our results

and nearly 12% lower than the accuracy of sparse fusion.

While the authors of FOANet [61] had reported a 12% boost

from using sparse fusion in their original experiments, our

experiments do not reveal such a boost when implementing

a system following the technical details provided in [61].

Table 14 lists the accuracy on individual channels of our

network and FOANet [61]. In this table, the values inside the

parenthesis represent the accuracy of FOANet. As shown in

the table, in the Global channel, our framework outperforms

FOANet in all the four channels by 10% to 25%. Also, for

the RGB of Right Hand, we obtain a comparable accuracy

(48%) as FOANet. However, FOANet is outperforming our

results in the Right Hand for Depth, RGBﬂow, and Depth-

ﬂow by nearly 10%. From our experiments, the performance

of ”Global” channels (whole body) in general is superior to

the Local channels (Right/ Left Hand) because the Global

channels include more information. By using the similar ar-

chitecture, FOANet reported 64% accuracy from Depth of

Right Hand and 38% from Depth of the entire frame. In-

stead, our framework achieves more consistent results. For

example, in our framework the accuracy of Depth channel

is higher than RGB and RGBﬂow for both Global and Right

Hand, while the accuracy in FOANet for Depth and RGB

are almost the same in the Global channel (around 40%) but

very different in the Right Hand channel (17% difference.)

Table 14 The accuracy (%) of 12 channels on the test set of Chalearn

IsoGD Dataset. Comparison between our framework and FOANet [61].

The values inside the parenthesis belong to FOANet.

Channel Global Channel (%) Left Hand (%) Right Hand (%)

RGB 58.32 (41.27)18.01 (16.63)48.58 (47.41)

Depth 63.16 (38.50)19.43 (24.06)54.15 (64.44)

RGB Flow 60.26 (50.96)21.97 (24.02)48.79 (59.69)

Depth Flow 55.37 (42.02)20.28 (22.71)47.07 (58.79)

6 Conclusion

In this paper, we have proposed a 3DCNN-based multi-channel

and multi-modality framework, which learns complemen-

tary information and embeds the temporal dynamics in videos

to recognize manual signs of ASL from RGB-D videos. To

validate our proposed method, we collaborate with ASL ex-

perts to collect an ASL dataset of 100 manual signs includ-

ing both hand gestures and facial expressions with full anno-

tation on the word labels and temporal boundaries (starting

and ending points.) The experimental results have demon-

strated that by fusing multiple channels in our proposed frame-

work, the accuracy of recognizing ASL signs can be im-

proved. This technology for identifying the appearance of

speciﬁc ASL words has valuable applications for technolo-

gies that can beneﬁt people who are DHH [9, 42, 43, 48,

52, 65, 68]. As an additional contribution, our “ASL-100-

RGBD” dataset, will be released to enable other members of

14 Longlong Jing∗et al.

the research community to use this resource for training or

evaluation of models for ASL recognition. The effectiveness

of the proposed framework is also evaluated on the Chalearn

IsoGD Dataset. Our method has achieved 75.88% accuracy

using only 5 channels which is 5.51% higher than the state-

of-the-art work using 12 channels in terms of average fusion.

Acknowledgements This material is based upon work supported by

the National Science Foundation under award numbers 1400802, 1400810,

and 1462280.

References

1. American deaf and hard of hearing statistics.

https://www.nidcd.nih.gov/health/statistics/quick-statistics-

hearing

2. Intel realsense technology: Observe the world in 3d.

https://www.intel.com/content/www/us/en/architecture-and-

technology/realsense-overview.html (2018)

3. Orbbec astra. https://orbbec3d.com/product-astra/ (2018)

4. Set up kinect for windows v2 or an xbox kinect sensor with kinect

adapter for windows. https://support.xbox.com/en-US/xbox-on-

windows/accessories/kinect-for-windows-v2-setup (2018)

5. von Agris, U., Knorr, M., Kraiss, K.F.: The signiﬁcance of facial

features for automatic sign language recognition. In: Proceedings

of IEEE International Conference on Automatic Face & Gesture

Recognition (2008)

6. Almeidaab, S.G.M., Guimaresc, F.G., Ramrez, J.: Feature extrac-

tion in brazilian sign language recognition based on phonological

structure and using rgb-d sensors. Expert Systems with Applica-

tions 41(16), 7259–7271 (2014)

7. Athitsos, V., Neidle, C., Sclaroff, S., Nash, J., Stefan, A., Yuan,

Q., Thangali, A.: The asl lexicon video dataset. In: Proceedings

of CVPR 2008 Workshop on Human Communicative Behaviour

Analysis. IEEE (2008)

8. Buehler, P., Everingham, M., Huttenlocher, D.P., Zisserman, A.:

Upper body detection and tracking in extended signing sequences.

International journal of computer vision 95(2), 180 (2011)

9. Camg¨

oz, N.C., Hadﬁeld, S., Koller, O., Bowden, R.: Subunets:

End-to-end hand shape and continuous sign language recognition.

In: ICCV, vol. 1 (2017)

10. Camgoz, N.C., Hadﬁeld, S., Koller, O., Ney, H., Bowden, R.: Neu-

ral sign language translation. CVPR 2018 Proceedings (2018)

11. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new

model and the kinetics dataset. In: Computer Vision and Pattern

Recognition (CVPR), 2017 IEEE Conference on, pp. 4724–4733.

IEEE (2017)

12. Chai, X., Li, G., Lin, Y., Xu, Z., Tang, Y., Chen, X., Zhou, M.: Sign

language recognition and translation with kinect. In: Proceedings

of IEEE International Conference on Automatic Face and Gesture

Recognition (2013)

13. Charles, J., Pﬁster, T., Everingham, M., Zisserman, A.: Automatic

and efﬁcient human pose estimation for sign language videos. In-

ternational Journal of Computer Vision 110(1), 70–90 (2014)

14. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural net-

works for continuous sign language recognition by staged opti-

mization. In: IEEE Conference on Computer Vision and Pattern

Recognition (CVPR) (2017)

15. Diba, A., Fayyaz, M., Sharma, V., Karami, A.H., Mahdi Arzani,

M., Yousefzadeh, R., Van Gool, L.: Temporal 3D ConvNets:

New Architecture and Transfer Learning for Video Classiﬁcation.

ArXiv e-prints (2017)

16. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M.,

Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent con-

volutional networks for visual recognition and description. In:

Proceedings of the IEEE conference on Computer Vision and Pat-

tern Recognition, pp. 2625–2634 (2015)

17. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng,

E., Darrell, T.: Decaf: A deep convolutional activation feature for

generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)

18. Dreuw, P., Forster, J., Ney, H.: Tracking benchmark databases for

video-based sign language recognition. In: Proc. ECCV Interna-

tional Workshop on Sign, Gesture, and Activity (2010)

19. Er-Rady, A., Thami, R.O.H., Faizi, R., Housni, H.: Automatic sign

language recognition: A survey. In: Proceedings of the 3rd In-

ternational Conference on Advanced Technologies for Signal and

Image Processing (2017)

20. Fang, G., Gao, W., Zhao, D.: Large-vocabulary continuoius sign

language recognition based on transition-movement models. IEEE

Transactions on Systems, Man, and Cybernetics - Part A: Systems

and Humans 37(1) (2007)

21. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.:

Rank pooling for action recognition. IEEE transactions on Pattern

Analysis and Machine Intelligence 39(4), 773–787 (2017)

22. Forster, J., Schmidt, C., Hoyoux, T., Koller, O., Zelle, U., Piater,

J.H., Ney, H.: Rwth-phoenix-weather: A large vocabulary sign

language recognition and translation corpus. In: LREC, pp. 3785–

3789 (2012)

23. Furman, N., Goldberg, D., Lusin, N.: Enrollments in

languages other than english in united states institu-

tions of higher education, fall 2010. Retrieved from

http://www.mla.org/2009 enrollmentsurvey (2010)

24. Gattupalli, S., Ghaderi, A., Athitsos, V.: Evaluation of deep learn-

ing based pose estimation for sign language recognition. In:

Proceedings of the 9th ACM International Conference on Perva-

sive Technologies Related to Assistive Environments, p. 12. ACM

(2016)

25. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hier-

archies for accurate object detection and semantic segmentation.

In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE

Conference on, pp. 580–587. IEEE (2014)

26. Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The

chalearn gesture dataset (cgd 2011). Machine Vision and Appli-

cations 25(8), 1929–1951 (2014)

27. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns re-

trace the history of 2d cnns and imagenet? In: Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), pp. 6546–6555 (2018)

28. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in

deep convolutional networks for visual recognition. In: Computer

Vision–ECCV 2014, pp. 346–361. Springer (2014)

29. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based

sign language recognition without temporal segmentation. arXiv

preprint arXiv:1801.10111 (2018)

30. Huenerfauth, M., Gale, E., Penly, B., Pillutla, S., Willard, M., Har-

iharan, D.: Evaluation of language feedback methods for student

videos of american sign language. ACM Transactions on Acces-

sible Computing (TACCESS) 10(1), 2 (2017)

31. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks

for human action recognition. IEEE transactions on pattern anal-

ysis and machine intelligence 35(1), 221–231 (2013)

32. Jiang, Y., Tao, J., Ye, W., Wang, W., Ye, Z.: An isolated sign lan-

guage recognition system using rgb-d sensor with sparse coding.

In: Proceedings of IEEE 17th International Conference on Com-

putational Science and Engineering (2014)

33. Jing, L., Yang, X., Tian, Y.: Video you only look once: Overall

temporal convolutions for action recognition. Journal of Visual

Communication and Image Representation 52, 58–65 (2018)

Recognizing American Sign Language Manual Signs from RGB-D Videos 15

34. Jing, L., Ye, Y., Yang, X., Tian, Y.: 3d convolutional neural net-

work with multi-model framework for action recognition. In: 2017

IEEE International Conference on Image Processing (ICIP), pp.

1837–1841. IEEE (2017)

35. Kadous, M.: Machine recognition of auslan signs using power-

gloves:towards large-lexicon recognition of sign language. In:

Proceedings of the Workshop on the Integration of Gesture in Lan-

guage and Speech, pp. 165–174 (1996)

36. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for

generating image descriptions. arXiv preprint arXiv:1412.2306

(2014)

37. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R.,

Fei-Fei, L.: Large-scale video classiﬁcation with convolutional

neural networks. In: CVPR (2014)

38. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vi-

jayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P.,

et al.: The kinetics human action video dataset. arXiv preprint

arXiv:1705.06950 (2017)

39. Kelly, D., McDonald, J., Markham, C.: A person independent sys-

tem for recognition of hand postures used in sign language. Pattern

Recognition Letters 31(11), 1359–1368 (2010)

40. Keskin, C., Kra, F., Kara, Y., Akarun, L.: Hand pose estimation

and hand shape classiﬁcation using multi-layered randomized de-

cision forests. In: In Proceedings of the European Conference on

Computer Vision, pp. 852–863 (2012)

41. Koller, O., Forster, J., Ney, H.: Continuous sign language recog-

nition: Towards large vocabulary statistical recognition systems

handling multiple signers. Computer Vision and Image Under-

standing 141, 108–125 (2015)

42. Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes

for sign language. In: Proceedings of the IEEE International Con-

ference on Computer Vision Workshops, pp. 85–91 (2015)

43. Koller, O., Ney, H., Bowden, R.: Deep hand: How to train a cnn on

1 million hand images when your data is continuous and weakly

labelled. In: Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pp. 3793–3802 (2016)

44. Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: Enabling

robust statistical continuous sign language recognition via hybrid

cnn-hmms. International Journal of Computer Vision 126(12),

1311–1325 (2018)

45. Kong, W., Ranganath, S.: Towards subject independent continues

sign language recognition: A segment and merge approach. Pat-

tern Recognition 47(3), 1294–1308 (2014)

46. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁca-

tion with deep convolutional neural networks. In: Advances in

Neural Information Processing Systems, pp. 1097–1105 (2012)

47. Kumar, P., Gauba, H., Roy, P.P., Dogra, D.P.: A multimodal frame-

work for sensor based sign language recognition. Neurocomputing

259, 21–38 (2017)

48. Kumar, P., Roy, P.P., Dogra, D.P.: Independent bayesian classiﬁer

combination based sign language recognition using facial expres-

sion. Information Sciences 428, 30–48 (2018)

49. Lang, S., Block, M., Rojas, R.: Sign language recognition using

kinect. In: In Proceedings of International Conference on Artiﬁcial

Intelligence and Soft Computing, pp. 394–402 (2012)

50. Liang, R.H., Ouhyoung, M.: A real-time continuous gesture

recognition system for sign language. In: Proceedings of the Third

IEEE International Conference on Automatic Face and Gesture

Recognition, pp. 558–567 (1998)

51. Liu, J., Liu, B., Zhang, S., Yang, F., Yang, P., Metaxas, D.N., Nei-

dle, C.: Recognizing eyebrow and periodic head gestures using

crfs for non-manual grammatical marker detection in asl. In: Proc.

of the 10th IEEE International Conference and Workshops on Au-

tomatic Face and Gesture Recognition (FG) (2013)

52. Liu, W., Fan, Y., Li, Z., Zhang, Z.: Rgbd video based human hand

trajectory tracking and gesture recognition system. Mathematical

Problems in Engineering 2015 (2015)

53. Liu, Z., Huang, F., Tang, G.W.L., Sze, F.Y.B., Qin, J., Wang, X.,

Xu, Q.: Real-time sign language recognition with guided deep

convolutional neural networks. In: Proceedings of the 2016 Sym-

posium on Spatial User Interaction, pp. 187–187. ACM (2016)

54. Lu, P., Huenerfauth, M.: Cuny american sign language motion-

capture corpus: ﬁrst release. In: Proceedings of the 5th Workshop

on the Representation and Processing of Sign Languages: Interac-

tions between Corpus and Lexicon, The 8th International Confer-

ence on Language Resources and Evaluation (LREC 2012), Istan-

bul, Turkey (2012)

55. Martnez, A.M., Wilbur, R.B., Shay, R., Kak, A.C.: The rvl-slll asl

database. In: Proc. of IEEE International Conference Multimodal

Interfaces (2002)

56. Mehrotra, K., Godbole, A., Belhe, S.: Indian sign language recog-

nition using kinect sensor. In: In Proceedings of the International

Conference Image Analysis and Recognition, pp. 528–535 (2015)

57. Metaxas, D., Liu, B., Yang, F., Yang, P., Michael, N., Neidle, C.:

Recognition of nonmanual markers in asl using non-parametric

adaptive 2d-3d face tracking. In: Proc. of the Int. Conf. on Lan-

guage Resources and Evaluation (LREC), European Language Re-

sources Association (2012)

58. Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., Cao, X.,

Liu, Z., Chai, X., Liu, Z., et al.: Multimodal gesture recognition

based on the resc3d network. In: ICCV Workshops, pp. 3047–

3055 (2017)

59. Mitchell, R.E., Young, T.A., Bachleda, B., Karchmer, M.A.: How

many people use asl in the united states? why estimates need up-

dating. Sign Language Studies 6(3), 306–335 (2006)

60. Mulrooney, K.: American Sign Language Demystiﬁed, Hard Stuff

Made Easy. McGraw Hill (2010)

61. Narayana, P., Beveridge, J.R., Draper, B.A.: Gesture recognition:

Focus on the hands. In: Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pp. 5235–5244 (2018)

62. Neidle, C., Thangali, A., Sclaroff, S.: Challenges in development

of the american sign language lexicon video dataset (asllvd) cor-

pus. In: Proceedings of the Language Resources and Evaluation

Conference (LREC) (2012)

63. Neidle, C., Vogler, C.: A new web interface to facilitate access

to corpora: Development of the asllrp data access interface (dai).

In: Proc. 5th Workshop on the Representation and Processing of

Sign Languages: Interactions between Corpus and Lexicon, LREC

(2012)

64. Ong S. C.and Ranganath, S.: Automatic sign language analysis:

A survey and the future beyond lexical meaning. IEEE Pattern

Analysis and Machine Intelligence 27(6), 873–891 (2005)

65. Palmeri, M., Vella, F., Infantino, I., Gaglio, S.: Sign languages

recognition based on neural network architecture. In: Interna-

tional Conference on Intelligent Interactive Multimedia Systems

and Services, pp. 109–118. Springer (2017)

66. Pigou, L., Dieleman, S., Kindermans, P.J., Schrauwen, B.: Sign

language recognition using convolutional neural networks. In:

Proceedings of European Conference on Computer Vision Work-

shops, pp. 572–578 (2014)

67. Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M.,

Dambre, J.: Beyond temporal pooling: Recurrence and temporal

convolutions for gesture recognition in video. International Jour-

nal of Computer Vision 126(2-4), 430–439 (2018)

68. Pigou, L., Van Herreweghe, M., Dambre, J.: Gesture and sign

language recognition with temporal residual networks. In: Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pp. 3086–3093 (2017)

69. Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iter-

ative optimization for continuous sign language recognition. In:

IJCAI, pp. 885–891 (2018)

70. Pugeault, N., Bowden, R.: Spelling it out: Real-time asl ﬁnger-

spelling recognition. In: Proc. of IEEE International Conference

on Computer Vision Workshops, pp. 1114–1119 (2011)

16 Longlong Jing∗et al.

71. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation

with pseudo-3d residual networks. In: The IEEE International

Conference on Computer Vision (ICCV) (2017)

72. Ren, Z., Yuan, J., Meng, J., Zhang, Z.: Robust part-based hand

gesture recognition using kinect sensor. IEEE Trans. on Multime-

dia 15, 1110–1120 (2013)

73. Simonyan, K., Zisserman, A.: Two-stream convolutional networks

for action recognition in videos. In: Advances in Neural Informa-

tion Processing Systems, pp. 568–576 (2014)

74. Simonyan, K., Zisserman, A.: Very deep convolutional networks

for large-scale image recognition. arXiv preprint arXiv:1409.1556

(2014)

75. Starner, T., Weaver, J., Pentland, A.: Real-time american sign lan-

guage recognition using desk and wearable computer based video.

IEEE Pattern Analysis and Machine Intelligence 20(12), 1371–

1375 (1998)

76. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov,

D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with

convolutions. arXiv preprint arXiv:1409.4842 (2014)

77. Tamura, S., Kawasaki, S.: Recognition of sign language motion

images. Pattern Recognition 21(4), 343–353 (1988)

78. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learn-

ing spatiotemporal features with 3d convolutional networks. In:

Proceedings of the IEEE International Conference on Computer

Vision, pp. 4489–4497 (2015)

79. Traxler, C.B.: The stanford achievement test: National norming

and performance standards for deaf and hard-of-hearing students.

Journal of deaf studies and deaf education 5(4), 337–348 (2000)

80. Valli, C., Lucas, C., Mulrooney, K.J., Villanueva, M.: Linguistics

of American Sign Language: An Introduction. Gallaudet Univer-

sity Press (2011)

81. Wan, J., Li, S., Zhao, Y., Zhou, S., Guyon, I., Escalera, S.:

Chalearn looking at people rgb-d isolated and continuous datasets

for gesture recognition. In: Proceedings of CVPR 2008 Work-

shops. IEEE (2016)

82. Wang, H., Wang, P., Song, Z., Li, W.: Large-scale multimodal ges-

ture recognition using heterogeneous networks. In: Proceedings

of the IEEE Conference on Computer Vision and Pattern Recog-

nition, pp. 3129–3137 (2017)

83. Yang, H., Sclaroff, S., Lee, S.: Sign language spotting with a

threshold model based on conditional random ﬁelds. IEEE Pat-

tern Analysis and Machine Intelligence 31(7), 1264–1277 (2009)

84. Yang, H.D.: Sign language recognition with the kinect sensor

based on conditional random ﬁelds. Sensors 15, 135–147 (2015)

85. Yang, R., Sarkar, S., Loeding, B.: Handling movement epenthesis

and hand segmentation ambiguities in continuous sign language

recognition using nested dynamic programming. IEEE Pattern

Analysis and Machine Intelligence 32(3), 462–477 (2010)

86. Ye, Y., Tian, Y., Huenerfauth, M.: Recognizing american sign lan-

guage gestures from within continuous videos. The 8th IEEE

Workshop on Analysis and Modeling of Faces and Gestures

(AMFG) in conjunction with CVPR 2018 (2017)

87. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals,

O., Monga, R., Toderici, G.: Beyond short snippets: Deep net-

works for video classiﬁcation. In: Proceedings of the IEEE confer-

ence on Computer Vision and Pattern Recognition, pp. 4694–4702

(2015)

88. Zafrulla, Z., Brashear, H., Starner, T., Hamilton H.and Presti, P.:

American sign language recognition with the kinect. In: In Pro-

ceedings of the International Conference on Multimodal Inter-

faces, pp. 279–286 (2011)

89. Zhang, C., Tian, Y., Huenerfauth, M.: Multi-modality american

sign language recognition. In: Proceedings of IEEE International

Conference on Image Processing (ICIP) (2016)

90. Zhang, L., Zhu, G., Shen, P., Song, J., Shah, S.A., Bennamoun, M.:

Learning spatiotemporal features using 3dcnn and convolutional

lstm for gesture recognition. In: Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition, pp. 3120–3128

(2017)

ResearchGate has not been able to resolve any citations for this publication.

Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs

Article

Full-text available

Dec 2018
INT J COMPUT VISION

This manuscript introduces the end-to-end embedding of a CNN into a HMM, while interpreting the outputs of the CNN in a Bayesian framework. The hybrid CNN-HMM combines the strong discriminative abilities of CNNs with the sequence modelling capabilities of HMMs. Most current approaches in the field of gesture and sign language recognition disregard the necessity of dealing with sequence data both for training and evaluation. With our presented end-to-end embedding we are able to improve over the state-of-the-art on three challenging benchmark continuous sign language recognition tasks by between 15 and 38% relative reduction in word error rate and up to 20% absolute. We analyse the effect of the CNN structure, network pretraining and number of hidden states. We compare the hybrid modelling to a tandem approach and evaluate the gain of model combination.

Neural Sign Language Translation

Conference Paper

Full-text available

Mar 2018

Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation for both end-to-end and pretrained settings (us-ing expert knowledge). This allows us to jointly learn the spatial representation of video frames, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT we collected the first Continuous SLT dataset, RWTH-PHOENIX-Weather 2014T, which will be made publicly available. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a german vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss level tokenization systems were able to achieve 9.58 and 18.13 respectively.

Video-based Sign Language Recognition without Temporal Segmentation

Conference Paper

Full-text available

Feb 2018

Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.

Recognizing American Sign Language Gestures from Within Continuous Videos

Conference Paper

Jun 2018

Gesture Recognition: Focus on the Hands

Conference Paper

Jun 2018

Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition

Conference Paper

Jul 2018

This paper presents a novel deep neural architecture with iterative optimization strategy for real-world continuous sign language recognition. Generally, a continuous sign language recognition system consists of visual input encoder for feature extraction and a sequence learning model to learn the correspondence between the input sequence and the output sentence-level labels. We use a 3D residual convolutional network (3D-ResNet) to extract visual features. After that, a stacked dilated convolutional network with Connectionist Temporal Classification (CTC) is applied for learning the mapping between the sequential features and the text sentence. The deep network is hard to train since the CTC loss has limited contribution to early CNN parameters. To alleviate this problem, we design an iterative optimization strategy to train our architecture. We generate pseudo-labels for video clips from sequence learning model with CTC, and fine-tune the 3D-ResNet with the supervision of pseudo-labels for a better feature representation. We alternately optimize feature extractor and sequence learning model with iterative steps. Experimental results on RWTH-PHOENIX-Weather, a large real-world continuous sign language recognition benchmark, demonstrate the advantages and effectiveness of our proposed method.

3D convolutional neural network with multi-model framework for action recognition

Conference Paper

Sep 2017

Video You Only Look Once: Overall Temporal Convolutions for Action Recognition

Article

Feb 2018

In this paper, we propose an efficient and straightforward approach, video you only look once (VideoYOLO), to capture the overall temporal dynamics from an entire video in a single process for action recognition. It remains an open question for action recognition on how to deal with the temporal dimension in videos. Existing methods subdivide a whole video into either individual frames or short clips and consequently have to process these fractions multiple times. A post process is then used to aggregate the partial dynamic cues to implicitly infer the whole temporal information. On the contrary, in VideoYOLO, we first generate a proxy video by selecting a subset of frames to roughly reserve the overall temporal dynamics presented in the original video. A 3D convolutional neural network (3D-CNN) is employed to learn the overall temporal characteristics from the proxy video and predict action category in a single process. Our proposed method is extremely fast. VideoYOLO-32 is able to process 36 videos per second that is 10 times and 7 times faster than prior 2D-CNN (Two-stream (Simonyan and Zisserman, 2014)) and 3D-CNN (C3D (Tran et al., 2015)) based models, respectively, while still achieves superior or comparable classification accuracies on the benchmark datasets, UCF101 and HMDB51.

Learning Spatiotemporal Features Using 3DCNN and Convolutional LSTM for Gesture Recognition

Conference Paper

Oct 2017

Gesture and Sign Language Recognition with Temporal Residual Networks

Conference Paper

Oct 2017

Recognizing American Sign Language Manual Signs from RGB-D Videos

Abstract and Figures

Recommended publications

Multi-Modal Multi-Channel American Sign Language Recognition

Recognizing American Sign Language Gestures from Within Continuous Videos

Multi-Modal Multi-Channel American Sign Language Recognition

Recognizing American Sign Language Nonmanual Signal Grammar Errors in Continuous Videos