PreprintPDF Available

Recognizing American Sign Language Manual Signs from RGB-D Videos

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In this paper, we propose a 3D Convolutional Neural Network (3DCNN) based multi-stream framework to recognize American Sign Language (ASL) manual signs (consisting of movements of the hands, as well as non-manual face movements in some cases) in real-time from RGB-D videos, by fusing multimodality features including hand gestures, facial expressions, and body poses from multi-channels (RGB, depth, motion, and skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3DCNN model. We collect a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera, each of 100 ASL manual signs, including RGB channel, depth maps, skeleton joints, face features, and HDface. The dataset is fully annotated for each semantic region (i.e. the time duration of each word that the human signer performs). Our proposed method achieves 92.88 accuracy for recognizing 100 ASL words in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on the Chalearn IsoGD dataset and achieves 76 accuracy which is 5.51 higher than the state-of-the-art work in terms of average fusion by using only 5 channels instead of 12 channels in the previous work.
Content may be subject to copyright.
Recognizing American Sign Language Manual Signs from RGB-D
Videos
Longlong Jing·Elahe Vahdani·Matt Huenerfauth ·Yingli Tian
Abstract In this paper, we propose a 3D Convolutional Neu-
ral Network (3DCNN) based multi-stream framework to rec-
ognize American Sign Language (ASL) manual signs (con-
sisting of movements of the hands, as well as non-manual
face movements in some cases) in real-time from RGB-D
videos, by fusing multimodality features including hand ges-
tures, facial expressions, and body poses from multi-channels
(RGB, depth, motion, and skeleton joints). To learn the over-
all temporal dynamics in a video, a proxy video is generated
by selecting a subset of frames for each video which are then
used to train the proposed 3DCNN model. We collect a new
ASL dataset, ASL-100-RGBD, which contains 42 RGB-
D videos captured by a Microsoft Kinect V2 camera, each
of 100 ASL manual signs, including RGB channel, depth
maps, skeleton joints, face features, and HDface. The dataset
is fully annotated for each semantic region (i.e. the time du-
ration of each word that the human signer performs). Our
proposed method achieves 92 .88 %accuracy for recogniz-
ing 100 ASL words in our newly collected ASL-100-RGBD
dataset. The effectiveness of our framework for recogniz-
ing hand gestures from RGB-D videos is further demon-
strated on the Chalearn IsoGD dataset and achieves 76% ac-
curacy which is 5.51% higher than the state-of-the-art work
L. Jing and E. Vahdani
Department of Computer Science, The Graduate Center, The City Uni-
versity of New York, NY, 10016.
E-mail: {ljing, evahdani}@gradcenter.cuny.edu
Equal Contribution
M. Huenerfauth
Golisano College of Computing and Information Sciences, the
Rochester Institute of Technology (RIT), Rochester, NY, USA.
E-mail: matt.huenerfauth@rit.edu
Y. Tian
Department of Electrical Engineering, The City College, and the De-
partment of Computer Science, the Graduate Center, the City Univer-
sity of New York, NY, 10031.
E-mail: ytian@ccny.cuny.edu
Corresponding Author
in terms of average fusion by using only 5 channels instead
of 12 channels in the previous work.
Keywords American Sign Language Recognition ·
Hand Gesture Recognition ·RGB-D Video Analysis ·
Multimodality ·3D Convolutional Neural Networks ·Proxy
Video.
1 Introduction
The focus of our research is to develop a real-time system
that can automatically identify ASL manual signs (individ-
ual words, which consist of movements of the hands, as well
as facial expression changes) from RGB-D videos. How-
ever, our broader goal is to create useful technologies that
would support ASL education, which would utilize this tech-
nology for identifying ASL signs and provide ASL students
immediate feedback about whether their signing is fluent or
not.
There are more than one hundred sign languages used
around the world, and ASL is used throughout the U.S. and
Canada, as well as other regions of the world, including
West Africa and Southeast Asia. Within the U.S.A., about
28 million people today are Deaf or Hard-of-Hearing (DHH)
[1]. There are approximately 500,000 people who use ASL
as a primary language [59], and since there are significant
linguistic differences between English and ASL, it is possi-
ble to be fluent in one language but not in the other.
In addition to the many members of the Deaf community
who may prefer to communicate in ASL, there are many
individuals who seek to learn the language. Due to a va-
riety of educational factors and childhood language expo-
sure, researchers have measured lower levels of English lit-
eracy among many deaf adults in the U.S. [79]. Studies have
shown that deaf children raised in homes with exposure to
ASL have better literacy as adults, but it can be challenging
for parents, teachers, and other adults in the life of a deaf
child to rapidly gain fluency in ASL. The study of ASL as a
arXiv:1906.02851v1 [cs.CV] 7 Jun 2019
2 Longlong Jinget al.
foreign language in universities has significantly increased
by 16.4% from 2006 to 2009, which ranked ASL as the
4th most studied language at colleges [23]. Thus, there are
many individuals would benefit from a flexible way to prac-
tice their ASL signing skills, and our research investigates
technologies for recognizing signs performed in color and
depth videos, as discussed in [30].
While the development of user-interfaces for educational
software was described in our prior work [30], this article
instead focuses on the development and evaluation of our
ASL recognition technologies, which underlie our educa-
tional tool. Beyond this specific application, technology to
automatically recognize ASL signs from videos could en-
able new communication and accessibility technologies for
people who are DHH, which may allow these users to in-
put information into computing systems by performing sign
language or may serve as a foundation for future research on
machine translation technologies for sign languages.
The rest of this article is structured as follows: Section
1.1 provides a summary of relevant ASL linguistic details,
and Section 1.2 motivates and defines the scope of our con-
tributions. Section 2 surveys related work in ASL recogni-
tion, gesture recognition in videos, and some video-based
ASL corpora (collections of linguistically labeled video record-
ings). Section 3 describes our framework for ASL recogni-
tion, Section 4 describes the new dataset of 100 ASL words
captured by a RGBD camera which is used in this work,
and Section 5 presents the experiments to evaluate our ASL
recognition model and the extension of our framework for
Chalearn IsoGD dataset. Finally, Section 6 summarizes the
proposed work.
1.1 ASL Linguistics Background
ASL is a natural language conveyed through movements and
poses of the hands, body, head, eyes, and face [80]. Most
ASL signs consist of the hands moving, pausing, and chang-
ing orientation in space. Individual ASL signs (words) con-
sist of a sequence of several phonological segments, which
include:
An essential parameter of a sign is the configuration of
the hand, i.e. the degree to which each of the finger joints
are bent, which is commonly referred to as the “hand-
shape.” In ASL, there are approximately 86 handshapes
which are commonly used [62], and the hand may tran-
sit between handshapes during the production of a single
sign.
During an ASL sign, the signer’s hands will occupy spe-
cific locations and will perform movement through space.
Some signs are performed by a single hand, but most are
performed using both of the signer’s hands, which move
through the space in front of their head and torso. During
Fig. 1 Example images of lexical facial expressions along with hand
gestures for signs: NEVER, WHO, and WHAT. For NEVER, the signer
shakes her head side-to-side slightly, which is a Negative facial ex-
pression in ASL. For WHO and WHAT, the signer is furrowing the
brows and slightly tilting moving the head forward, which is a WH-
word Question facial expression in ASL.
two-handed signs, the two hands may have symmetrical
movements, or the signer’s dominant hand (e.g. the right
hand of a right-handed person) will have greater move-
ments than the non-dominant hand.
The orientation of the palm of the hand in 3D space is
also a meaningful aspect of an ASL sign, and this param-
eter may differentiate pairs of otherwise identical signs.
Some signs co-occur with specific “non-manual signals,”
which are generally facial expressions that are charac-
terized by specific eyebrow movement, head tilt/turn, or
head movement (e.g., forward-backward relative to the
torso).
As discussed in [60], facial expressions in ASL are most
commonly utilized to convey information about entire sen-
tences or phrases, and these classes of facial expressions
are commonly referred to as “syntactic facial expressions.
While some researchers, e.g. [57], have investigated the iden-
tification of facial expressions that extend across multiple
words to indicate grammatical information, in this paper,
we describe our work on recognizing manual signs which
consist of movements of the hands and facial expression
changes.
In addition to “syntactic” facial expressions that extend
across multiple words in an ASL sentence, there exists an-
other category of facial expressions, which is specifically
relevant to the task of recognizing individual signs: “lexi-
cal facial expressions,” which are considered as a part of
the production of an individual ASL word (see examples in
Fig. 1). Such facial expressions are therefore essential for
the task of sign recognition. For instance, words with neg-
ative semantic polarity, e.g. NONE or NEVER, tend to oc-
cur with a negative facial expression consisting of a slight
head shake and nose wrinkle. In addition, there are specific
ASL signs that almost always occur in a context in which
a specific ASL syntactic facial expression occurs. For in-
stance, some question words, e.g. WHO or WHAT, tend to
co-occur with a syntactic facial expression (brows furrowed,
head tilted forward), which indicates that an entire sentence
is a WH-word Question. Thus, the occurrence of such a fa-
Recognizing American Sign Language Manual Signs from RGB-D Videos 3
cial expression may be useful evidence to consider when
building a sign-recognition system for such words.
1.2 Motivations and Scope of Contributions
As discussed in Section 2.1, most prior ASL recognition re-
search typically focuses on isolated hand gestures of a re-
stricted vocabulary. In this paper, we propose a 3D multi-
stream framework to recognize a set of grammatical ASL
words in real-time from RGB-D videos, by fusing multi-
modality features including hand gestures, facial expressions,
and body poses from multi-channels (RGB, depth, motion,
and skeleton joints). In an extension to our previous work
[89] and [86], the main contributions of the proposed frame-
work can be summarized as follows:
We propose a 3D multi-stream framework by using 3D
convolutional neural network for ASL recognition in RGB-
D videos by fusing multi-channels including RGB, depth,
motion, and skeleton joints.
We propose a random temporal augmentation strategy to
augment the training data to handle wide diverse videos
of relative small datasets.
We create a new ASL dataset, ASL-100-RGBD, includ-
ing multiple modalities (facial movements, hand ges-
tures, and body pose) and multiple channels (RGB, depth,
skeleton joints, and HDface) by collaborating with ASL
linguistic researchers; this dataset contains annotation of
the time duration when each ASL word is performed by
the human in the video. The dataset will be released to
public with the publication of this article.
We further evaluate the proposed framework to recog-
nize hand gestures on the Chalearn LAP IsoGD dataset
[81] which consists of 249 gesture classes in RGB-D
videos. The accuracy of our framework is 5.51% higher
than the state-of-the-art work in terms of average fusion
using fewer channels (5channels instead of 12).
2 Related Work
2.1 RGB-D Based ASL Recognition
Sign language (SL) recognition has been studied for three
decades since the first attempt to recognize Japanese SL by
Tamura and Kawasaki in 1988 [77]. The existing SL recog-
nition research can be classified as sensor-based methods in-
cluding data gloves and body trackers to capture and track
hand and body motions [20, 35, 45, 50] and non-intrusive
camera-based methods by applying computer vision tech-
nologies [9, 10, 13, 14, 24, 29, 39, 42–44, 53, 66–69, 75, 83,
85]. While much research in this area focuses on the hands,
there is also some research focusing on linguistic informa-
tion conveyed by the face and head of a human performing
sign language, such as [5, 48, 51, 57]. More details about SL
recognition can be found in these survey papers [19,64].
As cost-effective consumer depth cameras have become
available in recent years, such as RGB-D cameras of Mi-
crosoft Kinect V2 [4], Intel Realsense [2], Orbbec Astra
[3], it has become practical to capture high resolution RGB
videos and depth maps as well as to track a set of skeleton
joints in real time. Compared to traditional 2D RGB images,
RGB-D images provide both photometric and geometric in-
formation. Therefore, recent research work has been moti-
vated to investigate ASL recognition using both RGB and
depth information [6,8,12,32,70,72,84,86, 88,89]. In this ar-
ticle, we briefly summarize ASL recognition methods based
on RGB-D images or videos.
Some early work of SL recognition based on RGB-D
cameras only focused on a very small number of signs from
static images [40, 70, 72]. Pugeault and Richard proposed a
multi-class random forest classification method to recognize
24 static ASL fingerspelling alphabet letters by ignoring the
letters jand z(as they involve motion) and by combining
both appearance and depth information of handshapes cap-
tured by a Kinect camera [70]. Keskin et al. [40] recognized
24 static handshapes of the ASL alphabet, based on scale in-
variant features extracted from depth images, and then fed
to a Randomized Decision Forest for classification at the
pixel level, where the final recognition label was voted based
on a majority. Ren et al. proposed a modified Finger-Earth
Movers Distance metric to recognize static handshapes for
10 digits captured using a Kinect camera [72].
While these systems only used the static RGB and depth
images, some studies employed the RGB-D videos for ASL
recognition. Zafrulla et al. developed a hidden Markov model
(HMM) to recognize 19 ASL signs collected by a Kinect and
compared the performance with that from colored-glove and
accelerometer sensors [88]. For the Kinect data, they also
compared the system performance between the signer seated
and standing and found that higher accuracy resulted when
the users were standing. Yang developed a hierarchical con-
ditional random field method to recognize 24 manual ASL
signs (seven one-handed and 17 two-handed) from the hand-
shape and motion in RGB-D videos [84]. Lang et al. [49]
presented a HMM framework to recognize 25 signs of Ger-
man Sign Language using depth-camera specific features.
Mehrotra et al. [56] employed a support vector machine
(SVM) classifier to recognize 37 signs of Indian Sign Lan-
guage based on 3D skeleton points captured using a Kinect
camera. Almeida et al. [6] also employed a SVM classifier to
recognize 34 signs of Brazilian Sign Language using hand-
shape, movement and position captured by a Kinect. Jiang
et al. proposed to recognize 34 signs of Chinese Sign Lan-
guage based on the color images and the skeleton joints cap-
tured by a Kinect camera [32]. Recently, Kumar et al. [47]
4 Longlong Jinget al.
combined a Kinect camera with a Leap Motion sensor to
recognize 50 signs of India Sign Language.
As discussed above, SL consists of hand gestures, facial
expressions, and body poses. However, most existing work
has focused only on hand gestures without combining with
facial expressions and body poses. While a few attempted
to combine hand and face [5, 41, 48, 57, 68, 83], they only
use RGB videos. To the best of our knowledge, we believe
that this is the first work that combines multi-channel RGB-
D videos (RGB and depth) with fusion of multi-modality
features (hand, face, and body) for ASL recognition.
2.2 CNN for Action and Hand Gesture Recognition
Since the work of AlexNet [46] which makes use of the pow-
erful computation ability of GPUs, deep neural networks
(DNNs) have enjoyed a renaissance in various areas of com-
puter vision, such as image classification [17,76], object de-
tection [25,28], image description [16,36], and others. Many
efforts have been made to extend CNNs from the image
to the video domain [21], which is more challenging since
video data are much larger than images; therefore, handling
video data in the limited GPU memory is not tractable. An
intuitive way to extend image-based CNN structures to the
video domain is to perform the fine-tuning and classifica-
tion process on each frame independently, and then conduct
a later fusion, such as average scoring, to predict the action
class of the video [37]. To incorporate temporal information
in the video, [73] introduced a two-stream framework. One
stream was based on RGB images, and the other, on stacked
optical flows. Although that work proposed an innovative
way to learn temporal information using a CNN structure,
in essence, it was still image-based, since the third dimen-
sion of stacked optical flows collapsed immediately after the
first convolutional layer.
To model the sequential information of extracted fea-
tures from different segments of a video, [16] and [87] pro-
posed to input features into Recurrent Neural Network (RNN)
structures, and they achieved good results for action recog-
nition. The former emphasized pooling strategies and how
to fuse different features, while the latter focused on how
to train an end-to-end DNN structure that integrates CNNs
with RNNs. These networks mainly use CNN to extract spa-
tial features, then RNN is applied to extract the temporal in-
formation of the spatial features. 3DCNN was recently pro-
posed to learn the spatio-temporal features with 3D convolu-
tion operations [15,27,31,33,34,71,78], and has been widely
used in video analysis tasks such as video caption and ac-
tion detection. 3DCNN is usually trained with fixed-length
clips (usually 16 frames [27], [78],) and later fusion is per-
formed to obtain the final category of the entire video. Hara
et al. [27] proposed the 3D-ResNet by replacing all the 2D
kernels in 2D-ResNet with 3D convolution operations. With
its advantage of avoiding gradient vanishing and explosion,
the 3D-ResNet outperforms many complex networks.
ASL recognition shares properties with video action recog-
nition, therefore, many networks for video action have been
applied to this task. Pigou et al. proposed temporal resid-
ual networks for gesture and sign language recognition [68]
and temporal convolutions on top of the features extracted
by 2DCNN for gesture recognition [67]. Huang et al. pro-
posed a Hierarchical Attention Network with Latent Space
(LS-HAN) which eliminates the pre-processing of the tem-
poral segmentation [29]. Pu et al. proposed to employ 3D
residual convolutional network (3D-ResNet) to extract vi-
sual features which are then fed to a stacked dilated convo-
lution network with connectionist temporal classification to
map the visual features into text sentence [69]. Camgoz et
al. attempted to generate spoken language translations from
sign language video [10]. Camgoz et al. proposed SubUNets
for simultaneous hand shape and continuous sign language
recognition [9]. Cui et al. proposed a weakly supervised
framework to train the network from videos with ordered
gloss labels but no exact temporal locations for continuous
sign language recognition [14]. In prior work, our research
team proposed a 3D-FCRNN for ASL recognition by com-
bining the 3DCNN and a fully connected RNN [86].
2.3 Public Camera-based ASL Datasets
As discussed in Section 2.1, technology to recognize ASL
signs from videos could enable new educational tools or as-
sistive technologies for people who are DHH, and there has
been significant prior research on sign language recognition.
However, a limiting factor for much of this research has been
the scarcity of video recordings of sign language that have
been annotated with time interval labels of the words that the
human has performed in the video: For ASL, there have been
some annotated video-based datasets [63] or collections of
motion capture recordings of humans wearing special sen-
sors [54]. Most publicly available datasets, e.g. [22,41], con-
tain general ASL vocabularies from RGB videos and a few
with RGB-D channels.
2D Camera-based ASL databases: The American Sign
Language Linguistic Research Project (ASLLRP) dataset con-
tains video clips of signing from the front and side and in-
cludes a close-up view of the face [63], with annotations
for 19 short narratives (1,002 utterances) and 885 additional
elicited utterances from four Deaf native ASL signers; an-
notation includes: the start and endpoints of each sign, a
unique gloss label for each sign, part of speech, and start
and end points of a range of non-manual behaviors (e.g.,
raised/lowered eyebrows, head position and periodic head
movements, expressions of the nose and mouth) also labeled
with respect to the linguistic information that they convey
Recognizing American Sign Language Manual Signs from RGB-D Videos 5
(b) Randomly sampled eight frames from the video clip of the same ASL sign
(a) Eight Consecutive frames from a video clip of an ASL sign
Fig. 2 Generating representative proxy video by our proposed random temporal augmentation. (a) Eight consecutive frames from a video clip of
an ASL sign. (b) Randomly sampled eight frames from the video clip of the same ASL sign. With the same number of frames, the proxy video
reserves more temporal dynamics of the ASL sign.
(serving to mark, e.g., different sentence types, topics, nega-
tion, etc.). Dreuw et al. [18] produced several subsets from
the ASLLRP dataset as benchmark databases for automatic
recognition of isolated and continuous sign language.
The American Sign Language Lexicon Video Dataset
(ASLLVD) [7] is a large dataset of videos of isolated signs
from ASL. It contains video sequences of about 3,000 dis-
tinct signs, each produced by 1 to 6 native ASL signers
recorded by four cameras under three views: front, side, and
face region, along with annotations of those sequences, in-
cluding start/end frames and class label (i.e., gloss-based
identification) of every sign, as well as hand and face lo-
cations at every frame.
The RVL-SLLL ASL Database [55] consists three sets
of ASL videos with distinct motion patterns, distinct hand-
shapes, and structured sentences respectively. These videos
were captured from 14 native ASL signers (184 videos per
signer) under different lighting conditions. For annotation,
the videos with distinct motion patterns or distinct hand-
shapes are saved as separate clips. However, there is no de-
tailed annotations for the videos of structured sentences which
limited the usefulness of the database.
RGB-D Camera-based ASL and Gesture Databases:
Recently, some RGB-D databases have been collected for
hand gesture and SL recognition, for ASL or other sign lan-
guages [12,22,66]. Here we only briefly summarize RGB-D
databases for ASL.
The ”Spelling-It-Out” dataset consists of 24 static hand-
shapes of the ASL fingerspelling alphabet, ignoring the let-
ters j and z as they involve motion, from four signers; each
signer repeats 500 samples for each letter in front of a Kinect
camera [70]. The NTU dataset consists of 10 static hand ges-
tures for digits 1 to 10 and was collected from 10 subjects by
a Kinect camera. Each subject performs 10 different poses
with variations in hand orientation, scale, articulation for the
same gesture, and there is a color image and the correspond-
ing depth map for each [72].
The Chalearn LAP IsoGD dataset [81] is a large-scale
hand gesture RGB-D dataset, which is derived from Chalearn
Gesture dataset (CGD 2011) [26]. This dataset consists of
47,933 RGB-D video clips fallen into 249 classes of hand
gestures including mudras (Hindu/ Buddhist hand gestures),
Chinese numbers, and diving signals. Although it is not about
ASL recognition, it can be used to learn RGB-D features
from different environment settings. Using the learned fea-
tures as a pretrained model, the fine-tuned ASL recognition
model will be more robust to handle different backgrounds
and scales (e.g. distance variations between Kinect camera
and the signer).
To support our research, we have collected and anno-
tated a novel RGB-D ASL dataset, ASL-100-RGBD, de-
scribed in Section 4, with the following properties:
100 ASL signs have been collected which are performed
by 15 individual signers (often with multiple recordings
from each signer).
The ASL-100-RGBD dataset has been captured using
a Kinect V2 camera and contains multiple channels in-
cluding RGB, depth, skeleton joints, and HDface.
Each video consists of the 100 ASL words with time-
coded annotations in collaboration with ASL computa-
tional linguistic researchers.
The 100 ASL words have been strategically selected to
support sign recognition technology for ASL education
tools (many of these words consist of hand gestures and
facial expression changes), with the detailed vocabulary
composition described in Section 4.
6 Longlong Jinget al.
RGB Body Network Fusion Final Prediction
Depth Body Network
RGB Hands Network
RGB Face Network
Proxy Video Generation 3DCNN Modeling
Conv 1
Block 1
GAP
Dense
(100)
Block 2
Block 3
Block 4
Block 5
Conv 1
Block 1
GAP
Dense
(100)
Block 2
Block 3
Block 4
Block 5
Conv 1
Block 1
GAP
Dense
(100)
Block 2
Block 3
Block 4
Block 5
Conv 1
Block 1
GAP
Dense
(100)
Block 2
Block 3
Block 4
Block 5
Fig. 3 The pipeline of the proposed multi-channel multi-modality 3DCNN framework for ASL recognition. The multiple channels contain RGB,
Depth, and Optical flow while the multiple modalities include hand gestures, facial expressions and body poses. While the full size image is used
to represent body pose, to better model hand gestures and the facial expression changes, the regions of hands and face are obtained from the
RGB image based on the location guided by skeleton joints. The whole framework consists of two main components: proxy video generation and
3DCNN modeling. First, proxy videos are generated for each ASL sign by selecting a subset of frames spanning the whole video clip of each ASL
sign, to represent the overall temporal dynamics. Then the generated proxy videos of RGB, Depth, Optical flow, RGB of hands, and RGB of face
are fed into the multi-stream 3DCNN component. The predictions of these networks are weighted to obtain the final results of ASL recognition.
3 The Proposed Method for ASL Recognition
The pipeline of our proposed method is illustrated in Fig. 3.
There are two main components in the framework: random
temporal augmentation to generate proxy videos (which are
representative of the overall temporal dynamics of video clip
of an ASL sign) and 3DCNN to recognize the class label of
the sign.
3.1 Random Temporal Augmentation for Proxy Video
Generation
The performance of the deep neural network greatly depends
on the amount of the training data. Large-scale training data
and different data augmentation techniques usually are needed
for deep networks to avoid over-fitting. During training, dif-
ferent kinds of data augmentation techniques, such as ran-
dom resizing and random cropping of images, are already
widely applied in 3DCNN training. In order to capture the
overall temporal dynamics, we apply a random temporal
augmentation, to generate a proxy video for each sign video
clip channel, by selecting a subset of frames, which has proved
to be very effective for our proposed framework.
Videos are often redundant in the temporal dimension,
and some consecutive frames are very similar without ob-
servable difference, as shown in Fig. 2 (a) which displays
8consecutive frames in a video clip of an ASL sign while
the proxy video in 2 (b) displays the 8frames selected from
the same video clip by random temporal augmentation. With
the same number of frames, the proxy video provides more
temporal dynamics. Thus, proxy videos are generated to rep-
resent the overall temporal dynamics for each ASL word.
The process of proxy video generation by randomly sam-
pling is formulated in Eq. (1) below:
Si=random(bN/T c) + bN /T c ∗ i, (1)
where Nis the number of frames of a sign video, Tis the
number of the sampled frames from the video, Siis the
ith sampled frame, random(N/T )generates one random
number ranging b0, N/T cfor every i. To generate the proxy
video, each video is uniformly divided into Tintervals, and
one frame is randomly sampled from every interval. If the
Recognizing American Sign Language Manual Signs from RGB-D Videos 7
Fig. 4 The full list of the 100 ASL words in our “ASL-100-RGBD” dataset under 6semantic categories. These ASL words have been strategically
selected to support sign recognition technology for ASL education tools (many of these words consist of both hand gestures and facial expression
changes.)
total number of frames for a video is less than T, it is padded
with the last frame to the length of T. These proxy videos
make it feasible to train deep neural network on the proposed
dataset.
3.2 3D Convolutional Neural Network
3DCNN was first proposed for video action recognition [31],
and was improved in C3D [78] by using a similar archi-
tecture to VGG [74]. It obtained the state-of-the-art perfor-
mance for several video recognition tasks. The difference
between the 2DCNN and 3DCNN operation is that 3DCNN
has an extra temporal dimension, which can capture the spa-
tial and temporal information between video frames more
effectively.
After the emergence of C3D, many 3DCNN models were
proposed for video action recognition [11], [15], [71]. 3D-
ResNet was the 3D version of ResNet which introduced iden-
tical mapping to avoid gradient vanishing and explosion,
which makes the training of very deep of convolutional neu-
ral networks feasible. Compared to 2D ResNet, the size of
the convolution kernel is w×h×t(wis the width of the
kernel, his the height of the kernel and tis the temporal di-
mension of the kernel) in 3D-ResNet, while it is w×hin
2D-ResNet. In this paper, 3D-ResNet is chosen as the base
network for ASL recognition.
In order to handle the three important elements of ASL
recognition (hand gesture, facial expression, and body pose),
a hybrid framework is designed including two 3DCNN net-
works: one for full body, to capture the full body move-
ments including hands and face with the inputs of the multi-
channel proxy videos generated from the full images includ-
ing RGB, depth, and optical flow; and another for hand and
face, to capture the coordinates of hands and face with the
inputs of the multi-channel proxy videos generated from the
cropped regions of left hand, right hand, and face. Note for
the Hand-Face network, RGB and depth channels are em-
ployed for hand regions. The optical flow is not employed
8 Longlong Jinget al.
since it cannot accurately track the quick and large hand
motions. For the face regions, only RGB channel is em-
ployed since facial expressions generally change much less
in depth. The prediction results of the networks are weighted
to obtain the final prediction of each ASL sign.
The optical flow images are calculated by stacking the
x-component, the y-component, and the magnitude of the
flow. Each value in the image is then rescaled to 0and 255 .
This practice has yielded good performance in other studies
[16, 87]. As observed in the experimental results, by fusing
all the features generated by RGB, optical flow, and depth
images, the performance can be improved, which indicates
that complementary information are provided by different
channels in training deep neural networks.
4 Proposed ASL Dataset: “ASL-100-RGBD”
As mentioned in Section 2.3, a new dataset has been col-
lected for this research in collaboration with ASL computa-
tional linguistic researchers, from native ASL signers (indi-
viduals who have been using the language since very early
childhood) who performed a word list of 100 ASL signs
(See the full list of ASL words in Fig. 4) by using a Kinect
V2 camera. Participants responded affirmatively to the fol-
lowing screening question: Did you use ASL at home grow-
ing up, or did you attend a school as a very young child
where you used ASL? Participants were provided with a
slide-show presentation that asked them to perform a se-
quence of 100 individual ASL signs, without lowering their
hands between words. Since this new dataset includes 100
signs with RGB and depth data, we refer to it as the “ASL-
100-RGBD” dataset.
During the recording session, a native ASL signer met
the participant and conducted the session: prior research in
ASL computational linguistics has emphasized the impor-
tance of having only native signers present when recording
ASL videos so that the signer does not produce English-
influenced signing [54]. Several videos were recorded of
each of the 15 people, while they signed the 100 ASL signs.
Typically three videos were recorded from each person, to
produce a total collection of 42 videos (each video contains
all the 100 signs) and 4,200 samples of ASL signs.
To facilitate this collection process, we have developed
a recording system based on Kinect 2.0 RGB-D camera to
capture multimodality (facial expressions, hand gestures, and
body poses) from multiple channels of information (RGB
video and depth video) for ASL recognition. The recordings
also include skeleton and HDface information. The video
resolution is 1920 x1080 pixels for the RGB channel and
512 x424 pixels for the depth channel respectively.
The 100 ASL signs in this collection were selected to
strategically support research on sign recognition for ASL
education applications, and the words were chosen based on
vocabulary that is traditionally included in introductory ASL
courses. Specifically, as discussed in [30], our recognition
system must identify a subset of ASL words that relate to a
list of errors often made by students who are learning ASL.
Our proposed educational tool [30] would receive as input a
video of a student who is performing ASL sentences, and
the system would automatically identify whether the stu-
dent’s performance may include one of several dozen errors,
which are common among students learning ASL. As part of
the operation of this system, we require a sign-recognition
component that can identify when a video of a person in-
cludes any of these 100 words during the video (and the
time period of the video when this word occurs). When one
of these 100 key words are identified, then the system will
consider other properties of the signer’s movements [30], to
determine whether the signer may have made a mistake in
their signing.
For instance, the 100 ASL signs includes words related
to questions (e.g. WHO, WHAT), time-phrases (e.g. TO-
DAY, YESTERDAY), negation (e.g. NOT, NEVER), and
other categories of words that relate to key grammar rules
of ASL. A full listing of the words included in this dataset
appears in Fig. 4. Note that there is no one-to-one map-
ping between English words and ASL signs, and some ASL
signs have variations in their performance, e.g. due to ge-
ographic/regional differences or other factors. For this rea-
son, some words in Fig. 4 appear with integers after their
name, e.g. THURSDAY and THURSDAY2, to reflect more
than one variation in how the ASL word may be produced.
For instance, THURSDAY indicates a sign produced by the
signer’s dominant hand in the ”H” alphabet-letter handshape
gentle circling in space; whereas, THURSDAY2 indicates
a sign produced with the signer’s dominant hand quickly
switching from the alphabet-letter handshape of ”T” to ”H”
while held in space in front of the torso. Both are commonly
used ASL signs for the concept of ”Thursday”; they simply
represent two different ASL words that could be used for the
same concept.
As shown in Fig. 4, the words are grouped into 6seman-
tic categories (Negative, WH Questions, Yes/No Questions,
Time, Pointing, and Conditional), which in some cases sug-
gest particular facial expressions that are likely to co-occur
with these words when they are used in ASL sentences. For
instance, time-related phrases that appear at the beginning
of ASL sentences tend to co-occur with a specific facial ex-
pression (head tilted back slightly and to the side, with eye-
brows raised). Additional details about how detecting words
in these various categories would be useful in the context of
educational software appear in [30].
After the videos were collected from participants, the
videos were analyzed by a team of ASL linguists, who pro-
duced time-coded annotations for each video. The linguists
Recognizing American Sign Language Manual Signs from RGB-D Videos 9
Fig. 5 Four sample frames of each channel of an ASL sign from our
dataset including RGB, skeleton joints (25 joints for every frame),
depth map, basic face features (5 main face components), and HDFace
(1,347 points.)
used a coding scheme in which an English identifier label
was used to correspond to each of the ASL words used in
the videos, in a consistent manner across the videos. For ex-
ample, all of the time spans in the videos when the human
performed the ASL word “NOT” were labeled with the En-
glish string ”NOT” in our linguistic annotation.
Fig. 5 demonstrates several frames of each channel of an
ASL sign from our dataset including RGB, skeleton joints
(25 joints for every frame), depth map, basic face features
(5 main face components), and HDFace (1,347 points). With
the publication of this article, ASL-100-RGBD dataset1will
be released to the research community.
5 Experiments and Discussions
In this section, extensive experiments are conducted to eval-
uate the proposed approach on the newly collected “ASL-
100-RGBD” dataset and the Chalearn LAP IsoGD dataset
[81].
1Some example videos can be found at our research website
http://media-lab.ccny.cuny.edu/wordpress/datecode/
5.1 Implementation Details
Same 3D-ResNet architecture is employed for all experi-
ments. Different channels and modalities are fed to the net-
work as input. The input channels are RGB, Depth, RG-
Bflow (i.e. Optical flow of RGB images), and Depthflow
(i.e. Optical flow of depth images) of modalities including
hands, face, and full body. The fusion of different channels
are studied and compared.
Our proposed model is trained in PyTorch on four Titan
X GPUs. To avoid over-fitting, the pretrained model from
Kinetics or Chalearn dataset is employed and the follow-
ing data augmentation techniques were used: random crop-
ping (using a patch size of 112 ×112) and random rotation
(with a random number of degrees in a range of [10,10]).
The models are then fine-tuned for 50 epochs with an initial
learning rate of λ= 3 ×103, reduced by a factor of 10
after every 25 epochs.
To apply the pre-trained 3D-ResNet model on 3 bands
in RGB image format to one channel depth images or op-
tical flow images, the depth images are simply converted to
3 bands as RGB image format. For the optical flow images,
the pre-trained 3D-ResNet model takes the x-component,
the y-component, and the magnitude of flow as the R, G,
and B bands in the RGB format.
5.2 Experiments on ASL-100-RGBD
To prepare the training and testing for evaluation of the pro-
posed method on “ASL-100-RGBD” dataset, we first ex-
tracted the video clips for each ASL sign. We use 3,250
ASL clips for training ( 75 %of the data) and the remain-
ing 25 %ASL clips for testing. To ensure a subject indepen-
dent evaluation, there is no same signer appearing in both
training and testing datasets. To augment the data, a new 16-
frame proxy video is generated from each video by selecting
different subset of frames for each epoch during the training
phase.
5.2.1 Effects of Data Augmentations
The training dataset which contains 3,250 ASL video clips
of 100 ASL manual signs is relatively small for 3DCNN
training and could easily cause an over-fitting problem. In
order to extract more representative temporal dynamics as
well as avoid over-fitting, a random temporal augmentation
technique is applied to generate proxy videos (a new proxy
video for each epoch) for each ASL clip. The ASL recogni-
tion results of using the proposed proxy video (16 frames
per video) are compared with the traditional method (us-
ing the same number of consecutive frames). The network,
3DResNet-34, dose not converge when trained with 16 con-
secutive frames, while the network trained with proxy video
10 Longlong Jinget al.
obtained 68.4% on the testing dataset. This is likely due to
the majority of movements being from hands in these videos
and the consecutive frames could not effectively represent
the temporal and spatial information. Therefore, the network
could not distinguish the clips based on only 16 consecutive
frames. We also evaluate the effect of random cropping (us-
ing a patch size of 112 ×112) and random rotation (with a
random number of degrees in a range of [10,10]).
Table 1 lists the effects of different data augmentation
techniques for the performance for recognizing 100 ASL
words on only RGB channel. With the proxy videos, the
3DCNN model obtains 68.4% accuracy on the testing data
for recognizing 100 ASL signs. By adding the random crop,
the performance is improved by 4.4% and adding the ran-
dom rotation further improved the performance to 75.9%. In
the following experiments, proxy videos together with ran-
dom crop and random rotation are employed to augment the
data.
Table 1 The comparison of the performance of different data augmen-
tation methods on only RGB channel with 16 frames for recogniz-
ing 100 ASL manual signs. All the models are pretrained on Kinet-
ics and finetuned on ASL-100-RGBD dataset. The best performance
is achieved with the random proxy video, random crop, and random
rotation.
Augmentations Fusions
Random Proxy Video 7√ √
Random Crop 7√ √
Random Rotation 7
Performance Not converging 68.4% 72.8% 75.9%
5.2.2 Effects of Network Architectures
In this experiment, the ASL recognition results of differ-
ent number of layers at 18,34,50, and 101 for 3DRes-
Net are compared on full RGB, optical flow, and depth im-
ages. As shown in Table 2, the performance of 3DResNet-
18, 3DResNet-50, and 3DResNet-101 achieve comparable
results on RGB channel. However, the performance on op-
tical flow and depth channels are much lower than that of
RGB channel because the network has been pre-trained on
from Kinetics dataset which contains only RGB images. As
shown in Table 2, 3DResNet-34 obtained the best perfor-
mance for all RGB, optical flow, and depth channels. Hence,
3DResNet-34 is chosen for all the subsequent experiments.
Table 2 The effects of number of layers for 3DResNet with 16 frames
on RGB, optical flow, and depth channels. All the models are pre-
trained on Kinetics and finetuned on ASL-100-RGBD dataset.
Network RGB (%) Optical Flow (%) Depth (%)
3DResNet-18 73.2 61.9 65.0
3DResNet-34 75.9 62.8 66.5
3DResNet-50 72.3 55.4 62.0
3DResNet-101 72.5 55.0 61.5
Fig. 6 Example images of three datasets. ASL-100-RGBD: various
ASL signs. Kinetics dataset: consisting of diverse human actions, in-
volving different parts of body. Chalearn IsoGD: various hands ges-
tures including mudras (Hindu/ Buddhist hand gestures) and diving
signals.
5.2.3 Effects of Pre-trained Models
To evaluate the effects of pre-trained models, we fine-tune
3DResNet-34 with pretrained models from the Kinectics [38]
and the Chalearn LAP IsoGD datasets [81], respectively. Ki-
netics dataset consists of RGB videos of diverse human ac-
tions which involve different parts of body while the Chalearn
LAP IsoGD dataset contains both RGB and depth videos of
various hand gestures including mudras (Hindu/ Buddhist
hand gestures), Chinese numbers and diving signals, as shown
in Fig. 6.
The results are shown in Table 3. The temporal duration
is fixed to 16 and the channels are RGB, Depth, and RG-
Bflow. In all channels, the performance using the pretrained
models from Chalearn dataset is better than pretrained mod-
els from Kinetics dataset. This is probably because all the
videos in Chalearn dataset are focused on hand gestures and
the network trained on this dataset can learn prior knowl-
edge of hand gestures. The Kinetics dataset consists of gen-
eral videos from YouTube and the network focuses on the
prior knowledge of motions. Therefore, for each channel the
pretrained model on the same channel of Chalearn dataset is
used in the subsequent experiments.
Table 3 The comparison of the performance of recognizing 100 ASL
words on 3DResNet-34 with different pretrained models.
Channels Kinetics (%) Chalearn (%)
RGB 75.9 76.38
Depth 66.5 68.18
RGB Flow 62.8 66.79
Recognizing American Sign Language Manual Signs from RGB-D Videos 11
5.2.4 Effects of Temporal Duration of Proxy Videos
We study the effects of temporal duration (i.e. number of
frames used in proxy videos) by finetuning 3DResNet-34
on ASL-100-RGBD dataset at different temporal duration
in proxy videos at 16, 32, and 64 respectively. Note that
the same temporal duration is also used to train the cor-
responding pre-trained model on the Chalearn dataset. Re-
sults are shown in Table 4. The performance of the network
with 64 frames achieves the best performance. Therefore,
3D-ResNet-34 with 64 frames is used in all the following
experiments.
Table 4 The comparison of the performance of networks with different
temporal duration (i.e. number of frames used in proxy videos). All the
models are pretrained on Chalearn dataset and finetuned on ASL-100-
RGBD dataset by using same temporal duration.
Channel 16 frames (%) 32 frames (%) 64 frames (%)
RGB 76.38 80.73 87.83
Depth 68.18 74.21 81.93
RGB Flow 66.79 71.74 80.51
5.2.5 Effects of Different Input Channels
In this section, we examine the fusion results of different in-
put channels. The RGB channel provides global spatial and
temporal appearance information, the depth channel provides
the distance information, and the optical flow channel cap-
tures the motion information. The network is finetuned on
the three input channels respectively. The average fusion is
obtained by weighting the predicted results.
Table 5 shows the performance of ASL recognition on
ASL-100-RGBD dataset for each input channel and differ-
ent fusions. While RGB channel alone achieves 87 .83 %,
by fusing with optical flow, the performance is boosted up
to 89 .02 %. With the fusion of all the three channels (RGB,
Optical flow, and Depth), the performance is further im-
proved to 89 .91 %. This indicates that depth and optical flow
channels contain complementary information to RGB chan-
nel for ASL recognition.
Table 5 The comparison of the performance of networks with differ-
ent input channels and their fusions. All the models are pretrained on
Chalearn dataset and finetuned on ASL-100-RGBD dataset with 64
frames.
Channels Fusions
RGB √ √
Depth √ √
Optical Flow √ √
Performance 87.83% 81.93% 80.51% 89.91% 89.02% 89.71
5.2.6 Effects of Different Modalities
We attain further insight into the learned features of the 3DCNN
model for RGB channel. In Fig 7 we visualize some ex-
amples of the attention maps of the fifth convolution layer
on our test dataset generated by the trained RGB 3DCNN
model for ASL recognition. These attention maps are com-
puted by averaging the magnitude of activations of convolu-
tion layer which reflect the attention of the network. The at-
tention maps show that the model mostly focused on hands
and face of the signer during the ASL recognition process.
Fig. 7 The example RGB images and their corresponding attention
maps from the fifth convolution layer of the 3DResNet-34 on our test
dataset of ASL recognition which the hands and face have most of the
attention.
Hence, we conduct experiments to analyze the effects of
each modality (hand gestures, facial expression, and body
poses) with the RGB channel. As shown in Fig. 3, the hand
regions and the face regions are obtained from the RGB im-
age based on the location guided by skeleton joints. The per-
formance of each modality and their fusions are summarized
in Table 6.
Table 6 The comparison of the performance of different modalities
and their fusions. All the models are pretrained on Chalearn dataset
and finetuned on ASL-100-RGBD dataset with 64 frames.
Channels Fusions
Body √ √
Hand √ √
Face
Performance 87.83%80.9%89.81%91.5%
In addition to the accuracy of ASL sign recognition, we
further analyzed the accuracy of the six categories (see Fig.
4 for details) for each modality and their combinations in
Table 7. For the categories that involve many facial expres-
sions, such as Question(Yes/No) and Negative, the accu-
racy of hand modality is improved by more than 15% af-
ter fusion with face modality. For the Conditional category
which utilizes more subtle facial expressions, the accuracy
12 Longlong Jinget al.
of hand modality is not improved after fusion with face modal-
ity.
Table 7 The performance (%) of different modalities and their fusions
on six categories listed in Fig. 4: Conditional (Cond), Negative (Neg),
Pointing (Point), Question (WH), Yes/No Question (Y/N) and Time.
The last column is the accuracy (%) for ASL signs.
Modalities Cond Neg Point WH Y/N Time Acc
Hand 90.0 78.1 68.4 84.3 68.4 81.4 80.9
Body 100.0 87.4 84.2 88.0 89.5 87.6 87.83
Body+Hand 90.9 86.6 89.5 88.7 94.7 90.2 89.81
Body+Hand+Face 90.9 93.3 84.2 90.6 84.2 91.8 91.5
5.2.7 Fusions of Different Channels and Modalities
The fusion results of different input channels and modali-
ties on ASL-100-RGBD dataset are shown in Table 8. The
experiments are based on 3DResNet-34 with 64 frames, pre-
trained on Chalearn dataset. Among all the models, fusion of
RGB+Depth+Hands RGB+ Face RGB achieves the best
performance with 92.88% accuracy. Adding RGBflow to this
combination results in 92.48% accuracy which is compara-
ble but not improved since the channels have redundant in-
formation.
Table 8 Performance of 3DResNet-34 with 64 from fusion of different
channels and modalities.
Channels Fusions
RGB √ √
Depth √ √ √ √
RGBflow √ √
RGB of Hands √ √ √ √
RGB of Face √ √
Performance 91.19%92.48%92.48%92.88%
5.3 Experiments on Chalearn LAP IsoGD dataset
5.3.1 Effects of Network Architectures
The 3D-ResNet is pre-trained on Kinetics [38] for all the
experiments in this section. To find the best network archi-
tecture for Chalearn dataset, the parameters of 3D-ResNet
are studied on RGB videos. The results are shown in Table
9. By changing the number of layers to 18,34,50 while fix-
ing the temporal duration to 32, ResNet-34 achieved the best
accuracy.
Table 9 Ablation study of number of layers of the network on RGB
videos of Chalearn Dataset.
Network Temporal Duration Accuracy
ResNet-18 32 52.69%
ResNet-34 32 56.28%
ResNet-50 32 54.57%
We also examined the performance of ResNet-34 by chang-
ing the temporal duration to 16,32, and 64. Our results indi-
cate that ResNet-34 with 64 frames has the best architecture
for Chalearn dataset, as shown in Table 10.
Table 10 Ablation study of temporal duration on RGB videos of
Chalearn Dataset.
Network Temporal Duration Accuracy
ResNet-34 16 45.00%
ResNet-34 32 56.28%
ResNet-34 64 58.32%
5.3.2 Effects of Different Channels and Modalities
We evaluate the effects of different channels including RGB,
RGB flow, Depth, and Depth flow. Because the Chalearn
dataset is designed for hand gesture recognition, we fur-
ther analyze the effects of different hands (left and right), as
well as the whole body. We develop a method to distinguish
left and right hands in Chalearn Isolated Gesture dataset,
and will release the coordinates of hands (distinguished be-
tween right and left hands) with the publication of this arti-
cle. Since the Chalearn dataset is collected for recognizing
hand gestures, here, the face channel is not employed.
We train 12 3D-ResNet-34 networks with 64 frames by
using different combinations of channels and modalities re-
spectively and show the results in table 11. The accuracy
of right hand is significantly higher than the left hand. The
reason is that for most of the gestures in Chalearn dataset,
the right hand is dominant and the left hand does not move
much for many hand gestures.
Table 11 Performance of 3D-ResNet-34 with 64 frames on Chalearn
Dataset for different channels and modalities.
Channel Global Channel (%) Left Hand (%) Right Hand (%)
RGB 58.32 18.01 48.58
Depth 63.16 19.43 54.15
RGB Flow 60.26 21.97 48.79
Depth Flow 55.37 20.28 47.07
5.3.3 Effects of Fusions on different channels and
Modalities
Here we analyze the effects of average fusion on different
channels and modalities. The results are shown in Table 12.
Using only RGB and Depth channels, the accuracy is 67.58%
which is improved to 69.97% by adding RGB flow. We ob-
serve that among all different triplets of channels, Right Hand
RGB + Depth + RGBflow has the highest accuracy at 73.32%.
By applying the average fusion on four channels RGB+ RG-
Bflow+ Right Hand RGB + Right Hand Depth, our model
achieves the accuracy about 75.88%which outperforms the
Recognizing American Sign Language Manual Signs from RGB-D Videos 13
average fusion results of all previous work on Chalearn dataset.
In the-state-of-the-art work of [61], the accuracy of average
fusion is 71.93% for 7channels and 70.37% for 12 channels,
respectively.
Finally, the average fusion of all global channels (RGB,
RGB flow, Depth, Depth flow) and Right hand channels (
Right hand RGB, Right hand RGB flow, Right hand Depth,
Right hand Depth flow) resulted in 76.04% accuracy and the
accuracy of 12 channels together resulted in 75.68%. This
means that the 12 channels contain redundant information,
and adding more channels does not necessarily improve the
results.
Table 12 Performance of 3DResNet-34 with 64 frames for fusion of
different channels and modalities on Charlearn dataset.
Channels Fusions
RGB √ √
Depth √ √ √
RGBflow √ √ √
RGB of Right Hand √ √
Depth of Right Hand √ √
Performance 67.58%69.97%73.32%75.53%75.88%
5.3.4 Comparison with the-state-of-the-arts
Our framework achieves accuracy of 75.88% and 76.04%
from the fusion of 5and 8channels, respectively, on Chalearn
IsoGD dataset. Table 13 lists the state-of-the-art results from
Chalearn IsoGD competition 2017 as well as a recent paper,
FOANet [61]. As shown in the table, in terms of the Average
Fusion, our framework achieves around 6% higher accuracy
than the-state-of-the-arts methods.
Table 13 Comparison with State-of-the-art Results on Chalearn
IsoGD Dataset.
Framework Accuracy on Test Set (%)
Our Results 76.04
FOANet (Average Fusion) [61] 70.37
Miao et al. (ASU) [58] 67.71
SYSU-IEEE 67.02
Lostoy 65.97
Wang et al. (AMRL) [82] 65.59
Zhang et al. (XDETVP) [90] 60.47
It is worth noting that FOANet [61] reported the accu-
racy of 82.07% by applying Sparse Fusion on the softmax
scores of 12 channels (combinations of right hand, left hand,
and whole body while each has 4 channels of RGB, Depth,
RGBflow and Depthflow). The purpose of using sparse fu-
sion is to learn which channels are important for each ges-
ture. The accuracy of FOANet framework using average fu-
sion is 70.37% which is around 6% lower than our results
and nearly 12% lower than the accuracy of sparse fusion.
While the authors of FOANet [61] had reported a 12% boost
from using sparse fusion in their original experiments, our
experiments do not reveal such a boost when implementing
a system following the technical details provided in [61].
Table 14 lists the accuracy on individual channels of our
network and FOANet [61]. In this table, the values inside the
parenthesis represent the accuracy of FOANet. As shown in
the table, in the Global channel, our framework outperforms
FOANet in all the four channels by 10% to 25%. Also, for
the RGB of Right Hand, we obtain a comparable accuracy
(48%) as FOANet. However, FOANet is outperforming our
results in the Right Hand for Depth, RGBflow, and Depth-
flow by nearly 10%. From our experiments, the performance
of ”Global” channels (whole body) in general is superior to
the Local channels (Right/ Left Hand) because the Global
channels include more information. By using the similar ar-
chitecture, FOANet reported 64% accuracy from Depth of
Right Hand and 38% from Depth of the entire frame. In-
stead, our framework achieves more consistent results. For
example, in our framework the accuracy of Depth channel
is higher than RGB and RGBflow for both Global and Right
Hand, while the accuracy in FOANet for Depth and RGB
are almost the same in the Global channel (around 40%) but
very different in the Right Hand channel (17% difference.)
Table 14 The accuracy (%) of 12 channels on the test set of Chalearn
IsoGD Dataset. Comparison between our framework and FOANet [61].
The values inside the parenthesis belong to FOANet.
Channel Global Channel (%) Left Hand (%) Right Hand (%)
RGB 58.32 (41.27)18.01 (16.63)48.58 (47.41)
Depth 63.16 (38.50)19.43 (24.06)54.15 (64.44)
RGB Flow 60.26 (50.96)21.97 (24.02)48.79 (59.69)
Depth Flow 55.37 (42.02)20.28 (22.71)47.07 (58.79)
6 Conclusion
In this paper, we have proposed a 3DCNN-based multi-channel
and multi-modality framework, which learns complemen-
tary information and embeds the temporal dynamics in videos
to recognize manual signs of ASL from RGB-D videos. To
validate our proposed method, we collaborate with ASL ex-
perts to collect an ASL dataset of 100 manual signs includ-
ing both hand gestures and facial expressions with full anno-
tation on the word labels and temporal boundaries (starting
and ending points.) The experimental results have demon-
strated that by fusing multiple channels in our proposed frame-
work, the accuracy of recognizing ASL signs can be im-
proved. This technology for identifying the appearance of
specific ASL words has valuable applications for technolo-
gies that can benefit people who are DHH [9, 42, 43, 48,
52, 65, 68]. As an additional contribution, our “ASL-100-
RGBD” dataset, will be released to enable other members of
14 Longlong Jinget al.
the research community to use this resource for training or
evaluation of models for ASL recognition. The effectiveness
of the proposed framework is also evaluated on the Chalearn
IsoGD Dataset. Our method has achieved 75.88% accuracy
using only 5 channels which is 5.51% higher than the state-
of-the-art work using 12 channels in terms of average fusion.
Acknowledgements This material is based upon work supported by
the National Science Foundation under award numbers 1400802, 1400810,
and 1462280.
References
1. American deaf and hard of hearing statistics.
https://www.nidcd.nih.gov/health/statistics/quick-statistics-
hearing
2. Intel realsense technology: Observe the world in 3d.
https://www.intel.com/content/www/us/en/architecture-and-
technology/realsense-overview.html (2018)
3. Orbbec astra. https://orbbec3d.com/product-astra/ (2018)
4. Set up kinect for windows v2 or an xbox kinect sensor with kinect
adapter for windows. https://support.xbox.com/en-US/xbox-on-
windows/accessories/kinect-for-windows-v2-setup (2018)
5. von Agris, U., Knorr, M., Kraiss, K.F.: The significance of facial
features for automatic sign language recognition. In: Proceedings
of IEEE International Conference on Automatic Face & Gesture
Recognition (2008)
6. Almeidaab, S.G.M., Guimaresc, F.G., Ramrez, J.: Feature extrac-
tion in brazilian sign language recognition based on phonological
structure and using rgb-d sensors. Expert Systems with Applica-
tions 41(16), 7259–7271 (2014)
7. Athitsos, V., Neidle, C., Sclaroff, S., Nash, J., Stefan, A., Yuan,
Q., Thangali, A.: The asl lexicon video dataset. In: Proceedings
of CVPR 2008 Workshop on Human Communicative Behaviour
Analysis. IEEE (2008)
8. Buehler, P., Everingham, M., Huttenlocher, D.P., Zisserman, A.:
Upper body detection and tracking in extended signing sequences.
International journal of computer vision 95(2), 180 (2011)
9. Camg¨
oz, N.C., Hadfield, S., Koller, O., Bowden, R.: Subunets:
End-to-end hand shape and continuous sign language recognition.
In: ICCV, vol. 1 (2017)
10. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neu-
ral sign language translation. CVPR 2018 Proceedings (2018)
11. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new
model and the kinetics dataset. In: Computer Vision and Pattern
Recognition (CVPR), 2017 IEEE Conference on, pp. 4724–4733.
IEEE (2017)
12. Chai, X., Li, G., Lin, Y., Xu, Z., Tang, Y., Chen, X., Zhou, M.: Sign
language recognition and translation with kinect. In: Proceedings
of IEEE International Conference on Automatic Face and Gesture
Recognition (2013)
13. Charles, J., Pfister, T., Everingham, M., Zisserman, A.: Automatic
and efficient human pose estimation for sign language videos. In-
ternational Journal of Computer Vision 110(1), 70–90 (2014)
14. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural net-
works for continuous sign language recognition by staged opti-
mization. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2017)
15. Diba, A., Fayyaz, M., Sharma, V., Karami, A.H., Mahdi Arzani,
M., Yousefzadeh, R., Van Gool, L.: Temporal 3D ConvNets:
New Architecture and Transfer Learning for Video Classification.
ArXiv e-prints (2017)
16. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M.,
Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent con-
volutional networks for visual recognition and description. In:
Proceedings of the IEEE conference on Computer Vision and Pat-
tern Recognition, pp. 2625–2634 (2015)
17. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng,
E., Darrell, T.: Decaf: A deep convolutional activation feature for
generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)
18. Dreuw, P., Forster, J., Ney, H.: Tracking benchmark databases for
video-based sign language recognition. In: Proc. ECCV Interna-
tional Workshop on Sign, Gesture, and Activity (2010)
19. Er-Rady, A., Thami, R.O.H., Faizi, R., Housni, H.: Automatic sign
language recognition: A survey. In: Proceedings of the 3rd In-
ternational Conference on Advanced Technologies for Signal and
Image Processing (2017)
20. Fang, G., Gao, W., Zhao, D.: Large-vocabulary continuoius sign
language recognition based on transition-movement models. IEEE
Transactions on Systems, Man, and Cybernetics - Part A: Systems
and Humans 37(1) (2007)
21. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.:
Rank pooling for action recognition. IEEE transactions on Pattern
Analysis and Machine Intelligence 39(4), 773–787 (2017)
22. Forster, J., Schmidt, C., Hoyoux, T., Koller, O., Zelle, U., Piater,
J.H., Ney, H.: Rwth-phoenix-weather: A large vocabulary sign
language recognition and translation corpus. In: LREC, pp. 3785–
3789 (2012)
23. Furman, N., Goldberg, D., Lusin, N.: Enrollments in
languages other than english in united states institu-
tions of higher education, fall 2010. Retrieved from
http://www.mla.org/2009 enrollmentsurvey (2010)
24. Gattupalli, S., Ghaderi, A., Athitsos, V.: Evaluation of deep learn-
ing based pose estimation for sign language recognition. In:
Proceedings of the 9th ACM International Conference on Perva-
sive Technologies Related to Assistive Environments, p. 12. ACM
(2016)
25. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hier-
archies for accurate object detection and semantic segmentation.
In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE
Conference on, pp. 580–587. IEEE (2014)
26. Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The
chalearn gesture dataset (cgd 2011). Machine Vision and Appli-
cations 25(8), 1929–1951 (2014)
27. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns re-
trace the history of 2d cnns and imagenet? In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 6546–6555 (2018)
28. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in
deep convolutional networks for visual recognition. In: Computer
Vision–ECCV 2014, pp. 346–361. Springer (2014)
29. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based
sign language recognition without temporal segmentation. arXiv
preprint arXiv:1801.10111 (2018)
30. Huenerfauth, M., Gale, E., Penly, B., Pillutla, S., Willard, M., Har-
iharan, D.: Evaluation of language feedback methods for student
videos of american sign language. ACM Transactions on Acces-
sible Computing (TACCESS) 10(1), 2 (2017)
31. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks
for human action recognition. IEEE transactions on pattern anal-
ysis and machine intelligence 35(1), 221–231 (2013)
32. Jiang, Y., Tao, J., Ye, W., Wang, W., Ye, Z.: An isolated sign lan-
guage recognition system using rgb-d sensor with sparse coding.
In: Proceedings of IEEE 17th International Conference on Com-
putational Science and Engineering (2014)
33. Jing, L., Yang, X., Tian, Y.: Video you only look once: Overall
temporal convolutions for action recognition. Journal of Visual
Communication and Image Representation 52, 58–65 (2018)
Recognizing American Sign Language Manual Signs from RGB-D Videos 15
34. Jing, L., Ye, Y., Yang, X., Tian, Y.: 3d convolutional neural net-
work with multi-model framework for action recognition. In: 2017
IEEE International Conference on Image Processing (ICIP), pp.
1837–1841. IEEE (2017)
35. Kadous, M.: Machine recognition of auslan signs using power-
gloves:towards large-lexicon recognition of sign language. In:
Proceedings of the Workshop on the Integration of Gesture in Lan-
guage and Speech, pp. 165–174 (1996)
36. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for
generating image descriptions. arXiv preprint arXiv:1412.2306
(2014)
37. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R.,
Fei-Fei, L.: Large-scale video classification with convolutional
neural networks. In: CVPR (2014)
38. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vi-
jayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P.,
et al.: The kinetics human action video dataset. arXiv preprint
arXiv:1705.06950 (2017)
39. Kelly, D., McDonald, J., Markham, C.: A person independent sys-
tem for recognition of hand postures used in sign language. Pattern
Recognition Letters 31(11), 1359–1368 (2010)
40. Keskin, C., Kra, F., Kara, Y., Akarun, L.: Hand pose estimation
and hand shape classification using multi-layered randomized de-
cision forests. In: In Proceedings of the European Conference on
Computer Vision, pp. 852–863 (2012)
41. Koller, O., Forster, J., Ney, H.: Continuous sign language recog-
nition: Towards large vocabulary statistical recognition systems
handling multiple signers. Computer Vision and Image Under-
standing 141, 108–125 (2015)
42. Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes
for sign language. In: Proceedings of the IEEE International Con-
ference on Computer Vision Workshops, pp. 85–91 (2015)
43. Koller, O., Ney, H., Bowden, R.: Deep hand: How to train a cnn on
1 million hand images when your data is continuous and weakly
labelled. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3793–3802 (2016)
44. Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: Enabling
robust statistical continuous sign language recognition via hybrid
cnn-hmms. International Journal of Computer Vision 126(12),
1311–1325 (2018)
45. Kong, W., Ranganath, S.: Towards subject independent continues
sign language recognition: A segment and merge approach. Pat-
tern Recognition 47(3), 1294–1308 (2014)
46. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classifica-
tion with deep convolutional neural networks. In: Advances in
Neural Information Processing Systems, pp. 1097–1105 (2012)
47. Kumar, P., Gauba, H., Roy, P.P., Dogra, D.P.: A multimodal frame-
work for sensor based sign language recognition. Neurocomputing
259, 21–38 (2017)
48. Kumar, P., Roy, P.P., Dogra, D.P.: Independent bayesian classifier
combination based sign language recognition using facial expres-
sion. Information Sciences 428, 30–48 (2018)
49. Lang, S., Block, M., Rojas, R.: Sign language recognition using
kinect. In: In Proceedings of International Conference on Artificial
Intelligence and Soft Computing, pp. 394–402 (2012)
50. Liang, R.H., Ouhyoung, M.: A real-time continuous gesture
recognition system for sign language. In: Proceedings of the Third
IEEE International Conference on Automatic Face and Gesture
Recognition, pp. 558–567 (1998)
51. Liu, J., Liu, B., Zhang, S., Yang, F., Yang, P., Metaxas, D.N., Nei-
dle, C.: Recognizing eyebrow and periodic head gestures using
crfs for non-manual grammatical marker detection in asl. In: Proc.
of the 10th IEEE International Conference and Workshops on Au-
tomatic Face and Gesture Recognition (FG) (2013)
52. Liu, W., Fan, Y., Li, Z., Zhang, Z.: Rgbd video based human hand
trajectory tracking and gesture recognition system. Mathematical
Problems in Engineering 2015 (2015)
53. Liu, Z., Huang, F., Tang, G.W.L., Sze, F.Y.B., Qin, J., Wang, X.,
Xu, Q.: Real-time sign language recognition with guided deep
convolutional neural networks. In: Proceedings of the 2016 Sym-
posium on Spatial User Interaction, pp. 187–187. ACM (2016)
54. Lu, P., Huenerfauth, M.: Cuny american sign language motion-
capture corpus: first release. In: Proceedings of the 5th Workshop
on the Representation and Processing of Sign Languages: Interac-
tions between Corpus and Lexicon, The 8th International Confer-
ence on Language Resources and Evaluation (LREC 2012), Istan-
bul, Turkey (2012)
55. Martnez, A.M., Wilbur, R.B., Shay, R., Kak, A.C.: The rvl-slll asl
database. In: Proc. of IEEE International Conference Multimodal
Interfaces (2002)
56. Mehrotra, K., Godbole, A., Belhe, S.: Indian sign language recog-
nition using kinect sensor. In: In Proceedings of the International
Conference Image Analysis and Recognition, pp. 528–535 (2015)
57. Metaxas, D., Liu, B., Yang, F., Yang, P., Michael, N., Neidle, C.:
Recognition of nonmanual markers in asl using non-parametric
adaptive 2d-3d face tracking. In: Proc. of the Int. Conf. on Lan-
guage Resources and Evaluation (LREC), European Language Re-
sources Association (2012)
58. Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., Cao, X.,
Liu, Z., Chai, X., Liu, Z., et al.: Multimodal gesture recognition
based on the resc3d network. In: ICCV Workshops, pp. 3047–
3055 (2017)
59. Mitchell, R.E., Young, T.A., Bachleda, B., Karchmer, M.A.: How
many people use asl in the united states? why estimates need up-
dating. Sign Language Studies 6(3), 306–335 (2006)
60. Mulrooney, K.: American Sign Language Demystified, Hard Stuff
Made Easy. McGraw Hill (2010)
61. Narayana, P., Beveridge, J.R., Draper, B.A.: Gesture recognition:
Focus on the hands. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5235–5244 (2018)
62. Neidle, C., Thangali, A., Sclaroff, S.: Challenges in development
of the american sign language lexicon video dataset (asllvd) cor-
pus. In: Proceedings of the Language Resources and Evaluation
Conference (LREC) (2012)
63. Neidle, C., Vogler, C.: A new web interface to facilitate access
to corpora: Development of the asllrp data access interface (dai).
In: Proc. 5th Workshop on the Representation and Processing of
Sign Languages: Interactions between Corpus and Lexicon, LREC
(2012)
64. Ong S. C.and Ranganath, S.: Automatic sign language analysis:
A survey and the future beyond lexical meaning. IEEE Pattern
Analysis and Machine Intelligence 27(6), 873–891 (2005)
65. Palmeri, M., Vella, F., Infantino, I., Gaglio, S.: Sign languages
recognition based on neural network architecture. In: Interna-
tional Conference on Intelligent Interactive Multimedia Systems
and Services, pp. 109–118. Springer (2017)
66. Pigou, L., Dieleman, S., Kindermans, P.J., Schrauwen, B.: Sign
language recognition using convolutional neural networks. In:
Proceedings of European Conference on Computer Vision Work-
shops, pp. 572–578 (2014)
67. Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M.,
Dambre, J.: Beyond temporal pooling: Recurrence and temporal
convolutions for gesture recognition in video. International Jour-
nal of Computer Vision 126(2-4), 430–439 (2018)
68. Pigou, L., Van Herreweghe, M., Dambre, J.: Gesture and sign
language recognition with temporal residual networks. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3086–3093 (2017)
69. Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iter-
ative optimization for continuous sign language recognition. In:
IJCAI, pp. 885–891 (2018)
70. Pugeault, N., Bowden, R.: Spelling it out: Real-time asl finger-
spelling recognition. In: Proc. of IEEE International Conference
on Computer Vision Workshops, pp. 1114–1119 (2011)
16 Longlong Jinget al.
71. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation
with pseudo-3d residual networks. In: The IEEE International
Conference on Computer Vision (ICCV) (2017)
72. Ren, Z., Yuan, J., Meng, J., Zhang, Z.: Robust part-based hand
gesture recognition using kinect sensor. IEEE Trans. on Multime-
dia 15, 1110–1120 (2013)
73. Simonyan, K., Zisserman, A.: Two-stream convolutional networks
for action recognition in videos. In: Advances in Neural Informa-
tion Processing Systems, pp. 568–576 (2014)
74. Simonyan, K., Zisserman, A.: Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556
(2014)
75. Starner, T., Weaver, J., Pentland, A.: Real-time american sign lan-
guage recognition using desk and wearable computer based video.
IEEE Pattern Analysis and Machine Intelligence 20(12), 1371–
1375 (1998)
76. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov,
D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with
convolutions. arXiv preprint arXiv:1409.4842 (2014)
77. Tamura, S., Kawasaki, S.: Recognition of sign language motion
images. Pattern Recognition 21(4), 343–353 (1988)
78. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learn-
ing spatiotemporal features with 3d convolutional networks. In:
Proceedings of the IEEE International Conference on Computer
Vision, pp. 4489–4497 (2015)
79. Traxler, C.B.: The stanford achievement test: National norming
and performance standards for deaf and hard-of-hearing students.
Journal of deaf studies and deaf education 5(4), 337–348 (2000)
80. Valli, C., Lucas, C., Mulrooney, K.J., Villanueva, M.: Linguistics
of American Sign Language: An Introduction. Gallaudet Univer-
sity Press (2011)
81. Wan, J., Li, S., Zhao, Y., Zhou, S., Guyon, I., Escalera, S.:
Chalearn looking at people rgb-d isolated and continuous datasets
for gesture recognition. In: Proceedings of CVPR 2008 Work-
shops. IEEE (2016)
82. Wang, H., Wang, P., Song, Z., Li, W.: Large-scale multimodal ges-
ture recognition using heterogeneous networks. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pp. 3129–3137 (2017)
83. Yang, H., Sclaroff, S., Lee, S.: Sign language spotting with a
threshold model based on conditional random fields. IEEE Pat-
tern Analysis and Machine Intelligence 31(7), 1264–1277 (2009)
84. Yang, H.D.: Sign language recognition with the kinect sensor
based on conditional random fields. Sensors 15, 135–147 (2015)
85. Yang, R., Sarkar, S., Loeding, B.: Handling movement epenthesis
and hand segmentation ambiguities in continuous sign language
recognition using nested dynamic programming. IEEE Pattern
Analysis and Machine Intelligence 32(3), 462–477 (2010)
86. Ye, Y., Tian, Y., Huenerfauth, M.: Recognizing american sign lan-
guage gestures from within continuous videos. The 8th IEEE
Workshop on Analysis and Modeling of Faces and Gestures
(AMFG) in conjunction with CVPR 2018 (2017)
87. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals,
O., Monga, R., Toderici, G.: Beyond short snippets: Deep net-
works for video classification. In: Proceedings of the IEEE confer-
ence on Computer Vision and Pattern Recognition, pp. 4694–4702
(2015)
88. Zafrulla, Z., Brashear, H., Starner, T., Hamilton H.and Presti, P.:
American sign language recognition with the kinect. In: In Pro-
ceedings of the International Conference on Multimodal Inter-
faces, pp. 279–286 (2011)
89. Zhang, C., Tian, Y., Huenerfauth, M.: Multi-modality american
sign language recognition. In: Proceedings of IEEE International
Conference on Image Processing (ICIP) (2016)
90. Zhang, L., Zhu, G., Shen, P., Song, J., Shah, S.A., Bennamoun, M.:
Learning spatiotemporal features using 3dcnn and convolutional
lstm for gesture recognition. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 3120–3128
(2017)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This manuscript introduces the end-to-end embedding of a CNN into a HMM, while interpreting the outputs of the CNN in a Bayesian framework. The hybrid CNN-HMM combines the strong discriminative abilities of CNNs with the sequence modelling capabilities of HMMs. Most current approaches in the field of gesture and sign language recognition disregard the necessity of dealing with sequence data both for training and evaluation. With our presented end-to-end embedding we are able to improve over the state-of-the-art on three challenging benchmark continuous sign language recognition tasks by between 15 and 38% relative reduction in word error rate and up to 20% absolute. We analyse the effect of the CNN structure, network pretraining and number of hidden states. We compare the hybrid modelling to a tandem approach and evaluate the gain of model combination.
Conference Paper
Full-text available
Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation for both end-to-end and pretrained settings (us-ing expert knowledge). This allows us to jointly learn the spatial representation of video frames, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT we collected the first Continuous SLT dataset, RWTH-PHOENIX-Weather 2014T, which will be made publicly available. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a german vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss level tokenization systems were able to achieve 9.58 and 18.13 respectively.
Conference Paper
Full-text available
Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.
Conference Paper
This paper presents a novel deep neural architecture with iterative optimization strategy for real-world continuous sign language recognition. Generally, a continuous sign language recognition system consists of visual input encoder for feature extraction and a sequence learning model to learn the correspondence between the input sequence and the output sentence-level labels. We use a 3D residual convolutional network (3D-ResNet) to extract visual features. After that, a stacked dilated convolutional network with Connectionist Temporal Classification (CTC) is applied for learning the mapping between the sequential features and the text sentence. The deep network is hard to train since the CTC loss has limited contribution to early CNN parameters. To alleviate this problem, we design an iterative optimization strategy to train our architecture. We generate pseudo-labels for video clips from sequence learning model with CTC, and fine-tune the 3D-ResNet with the supervision of pseudo-labels for a better feature representation. We alternately optimize feature extractor and sequence learning model with iterative steps. Experimental results on RWTH-PHOENIX-Weather, a large real-world continuous sign language recognition benchmark, demonstrate the advantages and effectiveness of our proposed method.
Article
In this paper, we propose an efficient and straightforward approach, video you only look once (VideoYOLO), to capture the overall temporal dynamics from an entire video in a single process for action recognition. It remains an open question for action recognition on how to deal with the temporal dimension in videos. Existing methods subdivide a whole video into either individual frames or short clips and consequently have to process these fractions multiple times. A post process is then used to aggregate the partial dynamic cues to implicitly infer the whole temporal information. On the contrary, in VideoYOLO, we first generate a proxy video by selecting a subset of frames to roughly reserve the overall temporal dynamics presented in the original video. A 3D convolutional neural network (3D-CNN) is employed to learn the overall temporal characteristics from the proxy video and predict action category in a single process. Our proposed method is extremely fast. VideoYOLO-32 is able to process 36 videos per second that is 10 times and 7 times faster than prior 2D-CNN (Two-stream (Simonyan and Zisserman, 2014)) and 3D-CNN (C3D (Tran et al., 2015)) based models, respectively, while still achieves superior or comparable classification accuracies on the benchmark datasets, UCF101 and HMDB51.