Content uploaded by Suharjito Suharjito
Author content
All content in this area was uploaded by Suharjito Suharjito on Jul 12, 2019
Content may be subject to copyright.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
16
Sentence Level Indonesian Sign Language
Recognition Using 3D Convolutional Neural
Network and Bidirectional Recurrent Neural
Network
Meita Chandra Ariesta
Computer Science
Department, School of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
meita.ariesta@binus.ac.id
Fanny Wiryana
Computer Science
Department, School of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
fanny.wiryana@binus.ac.id
Suharjito
Computer Science
Department, BINUS Graduate
Program – Master of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
suharjito@binus.edu
Amalia Zahra
Computer Science
Department, BINUS Graduate
Program – Master of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
amalia.zahra@binus.edu
Abstract—Sign Language Recognition (SLR) is a relatively
challenging research field which allows opportunity for
improvements. In this research, we propose sentence-level SLR
using deep learning method by combining Convolutional
Neural Network (CNN) and Bidirectional Recurrent Neural
Network (Bi-RNN). Specifically, 3D CNN is implemented to
extract features from each video frame and bidirectional-RNN
is implemented to extract the unique features from the video
frame’s sequential behavior, which later generate a possible
sentence. There are two key takeaways from this paper. The
first is our proposed dataset of Indonesian Sign Language
(SIBI) which is comprised of 30 sentences in SIBI. The second
is our novel approach of using deep learning and Connectionist
Temporal Classification (CTC) loss function in sentence-level
SLR. The result shows that the hyperparameter used, in this
case Hyperparameter 1, achieves the best result. Also, this
research found that deeper network does not necessarily
guarantee good results. A bigger number of dataset also affects
the performance of the system.
Keywords— sign language recognition, deep learning, CNN,
RNN, CTC loss, SIBI
I. INTRODUCTION
There has always been a communication gap between
hearing-impaired community and hearing people who do not
speak in sign language. Sign language is gestures made by
the movement of hands which are visible, used mostly by
hearing-impaired people to communicate. The official Sign
Language of Indonesia is called Sistem Isyarat Bahasa
Indonesia (SIBI). SIBI is mostly adopted from American
Sign Language (ASL) and follows the grammar structure of
Bahasa Indonesia, the official language of Indonesia, and is
equipped with affixes, such as me-, ber-, di-, ke-, pe-, ter-,
and se- [1]. Although sign language is used daily by the
hearing-impaired community, the rest of the population has
limited knowledge about the language. This creates
communication barrier and discrimination among society.
Sign Language Recognition (SLR) systems are widely
developed to bridge the gap. Numerous research has worked
on various methods to achieve state-of-the-art results.
Various input devices have also been experimented on, such
as data-gloved approach, Leap Motion Controller, vision-
based approach, and Kinect. Due to the lack of its application
outside a laboratory, unnatural experience offered, and its
high cost, data-gloved approach is left behind. A more
elaborated approach is vision-based approach because it can
be applied directly to user and deployed on any other
applications that have camera such as a phone assistant and
smart home interactions [2]. However, it is not an easy task
because sign language is delivered through various ways:
hand-shapes, position, orientation, and movements, where it
is hard to extract information from such features [3,4].
Several feature extraction methods had been worked on,
yet the challenge remains, because SLR should be used in
natural environment outside a laboratory for it to be useful.
These are the challenges researchers tried to overcome
throughout the years from the development of Sign
Language Recognition (SLR) system. Lighting, complex
background, poses, and noises may also affect the
performance of an SLR system. Another challenge of an
SLR system is determining the start and end point of
meaningful gestures (gesture segmentation) [5]. To
overcome these problems, Huang et al. in [3] proposed 3D
Convolutional Neural Network (CNN) to extract spatial and
temporal spatial features from the videos obtained from
Kinect as the input device. The implementation of deep
learning on vision-based SLR systems has helped to
overcome several challenges. Previously, several other
studies related to vision-based recognition using deep
learning have been conducted. Deep learning helps major
advances in vision-based recognition and has proven its
satisfactory results on image and object recognition [6,7],
image and object descriptor [8], human behavioral
recognition [9,10], lip-reading [11], and especially SLR
[3,12].
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
17
This paper proposed sentence-level sign language
recognition using deep learning method by combining CNN
and Bidirectional Recurrent Neural Network (Bi-RNN).
Specifically, 3D CNN is implemented to extract features
from each video frame and bidirectional-RNN is
implemented to extract the unique features from the video
frame’s sequential behavior, which later generate a possible
sentence. There are two key takeaways from this paper. The
first is our proposed novel dataset of SIBI which is
comprised of 3,006 videos of 30 sentences in SIBI. The
second is our novel approach of using deep learning and
Connectionist Temporal Classification (CTC) loss function
in recognizing sign language in sentence level.
The remaining of this paper is structured as follows.
Section II elaborates a number of related works that have
been conducted by researchers. Section III describes the
methodology applied to the research presented in this paper,
followed by the proposed models, which are described in
Section IV. Section V and VI present the model training and
the results, respectively. Finally, conclusion and future works
are presented in Section VII.
II. RELATED WORKS
There are several input devices exploited in a Sign
Language Recognition (SLR) system. Data-glove approach,
vision-based approach, Leap Motion Controller (LMC), and
Kinect are some of the input devices used in SLR systems
for sign acquisition. Data-glove based method is a relatively
old data acquisition method for gesture recognition. This
method utilizes a device to help the process of collecting
data. The device is a glove which has sensors detecting the
movement and changes of the hands and fingers of the users
and it is connected to a computer [13–16]. Due to the lack of
its application outside a laboratory, unnatural experience
offered, and its high cost, the data-glove based approach is
left behind.
LMC converts signals into computer commands. It was
employed in a number of existing gesture recognition
systems [17–20]. However, LMC is less applicable in
everyday life due to its absence on common gadgets.
Microsoft Kinect has also been widely used in gesture
recognition [21–27]. Its strength lies in its capability to
capture every motion and converts it into usable features by
using the built-in 3D sensory camera. Several studies
recommend Microsoft Kinect to be employed in SLR
systems.
In a vision-based approach, colored gloves are often used
to help hand segmentation [28,29]. Hand segmentation is the
process of separating hands and other features from the rest
of an image. Hand segmentation is one of the many
challenges in gesture recognition. Zhang et al [28] employed
colored gloves to help the hand segmentation process along
with a pupil-detection algorithm to make the pupil as
reference points for assisting the process. Canny Edge
Detector is another method to detect the edges of hands in an
image for hand segmentation with low error rate and optimal
performance in detecting edges [30–32]. Elliptical Fourier
Descriptor is also employed for hand segmentation which is
designed for extracting shapes’ outline [33]. Skin detection
identifies and isolates skin area from the rest of an image for
hand segmentation [34,35]. Skin detection is employed along
with hand motion tracking to produce more accurate results
[36]. Similarly, colored gloves are employed because they
give a distinctive feature to the hand in order to assist a hand
segmentation process [28,37,38]. On the other hand,
determining the start and the end point of meaningful
gestures (gesture segmentation) is a challenge in SLR [5]. A
method to do it is by implementing a 3D Convolutional
Neural Network (CNN) to extract spatial and temporal
spatial features from videos obtained from Kinect [3] and
also to extract features in lipreading [11] and hand detection
[39].
In the development of an SLR system, Hidden Markov
Model (HMM) has been widely exploited. It was used in
both gloved-based approach [23,40,41] and vision-based
approach [29,42–44]. To speed up the recognition process,
the Tied-Mixture Density Hidden Markov Model
(TMDHMM) could be applied without significantly reducing
the system accuracy. It had proved that it could recognize up
to 92.5% of frequently used Chinese Sign Language (CSL).
However, HMM experiences difficulties in handling
noisy data [45]. Therefore, Kaluri et al [44] proposed an
implementation of Wiener Filter to eliminate the noise in
images and used adaptive histogram technique to segment
the images where the output was fed into HMM for training
and recognition [44].
An earlier research in SIBI sign language recognition
conducted by [46] used the classification method namely
Generalized Learning Vector Quantization (GLVQ) to
recognize letters A-Z and 10 numbers. This research used
Microsoft Kinect as the input device. Another work was
performed by [27] to recognize 10 words using GLVQ and
Random Forest.
Deep learning method has been widely used for computer
vision, such as object recognition. Deep learning CNN and
RNN performed by [47] was implemented to classify an
image into 51 classes. Another work conducted by [48]
derived some textual description from the given image using
CNN and bidirectional-RNN. The work in [3] also used a
deep learning method called 3D-CNN to recognize 25 words
using nine people. However, the research of sign language
recognition using deep learning is still limited.
III. METHODOLOGY
This section describes the preparation of the study prior
to experiments. It consists of Sistem Isyarat Bahasa
Indonesia (SIBI), data, and translation of dataset.
A. Sistem Isyarat Bahasa Indonesia (SIBI)
Sistem Isyarat Bahasa Indonesia (SIBI) is the official
sign language system approved by the Indonesian
Government. The Ministry of Education and Culture of
Indonesia published the first dictionary for SIBI in 1994
[27]. The government promotes the usage of SIBI to
overcome the diversity of sign language used in throughout
the country. 80% of SIBI are adopted from the American
Sign Language (ASL) while the remaining 20% are
normalized from Bahasa Indonesia. SIBI is taught in schools
for children with special needs as the means of
communication.
The vocabulary of SIBI is mostly adopted from ASL and
follows the grammatical structure of Bahasa Indonesia, the
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
18
official language of Indonesia, and is equipped with affixes
(me-, ber-, di-, ke-, pe-, ter-, and se-) [1]. Therefore, SIBI can
be said as the sign language which is normalized and
standardized based on Bahasa Indonesia. The representation
of one word in Bahasa Indonesia may need at most five signs
[49]. For example, the word “perjalanan” consists of the
prefix “per-”, the word “jalan”, and the suffix “-an”,
therefore each word and affixes has its own SIBI
representation and is represented with one sign and they are
performed continuously.
B. Data
The data used consisted of 3,006 videos of 30 sentences
in SIBI. The data was collected in Santi Rama, a school for
children with special needs, in Jakarta, Indonesia. Santi
Rama is one of the founding members of SIBI. There are 10
teachers who volunteered to perform 30 sentences in SIBI in
which the recording is repeated 10 times. There are errors
during the recording of those videos, resulting the uneven
number of videos for each sentence. Those errors comprised
of human and recording device errors which occurred during
the recording process. The data collected were then reviewed
one by one for validity checking.
Before the data was used for training, the video dataset
that had been collected was then preprocessed to sequence of
images. The images were then resized to 100px x 50px.
Lastly, due to the variance of video lengths which results in
different extracted frames for every video, we performed
padding and set the value of 270 frames for every extracted
sequence of images. For videos which have lower than 270
frames, the padding was applied to the remaining frames.
The data was distributed into three groups: training,
testing, and validation. The data distribution can be seen in
Table 1.
TABLE I. DATA DISTRIBUTION
Training
60%
Validation
20%
Testing
20%
TOTAL
100%
C. Translation of Dataset
To multiply the variance of the collected dataset, a
translation using affine transformation was applied to this
work. The translation was performed by sliding the pixel
horizontally (left, right) or vertically (up, down). For every
horizontal translation, image pixel slides horizontally for 105
pixels and vertically for 35 pixels. The translation was
applied in eight directions: right, left, up, down, right-up,
right-down, left-up, and left-down. The total dataset after the
translation consisted of 27,054 videos.
IV. PROPOSED MODEL
This section elaborates our proposed model, which
comprises Convolutional Neural Network (CNN), activation
function, dropout, SOFTMAX algorithm, Recurrent Neural
Network (RNN), Bidirectional RNN, Gated Recurrent Unit
(GRU), and Connectionist Temporal Classification (CTC)
loss.
A. Convolutional Neural Network (CNN)
‘Convolution Network’ itself, refers to a network with
mathematical operation named convolution. Convolution is a
special type of linear operation. Convolutional Neural
Network (CNN), straightforwardly, is a neural network that
uses convolution, at least in one of the layers [50]. The main
idea of CNN is to solve the problem of substantial number of
parameters and allowing a network to run deeper and faster
while decreasing the parameter [51]. CNN has learnable
weights and biases, and those parameters can be learnt using
supervised or unsupervised learning [3,52]. Furthermore, like
other neural networks, CNN has loss product function in its
last layer (fully connected layer). Generally, CNN consists of
stacks of convolutional layer, pooling layer, activation
function, and fully connected layer. One of the examples of
CNN architecture is displayed in Fig. 1. According to Ji et al.
[9], CNN architecture can be constructed by stacking some
layers of convolution and subsampling alternately.
Fig. 1. CNN Architecture
B. Activation Function
In the training process using gradient descent, the time
needed to perform one training saturating nonlinearity f(x)=
tanh(x) is longer than training the non-saturating
nonlinearity:
f(x)=max(0,x) (1)
Neuron with nonlinearity like the function above is
named Rectified Linear Units (ReLU). Training deep
convolutional network using ReLU has been proven to be
faster than using standard tanh unit [53].
ReLU is an activation function recommended to be used
for most of the neural networks. By applying ReLU
activation function, the linear transformation is changed to
non-linear transformation type of output.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
19
C. Dropout
Deep learning network models, in general, have large
number of parameters. Overfitting is always a problem in
using this type of network. Dropout is a technique applied to
solve this problem. Dropout aims to cut the connection (drop
unit) between neuron (unit) and minimize the shape of the
network during training. The disconnection of neuron stops
the network to be too clever on the training data provided.
Furthermore, dropout provides a way to combine neural
network architectures simultaneously and efficiently.
D. SOFTMAX Algorithm
Softmax function is the most frequently used function as
an output or classification to represent the diffusion of
probabilities of n different classes [50], applied to the output
layer of the deep learning model.
E. Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN) has been a focus for
researchers since 1990s. RNN is built to tackle sequential
data or various data based on timely pattern. Recurrent net
topology is built from the idea of adding one connection that
is called feedback connection. This feedback connection
connects one neuron to preceding neuron, so that the
activation function value can be adjusted (looping).
F. Bidirectional-RNN Architecture
Bidirectional RNN is a combined RNN that runs forward
through time from the start to the end in a sequence, and then
runs backward through time from the end to the start. Fig. 2
illustrates the bidirectional-RNN architecture. Bidirectional-
RNN architecture is used to learn and map the input
sequence x.
Fig. 2. Bidirectional-RNN Architecture [50]
h(t) is a state from sub-RNN that moves forward through
time. g(t) is a state from sub-RNN that moves backward. Unit
output o(t) is used to count the representation dependency for
the previous and the next state.
G. Gated Recurrent Unit (GRU)
In Chung, et al. research [54], they stated that RNN
training is difficult because of the long dependency of the
model architecture. Gradient value tends to disappear (in
most of the time) or blow up (happens rarely, but with
devastating effects). To prevent this, a recurrent unit
approach is used, such as the Gated Recurrent Unit (GRU).
Recurrent unit is said to be successful to gain information
from a lengthy time-dependent model properly.
GRU is applied to every recurrent unit to save dependent
information through time adaptively. GRU has gating unit
that controls information flow in a unit without using
separated memory cell.
H. Connectionist Temporal Classification (CTC) Loss
Connectionist Temporal Classification (CTC) is a
function used to allow Recurrent Neural Network (RNN) to
be trained for sequence transcription assignments without
previous information about the target sequences in the input
[55]. CTC loss is used to find the likelihood values in a
sequence from the input provided to the output [56].
V. MODEL TRAINING
The model used in this research adopts the model from
Lipnet. To evaluate the Lipnet model, we proposed another 3
models.
A. Lipnet
Lipnet model used three blocks of 3D-CNN and two
blocks of bidirectional-RNN. Hyperparameter used to train
this model was the same as Lipnet. Lipnet was trained using
a translation dataset which comprised sequences of 27,054
images. This model used hyperparameter 1, as shown in
Table 2.
TABLE II. HYPERPARAMETER 1
Layer
Size/Stride/Padding
Input Size
Dimension
Order
3D-CNN
(3,5,5)/(1,2,2)/(1,2,2)
270x3x50x100
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x32x25x50
T x C x H x W
3D-CNN
(3,5,5)/(1,2,2)/(1,2,2)
270x32x12x25
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x64x6x13
T x C x H x W
3D-CNN
(3,3,3)/(1,2,2)/(1,2,1)
270x64x3x16
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x96x1x3
T x C x H x W
Bi-GRU
256
270x(96x1x3)
T x (C x H x
W)
Bi-GRU
256
270x512
T x F
Linear
32 + blank
270x512
T x F
Softmax
270x33
T x V
B. First Model
This model is identical with Lipnet model. The difference
is that this model tries to evaluate the usability of using
padding by not applying it. The model still used three blocks
of 3D-CNN and two blocks of bidirectional-RNN. The
dataset used for this model is normal dataset comprising
sequences of 3,006 images. The architecture of Lipnet and
the first model can be seen in Fig. 3. This model used
hyperparameter 2, as shown in Table 3.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
20
Fig. 3. Lipnet Model and the First Model
TABLE III. HYPERPARAMETER 2
Layer
Size/Stride/Padding
Input Size
Dimension
Order
3D-CNN
(3,5,5)/(1,2,2)/(0,0,0)
270x3x50x100
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x32x23x48
T x C x H x W
3D-CNN
(3,5,5)/(1,1,1)/(0,0,0)
270x32x11x24
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x64x7x20
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(0,0,0)
270x64x3x10
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x96x1x4
T x C x H x W
Bi-GRU
256
270x(96x1x4)
T x (C x H x
W)
Bi-GRU
256
270x512
T x F
Linear
32 + Blank
270 x 512
T x F
Softmax
270 x 33
T x V
C. Second Model
The second model was built simpler than the Lipnet
model. It used a block of 3D-CNN and a block of
bidirectional-RNN. This model was trained using normal
dataset. The architecture of the second model can be seen in
Fig. 4. This model used hyperparameter 3, as shown in Table
4.
TABLE IV. HYPERPARAMETER 3
Layer
Size/Stride/Padding
Input Size
Dimension
Order
3D-
CNN
(3,5,5)/(1,2,2)/(0,0,0)
270x3x50x100
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x32x23x48
T x C x H x W
Bi-GRU
256
270x(32x11x24)
T x(C x H x W)
Linear
32 + blank
270x256
T x F
Softmax
270x33
T x V
Fig. 4. The Second Model
D. Third Model
The third model was built more complex than the Lipnet
model. This model used eight blocks of 3D-CNN and two
blocks of bidirectional-RNN, using normal dataset. The
architecture of the third model can be seen in Fig. 5. This
model used hyperparameter 4, as shown in Table 5.
Fig. 5. The Third Model
TABLE V. HYPERPARAMETER 4
Layer
Size/Stride/Padding
Input Size
Dimension
Order
3D-CNN
(3,5,5)/(1,2,2)/(2,4,4)
270x3x50x100
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(2,4,4)
270x32x27x53
T x C x H x W
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
21
Pool
(1,2,2)
270x32x33x59
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x64x16x29
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x96x18x31
T x C x H x W
Pool
(1,2,2)
270x96x20x33
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x128x10x16
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x128x12x18
T x C x H x W
Pool
(1,2,2)
270x128x14x20
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x256x7x10
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x256x9x12
T x C x H x W
Pool
(1,2,2)
270x256x11x14
T x C x H x W
Bi-GRU
256
270x(512x5x7)
T x(C x H x W)
Bi-GRU
256
270 x 512
T x F
Linear
32 + Blank
270 x 512
T x F
Softmax
270 x 33
T x V
VI. RESULT
To measure the performance of the training model, WER
(word error rate) and CER (character error rate) are
computed. WER and CER are the methods to count numbers
of substitution, deletion, and insertion from the comparison
of the true label and the hypothesis.
Table 6 below summarizes the performance of the system
using different models that have been described in the
previous section. Our proposed models are compared to the
Lipnet model according to WER and CER.
TABLE VI. EVALUATION RESULT
Model
WER
(Word Error Rate)
CER
(Character Error Rate)
Lipnet
87.10%
70%
First Model
89.40%
69.58%
Second Model
90.50%
76.90%
Third Model
88.17%
65.33%
Based on the performance shown in Table 6, the errors
are still high. The average error for every model evaluated
with WER is 88.79%. The Lipnet model gives the best result
compared to the remaining models based on WER, with the
smallest error of 87.10%. The second model gives the worst
result compared to the other models with the WER of
90.50%. Based on CER, the third model gives the best result
with the CER of 65.33%. Every trained model still
performed poorly in recognizing gestures of sign language
even though the system used similar data.
VII. CONCLUSION AND FUTURE WORKS
We proposed deep learning model to a sentence-level
sign language recognition. The deep learning methods used
in this research are 3D-CNN and bidirectional-Recurrent
Neural Network (RNN). This research used 30 sentences
from the SIBI video dataset collected by researchers.
Based on the results presented in the previous section, the
models proposed and the dataset do not seem to match well.
The first, second, and third model using normal dataset yield
similar error rates. The Lipnet model, which was trained
using large dataset, still produces similar result. For future
work, we would like to preprocess the data to match the
model, for example handling noise from the dataset.
ACKNOWLEDGMENT
This research was supported by research grant No.
039A/VR.RTT/VI/2017 from Ministry of Research
Technology and Higher Education of Republic of Indonesia.
REFERENCES
[1] D. R. Kurnia and T. Slamet, “Menormalkan yang dianggap “tidak
normal” (studi kasus: penertiban bahasa isyarat tunarungu di SLB
Malang),” Indonesian Journal of Disability Studies, 3(1), pp. 34–43,
2016.
[2] M. R. Abid, E. M. Petriu, and E. Amjadian, “Dynamic sign language
recognition for smart home interactive application using stochastic
linear formal grammar,” IEEE Transactions on Instrumentation and
Measurement, 64(3), pp. 596–605, 2015.
[3] J. Huang, W. Zhou, H. Li, and W. Li, “Sign language recognition
using 3D convolutional neural networks,” IEEE International
Conference on Multimedia and Expo (ICME), IEEE, pp. 1–6, 2015.
[4] O. Koller, J. Forster, and H. Ney, “Continuous sign language
recognition: Towards large vocabulary statistical recognition systems
handling multiple signers,” Computer Vision and Image
Understanding, 141, pp. 108–125, 2015.
[5] M. K. Bhuyan, D. A. Kumar, K. F. MacDorman, and Y. Iwahori, “A
novel set of features for continuous hand gesture recognition,” Journal
on Multimodal User Interfaces, 8(4), pp. 333–343, 2014.
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, IEEE, pp. 770–778, 2016.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” International Conference on
Learning Representations, pp. 1–14, 2014.
[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S.
Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent
convolutional networks for visual recognition and description,”
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2625–2634, 2015.
[9] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks
for human action recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 35(1), pp. 221–231, 2013.
[10] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt,
“Sequential deep learning for human action recognition,”
International Workshop on Human Behavior Understanding, pp. 29–
39, 2011.
[11] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas,
“Lipnet: Sentence-level lipreading,” arXiv Prepr arXiv161101599,
2016.
[12] O. Koller, H. Ney, and R. Bowden, “Deep learning of mouth shapes
for sign language,” Proceedings of the IEEE International Conference
on Computer Vision Workshops, pp. 85–91, 2015.
[13] S. A. Mehdi and Y. N. Khan, “Sign language recognition using sensor
gloves,” Proceedings of the 9th International Conference on Neural
Information Processing (ICONIP), IEEE, pp. 2204–2206, 2002.
[14] L. T. Phi, H. D. Nguyen, T. Q. Bui, and T. T. Vu, “A glove-based
gesture recognition system for Vietnamese sign language,”
Proceedings of the 15th International Conference on Control,
Automation and Systems (ICCAS), 13(16), pp. 1555–1559, 2015.
[15] A. Ranjini S. S. and M. Chaitra, "Sign language recognition system,"
International Journal on Recent and Innovation Trends in Computing
and Communication, 2(4), pp. 947–953, 2014.
[16] S. Saengsri, V. Niennattrakul, and C. A. Ratanamahatana, "TFRS:
Thai finger-spelling sign language recognition system," 2nd
International Conference on Digital Information and Communication
Technology and its Applications (DICTAP), pp. 457–462, 2012.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
22
[17] M. Koul, P. Patil, V. Nandurkar, and S. Patil, "Sign language
recognition using leap motion sensor," International Research Journal
of Engineering and Technology (IRJET), 3(11), pp. 322–325, 2016.
[18] L. E. Potter, J. Araullo, and L. Carter, "The leap motion controller: A
view on sign language," Proceedings of the 25th Australian
Computer-Human Interaction, Conference on Augmentation,
Application, Innovation, Collaboration – OzCHI, ACM, pp. 175–178,
2013.
[19] M. U. Kakde, M. G. Nakrani, and A. M. Rawate, "A review paper on
sign language recognition system for deaf and dumb people using
image processing", International Journal of Engineering Research and
Technology, 5(3), pp. 590–592, 2016.
[20] H. Bhavsar, "Review on feature extraction methods of image based
sign language recognition system," Indian Journal of Computer
Science and Engineering, 8(3), pp. 249–259, 2017.
[21] S. B. Carneiro, E. D. D. M. Santos, M. D. A. Talles, J. O. Ferreira, S.
G S. Alcala, and A. F. Da Rocha, "Static gestures recognition for
Brazilian sign language with kinect sensor," SENSORS, IEEE, 2016.
[22] E. Escobedo and G. Camara, "A new approach for dynamic gesture
recognition using skeleton trajectory representation and histograms of
cumulative magnitudes," SIBGRAPI Conference on Graphics,
Patterns and Images, pp. 209–216, 2016.
[23] J. Ma, W. Gao, J. Wu, and C. Wang, “A continuous Chinese sign
language recognition system,” Proceedings of the 4th International
Conference on Automatic Face and Gesture Recognition, IEEE, pp.
428–433, 2000.
[24] Y. Jiang, J. Tao, W. Ye, W. Wang, and Z. Ye, "An isolated sign
language recognition system using RGB-D sensor with sparse
coding," 17th International Conference on Computational Science and
Engineering, IEEE, pp. 21–26, 2014.
[25] C. Keskin, F. Kırac, Y. E. Kara, and L. Akarun, "Real time hand pose
estimation using depth sensors," International Conference on
Computer Vision Workshops, IEEE, pp. 1228–1234, 2011.
[26] J. L. Raheja, A. Mishra, and A Chaudhary, "Indian sign language
recognition using SVM," Pattern Recognition and Image Analysis,
26(2), pp. 434–441, 2016.
[27] E. Rakun, M. Andriani, I. W. Wiprayoga, K. Danniswara, and A.
Tjandra, "Combining depth image and skeleton data from kinect for
recognizing words in the sign system for Indonesian language (SIBI
[sistem isyarat bahasa Indonesia])," International Conference on
Advanced Computer Science and Information Systems (ICACSIS),
pp. 387–392, 2013.
[28] L. G. Zhang, Y. Chen, G. Fang, X. Chen, and W. Gao, “A vision-
based sign language recognition system using tied-mixture density
HMM,” Proceedings of the 6th International Conference on
Multimodal Interfaces, ACM, pp. 198–204, 2004.
[29] T. Starner and A. Pentland, “Real-time American sign language
recognition from video using hidden Markov models,” Motion-Based
Recognition, Springer, pp. 227–243, 1997.
[30] E. A. Kalsh and N. S. Garewal, "Sign language recognition system,"
International Journal of Computational Engineering Research, 3(6),
pp.15–21, 2013.
[31] M. V. D. Prasad, V. Kishore, and A. Kumar, "Indian sign language
recognition system using new fusion based edge operator," Journal of
Theoretical and Applied Information Technology, 88(3), pp. 574–
584, August 2016.
[32] D. K. Ghosh and S. Ari, “On an algorithm for vision-based hand
gesture recognition,” Signal, Image and Video Processing, 10(4), pp.
655–662, 2016.
[33] P. V. V. Kishore, M. V. D. Prasad, C. R. Prasad, and R. Rahul, "4-
camera model for sign language recognition using elliptical Fourier
descriptors and ANN," International Conference on Signal Processing
and Communication Engineering Systems (SPACES), IEEE, pp. 34–
38, 2015.
[34] S. C. W. Ong and S. Ranganath, "Automatic sign language analysis: a
survey and the future beyond lexical meaning," IEEE Transactions on
Pattern Analysis and Machine Intelligence, 27(6), pp. 873–891, 2005.
[35] K. M. Lim, A. W. C. Tan, S. C. Tan, "A feature covariance matrix
with serial particle filter for isolated sign language recognition,"
Expert Systems with Applications, Elsevier Ltd, 54, pp. 208–218,
2016.
[36] P. C. Pankajakshan and B. Thilagavathi, "Sign language recognition
system," Innovations in Information, Embedded and Communication
Systems (ICIIECS), IEEE, pp. 1–4, 2015.
[37] T. E. Starner, "Visual recognition of American sign language using
hidden Markov models," Massachusetts Inst Of Tech Cambridge Dept
Of Brain And Cognitive Science, 1995.
[38] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color
glove,” ACM Transactions on Graphics (TOG), 28(3), 63, 2009.
[39] S. Yan, Y. Xia, J. S. Smith, W. Lu, and B. Zhang, “Multiscale
convolutional neural networks for hand detection,” Applied
Computational Intelligence and Soft Computing, 2017.
[40] R. H. Liang and M. Ouhyoung, “A real-time continuous gesture
recognition system for sign language,” Proceedings of the 3rd
International Conference on Automatic Face and Gesture
Recognition, IEEE, pp. 558–567, 1998.
[41] H. Wang, M. C. Leu, and C. Oz, “American sign language
recognition using multi-dimensional hidden Markov models,” Journal
of Information Science and Engineering, 22(5), pp. 1109–1123, 2006.
[42] M. Elmezain, A. Al-Hamadi, J. Appenrodt, and B. Michaelis, “A
hidden Markov model-based continuous gesture recognition system
for hand motion trajectory,” 19th International Conference on Pattern
Recognition, IEEE, pp. 1–4, 2008.
[43] Pashaloudi VN, Margaritis KG. A performance study of a recognition
system for Greek sign language alphabet letters,” 9th Conference
Speech and Computer, 2004.
[44] R. Kaluri and C. H. Pradeep, “An enhanced framework for sign
gesture recognition using hidden Markov model and adaptive
histogram technique,” International Journal of Intelligence and
Engineering System, 10, 2007.
[45] M. C. Roh, S. Fazli, and S. W. Lee, “Selective temporal filtering and
its application to hand gesture recognition,” Applied Intelligence,
45(2), pp. 255–264, 2016.
[46] E. Rakun, M. F. Rachmadi, and K. Danniswara, “Spectral domain
cross correlation function and generalized learning vector
quantization for recognizing and classifying Indonesian sign
language,” Advanced Computer Science and Information Systems
(ICACSIS), pp. 978–979, 2012.
[47] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng,
“Convolutional-recursive deep learning for 3D object classification,”
Advances in Neural Information Processing Systems, pp. 656–664,
2012.
[48] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
generating image descriptions,” IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3128–3137, 2015.
[49] N. Palfreyman, “Sign language varieties of Indonesia: A linguistic
and sociolinguistic investigation,” Sign Language and Linguistics,
20(1), 2017.
[50] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT
press, 2016.
[51] H. H. Aghdam and E. J. Heravi, Guide to Convolutional Neural
Networks, Springer International Publishing, 2017.
[52] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
86(11), pp. 2278–2324, 1998.
[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” Advances In
Neural Information Processing Systems, pp. 1097–1105, 2012.
[54] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” Workshop
on Deep Learning (NIPS), pp. 1–9, December 2014.
[55] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” Proceedings of the 31st International
Conference on Machine Learning (ICML), pp. 1764–1772, 2014.
[56] A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free
conversational speech recognition with neural networks,” Proceedings
of the 2015 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, pp. 345–354, 2015.