Conference PaperPDF Available

Sentence Level Indonesian Sign Language Recognition Using 3D Convolutional Neural Network and Bidirectional Recurrent Neural Network

Authors:
  • Bina Nusantara University, Jakarta, Indonesia

Figures

Content may be subject to copyright.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
16
Sentence Level Indonesian Sign Language
Recognition Using 3D Convolutional Neural
Network and Bidirectional Recurrent Neural
Network
Meita Chandra Ariesta
Computer Science
Department, School of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
meita.ariesta@binus.ac.id
Fanny Wiryana
Computer Science
Department, School of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
fanny.wiryana@binus.ac.id
Suharjito
Computer Science
Department, BINUS Graduate
Program Master of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
suharjito@binus.edu
Amalia Zahra
Computer Science
Department, BINUS Graduate
Program Master of
Computer Science
Bina Nusantara University
Jakarta, Indonesia 11480
amalia.zahra@binus.edu
AbstractSign Language Recognition (SLR) is a relatively
challenging research field which allows opportunity for
improvements. In this research, we propose sentence-level SLR
using deep learning method by combining Convolutional
Neural Network (CNN) and Bidirectional Recurrent Neural
Network (Bi-RNN). Specifically, 3D CNN is implemented to
extract features from each video frame and bidirectional-RNN
is implemented to extract the unique features from the video
frame’s sequential behavior, which later generate a possible
sentence. There are two key takeaways from this paper. The
first is our proposed dataset of Indonesian Sign Language
(SIBI) which is comprised of 30 sentences in SIBI. The second
is our novel approach of using deep learning and Connectionist
Temporal Classification (CTC) loss function in sentence-level
SLR. The result shows that the hyperparameter used, in this
case Hyperparameter 1, achieves the best result. Also, this
research found that deeper network does not necessarily
guarantee good results. A bigger number of dataset also affects
the performance of the system.
Keywords sign language recognition, deep learning, CNN,
RNN, CTC loss, SIBI
I. INTRODUCTION
There has always been a communication gap between
hearing-impaired community and hearing people who do not
speak in sign language. Sign language is gestures made by
the movement of hands which are visible, used mostly by
hearing-impaired people to communicate. The official Sign
Language of Indonesia is called Sistem Isyarat Bahasa
Indonesia (SIBI). SIBI is mostly adopted from American
Sign Language (ASL) and follows the grammar structure of
Bahasa Indonesia, the official language of Indonesia, and is
equipped with affixes, such as me-, ber-, di-, ke-, pe-, ter-,
and se- [1]. Although sign language is used daily by the
hearing-impaired community, the rest of the population has
limited knowledge about the language. This creates
communication barrier and discrimination among society.
Sign Language Recognition (SLR) systems are widely
developed to bridge the gap. Numerous research has worked
on various methods to achieve state-of-the-art results.
Various input devices have also been experimented on, such
as data-gloved approach, Leap Motion Controller, vision-
based approach, and Kinect. Due to the lack of its application
outside a laboratory, unnatural experience offered, and its
high cost, data-gloved approach is left behind. A more
elaborated approach is vision-based approach because it can
be applied directly to user and deployed on any other
applications that have camera such as a phone assistant and
smart home interactions [2]. However, it is not an easy task
because sign language is delivered through various ways:
hand-shapes, position, orientation, and movements, where it
is hard to extract information from such features [3,4].
Several feature extraction methods had been worked on,
yet the challenge remains, because SLR should be used in
natural environment outside a laboratory for it to be useful.
These are the challenges researchers tried to overcome
throughout the years from the development of Sign
Language Recognition (SLR) system. Lighting, complex
background, poses, and noises may also affect the
performance of an SLR system. Another challenge of an
SLR system is determining the start and end point of
meaningful gestures (gesture segmentation) [5]. To
overcome these problems, Huang et al. in [3] proposed 3D
Convolutional Neural Network (CNN) to extract spatial and
temporal spatial features from the videos obtained from
Kinect as the input device. The implementation of deep
learning on vision-based SLR systems has helped to
overcome several challenges. Previously, several other
studies related to vision-based recognition using deep
learning have been conducted. Deep learning helps major
advances in vision-based recognition and has proven its
satisfactory results on image and object recognition [6,7],
image and object descriptor [8], human behavioral
recognition [9,10], lip-reading [11], and especially SLR
[3,12].
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
17
This paper proposed sentence-level sign language
recognition using deep learning method by combining CNN
and Bidirectional Recurrent Neural Network (Bi-RNN).
Specifically, 3D CNN is implemented to extract features
from each video frame and bidirectional-RNN is
implemented to extract the unique features from the video
frame’s sequential behavior, which later generate a possible
sentence. There are two key takeaways from this paper. The
first is our proposed novel dataset of SIBI which is
comprised of 3,006 videos of 30 sentences in SIBI. The
second is our novel approach of using deep learning and
Connectionist Temporal Classification (CTC) loss function
in recognizing sign language in sentence level.
The remaining of this paper is structured as follows.
Section II elaborates a number of related works that have
been conducted by researchers. Section III describes the
methodology applied to the research presented in this paper,
followed by the proposed models, which are described in
Section IV. Section V and VI present the model training and
the results, respectively. Finally, conclusion and future works
are presented in Section VII.
II. RELATED WORKS
There are several input devices exploited in a Sign
Language Recognition (SLR) system. Data-glove approach,
vision-based approach, Leap Motion Controller (LMC), and
Kinect are some of the input devices used in SLR systems
for sign acquisition. Data-glove based method is a relatively
old data acquisition method for gesture recognition. This
method utilizes a device to help the process of collecting
data. The device is a glove which has sensors detecting the
movement and changes of the hands and fingers of the users
and it is connected to a computer [1316]. Due to the lack of
its application outside a laboratory, unnatural experience
offered, and its high cost, the data-glove based approach is
left behind.
LMC converts signals into computer commands. It was
employed in a number of existing gesture recognition
systems [1720]. However, LMC is less applicable in
everyday life due to its absence on common gadgets.
Microsoft Kinect has also been widely used in gesture
recognition [2127]. Its strength lies in its capability to
capture every motion and converts it into usable features by
using the built-in 3D sensory camera. Several studies
recommend Microsoft Kinect to be employed in SLR
systems.
In a vision-based approach, colored gloves are often used
to help hand segmentation [28,29]. Hand segmentation is the
process of separating hands and other features from the rest
of an image. Hand segmentation is one of the many
challenges in gesture recognition. Zhang et al [28] employed
colored gloves to help the hand segmentation process along
with a pupil-detection algorithm to make the pupil as
reference points for assisting the process. Canny Edge
Detector is another method to detect the edges of hands in an
image for hand segmentation with low error rate and optimal
performance in detecting edges [3032]. Elliptical Fourier
Descriptor is also employed for hand segmentation which is
designed for extracting shapes’ outline [33]. Skin detection
identifies and isolates skin area from the rest of an image for
hand segmentation [34,35]. Skin detection is employed along
with hand motion tracking to produce more accurate results
[36]. Similarly, colored gloves are employed because they
give a distinctive feature to the hand in order to assist a hand
segmentation process [28,37,38]. On the other hand,
determining the start and the end point of meaningful
gestures (gesture segmentation) is a challenge in SLR [5]. A
method to do it is by implementing a 3D Convolutional
Neural Network (CNN) to extract spatial and temporal
spatial features from videos obtained from Kinect [3] and
also to extract features in lipreading [11] and hand detection
[39].
In the development of an SLR system, Hidden Markov
Model (HMM) has been widely exploited. It was used in
both gloved-based approach [23,40,41] and vision-based
approach [29,4244]. To speed up the recognition process,
the Tied-Mixture Density Hidden Markov Model
(TMDHMM) could be applied without significantly reducing
the system accuracy. It had proved that it could recognize up
to 92.5% of frequently used Chinese Sign Language (CSL).
However, HMM experiences difficulties in handling
noisy data [45]. Therefore, Kaluri et al [44] proposed an
implementation of Wiener Filter to eliminate the noise in
images and used adaptive histogram technique to segment
the images where the output was fed into HMM for training
and recognition [44].
An earlier research in SIBI sign language recognition
conducted by [46] used the classification method namely
Generalized Learning Vector Quantization (GLVQ) to
recognize letters A-Z and 10 numbers. This research used
Microsoft Kinect as the input device. Another work was
performed by [27] to recognize 10 words using GLVQ and
Random Forest.
Deep learning method has been widely used for computer
vision, such as object recognition. Deep learning CNN and
RNN performed by [47] was implemented to classify an
image into 51 classes. Another work conducted by [48]
derived some textual description from the given image using
CNN and bidirectional-RNN. The work in [3] also used a
deep learning method called 3D-CNN to recognize 25 words
using nine people. However, the research of sign language
recognition using deep learning is still limited.
III. METHODOLOGY
This section describes the preparation of the study prior
to experiments. It consists of Sistem Isyarat Bahasa
Indonesia (SIBI), data, and translation of dataset.
A. Sistem Isyarat Bahasa Indonesia (SIBI)
Sistem Isyarat Bahasa Indonesia (SIBI) is the official
sign language system approved by the Indonesian
Government. The Ministry of Education and Culture of
Indonesia published the first dictionary for SIBI in 1994
[27]. The government promotes the usage of SIBI to
overcome the diversity of sign language used in throughout
the country. 80% of SIBI are adopted from the American
Sign Language (ASL) while the remaining 20% are
normalized from Bahasa Indonesia. SIBI is taught in schools
for children with special needs as the means of
communication.
The vocabulary of SIBI is mostly adopted from ASL and
follows the grammatical structure of Bahasa Indonesia, the
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
18
official language of Indonesia, and is equipped with affixes
(me-, ber-, di-, ke-, pe-, ter-, and se-) [1]. Therefore, SIBI can
be said as the sign language which is normalized and
standardized based on Bahasa Indonesia. The representation
of one word in Bahasa Indonesia may need at most five signs
[49]. For example, the word “perjalanan” consists of the
prefix per-”, the word jalan”, and the suffix -an”,
therefore each word and affixes has its own SIBI
representation and is represented with one sign and they are
performed continuously.
B. Data
The data used consisted of 3,006 videos of 30 sentences
in SIBI. The data was collected in Santi Rama, a school for
children with special needs, in Jakarta, Indonesia. Santi
Rama is one of the founding members of SIBI. There are 10
teachers who volunteered to perform 30 sentences in SIBI in
which the recording is repeated 10 times. There are errors
during the recording of those videos, resulting the uneven
number of videos for each sentence. Those errors comprised
of human and recording device errors which occurred during
the recording process. The data collected were then reviewed
one by one for validity checking.
Before the data was used for training, the video dataset
that had been collected was then preprocessed to sequence of
images. The images were then resized to 100px x 50px.
Lastly, due to the variance of video lengths which results in
different extracted frames for every video, we performed
padding and set the value of 270 frames for every extracted
sequence of images. For videos which have lower than 270
frames, the padding was applied to the remaining frames.
The data was distributed into three groups: training,
testing, and validation. The data distribution can be seen in
Table 1.
TABLE I. DATA DISTRIBUTION
Training
60%
Validation
20%
Testing
20%
TOTAL
100%
C. Translation of Dataset
To multiply the variance of the collected dataset, a
translation using affine transformation was applied to this
work. The translation was performed by sliding the pixel
horizontally (left, right) or vertically (up, down). For every
horizontal translation, image pixel slides horizontally for 105
pixels and vertically for 35 pixels. The translation was
applied in eight directions: right, left, up, down, right-up,
right-down, left-up, and left-down. The total dataset after the
translation consisted of 27,054 videos.
IV. PROPOSED MODEL
This section elaborates our proposed model, which
comprises Convolutional Neural Network (CNN), activation
function, dropout, SOFTMAX algorithm, Recurrent Neural
Network (RNN), Bidirectional RNN, Gated Recurrent Unit
(GRU), and Connectionist Temporal Classification (CTC)
loss.
A. Convolutional Neural Network (CNN)
‘Convolution Network’ itself, refers to a network with
mathematical operation named convolution. Convolution is a
special type of linear operation. Convolutional Neural
Network (CNN), straightforwardly, is a neural network that
uses convolution, at least in one of the layers [50]. The main
idea of CNN is to solve the problem of substantial number of
parameters and allowing a network to run deeper and faster
while decreasing the parameter [51]. CNN has learnable
weights and biases, and those parameters can be learnt using
supervised or unsupervised learning [3,52]. Furthermore, like
other neural networks, CNN has loss product function in its
last layer (fully connected layer). Generally, CNN consists of
stacks of convolutional layer, pooling layer, activation
function, and fully connected layer. One of the examples of
CNN architecture is displayed in Fig. 1. According to Ji et al.
[9], CNN architecture can be constructed by stacking some
layers of convolution and subsampling alternately.
Fig. 1. CNN Architecture
B. Activation Function
In the training process using gradient descent, the time
needed to perform one training saturating nonlinearity f(x)=
tanh(x) is longer than training the non-saturating
nonlinearity:
f(x)=max(0,x) (1)
Neuron with nonlinearity like the function above is
named Rectified Linear Units (ReLU). Training deep
convolutional network using ReLU has been proven to be
faster than using standard tanh unit [53].
ReLU is an activation function recommended to be used
for most of the neural networks. By applying ReLU
activation function, the linear transformation is changed to
non-linear transformation type of output.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
19
C. Dropout
Deep learning network models, in general, have large
number of parameters. Overfitting is always a problem in
using this type of network. Dropout is a technique applied to
solve this problem. Dropout aims to cut the connection (drop
unit) between neuron (unit) and minimize the shape of the
network during training. The disconnection of neuron stops
the network to be too clever on the training data provided.
Furthermore, dropout provides a way to combine neural
network architectures simultaneously and efficiently.
D. SOFTMAX Algorithm
Softmax function is the most frequently used function as
an output or classification to represent the diffusion of
probabilities of n different classes [50], applied to the output
layer of the deep learning model.
E. Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN) has been a focus for
researchers since 1990s. RNN is built to tackle sequential
data or various data based on timely pattern. Recurrent net
topology is built from the idea of adding one connection that
is called feedback connection. This feedback connection
connects one neuron to preceding neuron, so that the
activation function value can be adjusted (looping).
F. Bidirectional-RNN Architecture
Bidirectional RNN is a combined RNN that runs forward
through time from the start to the end in a sequence, and then
runs backward through time from the end to the start. Fig. 2
illustrates the bidirectional-RNN architecture. Bidirectional-
RNN architecture is used to learn and map the input
sequence x.
Fig. 2. Bidirectional-RNN Architecture [50]
h(t) is a state from sub-RNN that moves forward through
time. g(t) is a state from sub-RNN that moves backward. Unit
output o(t) is used to count the representation dependency for
the previous and the next state.
G. Gated Recurrent Unit (GRU)
In Chung, et al. research [54], they stated that RNN
training is difficult because of the long dependency of the
model architecture. Gradient value tends to disappear (in
most of the time) or blow up (happens rarely, but with
devastating effects). To prevent this, a recurrent unit
approach is used, such as the Gated Recurrent Unit (GRU).
Recurrent unit is said to be successful to gain information
from a lengthy time-dependent model properly.
GRU is applied to every recurrent unit to save dependent
information through time adaptively. GRU has gating unit
that controls information flow in a unit without using
separated memory cell.
H. Connectionist Temporal Classification (CTC) Loss
Connectionist Temporal Classification (CTC) is a
function used to allow Recurrent Neural Network (RNN) to
be trained for sequence transcription assignments without
previous information about the target sequences in the input
[55]. CTC loss is used to find the likelihood values in a
sequence from the input provided to the output [56].
V. MODEL TRAINING
The model used in this research adopts the model from
Lipnet. To evaluate the Lipnet model, we proposed another 3
models.
A. Lipnet
Lipnet model used three blocks of 3D-CNN and two
blocks of bidirectional-RNN. Hyperparameter used to train
this model was the same as Lipnet. Lipnet was trained using
a translation dataset which comprised sequences of 27,054
images. This model used hyperparameter 1, as shown in
Table 2.
TABLE II. HYPERPARAMETER 1
Layer
Size/Stride/Padding
Dimension
Order
3D-CNN
(3,5,5)/(1,2,2)/(1,2,2)
T x C x H x W
Pool
(1,2,2)/(1,2,2)
T x C x H x W
3D-CNN
(3,5,5)/(1,2,2)/(1,2,2)
T x C x H x W
Pool
(1,2,2)/(1,2,2)
T x C x H x W
3D-CNN
(3,3,3)/(1,2,2)/(1,2,1)
T x C x H x W
Pool
(1,2,2)/(1,2,2)
T x C x H x W
Bi-GRU
256
T x (C x H x
W)
Bi-GRU
256
T x F
Linear
32 + blank
T x F
Softmax
T x V
B. First Model
This model is identical with Lipnet model. The difference
is that this model tries to evaluate the usability of using
padding by not applying it. The model still used three blocks
of 3D-CNN and two blocks of bidirectional-RNN. The
dataset used for this model is normal dataset comprising
sequences of 3,006 images. The architecture of Lipnet and
the first model can be seen in Fig. 3. This model used
hyperparameter 2, as shown in Table 3.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
20
Fig. 3. Lipnet Model and the First Model
TABLE III. HYPERPARAMETER 2
Layer
Size/Stride/Padding
Input Size
Dimension
Order
3D-CNN
(3,5,5)/(1,2,2)/(0,0,0)
270x3x50x100
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x32x23x48
T x C x H x W
3D-CNN
(3,5,5)/(1,1,1)/(0,0,0)
270x32x11x24
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x64x7x20
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(0,0,0)
270x64x3x10
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x96x1x4
T x C x H x W
Bi-GRU
256
270x(96x1x4)
T x (C x H x
W)
Bi-GRU
256
270x512
T x F
Linear
32 + Blank
270 x 512
T x F
Softmax
270 x 33
T x V
C. Second Model
The second model was built simpler than the Lipnet
model. It used a block of 3D-CNN and a block of
bidirectional-RNN. This model was trained using normal
dataset. The architecture of the second model can be seen in
Fig. 4. This model used hyperparameter 3, as shown in Table
4.
TABLE IV. HYPERPARAMETER 3
Layer
Size/Stride/Padding
Input Size
Dimension
Order
3D-
CNN
(3,5,5)/(1,2,2)/(0,0,0)
270x3x50x100
T x C x H x W
Pool
(1,2,2)/(1,2,2)
270x32x23x48
T x C x H x W
Bi-GRU
256
270x(32x11x24)
T x(C x H x W)
Linear
32 + blank
270x256
T x F
Softmax
270x33
T x V
Fig. 4. The Second Model
D. Third Model
The third model was built more complex than the Lipnet
model. This model used eight blocks of 3D-CNN and two
blocks of bidirectional-RNN, using normal dataset. The
architecture of the third model can be seen in Fig. 5. This
model used hyperparameter 4, as shown in Table 5.
Fig. 5. The Third Model
TABLE V. HYPERPARAMETER 4
Layer
Size/Stride/Padding
Input Size
Dimension
Order
3D-CNN
(3,5,5)/(1,2,2)/(2,4,4)
270x3x50x100
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(2,4,4)
270x32x27x53
T x C x H x W
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
21
Pool
(1,2,2)
270x32x33x59
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x64x16x29
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x96x18x31
T x C x H x W
Pool
(1,2,2)
270x96x20x33
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x128x10x16
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x128x12x18
T x C x H x W
Pool
(1,2,2)
270x128x14x20
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x256x7x10
T x C x H x W
3D-CNN
(3,3,3)/(1,1,1)/(1,2,2)
270x256x9x12
T x C x H x W
Pool
(1,2,2)
270x256x11x14
T x C x H x W
Bi-GRU
256
270x(512x5x7)
T x(C x H x W)
Bi-GRU
256
270 x 512
T x F
Linear
32 + Blank
270 x 512
T x F
Softmax
270 x 33
T x V
VI. RESULT
To measure the performance of the training model, WER
(word error rate) and CER (character error rate) are
computed. WER and CER are the methods to count numbers
of substitution, deletion, and insertion from the comparison
of the true label and the hypothesis.
Table 6 below summarizes the performance of the system
using different models that have been described in the
previous section. Our proposed models are compared to the
Lipnet model according to WER and CER.
TABLE VI. EVALUATION RESULT
Model
WER
(Word Error Rate)
CER
(Character Error Rate)
Lipnet
87.10%
70%
First Model
89.40%
69.58%
Second Model
90.50%
76.90%
Third Model
88.17%
65.33%
Based on the performance shown in Table 6, the errors
are still high. The average error for every model evaluated
with WER is 88.79%. The Lipnet model gives the best result
compared to the remaining models based on WER, with the
smallest error of 87.10%. The second model gives the worst
result compared to the other models with the WER of
90.50%. Based on CER, the third model gives the best result
with the CER of 65.33%. Every trained model still
performed poorly in recognizing gestures of sign language
even though the system used similar data.
VII. CONCLUSION AND FUTURE WORKS
We proposed deep learning model to a sentence-level
sign language recognition. The deep learning methods used
in this research are 3D-CNN and bidirectional-Recurrent
Neural Network (RNN). This research used 30 sentences
from the SIBI video dataset collected by researchers.
Based on the results presented in the previous section, the
models proposed and the dataset do not seem to match well.
The first, second, and third model using normal dataset yield
similar error rates. The Lipnet model, which was trained
using large dataset, still produces similar result. For future
work, we would like to preprocess the data to match the
model, for example handling noise from the dataset.
ACKNOWLEDGMENT
This research was supported by research grant No.
039A/VR.RTT/VI/2017 from Ministry of Research
Technology and Higher Education of Republic of Indonesia.
REFERENCES
[1] D. R. Kurnia and T. Slamet, “Menormalkan yang dianggap “tidak
normal” (studi kasus: penertiban bahasa isyarat tunarungu di SLB
Malang),” Indonesian Journal of Disability Studies, 3(1), pp. 3443,
2016.
[2] M. R. Abid, E. M. Petriu, and E. Amjadian, “Dynamic sign language
recognition for smart home interactive application using stochastic
linear formal grammar,” IEEE Transactions on Instrumentation and
Measurement, 64(3), pp. 596605, 2015.
[3] J. Huang, W. Zhou, H. Li, and W. Li, “Sign language recognition
using 3D convolutional neural networks,” IEEE International
Conference on Multimedia and Expo (ICME), IEEE, pp. 16, 2015.
[4] O. Koller, J. Forster, and H. Ney, “Continuous sign language
recognition: Towards large vocabulary statistical recognition systems
handling multiple signers,” Computer Vision and Image
Understanding, 141, pp. 108125, 2015.
[5] M. K. Bhuyan, D. A. Kumar, K. F. MacDorman, and Y. Iwahori, “A
novel set of features for continuous hand gesture recognition,” Journal
on Multimodal User Interfaces, 8(4), pp. 333343, 2014.
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, IEEE, pp. 770778, 2016.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” International Conference on
Learning Representations, pp. 114, 2014.
[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S.
Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent
convolutional networks for visual recognition and description,”
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 26252634, 2015.
[9] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks
for human action recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 35(1), pp. 221231, 2013.
[10] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt,
“Sequential deep learning for human action recognition,”
International Workshop on Human Behavior Understanding, pp. 29
39, 2011.
[11] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas,
“Lipnet: Sentence-level lipreading,” arXiv Prepr arXiv161101599,
2016.
[12] O. Koller, H. Ney, and R. Bowden, “Deep learning of mouth shapes
for sign language,” Proceedings of the IEEE International Conference
on Computer Vision Workshops, pp. 8591, 2015.
[13] S. A. Mehdi and Y. N. Khan, “Sign language recognition using sensor
gloves,” Proceedings of the 9th International Conference on Neural
Information Processing (ICONIP), IEEE, pp. 22042206, 2002.
[14] L. T. Phi, H. D. Nguyen, T. Q. Bui, and T. T. Vu, “A glove-based
gesture recognition system for Vietnamese sign language,”
Proceedings of the 15th International Conference on Control,
Automation and Systems (ICCAS), 13(16), pp. 15551559, 2015.
[15] A. Ranjini S. S. and M. Chaitra, "Sign language recognition system,"
International Journal on Recent and Innovation Trends in Computing
and Communication, 2(4), pp. 947953, 2014.
[16] S. Saengsri, V. Niennattrakul, and C. A. Ratanamahatana, "TFRS:
Thai finger-spelling sign language recognition system," 2nd
International Conference on Digital Information and Communication
Technology and its Applications (DICTAP), pp. 457462, 2012.
978-1-5386-9422-0/18/$31.00 ©2018 IEEE
The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia
22
[17] M. Koul, P. Patil, V. Nandurkar, and S. Patil, "Sign language
recognition using leap motion sensor," International Research Journal
of Engineering and Technology (IRJET), 3(11), pp. 322325, 2016.
[18] L. E. Potter, J. Araullo, and L. Carter, "The leap motion controller: A
view on sign language," Proceedings of the 25th Australian
Computer-Human Interaction, Conference on Augmentation,
Application, Innovation, Collaboration OzCHI, ACM, pp. 175178,
2013.
[19] M. U. Kakde, M. G. Nakrani, and A. M. Rawate, "A review paper on
sign language recognition system for deaf and dumb people using
image processing", International Journal of Engineering Research and
Technology, 5(3), pp. 590592, 2016.
[20] H. Bhavsar, "Review on feature extraction methods of image based
sign language recognition system," Indian Journal of Computer
Science and Engineering, 8(3), pp. 249259, 2017.
[21] S. B. Carneiro, E. D. D. M. Santos, M. D. A. Talles, J. O. Ferreira, S.
G S. Alcala, and A. F. Da Rocha, "Static gestures recognition for
Brazilian sign language with kinect sensor," SENSORS, IEEE, 2016.
[22] E. Escobedo and G. Camara, "A new approach for dynamic gesture
recognition using skeleton trajectory representation and histograms of
cumulative magnitudes," SIBGRAPI Conference on Graphics,
Patterns and Images, pp. 209216, 2016.
[23] J. Ma, W. Gao, J. Wu, and C. Wang, “A continuous Chinese sign
language recognition system,” Proceedings of the 4th International
Conference on Automatic Face and Gesture Recognition, IEEE, pp.
428433, 2000.
[24] Y. Jiang, J. Tao, W. Ye, W. Wang, and Z. Ye, "An isolated sign
language recognition system using RGB-D sensor with sparse
coding," 17th International Conference on Computational Science and
Engineering, IEEE, pp. 2126, 2014.
[25] C. Keskin, F. Kırac, Y. E. Kara, and L. Akarun, "Real time hand pose
estimation using depth sensors," International Conference on
Computer Vision Workshops, IEEE, pp. 12281234, 2011.
[26] J. L. Raheja, A. Mishra, and A Chaudhary, "Indian sign language
recognition using SVM," Pattern Recognition and Image Analysis,
26(2), pp. 434441, 2016.
[27] E. Rakun, M. Andriani, I. W. Wiprayoga, K. Danniswara, and A.
Tjandra, "Combining depth image and skeleton data from kinect for
recognizing words in the sign system for Indonesian language (SIBI
[sistem isyarat bahasa Indonesia])," International Conference on
Advanced Computer Science and Information Systems (ICACSIS),
pp. 387392, 2013.
[28] L. G. Zhang, Y. Chen, G. Fang, X. Chen, and W. Gao, “A vision-
based sign language recognition system using tied-mixture density
HMM,” Proceedings of the 6th International Conference on
Multimodal Interfaces, ACM, pp. 198204, 2004.
[29] T. Starner and A. Pentland, “Real-time American sign language
recognition from video using hidden Markov models,” Motion-Based
Recognition, Springer, pp. 227243, 1997.
[30] E. A. Kalsh and N. S. Garewal, "Sign language recognition system,"
International Journal of Computational Engineering Research, 3(6),
pp.1521, 2013.
[31] M. V. D. Prasad, V. Kishore, and A. Kumar, "Indian sign language
recognition system using new fusion based edge operator," Journal of
Theoretical and Applied Information Technology, 88(3), pp. 574
584, August 2016.
[32] D. K. Ghosh and S. Ari, “On an algorithm for vision-based hand
gesture recognition,” Signal, Image and Video Processing, 10(4), pp.
655662, 2016.
[33] P. V. V. Kishore, M. V. D. Prasad, C. R. Prasad, and R. Rahul, "4-
camera model for sign language recognition using elliptical Fourier
descriptors and ANN," International Conference on Signal Processing
and Communication Engineering Systems (SPACES), IEEE, pp. 34
38, 2015.
[34] S. C. W. Ong and S. Ranganath, "Automatic sign language analysis: a
survey and the future beyond lexical meaning," IEEE Transactions on
Pattern Analysis and Machine Intelligence, 27(6), pp. 873891, 2005.
[35] K. M. Lim, A. W. C. Tan, S. C. Tan, "A feature covariance matrix
with serial particle filter for isolated sign language recognition,"
Expert Systems with Applications, Elsevier Ltd, 54, pp. 208218,
2016.
[36] P. C. Pankajakshan and B. Thilagavathi, "Sign language recognition
system," Innovations in Information, Embedded and Communication
Systems (ICIIECS), IEEE, pp. 14, 2015.
[37] T. E. Starner, "Visual recognition of American sign language using
hidden Markov models," Massachusetts Inst Of Tech Cambridge Dept
Of Brain And Cognitive Science, 1995.
[38] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color
glove,” ACM Transactions on Graphics (TOG), 28(3), 63, 2009.
[39] S. Yan, Y. Xia, J. S. Smith, W. Lu, and B. Zhang, “Multiscale
convolutional neural networks for hand detection,” Applied
Computational Intelligence and Soft Computing, 2017.
[40] R. H. Liang and M. Ouhyoung, “A real-time continuous gesture
recognition system for sign language,” Proceedings of the 3rd
International Conference on Automatic Face and Gesture
Recognition, IEEE, pp. 558567, 1998.
[41] H. Wang, M. C. Leu, and C. Oz, “American sign language
recognition using multi-dimensional hidden Markov models,” Journal
of Information Science and Engineering, 22(5), pp. 11091123, 2006.
[42] M. Elmezain, A. Al-Hamadi, J. Appenrodt, and B. Michaelis, “A
hidden Markov model-based continuous gesture recognition system
for hand motion trajectory,” 19th International Conference on Pattern
Recognition, IEEE, pp. 14, 2008.
[43] Pashaloudi VN, Margaritis KG. A performance study of a recognition
system for Greek sign language alphabet letters,” 9th Conference
Speech and Computer, 2004.
[44] R. Kaluri and C. H. Pradeep, “An enhanced framework for sign
gesture recognition using hidden Markov model and adaptive
histogram technique,” International Journal of Intelligence and
Engineering System, 10, 2007.
[45] M. C. Roh, S. Fazli, and S. W. Lee, “Selective temporal filtering and
its application to hand gesture recognition,” Applied Intelligence,
45(2), pp. 255264, 2016.
[46] E. Rakun, M. F. Rachmadi, and K. Danniswara, “Spectral domain
cross correlation function and generalized learning vector
quantization for recognizing and classifying Indonesian sign
language,” Advanced Computer Science and Information Systems
(ICACSIS), pp. 978979, 2012.
[47] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng,
“Convolutional-recursive deep learning for 3D object classification,”
Advances in Neural Information Processing Systems, pp. 656664,
2012.
[48] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
generating image descriptions,” IEEE Conference on Computer
Vision and Pattern Recognition, pp. 31283137, 2015.
[49] N. Palfreyman, “Sign language varieties of Indonesia: A linguistic
and sociolinguistic investigation,” Sign Language and Linguistics,
20(1), 2017.
[50] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT
press, 2016.
[51] H. H. Aghdam and E. J. Heravi, Guide to Convolutional Neural
Networks, Springer International Publishing, 2017.
[52] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
86(11), pp. 22782324, 1998.
[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” Advances In
Neural Information Processing Systems, pp. 10971105, 2012.
[54] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” Workshop
on Deep Learning (NIPS), pp. 19, December 2014.
[55] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” Proceedings of the 31st International
Conference on Machine Learning (ICML), pp. 17641772, 2014.
[56] A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free
conversational speech recognition with neural networks,” Proceedings
of the 2015 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, pp. 345354, 2015.
... With the development of deep learning, CNN [18] has replaced the extraction of manual features, and RNN [19] or Transform [20] has replaced HMM and DTW. In the process, researchers have proposed a large number of excellent deep learning models for CSLR [21] [22][23] [24]. These models focus more on the improvement of model accuracy and better mining of feature information, but the ultimate goal of the field is the need to apply the models in practice, that is, the lightweight of the models. ...
Preprint
The goal of continuous sign language recognition(CSLR) research is to apply CSLR models as a communication tool in real life, and the real-time requirement of the models is important. In this paper, we address the model real-time problem through cross-resolution knowledge distillation. In our study, we found that keeping the frame-level feature scales consistent between the output of the student network and the teacher network is better than recovering the frame-level feature sizes for feature distillation. Based on this finding, we propose a new frame-level feature extractor that keeps the output frame-level features at the same scale as the output of by the teacher network. We further combined with the TSCM+2D hybrid convolution proposed in our previous study to form a new lightweight end-to-end CSLR network-Low resolution input net(LRINet). It is then used to combine cross-resolution knowledge distillation and traditional knowledge distillation methods to form a CSLR model based on cross-resolution knowledge distillation (CRKD). The CRKD uses high-resolution frames as input to the teacher network for training, locks the weights after training, and then uses low-resolution frames as input to the student network LRINet to perform knowledge distillation on frame-level features and classification features respectively. Experiments on two large-scale continuous sign language datasets have proved the effectiveness of CRKD. Compared with the model with high-resolution data as input, the calculation amount, parameter amount and inference time of the model have been significantly reduced under the same experimental conditions, while ensuring the accuracy of the model, and has achieved very competitive results in comparison with other advanced methods.
... The proposed pose-estimation model was evaluated on the sign language recognition dataset CLS-500, and the recognition results were tabulated in TABLE Ⅵ. The results show that our model designed for pose estimation can extract better sign-language features compared to 3D-CNN [36] and HPRNet [20]. ...
Article
Full-text available
The main challenge of continuous sign language translation (CSLT) lies in the extraction of both discriminative spatial features and temporal features. In this paper, a spatial-temporal feature extraction network (STFE-Net) is proposed for CSLT, which optimally fuses spatial and temporal features, extracted by the spatial feature extraction network (SFE-Net) and the temporal feature extraction network (TFE-Net), respectively. SFE-Net performs pose estimation for the presenters in sign-language videos. Based on COCO-WholeBody, 133 key points are abbreviated to 53 key points, according to the characteristics of the sign language. High-resolution pose estimation is performed on the hands, along with the whole-body pose estimation, to obtain finer-grained hand features. The spatial features extracted by SFE-Net and the sign language words are then fed to TFE-Net, which is based on Transformer with relative position encoding. In this paper, a dataset for Chinese continuous sign language was created and used for evaluation. STFE-Net achieves Bilingual Evaluation Understudy (BLEU-1, BLEU-2, BLEU-3, BLEU-4) scores of 77.59, 75.62, 74.25, 72.14, respectively. Furthermore, our proposed STFE-Net was also evaluated on two public datasets, RWTH-Phoenix-Weather 2014T and CLS. The BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores achieved by our method on the former dataset are 48.22, 33.59, 26.41 and 22.45, respectively, and the corresponding scores are 61.54, 58.76, 57.93 and 57.52, respectively, on the latter dataset. Experiment results show that our model can achieve promising performance. If any reader needs the code or dataset, please email lunfee@whut.edu.cn.
... 3D-CNN is one of the common methods to process video sequences and can be very effective in establishing spatialtemporal dependency, and have also been widely used in CSLR. Ariesta et al. [24] proposed a sentence-level sign language method combining 3D-CNN and bidirectional recurrent neural network(Bi-RNN) for deep learning. 3D-CNN network has the disadvantages of being computationally intensive and bulky models, and in order to design a lightweight 3D-CNN network, the 3D convolution was decomposed into "2+1D" convolution. ...
Preprint
The ultimate goal of continuous sign language recognition(CSLR) is to facilitate the communication between special people and normal people, which requires a certain degree of real-time and deploy-ability of the model. However, in the previous research on CSLR, little attention has been paid to the real-time and deploy-ability. In order to improve the real-time and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal convolutions. The overall CSLR model based on TSCM is built on the improved ResBlockT network in this paper. The hybrid convolution of "TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form the new ResBlockT, and random gradient stop and multi-level CTC loss are introduced to train the model, which reduces the final recognition WER while reducing the training memory usage, and extends the ResNet network from image classification task to video recognition task. In addition, this study is the first in CSLR to use only 2D convolution extraction of sign language video temporal-spatial features for end-to-end learning for recognition. Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.
Article
Full-text available
Convolutional neural networks (CNN) have provided great advances for the task of sign language recognition (SLR). However, recurrent neural networks (RNN) in the form of long–short-term memory (LSTM) have become a means for providing solutions to problems involving sequential data. This research proposes the development of a sign language translation system that converts Panamanian Sign Language (PSL) signs into text in Spanish using an LSTM model that, among many things, makes it possible to work with non-static signs (as sequential data). The deep learning model presented focuses on action detection, in this case, the execution of the signs. This involves processing in a precise manner the frames in which a sign language gesture is made. The proposal is a holistic solution that considers, in addition to the seeking of the hands of the speaker, the face and pose determinants. These were added due to the fact that when communicating through sign languages, other visual characteristics matter beyond hand gestures. For the training of this system, a data set of 330 videos (of 30 frames each) for five possible classes (different signs considered) was created. The model was tested having an accuracy of 98.8%, making this a valuable base system for effective communication between PSL users and Spanish speakers. In conclusion, this work provides an improvement of the state of the art for PSL–Spanish translation by using the possibilities of translatable signs via deep learning.
Article
Aiming the problem that the spatial-temporal hierarchical continuous sign language recognition (CSLR) model with video as input is computationally intensive, thus limiting the real-time application, this paper proposes a temporal super-resolution network (TSRNet) to reduce the model computation while keeping the loss of accuracy to a minimum, achieving the best compromise between the real-time performance and accuracy. The TSRNet-based CSLR constructed in this paper consists of three main parts: frame-level feature extraction, temporal feature extraction and the proposed TSRNet, where the TSRNet is located between them, and consists of two branches: detail and coarse descriptors. The extracted frame-level features are first sparse, after which they are passed through the two branches designed for feature reconstruction; the fused dense sequence is subjected to temporal feature extraction. In order to better recover the semantic-level information, this paper also proposes a self-generating adversarial network training method, which treats the TSRNet as the generator and the frame-level and temporal processing parts as the discriminator. In addition, to unify the criteria for judging the loss of model accuracy under different benchmarks, this paper proposes word error rate deviation (WERD), where the error rate between estimated WER and reference WER obtained by reconstructed frame-level feature sequence and complete original frame-level feature sequence, respectively. Experiments on two large-scale sign language datasets demonstrate the effectiveness of the model. The method proposed in this paper is not only applicable to CSLR, but is general to spatial-temporal hierarchical models where the input is video data. Code is available at https://github.com/woshisad159/CSLR.git.
Article
Full-text available
3D object recognition and pattern recognition are active and fast-growing research areas in the field of computer vision. It is mandatory to define the pattern class, feature extraction, design classifiers, clustering, and selection of test datasets and evaluate performance for any pattern recognition system. The pattern recognition system recognizes the object, so it is required to extract the features in such a way that it will be suitable for a particular recognition method. Features can be retrieved either locally or globally. The object recognition technique is divided into two parts: the local feature extraction method and the global feature extraction method. Many researchers have done admirable work in the field of local and global feature extraction. Local feature-based techniques are more suitable for the real-world scene. The Global feature-based methods are more suitable for retrieving the 3D model & identifying the object’s shape when the object’s geometric structure is fragile. A lot of research has been done on pattern recognition in the last 50 years. Still, no single technique can be used for all types of applications, such as bioinformatics, data mining, speech recognition, remote sensing, multimedia applications, text detection, localization, etc. The main agenda of this paper is to summarize the 3D object recognition methodologies. This paper provides a complete study of 3D object recognition based on local and global feature-based methods and different techniques of pattern recognition. We have tried to summarize the results of different technologies and the future scope of this paper’s particular technique. We enlisted the accessible online 3D database and their attributes, evaluation parameters of the 3D datasets. This paper will immensely help the researchers to Identify the research gap and limitations in pattern recognition and object recognition so that the researchers will be motivated to do something new in this field.
Article
Full-text available
We witness many people who face disabilities like being deaf, dumb, blind etc. They face a lot of challenges and difficulties trying to interact and communicate with others. This paper presents a new technique by providing a virtual solution without making use of any sensors. Histogram Oriented Gradient (HOG) along with Artificial Neural Network (ANN) have been implemented. The user makes use of web camera, which takes input from the user and processes the image of different gestures. The algorithm recognizes the image and identifies the pending voice input. This paper explains two way means of communication between impaired and normal people which implies that the proposed ideology can convert sign language to text and voice.
Article
Full-text available
Gesture based communication is the fundamental method for correspondence for those with hearing and vocal incapacities. Communication via gestures comprises of making shapes or developments with human hands as for the head or other body parts. In this paper, we propose a new framework for recognizing sign gestures by using Hidden Markov Model (HMM) and Histogram based methods. Initially, the noise of an image will be eliminated by Wiener Filter and the image will be segmented with the help of Histogram oriented methods - Adaptive Histogram technique and then features will be extracted. The extracted features will be given to the HMM for training and recognition of gestures. Our experimental results show a better performance in terms of recognizing gestures from a blurred image compared to the existing segmentation methods.
Article
Full-text available
Unconstrained hand detection in still images plays an important role in many hand-7 related vision problems, e.g., hand tracking, gesture analysis, human action recognition and 8 human-machine interaction, and sign language recognition. Although hand detection has been 9 extensively studied for decades, it is still a challenging task with many problems to be tackled. 10 The contributing factors for this complexity include heavy occlusion, low resolution, varying 11 illumination conditions, different hands gestures and the complex interactions between hands 12 and objects or other hands. In this paper, we propose a multi-scale deep learning model for 13 unconstrained hand detection in still images. Deep learning models, and deep convolutional 14 neural networks (CNNs) in particular, have achieved state-of-the-art performances in many vision 15 benchmarks. Developed from the Region-based CNN (R-CNN) model, we propose a hand 16 detection scheme based on candidate regions generated by a generic region proposal algorithm, 17 followed by multi-scale information fusion from the popular VGG16 model. Two benchmark 18 datasets were applied to validate the proposed method, namely, the Oxford Hand Detection Dataset, 19 and the VIVA Hand Detection Challenge. We achieved state-of-the-art results on the Oxford Hand 20 Detection Dataset and had satisfactory performance in the VIVA Hand Detection Challenge. 21
Book
This must-read text/reference introduces the fundamental concepts of convolutional neural networks (ConvNets), offering practical guidance on using libraries to implement ConvNets in applications of traffic sign detection and classification. The work presents techniques for optimizing the computational efficiency of ConvNets, as well as visualization techniques to better understand the underlying processes. The proposed models are also thoroughly evaluated from different perspectives, using exploratory and quantitative analysis. Topics and features: • Explains the fundamental concepts behind training linear classifiers and feature learning • Discusses the wide range of loss functions for training binary and multi-class classifiers • Illustrates how to derive ConvNets from fully connected neural networks, and reviews different techniques for evaluating neural networks • Presents a practical library for implementing ConvNets, explaining how to use a Python interface for the library to create and assess neural networks • Describes two real-world examples of the detection and classification of traffic signs using deep learning methods • Examines a range of varied techniques for visualizing neural networks, using a Python interface • Provides self-study exercises at the end of each chapter, in addition to a helpful glossary, with relevant Python scripts supplied at an associated website This self-contained guide will benefit those who seek to both understand the theory behind deep learning, and to gain hands-on experience in implementing ConvNets in practice. As no prior background knowledge in the field is required to follow the material, the book is ideal for all students of computer vision and machine learning, and will also be of great interest to practitioners working on autonomous cars and advanced driver assistance systems.
Technical Report
We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Chapter
Understanding the underlying process in a convolutional neural networks is crucial for developing reliable architectures. In this chapter, we explained how convolution operations are derived from fully connected layers. For this purpose, weight sharing mechanism of convolutional neural networks was discussed. Next basic building block in convolutional neural network is pooling layer. We saw that pooling layers are intelligent ways to reduce dimensionality of feature maps. To this end, a max pooling, average pooling, or a mixed pooling is applied on feature maps with a stride bigger than one. In order to explain how to design a neural network, two classical network architectures were illustrated and explained. Then, we formulated the problem of designing network in three stages namely idea, implementation, and evaluation. All these stages were discussed in detail. Specifically, we reviewed some of the libraries that are commonly used for training deep networks. In addition, common metrics (i.e., classification accuracy, confusion matrix, precision, recall, and F1 score) for evaluating classification models were mentioned together with their advantages and disadvantages. Two important steps in training a neural network successfully are initializing its weights and regularizing the network. Three commonly used methods for initializing weights were introduced. Among them, Xavier initialization and its successors were discussed thoroughly. Moreover, regularization techniques such as \(L_1\), \(L_2\), max-norm, and dropout were discussed. Finally, we finished this chapter by explaining more advanced layers that are used in designing neural networks.
Article
Until now there has been no robust (socio)linguistic documentation of urban sign language varieties in Indonesia, and given the size of the Indonesian archipelago, it might be expected that these varieties are very different from each other. In this kind of situation, sign linguists have often applied lexicostatistical methods, but two such studies in Indonesia have recently produced contradictory results. Instead, this investigation uses conceptual and methodological approaches from linguistic typology and Variationist Sociolinguistics, contextualised by a sociohistorical account of the Indonesian sign community. The grammatical domains of completion and negation are analysed using a corpus of spontaneous data from two urban centres, Solo and Makassar. Four completive particles occur in both varieties, alongside clitics and the expression of completion through mouthings alone. The realisations of two variables, one lexical and one grammatical, are predicted by factors including the syntactic and functional properties of the variant, and younger Solonese signers are found to favour completive clitics. The reasons for intra-individual persistence and variation are also discussed. Negation is expressed through particles, clitics, suppletives, and the simultaneous mouthing of predicates with negative particles. These paradigmatic variants occur in both varieties, with small differences in the sets of particles and suppletives for each variety. The realisations of four variables are found to be conditioned by factors including predicate type, sub-function, and the use of constructed dialogue. The gender of the signer is found to correlate with the syntactic order of negative and predicate; younger Solonese signers are also found to favour negative clitics and suppletives. The similarities revealed between the Solo and Makassar varieties are discussed with reference to the history of contact between sign sub-communities across the archipelago. The investigation concludes with a discussion of factors that favour and disfavour the convergence of urban sign language varieties.