ArticlePDF Available

Recognition of Sign Language System for Indonesian Language Using Long Short-Term Memory Neural Networks

Authors:

Abstract and Figures

SIBI (Sign Language System for the Indonesian Language) is the official sign language system for the Indonesian language. This research aims to find a suitable model for performing SIBI-to-text translation on inflectional word gestures. Extant research has been able to translate the alphabet, root words, and numbers from SIBI to text. Inflectional words are root words with prefixes, infixes, and suffixes, or some combination of the three. A new method that splits an inflectional word into three feature vector sets was developed. This reduces the amount of feature sets used, which would otherwise be as big as the product of the prefixes, suffixes, and root words feature sets of the inflectional word gestures. Long Short-Term Memory (LSTM) is used, as this model can take entire sequences as input and does not have to rely on pre-clustered per-frame data. LSTM suits this system well as the SIBI sequence data has a long-term temporal dependency. The 2-layer LSTM performed the best, being 95.4% accurate with root words. The same model is 77% accurate with inflectional words, using the combined skeleton-image feature set, with an 800-epoch limit. The lower accuracy with inflectional words is due to difficulties in recognizing prefixes and suffixes. Keywords: Inflectional Words, Long Short-Term Memory, Deep Learning, Kinect, sign language, SIBI.
Content may be subject to copyright.
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400407, 2016
1 Adv. Sci. Lett. Vol. 4, No. 2, 2016 1936-6612/2011/4/400/008 doi:10.1166/asl.2011.1261
Copyright © 2016 American Scientific Publishers Advanced Science Letters
All rights reserved Vol. 4, 400–407, 2016
Printed in the United States of America
Recognition of Sign Language System for
Indonesian Language using Long Short-Term
Memory Neural Networks
Erdefi Rakun1, Aniati M. Arymurthy1, Lim Y. Stefanus1, Alfan F. Wicaksono1, I Wayan W. Wisesa1
1 Fakultas Ilmu Komputer, Universitas Indonesia, Depok 16424, Jawa Barat, Indonesia
SIBI (Sign Language System for Indonesian Language) is the official sign language system for the Indonesian language.This
research aims to find a suitable model for performing SIBI-to-text translation on inflectional word gestures. Extant research
have been able to translate the alphabet, root words, and numbers from SIBI to text. Inflectional words are root words with
prefixes, infixes, and suffixes, or some combination of the three. A new method that splits an inflectional word into three
feature vector sets was developed. This reduces the amount of feature sets used, which would otherwise be as big as the
product of the prefixes, suffixes and root words feature sets of the inflectional word gestures. Long Short-Term Memory
(LSTM) is used, as this models can take entire sequences as input and does not have to rely on pre-clustered per-frame data.
LSTM suits this system well as the SIBI sequence data has long-term temporal dependency. The 2-layer LSTM performed
the best, being 95.4% accurate with root words. The same model is 77% accurate with inflectional words, using the combined
skeleton-image feature set, with an 800-epoch limit. The lower accuracy with inflectional words is due to difficulties in
recognizing prefixes and suffixes.
Keywords: Inflectional Words, Long Short-Term Memory, Deep Learning, Kinect, sign language, SIBI.
1. INTRODUCTION
SIBI (Sistem Isyarat Bahasa Indonesia or the Sign
Language System for Indonesian Language) is the official
method of communication for the hearing-impaired in
Indonesia. SIBI is used for communication between the
hearing-impaired, as well as between the speech/hearing-
impaired and those without the impairment1. SIBI is
Indonesian language, complete with its native syntax,
represented in gestures. There are four types of SIBI
gestures: root word, affix, inflectional and pronoun
gestures. Pronouns are further divided into four
categories: personal, possessive, pointer, and
conjunctive1.
*Email Address: efi@cs.ui.ac.id
Like most other sign languages, SIBI is not trivial to
master, and consequently there is a need for a system to
convert SIBI gestures into a text output. The challenge in
creating a SIBI-to-text translation system is the
recognition of the four different types of gestures in SIBI.
The translation system in progress must eventually be
able to recognize the gestures associated with all the
aforementioned linguistic elements efficiently, quickly,
and accurately.
This research and extant ones have attempted to create
the components that make up a SIBI-to-text translation
system. The first to be created was a system that can do
the translation solely for the alphabet and numbers2. A
system that can translate root words was then created
using the acquired knowledge. Previous work3 and this
Adv. Sci. Lett. 4, 400–407, 2016 RESEARCH ARTICLE
2
research focuses on creating a system that can translate
inflectional words. Inflectional words are root words
combined with prefixes, suffixes, infixes, or a mix of
some of the three. These additional morphemes and
inflections add extra information to the root word, as well
as ensuring that the resulting inflectional words serve a
particular grammatical or logical purpose. As an
illustration, adding the prefix "me-", and the suffix "-i" to
the root word "lempar" (to throw), results in the word
"melempari" (to throw at). This is a different result than
when "me-" and "-kan" are added to "lempar," which
results in the inflectional word "melemparkan" (to throw
with). The subject of the latter is what is being thrown,
and the subject of the former is the target of the throw.
There are no specific gestures for inflectional words
in SIBI, but the gestures that represent them are
constructed modularly. For instance, there is no single
gesture for “berlompatan” (= to jump about); instead,
three gestures are used: the gesture for the prefix “ber-” +
the gesture for the root word “lompat” (=jump) + the
gesture for the suffix “-an,” as shown in Fig. 1. This
method of gesture concatenation is a uniqueness of SIBI4.
The translation system needs to employ a feature
extraction technique for a minimally-sized feature vector
set, yet still be capable of recognizing different types of
gestures5. There are 7 prefixes, 11 suffixes and thousands
of root words in the Indonesian language. To construct an
inflectional word, one can concatenate a maximum of 3
prefixes and 2 suffixes6. The number of possible
combinations is in the thousands, and in order to
recognize these inflectional words, the SIBI translation
system should have an adequate amount of feature vector
sets in order to sufficiently recognize the existing
inflectional words.
In this research, to reduce the amount of required
feature vector sets, the gesture of the inflectional word is
separated into and recognized by its components (Fig. 2).
With this component recognition capability, the
translation system uses only three feature vector sets: one
for prefix gestures, one of root word gestures, and one for
suffix gestures. This is much smaller than generating a
separate feature vector set for each inflectional word in
Indonesian language. The implementation of this
separation method greatly reduces computation time
required to interpret inflectional word gestures, as well as
improving the efficiency of the translation system.
In addition to being able to recognize inflectional
words, the model developed by this research can be
adjusted to recognize other types of words in SIBI such as
the root word, pronoun, conjunction, repeated words (for
example "kemerah-merahan" - reddish), compound words
(for example "bertanggung jawab" - to be responsible
for). This adjustability is the big contribution of this
research, in the realm of SIBI gesture to text translation.
This research attempts to use Long Short-Term
Memory (LSTM7) neural networks, due to its ability to
take advantage of the long-term temporal dependencies
between each frames of a SIBI sequence to improve the
model's prediction.
2. RELATED WORKS
There are only a few researches in Indonesia which
focus on translating gesture to text. Kurniawan8, Rakun2
translated alphabet to text, while Najiburahman9, Rakun10,
Marcelita11 focused on translating root words, not the
complex inflectional word. Najiburahman9 categorized
words in SIBI into words with one, two or three gestures,
while this research uses image and skeleton features to
correctly identify any words expressed in SIBI,
irrespective of the number of gestures that make up the
word.
Many related researches have been done outside
Indonesia. Most of these researches attempted to
recognize a certain sign language by using Hidden
Markov Model (HMM)12,13,14,15,16,17,18. This group of
extant research is collectively unsuited for the recognition
of SIBI's inflectional words.
In a previous research, we have tried to divide the
inflectional word gestures into the following
subsequences: prefix, root word, and suffix. These
subsequences are then classified by a heuristic Hidden
Markov Model (HMM), with a best accuracy value of
67.77%3. One shortcoming of using HMM is that a K-
Means pre-clustered feature set had to be used as the
HMM's observed variable19. This pre-clustering scheme
results in some loss of information on the extracted
feature set. It is highly likely that the model's accuracy
will improve if it is possible to use a model that can use
features that are not pre-clustered.
HMM's architecture dictates that each frame be
Fig 1. Gesture representing the inflectional word
berlompatan” (= to jump about)4
epenthesis
prefix
...
......
epenthesis
root
word
epenthesis
...
......
......
suffix
epenthesis
......
inflectional word gesture frames
Fig. 2. The possible components of an inflectional
word.
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400407, 2016
3
assigned a label, which is clearly not the ideal way to
classify the sequences. The frame-labeling method results
in one sequence having multiple labels, making it difficult
to ascertain the validity of the error figure returned by the
model. HMM's classification is done in each frame, which
is a smaller unit than the unit in which the model is
required to classify in, namely classifying the components
of inflectional words, which is what each sequence in the
data represents.
In the interest of overcoming HMM's apparent
limitations, LSTM neural networks were used in this
research. The unique ability of LSTM is its ability to
decide whether to store, ignore, or forget information.
This stems from its inclusion of a four layer neural
network acting as gating units7,20,21. Because of this
ability, LSTM is considered to be well suited to speech
and handwriting recognition, polyphonic music
modeling22, as well as Chinese word segmentation23.
These tasks bear some similarities to the objective of this
research, as for instance with handwriting recognition, a
small sequence of a pen's strokes is relatively meaningless
unless the model correctly identifies the preceding
sequence of strokes. In short, highly accurate execution of
any of the above would be difficult if their respective
models cannot take advantage of their data's temporal
dependencies.
3. SCOPE OF RESEARCH
The prefixes and suffixes examined in this research
are all prefixes, suffixes, and the combination of prefixes
+ root words + suffixes in Indonesian language’s
grammatical structure6. Inflectional word gestures in this
research are defined as:
One prefix gesture + root word gesture, or
Root word gesture + one suffix gesture, or
One prefix gesture + root word gesture + one suffix
gesture.
SIBI categorizes its gestures into two groups: primary
gestures and supporting gestures. Primary gestures
determine the meaning of the gesture, whereas supporting
gestures give additional meaning to the gesture. An
example of supporting gesture is a facial expression
added to the gesture, fulfilling a role similar to the role of
intonation in speech. This research focuses on primary
gestures performed by both hands as well as the finger
movements of both hands.
4. EXPERIMENTS
4.1. DATASETS
This experiment uses 19 root words and 144
inflectional words, recorded 10 times each for a total of
1630 inflectional word sequences. The gestures were
performed by 2 teachers from the Santi Rama School for
the hearing-impaired, in Jakarta. These are broken down
into subsequences of prefixes, root words, and suffixes.
The training data set consists of 457 root words, 290
prefixes, and 238 suffixes, for a total of 985
subsequences. The testing data set's proportion is
416:264:213, for a total of 893 subsequences.
4.2. FEATURES
This study uses Microsoft Kinect to record the SIBI
gestures. The output from the depth sensor and the
skeleton tracking output from the Kinect are used to
obtain the experimental features. The movement of the
tracked skeleton is computed from the angles between the
two joints in the skeleton, shown in Fig. 3. The shoulder-
center joint is used as the origin, and the angles of interest
are those formed between the origin and the elbow, and
between the origin and the hand. Each of these angles are
described in terms of and . Angles and are
defined as follows:
The sequences of these angles represent the movement of
the combined arm-and-hand shape in the frame.
We used MATLAB's regionprops24 function to
extract the image-based features. This function is used to
measure the region properties in an image and return
them as a structured array. Each frame's depth images are
transformed into a binary depth matrix, which is then
used as parameter for the regionprops function.
The returned array from this function may contain
more than one object, based on the object(s) or region(s)
found within the image. The largest region is assumed to
be the hand-blob area. This area is then selected as the
extraction region. The chosen features from the extraction
region are: area, centroid, major and minor axes,
orientation, and normalized convex hull. These five
features sufficiently define the hand position and form
that will occur. The area will help the model recognize the
different hand shapes; the major and minor axes help
define the perimeter of the hand blob. The orientation
feature will define the orientation of the hand in each
frame; the convex hull represents the smallest convex
Fig. 3. Joints from skeleton tracking generated by Kinect
Adv. Sci. Lett. 4, 400–407, 2016 RESEARCH ARTICLE
4
polygon that envelops the extraction region. A normalized
convex hull is an alternate representation of the convex
hull, where the coordinates of the convex hull's vertices
are described relative to the extraction region's centroid,
as opposed to a static origin.
4.3. OUTLINE OF EXPERIMENTS
The whole experiment is outlined in Fig. 4. Our inputs
consist of skeleton tracking data and frame video (depth
image) data captured by Kinect. The feature extraction
process yields skeleton point angle, depth image
properties, and combined skeleton-image feature data.
The raw data contains epenthesis, which are
transitional gestures. They do not contain any information
by themselves, since they are present simply as links
between the other gestures. During preprocessing, the
epenthesis are cut out from the data, and therefore what's
left in the data are only the gestures from each inflectional
word's components, namely the prefix, the root word and
the suffix.
The average sequence length of the post-processed
data set is then calculated, which turns out to be 32
frames. All the sequences are then homogenized in length
to conform to that length. The entire data set is then split
into training and testing data.
The skeleton, image and combined data sets are then
fed into the LSTM model. The parameters measured for
evaluation purposes are the accuracy on both the training
and testing data sets, as well as the time required to
perform both training and testing. The experiment uses
LSTM written in Python 2.7, using Keras and Theano
libraries. The experiments were run on an i7-equipped
PC.
4.4. MODELS
Fig. 5 is the1-layer and 2-layer LSTM architectures
used in this experiment. In this LSTM implementation,
one label is assigned to one sequence, as opposed to
labeling each frame. xn denotes the feature present in the
nth frame. A more detailed visualization of the internals of
each LSTM block is shown in Fig. 6.
By using a sigmoid function , the Forget gate will
discard irrelevant information from the previous cell state
. Then the Input gate layer, using a sigmoid
function , and a separate tanh layer, will cooperate to
calculate the new value of the current cell state . The
4th layer will determine the value of the output . The
sigmoid function , on this final layer will determine
which part of will be carried through in . The tanh
function in this final layer will ensure that the value of
will be between -1 and 1. Using both of these functions,
will contain only the relevant portion of 21.
5. RESULTS
The independent variables here are the epoch limits
(denoted as nE, in hundreds of epochs), the feature type
(skeleton / SKL, image / IMG, combined / CMB), and
the model type (1 and 2-layer LSTM, denoted as L1 and
L2). The dependent variables are the training and testing
time, and the accuracy of the prediction relative to the
testing data. Table 1 shows each model's accuracy
variation with changing feature types and increasing
epoch limits, using the testing data set.
nE
CMB
SKL
IMG
L1
L2
L1
L2
L1
L2
100
0,559
0,677
0,594
0,624
0,479
0,677
200
0,560
0,736
0,637
0,686
0,533
0,710
300
0,601
0,727
0,653
0,709
0,501
0,703
400
0,604
0,721
0,646
0,730
0,498
0,709
500
0,577
0,753
0,646
0,723
0,516
0,736
600
0,587
0,766
0,665
0,709
0,504
0,718
700
0,591
0,739
0,661
0,712
0,503
0,714
800
0,580
0,770
0,675
0,747
0,494
0,709
900
0,615
0,747
0,673
0,705
0,508
0,705
1000
0,598
0,759
0,676
0,703
0,493
0,727
From these results, it can be concluded that the model
is most accurate when using the combined feature data
Fig. 4. Experiment flow
Fig. 5. Block diagram of (left) 1-layer LSTM and (right) 2-
layer LSTM
Fig. 6. Four interacting layers of LSTM
21
Table 1. Inflectional word prediction accuracy for the
different types of features
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400407, 2016
5
set, as opposed to the individual feature data sets. This
was achieved with an 800-epoch limit for the 2-level
LSTM, with an accuracy of 77%.
The time required by the 2-layer LSTM for training or
testing is nearly twice as much as what is required for the
1-layer LSTM. However, this increase is offset by the
gain in accuracy when using 2 layers. Using a combined
feature set and a 800-epoch limit, the 2-layer LSTM is
77% accurate, whereas the 1-layer LSTM is only 58%.
6. ANALYSIS
The results of the prediction is summarized in the
confusion matrix, shown in Fig. 7. This matrix shows that
the errors occur mainly with prefixes and suffixes.
Consequently, further experiments are done to shed more
light on this issue.
6.1. ANALYSIS FROM DATA AND FEATURES
POINT OF VIEW
In order to understand the experimental results better,
the LSTM is then run on the individual groups (suffixes,
root words, and prefixes). An additional process was
implemented to separate both training and testing data
into three different groups. This also changes the average
length of the sequence, depending on which type the
sequence belongs to. The resulting average length for
prefixes, root words, and suffixes are 30, 37, and 22
frames respectively.
The best result of each group of data can be seen in
Table 2 (on next page). From this table, it can be
concluded that the model is again most accurate when
Prefixes
Suffixes
Fig. 7. Confusion matrix for prediction result of testing data
Fig. 8. Confusion matrix for the prefixes
Fig. 9. Confusion matrix for the suffixes
Adv. Sci. Lett. 4, 400–407, 2016 RESEARCH ARTICLE
6
using the combined feature data set. This was achieved in
the 2-level LSTM, with an accuracy of 95.4% if tested on
root word only. The lowest accuracies were attained for
both prefixes and suffixes, with confusion matrices shown
for prefixes and suffixes shown in Fig. 8 and Fig. 9,
respectively.
This happens mainly because of the way prefixes are
expressed in SIBI. The expression of a prefix in SIBI is
done by making the right hand express the first letter of
the prefix, with the left palm facing to the right, and
moving both hands to meet in the middle. The issue is
that the left hand orientation is the same for every prefix,
which reduces the uniqueness of a prefix's features. This
is particularly true with the prefix group (me-, te-, and se-),
as seen in Fig. 10.
The suffixes also suffer from a general lack of feature
uniqueness. In SIBI, after the root word has been
expressed, the suffix is then expressed by making the
right hand express the first letter of the suffix. The right
hand will then move from the final position of the root
word's gesture (in front of the chest) towards the right hip
in a downward arc. The first letters of suffixes have very
similar xy-plane projections to begin with, and the
downward arc is similar in every suffix. The left hand
remains stationary in all suffix gestures. All of the above
contributes to the fact that the skeleton data for suffixes
have a large degree of commonality, and the model has to
rely solely on the right hand's image data. This problem
proves to be difficult to solve among the suffix group –
kah, –kan and –lah, as shown in Fig. 11.
7. CONCLUSIONS AND FUTURE WORKS
Using LSTM results in a higher accuracy, from 66.7%
with HMM to 77.4% with 2-layer LSTM. The remaining
error are mostly due to misidentification of prefixes and
suffixes.
The aforementioned errors occur due to the lack of
feature uniqueness in the gestures of the error-causing
prefixes and suffixes.
Experiments using skeleton, image, and combined
feature sets, reveal that the model performs best using the
combined features. This is because the relevant
movement of arm and hand, as well as the finger shapes
are best recorded by the combined feature set.
When tasked with identifying only the root words, the
2-layer LSTM is 95.4% accurate. This means that the 2-
layer LSTM works well if the gestures to be identified are
sufficiently unique relative to each other. It is also worthy
to note that the length of the root word sequence is 37
frames on average, which is longer than the prefixes or
suffixes. Further investigation is needed to study how
LSTM's accuracy varies with sequence length.
The time required for training and testing a 2-layer
LSTM is double that of 1-layer LSTM, but the 2-layer
LSTM is more accurate. Using a combined feature set and
a 800-epoch limit, the 2-layer LSTM is 77% accurate,
whereas the 1-layer LSTM is only 58%.
To improve the performance of this model, a better
image processing technique is needed that can capture the
finger's shapes better, in order to resolve the current
problem of less than optimal uniqueness of the prefix and
suffix features.
ACKNOWLEDGMENTS
This work is supported by SINAS 2015 Research
Grant #RT-2015-0547, from The Ministry of Research,
Technology and Higher Education of Indonesia. This
support is gratefully received and acknowledged. The
author also wishes to thank M. I. Mas for the photographs
and final proofreading.
REFERENCES
[1] S. Siswomartono, Cara mudah belajar SIBI (Sistem Isyarat
Bahasa Indonesia), Yayasan Santi Rama, 2007.
[2] E. Rakun, M. Febrian Rachmadi, A. Tjandra, and K. Danniswara,
Spectral domain cross correlation function and generalized
Learning Vector Quantization for recognizing and classifying
Indonesian Sign Language, In IEEE International Conference on
Advanced Computer Science and Information Systems
(ICACSIS), pages 213218, Depok, 2012.
[3] E. Rakun, M. I. Fanany, I.W. W. Wisesa, and A. Tjandra, A
heuristic Hidden Markov Model to recognize inflectional words
in sign system for Indonesian language known as SIBI (Sistem
Isyarat Bahasa Indonesia), In IEEE International Conference on
Technology, Informatics, Management, Engineering &
Environment (TIME-E), pages 5358, Samosir, 2015.
[4] Kamus Sistem Isyarat Bahasa Indonesia, Departemen
Pendidikan Nasional, 2001.
Fig. 11. L-R: gestures for the suffixes -kah, -kan, and -lah
Fig. 10. L-R: gestures for the prefixes me-, te-, and se-
Table 2. Best prediction result for prefix, root word, and
suffix group
Group
Sequence
Length
nE Accuracy
Inflectional
32
800
0.770
Root
37
900
0.954
Prefix
30
700
0.667
Suffix
22
800
0.690
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400407, 2016
7
[5] S. Kausar and M. Y. Javed, A Survey on Sign Language
Recognition, In IEEE Frontiers of Information Technology (FIT),
pages 9598, Islamabad, 2011.
[6] M. Adriani, J. Asian, B. Nazief, S. M. M. Tahaghoghi, and H. E.
Williams, Stemming Indonesian: A confix-stripping approach,
ACM Transactions on Asian Language Information Processing
(TALIP), 6(4):133, 2007.
[7] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory,
Neural computation, 9(8):17351780, 1997.
[8] W. Kurniawan and A. Harjoko, Pengenalan Bahasa Isyarat
dengan Metode Segmentasi Warna Kulit dan Center of Gravity,
Indonesian Journal of Electronics and Instrumentation Systems
(IJEIS), 1(2):67–78, 2011.
[9] M. Najiburahman, Simulasi dan Analisis Sistem Penerjemah
Bahasa SIBI Menjadi Bahasa Indonesia Menggunakan Metode
Klasifikasi Hidden Markov Model, Bachelor’s thesis, Universitas
Telkom, 2015.
[10] E. Rakun, M. Andriani, I. W. Wiprayoga, K. Danniswara, and A.
Tjandra, Combining depth image and skeleton data from Kinect
for recognizing words in the sign system for Indonesian language
(SIBI [Sistem Isyarat Bahasa Indonesia]), In IEEE International
Conference on Advanced Computer Science and Information
Systems (ICACSIS), pages 387–392, Bali, 2013.
[11] F. Marcelita, Pengenalan Bahasa Isyarat dari Video
Menggunakan Ciri Geometris, K-Means, dan Hidden Markov
Model, Bachelor’s thesis, Universitas Telkom, 2008.
[12] T. Starner and A. Pentland, Real-Time American Sign Language
Recognition from Video Using Hidden Markov Models, In IEEE
International Symposium on Computer Vision, pages 265270,
Coral Gables, FL, 1995.
[13] K. Grobel and M. Assan, Isolated sign language Recognition
using hidden Markov models, In IEEE Systems, Man, and
Cybernetics, 1997. Computational Cybernetics and Simulation,
pages 162167, Orlando, FL, 1997.
[14] T. Matsuo, Y. Shirai, and N. Shimada, Automatic generation of
HMM topology for sign language recognition, In IEEE 19th
International Conference on Pattern Recognition (ICPR), pages
1–4, Tampa, FL, 2008.
[15] M. Maebatake, I. Suzuki, M. Nishida, Y. Horiuchi, and S.
Kuroiwa, Sign Language Recognition Based on Position and
Movement Using Multi-Stream HMM, In IEEE Proceedings of
the 2nd International Symposium on Universal Communication
(ISUC), pages 478481, Osaka, 2008.
[16] S. Theodorakis, A. Katsamanis, and P. Maragos, Product-HMMs
for automatic sign language recognition, In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 1601–1604,Taipei, 2009.
[17] C. Vogler and D. Metaxas, Parallel hidden Markov models for
American sign language recognition, In Proceedings of the
Seventh IEEE International Conference on Computer Vision,
volume 1, pages 116 –122, Kerkyra, 1999.
[18] M. Jebali, P. Dalle, and M. Jemni, Hmm-based method to
overcome spatiotemporal sign language recognition issues, In
IEEE International Conference on Electrical Engineering and
Software Applications (ICEESA), pages 16, Hammamet, 2013.
[19] L.R. Rabiner. A tutorial on Hidden Markov Models and selected
applications in speech recognition. Proceedings of the IEEE,
77(2):257-286,1989.
[20] K. Cho, B. van Merrienboer, and D. Bahdanau, On the Properties
of Neural Machine Translation: Encoder Decoder Approaches, In
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics
and Structure in Statistical Translation, pages 103111, Doha,
Qatar, 2014.
[21] C. Olah, Understanding lstm networks,
http://colah.github.io/posts/2015-08-Understanding-LSTMs/,
Last accessed April 6, 2015.
[22] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J.
Schmidhuber, LSTM: A search space odyssey, arXiv preprint
arXiv:1503.04069, pages 110, 2015.
[23] X. Chen, X. Qiu, C. Zhu, P. Liu, and X. Huang, Long Short-
Term Memory Neural Networks for Chinese Word Segmentation,
In Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, September, pages 11971206,
Lisbon, Portugal, 2015.
[24] MATLAB, version 7.10.0 (R2010a), Natick, Massachusetts,
2010.
Received: May 10, 2016. Accepted: -
... Its performance could be attributed to the aptitude of the bidirectional LSTM. Correspondingly, Rakun et al. [130] attempted to use LSTM to recognize Indonesian Sign Language. LSTM was used in the experiment because the model can use full sequences as input and does not depend on pre-clustered per-frame data. ...
... In addition, methods that rely on sensors or customized input devices were not given proper consideration. Individual ap- Sign Language [13], [39], [44], [52], [72]- [74], [76], [95], [96], [116]- [120], [125], [128], [131], [134] American Sign Language [38] Italian Sign Language [40], [121], [169] Arabic Sign Language [49]- [51], [65], [71], [81], [97] Chinese Sign Language [70] Argentine Sign Language [63], [64] Danish and New Zealand Sign Language [45], [122] Bengali Sign Language [3], [66], [77], [95], [133] German Sign Language [67] Japanese Sign Language [68], [115], [123], [124] Indian Sign Language [69], [130], [132] Indonesian Sign Language [46] Portuguese Sign Language [126] Dutch Sign Language [127] Thai Sign Language [158] Korean Sign Language plications of SLR technology were also presented in a very succinct form, despite their relevance for the large number of users. This topic definitely deserves more attention in order to create innovation space that would allow for addressing numerous practical and philosophical issues that were not adequately covered in the previous period [?], [82]. ...
... In some cases, top-1, top-5, and top-10 accuracy were calculated, expressing the model's ability to identify 'most likely' candidates rather than one correct answer. A BLUE score was used to assess the quantitative output of translation models with values between 0 and 100 as depicted in Table 15, while qualitative analysis was based on comparison with ground #Input Modality [37] RGB video [185] Kinect [189] Video [157] RGB image extracted from video [191] Video, Kinect [190] Video [187] RGB video, depth video, 3D skeletal data, facial features [41] RGB video, Kinect, 3D skeletal data [195] Kinect, RGB image, skeletal data [50] RGB video [49] RGB, Kinect, Skeleton point data [128] Infrared [133] RGB [66] RGB [3] RGB [37] RGB Video [49] RGB, depth, skeleton [193] Video [68] NA [130] RGB, Kinect [69] RGB Video [70] RGB Video [47] RGB Video [67] RGB, Kinect [158] RGB from two angles, Video RGB video [185] Kinect [189] Video [157] RGB image extracted from video [191] Video, Kinect [190] Video [187] RGB video, depth video, 3D skeletal data, facial features [41] RGB video, Kinect, 3D skeletal data [195] Kinect, RGB image, skeletal data [50] RGB video [49] RGB, Kinect, Skeleton point data [128] Infrared [133] RGB [66] RGB [3] RGB [37] RGB Video [49] RGB, depth, skeleton [193] Video [68] NA [130] RGB, Kinect [69] RGB Video [70] RGB Video [47] RGB Video [67] RGB, Kinect [158] RGB from two angles, Video truth as interpreted by human operators. A combination of accuracy and training sample size is used to construct the learning curve, which demonstrates how the performance changes as the volume of training sample increases. ...
Article
Full-text available
People with hearing impairments are found worldwide; therefore, the development of effective local level sign language recognition (SLR) tools is essential. We conducted a comprehensive review of automated sign language recognition based on machine/deep learning methods and techniques published between 2014 and 2021 and concluded that the current methods require conceptual classification to interpret all available data correctly. Thus, we turned our attention to elements that are common to almost all sign language recognition methodologies. This paper discusses their relative strengths and weaknesses, and we propose a general framework for researchers. This study also indicates that input modalities bear great significance in this field; it appears that recognition based on a combination of data sources, including vision-based and sensor-based channels, is superior to a unimodal analysis. In addition, recent advances have allowed researchers to move from simple recognition of sign language characters and words towards the capacity to translate continuous sign language communication with minimal delay. Many of the presented models are relatively effective for a range of tasks, but none currently possess the necessary generalization potential for commercial deployment. However, the pace of research is encouraging, and further progress is expected if specific difficulties are resolved.
... The translation of sign language for SIBI has been conducted through the utilization of sensor-based [14]- [19] and vision-based approaches, employing both traditional machine learning [20], [21] and deep learning techniques [22]- [32]. Similarly, when classifying sign language translation for SIBI, three types of signs have been identified: fingerspelling signs [16]- [19], [33], isolated signs [14], [15], [20], [21], [26]- [30], [32], and continuous signs [22]- [24], [31]. ...
... The translation of sign language for SIBI has been conducted through the utilization of sensor-based [14]- [19] and vision-based approaches, employing both traditional machine learning [20], [21] and deep learning techniques [22]- [32]. Similarly, when classifying sign language translation for SIBI, three types of signs have been identified: fingerspelling signs [16]- [19], [33], isolated signs [14], [15], [20], [21], [26]- [30], [32], and continuous signs [22]- [24], [31]. However, based on an extensive literature review, it appears that no research has been conducted on the development of a sign language translation framework for SIBI that incorporates both manual and non-manual features for continuous signs. ...
Conference Paper
Full-text available
Total communication in sign language involves hand gestures as manual signs and facial expressions as non-manual signals. Both types of cues have an important role in communication. Research related to sign language translators focuses more on manual signs. Ignoring the non-manual components causes the translation of spoken language to be less precise because of the loss of information from the signer's facial emotions. Indonesian Sign Language (SIBI) is different from other sign languages. SIBI follows the rules of the Indonesian language, which consist of structure and syntax. This paper proposes a framework for sign language translation composed of manual and non-manual components. Two propositions will be tested and contribute to this research. The first proposition is related to the use of sign gesture detection and segmentation methods using optical flow-based methods, and the second proposition is associated with the use of STMC, Facial Expression Recognition, and Transformer methods at the stage of recognition and translation of sign language. The proposed work arrangement is expected to improve performance as measured by WER, BLEU, and METEOR and reduce computational load as measured by processing time.
... These three studies show that using one feature 0only will limit the gesture that can be recognized. Our previous research [6], [7] proved that concatenate skeleton features and hand shapes can recognize isolated gestures with an accuracy upto 95.4%. The weakness of [7] is that the method proposed is not good at recognizing continuous gestures. ...
... Our previous research [6], [7] proved that concatenate skeleton features and hand shapes can recognize isolated gestures with an accuracy upto 95.4%. The weakness of [7] is that the method proposed is not good at recognizing continuous gestures. [8] research also applies the combination of hand and skeleton features obtained from the Kinect camera, leap motion, and the handy cam. ...
Article
Full-text available
SIBI (Sign System for Indonesian Language) is an official sign language system used in school for hearing impairment students in Indonesia. This work uses the skeleton and hand shape features to classify SIBI gestures. In order to improve the performance of the gesture classification system, we tried to fuse the features in several different ways. The accuracy results achieved by the feature fusion methods are, in descending order of accuracy: 88.016%, when using sequence-feature-vector concatenation, 85.448% when using Conneau feature vector concatenation, 83.723% when using feature-vector concatenation, and 49.618% when using simple feature concatenation. The sequence-feature-vector concatenation techniques yield noticeably better results than those achieved using single features (82.849% with skeleton feature only, 55.530% for the hand shape feature only). The experiment results show that the combined features of the whole gesture sequence can better distinguish one gesture from another in SIBI than the combined features of each gesture frame. In addition to finding the best feature combination technique, this study also found the most suitable Recurrent Neural Network (RNN) model for recognizing SIBI. The models tested are 1-layer, 2-layer LSTM, and GRU. The experimental results show that the 2-layer bidirectional LSTM has the best performance.
... Research related to the introduction of the SIBI movement has been conducted several times. Rakun et al. [11] uses data taken from Microsoft Kinect as its extraction feature. Pratama et al. [10] and Harits et al. [4] using video data from smartphone by taking features based on the shape of the skeleton and the object hand of each frame. ...
... Pratama et al. [10] and Harits et al. [4] using video data from smartphone by taking features based on the shape of the skeleton and the object hand of each frame. Research [4], [10] and [11] in general still uses the dataset of word while in this study the dataset is used in the form of sentences so it is more complex when compared to previous studies. ...
... Previous research [2] has shown that we can improve the accuracy of the inflectional word recognition system by separating words into their components. This component separation technique is much smaller than generating a unique set of feature vectors for each inflectional word in the Indonesian language. ...
Article
Full-text available
The SIBI gesture translation framework by Rakun was built using a series of machine learning technologies: MobileNetV2 for feature extraction, Conditional Random Field for finding the epenthesis movement frame, and Long Short-Term Memory for word classification. This high computational translation system was previously implemented on a personal computer system, which lacks portability and accessibility. This study implemented the system on a smartphone using an on-device inference method: the translation process is embedded into the smartphone to provide lower latency and zero data usage. The system was then improved using a parallel multi-inference method, which reduced the average translation time by 25%. The final mobile SIBI gesture-to-text translation system achieved a word accuracy of 90.560%, a sentence accuracy of 64%, and an average translation time of 20 seconds.
... Rakun et al. [6] elaborated the recognition process of prefixes, root words, and suffixes in Indonesian inflectional words using HMM method. In another research, Rakun et al. [7] used the probabilistic graphical model (PGM) to recognize affix components and root words in Indonesian inflectional word gestures. Models tested in the research were conditional random fields (CRF), hidden Markov model (HMM), long short-term memory neural networks (LSTM), and gated recurrent unit (GRU). ...
Article
Full-text available
This research focuses on building a system to translate continuous Indonesian sign system (SIBI) gestures into text. In a continuous gesture, a signer will add an epenthesis (transitional) gesture, which is hand movement with no meaning but needed to connect the hand movement of one word with the next word in a continuous gesture. Reducing the number of irrelevant inputs to the model through automated epenthesis removal can improve the system's ability to recognize the words in continuous gestures. We implemented threshold conditional random fields (TCRF) to identify epenthesis gestures. The dataset consists of 2,255 videos representing 28 common sentences in SIBI. The translation system consists of MobileNetV2 as a feature extraction technique, removing epenthesis gestures found by the TCRF, and a long short-term memory (LSTM) for the classifier. With the MobileNetV2-TCRF-bidirectional LSTM model, the best word error rate (WER) and sentence accuracy (SAcc) were 33.4% and 16.2%, respectively. Intermediate-stage processing steps consisting of sandwiched majority voting of the TCRF and the removal of word labels whose number of frames is less than two frames, along with LSTM output grouping, were able to reduce WER from 33.4% to 3.4% and increase SAcc from 16.2% to 80.2%.</span
... SIBI is a system for understanding sign language through hand, finger, and mouth movements. There has been research on SIBI through hand movements such as those conducted by Rakun et al. [1] and Halim et al [2]. Research conducted by Anggraini et al. [3] translates lip SIBI sign language. ...
... Several research [10,14,15] related to SIBI Translation Tools currently have been working on Faculty of Computer Science in Universitas Indonesia. The SIBI Translation Tools is consists of two main parts. ...
Thesis
Sign language is an essential means of communication for millions of people around the world and serves as their primary language. However, most communication tools and technologies are designed for spoken and written languages, which can create barriers and limitations for the deaf community. By creating a sign language recognition system, we can bridge this communication gap and enable people who use sign language as their primary mode of expression to communicate better with people and their surroundings. This sign language recognition system increases the quality of education, the quality of health services, improves public interactions, and creates equal opportunities for the deaf community. In this research, an attempt will be made to continuously recognize Iranian sign language with the help of the latest machine learning tools such as transformer networks. The first step in this research is to collect Iranian sign language data at the word and sentence level, which is very valuable for Iranian sign language due to the lack of these data. The translation and recognition of sign language sentences have been investigated in two ways. The first path is sentence recognition through single-word recognition model and adaptive windowing technique, in which genetic algorithm is used to find the optimal architecture and a fuzzy controller is used to change the window length. The second path is the direct recognition of the sentence in one place. The implementation and training of the models led to 90.2% accuracy for single word recognition and acceptable performance in the sentence recognition section with windowing and direct methods, so that 17 sentences out of 20 test sentences in the windowing method and 115 sentences out of 150 test sentences in the direct method are detected as completely correct or with only one mistake in the words. Finally, the sign language training software that allows real-time feedback to users with the help of developed models is introduced. This software, and this research in general, is an important step in the practical implementation of sign language recognition models in the real world, which can greatly help deaf and hard-of-hearing people.
Conference Paper
Sign Language is a medium of communication for many disabled people. This real-time Sign Language Recognition (SLR) system is developed to identify the words of American Sign Language (ASL) in English and translate them into 5 spoken languages (Mandarin, Spanish, French, Italian, and Indonesian). Combining the study of facial expression with the recognition of Sign Language is an attempt to understand the emotions of the signer. Mediapipe and LSTM with a Dense network are used to extract the features and classify the signs respectively. The FER2013 data set was used to train the Convolutional Neural Network (CNN) to identify emotions. The system was able to recognize 10 words of ASL with an accuracy of 86.33% and translate them into 5 different languages. 4 emotions were also recognized with an accuracy of 73.62%.
Conference Paper
Full-text available
SIBI (Sistem Isyarat Bahasa Indonesia) is the commonly used sign language in Indonesia. SIBI, which follows Indonesian language's grammatical structure, is a complex and unique sign language. A method to recognize SIBI gestures in a rapid, precise and efficient manner needs to be developed for the SIBI machine translation system. Feature extraction method with space-efficient feature set and at the same time retained its capability to recognize different types of SIBI gestures is the ultimate goal. There are four types of SIBI gestures: root, affix, inflectional and function word gestures. This paper proposed to use heuristic Hidden Markov Model and a feature extraction system to separate inflectional gesture into its constituents, prefix, suffix and root. The separation reduces the amount of feature sets that would otherwise as big as the product of the prefixes, suffixes and root words feature sets of the inflectional word gestures.
Conference Paper
Full-text available
The Sign System for Indonesian Language (SIBI) is a rather complex sign language. It has four components that distinguish the meaning of the sign language and it follows the syntax and the grammar of the Indonesian language. This paper proposes a model for recognizing the SIBI words by using Microsoft Kinect as the input sensor. This model is a part of automatic translation from SIBI to text. The features for each word are extracted from skeleton and color-depth data produced by Kinect. Skeleton data features indicate the angle between human joints and Cartesian axes. Color images are transformed to gray-scale and their features are extracted by using Discrete Cosine Transform (DCT) with Cross Correlation (CC) operation. The image's depth features are extracted by running MATLAB regionprops function to get its region properties. The Generalized Learning Vector Quantization (GLVQ) and Random Forest (RF) training algorithm from WEKA data mining tools are used as the classifier of the model. Several experiments with different scenarios have shown that the highest accuracy (96,67%) is obtained by using 30 frames for skeleton combined with 20 frames for region properties image classified by Random Forest.
Conference Paper
Full-text available
Sign languages (SL) are the most accomplished forms of gestural communication. Therefore, their automatic analysis is a real challenge which is interestingly implied to their lexical and syntactic organization levels. Statements dealing with sign language occupy a significant interest in the Automatic Natural Language Processing (ANLP) domain. In this work, we are dealing with sign language recognition, in particular of French Sign Language (FSL). FSL has its own specificities, such as the simultaneity of several parameters, the important role of the facial expression or movement and the use of space for the proper utterance organization. Our object is to develop a new method based in HMM in order to overcome spatiotemporal sign language recognition issues.
Conference Paper
Full-text available
We address multistream sign language recognition and focus on efficient multistream integration schemes. Alternative approaches are investigated and the application of Product-HMMs (PHMM) is proposed. The PHMM is a variant of the general multistream HMM that also allows for partial asynchrony between the streams. Experiments in classification and isolated sign recognition for the Greek sign language using different fusion methods, show that the PHMMs perform the best. Fusing movement and shape information with the PHMMs has increased sign classification performance by 1,2% in comparison to the Parallel HMM fusion model. Isolated sign recognition rate increased by 8,3% over movement only models and by 1,5% over movement-shape models using multistream HMMs.
Conference Paper
Full-text available
Sign language is used for communicating to people with hearing difficulties. Recognition of a sign language image sequence is challenging because of the variety of hand shapes and hand motions. We propose a method to automatically construct a transitional structure(topology) of a Hidden Markov Model(HMM) for recognizing sign language words. Unlike conventional HMM, the constructed topology has branches and junctions in order to represent a flexible structure. The proposed method consists of segmentation of a motion, and construction of the topology from segments. The topology is constructed from an initial topology by modifying it. With experiments, we show the effectiveness of the proposed method.
Conference Paper
Full-text available
Sign Language (SL) recognition is getting more and more attention of the researchers due to its widespread applicability in many fields. This paper is based on the survey of the current research trends in the field of SL recognition to highlight the current status of different research aspects of the area. Paper also critically analyzed the current research to identify the problem areas and challenges faced by the researchers. This identification is aimed at providing guideline for the future advances in the field.
Article
Full-text available
Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Conference Paper
This paper shows the first part of the automatic Indonesian Sign Language (SIBI) into text translation system. The focus of this project is on translation of the alphabet (A to Z) and numbers 1 to 10. Using a combination of a Kinect camera, Discrete Cosine Transform (DCT), Cross Correlation Function and classifying algorithm Generalized Learning Vector Quantization (GLVQ) can create a simple system to recognize alphabet A to Z and number 1 to 10 in Indonesian Sign Language. The skeleton extraction function and depth sensor from the Kinect camera are used to capture and transfer of hand gesture movements into frames of images. DCT is used to transform spatial data of each frame of image into its spectral domain. Collection of Cross Correlation values between same rows or columns of data from two consecutive frames can be used as a signature of a character. Each signature is unique and needs a small amount of data. GLVQ is used as the classifying algorithm to recognize the character. From our experiments, the system we proposed has obtained a high degree of accuracy in the recognition of alphabet and numbers in Indonesian Sign Language.