ArticlePDF Available

Recognition of Sign Language System for Indonesian Language Using Long Short-Term Memory Neural Networks

February 2018
Journal of Computational and Theoretical Nanoscience 24(2):999-1004

February 2018
24(2):999-1004

DOI:10.1166/asl.2018.10675

Authors:

Erdefi Rakun

University of Indonesia

Aniati Murni Arymurthy

University of Indonesia

L. Yohanes Stefanus

University of Indonesia

Ferdianto eko Wicaksono

University of Pembangunan Nasional Veteran Jawa Timur

Show all 5 authorsHide

SIBI (Sign Language System for the Indonesian Language) is the official sign language system for the Indonesian language. This research aims to find a suitable model for performing SIBI-to-text translation on inflectional word gestures. Extant research has been able to translate the alphabet, root words, and numbers from SIBI to text. Inflectional words are root words with prefixes, infixes, and suffixes, or some combination of the three. A new method that splits an inflectional word into three feature vector sets was developed. This reduces the amount of feature sets used, which would otherwise be as big as the product of the prefixes, suffixes, and root words feature sets of the inflectional word gestures. Long Short-Term Memory (LSTM) is used, as this model can take entire sequences as input and does not have to rely on pre-clustered per-frame data. LSTM suits this system well as the SIBI sequence data has a long-term temporal dependency. The 2-layer LSTM performed the best, being 95.4% accurate with root words. The same model is 77% accurate with inflectional words, using the combined skeleton-image feature set, with an 800-epoch limit. The lower accuracy with inflectional words is due to difficulties in recognizing prefixes and suffixes. Keywords: Inflectional Words, Long Short-Term Memory, Deep Learning, Kinect, sign language, SIBI.

Joints from skeleton tracking generated by Kinect

…

Figures - uploaded by Erdefi Rakun

Content may be subject to copyright.

Content uploaded by Erdefi Rakun

Content may be subject to copyright.

RESEARCH ARTICLE Adv. Sci. Lett. 4, 400–407, 2016

1 Adv. Sci. Lett. Vol. 4, No. 2, 2016 1936-6612/2011/4/400/008 doi:10.1166/asl.2011.1261

Printed in the United States of America

Recognition of Sign Language System for

Indonesian Language using Long Short-Term

Memory Neural Networks

Erdefi Rakun1, Aniati M. Arymurthy1, Lim Y. Stefanus1, Alfan F. Wicaksono1, I Wayan W. Wisesa1

1 Fakultas Ilmu Komputer, Universitas Indonesia, Depok 16424, Jawa Barat, Indonesia

SIBI (Sign Language System for Indonesian Language) is the official sign language system for the Indonesian language.This

research aims to find a suitable model for performing SIBI-to-text translation on inflectional word gestures. Extant research

have been able to translate the alphabet, root words, and numbers from SIBI to text. Inflectional words are root words with

prefixes, infixes, and suffixes, or some combination of the three. A new method that splits an inflectional word into three

feature vector sets was developed. This reduces the amount of feature sets used, which would otherwise be as big as the

product of the prefixes, suffixes and root words feature sets of the inflectional word gestures. Long Short-Term Memory

(LSTM) is used, as this models can take entire sequences as input and does not have to rely on pre-clustered per-frame data.

LSTM suits this system well as the SIBI sequence data has long-term temporal dependency. The 2-layer LSTM performed

the best, being 95.4% accurate with root words. The same model is 77% accurate with inflectional words, using the combined

skeleton-image feature set, with an 800-epoch limit. The lower accuracy with inflectional words is due to difficulties in

recognizing prefixes and suffixes.

Keywords: Inflectional Words, Long Short-Term Memory, Deep Learning, Kinect, sign language, SIBI.

1. INTRODUCTION

SIBI (Sistem Isyarat Bahasa Indonesia or the Sign

Language System for Indonesian Language) is the official

method of communication for the hearing-impaired in

Indonesia. SIBI is used for communication between the

hearing-impaired, as well as between the speech/hearing-

impaired and those without the impairment1. SIBI is

Indonesian language, complete with its native syntax,

represented in gestures. There are four types of SIBI

gestures: root word, affix, inflectional and pronoun

gestures. Pronouns are further divided into four

categories: personal, possessive, pointer, and

conjunctive1.

*Email Address: efi@cs.ui.ac.id

Like most other sign languages, SIBI is not trivial to

master, and consequently there is a need for a system to

convert SIBI gestures into a text output. The challenge in

creating a SIBI-to-text translation system is the

recognition of the four different types of gestures in SIBI.

The translation system in progress must eventually be

able to recognize the gestures associated with all the

aforementioned linguistic elements efficiently, quickly,

and accurately.

This research and extant ones have attempted to create

the components that make up a SIBI-to-text translation

system. The first to be created was a system that can do

the translation solely for the alphabet and numbers2. A

system that can translate root words was then created

using the acquired knowledge. Previous work3 and this

Adv. Sci. Lett. 4, 400–407, 2016 RESEARCH ARTICLE

research focuses on creating a system that can translate

inflectional words. Inflectional words are root words

combined with prefixes, suffixes, infixes, or a mix of

some of the three. These additional morphemes and

inflections add extra information to the root word, as well

as ensuring that the resulting inflectional words serve a

particular grammatical or logical purpose. As an

illustration, adding the prefix "me-", and the suffix "-i" to

the root word "lempar" (to throw), results in the word

"melempari" (to throw at). This is a different result than

when "me-" and "-kan" are added to "lempar," which

results in the inflectional word "melemparkan" (to throw

with). The subject of the latter is what is being thrown,

and the subject of the former is the target of the throw.

There are no specific gestures for inflectional words

in SIBI, but the gestures that represent them are

constructed modularly. For instance, there is no single

gesture for “berlompatan” (= to jump about); instead,

three gestures are used: the gesture for the prefix “ber-” +

the gesture for the root word “lompat” (=jump) + the

gesture for the suffix “-an,” as shown in Fig. 1. This

method of gesture concatenation is a uniqueness of SIBI4.

The translation system needs to employ a feature

extraction technique for a minimally-sized feature vector

set, yet still be capable of recognizing different types of

gestures5. There are 7 prefixes, 11 suffixes and thousands

of root words in the Indonesian language. To construct an

inflectional word, one can concatenate a maximum of 3

prefixes and 2 suffixes6. The number of possible

combinations is in the thousands, and in order to

recognize these inflectional words, the SIBI translation

system should have an adequate amount of feature vector

sets in order to sufficiently recognize the existing

inflectional words.

In this research, to reduce the amount of required

feature vector sets, the gesture of the inflectional word is

separated into and recognized by its components (Fig. 2).

With this component recognition capability, the

translation system uses only three feature vector sets: one

for prefix gestures, one of root word gestures, and one for

suffix gestures. This is much smaller than generating a

separate feature vector set for each inflectional word in

Indonesian language. The implementation of this

separation method greatly reduces computation time

required to interpret inflectional word gestures, as well as

improving the efficiency of the translation system.

In addition to being able to recognize inflectional

words, the model developed by this research can be

adjusted to recognize other types of words in SIBI such as

the root word, pronoun, conjunction, repeated words (for

example "kemerah-merahan" - reddish), compound words

(for example "bertanggung jawab" - to be responsible

for). This adjustability is the big contribution of this

research, in the realm of SIBI gesture to text translation.

This research attempts to use Long Short-Term

Memory (LSTM7) neural networks, due to its ability to

take advantage of the long-term temporal dependencies

between each frames of a SIBI sequence to improve the

model's prediction.

2. RELATED WORKS

There are only a few researches in Indonesia which

focus on translating gesture to text. Kurniawan8, Rakun2

translated alphabet to text, while Najiburahman9, Rakun10,

Marcelita11 focused on translating root words, not the

complex inflectional word. Najiburahman9 categorized

words in SIBI into words with one, two or three gestures,

while this research uses image and skeleton features to

correctly identify any words expressed in SIBI,

irrespective of the number of gestures that make up the

word.

Many related researches have been done outside

Indonesia. Most of these researches attempted to

recognize a certain sign language by using Hidden

Markov Model (HMM)12,13,14,15,16,17,18. This group of

extant research is collectively unsuited for the recognition

of SIBI's inflectional words.

In a previous research, we have tried to divide the

inflectional word gestures into the following

subsequences: prefix, root word, and suffix. These

subsequences are then classified by a heuristic Hidden

Markov Model (HMM), with a best accuracy value of

67.77%3. One shortcoming of using HMM is that a K-

Means pre-clustered feature set had to be used as the

HMM's observed variable19. This pre-clustering scheme

results in some loss of information on the extracted

feature set. It is highly likely that the model's accuracy

will improve if it is possible to use a model that can use

features that are not pre-clustered.

HMM's architecture dictates that each frame be

Fig 1. Gesture representing the inflectional word

“berlompatan” (= to jump about)4

epenthesis

prefix

...

......

epenthesis

......

root

word

epenthesis

...

......

suffix

epenthesis

......

inflectional word gesture frames

Fig. 2. The possible components of an inflectional

word.

RESEARCH ARTICLE Adv. Sci. Lett. 4, 400–407, 2016

assigned a label, which is clearly not the ideal way to

classify the sequences. The frame-labeling method results

in one sequence having multiple labels, making it difficult

to ascertain the validity of the error figure returned by the

model. HMM's classification is done in each frame, which

is a smaller unit than the unit in which the model is

required to classify in, namely classifying the components

of inflectional words, which is what each sequence in the

data represents.

In the interest of overcoming HMM's apparent

limitations, LSTM neural networks were used in this

research. The unique ability of LSTM is its ability to

decide whether to store, ignore, or forget information.

This stems from its inclusion of a four layer neural

network acting as gating units7,20,21. Because of this

ability, LSTM is considered to be well suited to speech

and handwriting recognition, polyphonic music

modeling22, as well as Chinese word segmentation23.

These tasks bear some similarities to the objective of this

research, as for instance with handwriting recognition, a

small sequence of a pen's strokes is relatively meaningless

unless the model correctly identifies the preceding

sequence of strokes. In short, highly accurate execution of

any of the above would be difficult if their respective

models cannot take advantage of their data's temporal

dependencies.

3. SCOPE OF RESEARCH

The prefixes and suffixes examined in this research

are all prefixes, suffixes, and the combination of prefixes

+ root words + suffixes in Indonesian language’s

grammatical structure6. Inflectional word gestures in this

research are defined as:

• One prefix gesture + root word gesture, or

• Root word gesture + one suffix gesture, or

• One prefix gesture + root word gesture + one suffix

gesture.

SIBI categorizes its gestures into two groups: primary

gestures and supporting gestures. Primary gestures

determine the meaning of the gesture, whereas supporting

gestures give additional meaning to the gesture. An

example of supporting gesture is a facial expression

added to the gesture, fulfilling a role similar to the role of

intonation in speech. This research focuses on primary

gestures performed by both hands as well as the finger

movements of both hands.

4. EXPERIMENTS

4.1. DATASETS

This experiment uses 19 root words and 144

inflectional words, recorded 10 times each for a total of

1630 inflectional word sequences. The gestures were

performed by 2 teachers from the Santi Rama School for

the hearing-impaired, in Jakarta. These are broken down

into subsequences of prefixes, root words, and suffixes.

The training data set consists of 457 root words, 290

prefixes, and 238 suffixes, for a total of 985

subsequences. The testing data set's proportion is

416:264:213, for a total of 893 subsequences.

4.2. FEATURES

This study uses Microsoft Kinect to record the SIBI

gestures. The output from the depth sensor and the

skeleton tracking output from the Kinect are used to

obtain the experimental features. The movement of the

tracked skeleton is computed from the angles between the

two joints in the skeleton, shown in Fig. 3. The shoulder-

center joint is used as the origin, and the angles of interest

are those formed between the origin and the elbow, and

between the origin and the hand. Each of these angles are

described in terms of and . Angles and are

defined as follows:

The sequences of these angles represent the movement of

the combined arm-and-hand shape in the frame.

We used MATLAB's regionprops24 function to

extract the image-based features. This function is used to

measure the region properties in an image and return

them as a structured array. Each frame's depth images are

transformed into a binary depth matrix, which is then

used as parameter for the regionprops function.

The returned array from this function may contain

more than one object, based on the object(s) or region(s)

found within the image. The largest region is assumed to

be the hand-blob area. This area is then selected as the

extraction region. The chosen features from the extraction

region are: area, centroid, major and minor axes,

orientation, and normalized convex hull. These five

features sufficiently define the hand position and form

that will occur. The area will help the model recognize the

different hand shapes; the major and minor axes help

define the perimeter of the hand blob. The orientation

feature will define the orientation of the hand in each

frame; the convex hull represents the smallest convex

Fig. 3. Joints from skeleton tracking generated by Kinect

Adv. Sci. Lett. 4, 400–407, 2016 RESEARCH ARTICLE

polygon that envelops the extraction region. A normalized

convex hull is an alternate representation of the convex

hull, where the coordinates of the convex hull's vertices

are described relative to the extraction region's centroid,

as opposed to a static origin.

4.3. OUTLINE OF EXPERIMENTS

The whole experiment is outlined in Fig. 4. Our inputs

consist of skeleton tracking data and frame video (depth

image) data captured by Kinect. The feature extraction

process yields skeleton point angle, depth image

properties, and combined skeleton-image feature data.

The raw data contains epenthesis, which are

transitional gestures. They do not contain any information

by themselves, since they are present simply as links

between the other gestures. During preprocessing, the

epenthesis are cut out from the data, and therefore what's

left in the data are only the gestures from each inflectional

word's components, namely the prefix, the root word and

the suffix.

The average sequence length of the post-processed

data set is then calculated, which turns out to be 32

frames. All the sequences are then homogenized in length

to conform to that length. The entire data set is then split

into training and testing data.

The skeleton, image and combined data sets are then

fed into the LSTM model. The parameters measured for

evaluation purposes are the accuracy on both the training

and testing data sets, as well as the time required to

perform both training and testing. The experiment uses

LSTM written in Python 2.7, using Keras and Theano

libraries. The experiments were run on an i7-equipped

PC.

4.4. MODELS

Fig. 5 is the1-layer and 2-layer LSTM architectures

used in this experiment. In this LSTM implementation,

one label is assigned to one sequence, as opposed to

labeling each frame. xn denotes the feature present in the

nth frame. A more detailed visualization of the internals of

each LSTM block is shown in Fig. 6.

By using a sigmoid function , the Forget gate will

discard irrelevant information from the previous cell state

. Then the Input gate layer, using a sigmoid

function , and a separate tanh layer, will cooperate to

calculate the new value of the current cell state . The

4th layer will determine the value of the output . The

sigmoid function , on this final layer will determine

which part of will be carried through in . The tanh

function in this final layer will ensure that the value of

will be between -1 and 1. Using both of these functions,

will contain only the relevant portion of 21.

5. RESULTS

The independent variables here are the epoch limits

(denoted as nE, in hundreds of epochs), the feature type

(skeleton / SKL, image / IMG, combined / CMB), and

the model type (1 and 2-layer LSTM, denoted as L1 and

L2). The dependent variables are the training and testing

time, and the accuracy of the prediction relative to the

testing data. Table 1 shows each model's accuracy

variation with changing feature types and increasing

epoch limits, using the testing data set.

CMB

SKL

IMG

100

0,559

0,677

0,594

0,624

0,479

0,677

200

0,560

0,736

0,637

0,686

0,533

0,710

300

0,601

0,727

0,653

0,709

0,501

0,703

400

0,604

0,721

0,646

0,730

0,498

0,709

500

0,577

0,753

0,646

0,723

0,516

0,736

600

0,587

0,766

0,665

0,709

0,504

0,718

700

0,591

0,739

0,661

0,712

0,503

0,714

800

0,580

0,770

0,675

0,747

0,494

0,709

900

0,615

0,747

0,673

0,705

0,508

0,705

1000

0,598

0,759

0,676

0,703

0,493

0,727

From these results, it can be concluded that the model

is most accurate when using the combined feature data

Fig. 4. Experiment flow

Fig. 5. Block diagram of (left) 1-layer LSTM and (right) 2-

layer LSTM

Fig. 6. Four interacting layers of LSTM

Table 1. Inflectional word prediction accuracy for the

different types of features

RESEARCH ARTICLE Adv. Sci. Lett. 4, 400–407, 2016

set, as opposed to the individual feature data sets. This

was achieved with an 800-epoch limit for the 2-level

LSTM, with an accuracy of 77%.

The time required by the 2-layer LSTM for training or

testing is nearly twice as much as what is required for the

1-layer LSTM. However, this increase is offset by the

gain in accuracy when using 2 layers. Using a combined

feature set and a 800-epoch limit, the 2-layer LSTM is

77% accurate, whereas the 1-layer LSTM is only 58%.

6. ANALYSIS

The results of the prediction is summarized in the

confusion matrix, shown in Fig. 7. This matrix shows that

the errors occur mainly with prefixes and suffixes.

Consequently, further experiments are done to shed more

light on this issue.

6.1. ANALYSIS FROM DATA AND FEATURES

POINT OF VIEW

In order to understand the experimental results better,

the LSTM is then run on the individual groups (suffixes,

root words, and prefixes). An additional process was

implemented to separate both training and testing data

into three different groups. This also changes the average

length of the sequence, depending on which type the

sequence belongs to. The resulting average length for

prefixes, root words, and suffixes are 30, 37, and 22

frames respectively.

The best result of each group of data can be seen in

Table 2 (on next page). From this table, it can be

concluded that the model is again most accurate when

Prefixes

Suffixes

Fig. 7. Confusion matrix for prediction result of testing data

Fig. 8. Confusion matrix for the prefixes

Fig. 9. Confusion matrix for the suffixes

Adv. Sci. Lett. 4, 400–407, 2016 RESEARCH ARTICLE

using the combined feature data set. This was achieved in

the 2-level LSTM, with an accuracy of 95.4% if tested on

root word only. The lowest accuracies were attained for

both prefixes and suffixes, with confusion matrices shown

for prefixes and suffixes shown in Fig. 8 and Fig. 9,

respectively.

This happens mainly because of the way prefixes are

expressed in SIBI. The expression of a prefix in SIBI is

done by making the right hand express the first letter of

the prefix, with the left palm facing to the right, and

moving both hands to meet in the middle. The issue is

that the left hand orientation is the same for every prefix,

which reduces the uniqueness of a prefix's features. This

is particularly true with the prefix group (me-, te-, and se-),

as seen in Fig. 10.

The suffixes also suffer from a general lack of feature

uniqueness. In SIBI, after the root word has been

expressed, the suffix is then expressed by making the

right hand express the first letter of the suffix. The right

hand will then move from the final position of the root

word's gesture (in front of the chest) towards the right hip

in a downward arc. The first letters of suffixes have very

similar xy-plane projections to begin with, and the

downward arc is similar in every suffix. The left hand

remains stationary in all suffix gestures. All of the above

contributes to the fact that the skeleton data for suffixes

have a large degree of commonality, and the model has to

rely solely on the right hand's image data. This problem

proves to be difficult to solve among the suffix group –

kah, –kan and –lah, as shown in Fig. 11.

7. CONCLUSIONS AND FUTURE WORKS

Using LSTM results in a higher accuracy, from 66.7%

with HMM to 77.4% with 2-layer LSTM. The remaining

error are mostly due to misidentification of prefixes and

suffixes.

The aforementioned errors occur due to the lack of

feature uniqueness in the gestures of the error-causing

prefixes and suffixes.

Experiments using skeleton, image, and combined

feature sets, reveal that the model performs best using the

combined features. This is because the relevant

movement of arm and hand, as well as the finger shapes

are best recorded by the combined feature set.

When tasked with identifying only the root words, the

2-layer LSTM is 95.4% accurate. This means that the 2-

layer LSTM works well if the gestures to be identified are

sufficiently unique relative to each other. It is also worthy

to note that the length of the root word sequence is 37

frames on average, which is longer than the prefixes or

suffixes. Further investigation is needed to study how

LSTM's accuracy varies with sequence length.

The time required for training and testing a 2-layer

LSTM is double that of 1-layer LSTM, but the 2-layer

LSTM is more accurate. Using a combined feature set and

a 800-epoch limit, the 2-layer LSTM is 77% accurate,

whereas the 1-layer LSTM is only 58%.

To improve the performance of this model, a better

image processing technique is needed that can capture the

finger's shapes better, in order to resolve the current

problem of less than optimal uniqueness of the prefix and

suffix features.

ACKNOWLEDGMENTS

This work is supported by SINAS 2015 Research

Grant #RT-2015-0547, from The Ministry of Research,

Technology and Higher Education of Indonesia. This

support is gratefully received and acknowledged. The

author also wishes to thank M. I. Mas for the photographs

and final proofreading.

REFERENCES

[1] S. Siswomartono, Cara mudah belajar SIBI (Sistem Isyarat

Bahasa Indonesia), Yayasan Santi Rama, 2007.

[2] E. Rakun, M. Febrian Rachmadi, A. Tjandra, and K. Danniswara,

Spectral domain cross correlation function and generalized

Learning Vector Quantization for recognizing and classifying

Indonesian Sign Language, In IEEE International Conference on

Advanced Computer Science and Information Systems

(ICACSIS), pages 213–218, Depok, 2012.

[3] E. Rakun, M. I. Fanany, I.W. W. Wisesa, and A. Tjandra, A

heuristic Hidden Markov Model to recognize inflectional words

in sign system for Indonesian language known as SIBI (Sistem

Isyarat Bahasa Indonesia), In IEEE International Conference on

Technology, Informatics, Management, Engineering &

Environment (TIME-E), pages 53–58, Samosir, 2015.

[4] Kamus Sistem Isyarat Bahasa Indonesia, Departemen

Pendidikan Nasional, 2001.

Fig. 11. L-R: gestures for the suffixes -kah, -kan, and -lah

Fig. 10. L-R: gestures for the prefixes me-, te-, and se-

Table 2. Best prediction result for prefix, root word, and

suffix group

Group

Sequence

Length

nE Accuracy

Inflectional

800

0.770

Root

900

0.954

Prefix

700

0.667

Suffix

800

0.690

RESEARCH ARTICLE Adv. Sci. Lett. 4, 400–407, 2016

[5] S. Kausar and M. Y. Javed, A Survey on Sign Language

Recognition, In IEEE Frontiers of Information Technology (FIT),

pages 95–98, Islamabad, 2011.

[6] M. Adriani, J. Asian, B. Nazief, S. M. M. Tahaghoghi, and H. E.

Williams, Stemming Indonesian: A confix-stripping approach,

ACM Transactions on Asian Language Information Processing

(TALIP), 6(4):1–33, 2007.

[7] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory,

Neural computation, 9(8):1735–1780, 1997.

[8] W. Kurniawan and A. Harjoko, Pengenalan Bahasa Isyarat

dengan Metode Segmentasi Warna Kulit dan Center of Gravity,

Indonesian Journal of Electronics and Instrumentation Systems

(IJEIS), 1(2):67–78, 2011.

[9] M. Najiburahman, Simulasi dan Analisis Sistem Penerjemah

Bahasa SIBI Menjadi Bahasa Indonesia Menggunakan Metode

Klasifikasi Hidden Markov Model, Bachelor’s thesis, Universitas

Telkom, 2015.

[10] E. Rakun, M. Andriani, I. W. Wiprayoga, K. Danniswara, and A.

Tjandra, Combining depth image and skeleton data from Kinect

for recognizing words in the sign system for Indonesian language

(SIBI [Sistem Isyarat Bahasa Indonesia]), In IEEE International

Conference on Advanced Computer Science and Information

Systems (ICACSIS), pages 387–392, Bali, 2013.

[11] F. Marcelita, Pengenalan Bahasa Isyarat dari Video

Menggunakan Ciri Geometris, K-Means, dan Hidden Markov

Model, Bachelor’s thesis, Universitas Telkom, 2008.

[12] T. Starner and A. Pentland, Real-Time American Sign Language

Recognition from Video Using Hidden Markov Models, In IEEE

International Symposium on Computer Vision, pages 265–270,

Coral Gables, FL, 1995.

[13] K. Grobel and M. Assan, Isolated sign language Recognition

using hidden Markov models, In IEEE Systems, Man, and

Cybernetics, 1997. Computational Cybernetics and Simulation,

pages 162–167, Orlando, FL, 1997.

[14] T. Matsuo, Y. Shirai, and N. Shimada, Automatic generation of

HMM topology for sign language recognition, In IEEE 19th

International Conference on Pattern Recognition (ICPR), pages

1–4, Tampa, FL, 2008.

[15] M. Maebatake, I. Suzuki, M. Nishida, Y. Horiuchi, and S.

Kuroiwa, Sign Language Recognition Based on Position and

Movement Using Multi-Stream HMM, In IEEE Proceedings of

the 2nd International Symposium on Universal Communication

(ISUC), pages 478–481, Osaka, 2008.

[16] S. Theodorakis, A. Katsamanis, and P. Maragos, Product-HMMs

for automatic sign language recognition, In IEEE International

Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 1601–1604,Taipei, 2009.

[17] C. Vogler and D. Metaxas, Parallel hidden Markov models for

American sign language recognition, In Proceedings of the

Seventh IEEE International Conference on Computer Vision,

volume 1, pages 116 –122, Kerkyra, 1999.

[18] M. Jebali, P. Dalle, and M. Jemni, Hmm-based method to

overcome spatiotemporal sign language recognition issues, In

IEEE International Conference on Electrical Engineering and

Software Applications (ICEESA), pages 1–6, Hammamet, 2013.

[19] L.R. Rabiner. A tutorial on Hidden Markov Models and selected

applications in speech recognition. Proceedings of the IEEE,

77(2):257-286,1989.

[20] K. Cho, B. van Merrienboer, and D. Bahdanau, On the Properties

of Neural Machine Translation: Encoder Decoder Approaches, In

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics

and Structure in Statistical Translation, pages 103–111, Doha,

Qatar, 2014.

[21] C. Olah, Understanding lstm networks,

http://colah.github.io/posts/2015-08-Understanding-LSTMs/,

Last accessed April 6, 2015.

[22] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J.

Schmidhuber, LSTM: A search space odyssey, arXiv preprint

arXiv:1503.04069, pages 1–10, 2015.

[23] X. Chen, X. Qiu, C. Zhu, P. Liu, and X. Huang, Long Short-

Term Memory Neural Networks for Chinese Word Segmentation,

In Proceedings of the 2015 Conference on Empirical Methods in

Natural Language Processing, September, pages 1197–1206,

Lisbon, Portugal, 2015.

[24] MATLAB, version 7.10.0 (R2010a), Natick, Massachusetts,

2010.

Received: May 10, 2016. Accepted: -

Deep Learning for Sign Language Recognition: Current Techniques, Benchmarks, and Open Issues

Article

Full-text available

Sep 2021

People with hearing impairments are found worldwide; therefore, the development of effective local level sign language recognition (SLR) tools is essential. We conducted a comprehensive review of automated sign language recognition based on machine/deep learning methods and techniques published between 2014 and 2021 and concluded that the current methods require conceptual classification to interpret all available data correctly. Thus, we turned our attention to elements that are common to almost all sign language recognition methodologies. This paper discusses their relative strengths and weaknesses, and we propose a general framework for researchers. This study also indicates that input modalities bear great significance in this field; it appears that recognition based on a combination of data sources, including vision-based and sensor-based channels, is superior to a unimodal analysis. In addition, recent advances have allowed researchers to move from simple recognition of sign language characters and words towards the capacity to translate continuous sign language communication with minimal delay. Many of the presented models are relatively effective for a range of tasks, but none currently possess the necessary generalization potential for commercial deployment. However, the pace of research is encouraging, and further progress is expected if specific difficulties are resolved.

Advancing Total Communication in SIBI: A Proposed Conceptual Framework for Sign Language Translation

Conference Paper

Full-text available

Nov 2023

Total communication in sign language involves hand gestures as manual signs and facial expressions as non-manual signals. Both types of cues have an important role in communication. Research related to sign language translators focuses more on manual signs. Ignoring the non-manual components causes the translation of spoken language to be less precise because of the loss of information from the signer's facial emotions. Indonesian Sign Language (SIBI) is different from other sign languages. SIBI follows the rules of the Indonesian language, which consist of structure and syntax. This paper proposes a framework for sign language translation composed of manual and non-manual components. Two propositions will be tested and contribute to this research. The first proposition is related to the use of sign gesture detection and segmentation methods using optical flow-based methods, and the second proposition is associated with the use of STMC, Facial Expression Recognition, and Transformer methods at the stage of recognition and translation of sign language. The proposed work arrangement is expected to improve performance as measured by WER, BLEU, and METEOR and reduce computational load as measured by processing time.

Improving Recognition of SIBI Gesture by Combining Skeleton and Hand Shape Features

Article

Full-text available

Jul 2022

SIBI (Sign System for Indonesian Language) is an official sign language system used in school for hearing impairment students in Indonesia. This work uses the skeleton and hand shape features to classify SIBI gestures. In order to improve the performance of the gesture classification system, we tried to fuse the features in several different ways. The accuracy results achieved by the feature fusion methods are, in descending order of accuracy: 88.016%, when using sequence-feature-vector concatenation, 85.448% when using Conneau feature vector concatenation, 83.723% when using feature-vector concatenation, and 49.618% when using simple feature concatenation. The sequence-feature-vector concatenation techniques yield noticeably better results than those achieved using single features (82.849% with skeleton feature only, 55.530% for the hand shape feature only). The experiment results show that the combined features of the whole gesture sequence can better distinguish one gesture from another in SIBI than the combined features of each gesture frame. In addition to finding the best feature combination technique, this study also found the most suitable Recurrent Neural Network (RNN) model for recognizing SIBI. The models tested are 1-layer, 2-layer LSTM, and GRU. The experimental results show that the 2-layer bidirectional LSTM has the best performance.

Recognizing Word Gesture in Sign System for Indonesian Language (SIBI) Sentences Using DeepCNN and BiLSTM

Conference Paper

Full-text available

Oct 2019

Translating SIBI (Sign System for Indonesian Gesture) Gesture-to-Text in Real-Time using a Mobile Device

Article

Full-text available

Dec 2022

The SIBI gesture translation framework by Rakun was built using a series of machine learning technologies: MobileNetV2 for feature extraction, Conditional Random Field for finding the epenthesis movement frame, and Long Short-Term Memory for word classification. This high computational translation system was previously implemented on a personal computer system, which lacks portability and accessibility. This study implemented the system on a smartphone using an on-device inference method: the translation process is embedded into the smartphone to provide lower latency and zero data usage. The system was then improved using a parallel multi-inference method, which reduced the average translation time by 25%. The final mobile SIBI gesture-to-text translation system achieved a word accuracy of 90.560%, a sentence accuracy of 64%, and an average translation time of 20 seconds.

Word recognition and automated epenthesis removal for Indonesian sign system sentence gestures

Article

Full-text available

Jun 2022

This research focuses on building a system to translate continuous Indonesian sign system (SIBI) gestures into text. In a continuous gesture, a signer will add an epenthesis (transitional) gesture, which is hand movement with no meaning but needed to connect the hand movement of one word with the next word in a continuous gesture. Reducing the number of irrelevant inputs to the model through automated epenthesis removal can improve the system's ability to recognize the words in continuous gestures. We implemented threshold conditional random fields (TCRF) to identify epenthesis gestures. The dataset consists of 2,255 videos representing 28 common sentences in SIBI. The translation system consists of MobileNetV2 as a feature extraction technique, removing epenthesis gestures found by the TCRF, and a long short-term memory (LSTM) for the classifier. With the MobileNetV2-TCRF-bidirectional LSTM model, the best word error rate (WER) and sentence accuracy (SAcc) were 33.4% and 16.2%, respectively. Intermediate-stage processing steps consisting of sandwiched majority voting of the TCRF and the removal of word labels whose number of frames is less than two frames, along with LSTM output grouping, were able to reduce WER from 33.4% to 3.4% and increase SAcc from 16.2% to 80.2%.</span

Text-Driven Talking Head Using Dynamic Viseme and DFFD for SIBI

Conference Paper

Sep 2020

Generating of SIBI Animated Gestures from Indonesian Text

Conference Paper

Full-text available

Jul 2020

Developing a vision-based system for continuous translation of Iranian Sign Language

Thesis

Jun 2024

Sign language is an essential means of communication for millions of people around the world and serves as their primary language. However, most communication tools and technologies are designed for spoken and written languages, which can create barriers and limitations for the deaf community. By creating a sign language recognition system, we can bridge this communication gap and enable people who use sign language as their primary mode of expression to communicate better with people and their surroundings. This sign language recognition system increases the quality of education, the quality of health services, improves public interactions, and creates equal opportunities for the deaf community. In this research, an attempt will be made to continuously recognize Iranian sign language with the help of the latest machine learning tools such as transformer networks. The first step in this research is to collect Iranian sign language data at the word and sentence level, which is very valuable for Iranian sign language due to the lack of these data. The translation and recognition of sign language sentences have been investigated in two ways. The first path is sentence recognition through single-word recognition model and adaptive windowing technique, in which genetic algorithm is used to find the optimal architecture and a fuzzy controller is used to change the window length. The second path is the direct recognition of the sentence in one place. The implementation and training of the models led to 90.2% accuracy for single word recognition and acceptable performance in the sentence recognition section with windowing and direct methods, so that 17 sentences out of 20 test sentences in the windowing method and 115 sentences out of 150 test sentences in the direct method are detected as completely correct or with only one mistake in the words. Finally, the sign language training software that allows real-time feedback to users with the help of developed models is introduced. This software, and this research in general, is an important step in the practical implementation of sign language recognition models in the real world, which can greatly help deaf and hard-of-hearing people.

Recognition of American Sign Language with Study of Facial Expression for Emotion Analysis

Conference Paper

Feb 2023

Sign Language is a medium of communication for many disabled people. This real-time Sign Language Recognition (SLR) system is developed to identify the words of American Sign Language (ASL) in English and translate them into 5 spoken languages (Mandarin, Spanish, French, Italian, and Indonesian). Combining the study of facial expression with the recognition of Sign Language is an attempt to understand the emotions of the signer. Mediapipe and LSTM with a Dense network are used to extract the features and classify the signs respectively. The FER2013 data set was used to train the Convolutional Neural Network (CNN) to identify emotions. The system was able to recognize 10 words of ASL with an accuracy of 86.33% and translate them into 5 different languages. 4 emotions were also recognized with an accuracy of 73.62%.

A heuristic Hidden Markov Model to recognize inflectional words in sign system for Indonesian language known as SIBI (Sistem Isyarat Bahasa Indonesia)

Conference Paper

Full-text available

Sep 2015

SIBI (Sistem Isyarat Bahasa Indonesia) is the commonly used sign language in Indonesia. SIBI, which follows Indonesian language's grammatical structure, is a complex and unique sign language. A method to recognize SIBI gestures in a rapid, precise and efficient manner needs to be developed for the SIBI machine translation system. Feature extraction method with space-efficient feature set and at the same time retained its capability to recognize different types of SIBI gestures is the ultimate goal. There are four types of SIBI gestures: root, affix, inflectional and function word gestures. This paper proposed to use heuristic Hidden Markov Model and a feature extraction system to separate inflectional gesture into its constituents, prefix, suffix and root. The separation reduces the amount of feature sets that would otherwise as big as the product of the prefixes, suffixes and root words feature sets of the inflectional word gestures.

Long Short-Term Memory Neural Networks for Chinese Word Segmentation

Conference Paper

Full-text available

Jan 2015

Combining depth image and skeleton data from Kinect for recognizing words in the sign system for Indonesian language (SIBI [Sistem Isyarat Bahasa Indonesia])

Conference Paper

Full-text available

Sep 2013

The Sign System for Indonesian Language (SIBI) is a rather complex sign language. It has four components that distinguish the meaning of the sign language and it follows the syntax and the grammar of the Indonesian language. This paper proposes a model for recognizing the SIBI words by using Microsoft Kinect as the input sensor. This model is a part of automatic translation from SIBI to text. The features for each word are extracted from skeleton and color-depth data produced by Kinect. Skeleton data features indicate the angle between human joints and Cartesian axes. Color images are transformed to gray-scale and their features are extracted by using Discrete Cosine Transform (DCT) with Cross Correlation (CC) operation. The image's depth features are extracted by running MATLAB regionprops function to get its region properties. The Generalized Learning Vector Quantization (GLVQ) and Random Forest (RF) training algorithm from WEKA data mining tools are used as the classifier of the model. Several experiments with different scenarios have shown that the highest accuracy (96,67%) is obtained by using 30 frames for skeleton combined with 20 frames for region properties image classified by Random Forest.

Hmm-based method to overcome spatiotemporal sign language recognition issues

Conference Paper

Full-text available

Mar 2013

Sign languages (SL) are the most accomplished forms of gestural communication. Therefore, their automatic analysis is a real challenge which is interestingly implied to their lexical and syntactic organization levels. Statements dealing with sign language occupy a significant interest in the Automatic Natural Language Processing (ANLP) domain. In this work, we are dealing with sign language recognition, in particular of French Sign Language (FSL). FSL has its own specificities, such as the simultaneity of several parameters, the important role of the facial expression or movement and the use of space for the proper utterance organization. Our object is to develop a new method based in HMM in order to overcome spatiotemporal sign language recognition issues.

Product-HMMs for automatic sign language recognition

Conference Paper

Full-text available

May 2009
Acoust Speech Signal Process

We address multistream sign language recognition and focus on efficient multistream integration schemes. Alternative approaches are investigated and the application of Product-HMMs (PHMM) is proposed. The PHMM is a variant of the general multistream HMM that also allows for partial asynchrony between the streams. Experiments in classification and isolated sign recognition for the Greek sign language using different fusion methods, show that the PHMMs perform the best. Fusing movement and shape information with the PHMMs has increased sign classification performance by 1,2% in comparison to the Parallel HMM fusion model. Isolated sign recognition rate increased by 8,3% over movement only models and by 1,5% over movement-shape models using multistream HMMs.

Automatic Generation of HMM Topology for Sign Language Recognition

Conference Paper

Full-text available

Dec 2008

Sign language is used for communicating to people with hearing difficulties. Recognition of a sign language image sequence is challenging because of the variety of hand shapes and hand motions. We propose a method to automatically construct a transitional structure(topology) of a Hidden Markov Model(HMM) for recognizing sign language words. Unlike conventional HMM, the constructed topology has branches and junctions in order to represent a flexible structure. The proposed method consists of segmentation of a motion, and construction of the topology from segments. The topology is constructed from an initial topology by modifying it. With experiments, we show the effectiveness of the proposed method.

A Survey on Sign Language Recognition

Conference Paper

Full-text available

Dec 2011

Sign Language (SL) recognition is getting more and more attention of the researchers due to its widespread applicability in many fields. This paper is based on the survey of the current research trends in the field of SL recognition to highlight the current status of different research aspects of the area. Paper also critically analyzed the current research to identify the problem areas and challenges faced by the researchers. This identification is aimed at providing guideline for the future advances in the field.

Stemming Indonesian: A confix-stripping approach.

Article

Full-text available

Jan 2007

Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.

Long Short-term Memory

Article

Full-text available

Dec 1997

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

Spectral domain cross correlation function and generalized Learning Vector Quantization for recognizing and classifying Indonesian Sign Language

Conference Paper

Jan 2012

This paper shows the first part of the automatic Indonesian Sign Language (SIBI) into text translation system. The focus of this project is on translation of the alphabet (A to Z) and numbers 1 to 10. Using a combination of a Kinect camera, Discrete Cosine Transform (DCT), Cross Correlation Function and classifying algorithm Generalized Learning Vector Quantization (GLVQ) can create a simple system to recognize alphabet A to Z and number 1 to 10 in Indonesian Sign Language. The skeleton extraction function and depth sensor from the Kinect camera are used to capture and transfer of hand gesture movements into frames of images. DCT is used to transform spatial data of each frame of image into its spectral domain. Collection of Cross Correlation values between same rows or columns of data from two consecutive frames can be used as a signature of a character. Each signature is unique and needs a small amount of data. GLVQ is used as the classifying algorithm to recognize the character. From our experiments, the system we proposed has obtained a high degree of accuracy in the recognition of alphabet and numbers in Indonesian Sign Language.

Recognition of Sign Language System for Indonesian Language Using Long Short-Term Memory Neural Networks

Abstract and Figures

Recommended publications

Feature Extraction from Smartphone Images by Using Elliptical Fourier Descriptor, Centroid and Area...

A heuristic Hidden Markov Model to recognize inflectional words in sign system for Indonesian langua...

Sign Language System for Bahasa Indonesia (Known as SIBI) Recognizer using TensorFlow and Long Short...

Generating of Sign System for Bahasa Indonesia (SIBI) Root Word Gestures Using Deep Temporal Sigmoid...

Indonesian Language Sign System (SIBI) Recognition Using Threshold Conditional Random Fields