Conference PaperPDF Available

Sentence Level Indonesian Sign Language Recognition Using 3D Convolutional Neural Network and Bidirectional Recurrent Neural Network

September 2018

September 2018

DOI:10.1109/INAPR.2018.8627016

Conference: 2018 Indonesian Association for Pattern Recognition International Conference (INAPR)

Authors:

Meita Chandra Ariesta

Suharjito Suharjito

Bina Nusantara University, Jakarta, Indonesia

CNN Architecture

…

Figures - uploaded by Suharjito Suharjito

Content may be subject to copyright.

Content uploaded by Suharjito Suharjito

Content may be subject to copyright.

The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia

Sentence Level Indonesian Sign Language

Recognition Using 3D Convolutional Neural

Network and Bidirectional Recurrent Neural

Network

Meita Chandra Ariesta

Computer Science

Department, School of

Computer Science

Bina Nusantara University

Jakarta, Indonesia 11480

meita.ariesta@binus.ac.id

Fanny Wiryana

Computer Science

Department, School of

Computer Science

Bina Nusantara University

Jakarta, Indonesia 11480

fanny.wiryana@binus.ac.id

Suharjito

Computer Science

Department, BINUS Graduate

Program – Master of

Computer Science

Bina Nusantara University

Jakarta, Indonesia 11480

suharjito@binus.edu

Amalia Zahra

Computer Science

Department, BINUS Graduate

Program – Master of

Computer Science

Bina Nusantara University

Jakarta, Indonesia 11480

amalia.zahra@binus.edu

Abstract—Sign Language Recognition (SLR) is a relatively

challenging research field which allows opportunity for

improvements. In this research, we propose sentence-level SLR

using deep learning method by combining Convolutional

Neural Network (CNN) and Bidirectional Recurrent Neural

Network (Bi-RNN). Specifically, 3D CNN is implemented to

extract features from each video frame and bidirectional-RNN

is implemented to extract the unique features from the video

frame’s sequential behavior, which later generate a possible

sentence. There are two key takeaways from this paper. The

first is our proposed dataset of Indonesian Sign Language

(SIBI) which is comprised of 30 sentences in SIBI. The second

is our novel approach of using deep learning and Connectionist

Temporal Classification (CTC) loss function in sentence-level

SLR. The result shows that the hyperparameter used, in this

case Hyperparameter 1, achieves the best result. Also, this

research found that deeper network does not necessarily

guarantee good results. A bigger number of dataset also affects

the performance of the system.

Keywords— sign language recognition, deep learning, CNN,

RNN, CTC loss, SIBI

I. INTRODUCTION

There has always been a communication gap between

hearing-impaired community and hearing people who do not

speak in sign language. Sign language is gestures made by

the movement of hands which are visible, used mostly by

hearing-impaired people to communicate. The official Sign

Language of Indonesia is called Sistem Isyarat Bahasa

Indonesia (SIBI). SIBI is mostly adopted from American

Sign Language (ASL) and follows the grammar structure of

Bahasa Indonesia, the official language of Indonesia, and is

equipped with affixes, such as me-, ber-, di-, ke-, pe-, ter-,

and se- [1]. Although sign language is used daily by the

hearing-impaired community, the rest of the population has

limited knowledge about the language. This creates

communication barrier and discrimination among society.

Sign Language Recognition (SLR) systems are widely

developed to bridge the gap. Numerous research has worked

on various methods to achieve state-of-the-art results.

Various input devices have also been experimented on, such

as data-gloved approach, Leap Motion Controller, vision-

based approach, and Kinect. Due to the lack of its application

outside a laboratory, unnatural experience offered, and its

high cost, data-gloved approach is left behind. A more

elaborated approach is vision-based approach because it can

be applied directly to user and deployed on any other

applications that have camera such as a phone assistant and

smart home interactions [2]. However, it is not an easy task

because sign language is delivered through various ways:

hand-shapes, position, orientation, and movements, where it

is hard to extract information from such features [3,4].

Several feature extraction methods had been worked on,

yet the challenge remains, because SLR should be used in

natural environment outside a laboratory for it to be useful.

These are the challenges researchers tried to overcome

throughout the years from the development of Sign

Language Recognition (SLR) system. Lighting, complex

background, poses, and noises may also affect the

performance of an SLR system. Another challenge of an

SLR system is determining the start and end point of

meaningful gestures (gesture segmentation) [5]. To

overcome these problems, Huang et al. in [3] proposed 3D

Convolutional Neural Network (CNN) to extract spatial and

temporal spatial features from the videos obtained from

Kinect as the input device. The implementation of deep

learning on vision-based SLR systems has helped to

overcome several challenges. Previously, several other

studies related to vision-based recognition using deep

learning have been conducted. Deep learning helps major

advances in vision-based recognition and has proven its

satisfactory results on image and object recognition [6,7],

image and object descriptor [8], human behavioral

recognition [9,10], lip-reading [11], and especially SLR

[3,12].

The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia

This paper proposed sentence-level sign language

recognition using deep learning method by combining CNN

and Bidirectional Recurrent Neural Network (Bi-RNN).

Specifically, 3D CNN is implemented to extract features

from each video frame and bidirectional-RNN is

implemented to extract the unique features from the video

frame’s sequential behavior, which later generate a possible

sentence. There are two key takeaways from this paper. The

first is our proposed novel dataset of SIBI which is

comprised of 3,006 videos of 30 sentences in SIBI. The

second is our novel approach of using deep learning and

Connectionist Temporal Classification (CTC) loss function

in recognizing sign language in sentence level.

The remaining of this paper is structured as follows.

Section II elaborates a number of related works that have

been conducted by researchers. Section III describes the

methodology applied to the research presented in this paper,

followed by the proposed models, which are described in

Section IV. Section V and VI present the model training and

the results, respectively. Finally, conclusion and future works

are presented in Section VII.

II. RELATED WORKS

There are several input devices exploited in a Sign

Language Recognition (SLR) system. Data-glove approach,

vision-based approach, Leap Motion Controller (LMC), and

Kinect are some of the input devices used in SLR systems

for sign acquisition. Data-glove based method is a relatively

old data acquisition method for gesture recognition. This

method utilizes a device to help the process of collecting

data. The device is a glove which has sensors detecting the

movement and changes of the hands and fingers of the users

and it is connected to a computer [13–16]. Due to the lack of

its application outside a laboratory, unnatural experience

offered, and its high cost, the data-glove based approach is

left behind.

LMC converts signals into computer commands. It was

employed in a number of existing gesture recognition

systems [17–20]. However, LMC is less applicable in

everyday life due to its absence on common gadgets.

Microsoft Kinect has also been widely used in gesture

recognition [21–27]. Its strength lies in its capability to

capture every motion and converts it into usable features by

using the built-in 3D sensory camera. Several studies

recommend Microsoft Kinect to be employed in SLR

systems.

In a vision-based approach, colored gloves are often used

to help hand segmentation [28,29]. Hand segmentation is the

process of separating hands and other features from the rest

of an image. Hand segmentation is one of the many

challenges in gesture recognition. Zhang et al [28] employed

colored gloves to help the hand segmentation process along

with a pupil-detection algorithm to make the pupil as

reference points for assisting the process. Canny Edge

Detector is another method to detect the edges of hands in an

image for hand segmentation with low error rate and optimal

performance in detecting edges [30–32]. Elliptical Fourier

Descriptor is also employed for hand segmentation which is

designed for extracting shapes’ outline [33]. Skin detection

identifies and isolates skin area from the rest of an image for

hand segmentation [34,35]. Skin detection is employed along

with hand motion tracking to produce more accurate results

[36]. Similarly, colored gloves are employed because they

give a distinctive feature to the hand in order to assist a hand

segmentation process [28,37,38]. On the other hand,

determining the start and the end point of meaningful

gestures (gesture segmentation) is a challenge in SLR [5]. A

method to do it is by implementing a 3D Convolutional

Neural Network (CNN) to extract spatial and temporal

spatial features from videos obtained from Kinect [3] and

also to extract features in lipreading [11] and hand detection

[39].

In the development of an SLR system, Hidden Markov

Model (HMM) has been widely exploited. It was used in

both gloved-based approach [23,40,41] and vision-based

approach [29,42–44]. To speed up the recognition process,

the Tied-Mixture Density Hidden Markov Model

(TMDHMM) could be applied without significantly reducing

the system accuracy. It had proved that it could recognize up

to 92.5% of frequently used Chinese Sign Language (CSL).

However, HMM experiences difficulties in handling

noisy data [45]. Therefore, Kaluri et al [44] proposed an

implementation of Wiener Filter to eliminate the noise in

images and used adaptive histogram technique to segment

the images where the output was fed into HMM for training

and recognition [44].

An earlier research in SIBI sign language recognition

conducted by [46] used the classification method namely

Generalized Learning Vector Quantization (GLVQ) to

recognize letters A-Z and 10 numbers. This research used

Microsoft Kinect as the input device. Another work was

performed by [27] to recognize 10 words using GLVQ and

Random Forest.

Deep learning method has been widely used for computer

vision, such as object recognition. Deep learning CNN and

RNN performed by [47] was implemented to classify an

image into 51 classes. Another work conducted by [48]

derived some textual description from the given image using

CNN and bidirectional-RNN. The work in [3] also used a

deep learning method called 3D-CNN to recognize 25 words

using nine people. However, the research of sign language

recognition using deep learning is still limited.

III. METHODOLOGY

This section describes the preparation of the study prior

to experiments. It consists of Sistem Isyarat Bahasa

Indonesia (SIBI), data, and translation of dataset.

A. Sistem Isyarat Bahasa Indonesia (SIBI)

Sistem Isyarat Bahasa Indonesia (SIBI) is the official

sign language system approved by the Indonesian

Government. The Ministry of Education and Culture of

Indonesia published the first dictionary for SIBI in 1994

[27]. The government promotes the usage of SIBI to

overcome the diversity of sign language used in throughout

the country. 80% of SIBI are adopted from the American

Sign Language (ASL) while the remaining 20% are

normalized from Bahasa Indonesia. SIBI is taught in schools

for children with special needs as the means of

communication.

The vocabulary of SIBI is mostly adopted from ASL and

follows the grammatical structure of Bahasa Indonesia, the

The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia

official language of Indonesia, and is equipped with affixes

(me-, ber-, di-, ke-, pe-, ter-, and se-) [1]. Therefore, SIBI can

be said as the sign language which is normalized and

standardized based on Bahasa Indonesia. The representation

of one word in Bahasa Indonesia may need at most five signs

[49]. For example, the word “perjalanan” consists of the

prefix “per-”, the word “jalan”, and the suffix “-an”,

therefore each word and affixes has its own SIBI

representation and is represented with one sign and they are

performed continuously.

B. Data

The data used consisted of 3,006 videos of 30 sentences

in SIBI. The data was collected in Santi Rama, a school for

children with special needs, in Jakarta, Indonesia. Santi

Rama is one of the founding members of SIBI. There are 10

teachers who volunteered to perform 30 sentences in SIBI in

which the recording is repeated 10 times. There are errors

during the recording of those videos, resulting the uneven

number of videos for each sentence. Those errors comprised

of human and recording device errors which occurred during

the recording process. The data collected were then reviewed

one by one for validity checking.

Before the data was used for training, the video dataset

that had been collected was then preprocessed to sequence of

images. The images were then resized to 100px x 50px.

Lastly, due to the variance of video lengths which results in

different extracted frames for every video, we performed

padding and set the value of 270 frames for every extracted

sequence of images. For videos which have lower than 270

frames, the padding was applied to the remaining frames.

The data was distributed into three groups: training,

testing, and validation. The data distribution can be seen in

Table 1.

TABLE I. DATA DISTRIBUTION

Training

60%

Validation

20%

Testing

20%

TOTAL

100%

C. Translation of Dataset

To multiply the variance of the collected dataset, a

translation using affine transformation was applied to this

work. The translation was performed by sliding the pixel

horizontally (left, right) or vertically (up, down). For every

horizontal translation, image pixel slides horizontally for 105

pixels and vertically for 35 pixels. The translation was

applied in eight directions: right, left, up, down, right-up,

right-down, left-up, and left-down. The total dataset after the

translation consisted of 27,054 videos.

IV. PROPOSED MODEL

This section elaborates our proposed model, which

comprises Convolutional Neural Network (CNN), activation

function, dropout, SOFTMAX algorithm, Recurrent Neural

Network (RNN), Bidirectional RNN, Gated Recurrent Unit

(GRU), and Connectionist Temporal Classification (CTC)

loss.

A. Convolutional Neural Network (CNN)

‘Convolution Network’ itself, refers to a network with

mathematical operation named convolution. Convolution is a

special type of linear operation. Convolutional Neural

Network (CNN), straightforwardly, is a neural network that

uses convolution, at least in one of the layers [50]. The main

idea of CNN is to solve the problem of substantial number of

parameters and allowing a network to run deeper and faster

while decreasing the parameter [51]. CNN has learnable

weights and biases, and those parameters can be learnt using

supervised or unsupervised learning [3,52]. Furthermore, like

other neural networks, CNN has loss product function in its

last layer (fully connected layer). Generally, CNN consists of

stacks of convolutional layer, pooling layer, activation

function, and fully connected layer. One of the examples of

CNN architecture is displayed in Fig. 1. According to Ji et al.

[9], CNN architecture can be constructed by stacking some

layers of convolution and subsampling alternately.

Fig. 1. CNN Architecture

B. Activation Function

In the training process using gradient descent, the time

needed to perform one training saturating nonlinearity f(x)=

tanh(x) is longer than training the non-saturating

nonlinearity:

f(x)=max(0,x) (1)

Neuron with nonlinearity like the function above is

named Rectified Linear Units (ReLU). Training deep

convolutional network using ReLU has been proven to be

faster than using standard tanh unit [53].

ReLU is an activation function recommended to be used

for most of the neural networks. By applying ReLU

activation function, the linear transformation is changed to

non-linear transformation type of output.

The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia

C. Dropout

Deep learning network models, in general, have large

number of parameters. Overfitting is always a problem in

using this type of network. Dropout is a technique applied to

solve this problem. Dropout aims to cut the connection (drop

unit) between neuron (unit) and minimize the shape of the

network during training. The disconnection of neuron stops

the network to be too clever on the training data provided.

Furthermore, dropout provides a way to combine neural

network architectures simultaneously and efficiently.

D. SOFTMAX Algorithm

Softmax function is the most frequently used function as

an output or classification to represent the diffusion of

probabilities of n different classes [50], applied to the output

layer of the deep learning model.

E. Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN) has been a focus for

researchers since 1990s. RNN is built to tackle sequential

data or various data based on timely pattern. Recurrent net

topology is built from the idea of adding one connection that

is called feedback connection. This feedback connection

connects one neuron to preceding neuron, so that the

activation function value can be adjusted (looping).

F. Bidirectional-RNN Architecture

Bidirectional RNN is a combined RNN that runs forward

through time from the start to the end in a sequence, and then

runs backward through time from the end to the start. Fig. 2

illustrates the bidirectional-RNN architecture. Bidirectional-

RNN architecture is used to learn and map the input

sequence x.

Fig. 2. Bidirectional-RNN Architecture [50]

h(t) is a state from sub-RNN that moves forward through

time. g(t) is a state from sub-RNN that moves backward. Unit

output o(t) is used to count the representation dependency for

the previous and the next state.

G. Gated Recurrent Unit (GRU)

In Chung, et al. research [54], they stated that RNN

training is difficult because of the long dependency of the

model architecture. Gradient value tends to disappear (in

most of the time) or blow up (happens rarely, but with

devastating effects). To prevent this, a recurrent unit

approach is used, such as the Gated Recurrent Unit (GRU).

Recurrent unit is said to be successful to gain information

from a lengthy time-dependent model properly.

GRU is applied to every recurrent unit to save dependent

information through time adaptively. GRU has gating unit

that controls information flow in a unit without using

separated memory cell.

H. Connectionist Temporal Classification (CTC) Loss

Connectionist Temporal Classification (CTC) is a

function used to allow Recurrent Neural Network (RNN) to

be trained for sequence transcription assignments without

previous information about the target sequences in the input

[55]. CTC loss is used to find the likelihood values in a

sequence from the input provided to the output [56].

V. MODEL TRAINING

The model used in this research adopts the model from

Lipnet. To evaluate the Lipnet model, we proposed another 3

models.

A. Lipnet

Lipnet model used three blocks of 3D-CNN and two

blocks of bidirectional-RNN. Hyperparameter used to train

this model was the same as Lipnet. Lipnet was trained using

a translation dataset which comprised sequences of 27,054

images. This model used hyperparameter 1, as shown in

Table 2.

TABLE II. HYPERPARAMETER 1

Layer

Size/Stride/Padding

Input Size

Dimension

Order

3D-CNN

(3,5,5)/(1,2,2)/(1,2,2)

270x3x50x100

T x C x H x W

Pool

(1,2,2)/(1,2,2)

270x32x25x50

T x C x H x W

3D-CNN

(3,5,5)/(1,2,2)/(1,2,2)

270x32x12x25

T x C x H x W

Pool

(1,2,2)/(1,2,2)

270x64x6x13

T x C x H x W

3D-CNN

(3,3,3)/(1,2,2)/(1,2,1)

270x64x3x16

T x C x H x W

Pool

(1,2,2)/(1,2,2)

270x96x1x3

T x C x H x W

Bi-GRU

256

270x(96x1x3)

T x (C x H x

Bi-GRU

256

270x512

T x F

Linear

32 + blank

270x512

T x F

Softmax

270x33

T x V

B. First Model

This model is identical with Lipnet model. The difference

is that this model tries to evaluate the usability of using

padding by not applying it. The model still used three blocks

of 3D-CNN and two blocks of bidirectional-RNN. The

dataset used for this model is normal dataset comprising

sequences of 3,006 images. The architecture of Lipnet and

the first model can be seen in Fig. 3. This model used

hyperparameter 2, as shown in Table 3.

The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia

Fig. 3. Lipnet Model and the First Model

TABLE III. HYPERPARAMETER 2

Layer

Size/Stride/Padding

Input Size

Dimension

Order

3D-CNN

(3,5,5)/(1,2,2)/(0,0,0)

270x3x50x100

T x C x H x W

Pool

(1,2,2)/(1,2,2)

270x32x23x48

T x C x H x W

3D-CNN

(3,5,5)/(1,1,1)/(0,0,0)

270x32x11x24

T x C x H x W

Pool

(1,2,2)/(1,2,2)

270x64x7x20

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(0,0,0)

270x64x3x10

T x C x H x W

Pool

(1,2,2)/(1,2,2)

270x96x1x4

T x C x H x W

Bi-GRU

256

270x(96x1x4)

T x (C x H x

Bi-GRU

256

270x512

T x F

Linear

32 + Blank

270 x 512

T x F

Softmax

270 x 33

T x V

C. Second Model

The second model was built simpler than the Lipnet

model. It used a block of 3D-CNN and a block of

bidirectional-RNN. This model was trained using normal

dataset. The architecture of the second model can be seen in

Fig. 4. This model used hyperparameter 3, as shown in Table

TABLE IV. HYPERPARAMETER 3

Layer

Size/Stride/Padding

Input Size

Dimension

Order

3D-

CNN

(3,5,5)/(1,2,2)/(0,0,0)

270x3x50x100

T x C x H x W

Pool

(1,2,2)/(1,2,2)

270x32x23x48

T x C x H x W

Bi-GRU

256

270x(32x11x24)

T x(C x H x W)

Linear

32 + blank

270x256

T x F

Softmax

270x33

T x V

Fig. 4. The Second Model

D. Third Model

The third model was built more complex than the Lipnet

model. This model used eight blocks of 3D-CNN and two

blocks of bidirectional-RNN, using normal dataset. The

architecture of the third model can be seen in Fig. 5. This

model used hyperparameter 4, as shown in Table 5.

Fig. 5. The Third Model

TABLE V. HYPERPARAMETER 4

Layer

Size/Stride/Padding

Input Size

Dimension

Order

3D-CNN

(3,5,5)/(1,2,2)/(2,4,4)

270x3x50x100

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(2,4,4)

270x32x27x53

T x C x H x W

The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia

Pool

(1,2,2)

270x32x33x59

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(1,2,2)

270x64x16x29

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(1,2,2)

270x96x18x31

T x C x H x W

Pool

(1,2,2)

270x96x20x33

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(1,2,2)

270x128x10x16

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(1,2,2)

270x128x12x18

T x C x H x W

Pool

(1,2,2)

270x128x14x20

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(1,2,2)

270x256x7x10

T x C x H x W

3D-CNN

(3,3,3)/(1,1,1)/(1,2,2)

270x256x9x12

T x C x H x W

Pool

(1,2,2)

270x256x11x14

T x C x H x W

Bi-GRU

256

270x(512x5x7)

T x(C x H x W)

Bi-GRU

256

270 x 512

T x F

Linear

32 + Blank

270 x 512

T x F

Softmax

270 x 33

T x V

VI. RESULT

To measure the performance of the training model, WER

(word error rate) and CER (character error rate) are

computed. WER and CER are the methods to count numbers

of substitution, deletion, and insertion from the comparison

of the true label and the hypothesis.

Table 6 below summarizes the performance of the system

using different models that have been described in the

previous section. Our proposed models are compared to the

Lipnet model according to WER and CER.

TABLE VI. EVALUATION RESULT

Model

WER

(Word Error Rate)

CER

(Character Error Rate)

Lipnet

87.10%

70%

First Model

89.40%

69.58%

Second Model

90.50%

76.90%

Third Model

88.17%

65.33%

Based on the performance shown in Table 6, the errors

are still high. The average error for every model evaluated

with WER is 88.79%. The Lipnet model gives the best result

compared to the remaining models based on WER, with the

smallest error of 87.10%. The second model gives the worst

result compared to the other models with the WER of

90.50%. Based on CER, the third model gives the best result

with the CER of 65.33%. Every trained model still

performed poorly in recognizing gestures of sign language

even though the system used similar data.

VII. CONCLUSION AND FUTURE WORKS

We proposed deep learning model to a sentence-level

sign language recognition. The deep learning methods used

in this research are 3D-CNN and bidirectional-Recurrent

Neural Network (RNN). This research used 30 sentences

from the SIBI video dataset collected by researchers.

Based on the results presented in the previous section, the

models proposed and the dataset do not seem to match well.

The first, second, and third model using normal dataset yield

similar error rates. The Lipnet model, which was trained

using large dataset, still produces similar result. For future

work, we would like to preprocess the data to match the

model, for example handling noise from the dataset.

ACKNOWLEDGMENT

This research was supported by research grant No.

039A/VR.RTT/VI/2017 from Ministry of Research

Technology and Higher Education of Republic of Indonesia.

REFERENCES

[1] D. R. Kurnia and T. Slamet, “Menormalkan yang dianggap “tidak

normal” (studi kasus: penertiban bahasa isyarat tunarungu di SLB

Malang),” Indonesian Journal of Disability Studies, 3(1), pp. 34–43,

2016.

[2] M. R. Abid, E. M. Petriu, and E. Amjadian, “Dynamic sign language

recognition for smart home interactive application using stochastic

linear formal grammar,” IEEE Transactions on Instrumentation and

Measurement, 64(3), pp. 596–605, 2015.

[3] J. Huang, W. Zhou, H. Li, and W. Li, “Sign language recognition

using 3D convolutional neural networks,” IEEE International

Conference on Multimedia and Expo (ICME), IEEE, pp. 1–6, 2015.

[4] O. Koller, J. Forster, and H. Ney, “Continuous sign language

recognition: Towards large vocabulary statistical recognition systems

handling multiple signers,” Computer Vision and Image

Understanding, 141, pp. 108–125, 2015.

[5] M. K. Bhuyan, D. A. Kumar, K. F. MacDorman, and Y. Iwahori, “A

novel set of features for continuous hand gesture recognition,” Journal

on Multimodal User Interfaces, 8(4), pp. 333–343, 2014.

[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for

image recognition,” Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, IEEE, pp. 770–778, 2016.

[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks

for large-scale image recognition,” International Conference on

Learning Representations, pp. 1–14, 2014.

[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S.

Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent

convolutional networks for visual recognition and description,”

Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pp. 2625–2634, 2015.

[9] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks

for human action recognition,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, 35(1), pp. 221–231, 2013.

[10] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt,

“Sequential deep learning for human action recognition,”

International Workshop on Human Behavior Understanding, pp. 29–

39, 2011.

[11] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas,

“Lipnet: Sentence-level lipreading,” arXiv Prepr arXiv161101599,

2016.

[12] O. Koller, H. Ney, and R. Bowden, “Deep learning of mouth shapes

for sign language,” Proceedings of the IEEE International Conference

on Computer Vision Workshops, pp. 85–91, 2015.

[13] S. A. Mehdi and Y. N. Khan, “Sign language recognition using sensor

gloves,” Proceedings of the 9th International Conference on Neural

Information Processing (ICONIP), IEEE, pp. 2204–2206, 2002.

[14] L. T. Phi, H. D. Nguyen, T. Q. Bui, and T. T. Vu, “A glove-based

gesture recognition system for Vietnamese sign language,”

Proceedings of the 15th International Conference on Control,

Automation and Systems (ICCAS), 13(16), pp. 1555–1559, 2015.

[15] A. Ranjini S. S. and M. Chaitra, "Sign language recognition system,"

International Journal on Recent and Innovation Trends in Computing

and Communication, 2(4), pp. 947–953, 2014.

[16] S. Saengsri, V. Niennattrakul, and C. A. Ratanamahatana, "TFRS:

Thai finger-spelling sign language recognition system," 2nd

International Conference on Digital Information and Communication

Technology and its Applications (DICTAP), pp. 457–462, 2012.

The 1st 2018 INAPR International Conference, 7 Sept 2018, Jakarta, Indonesia

[17] M. Koul, P. Patil, V. Nandurkar, and S. Patil, "Sign language

recognition using leap motion sensor," International Research Journal

of Engineering and Technology (IRJET), 3(11), pp. 322–325, 2016.

[18] L. E. Potter, J. Araullo, and L. Carter, "The leap motion controller: A

view on sign language," Proceedings of the 25th Australian

Computer-Human Interaction, Conference on Augmentation,

Application, Innovation, Collaboration – OzCHI, ACM, pp. 175–178,

2013.

[19] M. U. Kakde, M. G. Nakrani, and A. M. Rawate, "A review paper on

sign language recognition system for deaf and dumb people using

image processing", International Journal of Engineering Research and

Technology, 5(3), pp. 590–592, 2016.

[20] H. Bhavsar, "Review on feature extraction methods of image based

sign language recognition system," Indian Journal of Computer

Science and Engineering, 8(3), pp. 249–259, 2017.

[21] S. B. Carneiro, E. D. D. M. Santos, M. D. A. Talles, J. O. Ferreira, S.

G S. Alcala, and A. F. Da Rocha, "Static gestures recognition for

Brazilian sign language with kinect sensor," SENSORS, IEEE, 2016.

[22] E. Escobedo and G. Camara, "A new approach for dynamic gesture

recognition using skeleton trajectory representation and histograms of

cumulative magnitudes," SIBGRAPI Conference on Graphics,

Patterns and Images, pp. 209–216, 2016.

[23] J. Ma, W. Gao, J. Wu, and C. Wang, “A continuous Chinese sign

language recognition system,” Proceedings of the 4th International

Conference on Automatic Face and Gesture Recognition, IEEE, pp.

428–433, 2000.

[24] Y. Jiang, J. Tao, W. Ye, W. Wang, and Z. Ye, "An isolated sign

language recognition system using RGB-D sensor with sparse

coding," 17th International Conference on Computational Science and

Engineering, IEEE, pp. 21–26, 2014.

[25] C. Keskin, F. Kırac, Y. E. Kara, and L. Akarun, "Real time hand pose

estimation using depth sensors," International Conference on

Computer Vision Workshops, IEEE, pp. 1228–1234, 2011.

[26] J. L. Raheja, A. Mishra, and A Chaudhary, "Indian sign language

recognition using SVM," Pattern Recognition and Image Analysis,

26(2), pp. 434–441, 2016.

[27] E. Rakun, M. Andriani, I. W. Wiprayoga, K. Danniswara, and A.

Tjandra, "Combining depth image and skeleton data from kinect for

recognizing words in the sign system for Indonesian language (SIBI

[sistem isyarat bahasa Indonesia])," International Conference on

Advanced Computer Science and Information Systems (ICACSIS),

pp. 387–392, 2013.

[28] L. G. Zhang, Y. Chen, G. Fang, X. Chen, and W. Gao, “A vision-

based sign language recognition system using tied-mixture density

HMM,” Proceedings of the 6th International Conference on

Multimodal Interfaces, ACM, pp. 198–204, 2004.

[29] T. Starner and A. Pentland, “Real-time American sign language

recognition from video using hidden Markov models,” Motion-Based

Recognition, Springer, pp. 227–243, 1997.

[30] E. A. Kalsh and N. S. Garewal, "Sign language recognition system,"

International Journal of Computational Engineering Research, 3(6),

pp.15–21, 2013.

[31] M. V. D. Prasad, V. Kishore, and A. Kumar, "Indian sign language

recognition system using new fusion based edge operator," Journal of

Theoretical and Applied Information Technology, 88(3), pp. 574–

584, August 2016.

[32] D. K. Ghosh and S. Ari, “On an algorithm for vision-based hand

gesture recognition,” Signal, Image and Video Processing, 10(4), pp.

655–662, 2016.

[33] P. V. V. Kishore, M. V. D. Prasad, C. R. Prasad, and R. Rahul, "4-

camera model for sign language recognition using elliptical Fourier

descriptors and ANN," International Conference on Signal Processing

and Communication Engineering Systems (SPACES), IEEE, pp. 34–

38, 2015.

[34] S. C. W. Ong and S. Ranganath, "Automatic sign language analysis: a

survey and the future beyond lexical meaning," IEEE Transactions on

Pattern Analysis and Machine Intelligence, 27(6), pp. 873–891, 2005.

[35] K. M. Lim, A. W. C. Tan, S. C. Tan, "A feature covariance matrix

with serial particle filter for isolated sign language recognition,"

Expert Systems with Applications, Elsevier Ltd, 54, pp. 208–218,

2016.

[36] P. C. Pankajakshan and B. Thilagavathi, "Sign language recognition

system," Innovations in Information, Embedded and Communication

Systems (ICIIECS), IEEE, pp. 1–4, 2015.

[37] T. E. Starner, "Visual recognition of American sign language using

hidden Markov models," Massachusetts Inst Of Tech Cambridge Dept

Of Brain And Cognitive Science, 1995.

[38] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color

glove,” ACM Transactions on Graphics (TOG), 28(3), 63, 2009.

[39] S. Yan, Y. Xia, J. S. Smith, W. Lu, and B. Zhang, “Multiscale

convolutional neural networks for hand detection,” Applied

Computational Intelligence and Soft Computing, 2017.

[40] R. H. Liang and M. Ouhyoung, “A real-time continuous gesture

recognition system for sign language,” Proceedings of the 3rd

International Conference on Automatic Face and Gesture

Recognition, IEEE, pp. 558–567, 1998.

[41] H. Wang, M. C. Leu, and C. Oz, “American sign language

recognition using multi-dimensional hidden Markov models,” Journal

of Information Science and Engineering, 22(5), pp. 1109–1123, 2006.

[42] M. Elmezain, A. Al-Hamadi, J. Appenrodt, and B. Michaelis, “A

hidden Markov model-based continuous gesture recognition system

for hand motion trajectory,” 19th International Conference on Pattern

Recognition, IEEE, pp. 1–4, 2008.

[43] Pashaloudi VN, Margaritis KG. A performance study of a recognition

system for Greek sign language alphabet letters,” 9th Conference

Speech and Computer, 2004.

[44] R. Kaluri and C. H. Pradeep, “An enhanced framework for sign

gesture recognition using hidden Markov model and adaptive

histogram technique,” International Journal of Intelligence and

Engineering System, 10, 2007.

[45] M. C. Roh, S. Fazli, and S. W. Lee, “Selective temporal filtering and

its application to hand gesture recognition,” Applied Intelligence,

45(2), pp. 255–264, 2016.

[46] E. Rakun, M. F. Rachmadi, and K. Danniswara, “Spectral domain

cross correlation function and generalized learning vector

quantization for recognizing and classifying Indonesian sign

language,” Advanced Computer Science and Information Systems

(ICACSIS), pp. 978–979, 2012.

[47] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng,

“Convolutional-recursive deep learning for 3D object classification,”

Advances in Neural Information Processing Systems, pp. 656–664,

2012.

[48] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for

generating image descriptions,” IEEE Conference on Computer

Vision and Pattern Recognition, pp. 3128–3137, 2015.

[49] N. Palfreyman, “Sign language varieties of Indonesia: A linguistic

and sociolinguistic investigation,” Sign Language and Linguistics,

20(1), 2017.

[50] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT

press, 2016.

[51] H. H. Aghdam and E. J. Heravi, Guide to Convolutional Neural

Networks, Springer International Publishing, 2017.

[52] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based

learning applied to document recognition,” Proceedings of the IEEE,

86(11), pp. 2278–2324, 1998.

[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet

classification with deep convolutional neural networks,” Advances In

Neural Information Processing Systems, pp. 1097–1105, 2012.

[54] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation

of gated recurrent neural networks on sequence modeling,” Workshop

on Deep Learning (NIPS), pp. 1–9, December 2014.

[55] A. Graves and N. Jaitly, “Towards end-to-end speech recognition

with recurrent neural networks,” Proceedings of the 31st International

Conference on Machine Learning (ICML), pp. 1764–1772, 2014.

[56] A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free

conversational speech recognition with neural networks,” Proceedings

of the 2015 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language

Technologies, pp. 345–354, 2015.

Continuous sign language recognition based on cross-resolution knowledge distillation

Preprint

Mar 2023

The goal of continuous sign language recognition(CSLR) research is to apply CSLR models as a communication tool in real life, and the real-time requirement of the models is important. In this paper, we address the model real-time problem through cross-resolution knowledge distillation. In our study, we found that keeping the frame-level feature scales consistent between the output of the student network and the teacher network is better than recovering the frame-level feature sizes for feature distillation. Based on this finding, we propose a new frame-level feature extractor that keeps the output frame-level features at the same scale as the output of by the teacher network. We further combined with the TSCM+2D hybrid convolution proposed in our previous study to form a new lightweight end-to-end CSLR network-Low resolution input net(LRINet). It is then used to combine cross-resolution knowledge distillation and traditional knowledge distillation methods to form a CSLR model based on cross-resolution knowledge distillation (CRKD). The CRKD uses high-resolution frames as input to the teacher network for training, locks the weights after training, and then uses low-resolution frames as input to the student network LRINet to perform knowledge distillation on frame-level features and classification features respectively. Experiments on two large-scale continuous sign language datasets have proved the effectiveness of CRKD. Compared with the model with high-resolution data as input, the calculation amount, parameter amount and inference time of the model have been significantly reduced under the same experimental conditions, while ensuring the accuracy of the model, and has achieved very competitive results in comparison with other advanced methods.

STFE-Net: A Spatial-Temporal Feature Extraction Network for Continuous Sign Language Translation

Article

Full-text available

Jan 2023

The main challenge of continuous sign language translation (CSLT) lies in the extraction of both discriminative spatial features and temporal features. In this paper, a spatial-temporal feature extraction network (STFE-Net) is proposed for CSLT, which optimally fuses spatial and temporal features, extracted by the spatial feature extraction network (SFE-Net) and the temporal feature extraction network (TFE-Net), respectively. SFE-Net performs pose estimation for the presenters in sign-language videos. Based on COCO-WholeBody, 133 key points are abbreviated to 53 key points, according to the characteristics of the sign language. High-resolution pose estimation is performed on the hands, along with the whole-body pose estimation, to obtain finer-grained hand features. The spatial features extracted by SFE-Net and the sign language words are then fed to TFE-Net, which is based on Transformer with relative position encoding. In this paper, a dataset for Chinese continuous sign language was created and used for evaluation. STFE-Net achieves Bilingual Evaluation Understudy (BLEU-1, BLEU-2, BLEU-3, BLEU-4) scores of 77.59, 75.62, 74.25, 72.14, respectively. Furthermore, our proposed STFE-Net was also evaluated on two public datasets, RWTH-Phoenix-Weather 2014T and CLS. The BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores achieved by our method on the former dataset are 48.22, 33.59, 26.41 and 22.45, respectively, and the corresponding scores are 61.54, 58.76, 57.93 and 57.52, respectively, on the latter dataset. Experiment results show that our model can achieve promising performance. If any reader needs the code or dataset, please email lunfee@whut.edu.cn.

Temporal superimposed crossover module for effective continuous sign language

Preprint

Nov 2022

The ultimate goal of continuous sign language recognition(CSLR) is to facilitate the communication between special people and normal people, which requires a certain degree of real-time and deploy-ability of the model. However, in the previous research on CSLR, little attention has been paid to the real-time and deploy-ability. In order to improve the real-time and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal convolutions. The overall CSLR model based on TSCM is built on the improved ResBlockT network in this paper. The hybrid convolution of "TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form the new ResBlockT, and random gradient stop and multi-level CTC loss are introduced to train the model, which reduces the final recognition WER while reducing the training memory usage, and extends the ResNet network from image classification task to video recognition task. In addition, this study is the first in CSLR to use only 2D convolution extraction of sign language video temporal-spatial features for end-to-end learning for recognition. Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.

Sign-to-Text Translation from Panamanian Sign Language to Spanish in Continuous Capture Mode with Deep Neural Networks

Article

Full-text available

Feb 2024

Convolutional neural networks (CNN) have provided great advances for the task of sign language recognition (SLR). However, recurrent neural networks (RNN) in the form of long–short-term memory (LSTM) have become a means for providing solutions to problems involving sequential data. This research proposes the development of a sign language translation system that converts Panamanian Sign Language (PSL) signs into text in Spanish using an LSTM model that, among many things, makes it possible to work with non-static signs (as sequential data). The deep learning model presented focuses on action detection, in this case, the execution of the signs. This involves processing in a precise manner the frames in which a sign language gesture is made. The proposal is a holistic solution that considers, in addition to the seeking of the hands of the speaker, the face and pose determinants. These were added due to the fact that when communicating through sign languages, other visual characteristics matter beyond hand gestures. For the training of this system, a data set of 330 videos (of 30 frames each) for five possible classes (different signs considered) was created. The model was tested having an accuracy of 98.8%, making this a valuable base system for effective communication between PSL users and Spanish speakers. In conclusion, this work provides an improvement of the state of the art for PSL–Spanish translation by using the possibilities of translatable signs via deep learning.

Animal Collision Detection in Real Time to Prevent Accident using KNN Classifier in comparison with Stochastic Gradient Descent

Conference Paper

Sep 2023

Comparative performance analysis for maximum detection of abnormal children with coloured segmentation of images using convnet over support vector machine

Conference Paper

Jan 2023

Research on End-to-End Continuous Sign Language Sentence Recognition Based on Transformer

Conference Paper

Jan 2023

Continuous Sign Language Recognition Via Temporal Super-Resolution Network

Article

Mar 2023

Aiming the problem that the spatial-temporal hierarchical continuous sign language recognition (CSLR) model with video as input is computationally intensive, thus limiting the real-time application, this paper proposes a temporal super-resolution network (TSRNet) to reduce the model computation while keeping the loss of accuracy to a minimum, achieving the best compromise between the real-time performance and accuracy. The TSRNet-based CSLR constructed in this paper consists of three main parts: frame-level feature extraction, temporal feature extraction and the proposed TSRNet, where the TSRNet is located between them, and consists of two branches: detail and coarse descriptors. The extracted frame-level features are first sparse, after which they are passed through the two branches designed for feature reconstruction; the fused dense sequence is subjected to temporal feature extraction. In order to better recover the semantic-level information, this paper also proposes a self-generating adversarial network training method, which treats the TSRNet as the generator and the frame-level and temporal processing parts as the discriminator. In addition, to unify the criteria for judging the loss of model accuracy under different benchmarks, this paper proposes word error rate deviation (WERD), where the error rate between estimated WER and reference WER obtained by reconstructed frame-level feature sequence and complete original frame-level feature sequence, respectively. Experiments on two large-scale sign language datasets demonstrate the effectiveness of the model. The method proposed in this paper is not only applicable to CSLR, but is general to spatial-temporal hierarchical models where the input is video data. Code is available at https://github.com/woshisad159/CSLR.git.

A Modified LSTM Model for Chinese Sign Language Recognition Using Leap Motion

Conference Paper

Oct 2022

Three dimensional objects recognition & pattern recognition technique; related challenges: A review

Article

Full-text available

Mar 2022
MULTIMED TOOLS APPL

3D object recognition and pattern recognition are active and fast-growing research areas in the field of computer vision. It is mandatory to define the pattern class, feature extraction, design classifiers, clustering, and selection of test datasets and evaluate performance for any pattern recognition system. The pattern recognition system recognizes the object, so it is required to extract the features in such a way that it will be suitable for a particular recognition method. Features can be retrieved either locally or globally. The object recognition technique is divided into two parts: the local feature extraction method and the global feature extraction method. Many researchers have done admirable work in the field of local and global feature extraction. Local feature-based techniques are more suitable for the real-world scene. The Global feature-based methods are more suitable for retrieving the 3D model & identifying the object’s shape when the object’s geometric structure is fragile. A lot of research has been done on pattern recognition in the last 50 years. Still, no single technique can be used for all types of applications, such as bioinformatics, data mining, speech recognition, remote sensing, multimedia applications, text detection, localization, etc. The main agenda of this paper is to summarize the 3D object recognition methodologies. This paper provides a complete study of 3D object recognition based on local and global feature-based methods and different techniques of pattern recognition. We have tried to summarize the results of different technologies and the future scope of this paper’s particular technique. We enlisted the accessible online 3D database and their attributes, evaluation parameters of the 3D datasets. This paper will immensely help the researchers to Identify the research gap and limitations in pattern recognition and object recognition so that the researchers will be motivated to do something new in this field.

Sign Language Recognition System

Article

Full-text available

Jan 2022

We witness many people who face disabilities like being deaf, dumb, blind etc. They face a lot of challenges and difficulties trying to interact and communicate with others. This paper presents a new technique by providing a virtual solution without making use of any sensors. Histogram Oriented Gradient (HOG) along with Artificial Neural Network (ANN) have been implemented. The user makes use of web camera, which takes input from the user and processes the image of different gestures. The algorithm recognizes the image and identifies the pending voice input. This paper explains two way means of communication between impaired and normal people which implies that the proposed ideology can convert sign language to text and voice.

ML Based Sign Language Recognition System

Conference Paper

Full-text available

Feb 2021

An Enhanced Framework for Sign Gesture Recognition using Hidden Markov Model and Adaptive Histogram Technique

Article

Full-text available

Jun 2017

Gesture based communication is the fundamental method for correspondence for those with hearing and vocal incapacities. Communication via gestures comprises of making shapes or developments with human hands as for the head or other body parts. In this paper, we propose a new framework for recognizing sign gestures by using Hidden Markov Model (HMM) and Histogram based methods. Initially, the noise of an image will be eliminated by Wiener Filter and the image will be segmented with the help of Histogram oriented methods - Adaptive Histogram technique and then features will be extracted. The extracted features will be given to the HMM for training and recognition of gestures. Our experimental results show a better performance in terms of recognizing gestures from a blurred image compared to the existing segmentation methods.

Multi-Scale Convolutional Neural Networks for Hand Detection

Article

Full-text available

May 2017

Unconstrained hand detection in still images plays an important role in many hand-7 related vision problems, e.g., hand tracking, gesture analysis, human action recognition and 8 human-machine interaction, and sign language recognition. Although hand detection has been 9 extensively studied for decades, it is still a challenging task with many problems to be tackled. 10 The contributing factors for this complexity include heavy occlusion, low resolution, varying 11 illumination conditions, different hands gestures and the complex interactions between hands 12 and objects or other hands. In this paper, we propose a multi-scale deep learning model for 13 unconstrained hand detection in still images. Deep learning models, and deep convolutional 14 neural networks (CNNs) in particular, have achieved state-of-the-art performances in many vision 15 benchmarks. Developed from the Region-based CNN (R-CNN) model, we propose a hand 16 detection scheme based on candidate regions generated by a generic region proposal algorithm, 17 followed by multi-scale information fusion from the popular VGG16 model. Two benchmark 18 datasets were applied to validate the proposed method, namely, the Oxford Hand Detection Dataset, 19 and the VIVA Hand Detection Challenge. We achieved state-of-the-art results on the Oxford Hand 20 Detection Dataset and had satisfactory performance in the VIVA Hand Detection Challenge. 21

Guide to Convolutional Neural Networks: A Practical Application to Traffic-Sign Detection and Classification

Book

Jan 2017

This must-read text/reference introduces the fundamental concepts of convolutional neural networks (ConvNets), offering practical guidance on using libraries to implement ConvNets in applications of traffic sign detection and classification. The work presents techniques for optimizing the computational efficiency of ConvNets, as well as visualization techniques to better understand the underlying processes. The proposed models are also thoroughly evaluated from different perspectives, using exploratory and quantitative analysis. Topics and features: • Explains the fundamental concepts behind training linear classifiers and feature learning • Discusses the wide range of loss functions for training binary and multi-class classifiers • Illustrates how to derive ConvNets from fully connected neural networks, and reviews different techniques for evaluating neural networks • Presents a practical library for implementing ConvNets, explaining how to use a Python interface for the library to create and assess neural networks • Describes two real-world examples of the detection and classification of traffic signs using deep learning methods • Examines a range of varied techniques for visualizing neural networks, using a Python interface • Provides self-study exercises at the end of each chapter, in addition to a helpful glossary, with relevant Python scripts supplied at an associated website This self-contained guide will benefit those who seek to both understand the theory behind deep learning, and to gain hands-on experience in implementing ConvNets in practice. As no prior background knowledge in the field is required to follow the material, the book is ideal for all students of computer vision and machine learning, and will also be of great interest to practitioners working on autonomous cars and advanced driver assistance systems.

Deep Visual-Semantic Alignments for Generating Image Descriptions

Technical Report

Jan 2014

We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Convolutional Neural Networks

Chapter

May 2017

Understanding the underlying process in a convolutional neural networks is crucial for developing reliable architectures. In this chapter, we explained how convolution operations are derived from fully connected layers. For this purpose, weight sharing mechanism of convolutional neural networks was discussed. Next basic building block in convolutional neural network is pooling layer. We saw that pooling layers are intelligent ways to reduce dimensionality of feature maps. To this end, a max pooling, average pooling, or a mixed pooling is applied on feature maps with a stride bigger than one. In order to explain how to design a neural network, two classical network architectures were illustrated and explained. Then, we formulated the problem of designing network in three stages namely idea, implementation, and evaluation. All these stages were discussed in detail. Specifically, we reviewed some of the libraries that are commonly used for training deep networks. In addition, common metrics (i.e., classification accuracy, confusion matrix, precision, recall, and F1 score) for evaluating classification models were mentioned together with their advantages and disadvantages. Two important steps in training a neural network successfully are initializing its weights and regularizing the network. Three commonly used methods for initializing weights were introduced. Among them, Xavier initialization and its successors were discussed thoroughly. Moreover, regularization techniques such as \(L_1\), \(L_2\), max-norm, and dropout were discussed. Finally, we finished this chapter by explaining more advanced layers that are used in designing neural networks.

Sign language varieties of Indonesia: A linguistic and sociolinguistic investigation

Article

Nov 2017

Nick Palfreyman

Until now there has been no robust (socio)linguistic documentation of urban sign language varieties in Indonesia, and given the size of the Indonesian archipelago, it might be expected that these varieties are very different from each other. In this kind of situation, sign linguists have often applied lexicostatistical methods, but two such studies in Indonesia have recently produced contradictory results. Instead, this investigation uses conceptual and methodological approaches from linguistic typology and Variationist Sociolinguistics, contextualised by a sociohistorical account of the Indonesian sign community. The grammatical domains of completion and negation are analysed using a corpus of spontaneous data from two urban centres, Solo and Makassar. Four completive particles occur in both varieties, alongside clitics and the expression of completion through mouthings alone. The realisations of two variables, one lexical and one grammatical, are predicted by factors including the syntactic and functional properties of the variant, and younger Solonese signers are found to favour completive clitics. The reasons for intra-individual persistence and variation are also discussed. Negation is expressed through particles, clitics, suppletives, and the simultaneous mouthing of predicates with negative particles. These paradigmatic variants occur in both varieties, with small differences in the sets of particles and suppletives for each variety. The realisations of four variables are found to be conditioned by factors including predicate type, sub-function, and the use of constructed dialogue. The gender of the signer is found to correlate with the syntactic order of negative and predicate; younger Solonese signers are also found to favour negative clitics and suppletives. The similarities revealed between the Solo and Makassar varieties are discussed with reference to the history of contact between sign sub-communities across the archipelago. The investigation concludes with a discussion of factors that favour and disfavour the convergence of urban sign language varieties.

Sign Language Recognition using 3D convolutional neural networks

Conference Paper

Jun 2015

Sentence Level Indonesian Sign Language Recognition Using 3D Convolutional Neural Network and Bidirectional Recurrent Neural Network

Figures

Recommended publications

Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel...

Feature Extraction Methods in Sign Language Recognition System: A Literature Review

A Survey of Hand Gesture Recognition Methods in Sign Language Recognition

Sign Language Recognition Application Systems for Deaf-Mute People: A Review Based on Input-Process-...

Continuous Sign Language Recognition using Convolutional Neural Network