PreprintPDF Available

Abstract and Figures

Sign Language is mainly used by deaf (hard hearing) and dumb people to exchange information between their own community and with other people. It is a language where people use their hand gestures to communicate as they can't speak or hear. Sign Language Recognition (SLR) deals with recognizing the hand gestures acquisition and continues till text or speech is generated for corresponding hand gestures. Here hand gestures for sign language can be classified as static and dynamic. However, static hand gesture recognition is simpler than dynamic hand gesture recognition, but both recognition is important to the human community. We can use Deep Learning Computer Vision to recognize the hand gestures by building Deep Neural Network architectures (Convolution Neural Network Architectures) where the model will learn to recognize the hand gestures images over an epoch. Once the model Successfully recognizes the gesture the corresponding English text is generated and then text can be converted to speech. This model will be more efficient and hence communicate for the deaf (hard hearing) and dump people will be easier. In this paper, we will discuss how Sign Language Recognition is done using Deep Learning.
Content may be subject to copyright.
Sign Language Recognition
1Satwik Ram Kodandaram, 2 N Pavan Kumar 3 Sunil G L
18th Sem UG Student, Visvesvaraya Technological University, Karnataka, India
2 8th Sem UG Student, Visvesvaraya Technological University, Karnataka, India
3 Assistant Professor, Visvesvaraya Technological University, Karnataka, India
E-mail: satwikram29@gmail.com, pavankumarpk031999@gmail.com, sunilgl.gls@gmail.com
ABSTRACT
Sign Language is mainly used by deaf (hard
hearing) and dumb people to exchange
information between their own community and
with other people. It is a language where people use
their hand gestures to communicate as they can’t
speak or hear. Sign Language Recognition (SLR)
deals with recognizing the hand gestures
acquisition and continues till text or speech is
generated for corresponding hand gestures. Here
hand gestures for sign language can be classified
as static and dynamic. However, static hand
gesture recognition is simpler than dynamic hand
gesture recognition, but both recognition is
important to the human community. We can use
Deep Learning Computer Vision to recognize the
hand gestures by building Deep Neural Network
architectures (Convolution Neural Network
Architectures) where the model will learn to
recognize the hand gestures images over an epoch.
Once the model Successfully recognizes the gesture
the corresponding English text is generated and
then text can be converted to speech. This model
will be more efficient and hence communicate for
the deaf (hard hearing) and dump people will be
easier. In this paper, we will discuss how Sign
Language Recognition is done using Deep
Learning.
Index words: Hand Gestures; Sign Language
Recognition; Convolution Neural Networks;
Computer Vision; Text to Speech.
1. INTRODUCTION
Deaf (hard hearing) and dumb people use Sign
Language (SL) [1] as their primary means to express
their ideas and thoughts with their own community
and with other people with hand and body gestures.
It has its own vocabulary, meaning, and syntax
which is different from the spoken language or
written language. Spoken language is a language
produced by articulate sounds mapped against
specific words and grammatical combinations to
convey meaningful messages. Sign language uses
visual hand and body gestures to convey meaningful
messages. There are somewhere between 138 and
300 different types of Sign Language used around
globally today. In India, there are only about 250
certified sign language interpreters for a deaf
population of around 7 million. This would be a
problem to teach sign language to the deaf and dumb
people as there is a limited number of
sign language interpreters exits today. Sign
Language Recognition is an attempt to recognize
these hand gestures and convert them to the
corresponding text or speech. Today Computer
Vision and Deep Learning have gained a lot of
popularity and many State of the Art (SOTA)
models can be built. Using Deep Learning
algorithms and Image Processing we can able to
classify these hand gestures and able to produce
corresponding text. An example of “A” alphabet in
sign language notion to English “A” text or speech.
Fig. 1. Sign Language Hand Gestures
In Deep Learning Convolution Neural Networks
(CNN) is the most popular neural network algorithm
which is a widely used algorithm for Image/Video
tasks. For Convolution Neural Networks (CNN) we
have advanced architectures like LeNET-5 [2], and
MobileNetV2 [3] where we can use these
architectures to achieve the State of the Art (SOTA).
We can use all these architectures and combine them
using neural network ensemble techniques [4]. By
this, we can achieve an almost 100% accurate model
which will recognize the hand gestures. This model
will be deployed in web frameworks like Django or
a standalone application or embedded devices where
the hand gestures are recognized in the live camera
and then converting them to text. This system will
help deaf and dumb people to communicate easily.
Fig. 2. Convolution Neural Networks
1.1 Motivation
The various advantages of building a Sign
Language Recognition system includes:
Sign Language hand gestures to
text/speech translation systems or dialog
systems which are used in specific public
domains such as airports, post offices, or
hospitals.
Sign Language Recognition (SLR) can
help to translate the video to text or speech
enables inter-communication between
normal and deaf people.
1.2 Problem Statement
Sign language uses lots of gestures so that it looks
like movement language which consists of a series
of hands and arms motions. For different countries,
there are different sign languages and hand gestures.
Also, it is noted that some unknown words are
translated by simply showing gestures for each
alphabet in the word. In addition, sign language also
includes specific gestures to each alphabet in the
English dictionary and for each number between 0
and 9. Based on these sign languages are made up of
two groups, namely static gesture, and dynamic
gesture. The static gesture is used for alphabet and
number representation, whereas the dynamic gesture
is used for specific concepts. Dynamic also includes
words, sentences, etc. The static gesture consists of
hand gestures, whereas the latter includes motion of
hands, head, or both. Sign language is a visual
language and consists of 3 major components, such
as finger-spelling, word-level sign vocabulary, and
non-manual features. Finger-spelling is used to spell
words letter by letter and convey the message
whereas the latter is keyword-based. But the design
of a sign language translator is quite challenging
despite many research efforts during the last few
decades. Also, even the same signs have
significantly different appearances for different
signers and different viewpoints. This work focuses
on the creation of a static sign language translator by
using a Convolutional Neural Network. We created
a lightweight network that can be used with
embedded devices/standalone applications/web
applications having fewer resources.
1.3 Objectives
The main objectives of this project are to contribute
to the field of automatic sign language recognition
and translation to text or speech. In our project, we
focus on static sign language hand gestures. This
work focused on recognizing the hand gestures
which includes 26 English alphabets (A-Z) and 10
digits (0-9) using Deep Neural Networks (DNN).
We created a convolution neural networks classifier
that can classify the hand gestures into English
alphabets and digits. We have trained the neural
network under different configurations and
architectures like LeNet-5 [2], MobileNetV2 [3],
and our own architecture. We used the horizontal
voting ensemble technique to achieve the maximum
accuracy of the model. We have also created a web
application using Django Rest Frameworks to test
our results from a live camera.
2. LITERATURE REVIEW
2.1 Real-time sign language fingerspelling
recognition using convolutional neural networks
from depth map [5].
This works focuses on static fingerspelling in
American Sign Language A method for
implementing a sign language to text/voice
conversion system without using handheld gloves
and sensors, by capturing the gesture continuously
and converting them to voice.
In this method, only a few images were captured for
recognition. The design of a communication aid for
the physically challenged.
2.2 Design of a communication aid for physically
challenged [6].
The system was developed under the MATLAB
environment. It consists of mainly two phases via
training phase and the testing phase. In the training
phase, the author used feed-forward neural
networks. The problem here is MATLAB is not that
efficient and also integrating the concurrent
attributes as a whole is difficult.
2.3 American Sign Language Interpreter System
for Deaf and Dumb Individuals [7].
The discussed procedures could recognize 20 out of
24 static ASL alphabets. The alphabets A, M, N, and
S couldn’t be recognized due to the occlusion
problem. They have used only a limited number of
images.
3. IMPLEMENTATION
3.1 Dataset
We have used multiple datasets and trained multiple
models to achieve good accuracy.
3.1.1 ASL Alphabet
The data is a collection of images of the alphabet
from the American Sign Language, separated into 29
folders that represent the various classes.
The training dataset consists of 87000 images which
are 200x200 pixels. There are 29 classes of which 26
are English alphabets A-Z and the rest 3 classes are
SPACE, DELETE, and, NOTHING. These 3 classes
are very important and helpful in real-time
applications.
3.1.2 Sign Language Gesture Images Dataset
The dataset consists of 37 different hand sign
gestures which include A-Z alphabet gestures, 0-9
number gestures, and also a gesture for space which
means how the deaf (hard hearing) and dumb people
represent space between two letters or two words
while communicating.
Each gesture has 1500 images which are 50x50
pixels, so altogether there are 37 gestures which
means there 55,500 images for all gestures.
Convolutional Neural Network (CNN) is well suited
for this dataset for model training purposes and
gesture prediction.
3.2 Data Pre-processing
An image is nothing more than a 2-dimensional
array of numbers or pixels which are ranging from 0
to 255. Typically, 0 means black, and 255 means
white. Image is defined by mathematical function
f(x, y) where ‘x’ represents horizontal and ‘y’
represents vertical in a coordinate plane. The value
of f(x, y) at any point is giving the pixel value at that
point of an image.
Image Pre-processing is the use of algorithms to
perform operations on images. It is important to Pre-
process the images before sending the images for
model training. For example, all the images should
have the same size of 200x200 pixels. If not, the
model cannot be trained.
Fig. 3. Sample Image without Pre-processing
The steps we have taken for image Pre-processing
are:
Read Images.
Resize or reshape all the images to the same
Remove noise.
All the image pixels arrays are converted to
0 to 255 by dividing the image array by
255.
Fig. 4. Pre-Processed Image
3.3 Convolution Neural Networks (CNN)
Computer Vision is a field of Artificial Intelligence
that focuses on problems related to images and
videos. CNN combined with Computer vision is
capable of performing complex problems.
Fig. 5. Working of CNN
The Convolution Neural Networks has two main
phases namely feature extraction and classification.
A series of convolution and pooling operations are
performed to extract the features of the image.
The size of the output matrix decreases as we keep
on applying the filters.
Size of new matrix = (Size of old matrix filter
size) +1
A fully connected layer in the convolution neural
networks will serve as a classifier. In the last layer,
the probability of the class will be predicted. The
main steps involved in convolution neural networks
are:
1. Convolution
2. Pooling
3. Flatten
4. Full connection
3.3.1 Convolution
Convolution is nothing but a filter applied to an
image to extract the features from it. We will use
different filters to extract features like edges,
highlighted patterns in an image. The filters will be
randomly generated.
What this convolution does is, creates a filter of
some size says 3x3 which is the default size. After
creating the filter, it starts performing the element-
wise multiplication starting from the top left corner
of the image to the bottom right of the image. The
obtained results will be extracted feature.
Fig. 6. Convolution
Fig. 6. Feature Extraction
3.3.2 Pooling
After the convolution operation, the pooling layer
will be applied. The pooling layer is used to reduce
the size of the image. There are two types of pooling:
1. Max Pooling
2. Average Pooling
3.3.2.1 Max pooling
Max pooling is nothing but selecting the maximum
pixel value from the matrix.
Fig. 7. Max Pooling
This method is helpful to extract the features with
high importance or which are highlighted in the
image.
3.3.2.2 Average pooling
Unlike Max pooling, the average pooling will take
average values of the pixel
.
Fig. 8. Average Pooling
In most cases, max pooling is used because its
performance is much better than average pooling.
3.3.3 Flatten
Fig. 9. Flatten
The obtained resultant matrix will be in muti-
dimension. Flattening is converting the data into a 1-
dimensional array for inputting the layer to the next
layer. We flatten the convolution layers to create a
single feature vector.
3.3.4 Full Connection
Fig. 10. Full Connection
A fully connected layer is simply a feed-forward
neural network. All the operations will be performed
and prediction is obtained. Based on the ground truth
the loss will be calculated and weights are updated
using gradient descent backpropagation algorithm.
3.4 Convolution Neural Network (CNN)
Architectures
3.4.1 LeNet-5
Fig. 11. LeNet-5 Implementation
The LeNet-5 [2] architecture consists of two pairs of
convolutional and average pooling layers, followed
by a flattening convolutional layer, then two fully
connected layers, and finally a SoftMax classifier.
3.4.2 MobileNetV2
Fig. 12. MobileNetV2 Implementation
MobileNetV2 [3] is a convolutional neural network
architecture that performs well on mobile devices.
The architecture of MobileNetV2 contains the fully
convolution layer with 32 filters, followed by 19
residual bottleneck layers. This network is
lightweight and efficient.
3.4.3 Own Architecture
Fig. 13. Own Architecture Implementation
In our own architecture, we have implemented 3
convolution layers followed by batch normalization
and max pooling, followed by global average
pooling with dense layer and batch normalization,
and a final dense layer for classification.
3.5 Proposed Model
We have trained 3 models with 2 different datasets
to perform well on unseen datasets. We have trained
LeNet-5, MobineNetV2 and, our own architectures.
We have not taken the best model out of 3 we have
taken all 3 models and made a final model that will
perform an ensemble of these 3 models.
3.5.1 Neural Network Ensemble Horizontal
Voting
In Machine Learning we have an ensemble
technique where we train multiple sub-models and
average them. Random Forest algorithm is an
example where it uses multiple Decision tree
algorithms. Similarly, we can perform ensemble for
Neural Networks [4] as well. There are a lot of
ensemble techniques for Neural Networks like
Stacked generalization [8], Ensemble learning via
negative correlation [9] and, Probabilistic Modelling
with Neural Networks [10] [11]. We have
implemented the Horizontal Voting Ensemble
method to improve the performance of neural
networks.
Horizontal voting is an ensemble technique for
neural networks where we train several sub-models
and make predictions using these sub-models. For
the final predictions, we make predictions from all
the sub-models and see which class has got
maximum votes. The final prediction will be the
class that has the maximum votes. For this, we have
used 3 models that are an odd number of sub-models
to avoid an even number of votes for two classes in
worst cases.
Let model be the set of neural network models being
trained on the training set T(xi,yi), such that m
model. Let yhat be the predictions obtained by all
the models on the test set T’(xi,yi).
Let ‘array’ be the function for converting lists to
arrays
Algorithm 1: Horizontal Voting
Input: models, test set T’(xi,yi), and empty yhat
list
Output: predictions final prediction obtained
1. Step 1: Obtain the predictions of each
model
2. for each i in range(xi)
3. for each m in model, do
yhat[i] predict(x[i])
Calculate highest number of
votes for ith test data and append
yhat[i] highest voted class
end for
4. end for
5. Step 2: Convert list into an array
6. yhat array(yhat)
7. Step 3: return yhat
For MobileNetV2 model we have taken Adam
optimizer with learning rate = 0.001, eta_1=0.9,
beta_2=0.999, and epsilon=1e-07
For all the models while training we have used
ReduceLROnPlateau callback with factor = 0.2,
patience = 2, min_lr = 0.001.
By using this horizontal ensemble technique, we
have achieved 99.8% accuracy.
We have deployed all the models in Django Web
Frameworks and built a simple frontend to accept
the image from users and send the response.
We have also built an API that will pop up the live
camera and detects the hand gestures and then
converts them to the corresponding English
alphabets.
4. EXPERIMENTAL RESULTS
Fig. 14. Training graphs
We have trained all the models for around 10-15
epochs with a batch size of 32.
Fig. 15. Model Prediction during training
Models
Accuracy
MobileNetV2
98.9%
LeNet-5
97%
Own Model
98%
Ensemble
99.8%
Table. 1. Performance Results
All the models performed well on the test cases.
After applying the horizontal voting ensemble
technique for these 3 models, we have achieved
almost 100% accuracy.
Fig. 16. JSON Response from the Application
We have used Django Rest Frameworks as a
backend for our project. This is the sample JSON
response that is sent to the frontend when a user
inputs an Image.
Fig. 17. Model Prediction on Live Camera
Prediction: “W”
We have used OpenCV to test our results in the live
camera. This is a sample result on a live camera.
Fig. 18. Model Prediction on Live Camera
Prediction: “T
Fig. 19. Model Prediction on Live Camera
Prediction: “C
Fig. 20. Model Prediction on Live Camera
Prediction: “M”
5. CONCLUSION
In conclusion, we were successfully able to develop
a practical and meaningful system that can able to
understand sign language and translate that to the
corresponding text. There are still many shortages of
our system like this system can detect 0-9 digits and
A-Z alphabets hand gestures but doesn’t cover body
gestures and other dynamic gestures. We are sure
and it can be improved and optimized in the future.
REFERENCES
[1] Brill R. 1986. The Conference of Educational
Administrators Serving the Deaf: A History.
Washington, DC: Gallaudet University Press.
[2] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner,
"Gradient-based learning applied to document
recognition," in Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278-2324, Nov. 1998, doi:
10.1109/5.726791.
[3] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov
and L. Chen, "MobileNetV2: Inverted Residuals and
Linear Bottlenecks," 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2018,
pp. 4510-4520, doi: 10.1109/CVPR.2018.00474.
[4] L. K. Hansen and P. Salamon, "Neural network
ensembles," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 12, no. 10,
pp. 993-1001, Oct. 1990, doi: 10.1109/34.58871.
[5] Kang, Byeongkeun, Subarna Tripathi, and
Truong Q. Nguyen. ”Real- time sign language
fingerspelling recognition using convolutional
neural networks from depth map.” arXiv preprint
arXiv: 1509.03001 (2015).
[6] Suganya, R., and T. Meeradevi. ”Design of a
communication aid for phys- ically challenged.” In
Electronics and Communication Systems (ICECS),
2015 2nd International Conference on, pp. 818-822.
IEEE, 2015.
[7] Sruthi Upendran, Thamizharasi. A,” American
Sign Language Interpreter System for Deaf and
Dumb Individuals”, 2014 International Conference
on Control, Instrumentation, Communication and
Computa.
[8] David H. Wolpert, Stacked generalization,
Neural Networks, Volume 5, Issue 2, 1992, Pages
241-259, ISSN 0893-6080,
https://doi.org/10.1016/S0893-6080(05)80023-1.
[9] Y. Liu, X. Yao, Ensemble learning via negative
correlation, Neural Networks, Volume 12, Issue
10,1999, Pages 1399-1404, ISSN 0893-6080,
https://doi.org/10.1016/S0893-6080(99)00073-8.
[10] MacKay D.J.C. (1995) Developments in
Probabilistic Modelling with Neural Networks
Ensemble Learning. In: Kappen B., Gielen S. (eds)
Neural Networks: Artificial Intelligence and
Industrial Applications. Springer, London.
https://doi.org/10.1007/978-1-4471-3087-1_37
[11] Polikar R. (2012) Ensemble Learning. In:
Zhang C., Ma Y. (eds) Ensemble Machine Learning.
Springer, Boston, MA. https://doi.org/10.1007/978-
1-4419-9326-7_1
... It is difficult to teach sign language to the community because of this lack. To overcome communication hurdles, sign language recognition uses computer vision and deep learning to identify hand motions and transform them into text or voice [1]. The importance of a Sign Language Recognition (SLR) system for hard-of-hearing and speech-impaired people is emphasised in the study. ...
Article
Full-text available
Sign Language Recognition (SLR) recognizes hand gestures and produces the corresponding text or speech. Despite advances in deep learning, the SLR still faces challenges in terms of accuracy and visual quality. Sign Language Translation (SLT) aims to translate sign language images or videos into spoken language, which is hampered by limited language comprehension datasets. This paper presents an innovative approach for sign language recognition and conversion to text using a custom dataset containing 15 different classes, each class containing 70-75 different images. The proposed solution uses the YOLOv5 architecture, a state-of-the-art Convolutional Neural Network (CNN) to achieve robust and accurate sign language recognition. With careful training and optimization, the model achieves impressive mAP values (average accuracy) of 92% to 99% for each of the 15 classes. An extensive dataset combined with the YOLOv5 model provides effective real-time sign language interpretation, showing the potential to improve accessibility and communication for the hearing impaired. This application lays the groundwork for further advances in sign language recognition systems with implications for inclusive technology applications.
Conference Paper
Full-text available
Sign language recognition is important for natural and convenient communication between deaf community and hearing majority. We take the highly efficient initial step of automatic fingerspelling recognition system using convolutional neural networks (CNNs) from depth maps. In this work, we consider relatively larger number of classes compared with the previous literature. We train CNNs for the classification of 31 alphabets and numbers using a subset of collected depth data from multiple subjects. While using different learning configurations, such as hyper-parameter selection with and without validation, we achieve 99.99% accuracy for observed signers and 83.58% to 85.49% accuracy for new signers. The result shows that accuracy improves as we include more data from different subjects during training. The processing time is 3 ms for the prediction of a single image. To the best of our knowledge, the system achieves the highest accuracy and speed. The trained model and dataset is available on our repository1.
Article
Full-text available
This paper introduces stacked generalization, a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. When used with multiple generalizers, stacked generalization can be seen as a more sophisticated version of cross-validation, exploiting a strategy more sophisticated than cross-validation's crude winner-takes-all for combining the individual generalizers. When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question. After introducing stacked generalization and justifying its use, this paper presents two numerical experiments. The first demonstrates how stacked generalization improves upon a set of separate generalizers for the NETtalk task of translating text to phonemes. The second demonstrates how stacked generalization improves the performance of a single surface-fitter. With the other experimental evidence in the literature, the usual arguments supporting cross-validation, and the abstract justifications presented in this paper, the conclusion is that for almost any real-world generalization problem one should use some version of stacked generalization to minimize the generalization error rate. This paper ends by discussing some of the variations of stacked generalization, and how it touches on other fields like chaos theory.
Article
Full-text available
Several means for improving the performance and training of neural networks for classification are proposed. Crossvalidation is used as a tool for optimizing network parameters and architecture. It is shown that the remaining residual generalization error can be reduced by invoking ensembles of similar networks
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Article
Ensemble learning by variational free energy minimization is a framework for statistical inference in which an ensemble of parameter vectors is optimized rather than a single parameter vector. The ensemble approximates the posterior probability distribution of the parameters. In this paper I give a review of ensemble learning using a simple example.
The Conference of Educational Administrators Serving the Deaf: A History
  • R Brill
Brill R. 1986. The Conference of Educational Administrators Serving the Deaf: A History. Washington, DC: Gallaudet University Press.
Design of a communication aid for phys-ically challenged
  • R Suganya
  • T Meeradevi
Suganya, R., and T. Meeradevi. "Design of a communication aid for phys-ically challenged." In Electronics and Communication Systems (ICECS), 2015 2nd International Conference on, pp. 818-822. IEEE, 2015.