ArticlePDF Available

Deep Learning Reader for Visually Impaired

October 2022
Electronics 11(20):3335

October 2022
11(20):3335

DOI:10.3390/electronics11203335

License
CC BY 4.0

Authors:

Jothi Ganesan

Sona College of Arts and Science

Shrooq Alsenan

Prince Sultan University

Nashwa Ahmad Kamal

Cairo University

Show all 6 authorsHide

Recent advances in machine and deep learning algorithms and enhanced computational capabilities have revolutionized healthcare and medicine. Nowadays, research on assistive technology has benefited from such advances in creating visual substitution for visual impairment. Several obstacles exist for people with visual impairment in reading printed text which is normally substituted with a pattern-based display known as Braille. Over the past decade, more wearable and embedded assistive devices and solutions were created for people with visual impairment to facilitate the reading of texts. However, assistive tools for comprehending the embedded meaning in images or objects are still limited. In this paper, we present a Deep Learning approach for people with visual impairment that addresses the aforementioned issue with a voice-based form to represent and illustrate images embedded in printed texts. The proposed system is divided into three phases: collecting input images, extracting features for training the deep learning model, and evaluating performance. The proposed approach leverages deep learning algorithms; namely, Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), for extracting salient features, captioning images, and converting written text to speech. The Convolution Neural Network (CNN) is implemented for detecting features from the printed image and its associated caption. The Long Short-Term Memory (LSTM) network is used as a captioning tool to describe the detected text from images. The identified captions and detected text is converted into voice message to the user via Text-To-Speech API. The proposed CNN-LSTM model is investigated using various network architectures, namely, GoogleNet, AlexNet, ResNet, SqueezeNet, and VGG16. The empirical results conclude that the CNN-LSTM based training model with ResNet architecture achieved the highest prediction accuracy of an image caption of 83%.

The Deep Learning Reader Architecture.

…

Related Literature Summary.

…

CNN Architecture [51]. Numerous CNN architectures, such as Alexnet [52], VGG16 [53], Squeezenet [54], ResNet [55], and GoogLeNet [56], have emerged in recent years, with many differences in terms of layer types, hyper-parameters, and so on. The most significant predefined networks are discussed in this article.

…

LSTM Network Architecture [59].

Figures - uploaded by Basit Qureshi

Content may be subject to copyright.

Content uploaded by Basit Qureshi

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Citation: Ganesan, J.; Azar, A.T.;

Alsenan, S.; Kamal, N.A.; Qureshi, B.;

Hassanien, A.E. Deep Learning

Reader for Visually Impaired.

Electronics 2022,11, 3335.

https://doi.org/10.3390/

electronics11203335

Academic Editor: George A.

Papakostas

Received: 3 September 2022

Accepted: 12 October 2022

Published: 16 October 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional afﬁl-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

electronics

Article

Deep Learning Reader for Visually Impaired

Jothi Ganesan 1, Ahmad Taher Azar 2,3,* , Shrooq Alsenan 2, Nashwa Ahmad Kamal 4, Basit Qureshi 2

and Aboul Ella Hassanien 5

1Department of Computer Applications, Sona College of Arts and Science, Salem 636005, Tamil Nadu, India

2College of Computer & Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia

3Faculty of Computers and Artiﬁcial Intelligence, Benha University, Benha 13518, Egypt

4Faculty of Engineering, Cairo University, Giza 12613, Egypt

5Faculty of Computers and Artiﬁcial Intelligence, Cairo University, Giza 12613, Egypt

*Correspondence: aazar@psu.edu.sa or ahmad_t_azar@ieee.org or ahmad.azar@fci.bu.edu.eg

Abstract:

Recent advances in machine and deep learning algorithms and enhanced computational

capabilities have revolutionized healthcare and medicine. Nowadays, research on assistive technology

has beneﬁted from such advances in creating visual substitution for visual impairment. Several

obstacles exist for people with visual impairment in reading printed text which is normally substituted

with a pattern-based display known as Braille. Over the past decade, more wearable and embedded

assistive devices and solutions were created for people with visual impairment to facilitate the reading

of texts. However, assistive tools for comprehending the embedded meaning in images or objects are

still limited. In this paper, we present a Deep Learning approach for people with visual impairment

that addresses the aforementioned issue with a voice-based form to represent and illustrate images

embedded in printed texts. The proposed system is divided into three phases: collecting input

images, extracting features for training the deep learning model, and evaluating performance. The

proposed approach leverages deep learning algorithms; namely, Convolutional Neural Network

(CNN), Long Short Term Memory (LSTM), for extracting salient features, captioning images, and

converting written text to speech. The Convolution Neural Network (CNN) is implemented for

detecting features from the printed image and its associated caption. The Long Short-Term Memory

(LSTM) network is used as a captioning tool to describe the detected text from images. The identiﬁed

captions and detected text is converted into voice message to the user via Text-To-Speech API. The

proposed CNN-LSTM model is investigated using various network architectures, namely, GoogleNet,

AlexNet, ResNet, SqueezeNet, and VGG16. The empirical results conclude that the CNN-LSTM

based training model with ResNet architecture achieved the highest prediction accuracy of an image

caption of 83%.

Keywords:

artiﬁcial intelligence; Convolutional Neural Network architectures; Long Short Term

Memory; visually impaired individuals; assistive device; deep learning

1. Introduction

Over the past decade, machine learning algorithms and applications have contributed

to new advances in the ﬁeld of assistive technology. Researchers are leveraging such

advancements to continuously improve human quality of life, especially those with disabil-

ities or alarming health conditions [

]. Assistive technology (AT) deploy devices, present

services or programs to improve functional capabilities of people with disabilities [

The scope of assistive technology research studies comprises hearing impairment, visual

impairment, and cognitive impairment, among others [3–5].

Vision impairment can vary from mild, moderate, severe vision impairment and

total blindness. In the light of the recent advances in machine learning and deep learn-

ing, research studies and new solutions for people with visual impairment have gained

more popularity. The main goal is to provide people with visual impairment with visual

Electronics 2022,11, 3335. https://doi.org/10.3390/electronics11203335 https://www.mdpi.com/journal/electronics

Electronics 2022,11, 3335 2 of 22

substitution by creating navigation or orientation solutions. Such solutions can ensure

self-independence, conﬁdence, and safety for people with visual impairment in the daily

tasks [

]. According to estimates, approximately 253 million individuals suffer from visual

impairments: 217 million have low-to-high vision impairments, and 36 million are blind.

Figures have also shown that, amongst this population, 4.8% are born with visual deﬁcien-

cies, such as blindness: for 90% of these individuals, their ailments have different causes,

including accidents, diabetes, glaucoma, and macular degeneration.

The world’s population is not only growing, but also getting older, meaning more

people will lose their sight due to chronic diseases [

]. Such impediments can have knock-

on effects; for example, individuals with visual impairments who want an education

may need specialized help in the form of a helper or equipment. Learners with visual

impairments can now make use of course content in different forms, such as audiotapes,

Braille, and magniﬁed material [

]. It is worth noting that these tools read the text instead

of images. Technological advancements have been employed in educational environments

to assist people with visual impairment, blind people, and special-needs learners, and these

developments, particularly concerning machine learning, are ongoing.

The main objective of conducting visual impairment research studies is to achieve

visual enhancement, vision replacement, or vision substitution as originally classiﬁed by

Welsh Richard in 1981 [

]. Vision enhancement involve acquiring signals from camera

which processed to produce an output display through head-mounted device. Vision

replacement deals with displaying visual information to the human brain’s visual cortex

or the optic nerve. Vision substitution concentrate on delivering nonvisual output in a

auditory signals [

]. In this paper, we focus on vision substitution solution that delivers

a vocal description on both printed texts and images to people with visual impairment.

There are three main areas of concentration concerning research on people with visual

impairment; namely, mobility, object detection and recognition and navigation. In the

era of data explosion and information availability, it is imperative to consider means to

information access for people with visual impairment specially printed information and

images [

]. Over the past decades, authors have leveraged state of the art machine learning

algorithms to develop solutions supporting each of the aforementioned areas.

Deep learning has evolved in prominence as a ﬁeld of study that seeks innovative

approaches for automating different tasks depending on input data [

–

]. Deep learn-

ing is a type of artiﬁcial intelligence techniques that can be used for image classiﬁcation,

recognition, virtual assistants, healthcare, authentication systems, natural language pro-

cessing, fraud detection, and other purposes. The study describes an Intelligent Reader

system that employs Deep Learning techniques to help people with visual impairment

read and describe images in a printed text book. In the proposed technique, Convolutional

Neural Network (CNN) [

] is utilised to extract features from input images, while Long

Short-Term Memory (LSTM) [

] is used to describe visual information in an image. The

intelligent learning system generates a voice message comprising text and graphic infor-

mation from a printed text book using the text-to-speech approach. Deep learning-based

technologies increase image-related task performance and can help people with visual im-

pairment live better lives. The overall architecture of the proposed solution is demonstrated

in Figure 1.

The proposed intelligent reader system reads text using optical character recognition

(OCR) and the Google Text-to-Speech (TTS) approach, which converts textual input into

voice messages. The input images were trained with CNN-LSTM model to predicts the

appropriate captions of an image and sends them to the intelligent reader system. The

reader system transmits all data to visually impaired users in the form of audio messages.

The proposed approach divides into three phrases: acquisition of input images, extracting

features for training the deep learning model, and assessing performance. The efﬁciency of

the constructed model is evaluated using different deep learning architectures, including

ResNet, AlxeNet, GoogleNet, SqueezeNet, and VGG16. The experimental results suggest

that the ResNet network design outperforms other architectures in terms of accuracy.

Electronics 2022,11, 3335 3 of 22

This paper provides the following contributions. First, it delivers an Electronic Travel

Aids (ETA) vision substitution solution for people with visual impairment that includes

spatial inputs such as photography or visual content. Although many studies have pro-

posed text-to-speech solutions, this paper utilizes deep learning capabilities to describe

images as well as text to a person with visually impairment. Second, it briefs the reader

about most signiﬁcant deep learning architectures for image recognition, along with most

identiﬁed features of each architecture. Finally, this paper proposes and implements a deep

learning architecture utilizing CNN and LSTM algorithms. Content is extracted from text

and images with the former algorithm, and a captions are predicted with the latter.

Figure 1. The Deep Learning Reader Architecture.

In the recent decades, many researchers have developed an assistive device/system

to read text books for people with visual impairment, which helps them enhance their

learning skills without the assistance of a tutor. Reading image content is a challenging task

for visually impaired students. The proposed system is unique in that it incorporates the

intelligence of two deep learning approaches, CNN and LSTM, to assist people with visual

impairment in reading a text book (both text and image content) without the assistance of a

human. The proposed approach reads the text content in the book using OCR and then

provides an audio message. If any images are presented in the text book between the texts,

the system uses the CNN model to extract the features of the image, and the LSTM model

to describe the captions of the images. Following that, the image captions are translated

into voice messages. As a result, visually challenged persons understand the concept of the

Electronics 2022,11, 3335 4 of 22

text book without any ambiguity. The suggested method combines the beneﬁts of OCR,

CNN, LSTM, and TTS to read and describe the complete book content through audio/voice

message.

The rest of the paper is structured as follows: Section 2covers previous proposed

solutions available for visual impairment. The preliminaries of various architectures

in Deep learning approach is explained in Section 3. Then, the empirical ﬁndings and

model evaluation are detailed in Section 4. We conclude this endeavour in Section 5with

concluding remarks and future work.

2. Related Work

Vision impairment is a common disability with different level of severity. Assistive

technology has contributed to providing visual substitution in the form of products,

devices, software or systems [

–

]. Visual substitution is an alternative means to capture

visual images, directions or movements and deliver it in a non-visual manner through audio

or Braille [

]. Visual substitution can be categorized into three main categories; namely,

Electronic Travel Aids (ETAs), Electronic Orientation Aids (EOAs), and Position Locator

Devices (PLDs) [

]. An overview of each category of visual substitution is discussed

further.

1. Electronic Travel Aids (ETAs)

ETAa are devices that translate environment information that are typically identiﬁed

via human vision, using non-vision sensory. It includes sensing inputs such as a cam-

era, Radio Frequency Identiﬁcation (RFID), Bluetooth, or Near-Field Communication

(NFC), to receive environment inputs, and a feedback modalities to deliver informa-

tion to the user in a non-vision form such as such as audio, tactile, or vibrations.

2. Electronic Orientation Aids (EOAs)

EOAs devices provide a navigation path and identify obstacles to people with visual

impairment. The objectives of EOAs devices is to improve safety and mobility in

unrecognized environment by detecting obstacles and delivering information by

means of audio or vibrations [25].

3. Position Locator Devices (PLDs)

PLDs provide a precise positioning of devices that utilizes Global Positioning System

(GPS) and Geographic Information System (GIS). Such technologies have limitation

that it ought to be used outdoors and need to be coupled with other sensors to identify

obstacles throughout navigation

This paper delivers an ETA vision substitution solution for people with visual impair-

ment. Technological developments, including computer vision and deep and machine

learning, are utilized in an autonomous learning system for people with visual impairment.

2.1. AT Based on Deep Learning Techniques

The author in [

] outlined a one-off cheap wearable assistive technology (AT) device

running on solar power that provides users with ongoing real-time continuous, real-time

object recognition to aid VI individuals.This system comprises three elements: a camera,

a system on module (SoM) processing unit, and an ultrasonic sensor. The user wears the

camera like a pair of glasses to provide real-time recordings, while the SoM, worn as a

belt, processes information from the camera, and the sensor can detect objects. Lin et al.

[27] proposed a deep learning-based support system to heighten users’ ability to perceive

their surroundings. This system involves a terminal that can be worn and has an earpiece,

RGBD camera, and an earphone. A CPU can aid deep learning, and a smartphone is

employed whenever touch-based actions are necessary. The system also provides safe, clear

walking directives thanks to the RGBD information and semantic maps. In [

], employs

deep convolution neural network-based architecture to create a system that detects indoor

objects and is based on the “RetinaNet” deep convolution neural network. The assessment

of detection levels utilizes different elements, including AlexNet, GoogleNet, ResNet,

Electronics 2022,11, 3335 5 of 22

SqueezeNet, and VGGNet. Applying this system resulted in detection clarity of mean

average precision (mAP) 84.61%.

To assist people with visual impairments, Tasnim et al. [

] outlined an automatic

process solution to detect Bangladeshi banknotes by way of a convolutional neural net-

work. The research proved successful, as demonstrated by the fact that the system was

92% accurate in specifying the notes and could provide written and audio outputs. The

researcher [

] designed a smart glass system for blind and people with visual impairment

using computer vision and deep learning algorithms. This proposed approach includes

four distinct modules: low-light image enhancement, object detection, audio feedback, and

tactile graphics generation. In the ﬁrst module, a deep learning approach is used to improve

the quality of the dark image, and objects/texts are recognized using an object recognition

method. Finally, the text to speech module produces an audio output. In this method, the

object detection model is trained on 133 different types of sounds. The ExDark data set

is used to assess the effectiveness of the proposed approach. Reading books, detecting

currency notes, and determining parcel speciﬁcs are all challenging tasks for people with

visual impairment. Mishra et al. [

] developed ChartVi, an automated chart summarising

system that accepts various types of chart images such as line, pie, bar, and so on and

generates a summary. The CNN-VGG16 network model is utilised in this approach to

identify the chart image categories, and then feature extraction techniques are employed to

automatically separate graphical and textual information. The inpainting method removed

the grid lines from the chart. Finally, the chart summary is divided into three sections:

prime, core, and wrapping. The premier section of the chart comprises the fundamental

information about the chart, such as the title, axis titles, and range, among other things,

the core part contains the actual meaning of the chart, and the wrapping part contains the

details of multi-serious charts. According to the empirical studies, ChartVi achieves 97.09%

accuracy in chart type classiﬁcation, >95% accuracy in textual segmentation, and 98%

accuracy in graphical extraction. Consequently, a database containing thousands of images

of these banknotes was created. Developments such as these mean blind individuals or

those with visual impairments can participate in everyday activities.

2.2. AT Based on Raspberry Pi

According to Zamir et al. [

], a smart reader system informed by the Raspberry

Pi can turn text into spoken signals. A camera recognizes printed text thanks to optical

character recognition (OCR). This method proposes to create a system that converts images

to audio per the Raspberry Pi single-board computer. In [

], the authors presented the

uniﬁed descriptor network, Dual Desc, that could outperform the NetVLAD architecture in

terms of describing images. A wearable device validates real-world information, and the

suggested visual localization suggestions employ multimodal images to avoid any issues

associated with RGB photos.The author [

] developed a voice mentor system that reads

content such as books, currency notes, and shopping parcels and provides audio output to

the user. The Raspberry Pi is deployed in this approach to support the portable camera

and audio signals through headphones. To extract text from images and transform the text

to audio, optical character recognition (OCR) is used. Chauhan et al. [

] use a Raspberry

Pi 3B model and ultrasonic sensors to create Ikshana, an intelligent assistive device for

vision impaired users. This device is designed to help people with a variety of daily chores

including as character recognition, facial detection, currency denomination identiﬁcation,

and obstacle detection. OCR software is used to extract text from printed books and internet

content. The assisting device’s design includes a Raspberry Pi 3B model as the computing

unit, a Raspberry Pi camera, buttons, and ultrasonic sensors. The headphone acts as a

narrative agent, directing the audio output to the user.

A smart electronic assistive device, consisting of two gadgets such as glasses and

a smart cane, is designed by Flores et al. [

]. The glasses utilise an image processing

technique to recognise text, while the smart cane detects obstacles in the walking path

by using sensors named VL53L0X and Ultrasonic. The developed device achieves 100%

Electronics 2022,11, 3335 6 of 22

accuracy in obstacle detection, 98.13% accuracy in text recognition, and 91.33% accuracy in

natural scene identiﬁcation.

2.3. AT Based on Internet of Things (IoT)

The author [

] developed an intelligent assistive system based on machine learning

and Internet of Things (IoT) to recognise people with visual impairment’s acquaintances in

their regular activities. The author built the proposed system using three major technologies:

machine learning, image processing, and IoT. In this system, the data ingestion layer is used

to store input images, while the data analysis layer analyses processed data and evaluates

the system’s accuracy and efﬁciency using machine learning. Finally, the application layer

builds a mobile app that may be used to detect a new individual whose face samples

must be saved in the cloud and that gives haptic response to the person with visual

impairment when an acquaintance is detected.The researcher [

] created an IoT-based

automatic object identiﬁcation system that can recognise objects and currency notes in

real time. This system employs four kinds of sensors to detect obstructions in the front,

left, right, and ﬂoor directions. To detect the currency note, the Single Shot Detector

(SSD) model using MobileNet and Tensorﬂow-lite is utilised. There were 365 people with

visual impairment evaluated with this technology, and 82% of them thought the cost was

acceptable, 13% thought it was moderate, and the remaining 5% thought it was relatively

high. The proposed system’s overall accuracy in object identiﬁcation and recognition is

99.31% and 98.43%, respectively.

2.4. Image Captioning Techniques

Image captioning technique is utilised in a wide range of applications, including

bridge damage detection, remote sensing image captioning, language caption synthesis,

and construction. Chun et al. [

] used an image captioning technique to describe the

damage state of a bridge. A deep learning model is used in this work to produce descriptive

sentences from an image. This method can also detect many types of damage in bridge

images and provide a full interpretation of complicated imagery. The real time dataset is

created during inspection work on 3118 bridges controlled by Japan’s Kanto Regional De-

velopment Bureau’s MLIT from 2004 to 2018. The developed technique uses the Bilingual

Evaluation Understudy (BLEU) score to evaluate the algorithm’s performance. The pro-

posed method achieves 69.3% accuracy for accurately generating explanatory phrases that

give user-friendly, text-based descriptions of bridge damage in images. The

researcher [40]

used Meta captioning to develop a remote sensing image captioning system. The Meta

characteristics are extracted from two tasks in this approach: remote sensing classiﬁcation

and natural image classiﬁcation. Because of the scarcity of training dataset, effective remote

sensing image captioning is extremely difﬁcult. The Meta features are then employed

for remote sensing image captioning. The ResNet network is used to train natural image

categorization. To illustrate the efﬁciency of the Meta captioning framework, three distinct

remote sensing captioning datasets were employed in the experimental analysis: Sydney-

Captions, the Remote Sensing Image Captioning Dataset, and the University of California

Merced dataset.

In [

], an integrated approach for extracting semantic information about items, be-

haviours, and interactions from construction images with visual links was devised. In this

approach, the CNN model is used to extract the prominent features from the entire image,

and the Mask R-CNN-based Encoder model is used to forecast the image’s description

words based on the input features. To train the model, 41,668 images were collected from

174 distinct construction sites and divided into training and validation sets. According to

the results of the experimental analysis, the proposed method produces BLEU Scores of 0.61,

0.52, 0.44, and 0.36 for BLEU1, BLEU2, BLEU3, and BLEU4, respectively.

Afyouni et al. [42]

developed AraCap, an Arabic Image Caption Generation approach that combines an object-

based and image captioning framework. The COCO and Flickr30k datasets are used to

assess the method’s performance. The proposed method includes the object detection and

Electronics 2022,11, 3335 7 of 22

image captioning processes in a sequential order. Using a similarity score, the proposed

approach generates captions that are compared to original captions from public databases.

The results show that the similarity scores of the proposed models for Arabic generated cap-

tions surpassed the basic captioning technique. The remote sensing captioning model was

constructed [43] utilising a Variational Autoencoder and a Reinforcement Learning-based

Two-stage Multi-task Learning Model (VRTMM). CNN is used in this method to extract

both semantic and spatial characteristics from an image. Then, Reinforcement Learning

is used to improve the quality of the generated phrases. To identify the remote sensing

image scene, a publicly accessible Remote Sensing Image dataset of 31,500 images and

45 scene

classiﬁcations was used. The results of the experiments illustrate that the proposed

model is successful at remote sensing image captioning and produces a new state-of-the-art

outcome.

Table 1illustrates the learning approach employed in recent studies designed for

people with visual impairment.

Table 1. Summarization of the Related works.

Author(s) Year Technique Used Developed System

Denic et al. [44] 2019 CNN Object Detection System

Felix et al. [45] 2019 Android Mobile App Blind Assistive technology

Durgadevi et al. [46] 2020 Image classiﬁcation Indoor object detection

Lin et al. [27] 2019 DL

Develop a system that assists people

in determining their perspective of

their environment.

Zamir et al. [32] 2019 Raspberry Pi OCR based Text Detection system

Calabrese et al. [26] 2020 DL object detection system

Aﬁf et al. [28] 2020 Deep CNN indoor object detector system

Shen et al. [43] 2020

CNN Variational

Autoencoder and

Reinforcement

Learning

Remote sensing image captioning

Cheng et al. [33] 2021 NetVLAD image description System

Tasnim et al. [29] 2021 CNN Bangladeshi banknotes detection

system

Mukhiddinov et al. [

]

2021 CNN Smart Glass for object detection

Afyouni et al. [42] 2021 CNN, LSTM Arabic Image Caption Generation

Sahithi et al. [34] 2022 Raspberry Pi and

OCR

Voice mentor system that reads text

content

Chauhan et al. [35] 2021 Raspberry Pi and

OCR

Ikshana: Character, Facial, object

and currency identiﬁcation system

Flores et al. [36] 2021

Image processing

techniques and

ultrasonic sensors

Obstacle/object detection

Aravindan et al. [37] 2021 Machine Learning

and IoT

Recognise visually impaired

people’s acquaintances in their

regular activities.

Rahman [38] 2021 Deep Learning and

IoT

Object and currency note

identiﬁcation

Mishra et al. [31] 2022 CNN-VGG16 Generate the summarization of

Chart images

Chun et al. [39] 2022 CNN Bridge damage detection captioning

method

Yang et al. [40] 2022 LSTM Remote sensing image captioning

Wang et al. [41] 2022 CNN, Mask-RCNN Extract Visual information about

construction images

Figure 2depicts a summary of relevant literature. Recently, new innovations in the

ﬁeld of assistive technology have emerged, providing excellent assistance to people with

Electronics 2022,11, 3335 8 of 22

visual impairment in a variety of ways. According to the above literature, 54% of researchers

applied deep learning and artiﬁcial intelligence approaches to design an assistive device

or system. The key advantage of these systems is that they are mobile apps, making

it very easy for the user to utilise them. The ﬁgure also depicts that, 23% of assistive

devices are Rasberry Pi and IoT based hardware devices. The remaining 23% of researchers

analyses the image captioning methods using deep learning. The signiﬁcant applications

of the literature discussed above include text recognition, currency note identiﬁcation,

bridge damage detection, language prediction, remote sensing image captioning and facial

recognition. It is extremely difﬁcult for visually challenged persons to comprehend image

information presented in textbooks, articles, and online advertisements. To overcome

these limitations, the proposed system uses deep learning algorithms to provide image

information in the form of audio output.

Figure 2. Related Literature Summary.

3. Preliminaries

Deep learning is a type of machine learning and artiﬁcial intelligence (AI) that that

models the learning of data. This approach helps academic researchers to gather, assess,

and decipher substantial reams of information because it streamlines and quickens the

process.

A vast amount of research has been conducted in the ﬁeld of computer vision in recent

decades. Image classiﬁcation, image segmentation, video tracking, pedestrian identiﬁcation,

object detection, and many other applications are examples of computer vision applications.

One of the most essential computer vision techniques is object detection, which is used

to discover and locate objects/obstacles inside an image or video. Object identiﬁcation

approaches include drawing bounding boxes and representing various things of interest in

a given image. Several deep learning variations based on artiﬁcial neural networks have

been employed such as Multilayer Perceptron (MLP), Recurrent Neural Networks (RNN),

Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM), where

different architectures plays an important role in different applications [47].

3.1. Multilayer Perceptron (MLP)

Multilayer Perceptron is a feed-forward artiﬁcial neural network algorithm which has

input, output and one or more hidden layers [

]. The perceptron can use Rectiﬁed Linear

Unit (ReLU) [

] activation function or Sigmoid that combined with the initial weights

in a weighted sum for prediction. In the fully connected layer of MLP, all the nodes are

connected with the next and previous layer. There are many applications of multi-layer

perceptron such as speech recognition, pattern recognition, sentiment analysis, etc.

Electronics 2022,11, 3335 9 of 22

3.2. Convolutional Neural Networks (CNNs)

CNN [

] can be deﬁned as a kind of deep learning neural network that has aided the

development of classifying and recognizing images. The CNNs are composed of several

different basic layers followed by the activation layers. A CNN is made up of three layers:

a convolutional layer, a pooling layer, and a fully connected layer. The Convolutional layer

involves a procedure wherein a succession of layers retrieve low- to high-level features from

the input layer. Meanwhile, the fully connected layer utilizes the Softmax Classiﬁcation

method to calculate and arrange the class label scores. The pooling layer is responsible for

reducing the convoluted features’ spatial dimensions. This pooling comprises two kinds:

average pooling and maximum pooling. The former provides an average of each value from

the part of the image within the kernel’s boundaries, while the latter returns the topmost

value. The fully connected (FC) layer performs classiﬁcation using the characteristics

retrieved by the previous layers and their various ﬁlters. FC layers typically use a softmax

activation function to classify inputs, yielding a probability ranging from 0 to 1. Figure 3

shows the CNN architecture.

Figure 3. CNN Architecture [51].

Numerous CNN architectures, such as Alexnet [

], VGG16 [

], Squeezenet [

ResNet [

], and GoogLeNet [

], have emerged in recent years, with many differences

in terms of layer types, hyper-parameters, and so on. The most signiﬁcant predeﬁned

networks are discussed in this article.

3.3. CNN-AlexNet

AlexNet is a pioneering architecture in the ﬁeld of computer vision. This model takes

images with dimensions of 227

227

3 as input. As the number of ﬁlters increases,

the model is trained deeper and more features are extracted. In addition, the ﬁlter size is

decreasing, implying that the original ﬁlter is becoming smaller. RGB images are sent into

the deep learning model’s input. Softmax is the activation function utilised in the output

layer [52].

3.4. CNN-VGG16

The VGG16 is a typical convolution neural network (CNN) architecture developed by

Karen Simonyan and Andrew Zisserman of the University of Oxford. The architecture’s

performance is assessed using the ImageNet dataset [

], which obtained 92.7 percent top-5

test accuracy in 2014. In comparison to AlexNet, VGG16 employs huge kernel-sized ﬁlters.

The architecture’s input image dimensions are set at 244

244

3. All of the hidden layers

in this network are followed by the ReLu activation function. Finally, the softmax layer

serves as the output layer [53].

Electronics 2022,11, 3335 10 of 22

3.5. CNN-GoogLeNet

The primary goal of the Inception architectural model is to use less computational

resource by altering earlier Inception designs. The initial version of the inception model

is called “GoogLeNet”, and it has 22 layers. These networks have learnt several feature

representations for a variety of images. The network’s input dimensions are 224

224

The GoogLeNet architecture differs from prior designs such as AlexNet and VGG16 in that

it uses global average pooling to generate deeper architecture. Rectiﬁed Linear Unit (ReLU)

is used as activation functions in this architecture’s convolutions [56].

3.6. CNN-ResNet

Deep neural networks need more time to train the model and are more prone to

overﬁtting. To overcome these shortcomings, Microsoft launched ResNet, a residual

learning framework that improves the training of networks that are far deeper and relatively

simple to grasp than those previously employed. Every few stacked levels in this network

design directly suit a required underlying mapping [55].

3.7. CNN-SqueezeNet

SqueezeNet is a smaller neural network that was created to be a more compact alter-

native for AlexNet. This architecture has 50x less parameters than AlexNet and performs

3×quicker. It used ReLU activation in all squeeze and expand layers [54].

Table 2describes the 3D depiction of each deep learning network architecture. Ten-

sorSpace (https://tensorspace.org/index.html (accessed on 15 October 2022)) is an interac-

tive visualization tool that exposes data connections between network layers.

Table 2. Visualization of Deep Learning Architectures.

GoogleNet Alxenet

ResNet VGG16

SqueezeNet

Electronics 2022,11, 3335 11 of 22

3.8. Long Short Term Memory (LSTM)

Long Short-Term Memory (LSTM) [

] networks are Recurrent Neural Networks

(RNN) that have the capacity to grasp order dependency in sequence prediction scenarios.

RNN is a feed-forward neural network characterized by its internal memory. In this net-

work, the current stage involves the output of the preceding step acting as an input: after

its generation, the output undergoes replication and is returned to the RNN. During the

decision-making process, the network assesses information about the input and output it

acquired from the previous input and helps identify the order of the images. LSTM net-

works can be used in different contexts, including activity recognition, grammar learning,

handwriting identiﬁcation, human action detection, picture description, rhythm learning,

time series prediction, voice recognition, and video description. Figure 4illustrates the

LSTM architecture.

Figure 4. LSTM Network Architecture [59].

LSTM networks comprise numerous memory blocks, otherwise referred to as cells

and illustrated in the image as rectangles. These blocks take responsibility for recording

information, and modifying this information occurs using one of four gate methods. LSTMs

handle Short-Term Memory (STM) and Long-Term Memory (LTM), while the gates aim

to streamline the computation process. In this instance, the LTM moves to the forget gate,

where it loses data that does not serve a purpose: conversely, the learn gate makes it

possible to grasp data from the STM, and the remember gate amends LTM data and brings

it up to date, and the use gate forecasts the output of the current event.

4. The Proposed CNN-LSTM Design

The proposed approach involves feeding the input ﬁle into the intelligent reader

system, which utilizes an Optical Character Recognition (OCR) tool that scrutinizes the

ﬁle’s contents and Google Text-to-Speech (TTS) technique adapts written input into voice

responses. When a ﬁle has images, the trained CNN-LSTM model predicts the related

captions, which are forwarded to the intelligent reader system. The reader system passes

on all data in the form of voice messages. The proposed system is divided into three

phases: collecting input images, extracting features for training the deep learning model,

and evaluating performance. Such an approach aims to ease concerns over predicting

sequences, including spatial inputs such as photography or visual content. Figure 5depicts

the suggested CNN-LSTM model’s architecture.

Phase 1 (Input Image Collection): The input images are collected and preprocessed. In

this research, Flickr 8K dataset, which comprises images and associated human descriptions,

is utilised for model training.

Phase 2 (Model Training): It consists of two main parts: feature extraction and a

language prediction model built with two deep learning techniques: Convolutional Neural

Network (CNN) and Long Short Term Memory (LSTM). CNN is a sub component of the

Deep Learning approach and customized deep neural networks that are used for image

classiﬁcation and recognition. Images in CNN model are represented as a 2D matrix that

can be scaled, translated, and rotated. The CNN model analyses the images from top to

bottom and left to right, extracting salient features for image categorization. In this network

architecture the convolutional layer with 3

3 kernels is utilised for feature extraction with

Electronics 2022,11, 3335 12 of 22

ReLU active function. To minimise the dimensions of an input picture, the max-pooling

layer with a size of 2

2 kernels is utilised. The extracted features will be put into the

LSTM model, which will provide the image caption. LSTM is a subsection of Recurrent

Neural Networks (RNN) that was created to solve sequence prediction issues. The output

from the last hidden state of the CNN (Encoder) is fed as the input of the decoder. Let

= <START> vector and the required label

= ﬁrst word in the sequence. In the same

way consider

= word vector of the ﬁrst word and expect the network to identify the

next word. Lastly,

= last word, and

= <END> token. The visualization of language

prediction model is depicted in Figure 6.

Figure 5. CNN-LSTM Design.

Figure 6. Language Prediction Model.

The language model takes the image pixels i and the input word vectors is denoted

as (

,. . . ,

), and determines the series of hidden states (

,. . . ,

) that produce the

outputs (

,...,

). As the initial hidden state ht, the image feature vectors are only

transmitted once. As a result, the image vector I the previously hidden state ht−1, and the

current input

are used to determine the next hidden state. A softmax layer is used on the

speciﬁed hidden state activation function to generate the current output yt.

bu=Whi[C NNθc(I)] (1)

ht=f(Whx xt+Whhh(t−1)+bh+1(t=1)Θbv)(2)

yt=so f tm ax(Wohht+bo)(3)

The CNN-LSTM is a deep learning architecture that combines two algorithms: CNN

and LSTM. The salient features of the input images are extracted to predict sequences,

Electronics 2022,11, 3335 13 of 22

and the latter predicts captions. The developed deep network model is evaluated using

various architectures including ResNet, AlexNet, GoogleNet, SqueezeNet, and VGG16.

Phase 3 (Testing): In phase 3, the trained model is tested using the test dataset. The CNN-

LSTM model predicts the caption sequence from the test image. The proposed approach’s

efﬁciency is determined using metrics such as BLEU, precision, recall, and accuracy. Using

Google Text-to-Speech API, the output captions are turned into audio messages. The

intelligent reader system based on deep learning enables people with visual impairment to

easily understand text as well as images displayed in text content.

5. Results and Discussion

5.1. Dataset Collection

In this research, the Flickr8k dataset [

] is employed to train the model. Flickr8k

Dataset, which contains 8092 images, and each is annotated with 5 sentences using Amazon

Mechanical Turk.The annotations on each image allow for progress in automatic image

description and grounded language understanding. Flickr8k Text ﬁle, which contains

image names and captions. For training, the deep learning model dataset is divided into

three parts: 80%, 10%, and 10% for training, validation, and testing, respectively. Table 3

shows a sample image with a caption.

Table 3. Sample Image and Description.

Sample Image Description/Caption

A child in a pink dress is climbing up a set of stairs in an entry

way.

A man stands in front of a very tall building.

The white dog is playing in a green ﬁeld with a yellow toy.

5.2. Results and Discussion

The training process involved feeding the dataset, which was the input, into the model.

This research employed CNN and LSTM to ascertain an image’s caption: CNN withdrew

the features, and an LSTM-trained model came up with the caption. Post-training, the

model should accept the image, which subsequently summarizes the content. The trained

model helps capture the information encoded in the image, as illustrated in Figure 7.

Electronics 2022,11, 3335 14 of 22

Figure 7. Image Description presented using Trained Model [61].

As a consequence of the parameter analysis, several metrics were investigated, in-

cluding input image size, activation function, developer name and year of creation, and

top-5 error rate. Table 4depicts the input parameter values. In this table the input image

size of AlexNet is 227

227

3, while the remaining network architecture is 224

224

3. The activation function is used to ﬁnd the output of the neural network that contains

several activation functions such as sigmoid, tanh, relu, softmax, and so on. In this article,

AlexNet and VGG16 employed the softmax activation function, whereas the others used

Relu activation. When compared to other algorithms, ResNet has the lowest error rate. In

the empirical analysis, the batch size and epochs were set to 512 and 200, respectively.

Table 4. Parameter Values of Proposed Method.

CNNs Architectures Input Image Size Activation Function Batch Size, Epochs Top 5 Error Rate

Alexnet 227 ×227 ×3 Softmax

512, 200

15.3%

GoogleNet

224 ×224 ×3

ReLU 6.67%

VGG 16 Softmax 7.32%

ResNet ReLU 3.6%

SqueezeNet ReLU 19.7%

The suggested approach features a document comprising words and images: the

LSTM model predicts the caption, and a voice message disseminates all relevant data.

Figures 8and 9demonstrates the output of the proposed deep learning reader.

BiLingual Evaluation Understudy (BLEU) [

] evaluates the performance levels of the

image captioning system and carries out an investigation of the n-gram correlation between

the reference translation statement and the translation statement under consideration.

BLEU score is computed using the following equation,

BLEU =min(1, exp(1−re f erence −length

out put −length ))(

∏

i=1

precisioni)1/4 (4)

precisioni=∑snt ∈C andCor pus ∑i∈sntmin(mi

cmi

t=∑snt0∈CandCor pus ∑i∈snt0(mi

c)(5)

where,

cis the count of i-gram in candidate matching the reference translation,

ris the count of i-gram in the reference translation,

tis the total number of i-grams in candidate tanslation.

Electronics 2022,11, 3335 15 of 22

A higher BLEU score indicates correspondingly high performance levels. Table 5

features the 1- and 2-gram BLEU scores for the AlexNet, GoogleNet, ResNet, SqueezeNet,

and VGG16 networks. Studies have found that the ResNet network architecture exceeds

the other networks’ performance levels.

Table 5. BLEU Score Values.

Architecture BLEU-1 BLEU-2

Alexnet 0.6347 0.6217

GoogleNet 0.7286 0.7368

VGG 16 0.7824 0.7303

ResNet 0.8126 0.8026

SqueezeNet 0.6012 0.6175

Figure 8. Deep learning Reader Output 1.

Figure 9. Deep learning Reader Output 2.

Electronics 2022,11, 3335 16 of 22

Table 6summarises the performance of the proposed framework and the other image

captioning approaches presented in the related work section. Image captioning is used in a

variety of applications such as bridge damage detection, remote sensing image captioning,

language caption synthesis, constructions, and so on. According to the table, the BLEU score

values for construction image captioning method, remote sensing image captioning method,

and Arabic Image Caption Generation were 0.56, 0.77, and 0.81, respectively. For describing

the image caption, the proposed method used various Convolutional Neural Network

(CNN) pre-trained networks such as AlexNet, GoogleNet, ResNet, SqueezeNet, and VGG16.

The empirical results show that the proposed CNN–ResNet network model achieves

a higher BLEU score value than other network models and existing image captioning

approaches.

Table 6. The Performance Comparison with Existing Approaches.

Author(s) Deep Learning Technique

Utilized Application BLEU-1 BLEU-2

Chun et al. CNN Model Bridge Damage Detection 0.768 0.732

Yang et al. LSTM Remote sensing image

captioning 0.8108 0.7451

Wang et al. CNN, Mask R-CNN Extract Visual information

about construction images 0.6100 0.5200

Afyouni et al. CNN, LSTM Arabic Image Caption

Generation

0.81 (Similarity Score)

Shen et al. CNN, VA and Reinforcement

Learning

Remote sensing image

captioning 0.7934 0.6794

The Proposed Method

CNN, LSTM– Alexnet Image Captioning for Visually

Impaired People 0.6347 0.6217

CNN, LSTM–GoogleNet 0.7286 0.7368

CNN, LSTM–VGG 16 0.7824 0.7303

CNN, LSTM–ResNet 0.8126 0.8026

CNN, LSTM–SqueezeNet 0.6012 0.6175

The suggested CNN-LSTM algorithm’s efﬁciency is assessed using evaluation metrics

such as precision, recall, and accuracy.Precision is the number of correct class predictions

that belong to the same class. The number of actual predictions made out of all the classes

in the data set is denoted as recall. Model accuracy relates to the ability to choose the best

model based on training data, and it is deﬁned as follows:

Precision =TruePositive

(TruePositives +FalsePositive)(6)

Recall =TruePositive

(TruePositive +FalseNegative)(7)

Accuracy =No.o f Corre ctedPr edicti ons

Total No.o f Predictio ns (8)

To evaluate the correctness of the predicted images captions, it is compared against

the caption of the tagged images. In this empirical analysis, true positive means that the

predicted model accurately predicts image captions that are tagged positive captions in

the class labels. True negatives indicate that the predicted model properly predicts image

captions with negative tags in the class labels. A false positive is an outcome in which the

model forecasts the positive class inaccurately. Similarly, false negative is an output in which

the model predicts negative captions incorrectly. The ResNet network performs better

than the existing architectures. The next step in the methodology employ text-to-speech to

vocally narrate the data for people with visual impairment. Figure 10 demonstrates the

Electronics 2022,11, 3335 17 of 22

evaluation metric values for several architectures such as Alexnet, GoogleNet, SqueezeNet,

VGG16, and ResNet.

Figure 10. Prediction Accuracy.

According to the experimental results, ResNet has the highest precision value of

85.45%, while Alexnet has the lowest precision value of 68.26%. In terms of recall, squeezeNet

has the lowest recall value of 69.26%, while ResNet has the highest precision value of 83.12%.

ResNet architecture has the highest overall accuracy of 86.74% when compared to other

network architectures. The empirical results indicate that the CNN-LSTM model with

ResNet network architecture outperforms the image caption prediction.

In this experimental analysis, the model is trained with 200 epochs. The training and

validation loss for the AleNet architecture is depicted in the Figure 11. The lost accuracy

start at 0.7 and gradually decreased to an average value of 0.2.

Figure 11. Training and Validation loss (Alexnet).

Figure 12 demonstrates the training and validation loss for GoogleNet architecture.

According to the GoogleNet training and validation graph, the loss value starts at 0.6 and

ends at 0.25, indicating that the network performs lower for image caption prediction.

Figure 12. Training and Validation loss (GoogleNet).

Electronics 2022,11, 3335 18 of 22

Figure 13 demonstrates the training and validation loss of the ResNet architecture.

According to this ﬁgure, the loss rate started at 0.8 and gradually decreased below 0.1 after

the 80th epoch. When compared to other networks architectures, the ResNet produces

better accuracy and a lower loss rate.

Figure 13. Training and Validation loss (ResNet).

The loss rate for the VGG16 network architecture started at 0.75 and ended at 0.2 and

it is shown in Figure 14. It is believed that the loss rate for both training and validation

seems to be the same. It is also important to note that the training loss rate is impressed in

a smooth manner, with no ups and downs.

Figure 14. Training and Validation loss (VGG16).

Finally, Figure 15 shows the SqueezNet training and validation graph. In this ﬁgure,

both curves are deviated with high values, indicating that it produce less performance

for image caption prediction. The graphs conﬁrms a proven decrease in loss across the

different tested models as the number of epochs increases during validation and training

which represents a better learning capability of the models.

Figure 15. Training and Validation loss (SqueezNet).

Electronics 2022,11, 3335 19 of 22

The experimental results show that the SqueezNet network has a very high loss rate,

implying that it produces less efﬁciency than the other networks. In comparison of ResNet

and Vgg16 networks, ResNet produce superior and lower loss rates. Similarly, when

compared to other networks, GoogleNet has an average loss rate. The ResNet design is

thought to better match the training data and forecast the incoming data.

6. Conclusions

This work involves generating a deep learning-based intelligent system to assist

individuals with visual impairments. The system comprises entering text and images from

coursebooks: CNN extricates the relevant data, and LSTM speciﬁes the visual input. Users

receive the data in the form of voice messages that use the text-to-speech module. The

Alexnet, GoogleNet, ResNet, SqueezeNet, and VGG16 networks train the LSTM model.

According to research, the LSTM-based training model provides the most suitable image

descriptions and predictions.

An intelligent system means individuals with visual impairments can easily compre-

hend text and images, although limitations exist, such as requiring the use of the Flicker8k

dataset to provide image data. Subsequent studies will utilize transfer learning to reﬁne

descriptions of images based on real-time photos and their descriptive content.

Author Contributions:

Conceptualization, J.G., A.T.A. and N.A.K.; Data curation, S.A., B.Q. and

A.E.H.; Formal analysis, J.G., A.T.A., S.A., N.A.K., B.Q. and A.E.H.; Investigation, A.T.A., S.A., N.A.K.,

B.Q. and A.E.H.; Methodology, J.G., A.T.A., S.A., N.A.K., B.Q. and A.E.H.; Resources, J.G., S.A., B.Q.

and A.E.H.; Software, J.G. and N.A.K.; Supervision, A.T.A.; Validation, A.T.A., S.A., B.Q. and A.E.H.;

Visualization, J.G., N.A.K., B.Q. and A.E.H.; Writing—original draft, J.G., A.T.A., S.A. and N.A.K.;

Writing—review & editing, J.G., A.T.A., S.A., N.A.K., B.Q. and A.E.H. All authors have read and

agreed to the published version of the manuscript.

Funding: This research is funded by Prince Sultan University, Riyadh, Saudi Arabia.

Acknowledgments:

The authors would like to thank Prince Sultan University, Riyadh, Saudi Arabia

for supporting this work. Special acknowledgement to Automated Systems & Soft Computing Lab

(ASSCL), Prince Sultan University, Riyadh, Saudi Arabia.

Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

Triantafyllidis, A.K.; Tsanas, A. Applications of machine learning in real-life digital health interventions: Review of the literature.

J. Med. Internet Res. 2019,21, e12286. [CrossRef]

Manjari, K.; Verma, M.; Singal, G. A survey on assistive technology for visually impaired. Internet Things

2020

,11, 100188.

[CrossRef]

3. Park, C.; Took, C.C.; Seong, J.K. Machine learning in biomedical engineering. Biomed. Eng. Lett. 2018,8, 1–3. [CrossRef]

Pellegrini, E.; Ballerini, L.; Hernandez, M.d.C.V.; Chappell, F.M.; González-Castro, V.; Anblagan, D.; Danso, S.; Muñoz-Maniega,

S.; Job, D.; Pernet, C.; et al. Machine learning of neuroimaging for assisted diagnosis of cognitive impairment and dementia:

A systematic review. Alzheimer’s Dementia Diagn. Assess. Dis. Monit. 2018,10, 519–535. [CrossRef] [PubMed]

Swenor, B.K.; Ramulu, P.Y.; Willis, J.R.; Friedman, D.; Lin, F.R. The prevalence of concurrent hearing and vision impairment in the

United States. JAMA Intern. Med. 2013,173, 312–313.

Bhowmick, A.; Hazarika, S.M. An insight into assistive technology for the visually impaired and blind people: State-of-the-art

and future trends. J. Multimodal User Interfaces 2017,11, 149–172. [CrossRef]

Lee, B.H.; Lee, Y.J. Evaluation of medication use and pharmacy services for visually impaired persons: Perspectives from both

visually impaired and community pharmacists. Disabil. Health J. 2019,12, 79–86. [CrossRef]

Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.Y. Trafﬁc ﬂow prediction with big data: A deep learning approach. IEEE Trans. Intell.

Transp. Syst. 2014,16, 865–873. [CrossRef]

Welsh, R. Foundations of Orientation and Mobility; Technical Report; American Printing House for the Blind: Louisville, KY, USA,

1981.

10.

Martínez, B.D.C.; Villegas, O.O.V.; Sánchez, V.G.C.; Jesús Ochoa Domínguez, H.d.; Maynez, L.O. Visual perception substitution

by the auditory sense. In Proceedings of the International Conference on Computational Science and Its Applications, Santander,

Spain, 20–23 June 2011; pp. 522–533.

11.

Dakopoulos, D.; Bourbakis, N.G. Wearable obstacle avoidance electronic travel aids for blind: A survey. IEEE Trans. Syst. Man,

Cybern. Part C (Appl. Rev.) 2009,40, 25–35. [CrossRef]

Electronics 2022,11, 3335 20 of 22

12. Li, Z.; Song, F.; Clark, B.C.; Grooms, D.R.; Liu, C. A wearable device for indoor imminent danger detection and avoidance with

region-based ground segmentation. IEEE Access 2020,8, 184808–184821. [CrossRef]

13.

Elkholy, H.A.; Azar, A.T.; Magd, A.; Marzouk, H.; Ammar, H.H. Classifying Upper Limb Activities Using Deep Neural Networks.

In Proceedings of the International Conference on Artiﬁcial Intelligence and Computer Vision, Cairo, Egypt, 8–10 April 2020;

pp. 268–282.

14.

Mohamed, N.A.; Azar, A.T.; Abbas, N.E.; Ezzeldin, M.A.; Ammar, H.H. Experimental Kinematic Modeling of 6-DOF Serial

Manipulator Using Hybrid Deep Learning. In Proceedings of the International Conference on Artiﬁcial Intelligence and Computer

Vision, Cairo, Egypt, 8–10 April 2020; pp. 283–295.

15.

Ibrahim, H.A.; Azar, A.T.; Ibrahim, Z.F.; Ammar, H.H.; Hassanien, A.; Gaber, T.; Oliva, D.; Tolba, F. A Hybrid Deep Learning

Based Autonomous Vehicle Navigation and Obstacles Avoidance. In Proceedings of the International Conference on Artiﬁcial

Intelligence and Computer Vision, Cairo, Egypt, 8–10 April 2020; pp. 296–307.

16.

Sayed, A.S.; Azar, A.T.; Ibrahim, Z.F.; Ibrahim, H.A.; Mohamed, N.A.; Ammar, H.H. Deep Learning Based Kinematic Modeling

of 3-RRR Parallel Manipulator. In Proceedings of the International Conference on Artiﬁcial Intelligence and Computer Vision,

Cairo, Egypt, 8–10 April 2020; pp. 308–321.

17.

Azar, A.T.; Koubaa, A.; Ali Mohamed, N.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.;

Hameed, I.A.; et al. Drone Deep Reinforcement Learning: A Review. Electronics 2021,10, 999. [CrossRef]

18.

Koubâa, A.; Ammar, A.; Alahdab, M.; Kanhouch, A.; Azar, A.T. DeepBrain: Experimental Evaluation of Cloud-Based Computation

Ofﬂoading and Edge Computing in the Internet-of-Drones for Deep Learning Applications. Sensors

2020

,20, 5240. [CrossRef]

[PubMed]

19.

Guo, T.; Dong, J.; Li, H.; Gao, Y. Simple convolutional neural network on image classiﬁcation. In Proceedings of the 2017 IEEE

2nd International Conference on Big Data Analysis (ICBDA), Beijing, China, 10–12 March 2017; pp. 721–724.

20.

Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D

Nonlinear Phenom. 2020,404, 132306. [CrossRef]

21.

Shelton, A.; Ogunfunmi, T. Developing a deep learning-enabled guide for the visually impaired. In Proceedings of the 2020 IEEE

Global Humanitarian Technology Conference (GHTC), Seattle, WA, USA, 29 October–1 November 2020; pp. 1–8.

22.

Tapu, R.; Mocanu, B.; Zaharia, T. Wearable assistive devices for visually impaired: A state of the art survey. Pattern Recognit. Lett.

2020,137, 37–52. [CrossRef]

23.

Swathi, K.; Vamsi, B.; Rao, N.T. A Deep Learning-Based Object Detection System for Blind People. In Smart Technologies in Data

Science and Communication; Springer: Berlin/Heidelberg, Germany, 2021; pp. 223–231.

24.

Rao, A.S.; Gubbi, J.; Palaniswami, M.; Wong, E. A vision-based system to detect potholes and uneven surfaces for assisting

blind people. In Proceedings of the 2016 IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia,

23–27 May 2016; pp. 1–6.

25.

Hoang, V.N.; Nguyen, T.H.; Le, T.L.; Tran, T.T.H.; Vuong, T.P.; Vuillerme, N. Obstacle detection and warning for visually

impaired people based on electrode matrix and mobile Kinect. In Proceedings of the 2015 2nd National Foundation for

Science and Technology Development Conference on Information and Computer Science (NICS), Ho Chi Minh City, Vietnam,

16–18 September 2015; pp. 54–59.

26.

Calabrese, B.; Velázquez, R.; Del-Valle-Soto, C.; de Fazio, R.; Giannoccaro, N.I.; Visconti, P. Solar-Powered Deep Learning-Based

Recognition System of Daily Used Objects and Human Faces for Assistance of the Visually Impaired. Energies

2020

,13, 6104.

[CrossRef]

27.

Lin, Y.; Wang, K.; Yi, W.; Lian, S. Deep learning based wearable assistive system for visually impaired people. In Proceedings of

the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27–29 October 2019.

28.

Aﬁf, M.; Ayachi, R.; Said, Y.; Pissaloux, E.; Atri, M. An evaluation of retinanet on indoor object detection for blind and visually

impaired persons assistance navigation. Neural Process. Lett. 2020,51, 1–15. [CrossRef]

29.

Tasnim, R.; Pritha, S.T.; Das, A.; Dey, A. Bangladeshi Banknote Recognition in Real-Time Using Convolutional Neural Network for

Visually Impaired People. In Proceedings of the 2021 2nd International Conference on Robotics, Electrical and Signal Processing

Techniques (ICREST), Dhaka, Bangladesh, 5–7 January 2021; pp. 388–393.

30.

Mukhiddinov, M.; Cho, J. Smart glass system using deep learning for the blind and visually impaired. Electronics

2021

,10, 2756.

[CrossRef]

31.

Mishra, P.; Kumar, S.; Chaube, M.K.; Shrawankar, U. ChartVi: Charts summarizer for visually impaired. J. Comput. Lang.

2022

69, 101107. [CrossRef]

32.

Zamir, M.F.; Khan, K.B.; Khan, S.A.; Rehman, E. Smart Reader for Visually Impaired People Based on Optical Character

Recognition. In Proceedings of the International Conference on Intelligent Technologies and Applications, Bahawalpur, Pakistan,

6–8 November 2019; pp. 79–89.

33.

Cheng, R.; Hu, W.; Chen, H.; Fang, Y.; Wang, K.; Xu, Z.; Bai, J. Hierarchical visual localization for visually impaired people using

multimodal images. Expert Syst. Appl. 2021,165, 113743. [CrossRef]

34.

Sahithi, P.; Bhavana, V.; ShushmaSri, K.; Jhansi, K.; Madhuri, C. Speech Mentor for Visually Impaired People. In Smart Intelligent

Computing and Applications; Springer: Berlin/Heidelberg, Germany, 2022; Volume 1; pp. 441–450.

Electronics 2022,11, 3335 21 of 22

35.

Chauhan, S.; Patkar, D.; Dabholkar, A.; Nirgun, K. Ikshana: Intelligent Assisting System for Visually Challenged People.

In Proceedings of the

2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India,

7–9 October 2021; pp. 1154–1160.

36.

Flores, I.; Lacdang, G.C.; Undangan, C.; Adtoon, J.; Linsangan, N.B. Smart Electronic Assistive Device for Visually Impaired

Individual through Image Processing. In Proceedings of the 2021 IEEE 13th International Conference on Humanoid, Nanotech-

nology, Information Technology, Communication and Control, Environment, and Management (HNICEM), Manila, Philippines,

28–30 November 2021; pp. 1–6.

37.

Aravindan, C.; Arthi, R.; Kishankumar, R.; Gokul, V.; Giridaran, S. A Smart Assistive System for Visually Impaired to Inform

Acquaintance Using Image Processing (ML) Supported by IoT. In Hybrid Artiﬁcial Intelligence and IoT in Healthcare; Springer:

Berlin/Heidelberg, Germany, 2021; pp. 149–164.

38.

Rahman, M.A.; Sadi, M.S. IoT enabled automated object recognition for the visually impaired. Comput. Methods Programs Biomed.

Update 2021,1, 100015. [CrossRef]

39.

Chun, P.J.; Yamane, T.; Maemura, Y. A deep learning-based image captioning method to automatically generate comprehensive

explanations of bridge damage. Comput.-Aided Civ. Infrastruct. Eng. 2022,37, 1387–1401. [CrossRef]

40.

Yang, Q.; Ni, Z.; Ren, P. Meta captioning: A meta learning based remote sensing image captioning framework. ISPRS J.

Photogramm. Remote Sens. 2022,186, 190–200. [CrossRef]

41.

Wang, Y.; Xiao, B.; Bouferguene, A.; Al-Hussein, M.; Li, H. Vision-based method for semantic information extraction in

construction by integrating deep learning object detection and image captioning. Adv. Eng. Inform.

2022

,53, 101699. [CrossRef]

42.

Afyouni, I.; Azhar, I.; Elnagar, A. AraCap: A hybrid deep learning architecture for Arabic Image Captioning. Procedia Comput. Sci.

2021,189, 382–389. [CrossRef]

43.

Shen, X.; Liu, B.; Zhou, Y.; Zhao, J.; Liu, M. Remote sensing image captioning via Variational Autoencoder and Reinforcement

Learning. Knowl.-Based Syst. 2020,203, 105920. [CrossRef]

44.

Deni´c, D.; Aleksov, P.; Vuˇckovi´c, I. Object Recognition with Machine Learning for People with Visual Impairment. In Proceedings

of the 2021 15th International Conference on Advanced Technologies, Systems and Services in Telecommunications (TELSIKS),

Nis, Serbia, 20–22 October 2021; pp. 389–392.

45.

Felix, S.M.; Kumar, S.; Veeramuthu, A. A smart personal AI assistant for visually impaired people. In Proceedings of the 2018 2nd

International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018; pp. 1245–1250.

46.

Durgadevi, S.; Thirupurasundari, K.; Komathi, C.; Balaji, S.M. Smart Machine Learning System for Blind Assistance.

In Proceedings of the

2020 International Conference on Power, Energy, Control and Transmission Systems (ICPECTS), Chennai,

India, 10–11 December 2020; pp. 1–4.

47. Koubaa, A.; Azar, A.T. Deep Learning for Unmanned Systems; Springer: Cham, Switzerland, 2021.

48.

Popescu, M.C.; Balas, V.E.; Perescu-Popescu, L.; Mastorakis, N. Multilayer perceptron and neural networks. WSEAS Trans.

Circuits Syst. 2009,8, 579–588.

49. Agarap, A.F. Deep learning using rectiﬁed linear units (relu). arXiv 2018, arXiv:1803.08375.

50.

LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE

1998

86, 2278–2324. [CrossRef]

51.

Alom, M.Z.; Taha, T.M.; Yakopcic, C.; Westberg, S.; Sidike, P.; Nasrin, M.S.; Hasan, M.; Van Essen, B.C.; Awwal, A.A.; Asari, V.K.

A state-of-the-art survey on deep learning theory and architectures. Electronics 2019,8, 292. [CrossRef]

52.

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classiﬁcation with deep convolutional neural networks. Commun. ACM

2017

60, 84–90. [CrossRef]

53. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.

54.

Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer

parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360.

55.

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.

56.

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with

convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June

2015; pp. 1–9.

57.

Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of

the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255.

58.

Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg,

Germany, 2012; pp. 37–45.

59.

Yan, S. Understanding LSTM Networks. Volume 11. 2015. Available online: https://colah.github.io/posts/2015-08-

Understanding-LSTMs/ (accessed on 11 October 2022).

60.

Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics.

J. Artif. Intell. Res. 2013,47, 853–899. [CrossRef]

Electronics 2022,11, 3335 22 of 22

61.

Johnson, J.; Karpathy, A.; Fei-Fei, L. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4565–4574.

62.

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of

the 40th annual meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318.

Improved Indian currency recognition: neighbourhood-centred image processing and CNNs with region of pixel selection techniques

Article

Feb 2024
Int J Serv Econ Manag

Improved Indian currency recognition: neighbourhood-centred image processing and CNNs with region of pixel selection techniques

Article

Jan 2023

An Improved Robust Fuzzy Local Information K-Means Clustering Algorithm for Diabetic Retinopathy Detection

Article

Full-text available

Jan 2024

According to the International Diabetes Federation (IDF), roughly 33% of individuals affected by diabetes exhibit diagnoses encompassing diverse severity of diabetic retinopathy. In the year 2020, approximately 463 million adults within the age bracket of 20 to 79 were documented as diabetes sufferers on a global scale. Projections suggest a rise to 700 million by 2045. Proposed automated diabetic retinopathy detection methods aim to reduce ophthalmologist workload. The study presents the Robust Fuzzy Local Information K-Means Clustering algorithm, an advanced iteration of the classical K-means clustering approach, integrating localized information parameters tailored to individual clusters. Comparative analysis is conducted between the performance of Robust Fuzzy Local Information K-Means Clustering and Modified Fuzzy C Means clustering, the latter of which incorporates a median adjustment parameter to augment Fuzzy C Means for diabetic retinopathy detection. The results are evaluated on three different datasets: IDRiD, Kaggle, and fundus images collected from Shiva Netralaya Center, India. Achieving a 94.4% accuracy rate and an average execution time of 17.11 seconds, the proposed algorithm aims to adeptly categorize a substantial volume of retinal images, thereby improving performance and meeting the crucial demand for prompt and precise diagnoses in diabetic retinopathy healthcare.

A real-time image captioning framework using computer vision to help the visually impaired

Article

Full-text available

Dec 2023
MULTIMED TOOLS APPL

Advancements in image captioning technology have played a pivotal role in enhancing the quality of life for those with visual impairments, fostering greater social inclusivity. The computer vision and natural language processing methods enhances the accessibility and comprehensibility of pictures via the addition of textual descriptions. Significant advancements have been achieved in photo captioning, specifically tailored for those with visual impairments. Nevertheless, some challenges must be addressed, like ensuring the precision of automatically generated captions and effectively handling pictures that include many objects or settings. This research presents a ground breaking architecture for real-time picture captioning using a VGG16-LSTM deep learning model with computer vision assistance. The framework has been developed and deployed in a Raspberry Pi 4B single-board computer, with graphics processing unit capabilities. This implementation allows for the automated generation of relevant captions for photographs captured in real time by a NoIR camera module. This characteristic makes it a portable and uncomplicated choice for those with visual impairments. The efficacy of the VGG16-LSTM deep learning model is evaluated via comprehensive testing, including both sighted and visually impaired participants in diverse settingsThe experimental findings demonstrate that the proposed framework effectively operates as intended, generating real-time picture captions that are accurate and contextually appropriate. The analysis of user feedback indicates a significant improvement in the understanding of visual content, hence facilitating the mobility and interaction of individuals with visual impairments in their environment. We have used multiple dataset including Flicke8k, Flickr30k, VizWiz captioning and custom dataset for the model training, validation and testing process. During the training phase, the ResNet-50 and VGG-16 models achieve 80.84% and 84.13% accuracy, respectively. Similarly, during the validation phase, the ResNet-50 and VGG-16 models acquire accuracies of 80.04% and 83.88%, respectively. The text-to-speech API is analyzed with MOS and WER matrices and achieved an exceptional accuracy and performance is verifying on a GPU system using a custom dataset. The efficacy of the VGG16-LSTM deep-learning model is evaluated using six metrics: accuracy, precision, recall, F1 score, BLEU, and ROUGE. Individuals with visual impairments may benefit from this deep learning architecture, as it endeavors to facilitate their comprehension and engagement with visual content.

Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages

Conference Paper

Jul 2023

SignSense: AI Framework for Sign Language Recognition

Article

Apr 2024

Sign Language recognition is a pioneering framework designed to advance the field of Sign Language Recognition (SLR) through the innovative application of ensemble deep learning models. The primary goal of this research is to significantly improve the accuracy, resilience and interpretability of SLR systems. Leveraging the unique features of ResNet within an ensemble learning paradigm. The key component of InceptionResNetv2 architecture is its deep and effective feature extraction capabilities. The utilization of InceptionResNet model enhances the model ability to capture intricate details crucial for accurate sign language recognition. This framework is also to scale seamlessly, accommodating an expanding vocabulary of signs, diverse users and dynamic environmental conditions without compromising performance.

Framework for Face recognition and Scene Description using Deep Learning for Visually Challenged people

Conference Paper

Dec 2023

Reading Assistant for Visually Challenged Peoples with Advance Image Capturing Technique Using Machine Learning

Article

Full-text available

Dec 2023

Since blindness prevents a person from learning about their surroundings, it is difficult for them to independently navigate, recognise items, avoid hazards, and read. In this essay, we provide a ground-breaking system for visually impaired people who use assistive technology. The concept incorporates a camera, sensors, and effective image processing algorithms that use Raspberry Pi for object detection and obstacle avoidance. Ultrasonic sensors and the camera both measure the user's distance from the obstruction. The system consists of integrated reading help that first generates an audio response before converting images to text. The complete apparatus is small and light, and it can be easily and inexpensively mounted on a regular pair of eyeglasses. The entire system is affordable, easy to use, and can be attached to a regular pair of eyeglasses. It is also portable and lightweight. Ten people who are completely blind will be used to compare the performance of the suggested device to the traditional white cane. The evaluations are conducted in controlled environments intended to mimic day-to-day activities for blind persons. The findings show that the proposed device provides more accessibility, comfort, and simplicity of navigation for the blind when compared to the white cane.

Computer Vision and Voice Assisted Image Captioning Framework for Visually Impaired Individuals using Deep Learning Approach

Conference Paper

Oct 2023

Passer Journal Classification of The Cause of Eye Impairment Using Different Kinds of Machine Learning Algorithms

Article

Full-text available

Nov 2023

Ari Guron

This study aims to create a machine learning-based method for categorizing ocular impairment. Congenital, refractive error, age, diabetes, and unknown are the five primary causes that specialists consider. The suggested technique automatically classifies patients into one of the five groups based on their unique features by evaluating the ODIR dataset of patient records, which includes numerous demographic and clinical information, and utilizing machine learning algorithms. Most previous studies in this area have focused on classifying illnesses; hence, this study's main contribution is its innovative focus on categorizing the causes of eye disorders. To the best of our knowledge, no ocular dataset has a label that specifies the cause of eye disease. The classes of eye disease have been added by Ophthalmologists. Better patient outcomes and more effective use of healthcare resources can be achieved by increasing the precision of physicians' diagnoses and streamlining their decision-making. Compared to the other classification methods, the Quadratic SVM model has the highest accuracy of 71.3%. https://creativecommons.org/licenses/by-nc/4.0/

A deep learning‐based image captioning method to automatically generate comprehensive explanations of bridge damage

Article

Full-text available

Nov 2021

Photographs of bridges can reveal considerable technical information such as the part of the structure that is damaged and the type of damage. Maintenance and inspection engineers can benefit greatly from a technology that can automatically extract and express such information in readable sentences. This is possibly the first study on developing a deep learning model that can generate sentences describing the damage condition of a bridge from images through an image captioning method. Our study shows that by introducing an attention mechanism into the deep learning model, highly accurate descriptive sentences can be generated. In addition, often multiple forms of damage can be observed in the images of bridges; hence, our algorithm is adapted to output multiple sentences to provide a comprehensive interpretation of complex images. In our dataset, the scores of Bilingual Evaluation Understudy (BLEU)‐1 to BLEU‐4 were 0.782, 0.749, 0.711, and 0.693, respectively, and the percentage of correctly output explanatory sentences is 69.3%. All of these results are better than the model without the attention mechanism. The developed method makes it possible to provide user‐friendly, text‐based explanations of bridge damage in images, making it easier for engineers with relatively little experience and even administrative staff without extensive technical expertise to understand images of bridge damage. Future research in this field is expected to lead to the unification of field expertise with artificial intelligence (AI), which will be the foundation of the evolutionary development of bridge inspection AI.

Smart Glass System Using Deep Learning for the Blind and Visually Impaired

Article

Full-text available

Nov 2021

Individuals suffering from visual impairments and blindness encounter difficulties in moving independently and overcoming various problems in their routine lives. As a solution, artificial intelligence and computer vision approaches facilitate blind and visually impaired (BVI) people in fulfilling their primary activities without much dependency on other people. Smart glasses are a potential assistive technology for BVI people to aid in individual travel and provide social comfort and safety. However, practically, the BVI are unable move alone, particularly in dark scenes and at night. In this study we propose a smart glass system for BVI people, employing computer vision techniques and deep learning models, audio feedback, and tactile graphics to facilitate independent movement in a night-time environment. The system is divided into four models: a low-light image enhancement model, an object recognition and audio feedback model, a salient object detection model, and a text-to-speech and tactile graphics generation model. Thus, this system was developed to assist in the following manner: (1) enhancing the contrast of images under low-light conditions employing a two-branch exposure-fusion network; (2) guiding users with audio feedback using a transformer encoder–decoder object detection model that can recognize 133 categories of sound, such as people, animals, cars, etc., and (3) accessing visual information using salient object extraction, text recognition, and refreshable tactile display. We evaluated the performance of the system and achieved competitive performance on the challenging Low-Light and ExDark datasets.

AraCap: A hybrid deep learning architecture for Arabic Image Captioning

Article

Jan 2021

Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning

Article

Aug 2022
ADV ENG INFORM

Recently, vision-based monitoring has been widely adopted in construction management to improve crew productivity, reduce safety risks, and facilitate site planning. However, automated retrieval of semantic information (e.g., objects, activities, and interactions between objects) from construction images remains challenging due to the complex nature of construction sites. This paper proposes a novel semantic information extraction method by integrating deep learning object detection and image captioning, which aims to explore salient information from construction images or videos. In the proposed method, object detection has been employed as an encoder to extract the feature maps of construction object zones and the holistic image. The image captioning has been selected as the decoder to extract the semantic information. A post-processing method has been proposed to parse the semantic information into a graph format for better accessibility and visualization. In experiments, the proposed method has achieved the Consensus-based Image Description Evaluation (CIDEr) of 1.84. By adopting the proposed method, semantic information behind construction images can be presented to construction managers to assist their decision-making.

Speech Mentor for Visually Impaired People

Chapter

Apr 2022

Book reading is a very interesting habit, but it will be difficult for visually impaired ones and the blind people. Braille-related machines help them to an extent but are not affordable to everyone. The current application which was build aims to help such blind people by making their daily tasks simple and even easy. The system was built with a camera that reads the content (like books, currency notes, and online parcel.) and gives them an output in the form of audio to the user. We have used Raspberry Pi to accommodate the portable camera and the audio output through headphones or speakers. This application uses optical character recognition (OCR) to extract text from images where we try to convert the text to speech and send audio signals as output. In addition the above ability, the system will be capable of extracting the text from labels on product packaging and can even identify currency notes, etc.KeywordsOptical character recognition (OCR)Text-to-speech conversionRaspberry PiPortable cameraTesseract engine

ChartVi: Charts summarizer for visually impaired

Article

Mar 2022

Information charts are an indispensable component of documents. All charts aim to represent and visualize data, depicting a meaning about data or subject matter. However, charts are not accompanied by any contextual message type and are mostly not addressed in the textual description. Mostly, people with visual impairment rely on screen readers to understand the text but are unable to comprehend charts. Information retrieval methods should process and extract pertinent information from any type of chart image. The paper proposes ChartVi, a system to automatically interpret chart images by acquiring direct and derived data from images and generating anecdotes independently. These anecdotes are combined for generating concise summaries of different types of chart images. The system was designed especially for individuals with visual impairment, enabling them to understand the chart images, which are otherwise skipped by screen readers. We evaluated ChartVi with visually impaired people from the native rehabilitation center and NGOs (non-governmental organizations). Results obtained were satisfactory and were easily comprehended by the participants. Besides teachers and sighted users, all the participants were satisfied because ChartVi provides a detailed and correct description for various chart types while maintaining ease of access while using the system. ChartVi has recognition accuracy of 97.09%, textual segment accuracy above 95%, and graphical segment extraction accuracy of 98%.

Smart Electronic Assistive Device for Visually Impaired Individual through Image Processing

Conference Paper

Nov 2021

Meta captioning: A meta learning based remote sensing image captioning framework

Article

Apr 2022
ISPRS J PHOTOGRAMM

Remote sensing image captioning models require large amounts of caption-labeled training data. Though image classification models are normally trained with sufficient training data, they cannot be straightforwardly applied to remote sensing image captioning, because the labels for classification and captioning arise from different task domains. Additionally, remote sensing images with caption labels are not as sufficient as images with class labels. Such limitations render difficulty to effective remote sensing image captioning. To address these limitations, we develop a meta captioning framework that conducts remote sensing image captioning with meta learning. The meta captioning framework extracts meta features from two support tasks, i.e., natural image classification and remote sensing image classification, and transfers the meta features to one target task, i.e., remote sensing image captioning. The two support tasks train classification models with big amounts of class-labeled data such that they extract meta features that comprehensively represent image visual features from the perspective of classification. The target task, equipped by the meta features, just requires a relatively small amount of caption-labeled training data for achieving effective remote sensing image captioning. Experimental evaluations on three public datasets validate that the meta captioning framework achieves state-of-the-art performance on remote sensing image captioning. We release the code for our work at: https://github.com/QiaoqiaoYang/MetaCaptioning.

Object Recognition with Machine Learning for People with Visual Impairment

Conference Paper