Content uploaded by Runze Chen
Author content
All content in this area was uploaded by Runze Chen on Sep 06, 2018
Content may be subject to copyright.
Construction of a Voice Driven Life Assistant System for Visually Impaired People
Runze Chen, Zhanhong Tian, Hailun Liu, Fang Zhao, Shuai Zhang, Haobo Liu
School of Software Engineering
Beijing University of Posts and Telecommunications
Beijing, China
e-mail: chenrz925@bupt.edu.cn
Abstract—The rapid development of artificial intelligence and
mobile computation brings more convenient life to the blind
and visually impaired people. This paper presents a prototype
of a voice assistant specially designed for them. The system
mainly contains fundamental services including falling
detection, safety care, accessibility of mobile phone, daily
information broadcasting and view description to make life
easier for them. Natural language understanding, voice
recognition and synthesis have been integrated to enable users
operate majority of mobile phones' functions. Also, the built-in
falling detection algorithm based on tri-axis accelerometer and
object detection algorithm based on Mask R-CNN can enrich
sense of users and at the same time keep the safety of users.
Keywords-voice assistant; navigation; visually impaired;
natural language understanding; accessibility; mobile computing
I. INTRODUCTION
With the rapid development of artificial intelligence and
mobile computing, modern technology has brought more
convenient to the blind and visually impaired people. Lately,
it's estimated that there are about 253 [1] million people with
vision impairment. The group of visually impaired persons
are undergoing an inconvenient daily life without the
assistant from their family or friends. An effective method
for guidance is using guide dogs. However, the disadvantage
of this method is that guide dogs need plenty of money and
time to train and feed it. Also, the blind need an approach to
get to know about the life outside their home and they desire
to have access to internet and mobile services as normal one.
However, as for them, a lot of obstacles exist and there
remains need improvement from society and technology,
including lack of information resources for blind, inadequate
infrastructure and lack of technical input. [2] But, according
to our research, the rapid development of artificial
intelligence and mobile computing technology can be an
ideal solution to help the blind and visually impaired persons
to perceive their surroundings.
To assist blind and visually impaired group, many
solutions has been provided. Some solutions try to design a
hardware system to provide some fundamental functions. For
example, Mohamed Manoufali [3] et al. designed cane for
blind with obstacle detection by ultrasonic sensor. And Siti
Fauziah Toha [4] et al. also has the similar idea about
assistance. However, those solutions can’t detect the objects
around the user. Another type of solution for blind assistance
is to provide guide and service to blind users. Jiayin S. [5] et
al. provides a construct of a guide device with some
fundamental functions and the user need to press buttons to
use those services but the user experience and functions are
limited.
In this paper, we presented "beEYE", an extendable
system launched on Android phone to provide functions for
visually impaired people. It has functions including
messaging, describing the street view, navigating to certain
place, etc. We have integrated those discrete function into a
unified system with a voice interface provided to the blind.
With our system, we hope to greatly improve their life.
II. RELATED WORK
There are many solutions to simplify the way people
interacting with computers. The stability of natural language
understanding and voice recognition have developed so well
that blind person can also has the chance to use the mobile
phone easily. To understand the intention of user and extract
key information in the sentences spoken, natural language
understanding technology should classify the intent and its
content so as to extract the entities from the raw sentence.
Microsoft has released LUIS [6], a natural language
understanding service which can extract the intent and
entities from the sentence. An open-source project named
Rasa NLU [7] can also provide support to classify the intent
and extract the entities. However, Rasa NLU need to be
modified to understand Chinese text.
A falling detection system has been designed by Wang
Rong [8] et al. has provide a solution to detect elder people's
movement. Also, as a risk warning service, the falling
detection system can be used to protect blind and visually
impaired people and alarm their family when abnormal event
happens. Kaiming He [9] created a method to detect objects
in images which extends Fast R-CNN named Mask R-CNN
[10]. The object detection technology can help blind group to
know what appears in front of their walking direction well,
so we also integrated the Mask R-CNN algorithm into the
system to describe the view in front of the blind user.
III. SYSTEM ARCHITECTURE
We have unified different approach into an unique one
with a voice interface so that users just need to speak to the
platform to get service. The service provided by the system
contains 5 modules. To run this system, an Android device
with GPS system, three-axis gyroscope and internet
connection is needed. Optionally, a Bluetooth headphone is
87
,QWHUQDWLRQDO&RQIHUHQFHRQ$UWLILFLDO,QWHOOLJHQFHDQG%LJ'DWD
978-1-5386-6987-7/18/$31.00 ©2018 IEEE
needed to provide a better experience to users. The total
construct of modules in this study is displayed in Fig. 1.
Figure 1. Module diagram of the project.
When a user speaks out the command or query to the
device, the application will connect to the back-end server
and the back-end server will provide a RESTful [11] service
to define which module to access judged from the natural
language command. The redirected command will
authenticate the user and provide demanded information or
invoke the local modules to compute or react to the
physically action of user. Some safety-related module will
run independently to sense the abnormal movement from
blind users. When user has an abnormal action, the platform
can immediately detect and notify the guard user of the blind
user with the location and latest walking trails. The vision
description module need to collect the data from sensors and
need the computation of algorithms deployed on server. Fig.
2 shows deployment of the project.
Figure 2. Deployment diagram of the project.
A. Dialogue System
The technology of Chinese speech recognition has been
well developed by iFLYTEK [12], so we integrated it into
the platform to get the raw natural Chinese sentence from
users' voice. However, the dictionary of voice recognition
needs to be extended because the customized words such as
names in contacts are also needed to be recognized to
support the services of accessibility. For example, a wrong
homophonic name can’t be searched out from the contact
database. To solve this problem, we locally extended the
dictionary of the voice recognition module according to
limited user data to get a better accuracy.
Another important problem need to be solved is to
understand the intent of visually impaired users and get the
key information from the expression parsed from the human
voice. To classify the intent of user from commands, this
study trained a model based on the LUIS [6] with well
accuracy and stability to understand natural language. For
example, if user speak "I want to walk to Beijing University
of Posts and Telecommunications", the model can parse this
command to an intent "Navigation" and entities of "Beijing
University of Posts and Telecommunications" with the type
"Location" as well as the transport entity "Walking". With
the intent and entities, the navigation service can search and
plan the route for this query in order to navigate user to the
certain place.
Responding to the user with a human voice is the output
of the system, when user gives command to the system to
execute different tasks frequently, the response will be
produced before the voice output and the latest response will
yield the last one speaking response without any measures.
To deal with this problem, an asynchronous queue is placed
as a buffer for different responses, which can also reduce
coupling of voice interface and diverse services. To make
sure the request completely received, the recorder can be
wake up by user's touch on the whole screen and notify the
user by vibrator when record action starts. To explain the
process of voice interface, Fig. 3 shows the process of
dealing with human voice.
Figure 3. Flowchart diagram of voice interface module.
88
B. Navigation & Security
The project provides solutions of walking routine and
public transport for transfer for blind users. When the user
gives the command to the platform for navigating to certain
place, the platform will schedule the best routine and start
backend navigation service. With the navigation service,
users can know when to change direction and how long the
distance they need to walk. AMap [13] service is integrated
to provide the key data for navigation so that the platform
can access more completed routine information for users.
When user need to know where they are, they can ask to the
platform where they are and the platform can also provide an
accurate answer. To avoid unexpected event, the navigation
service will also judge whether user walks toward the right
direction by the built-in compass of the device as a
supplement of blind cane.
Related to the location and movement information,
security is also important for users. This study implements an
algorithm computing backend to judge the state of user [8],
when unexpected fall happens the system can immediately
notify to the guard user by messaging the location and
dialing. With the real-time accessible from tri-axis
accelerometer, the algorithm can provide an attitude
estimation of user to make sure the safety of user. We can
access attitude angle from the built-in tri-axis accelerometer
of devices. To determine attitude angle, the following
equations can be applied.
ℎ =arctan(
2+2)
=arctan(
2+2)
=arctan(2+2
)
In above equation, pitch represents the angle of rotation
around the Y axis, which is the angle of the body's backward
pitch; roll represents around the X axis rotation Angle, which
is the body side-slip angle from left to right and yaw presents
the rotation angle around X axis which is the rotation angle
of body from left to right. And the study can collect the data
of pitch, roll and yaw, which can be used to train and analyze
the normal range of user's movement. To avoid the noise
from data, the Kalman filtering algorithm is used to improve
the system's reliability.
To better ensure the safety of user, we collect some extra
data from different sensor including angular velocity(from
gyroscope) and acceleration(from acceleration sensor) and
improve this method. With these data, we can calculate the
acceleration vector sum (AVS) and the angular velocity
vector sum (AVVS) to detect the movement of user using
equation (4) and equation (5).
AVS = 2+2+2
AVVS = 2+2+2
With several sets of data collected consecutively, the
system will give an alarm that a falling incident is happened
when half of the data sets exceed the threshold. If almost all
of the data sets exceed the threshold, the system will give a
notification, asking if the user needs to stop the alarm in five
seconds.
C. Accessibility
The accessibility module provides a unified approach to
access the function of mobile phone. The platform currently
provides the following features displayed in Fig. 4.
Figure 4. Contents of accessibility service.
To provide above features, we developed a solution
depends on the accessibility service of Android operating
system and the platform will work with applications
provided by other companies such as WeChat.
D. Information Service
Figure 5. Contents of information service.
The platform also provides diversity of information
services to facilitate the access to life information. We
integrated some information API to access certain
information and developed some built-in modules to provide
different information for user to query. Currently, the project
89
provides the following information services displayed in Fig.
5.
E. Vision Description
Vision description can help user to know more about
circumstances forward. To some degree, it can also assist
blind users to avoid some danger such as collision with
pedestrian or other facilities on sidewalk. We implemented
and integrated Mask R-CNN [10] to recognize the objects
captured by the device camera. Training Mask R-CNN with
cityscapes [14] dataset, the vision description module can
describe certain objects captured in camera well, such as
bicycles, cars or pedestrians.
The vision description service can start when user ask
about what things are in front of them self. The system can
automatically capture the image of their foreground and send
it to the back-end server provided by us to compute for the
result. When getting the result of Mask R-CNN [10] model,
the module will list all the objects detected and speak out all
of objects.
IV. TEST RESULTS
During the development process of the prototype of the
system, we have tested all the functions and features in the
platform.
Figure 6. Walking route of volunteer.
We invited volunteers to experience and test the system.
The volunteers are demanded to wear a blinder to experience
the real condition of blind or visually impaired users. And
we tested in the campus of Beijing University of Posts and
Telecommunications. The volunteer has walked along the
route in Fig. 6. When walking alone the route calculated by
navigation service, the volunteer can successfully start and
yield the navigation and the navigation service can
immediately reroute when the direction has been suddenly
changed.
During the navigation process, the volunteer command
the system to describe the foreground. After few seconds of
delay, the server can respond the objects detected objects
captured by camera. Because of camera's instability, the
volunteer need to steady the phone to capture clearer images.
We tested this service in conditions of street and classroom
and the algorithm can provide a well result. The Fig. 7
displayed the effect of object detection algorithm.
We deploy the object detection service on a server with
NVIDIA TESLA P100 GPU and evaluated the performance
of the service when it is accessed via the web API provided
by us. And We have collected about 100 images randomly
captured by volunteers to evaluate the performance of the
object detection. We perform the test in different Internet
environment and the Internet connection is stable and
available. In Table I, the test result has been recorded.
TABLE I. COMPUTING TIME OF OBJECT DETECTION IN DIFFERENT
INTERNET ENVIRONMENT
Internet Connection
Average
Computing Time
(ms)
Average Server
Computing Time (ms)
Indoor Wi-Fi
1582
1276
4G (CMCC)
1707
1310
4G (ChinaNet)
1910
1315
Table I displays the time needed to parse the result from
captured images. We can infer from the test result that the
internet connection is not the key factor to take more than 1
second of the object detecting process, and a carefully
simplified model without losing much of the object detection
accuracy can help improve the efficiency. The average
computing time including the time consumed by
communication via Internet and the running of backend
object detection algorithm and the average server computing
time including merely the time consumed by the backend
algorithm.
The recognition of commands has been also tested by the
team. To evaluate performance of dialogue system, we
recorded the accuracy of conversations. In Table II, we can
see that the whole dialogue system can recognize and extract
the intent and entities from users' voice with a well accuracy.
Figure 7. Effect of object detection algorithm.
However, the accuracy of entity recognition can still be
improved if a larger data set of natural sentences is available.
And we are still working hard to enlarge the sentence dataset
to improve the accuracy.
90
TABLE II. ACCURACY OF TESTING VOICE INTERFACE
Function
Intent Accuracy
Entity Accuracy
Weather
98.9%
91.9%
Navigation
95.3%
88.6%
Joke
99.5%
No entity
Message
97.6%
92.3%
Dial
96.7%
90.1%
Location
100.0%
No entity
Figure 8 . AVS and AVVS when wa lkin g normally.
.
Figure 9 . AVS and AVVS when walking upstair.
Figure 10. AVS and AVVS when puting the mobile phone into trouser
pocket.
Figure 11. AVS and AVVS when falling backward.
The falling detection service has also been tested under
different conditions. Some test conditions are displayed in
Figure 8, Figure 9, Figure 10 and Figure 11. As displayed in
the following figures, the AVS and AVVS can be used to get
the movement of device and can also help the system
determine the state of the user when the sensor data has
different values larger than the threshold value. And it can be
noticed that the change is smaller than the threshold value
when the user is walking normally and the change can be
dramatic when a falling incident is happened. When testing
the algorithm, we set the threshold value of AVS and AVVS
to 25 and the test can produce a reasonable result.
V. CONCLUSION
This paper implemented a prototype of a voice assistant
to provide daily service to blind and visually impaired
persons. The overall architecture of the platform is displayed
in Fig. 12. As a supplement of blind cane, the system can
help blind people using mobile information service easily
and partially keep the security of users. However, the dream
to help blind persons to walk without cane or guide dogs
needs more work to chase. With this prototype of voice
assistant, hopefully, blind and visually impaired persons can
enjoy a more convenient daily life.
Figure 12. Conceptual figure of the project.
The system can provide a complete outdoor navigation
service but it has some issues just as lack of indoor
navigation service. Some solutions exist, but those solutions
still need hardware components to provide key data to
outdoor navigation algorithms. Security functions need more
development and the system need more hardware devices to
integrate sensors with higher accuracy.
The system also has some limitations on functions and
features for accessibility service and information service.
More information can be collected to integrated into the
91
system in future work. The performance of the prototype has
potential to be improved and algorithms in some module, for
example, natural language understanding, objects detection
and obstacle detection. More algorithms can be tested and
researched to have a better performance and accuracy to
provide a better result on functions of the project.
ACKNOWLEDGMENT
Research Innovation Fund for College Students of
Beijing University of Posts and Telecommunications
REFERENCES
[1] World Health Organization. (2017). "Vision impairment and
blindness." Retrieved March 16, 2018, from
http://www.who.int/mediacentre/factsheets/fs282/en/.
[2] Ying, Z. and G. Chaobing (2014). "Research on Obstacles of
Information Acquisition for the Blind in China." Journal of Modern
Information(07): 10-13.
[3] Manoufali, M., et al. (2011). Smart guide for blind people. 2011
International Conference and Workshop on the Current Trends in
Information Technology, CTIT'11, October 26, 2011 - October 27,
2011, Dubai, United arab emirates, IEEE Computer Society.
[4] Mutiara, G. A., et al. (2016). Smart guide extension for blind cane.
4th International Conference on Information and Communication
Technology, ICoICT 2016, May 25, 2016 - May 27, 2016, Bandung,
Indonesia, Institute of Electrical and Electronics Engineers Inc.
[5] Song, J., et al. (2016). "The design of a guide device with multi -
function to aid travel for blind person." International Journal of Smart
Home 10(4): 77-86.
[6] Microsoft (2018). "LUIS." Retrieved March 16, 2018, from
https://www.luis.ai/.
[7] Rasa Technologies GmbH. (2018). "Rasa NLU." Retrieved March 16,
2018, from https://nlu.rasa.ai/.
[8] Rong, W., et al. (2012). "Design and implementation of fall detection
system using tri-axis accelerometer." Journal of Computer
Applications(05): 1450-1452+1456.
[9] Girshick, R. (2015). Fast R-CNN. 15th IEEE International
Conference on Computer Vision, ICCV 2015, December 11, 2015 -
December 18, 2015, Santiago, Chile, Institute of Electrical and
Electronics Engineers Inc.
[10] He, K., et al. (2017). Mask R-CNN. 16th IEEE International
Conference on Computer Vision, ICCV 2017, October 22, 2017 -
October 29, 2017, Venice, Italy, Institute of Electrical and Electronics
Engineers Inc.
[11] Fielding, R. T. (2000). Architectural styles and the design of network-
based software architectures, University of California, Irvine: xvii,
162 leaves.
[12] iFLYTEK. "iFLYTEK." Retrieved March 15, 2018, from
http://www.xfyun.cn/.
[13] AMap. "AMap." Retrieved March 18, 2018, from
http://lbs.amap.com/.
[14] Cordts, M., et al. (2016). The Cityscapes Dataset for Semantic Urban
Scene Understanding. 29th IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016,
Las Vegas, NV, United states, IEEE Computer Society.
92