BookPDF Available

MouseNose - Uma ferramenta de acessibilidade para deficientes

Authors:

Abstract and Figures

Este estudo nasceu da necessidade de avaliação de um software às reais necessidades das pessoas com deficiências em suas atividades, tendo como principal finalidade a sua integração ao uso do computador nos moldes atuais. O software intitulado Mousenose, foi desenvolvido pelo Grupo IMAGO, da Universidade Federal do Paraná. O Mousenose funciona reconhecendo através de uma câmera de vídeo, uma parte do corpo do usuário (nariz) e a partir do rastreamento e da captura do campo visual da câmera, transmite o movimento do usuário no cursor sem o uso de hardware. https://www.amazon.com.br/Mousenose-Weldt-Claudia-Francele/dp/3639895495
Content may be subject to copyright.
XXX
Computer vision-based methodology to improve interaction
for people with motor and speech impairment
RÚBIA E. O. SCHULTZ ASCARI,Department of Informatics - UFPR and UTFPR, Brazil
ROBERTO PEREIRA, Department of Informatics - UFPR, Brazil
LUCIANO SILVA, Department of Informatics - UFPR, Brazil
Augmentative and Alternative Communication (AAC) aims to complement or replace spoken language to
compensate for expression diculties faced by people with speech impairments. Computing systems have been
developed to support AAC, however, partially due to technical problems, poor interface, and limited interaction
functions, AAC systems are not widespread, adopted, and used, therefore reaching a limited audience. This
paper proposes a methodology to support AAC for people with motor impairments, using computer vision
and machine learning techniques to allow for personalized gestural interaction. The methodology was applied
in a pilot system used by both volunteers without disabilities, and by volunteers with motor and speech
impairments, to create datasets with personalized gestures. The created datasets and a public dataset were
used to evaluate the technologies employed for gesture recognition, namely the Support Vector Machine
(SVM) and Convolutional Neural Network (using Transfer Learning), and for motion representation, namely
the conventional Motion History Image and Optical Flow-Motion History Image (OF-MHI). Results obtained
from the estimation of prediction error using K-fold cross-validation suggest SVM associated with OF-MHI
presents slightly better results for gesture recognition. Results indicate the technical feasibility of the proposed
methodology, which uses a low-cost approach, and reveals the challenges and specic needs observed during
the experiment with the target audience.
CCS Concepts:
Human-centered computing Human computer interaction (HCI)
;
Social and
professional topics People with disabilities;Computing methodologies Motion capture.
Additional Key Words and Phrases: Assistive Technology, Augmentative and Alternative Communication,
Computer Vision, Gesture Recognition, Accessibility
ACM Reference Format:
Rúbia E. O. Schultz Ascari, Roberto Pereira, and Luciano Silva. 2020. Computer vision-based methodology
to improve interaction for people with motor and speech impairment. J. ACM XX, X, Article XXX (X 2020),
34 pages. https://doi.org/10.1145/1122445.1122456
1 INTRODUCTION
People with disabilities very often must deal with dierent barriers to participate in social and
economic life, requiring support from family members and caregivers, or the aid of technical
solutions that facilitate interaction with the environment and other people. Although computers
Corresponding author: rubia@utfpr.edu.br
Authors’ addresses: Rúbia E. O. Schultz Ascari, Department of Informatics - UFPR and UTFPR, Curitiba, Brazil, rubia@utfpr.
edu.br; Roberto Pereira, Department of Informatics - UFPR, Curitiba, Brazil, rpereira@inf.ufpr.br; Luciano Silva, Department
of Informatics - UFPR, Curitiba, Brazil, luciano@ufpr.br.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2020 Association for Computing Machinery.
0004-5411/2020/X-ARTXXX $15.00
https://doi.org/10.1145/1122445.1122456
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:2 R. E. O. S. Ascari, et al.
are present in many aspects of daily life, computer systems still have imposed barriers for people
with disabilities, failing to oer the support they can be designed to oer.
Designing systems and interfaces for Assistive Technology (AT) is particularly challenging, as
there is no "average user" on which to base solutions that would work for users with specic and
diverse needs [
91
]. Selecting an AT requires maximizing the ow of information, and minimizing
the eort (physical and mental) needed to use it [
2
]. When developing AT devices, end-users
and their view of what an ideal solution means, must be considered, nding the balance between
functionality, performance, ease of use, and aesthetics.
Speech impairment is a condition in which the ability to produce the speech sounds necessary
to communicate with others is compromised. People with speech impairments very often have
an associated motor disability, aecting their ability to interact with other people and with the
environment. Therefore, alternatives are demanded for people who are totally or partially unable
to move, or control their limbs, and who cannot rely on verbal communication solely. Oering
specic resources, services, strategies, and practices, AT aims to help people with disabilities to be
socially included, and become or remain independent.
Augmentative and Alternative Communication (AAC) refers to forms of communication that
complement or replace speech to compensate for speech diculties by using intervention strategies
and non-verbal communication systems [
46
]. AAC mediated by computational applications enables
users with motor and speech impairments to access a computer, using it not only to express
themselves, but as an educational or training tool as well. Such possibilities may support people’s
communicative abilities, contributing to their training and learning [7].
There are many input devices and dierent technologies that open up new paradigms in Human-
Computer Interaction (HCI). Systems based on multimodal interaction provide extended possibilities
for users, and are able to adjust to the users’ specic needs, making systems more exible [
65
]. Or-
dinary computers and mobile phones, for instance, are equipped with cameras that favor Computer
Vision (CV) interfaces, providing another possibility for interaction via these devices. Easy access to
camera devices has allowed for the generation of new AT resources that do not involve expensive
or customized devices to accommodate special access needs, because they are software-based,
enabling cost reduction and improved availability as envisaged by Betke et al. [
15
]. Non-invasive
techniques based on CV allow for non-conventional interaction methods to be considered, including
the recognition of movement of the hands [
94
,
95
], head [
92
] and other body parts to perform
actions on computer systems [78].
Gesture recognition allows for people to interact with machines without the need of other
devices (e.g. mouse or keyboard). This interaction mode is capable of dealing with the particularities
and limitations of each user’s performance of a movement, thus being considered "natural", and
even intuitive, as people learn gestures since their childhood [
25
]. Although solutions in gestural
interaction have become popular, their application for AAC still requires experiments to evaluate
these technologies, their possibilities, and limitations. Examples of applications are needed to
demonstrate the technical viability of gesture recognition, and to allow for the development of
low-cost solutions that attend to a diversity of people and their physical, cognitive, social, and
economic conditions.
Users with motor impairment may present very particular postures and involuntary movements,
as well as short-term fatigue and varying motor capacities that are challenging for AAC systems.
In order to generate a computational solution that takes into account the characteristics and the
diversity of its target audience, this paper presents a methodology to support the development
of AAC for people who have motor and speech disabilities, making use of CV techniques and
machine learning to enable personalized gestural interaction. The methodology can support people,
such as users and caregivers, to generate and update a customized set of gestures that will be used
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:3
to train a gestural-based interactive AAC system. Therefore, people may create a personalized
gesture language for communication purposes, taking into account their abilities and limitations
when performing movements, thus allowing for other people to recognize these movements. This
paper presents the methodology and the results obtained from the use of machine learning and
motion representation techniques to recognize gestures using a system developed based on the
methodology. Gestures were obtained from a public dataset, and from two controlled experiments
with teachers and students with dierent skills.
The constructive nature of this research requires a progressive and incremental strategy where
progress is evaluated and informs further steps of research and development. In [
8
], we introduced
the rst version of the methodology and results from an exploratory evaluation with HCI experts
where a prototype was used for gesture recognition. Now, in this paper, we present the improved
methodology and results of a system developed based on it (an evolution of the prototype proposed
in [
8
]), where machine learning and motion representation techniques were applied to recognize
gestures. Gestures were obtained from a public dataset and from two controlled experiments with
teachers and students with dierent skills. Because of the intrinsic complexity to evaluate research
of constructive nature, dierent evaluation strategies are needed to evaluate the methodology
and its application via computing technology. Therefore, the main contribution of this paper is
a multiple evaluation conducted in dierent steps, or stages, with dierent focus each, and the
results obtained from each step, highlighting challenges and necessary improvements for both
the system and the methodology. The results presented in this paper have informed the research
progress and the system evolution: functional requirements and improvements are presented in [
9
],
characterizing the system as an Assistive Technology for AAC, and new features for the system,
including a game-based approach, are presented and evaluated in [10].
2 RELATED WORK
The eld of AAC includes research and the development of designs in education, systems, and prac-
tices, enabling the cooperation between several areas, and, therefore, requires a multidisciplinary
approach [
102
]. In the literature, dierent initiatives can be found to make AAC systems eectively
usable, and dierent proposals involving CV techniques can be found.
Krueger et al. [
63
] were one of the rst to exemplify the use of video to recognize hand movements
as an interaction mode. Jacob [
58
] investigated appearance-based interaction techniques into
real-time applications for people with disabilities. Jacob discussed some factors and technical
considerations for using eye movements as data input in interfaces for computing systems. Since
then several studies have been developed to improve the support for people with physical disabilities
by using AT, and interaction modalities based on the recognition of body movements. Table 1
presents research on gesture recognition that applies dierent devices and mostly CV techniques
for AT.
Studies presented in Table 1 are organized according to the type of device and part of the human
body used for tracking, indicating the target audience of each study. As for the parts of human
body used in each research, the type "Various" was included in the "Body Parts vs. Devices used"
column to refer to the use of two or more parts of the body as a visual signal, or to the tracking of
body movements in general. In Kane et al. [
60
], however, CV is applied for the identication of
context and location for AAC purposes, not for the identication or recognition of any part of the
users’ body.
As presented in Table 1, simple devices such as webcams were used by several dierent initiatives
(43% of the presented papers — 34 papers), mainly because they represent a viable alternative
for detecting and tracking movements, especially due to their low cost. The same advantage
holds for mobile devices, which are increasingly accessible. More recent work has also used depth
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:4 R. E. O. S. Ascari, et al.
Table 1. Examples of related studies that focus on gesture recognition.
Body parts
vs. Devices
used
Mobile
camera Depth
camera Single
camera or
webcam
Thermal
camera Eye-
tracker
Others /
Not
informed
More than
one device Total
Mouth/
Tongue [70]n; [97]d
[
82
]
a
;
[83]a;[5]i5
Head/ Face [86] f[16]r
[
116
]
a
;
[
24
]
a
; [
81
]
n
;
[
118
]
f
;
[
121
]
a
;
[
127
]
a
; [
6
]
a
;
[
126
]
k
; [
53
]
f
[43]a; [12]a[113]k; 14
Nose/ Nos-
trils [76]n[35]a2
Hands [122]g
[
48
]
f
;
[13]a
[
120
]
a
;
[
105
]
m
;
[
40
]
h
; [
30
]
a
;
[31]a
[
101
]
f
;
[88]a; [89]a11
Eyes
[
85
]
a
;
[132]j[23]a
[
26
]
j
; [
62
]
a
;
[
84
]
a
;
[
100
]
a
;
[
68
]
a
; [
78
]
c
;
[
131
]
a
;
[72]a; [93]a
[44]p
[
58
]
a
; [
55
]
i
;
[
75
]
i
; [
11
]
i
;
[
17
]
i
; [
34
]
p
;
[
20
]
q
; [
74
]
e
;
[52]l
[
50
]
i
; [
3
]
i
;
[41]k
[
87
]
i
; [
129
]
i
;
[
19
]
a
; [
45
]
j
;
[114]i
30
Feet [128]a1
Various
[
79
]
a
;
[96]o
[
15
]
a
; [
14
]
a
;
[
104
]
r
;
[
49
]
n
;
[
119
]
a
;
[
36
]
k
; [
64
]
a
;
[27]a
[
108
]
n
;
[
107
]
n
;
[51]a
[
18
]
a
;
[109]a15
None [60]b1
Total 5 7 34 3 9 12 9 79
Target audience: People with: aphysical disability ; baphasia ; cspinal muscular atrophy; ddeciency
of dexterity; eneuro-motor deciency; fmotor and speech disabilities; ghearing and speech diculties; hspeech
diculties; isevere motor diculties; jAmyotrophic lateral sclerosis; kupper limb motor disabilities; lhigh spinal
cord injury; macquired brain injury; ncerebral palsy; ocortical diseases (Alzheimer); pTotal Block Syndrome;
qadvanced stage of multiple sclerosis ; rtetraplegia.
data (7 papers since 2014) from devices such as a Kinect, BumbleBee depth sensor, a Monocular
infrared depth camera, and an Image range sensor. Gaze detecting/tracking research has also
received increased attention (11% of papers presented in Table 1 — 9 papers), possibly because eye
movements may be the only remaining movement some people with severe disabilities can control
voluntarily.
Tracking specic regions of the human body as a form of interaction with a specic target
audience tends to generate solutions more adapted to the diversity of users and their interests.
However, the accessibility of these solutions may fail if they do not allow users to eectively
adapt or customize solutions before they begin using them. Even when adaptation mechanisms are
provided, solutions must guarantee that users will be able to nd and use them.
Research aiming at developing successful sign language recognition, generation, and translation
systems are related to our study despite having as main target audience deaf and hard of hearing
people. People with motor impairments, in general, present diculties in performing movements,
and the correct execution of a sizeable predened gesture set, such as used in sign language, is a
challenge. Even so, the contributions obtained from studies aimed at sign language recognition
using Computer Vision can undoubtedly contribute positively to the development of technologies
aimed at people with motor and speech diculties. Non-intrusive vision-based sign language
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:5
recognition is the currently dominant approach [
22
], however, for Martins et al. [
80
], although
existing devices can easily capture gestures and expressions, they face some problems: the vast
number of gestures and similarity between them; dierent sign languages due to culture, individual
social life, and the way gestures were taught; and, the sequence of gestures to express a sentence
can be dicult to calculate because it is dicult to detect where a gesture starts and ends and
where the next one begins. Thus, there are still some critical challenges to be solved. The studies of
Martins et al. [
80
], Ghanem et al. [
47
], Bragg et al. [
22
] present key backgrounds, a review of the
state-of-the-art, a set of pressing challenges, and a call to action for the research in this area.
Although they employ sensors, rather than cameras, to track users (magnetic tracker and elec-
tromyography), studies from Roy et al. [
107
,
108
] show that people with a speech disability may be
able to perform gestures that are replicable, and that can be mapped into words or concepts. Due
to physical disabilities, these gestures may not follow any standardized form, or be recognized as
iconic representations. According to the authors, people with cerebral palsy are able to perform
actions or gestures with their arms that are recognizable by observers in their family, and found
out that by encouraging free expression, the number and possibility of dierent gestures which
can be performed by individuals is much greater than previously thought. When a person has
severe limitations regarding self-expression, the knowledge that observers (e.g. caregivers, family
members) have about an individual’s ability to perform movements is fundamental to create a
personalized gesture language. The research we present in this paper aims to support work that
uses such information, allowing for people with disabilities to interact with a computing system by
using interaction language composed of their own gestures.
3 A METHODOLOGY TO SUPPORT AAC VIA PERSONALIZED GESTURAL
INTERACTION
Considering the literature presented previously, we have identied that initiatives usually focus on
specic situations and characteristics, oering little or no exibility for people and their dierent
contexts of use, therefore, requiring that people adapt themselves to the system instead of adapting
the system to people’s dierent needs. Designing AAC systems for people who have communication
diculties and motor impairment is a challenge as, regardless of the origin of motor problems,
people usually have very particular postures and involuntary movements that can sometimes be
uncontrollable, making it impossible to use several interfaces.
In this research, we investigate a methodology that can be used to design systems based on
gesture recognition in which gestures and their meanings are created and congured by users
and their caregivers. The methodology was conceived to enable the recognition of patterns in
gestural interaction, captured using a camera, and to be substantiated by dierent cameras or
complementary input devices (e.g., brain-computer interfaces or mobile device sensors) that enable
multimodal interaction with the AAC system.
Based on the Problem-Solving perspective [
99
] to describe research in HCI, the problem in this
research can be understood as having a mixed nature, with characteristics of an empirical and
constructive nature. Its empirical nature is due to the fact thats experimentation is required to
test and describe the eects of a methodology designed to support AAC based on personalized
gestural interaction. It is constructive, in the sense that it aggregates information to understand
the use of an AAC computer system by people with motor and communication impairments.
Figure 1 presents a scheme for the proposed methodology, and uses the Business Process Modeling
Notation, a graphical notation for business process modeling [
37
], showing the responsibilities for,
the execution of activities, as well as how work-ows across functions, or how functions transfer
the responsibility for an activity.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:6 R. E. O. S. Ascari, et al.
Fig. 1. Scheme for the proposed methodology, divided into four lanes that define responsibilities for the
execution of activities.
The scheme can be understood from a macro level, but depends on a series of manipulations
and specic processes performed at a micro level, whose specic steps have already been tried
and evaluated in a previous experiment with HCI experts [
8
]. The results obtained in this previous
experiment reinforced our perception that a methodology aimed at personalized gestural interaction
is feasible, and can be applied in an assistive context, increasing the possibilities for people with
motor disabilities to communicate by means of AAC systems.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:7
A pilot system, named Personal Gesture Communication Assistant (PGCA), was developed to
analyze and evaluate the feasibility of the methodology, its potential, and its limits. The system’s
rst version was designed following the proposed methodology that allows for the creation of
personalized gestural interaction for AAC, and was evaluated by HCI experts in order to identify
usability and accessibility issues, as well as to validate its requirements before experimenting the
system with the target audience. Previous evaluations and experiments are needed before involving
the user, so as not to take a solution with problems and errors that could have been anticipated in a
lab test
1
to the eld. As a result of the evaluation, technical limitations and interaction problems
were identied, as well as suggestions for interface improvements. The evaluation activity with
experts indicated the need for improvements before an experiment in a real context was possible
and productive, helping to anticipate problems that would make it dicult for the system to be
exible and adaptable to each user’s characteristics, or even that would prevent its use by people
with dierent limitations.
Figure 2 presents three interfaces of the system: A. Caregiver area, where datasets are created; B.
User area, where gesture recognition is used for interacting with the system and with communication
boards. The user’s interaction with the system, and the dierent techniques used for gesture
recognition and motion representation, are briey described; and C. Communication boards area,
where new boards can be generated by selecting images.
3.1 Interaction with the PGCA system
When the system is started, a camera is automatically enabled to allow the user to perform the
calibration process, which consists of positioning the user in the capture center area of the camera,
thus facilitating the standardization of postures for recording and recognizing movements. After
completing this process, the caregiver can begin customizing the system using the specic guide
(caregiver area), which assists the user to record examples of gestures for training, as well as
later use as a way of interacting with the system. Ideally, the methodology will allow any user to
independently customize the system through gestures. For this exploratory version of the system,
caregiver assistance is needed for the initial conguration of the system, and for recording and
labeling the gestures with the user. While the support of a caregiver will always be necessary when
people have more extreme disabilities, particularly cognitive disabilities that aect intentional
interaction, the system must be designed to allow its conguration and use by users with at least
one identiable movement. All actions performed in the system through gestural interaction are
stored in a text le (log) in order to record relevant events during the interaction with the interface.
In the caregiver area (Figure 2 - A), images are recorded representing users’ movement history
for each gesture, then becoming classes that can be labeled with words that will be used for
communication or interaction purposes (e.g.: Hi, Goodbye, Bathroom, Food, Water, Conrm, Undo).
After recording several representative pictures of the same gesture (the more samples in the dataset,
better the results), the user can train the system via a button available at the bottom of the caregiver
area. The training process expands the dataset by Data Augmentation, as well as extracts and
classies the features. Then the user can evaluate the system accuracy.
The user area (Figure 2 - B) can be used after system training. This area represents the main
interface with AAC functions, in which gesture interaction will allow for the execution of the
following functions: 1. Detection and representation of categories of gestures by writing the
corresponding label in a text box, emitting a sound (by means of synthesized voice) referring to
the word, and presenting a related image. 2. System conguration for personalizing navigation
1
Intensive testing and evaluations before evaluating a product with the target audience is mainly a matter of ethics as
people with disabilities cannot be treated as subjects of research.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:8 R. E. O. S. Ascari, et al.
Fig. 2. Three main interfaces of the pilot system developed: A) Caregiver area, where datasets are created; B)
User area, where gesture recognition is used for interaction through the use of communication boards; C).
Communication boards area, where new boards can be generated by selecting images.
functions, and relating gestures to functionalities. 3. Selecting whether the mode of navigation in
the communication boards is automatic (based on time) or manual (via gestures), varying between
communication boards of alphabetical or of several gures. 4. Selecting images when navigating
in a communication board, where it is possible to select a letter or a gure to show its related
description, and play the corresponding sound. 5. Simulating the keyboard use via communication
boards for typing characters or commands, allowing its use as input for the PGCA interface, as well
as for other applications such as Internet Browsers or Text Editors, for example. The techniques
used for data augmentation, gesture recognition and motion representation are briey described
below.
In the communication board’s area (Figure 2 - C) the caregiver can create dierent communication
boards composed of images. Next, the user can select images from these boards for communication
purposes.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:9
3.2 Data Augmentation
In many classication problems, the available data is insucient to train accurate and robust
classiers, being necessary to apply the data augmentation process [
39
]. Data augmentation
transforms the base data and increases the number of training data. The transformed images are
usually produced from the original images with very little computation and are generated during
training [
4
]. The augmented data will represent a more comprehensive set of possible data points,
thus minimizing the distance between the training and validation set, as well as any future testing
sets [
112
]. In the study of Alani et al. [
4
], data augmentation is initially applied, which shifts
images both horizontally and vertically to the extent of 20% of the original dimensions randomly,
to increase the size of the dataset numerically and to add the robustness needed for a deep learning
approach.
For this research, each sample training data is augmented, creating another eight variations, by
rotating and scaling the original image, aiming to simulate small changes in camera positioning
or distances that may occur when users interact with the system. Figure 3 shows an example of a
hand gesture represented by a dynamic gesture image, where the original (central) image is used to
generate eight additional images for enlarging the dataset used by the system. Variations employed
were -10 and 10 for the angle, and 0.9 and 1.1 for scale.
Fig. 3. Example of the application of Data Augmentation. From a representative dynamic gesture image,
eight other variations are generated by rotating and scaling operations.
3.3 Motion representation by Conventional MHI
For the pilot system, movements performed in front of a single camera are captured and represented
as Motion History Image (MHI). Proposed originally by Davis and Bobick [
21
] [
29
], MHI is a global
spatio-temporal representation of motion that has been applied to motion analysis and tracking for
dierent purposes, such as gesture recognition [
120
] or human action recognition [
56
] [
131
]. MHI
converts the 3D space-time information from a video sequence into a single 2D intensity image. The
movements include information such as time and space, and the MHI image reects not only the
position of a spatial action but also the movement order. In the MHI, high xed intensity is assigned
to a foreground pixel (moving object), while the intensity value is decreased by a small constant to
a background pixel [
117
]. The intensity value in the MHI records the history of temporal changes
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:10 R. E. O. S. Ascari, et al.
in each pixel location. The MHI
𝐻𝜏(𝑥, 𝑦, 𝑡 )
is computed from an update function
𝜓(𝑥, 𝑦 , 𝑡 )
described
by Davis and Bobick [29] in Equation (1):
𝐻𝜏(𝑥, 𝑦, 𝑡 )=(𝜏, 𝜓 (𝑥, 𝑦 , 𝑡 )=1(∈ 𝑓 𝑜𝑟 𝑒𝑔𝑟𝑜𝑢𝑛𝑑 )
𝑚𝑎𝑥 (0,(𝐻𝜏(𝑥, 𝑦, 𝑡 1)) 𝛿),otherwise (1)
where (x,y,t) is the spatial coordinates (x,y) of an image pixel at a given time t (in terms of image
frame number). Duration
𝜏
determines the temporal extent of the movement in terms of frames,
and
𝛿
is the decay parameter. We used
𝜏
= 3 and
𝛿
= time stamp.
𝜓(𝑥, 𝑦, 𝑧 )
is dened on Equation
(2) as described by [38]:
𝜓(𝑥, 𝑦, 𝑧 )=(1, 𝐷(𝑥 , 𝑦, 𝑡 )
0,otherwise (2)
where
𝐷(𝑥, 𝑦 , 𝑡 )
is a binary image comprised of pixel intensity dierence in frames separated by
temporal distance Δ, dened on Equation (3):
𝐷(𝑥, 𝑦 , 𝑧)=|𝐼(𝑥 , 𝑦, 𝑡 ) − 𝐼(𝑥 , 𝑦, 𝑡 ±Δ)| (3)
where I(x,y,t) is the intensity value of pixel (x,y) at the
𝑡𝑡ℎ
frame of the image sequence. We
used the "updateMotionHistory" function available in the OpenCV (Open Source Computer Vision)
library to calculate MHI.
3.4 Motion representation by Optical Flow based MHI
During a second moment, the optical ow was evaluated to aggregate velocity information to
images that represent motion history performed in front of the camera. Optical ow [
54
] [
73
]
denotes a shift in the same scene in an image sequence at a dierent time instant, estimating
pixel-level movement between two images. In conventional MHI, every detected foreground pixel is
assigned with a xed intensity value
𝜏
and speed dierences are not considered: a slow movement
and a fast movement of dierent body parts will have the same motion strength [
117
]. Dierent
proposals have been presented to add velocity information to MHI by means of optical ow, such
as Tsai et al. [
117
], Fan and Tjahjadi [
38
], Khalifa et al. [
61
]. A similar proposal to [
38
] was used for
this research, but in our system, a labeling algorithm ("connectedComponentsWithStats" function
available in the OpenCV library) is applied to the silhouette obtained in dierent frames in order
to identify connected regions. Afterward, the Lucas Kanade optical ow [
73
] is calculated for the
centroid pixels of each of these regions, and this displacement value is replicated to the other
pixels of the same region, in order to speed the tracking process via optical ow. We used the
“calcOpticalFlowPyrLK” function available in the OpenCV library to calculate Lucas-Kanade optical
ow, with parameters winSize(31
×
31), minEigThreshold (0.001), and default values in the others
parameters. When the "Zoom in" option is checked in the system settings, the facial landmarks are
entered as points to be tracked, highlighting and improving the perception of facial movements.
The resulting intensity value indicates a history of motion speeds at that location. The Optical
ow-based MHI (OF-MHI) is dened in Equation (4), described by Fan and Tjahjadi [38]:
𝐸(𝑥, 𝑦, 𝑡 )=𝑠(𝑥, 𝑦 , 𝑡 ) + 𝐸(𝑥 ,𝑦 , 𝑡 1).𝛼 (4)
where
𝑠(𝑥, 𝑦, 𝑡 )
represents the optical ow length of pixel
(𝑥, 𝑦)
at time frame
𝑡
, and
𝛼
is the
updated rate used (0
<𝛼<
1). The motion strength is given by the ow length
𝑠(𝑥, 𝑦, 𝑡 )
for each
individual pixel
(𝑥, 𝑦)
. The intensity of a pixel is increased if it is a foreground point. A small value
of
𝛼
creates an accelerated decrease in motion strength, and only the recent short-term movements
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:11
are retained in the temporal template. Larger values of
𝛼
, in turn, will originate a long-term history
in temporal template. We used 0.85 for 𝛼parameter.
Figure 4 shows an example of motion representation using conventional MHI and OF-MHI. Both
images are displayed in grayscale, where the darker color represents the most recent movement. In
the OF-MHI, darker tones also represent regions where higher speed movements occurred.
Fig. 4. Examples of motion representation by means of MHI (A) and OF-MHI (B).
3.5 Gesture Recognition by HOG and SVM
A multiclass Support Vector Machine (SVM) discriminant classier using Radial Basis Function
kernel (RBF), and the feature descriptor Histogram of Oriented Gradient (HOG) were used to
recognize gestures created by the user and the caregiver. Discriminant classiers are trained to
separate classes. SVM [
111
] is a linear binary classier that assigns a given sample to a class of only
two possible classes [
90
]: it separates data into two classes by learning a hyperplane in a larger
dimensional space. To address the problems of multiple classes, SVM can be adapted by applying
other methods. For our research, the "one-versus-all" method was used: given an n-class problem,
for each class a binary model is constructed; the training set consists of examples of this class
as positive labels, and examples of other classes as negative labels. HOG is a feature descriptor
used for object detection, obtained based on an image gradient histogram. HOG is represented
by a border (gradient) of structural features, and can suppress the inuence of translation and
rotation to some extent caused by the quantication of spatial position and orientation. For the
system, the HOG was extracted from the whole MHI, or OF-MHI resized to (64x48), as exemplied
in Figure 5, generating a feature vector of 1260 positions. To extract features using HOG descriptor
we used the “HOGDescriptor” function available in the OpenCV library, and parameters include
winSize (64,48), blockSize (16, 16), blockStride (8, 8), cellSize (8, 8), nbins (9), derivAperture(1),
winSigma(4), histogramNormType(0), and default values for the others parameters. Subsequently,
the “hog.compute” function was used, with parameters winStride(32x24) and padding(0,0).
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:12 R. E. O. S. Ascari, et al.
Fig. 5. Schema used by SVM-based classifier.
3.6 Gesture Recognition by CNN
The Convolutional Neural Network (CNN) features can give a good description of image content;
thus the potential of deep learning by means of CNN was also explored. Therefore, feature extraction
based on the HOG descriptor with the SVM classier was replaced by an automatic process
performed by CNN, which works directly with images, performing the feature extraction internally.
Figure 6 shows the scheme employed by the CNN-based classier.
Fig. 6. Schema used by CNN-based classifier.
To train a CNN from scratch, a large and varied dataset is necessary, and, since, in our context,
the number of samples is limited due to the fact that each user will create his/her own dataset,
Transfer Learning could be a viable alternative to improve the learning mechanism from one domain
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:13
by transferring information from a related domain [
125
]. Therefore, we used the TensorFlow [
1
]
Inception V3 [
115
] (a codename for a deep CNN architecture model, originally trained on ImageNet
dataset [
110
]), as the basis to retrain a custom set of images. Afterward, we applied Transfer
Learning by retraining Inception’s nal layer, by 4000 steps, with new categories in order to build a
custom image classier according to labels and gestures captured by the system’s users. The whole
MHI, or OF-MHI resized to (64x48), is used as input for the network after the data augmentation
process has been executed.
3.7 Apparatus
The materials used for the experiments performed in all steps reported in this paper were a laptop
with 8GB of RAM and the webcam coupled to the laptop. Data collections took place in dierent
environments, with varying conditions of lighting. When performed gestures, users were positioned
in front of the table on which the laptop was located. People with disabilities who used a wheelchair
stayed a little further away compared to users who could sit in a simple chair closer to the table.
The volunteers who participated in the rst step evaluation performed data collection in their work
environment. Students who participated in the third step of system evaluation performed data
collection in their school environment.
Regarding the algorithms employed, we chose to use well-established algorithms, although
some approaches are not novel. We employed methods available in the OpenCV library (Open
Source Computer Vision Library) because it oers high computational eciency, and simple use of
Computer Vision and Machine Learning infrastructures [
28
]. Using MHI images has advantages
related to simplicity, robustness in motion representation, and low computation. Lucas-Kanade
optical ow has very fast calculations, and accurate time derivatives [
98
], and it proved to be
eective in aggregating velocity information from the movement on the MHI. Support Vector
Machine (SVM) has the advantage of oering a strong generalization ability, simple architecture, as
well as the ability to classify a few samples [
71
] [
77
]. CNN has shown growing popularity, partially
due to its success in image classication and other Computer Vision elds [
33
] [
124
]. Transfer
learning can take advantage of the experiences acquired during a deep CNN pre-trained with a
large dataset, for a specic task, and improve the performance of gesture recognition (our task)
with a small dataset composed by a restricted number of samples (our context).
4 EVALUATING THE PGCA SYSTEM
As the target audience of this research involves groups of users considered vulnerable, the project
was submitted for evaluation by the Research Ethics Committee of the University linked to this
study. The approval from the Committee provided the legal conditions for testing the system
with users from three co-participating institutions. After HCI experts evaluated the system under
laboratory conditions, tests with users without disabilities (Step 1), tests using a public dataset (Step
2), and tests with users with speech and motor impairments (Step 3) were planned and conducted.
Figure 7 represents the evaluation of the system, showing the datasets used and the objective of
each experiment, identied by steps.
For the experiment performed with users without disabilities, teachers from one of the partici-
pating institutions were invited. Five people accepted the invitation, and participated in the rst
step of data collection. The objective of this step was to evaluate whether the system would be able
to recognize personalized gestures, and trained with few samples. Subsequently, aiming to identify
the best strategies for a new version of the PGCA system, two classiers for gesture recognition,
and two motion representation, were evaluated according to their performance. This evaluation
step was conducted using the Keck Gesture Dataset, a public dataset available in [59].
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:14 R. E. O. S. Ascari, et al.
Fig. 7. Experiments carried out to evaluate the PGCA system.
Finally, during a third step, after analyzing the results of the previous tests, improvements were
implemented in the PGCA system and an experiment with the target audience was conducted. The
main objective of the experiment was to verify whether the PGCA system would support the target
audience in generating a customized dataset, and also to analyze whether our system is robust and
eective for communication purposes, taking into consideration the possible limitations of use in
daily life by people with dierent disabilities.
5 RESULTS
This section describes the experiments conducted to evaluate the proposed system, and the main
results obtained. Images pertaining to datasets created by volunteer teachers and students are
presented only in the form of MHI or OF-MHI to maintain the participant’s anonymity.
5.1 Step 1 - Evaluation of machine learning techniques using datasets created by
volunteers without disabilities
For the rst experiment, ve volunteers (people with no motor and speech impairment) were invited
to create datasets composed of six to eight dierent gestures, with labels dened by the volunteers
themselves. The researcher who conducted the experiment played the role of a caregiver. The
experiment was designed to evaluate the accuracy of the classier regarding gesture recognition.
For this evaluation step, only the Caregiver Area and the User Area were available for use in
the PGCA system. Only the traditional MHI was implemented, and the captured samples were
registered only as images. There was no possibility to store videos of performed movements. The
Caregiver Area did not provide any form to validating the captured samples, allowing the creation
of data sets of personalized gestures, training, and system evaluation. The User Area was used only
to test the recognition of the gestures for which the system was trained.
Volunteers P1 and P2 created datasets with eight distinct classes, registering twenty samples by
class. Volunteer P3 created a dataset with eight distinct classes, registering fteen samples per class.
Volunteers P4 and P5 created datasets with six distinct classes. Volunteer P4 registered twenty
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:15
samples per class, and Volunteer P5 registered fteen samples per class. Figures 8 and 9 present
examples of samples generated by volunteers to compose each dataset. Volunteers P1 and P2 used
the option "Zoom in", available in the system settings, highlighting and improving the perception
of facial movements.
Fig. 8. Gesture samples performed by Volunteers P1, P2 and P3.
During the analysis of data captured in the evaluation step 1, it was identied the need to make a
change in methodology, which initially [
8
] performed data augmentation process before separating
original data into training and test data. This situation could generate a very optimistic performance
evaluation. Therefore, the methodology was updated to divide the original data set into training
and test data, to later perform the data augmentation process only in the training data sets. This
change was made before evaluating the system performance on the data sets created by volunteers
without disabilities.
To evaluate the performance of the classiers, an estimation of prediction error by using K-fold
cross-validation was used, using ten folds, separating 90% of the data for training, and 10% for
testing. The quantity of training data was expanded by Data Augmentation, where additional
samples were created from existing data.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:16 R. E. O. S. Ascari, et al.
Fig. 9. Gesture samples performed by Volunteers P4 and P5.
After running tests on all folds, the overall accuracy (weighted average), standard deviation,
Variance, and Cohen’s Kappa (a statistical measure of inter-rater agreement), were calculated for
each of them. HOG + SVM and CNN were the machine learning techniques used. Results obtained
for datasets generated by the ve volunteers are presented in Table 3, where two learning methods
are compared.
Table 3. Volunteers’ datasets - machine learning method comparison: Overall accuracy, Cohen’s Kappa,
Standard deviation, and Variance.
HOG + SVM CNN
Volunteer Acc.
Cohen k
Std dev.
Var. Acc.
Cohen k
Std dev.
Var.
P1 - MHI 0.981 0.979 0.04 0.00180 0.994 0.993 0.02 0.0004
P2 - MHI 0.981 0.978 0.02 0.00090 0.987 0.985 0.02 0.0007
P3 - MHI 0.975 0.971 0.07 0.00620 0.941 0.936 0.07 0.0069
P4 - MHI 0.974 0.970 0.03 0.00160 0.974 0.970 0.03 0.0016
P5 - MHI 0.988 0.986 0.02 0.00006 1 1 0 0
Typically, a perfect classication would produce a variance and standard deviation of zero, and the
accuracy and kappa value of one. According to Landis and Koch criteria [
66
], for the interpretation
of the kappa value: 0.0 to 0.2 = slight agreement, 0.2 to 0.4 = fair agreement, 0.4 to 0.6 = moderate
agreement, 0.6 to 0.8 = substantial agreement, and 0.8 to 1.0 = almost perfect agreement.
In this experiment, the classiers presented satisfactory results, since the average accuracy
obtained in all datasets was high (higher than 0.94), low standard deviation and variance were
observed, and kappa values indicated almost perfect agreement. The CNN-based classier presented
slightly better accuracy in comparison to the SVM-based classier, in four of the ve datasets used.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:17
5.2 Step 2 - Evaluation of machine learning techniques and motion representations
using Keck Gesture Dataset
In the second evaluation step, the public dataset Keck Gesture Dataset was used to evaluate the
system’s performance using two classiers (HOG + SVM and CNN), and two distinct motion
representations (Conventional MHI and OF-MHI). Keck Gesture Dataset is composed of fourteen
distinct gestures, performed by three people in front of a static background. For each gesture,
each person performs three repetitions. Examples of motion representations generated for each
of the fourteen gesture classes available in Keck Gesture Dataset are presented in Figure 10. For
this step, besides all features available previously in the PGCA system, a second form of motion
representation was included: Optical ow-based motion history image (OF-MHI).
Fig. 10. Samples of gestures from the Keck Gesture Dataset represented by Conventional MHI and Optical
Flow based MHI. MHI images were resized to give emphasis to moving regions.
For the evaluation of this dataset, nine samples were available for each of the fourteen classes
existing in the dataset: a total of one hundred and twenty-six original samples. The "Zoom in"
option was used to enlarge regions where movements are performed.
Results obtained by k-Fold Cross-Validation with nine folds are presented in Table 4. Each fold
contained one sample per class for testing, and seventy-two samples per class for training (after
running the Data Augmentation process).
There are dierent works aimed at recognizing gestures, actions, or images in which the Keck
Gesture Dataset was used to assess the accuracy of classiers or methods used. As no study was
found using precisely the same form of evaluation employed by us (k-fold cross-validation using
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:18 R. E. O. S. Ascari, et al.
Table 4. Keck Gesture Dataset - machine learning and motion representation - method comparison: Overall
Accuracy, Cohen’s Kappa, Standard deviation, and Variance.
Keck
Dataset
HOG + SVM CNN
Acc.
Cohen k
Std dev.
Var. Acc. Cohen k
Std dev.
Var.
MHI 0.88 0.87 0.04 0.0025 0.90 0.89 0.05 0.0038
OF-MHI 0.89 0.88 0.04 0.0026 0.87 0.86 0.07 0.0060
ten folds), a direct comparison of the classiers’ performance was not conducted. However, we
consider it worth mentioning some relevant works and the results obtained with the Keck Gesture
Dataset. For example, in the study of Pei et al. [
103
], a fast inverted index-based algorithm is
introduced for multi-class action recognition. Results presented using the proposed method indicate
an accuracy of up to 89.88%. Fu et al. [
42
] considered the action recognition problem based on
geometrical structure. The method proposed by the authors uses a low dimensional structure on the
Grassmannian manifold to represent video sequences by using the linear structure of the tangent
space and presented a recognition accuracy of 93.4%. Wan et al. [
123
] presented a class specic
dictionary learning approach via information theory for action and gesture recognition, and the
recognition accuracy achieved is up until 95.1%. The study of Zhang et al. [
130
] introduced a hybrid
model based on CNN for image classication, and the results indicate an accuracy of up to 93.15%.
In our experiment, both classiers presented satisfactory results, using the MHI as well as OF-
MHI. The classication using the two datasets created from the Keck Gesture Dataset presented
valid accuracy, and statistical data with few variations. Next, the two classiers and the two motion
representations were evaluated once again during the following experiment, conducted with the
target audience.
5.3 Step 3 - Evaluation of the methodology using datasets created by students with
motor and speech impairments
For evaluating the system with the target audience, several improvements were introduced in the
system: a) storage of the video referring to the movements used to create the data sets; b) a guide
for creating picture communication boards; c) new conguration options for simulating the use of
the keyboard; d) visualization in video form of the movement related to selected gestures in the
conguration screen; e) possibility to choose dierent communication boards in the User Area; f)
registration of new information regarding the main actions performed on the system interface in a
log le.
For the tests with the target audience, visits were made to four schools in the co-participating
Institutions: one of them is a specialized educational institution for students with disabilities, and
the others are public schools where there are students with disabilities in mainstream education.
After the researchers met with several students with disabilities registered in the participating
institutions, a rst selection was made looking for the students who would have a greater compre-
hension capacity, and the ability to carry out voluntary movements, according to the perception of
the teachers that accompany them daily. In the four schools, after conducting interviews with the
teachers, support teachers, or LIBRAS (Brazilian sign language) interpreter, seven students with
characteristics considered desirable to participate in the experiment were identied (i.e. people with
motor and speech disabilities and without signicant cognitive limitation). One student, among the
selected, was not authorized by the family to participate.
All selected students are characterized as people with cerebral palsy, with dierent levels of
disabilities. Table 5 describes some of the characteristics of the participating students.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:19
Table 5. Characteristics of students with motor and speech impairments who participated in the experiment
Student Sex* Age Medical report Voluntary movements
A M 18 years old
Cerebral palsy due to sequelae from complications
during labor. Head movements
B F 29 years old Brain damage and discrete hydrocephalus.
Movements of head and
hands
C M 38 years old
Quadriplegia with athetosis component, bilateral
sensorineural hearing loss.
Movements of head and
hands
D F 20 years old
Pseudobulbar palsy. Generalized hypotonia and
hyperreexia.
Movements of head and
hands, facial expressions.
E F 18 years old Static encephalopathy and spastic quadriplegia.
Head movements, facial
expressions
F M 18 years old
Static encephalitis, epilepsy, and Taybi’s Rubinstein
Syndrome.
Movements of head and
hands
*Sex: M - Male; F- Female.
Each student who participated of this experiment was accompanied by a teacher who played
the "caregiver user" role in the system, informing the gestures that the student usually uses to
communicate, and the meaning of each of these gestures. Therefore, tasks conducted with the help
of these teachers allowed for us to evaluate the system regarding the creation of a dataset with
gestures personalized for each student. The tasks expected to be performed during the experiment
were: 1. creating the dataset by capturing gestures for training the system; 2. training and evaluating
the system; 3. using the system to recognize gestures; 4. using gestures to select images in the
communication board; and 5. using gestures for interacting with the text editor or Internet browser.
5.3.1 Student A. Student A uses only two head gestures in the school environment to communicate
with his/her classmates and teachers, which refer to "Yes" and "No". This student has a very preserved
capacity for comprehension. Data collection for the system training and the interaction tests with
the interface were performed during two dierent sessions. For this student, the system was
congured to use the "Zoom in" option in order to better capture facial movements. Student A
presented some involuntary movements during the interaction with the system, leaning his leg or
arm on the table where the computer and the camera were arranged, generating some samples
with information about the back of the room, considered as noises. The dataset created with this
student was composed of two classes; fteen samples from each class were considered as valid for
training the system. After training the system, the rst step for testing interaction with the interface
looked at whether the system could recognize the gestures for which it had been trained. The
two gestures were correctly identied by the system when performed voluntarily by the student.
Subsequently, the system was congured by associating the "Yes" class with the system’s conrm
option. This conguration allowed for the testing of the communication board, and whether the
student could write words related to specic requests, such as "Sleep", "Bath", "Food", and others.
Next, the system was congured by associating the "No" class with the ENTER key, and the option
"simulate keyboard" was selected, allowing the use of other applications. This specic conguration
allowed for the user to select images on a communication board in order to write words directly
into a text editor, simulating the pressing of the keyboard’s ENTER key when performing the "No"
gesture. Some false positives occurred, which were later corrected by adjusting the condence level
of the recognition system.
5.3.2 Student B. Student B does not present signicant motor impairment and uses hand and head
gestures in the school environment. However, she avoids interacting with unknown people. She
has some diculties understanding and is very shy, demanding constant encouragement from the
teacher who accompanied her when performing the gestures. Data collection for system training
was performed during a session that happened in one day, and interaction tests with the interface
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:20 R. E. O. S. Ascari, et al.
were performed during another session three days later. The "Zoom in" option was disabled in the
system. Her customized dataset was composed of ve classes with thirteen samples each. After
training the system, during the interaction test, the gestures referring to "Yes", "Fast", and "Bunny"
(because of the Easter Bunny) were correctly recognized. The gestures referring to "Food" and "No"
generated some erroneous interpretations, but this did not prevent interaction with the system.
Subsequently, the system was congured by associating the "Yes" class with the system’s conrm
option, and the student selected images related to handicrafts on a communication board, as this is
a subject of interest to the student, according to her teacher. This board was used to write words
associated with each gure in the system interface when the student performed the "Yes" gesture.
In addition, a communication board composed of images of vowels was used for the student to
indicate the rst letter in the word "Elephant". With the teacher’s help, the student made the "Yes"
gesture to select the gure referring to the letter "E" on the system interface, and the corresponding
sound was emitted. Because the student demanded a lot of intervention from the teacher, and had
little initiative to selecting pictures on her own, no other interaction test was performed.
5.3.3 Student C. Student C presents motor and speech impairment, and severe hearing loss. He
uses hand and head gestures in the school environment. He uses some gestures from LIBRAS as well
as signs from home, but because of motor impairment in the hands, not all LIBRAS interpreters can
understand his communication intentions. Student C has a preserved capacity for understanding,
and, for the execution of the experiment, he was accompanied by his LIBRAS interpreter and
caregiver. Data collection for system training and interaction tests were performed during two
dierent sessions. The "Zoom in" option was disabled. Initially, the dataset created by this student
was composed of ten classes, with twelve samples each. After training the system, during the
interaction test, some gestures characterized with similar movements were being confused by the
classier, such as the gestures "Yes", "I", "Mom", and "Water". The stored videos and samples related
to these gestures showed the movements for these gestures are very similar to each other, with
variations only in the position of the ngers or hand and, probably because of this, the training
performed was not enough for the system to recognize the dierence in these signs. Therefore,
during another session one month later, a second dataset was created containing only seven gestures
with seventeen samples each, leaving out the gestures "I", "Mom", and " Water ", and keeping the"
Yes "gesture. With the new dataset, most of the gestures were correctly recognized by the system,
except the gesture referring to "Bathroom" that was not recognized in some situations. Subsequently,
the system was congured by associating the "Yes" class with the system’s conrm option, allowing
for the student to test the image selection on the communication boards. Next, the system was
congured by associating the "Intelligent" class with the ENTER key, and the "simulate keyboard"
option was selected. Then, the student selected images on a communication board to write words
in a text editor, and to simulate the use of the ENTER key. It was also possible for the student to
select a gure on a communication board with keyword options in order to search in an Internet
browser, writing directly into the browser URL, and simulating the ENTER key to search for the
keyword. The interaction test was nalized after this step. However, following the same interaction
structure, other gestures could be associated with the TAB key to navigate between the search
results, and to select the desired page using the ENTER key simulation.
5.3.4 Student D. This student uses head gestures and facial expressions in the school environment,
has a very preserved capacity for comprehension, and is able to voluntarily move the right arm
despite many spastic movements. The gesture referring to "Yes" is the raising of the eyebrows.
However, since the gesture has a lot of associated head movement, the system was not able to
correctly register the movement of this facial expression. In order to create a dataset for this student
to interact with the system, we chose to capture the movement referring to "No" (moving the
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:21
head to both sides), and the movement referring to "Hand" (moving his right arm). Data collection
was performed during a session, and the interaction tests during another session ve days later.
Ten samples were considered valid for each class in the dataset. After training the system, the
two gestures were correctly identied by the system during the interaction test. Subsequently,
the system was congured by associating the "Hand" class with the system’s conrm option,
and the "No" class with the ENTER key. The same interaction tests with the Internet Browser
and Text Editor performed by Students A and C were performed by this student. However, some
involuntary movements occurred with the arm used to make selections, and, unintentionally, the
system selected items on the boards several times. According to the accompanying teacher, as the
student is accustomed to using the eyebrow to give armative answers, using the arm is still a
challenging task for her, and would require more training.
5.3.5 Student E. Student E uses only small head gestures and facial expressions to expose her
communication intentions in the school environment. The teachers expressed doubts about the stu-
dent’s level of understanding. Two attempts were made to collect data with this student during two
dierent sessions on dierent days. However, this student makes very restricted head movements,
for the same intention to communicate "Yes", sometimes she moved her head and sometimes she
just smiled. According to the teachers, when eating, the student puts her tongue out to indicate
she does not want a specic food. However, during the experiment, this same gesture referring to
"No" was never performed by the student, even after several attempts by the teacher, who asked
questions seeking a negative response from the student. Therefore, the experiment with Student
E was nalized. The experiment was designed to obtain images during explicit training, when
the user answers the caregiver’s questions. A feature that might be added in the future, is for the
system to learn from an annotated video where the caregiver indicates the meaning of a students’
gestures and expressions.
5.3.6 Student F. The Student F has progressively lost motor functions, and uses hand and head
gestures in the school environment, mainly pointing to objects of interest. According to the
teachers, the student has a well-preserved comprehension capacity. The rst author who conducted
the experiment participated in some classes with the student, observing his gestural interaction.
However, it was not possible to create a dataset for this student, even though two attempts were made
to collect data on dierent dates and with the support of dierent teachers. During both sessions,
despite having the necessary motor conditions, the student showed no interest in executing the
gestures when requested. After the second attempt to capture his gestures, the researcher questioned
the student if he did not want to be lmed, and the student emitted a sound (considering his speech
limitations) which was understood as a "No". The data collection session was then terminated.
Taking into account the situations reported when creating customized datasets with the selected
students, it was possible to create datasets with gestures performed in a personalized manner by
four of the seven selected students. Figures 11, 12, and 13 present examples of samples generated
by students to compose each dataset.
To evaluate the classiers’ performance, an estimation of prediction error by using K-fold cross-
validation was used, using ten folds, separating 90% of the data for training, and 10% for testing.
The number of samples used for composing the dataset created by each student varied according
to the number of gestures captured and that were considered valid by the rst author who carried
out the experiment. That is, after a series of records of movements performed by the students, only
the samples considered similar to each other (correctly representative of the same class) were used,
while the others were discarded. Results obtained by k-Fold Cross-Validation are presented in Table
6.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:22 R. E. O. S. Ascari, et al.
Fig. 11. Gesture samples by Students A and D, represented by MHI and OF-MHI.
Fig. 12. Student B’s gesture samples represented by MHI and OF-MHI.
Fig. 13. Student C’s gesture samples represented by MHI and OF-MHI.
During the experiment to test the PGCA system for interaction, in this experiment, the OF-MHI
option for motion representation, and the CNN-based classier for gesture recognition, were used.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:23
However, for each gesture captured by the system and represented in an image, the video of the
movement was also stored, which later allowed for the simulation of gesture recognition using the
two forms of motion representation and the two classiers evaluated during previous experiments.
Table 6. Results obtained with students dataset using two machine learning (HOG + SVM and CNN) and two
motion representations (Conventional MHI and OF-MHI). Method comparison: Overall Accuracy, Cohen’s
Kappa, Standard deviation, and Variance.
HOG + SVM CNN
Student Acc.
Cohen k
Std dev. Var. Acc. Cohen k Std dev. Var.
AMHI 1 1 0 0 0.93 0.86 0.10 0.011
OF-MHI 1 1 0 0 0.97 0.96 0.07 0.006
BMHI 0.83 0.79 0.11 0.013 0.84 0.80 0.11 0.013
OF-MHI 0.86 0.83 0.09 0.009 0.78 0.73 0.14 0.021
C1 MHI 0.84 0.82 0.13 0.020 0.68 0.64 0.15 0.026
OF-MHI 0.87 0.86 0.13 0.019 0.73 0.70 0.10 0.012
C2 MHI 0.87 0.85 0.14 0.023 0.76 0.72 0.09 0.010
OF-MHI 0.87 0.86 0.13 0.019 0.87 0.84 0.11 0.015
DMHI 0.96 0.90 0.15 0.025 0.90 0.80 0.20 0.044
OF-MHI 1 1 0 0 1 1 0 0
As a result of this experiment, we observed that people with motor and speech impairments, can
generate a customized gesture dataset and to train a system to recognize these gestures. To gesture
recognition, the SVM-based classier associated with OF-MHI motion representation presented
greater overall performance than CNN-based classier and MHI motion representation. Besides
that, were identied some challenges and issues to be improved in the methodology and in the
PGCA system, which are described in the Discussion session. For instance, was identied the need
to improve the system usability and accessibility, ensure the quality of the captured samples, and
to consider the variations of the level of understanding of each user.
Motor disability is a condition that generates very particular skills and limitations, requiring
a personalized approach. This condition was viewed in the heterogeneity of the participants in
this assessment step. Therefore, the key point here is not nding an average result or going very
deeply into the subjectivities of each participant, but to identify if the system (and, consequently,
the methodology behind it) is capable of supporting participants, in their diversity of skills and
limitations, to create, train and use personalized gesture datasets. Results presented in terms of
gesture recognition accuracy allowed us to observe the viability to use PGCA system by people with
very dierent motor skills performing similar tasks, each in their own way. Although it can give us
information about the tool’s performance, we do not intend to compare the accuracy presented
between the dierent datasets created since the diculties faced and the eort required of each
student to perform the tasks varied widely. The various attempts to use the tool by students with
cerebral palsy allowed us to see that personalized gestural interaction is a promising path to be
explored for augmentative and alternative communication for this audience, although challenges
and future improvements are needed and have been identied.
Details about results from step 3 (considering only the use of OF-MHI motion representation)
and from interviews with special education professionals were used to improve the system and can
be found on [9].
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:24 R. E. O. S. Ascari, et al.
6 DISCUSSION
Most of the relevant studies found in the literature used low-cost solutions for image acquisition,
and explored the few possibilities for adapting resources available in computers used by most
people. Prioritizing low-cost solutions is necessary to reach a wider audience unable to aord
high-cost devices. In our research, the availability of the resource for the target audience motivated
using a simple camera for capturing basic input for the system.
During the experiments, we observed the developed system can be used by the target audience,
that is, by people with motor and communication impairments, allowing for the execution of the
foreseen tasks in the proposed methodology. The rst experiment with volunteers who do not
have a disability allowed for us to verify the possibility of training a system with personalized
gestures, using a few samples for training. Using a public dataset in the second experiment had the
objective of allowing for the repeatability of the experiment by other researchers, besides enabling
the performance of new tests with dierent technologies before the system was made available for
testing with the targeted public. Finally, the third experiment brought rich insights by involving
representatives from the target audience, providing a glimpse of the routine of students with motor
and speech impairments in the school environment, and allowing for us to observe the initiatives
used by the teachers to communicate with these students daily.
Considering the three mentioned experiments, results indicate that the implemented classiers
are able to recognize gestures after being trained with customized datasets even with a small number
of samples. Therefore, the classiers can be applied to enable personalized gestural interaction.
Transfer Learning has proven to be ecient for our work because even a trained network with a
dataset composed of colored images can be customized to successfully recognize our gray scale
images.
The SVM-based classier associated with OF-MHI motion representation presented improve-
ments in performance for gesture recognition on most datasets. New experiments will be carried
out with the target audience, and the classier that shows the best overall results is the natural
candidate to be adopted for launching the system’s nal version. Some important issues observed
during the evaluation steps are presented below.
6.1 Uncontrolled situations
In many situations, images with motion blur can provide inaccurate results, but in our work, as
motion is already represented in the form of a blur composed of shades of gray, motion blur would
most likely enter the sum of the frames. Since the classiers presented satisfactory results, it is
possible that motion blur not signicantly interfere with the nal motion representation. When
using the PGCA system, occlusions may disturb the understanding of the gesture if it interferes
considerably with the nal representation of the generated motion. The scene’s background must
be static, as any object or person who might move behind the user will generate inaccurate or
unnecessary motion representations. Backgrounds with complex scenes negatively inuenced
motion representation only in cases where users touched the table on which the camera was
positioned. For a new experiment with the target audience, an external camera mounted on a tripod
will be used to avoid this kind of situation. We also realize that lighting conditions in dierent
data collection environments can interfere with the system’s ability to capture the movements
performed by users correctly. In the rst experiment, one of the volunteers was positioned next
to a window (light source), and we noticed that the gesture representation generated when the
user was positioned with the body sideways to the window is signicantly dierent from the
representation generated when the user was positioned with the whole body facing the window,
and this can negatively interfere with system’s performance. Thus, for the system to perform better,
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:25
it is important to keep the same lighting pattern during the dataset creation and interaction with
the system. In the experiment with the target audience, we seek to position the users in front of
a light source, either a window or ordinary light bulb. More exhaustive tests could set optimal
conditions for dierent variations of cluttered background, dark lighting, low contrast, or motion
blur.
6.2 Important improvements
The user experience (UX) refers to all experiences resulting from interactions that a user has with
a product or service [
32
]. Lee et al. [
67
] dene UX for people with disabilities as an experience that
consists of aspects of the interaction between people with disabilities and products/services that are
inuenced by assistive technologies. The evaluation steps described in this paper aimed to conduct
preliminary accessibility checks by the developer and the volunteers without disabilities, for later,
perform the interface evaluation with the participation of users with disabilities and observe the
user experience. The experiments allowed us to identify, in dierent ways, points to be improved
in both the developed system and the proposed methodology, because, as highlighted by Ilyas et al.
[
57
], creating a dataset considering real situations, with gestures executed by the target audience,
is still perceived as problematic and challenging mainly due to human variations in performing the
same gestures. The challenge is greater when considering people with motor-impairment, because
disabilities can make it dicult for people to repeat gestures, and some computer vision solutions
present limited performance in the presence of involuntary body movement, or if the person
presents seizure disorders like spastic movements. In order to create the datasets, several executions
of the same gesture were captured and, later, samples that presented very dierent representations
(e.g., if some involuntary movement occurred during the execution of the gesture) were deleted via
the system. Even with the satisfactory recognition rate obtained in some tests, the system must
evaluate and guarantee the quality of the samples captured to compose the dataset. During the
process of creating the dataset, an image matching algorithm was developed to compare images
of the representation generated for each gesture with the rst gesture considered as the base. To
compare the captured samples, a subtraction operation is performed between corresponding images,
and the remaining area is checked to see if it is not greater than the original area of the base sample.
In addition, the centroid of the images is also compared to check if they are in the same quadrant
in the image, or in the close quadrant. Therefore, only samples considered valid by the system
and by the caregiver will be stored and become a part of the dataset. From the evaluation step 3,
we noticed the need to improve the system usability and accessibility, mainly concerning visual
aspects of the interface and user feedback. Additional features, such as a game-based interactive
interface was developed and evaluated with the target audience and described in [10].
6.3 Student’s level of understanding
For the PGCA system to bring benets to both users and caregivers, it is imperative that users
become aware of the possibilities of interaction with the system: therefore, users with cognitive
impairments may not be able to use the system in its current version, as the system waits for
explicit input from the user. Such a limitation drew the attention of the researchers, as selecting
students to participate in the experiment was not a trivial task. Even in situations where there are
similar diagnoses (e.g., people with cerebral palsy), teachers themselves have doubts about each
student’s level of understanding. Using the proposed methodology, while adding more forms of
data input and processing, can make the system more inclusive for users with severe cognitive
disabilities, for the system would interpret the users and their intentions to communicate rather
than their intentions to interact with the system.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:26 R. E. O. S. Ascari, et al.
6.4 Limitations
To avoid unnecessary and incorrect answers by the system, students that perform a restricted
number of voluntary gestures, like Students A and D, need caregiver assistance to initiate and
nalize the capture process through the interface. Since these students did not have other gestures
that could be associated with the capture/start functionality, this task must be done by the caregiver.
The interface could be improved to suggest to the caregiver that he/she close of the capture process,
or the system could do so automatically after a certain period of inactivity.
Student C can perform a higher amount of signs. However, some of the signs generated very
similar motion representations that ended up confusing the classier. For gesture-based interfaces,
the precision (high true positives and low false positive rates) has to be assured, while maintaining
the natural feeling of interpersonal communication [
69
]. Even with the possibility of using a
condence level to minimize false positives, users can create datasets composed of gestures that are
quite similar to each other, which may not be correctly recognized by the system. In these cases,
the system may recommend a new data collection for these gestures or suggest only one of the
similar gestures to compose the dataset. Further studies are being conducted to improve overall
system accuracy by including more information in motion representation, such as the texture of
the hand and face, in order to enrich the input samples used by the classier.
Sessions occurred on dierent days, and some students performed gestures with dierent speed
and intensity in each session, suggesting that it is important to collect samples on dierent days to
generate more information to be learned by the classier. Besides, the inclusion of a game-based
approach, perhaps using a serious game that requires the execution of similar movements for its
use, can be a way to stimulate the students’ interest and to promote engagement and motivation
to train and use the system. In [
10
], the authors experienced a game-base approach to stimulate
students and to promote their engagement and motivation to train and use the system, obtaining
promising results.
The feedback and the state of the system must be improved to indicate that the system considered
a gesture as complete and that it is trying to recognize it. Currently, a progress bar indicates the
system’s status, but during the experiments, it was not perceived or understood by the users.
Changes in the system interface and dierent forms of feedback can be adopted, such as presenting
the messages in textual form, and reading them with a synthesized voice to facilitate understanding
by non-literate users. Since one of the users who participated in the experiment has a hearing
impairment, displaying messages and usage guidelines translated into LIBRAS can also support
users while learning and using the system.
One known limitation is that we are not comparing the proposed methodology with other
methodologies with the same purpose, because it is not yet possible to do this. In the future,
new assessments can be made, comparing dierent applications of the methodology, by dierent
professionals, in dierent contexts.
6.5 Future Works
For each gesture performed during the interaction with the AAC system, an MHI or OF-MHI image
is generated, as well as a text le containing the system’s predicted class and its condence level.
This information could be used in a more autonomous version of the system, in the future, as a way
to reload the dataset with new samples, whose classication has shown a high condence level.
This larger number of samples could contribute for classiers to generate better results, achieving
adequate accuracy. This could be important for the process of personalization, as the system could
identify, for example, that over time a particular gesture has been executed very dierently from its
original execution when the system was trained. When reaching a very signicant level of dierence,
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:27
the system could suggest the user to retrain the system using newer samples. Furthermore, after a
long period of system use, a fairly large number of new samples could be generated. This could
make training a deep network like CNN from scratch possible, and possibly allow the SVM-classier
to also deliver better results. In the system’s current version, the caregiver could perform some
new sample collection periodically, parallel to the use of the system already trained with a rst
dataset. These new samples could gradually generate a more complete dataset, possibly able to
recognize the gestures more satisfactorily.
Our research until now aimed to identify dynamic gestural patterns, based mainly on the
execution of movements. Nevertheless, the proposed methodology can support the design of a
system with another perspective, focusing on recognizing gestures in a complex scenario, using
other technologies or devices, such as depth camera, gloves with electrical sensors, or head-mounted
displays. These alternatives could be used to generate an automatic translation system for complex
sign languages, such as LIBRAS, which demands other system requirements such as parameters for
the hand conguration, articulation point, orientation, movement, and facial expression.
The three evaluation steps, together, show the methodology can be applied via a computing
system to support its target audience to generate a customized dataset and to use such a dataset to
enable personalized gestural recognition and interaction. To the best of our knowledge, no other
existing methodology could be used for comparison. Therefore, the present study is mandatory
before we are able to apply the methodology to design dierent systems, in dierent contexts, by
dierent professionals, for dierent audiences, which can provide us evidence to evaluate more
attributes of our methodology than its feasibility in the future.
The best perspectives for the proposed methodology are on the design of AAC tools, targeted
at a particular individual, by learning their actions, rather than adapting dierent users’ varying
gestures through thresholds. For the next steps in our research, a personied approach could be
introduced by going beyond system interface adaptation and personalization. Another aspect to be
introduced is to conceive of intelligent assistive technology that requires minimal user intervention,
capturing samples continuously to interpret patterns of movements performed by people, especially
people with disabilities. These samples can be combined with other data (e.g., brain-computer
interface) to train a personied system, through machine learning, that would be more focused on
the user’s individualities, therefore able to learn and represent the user, overcoming personalization
standards.
The accessibility can be considered as a prerequisite of usability [
67
] [
106
]. The user experience
observed in the evaluation step 3 of the PGCA system indicates the feasibility of the system
and its potential in providing accessibility for people with motor and speech disabilities, despite
some limitations and challenges perceived. The professionals who monitored the execution of the
scheduled tasks in the evaluation step 3 showed interest in using the PGCA system in the school
environment. Even so, a methodology for evaluating the usability and accessibility of the PGCA
system should be employed to better understand and evaluate the perception of the professionals
who will follow the execution of news experiments with the target audience. These evaluations
may provide results to represent the opinion of caregivers and their real intention to use the system
in the future.
7 CONCLUSION
This paper presented a methodology for supporting AAC based on personalized gestural interaction,
as well as results from the evaluation of the pilot system created from this methodology. Three
dierent evaluations were conducted using datasets: 1. created by volunteers without disabilities, 2.
using the public dataset Keck Gesture Dataset, and 3. created by students with motor and speech
impairments.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:28 R. E. O. S. Ascari, et al.
Two machine learning techniques were used to generate classiers for gesture recognition: SVM
(with descriptor HOG), and CNN (using Transfer Learning). Two dierent motion representations
were used to describe movements: Conventional MHI, and Optical Flow based MHI. The SVM-based
classier, with the motion representation obtained by OF-MHI, presented better performance in
most tests.
The proposed methodology allows users and caregivers to create personalized gestural interaction
for communication purposes, and is promising to support the design of AAC systems. The biggest
challenge identied so far is related to the prole of our target audience: on the one hand, training
the system is quite dependent on the quality of the dataset created, and on the other hand, creating
a dataset with quality samples depends heavily on users’ comprehension capacity to know what
movements to perform, and on their capacity to perform voluntary movements with a few variations,
i.e., it is necessary to guarantee a certain level of awareness and repeatability in order to perform a
same movement multiple times.
In future work, the PGCA user interface will be adjusted to minimize the eort required from
users and caregivers in acquiring samples to create their customized dataset, and new experiments
with the target audience will be conducted in order to better evaluate the system, and its potential
to support the AAC. During a future stage of the research, we also intend to investigate how
uncontrolled situations can interfere with system accuracy by performing tests with datasets
created under dierent conditions, such as with cluttered background, dark lighting, low contrast,
and motion blur.
ACKNOWLEDGMENTS
The authors thank CAPES and CNPq for supporting this research, and especially thank the institu-
tions, the volunteers, teachers, and students who participated in the experiments.
REFERENCES
[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jerey Dean, Matthieu Devin, Sanjay Ghe-
mawat, Georey Irving, Michael Isard, et al
.
2016. Tensorow: a system for large-scale machine learning.. In OSDI,
Vol. 16. 265–283.
[2]
Julio Abascal. 2008. Users with disabilities: maximum control with minimum eort. Articulated Motion and Deformable
Objects (2008), 449–456.
[3]
Malek Adjouadi, Anaelis Sesin, Melvin Ayala, and Mercedes Cabrerizo. 2004. Remote eye gaze tracking system as a
computer interface for persons with severe motor disability. In International Conference on Computers for Handicapped
Persons. Springer, 761–769.
[4]
Ali A Alani, Georgina Cosma, Aboozar Taherkhani, and TM McGinnity. 2018. Hand gesture recognition using an
adapted convolutional neural network with data augmentation. In 2018 4th International conference on information
management (ICIM). IEEE, 5–12.
[5]
Natasha Alves, Stefanie Blain, Tiago Falk, Brian Leung, Negar Memarian, and Tom Chau. 2016. Access Technologies
for Children and Youth with Severe Motor Disabilities. Paediatric Rehabilitation Engineering: From Disability to
Possibility (2016), 45.
[6]
Rui Azevedo Antunes, Luís Brito Palma, Fernando V Coito, Hermínio Duarteramos, and Paulo Gil. 2016. Intelligent
human-computer interface for improving pointing device usability and performance. In Control and Automation
(ICCA), 2016 12th IEEE International Conference on. IEEE, 714–719.
[7]
Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2018. Mobile Interaction for Augmentative
and Alternative Communication: a Systematic Mapping. SBC Journal on 3D Interactive Systems 9, 2 (2018), 105–118.
[8]
Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2018. Towards a Methodology to Support
Augmentative and Alternative Communication by means of Personalized Gestural Interaction. In Proceedings of the
17th Brazilian Symposium on Human Factors in Computing Systems. ACM, 38.
[9]
Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2019. Personalized Interactive Gesture
Recognition Assistive Technology. Proceedings of the 18th Brazilian Symposium on Human Factors in Computing
Systems (2019), 1–12.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:29
[10]
Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2020. Personalized Gestural Interaction
Applied in a Gesture Interactive Game-based Approach for People with Disabilities. Proceedings of the 25th International
Conference on Intelligent User Interfaces (2020), 1–11.
[11]
Behrooz Ashtiani and I Scott MacKenzie. 2010. BlinkWrite2: an improved text entry method using eye blinks. In
Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications. ACM, 339–345.
[12]
Aqil Azmi, Nawaf M Alsabhan, and Majed S AlDosari. 2009. The Wiimote with SAPI: Creating an accessible low-cost,
human computer interface for the physically disabled. International Journal of Computer Science and Network Security
9, 12 (2009), 63–68.
[13]
Samy Bakheet. 2017. A Fuzzy Framework for Real-Time Gesture Spotting and Recognition. Journal of Russian Laser
Research 38, 1 (2017), 61–75.
[14]
Margrit Betke. 2008. Camera-Based Interfaces and Assistive Software for People with Sever Motion Impairments.
Technical Report. Boston University Computer Science Department.
[15]
Margrit Betke, James Gips, and Peter Fleming. 2002. The camera mouse: visual tracking of body features to provide
computer access for people with severe disabilities. IEEE Transactions on neural systems and Rehabilitation Engineering
10, 1 (2002), 1–10.
[16]
Zhen-Peng Bian, Junhui Hou, Lap-Pui Chau, and Nadia Magnenat-Thalmann. 2016. Facial position and expression-
based human–computer interface for persons with tetraplegia. IEEE journal of biomedical and health informatics 20, 3
(2016), 915–924.
[17]
Pradipta Biswas and Pat Langdon. 2011. A new input system for disabled users involving eye gaze tracker and
scanning interface. Journal of Assistive Technologies 5, 2 (2011), 58–66.
[18]
Pradipta Biswas and Pat Langdon. 2013. A new interaction technique involving eye gaze tracker and scanning system.
In Proceedings of the 2013 Conference on Eye Tracking South Africa. ACM, 67–70.
[19]
Pradipta Biswas and Pat Langdon. 2015. Multimodal intelligent eye-gaze tracking system. International Journal of
Human-Computer Interaction 31, 4 (2015), 277–294.
[20]
Pieter Blignaut. 2017. Development of a gaze-controlled support system for a person in an advanced stage of multiple
sclerosis: a case study. Universal Access in the Information Society 16, 4 (2017), 1003–1016.
[21]
Aaron F. Bobick and James W. Davis. 2001. The recognition of human movement using temporal templates. IEEE
Transactions on pattern analysis and machine intelligence 23, 3 (2001), 257–267.
[22]
Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braort, Naomi Caselli, Matt
Huenerfauth, Hernisa Kacorri, Tessa Verhoef, et al
.
2019. Sign Language Recognition, Generation, and Translation:
An Interdisciplinary Perspective. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility.
16–31.
[23]
Dario Cazzato, Marco Leo, and Cosimo Distante. 2014. An investigation on the feasibility of uncalibrated and
unconstrained gaze tracking for human assistive applications by using head pose estimation. Sensors 14, 5 (2014),
8363–8379.
[24]
Vikash Chauhan and Tim Morris. 2001. Face and feature tracking for cursor control. In Proceedings of the Scandinavian
Conference on Image Analysis. 356–362.
[25]
Weiqin Chen. 2013. Gesture-based applications for elderly people. In International Conference on Human-Computer
Interaction. Springer, 186–195.
[26]
Fulvio Corno, Laura Farinetti, and Isabella Signorile. 2002. A cost-eective solution for eye-gaze assistive technology.
In Multimedia and Expo, 2002. ICME’02. Proceedings. 2002 IEEE International Conference on, Vol. 2. IEEE, 433–436.
[27]
Stefania Cristina and Kenneth P Camilleri. 2016. Model-based head pose-free gaze estimation for assistive communi-
cation. Computer Vision and Image Understanding 149 (2016), 157–170.
[28]
E Dall’Asta and Riccardo Roncella. 2014. A COMPARISON OF SEMIGLOBAL AND LOCAL DENSE MATCHING
ALGORITHMS FOR SURFACE RECONSTRUCTION. International Archives of the Photogrammetry, Remote Sensing &
Spatial Information Sciences 45 (2014).
[29]
James W Davis and Aaron F Bobick. 1997. The representation and recognition of human movement using temporal
templates. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on.
IEEE, 928–934.
[30]
AV Dehankar, Sanjeev Jain, and VM Thakare. 2017. Using AEPI method for hand gesture recognition in varying
background and blurred images. In Electronics, Communication and Aerospace Technology (ICECA), 2017 International
conference of, Vol. 1. IEEE, 404–409.
[31]
AV Dehankar, VM Thakare, and Sanjeev Jain. 2017. Detecting centroid for hand gesture recognition using morpho-
logical computations. In Inventive Systems and Control (ICISC), 2017 International Conference on. IEEE, 1–5.
[32]
Pieter Desmet and Paul Hekkert. 2007. Framework of product experience. International journal of design 1, 1 (2007),
57–66.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:30 R. E. O. S. Ascari, et al.
[33]
Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2015. Image super-resolution using deep convolutional
networks. IEEE transactions on pattern analysis and machine intelligence 38, 2 (2015), 295–307.
[34]
Simone Eidam, Jens Garstka, and Gabriele Peters. 2016. Towards regaining mobility through virtual presence for
patients with locked-in syndrome. In Proceedings of the 8th International Conference on Advanced Cognitive Technologies
and Applications. Rome, Italy. 120–123.
[35]
Layal El-A, Mohamad Karaki, Joelle Korban, and Mohamad A al Alaoui. 2004. ’Hands-free interface’-a fast and
accurate tracking procedure for real time human computer interaction. In Signal Processing and Information Technology,
2004. Proceedings of the Fourth IEEE International Symposium on. IEEE, 517–520.
[36]
Samuel Epstein, Eric Missimer, and Margrit Betke. 2014. Using kernels for a video-based mouse-replacement interface.
Personal and Ubiquitous Computing 18, 1 (2014), 47–60.
[37]
S Yu Eroshkin, NA Kameneva, DV Kovkov, and AI Sukhorukov. 2017. Conceptual system in the modern information
management. Procedia Computer Science 103 (2017), 609–612.
[38]
Xijian Fan and Tardi Tjahjadi. 2017. A dynamic framework based on local Zernike moment and motion history image
for facial expression recognition. Pattern Recognition 64 (2017), 399–406.
[39]
Alhussein Fawzi, Horst Samulowitz, Deepak Turaga, and Pascal Frossard. 2016. Adaptive data augmentation for
image classication. In 2016 IEEE International Conference on Image Processing (ICIP). Ieee, 3688–3692.
[40]
S Federici and MJ Scherer. 2012. The assistive technology assessment model and basic denitions. Assistive technology
assessment handbook (2012), 1–10.
[41]
Marcela Fejtová, Luis Figueiredo, Petr Novák, Olga Štěpánková, and Ana Gomes. 2009. Hands-free interaction with a
computer and other technologies. Universal Access in the Information Society 8, 4 (2009), 277.
[42]
Xiping Fu, Brendan McCane, Michael Albert, and Steven Mills. 2013. Action recognition based on principal geodesic
analysis. In 2013 28th International Conference on Image and Vision Computing New Zealand (IVCNZ 2013). IEEE,
259–264.
[43]
Yun Fu and Thomas S Huang. 2007. hMouse: Head tracking driven virtual computer mouse. In Applications of
Computer Vision, 2007. WACV’07. IEEE Workshop on. IEEE, 30–30.
[44]
Luke Gane, Sarah Power, Azadeh Kushki, and Tom Chau. 2011. Thermal imaging of the periorbital regions during
the presentation of an auditory startle stimulus. PloS one 6, 11 (2011), e27268.
[45]
Liliana García, Ricardo Ron-Angevin, Bertrand Loubière, Loc Renault, Gwendal Le Masson, Véronique Lespinet-Najib,
and Jean Marc André. 2017. A comparison of a Brain-Computer Interface and an Eye tracker: is there a more
appropriate technology for controlling a virtual keyboard in an ALS patient?. In International Work-Conference on
Articial Neural Networks. Springer, 464–473.
[46]
Cindy Gevarter, Mark F O’Reilly, Laura Rojeski, Nicolette Sammarco, Russell Lang, Giulio E Lancioni, and Je Sigafoos.
2013. Comparisons of intervention components within augmentative and alternative communication systems for
individuals with developmental disabilities: A review of the literature. Research in developmental disabilities 34, 12
(2013), 4404–4414.
[47]
Sakher Ghanem, Christopher Conly, and Vassilis Athitsos. 2017. A survey on sign language recognition using smart-
phones. In Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments.
171–176.
[48]
Francisco Gomez-Donoso, Miguel Cazorla, Alberto Garcia-Garcia, and Jose Garcia-Rodriguez. 2016. Automatic
Schaeer’s gestures recognition system. Expert Systems 33, 5 (2016), 480–488.
[49]
Magdalena González, Débora Mulet, Elisa Perez, Carlos Soria, and Vicente Mut. 2010. Vision based interface: an
alternative tool for children with cerebral palsy. In Engineering in Medicine and Biology Society (EMBC), 2010 Annual
International Conference of the IEEE. IEEE, 5895–5898.
[50]
Kristen Grauman, Margrit Betke, Jonathan Lombardi, James Gips, and Gary R Bradski. 2003. Communication via eye
blinks and eyebrow raises: Video-based human-computer interfaces. Universal Access in the Information Society 2, 4
(2003), 359–373.
[51]
John Paulin Hansen, Kristian Tørning, Anders Sewerin Johansen, Kenji Itoh, and Hirotaka Aoki. 2004. Gaze typing
compared with input by head and hand. In Proceedings of the 2004 symposium on Eye tracking research & applications.
ACM, 131–138.
[52]
Helena Hemmingsson, Gunnar Ahlsten, Helena Wandin, Patrik Rytterström, and Maria Borgestig. 2018. Eye-Gaze
Control Technology as Early Intervention for a Non-Verbal Young Child with High Spinal Cord Injury: A Case Report.
Technologies 6, 1 (2018), 12.
[53]
Alexandre Felippeto Henzen and Percy Nohama. 2017. Facial Movements Detection Using Neural Networks and
Mpeg-7 Descriptors Applied to Alternative and Augmentative Communication Systems. In VII Latin American Congress
on Biomedical Engineering CLAIB 2016, Bucaramanga, Santander, Colombia, October 26th-28th, 2016. Springer, 626–629.
[54]
Berthold KP Horn and Brian G Schunck. 1981. Determining optical ow. Articial intelligence 17, 1-3 (1981), 185–203.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:31
[55]
Anthony J Hornof and Anna Cavender. 2005. EyeDraw: enabling children with severe motor impairments to draw
with their eyes. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 161–170.
[56]
Chin-Pan Huang, Chaur-Heh Hsieh, Kuan-Ting Lai, and Wei-Yang Huang. 2011. Human action recognition using
histogram of oriented gradient of motion history image. In Instrumentation, Measurement, Computer, Communication
and Control, 2011 First International Conference on. IEEE, 353–356.
[57]
Chaudhary Muhammad Aqdus Ilyas, Mohammad A Haque, Matthias Rehm, Kamal Nasrollahi, and Thomas B Moeslund.
2017. Facial Expression Recognition for Traumatic Brain Injured Patients. In International Conference on Computer
Vision Theory and Applications. SCITEPRESS Digital Library.
[58]
Robert JK Jacob. 1991. The use of eye movements in human-computer interaction techniques: what you look at is
what you get. ACM Transactions on Information Systems (TOIS) 9, 2 (1991), 152–169.
[59]
Zhuolin Jiang, Zhe Lin, and Larry Davis. 2012. Recognizing human actions by learning and matching shape-motion
prototype trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2012), 533–547.
[60]
Shaun K Kane, Barbara Linam-Church, Kyle Altho, and Denise McCall. 2012. What we talk about: designing a
context-aware communication tool for people with aphasia. In Proceedings of the 14th international ACM SIGACCESS
conference on Computers and accessibility. ACM, 49–56.
[61]
Intissar Khalifa, Ridha Ejbali, and Mourad Zaied. 2018. Hand motion modeling for psychology analysis in job interview
using optical ow-history motion image: OF-HMI. In Tenth International Conference on Machine Vision (ICMV 2017),
Vol. 10696. International Society for Optics and Photonics, 106962L.
[62]
Tomasz Kocejko, Adam Bujnowski, and Jerzy Wtorek. 2009. Eye-mouse for disabled. In Human-computer systems
interaction. Springer, 109–122.
[63]
Myron W Krueger, Thomas Gionfriddo, and Katrin Hinrichsen. 1985. VIDEOPLACE—an articial reality. In ACM
SIGCHI Bulletin, Vol. 16. ACM, 35–40.
[64]
Andrew Kurauchi, Wenxin Feng, Carlos Morimoto, and Margrit Betke. 2015. HMAGIC: head movement and gaze
input cascaded pointing. In Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to
Assistive Environments. ACM, 47.
[65]
Denis Lalanne, Laurence Nigay, Peter Robinson, Jean Vanderdonckt, Jean-François Ladry, et al
.
2009. Fusion engines
for multimodal input: a survey. In Proceedings of the 2009 international conference on Multimodal interfaces. ACM,
153–160.
[66]
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics
(1977), 159–174.
[67]
Mingyu Lee, Sung H Han, Hyun K Kim, and Hanul Bang. 2017. Identifying user experience elements for people with
disabilities. Presentation at ACHI: The Eighth International Conference on Advances in .. ..
[68]
Wouter Lemahieu and Bart Wyns. 2011. Low cost eye tracking for human-machine interfacing. Journal of Eye
Tracking, Visual Cognition and Emotion (2011).
[69]
Marco Leo, G Medioni, M Trivedi, Takeo Kanade, and Giovanni Maria Farinella. 2017. Computer vision for assistive
technologies. Computer Vision and Image Understanding 154 (2017), 1–15.
[70]
Brian Leung and Tom Chau. 2010. A multiple camera tongue switch for a child with severe spastic quadriplegic
cerebral palsy. Disability and Rehabilitation: Assistive Technology 5, 1 (2010), 58–68.
[71]
Yongqian Liu, Yuzhu He, and Weijia Cui. 2018. An improved SVM classier based on multi-verse optimizer for
fault diagnosis of autopilot. In 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control
Conference (IAEAC). IEEE, 941–944.
[72]
Yi Liu, Bu-Sung Lee, and Martin J McKeown. 2016. Robust eye-based dwell-free typing. International Journal of
Human–Computer Interaction 32, 9 (2016), 682–694.
[73]
Bruce D Lucas and Takeo Kanade. 1981. An iterative image registration technique with an application to stereo vision.
In Proceedings of the 7th International Joint Conference on Articial Intelligence. Vancouver, BC, Canada.
[74]
Robert Gabriel Lupu, Radu Gabriel Bozomitu, Alexandru Păsărică, and Cristian Rotariu. 2017. Eye tracking user
interface for Internet access used in assistive technology. In E-Health and Bioengineering Conference (EHB), 2017. IEEE,
659–662.
[75] I Scott MacKenzie and Behrooz Ashtiani. 2009. BlinkWrite: ecient text entry using eye blinks. Universal Access in
the Information Society 10, 1 (2009), 69–80.
[76]
Cristina Manresa-Yee, Pere Ponsa, Javier Varona, and Francisco J Perales. 2010. User experience to improve the
usability of a vision-based interface. Interacting with Computers 22, 6 (2010), 594–605.
[77]
Xianbai Mao, Liheng Wang, and Changxi Li. 2008. SVM classier for analog fault diagnosis using fractal features. In
2008 Second International Symposium on Intelligent Information Technology Application, Vol. 2. IEEE, 553–557.
[78]
Joanna Marnik. 2014. BlinkMouse-On-Screen Mouse Controlled by Eye Blinks. In Information Technologies in
Biomedicine, Volume 4. Springer, 237–248.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:32 R. E. O. S. Ascari, et al.
[79]
João MS Martins, João MF Rodrigues, and Jaime AC Martins. 2015. Low-cost natural interface based on head
movements. Procedia Computer Science 67 (2015), 312–321.
[80]
Paulo Martins, Henrique Rodrigues, Tânia Rocha, Manuela Francisco, and Leonel Morgado. 2015. Accessible options
for deaf people in e-learning platforms: technology solutions for sign language translation. Procedia Computer Science
67 (2015), 263–272.
[81]
César Mauri, Toni Granollers, Jesús Lorés, and Mabel García. 2006. Computer vision interaction for people with severe
movement restrictions. Human Technology: An Interdisciplinary Journal on Humans in ICT Environments (2006).
[82]
Negar Memarian, Tom Chau, and Anastasios N Venetsanopoulos. 2009. Application of infrared thermal imaging in
rehabilitation engineering: Preliminary results. In Science and Technology for Humanity (TIC-STH), 2009 IEEE Toronto
International Conference. IEEE, 1–5.
[83]
Negar Memarian, Anastasios N Venetsanopoulos, and Tom Chau. 2009. Infrared thermography as an access pathway
for individuals with severe motor impairments. Journal of neuroengineering and rehabilitation 6, 1 (2009), 11.
[84]
Eric Missimer and Margrit Betke. 2010. Blink and wink detection for mouse pointer control. In Proceedings of the 3rd
International Conference on Pervasive Technologies Related to Assistive Environments. ACM, 23.
[85]
Assit Prof Aree A Mohammed. 2014. Ecient eye blink detection method for disabled-helping domain. Eye 10, P1
(2014), P2.
[86]
Laura Montanini, Enea Cippitelli, Ennio Gambi, and Susanna Spinsante. 2015. Low complexity head tracking on
portable android devices for real time message composition. Journal on Multimodal User Interfaces 9, 2 (2015), 141–151.
[87]
Inhyuk Moon, Kyunghoon Kim, Jeicheong Ryu, and Museong Mun. 2003. Face direction-based human-computer
interface using image observation and EMG signal for the disabled. In Robotics and Automation, 2003. Proceedings.
ICRA’03. IEEE International Conference on, Vol. 1. IEEE, 1515–1520.
[88]
K Morrison and S J McKenna. 2002. Automatic visual recognition of gestures made by motor-impaired computer
users. Technology and Disability 14, 4 (2002), 197–203.
[89]
K Morrison and S J McKenna. 2002. Contact-free recognition of user-dened gestures as a means of computer access
for the physically disabled. In Workshop on Universal Access and Assistive Technology. 99–103.
[90]
Giorgos Mountrakis, Jungho Im, and Caesar Ogole. 2011. Support vector machines in remote sensing: A review. ISPRS
Journal of Photogrammetry and Remote Sensing 66, 3 (2011), 247–259.
[91]
Cosmin Munteanu, Sharon Oviatt, Gerald Penn, and Randy Gomez. 2016. Designing Speech and Multimodal
Interactions for Mobile , Wearable , and Pervasive Applications. (2016), 3612–3619.
[92]
Masoomeh Nabati and Alireza Behrad. 2015. 3D Head pose estimation and camera mouse implementation using a
monocular video camera. Signal, Image and Video Processing 9, 1 (2015), 39–44.
[93]
Rizwan Ali Naqvi, Muhammad Arsalan, and Kang Ryoung Park. 2017. Fuzzy system-based target selection for a NIR
camera-based gaze tracker. Sensors 17, 4 (2017), 862.
[94]
Saeed Nasri, Alireza Behrad, and Farbod Razzazi. 2015. A novel approach for dynamic hand gesture recognition using
contour-based similarity images. International Journal of Computer Mathematics 92, 4 (2015), 662–685.
[95]
Saeed Nasri, Alireza Behrad, and Farbod Razzazi. 2015. Spatio-temporal 3D surface matching for hand gesture
recognition using ICP algorithm. Signal, Image and Video Processing 9, 5 (2015), 1205–1220.
[96]
Farhood Negin, Pau Rodriguez, Michal Koperski, Adlen Kerboua, Jordi Gonzàlez, Jeremy Bourgeois, Emmanuelle
Chapoulie, Philippe Robert, and Francois Bremond. 2018. PRAXIS: Towards Automatic Cognitive Assessment Using
Gesture Recognition. Expert Systems with Applications (2018).
[97]
Shuo Niu, Li Liu, and D Scott McCrickard. 2018. Tongue-able interfaces: Prototyping and evaluating camera based
tongue gesture input system. Smart Health (2018).
[98]
Redwan AK Noaman, Mohd Alauddin Mohd Ali, and Nasharuddin Zainal. 2017. Enhancing pedestrian detection
using optical ow for surveillance. International Journal of Computational Vision and Robotics 7, 1-2 (2017), 35–48.
[99]
Antti Oulasvirta and Kasper Hornbæk. 2016. HCI research as problem-solving. In Proceedings of the 2016 CHI
Conference on Human Factors in Computing Systems. ACM, 4956–4967.
[100]
Kaushik Parmar, Bhavin Mehta, and Rupali Sawant. 2012. Facial-feature based Human-Computer Interface for
disabled people. In Communication, Information & Computing Technology (ICCICT), 2012 International Conference on.
IEEE, 1–5.
[101]
Rupal Patel and Deb Roy. 1998. Teachable interfaces for individuals with dysarthric speech and severe physical
disabilities. In Proceedings of the AAAI Workshop on Integrating Articial Intelligence and Assistive Technology. Citeseer,
40–47.
[102]
Jasmina Ivšac Pavliša, Marta Ljubešić, and Ivana Jerečić. 2012. The use of AAC with young children in croatia–from
the speech and language pathologist’s view. In KES International Symposium on Agent and Multi-Agent Systems:
Technologies and Applications. Springer, 221–230.
[103]
Lishen Pei, Mao Ye, Pei Xu, Xuezhuan Zhao, and Tao Li. 2013. Multi-class action recognition based on inverted index
of action states. In 2013 IEEE International Conference on Image Processing. IEEE, 3562–3566.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
Computer vision-based methodology to support AAC XXX:33
[104]
Emanuele Perini, Simone Soria, Andrea Prati, and Rita Cucchiara. 2006. FaceMouse: A human-computer interface for
tetraplegic people. In European Conference on Computer Vision. Springer, 99–108.
[105]
E Pirani and Mahesh Kolte. 2010. Gesture based educational software for children with acquired brain injuries.
International Journal in Computer Science and Engineering 2, 3 (2010), 790–794.
[106]
Franz Puhretmair and Klaus Miesenberger. 2005. Making sense of accessibility in IT Design-usable accessibility vs.
accessible usability. In 16th International Workshop on Database and Expert Systems Applications (DEXA’05). IEEE,
861–865.
[107]
David M Roy, Marilyn Panayi, Roman Erenshteyn, Richard Foulds, andRobert Fawcus. 1994. Gestural human-machine
interaction for people with severe speech and motor impairment due to cerebral palsy. In Conference companion on
Human factors in computing systems. ACM, 313–314.
[108]
David M Roy, Marilyn Panayi, Richard Foulds, Roman Erenshteyn, William S Harwin, and Robert Fawcus. 1994. The
enhancement of interaction for people with severe speech and physical impairment through the computer recognition
of gesture and manipulation. Presence: Teleoperators & Virtual Environments 3, 3 (1994), 227–235.
[109]
David Rozado, Jason Niu, and Martin Lochner. 2017. Fast Human-Computer Interaction by Combining Gaze Pointing
and Face Gestures. ACM Transactions on Accessible Computing (TACCESS) 10, 3 (2017), 10.
[110]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, et al
.
2015. Imagenet large scale visual recognition challenge. International Journal
of Computer Vision 115, 3 (2015), 211–252.
[111]
Sancho Salcedo-Sanz, José Luis Rojo-Álvarez, Manel Martínez-Ramón, and Gustavo Camps-Valls. 2014. Support
vector machines in engineering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
4, 3 (2014), 234–267.
[112]
Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal
of Big Data 6, 1 (2019), 60.
[113]
Tyler Simpson, Colin Broughton, Michel JA Gauthier, and Arthur Prochazka. 2008. Tooth-click control of a hands-free
computer interface. IEEE Transactions on Biomedical Engineering 55, 8 (2008), 2050–2056.
[114]
Piotr Stawicki, Felix Gembler, Aya Rezeika, and Ivan Volosyak. 2017. A novel hybrid mental spelling application
based on eye tracking and SSVEP-based BCI. Brain sciences 7, 4 (2017), 35.
[115]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition.
2818–2826.
[116]
Kentaro Toyama. 1998. Look, ma-no hands! hands-free cursor control with real-time 3d face tracking. Workshop on
Perceptual User Interfaces (1998).
[117]
Du-Ming Tsai, Wei-Yao Chiu, and Men-Han Lee. 2015. Optical ow-motion history image (OF-MHI) for action
recognition. Signal, Image and Video Processing 9, 8 (2015), 1897–1906.
[118]
Jilin Tu, Hai Tao, and Thomas Huang. 2007. Face as mouse through visual face tracking. Computer Vision and Image
Understanding 108, 1-2 (2007), 35–40.
[119]
Outi Tuisku, Veikko Surakka, Ville Rantanen, Toni Vanhala, and Jukka Lekkala. 2013. Text entry by gazing and
smiling. Advances in Human-Computer Interaction 2013 (2013), 1.
[120]
Maryam Vafadar and Alireza Behrad. 2008. Human hand gesture recognition using motion orientation histogram
for interaction of handicapped persons with computer. In International Conference on Image and Signal Processing.
Springer, 378–385.
[121]
Javier Varona, Cristina Manresa-Yee, and Francisco J Perales. 2008. Hands-free vision-based interface for computer
accessibility. Journal of Network and Computer Applications 31, 4 (2008), 357–374.
[122]
Mrs M Vidhya, P Poornima Devi, S Priscilla Emima, and G Revathi. 2016. Implementation of Bidirectional Voice
Communication between Normal and Deaf & Dumb Person. International Journal of Advanced Research Trends in
Engineering and Technology (IJARTET) (2016).
[123]
Jun Wan, Vassilis Athitsos, Pat Jangyodsuk, Hugo Jair Escalante, Qiuqi Ruan, and Isabelle Guyon. 2014. CSMMI:
Class-specic maximization of mutual information for action and gesture recognition. IEEE Transactions on Image
Processing 23, 7 (2014), 3152–3165.
[124]
Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. 2018. Packing convolutional neural networks in the frequency
domain. IEEE transactions on pattern analysis and machine intelligence (2018).
[125]
Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. 2016. A survey of transfer learning. Journal of Big Data 3, 1
(2016), 9.
[126]
Krishna Ferreira Xavier, Vinícius Kruger da Costa, Rafael Cunha Cardoso, Jamir Alves Peroba, Adriano Oliveira Lima
Ferreira, Marcelo Bender Machado, Tatiana Aires Tavares, and Andréia Sias Rodrigues. 2017. VisiUMouse: An
Ubiquitous Computer Vision Technology for People with Motor Disabilities. (2017), 115–118.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
XXX:34 R. E. O. S. Ascari, et al.
[127]
Cristina Suemay Manresa Yee, Francisco Perales López, and Javier Varona Gómez. 2009. Advanced and natural
interaction system for motion-impaired users. Ph.D. Dissertation. PhD thesis, Departament de Ciencies Matematiques i
Informatica, Universitat de les Illes Balears, Spain.
[128]
I Yoda, K Ito, and T Nakayama. 2017. Modular Gesture Interface for People with Severe Motor Dysfunction: Foot
Recognition. Studies in health technology and informatics 242 (2017), 725–732.
[129]
Thorsten O Zander, Matti Gaertner, Christian Kothe, and Roman Vilimek. 2010. Combining eye gaze input with a
brain–computer interface for touchless human–computer interaction. Intl. Journal of Human–Computer Interaction
27, 1 (2010), 38–51.
[130]
Jiajia Zhang, Kun Shao, and Xing Luo. 2018. Small sample image recognition using improved convolutional neural
network. Journal of Visual Communication and Image Representation 55 (2018), 640–647.
[131]
Shasha Zhang, Weicun Zhang, and Yunluo Li. 2016. Human Action Recognition Based on Multifeature Fusion. In
Proceedings of 2016 Chinese Intelligent Systems Conference. Springer, 183–192.
[132]
Xiaoyi Zhang, Harish Kulkarni, and Meredith Ringel Morris. 2017. Smartphone-Based Gaze Gesture Communication
for People with Motor Disabilities. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.
ACM, 2878–2889.
J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Developing successful sign language recognition, generation, and translation systems requires expertise in a wide range of fields, including computer vision, computer graphics, natural language processing, human-computer interaction, linguistics, and Deaf culture. Despite the need for deep interdisciplinary knowledge, existing research occurs in separate disciplinary silos, and tackles separate portions of the sign language processing pipeline. This leads to three key questions: 1) What does an interdisciplinary view of the current landscape reveal? 2) What are the biggest challenges facing the field? and 3) What are the calls to action for people working in the field? To help answer these questions, we brought together a diverse group of experts for a two-day workshop. This paper presents the results of that interdisciplinary workshop, providing key background that is often overlooked by computer scientists, a review of the state-of-the-art, a set of pressing challenges, and a call to action for the research community.
Article
Full-text available
Abstract Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfortunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data-space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in this survey include geometric transformations, color space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning. The application of augmentation methods based on GANs are heavily covered in this survey. In addition to augmentation techniques, this paper will briefly discuss other characteristics of Data Augmentation such as test-time augmentation, resolution impact, final dataset size, and curriculum learning. This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing Data Augmentation. Readers will understand how Data Augmentation can improve the performance of their models and expand limited datasets to take advantage of the capabilities of big data.
Article
Full-text available
Verbal communication is essential for socialization, meaning construction and knowledge sharing in a society. When verbal communication does not occur naturally because of constraints in people’s and environments capabilities, it is necessary to design alternative means. Augmentative and Alternative Communication (AAC) aims to complement or replace speech to compensate difficulties of verbal expression. AAC systems can provide technological support for people with speech disorders, assisting in the inclusion, learning and sharing of experiences. This paper presents a systematic mapping of the literature to identify research initiatives regarding the use of mobile devices and AAC solutions. The search identified 1366 potentially eligible scientific articles published between 2006 and 2016, indexed by ACM, IEEE, Science Direct, and Springer databases and by the SBC Journal on Interactive Systems. From the retrieved papers, 99 were selected and categorized into themes of research interest: games, autism, usability, assistive technology, AAC, computer interfaces, interaction in mobile devices, education, among others. Most of papers (57 out of 99) presented some form of interaction via mobile devices, and 46 papers were related to assistive technology, from which 14 were related to AAC. The results offer an overview on the applied research on mobile devices for AAC, pointing out to opportunities and challenges in this research domain, with emphasis on the need to promoting the use and effective adoption of assistive technology.
Conference Paper
Technology can support people with disabilities to participate in social and economic life. Using relevant Human-Computer Interaction, as obtained through Intelligent User Interfaces, people with motor and speech impairments may be able to communicate in different ways. Augmentative and Alternative Communication supported by Computer Vision systems can benefit from the recognition of users' remaining functional motions as an alternative interaction design approach. Based on a methodology in which gestures and their meanings are created and configured by users and their caregivers, we developed a Computer Vision system, named PGCA, that employs machine learning techniques to create personalized gestural interaction as an Assistive Technology resource for communication purposes. Using a low-cost approach, PGCA has been experienced with students with motor and speech impairments to create personalized gesture datasets and to identify improvements for the system. This paper presents an experiment carried with the target audience using a game-based approach where three students used PGCA to interact with communication boards and to play a game. The system was evaluated by special education professionals using the System Usability Scale and was considered suitable for its purpose. Results from the experiment suggest the technical feasibility for the methodology and for the system, also adding knowledge about the interaction process of disabled people with a game.
Conference Paper
Computing systems have the potential to contribute with interactive and low-cost solutions to support Augmentative and Alternative Communication (AAC) by applying different technologies to address different users' characteristics and needs. Computer Vision-based AAC systems can support users with motor difficulties by tracking and recognizing their remaining functional motions. To investigate possibilities for people with motor and speech impairments, this paper presents the PGCA: a Computer Vision system that allows the creation of a personalized gestural interaction as assistive technology for communication purposes. PGCA system takes into account the motor abilities and limitations of its users and the knowledge of caregivers in recognizing the gestures performed by the users. Results from interviews with special education professionals and from an experiment with the target audience suggest the use of personalized gestures is a common practice for AAC, and that creating custom datasets can be challenging, mainly due to the level of understanding of participants, the similarity between gestures, and variations in performing the same gestures. Improvements for the system were identified and described aiming to make the interface easier and more effective.
Conference Paper
Augmentative and Alternative Communication (AAC) involves the use of non-verbal modes as a complement or substitute for spoken language, supporting communicative abilities of people, especially people with speech limitations. Computing systems have been proposed to support AAC, applying different technology to address users' different needs. Computer vision techniques can assist people with motor impairments by using their remaining functional motions. This paper proposes a methodology to support AAC of people with motor impairments, using computer vision and machine learning techniques to enable personalized gestural interaction. The methodology was instantiated in a pilot system described in this paper and evaluated by Human-Computer Interaction experts. The evaluation results suggested improvements for the methodology and for the system, and indicated the methodology is feasible to support the design of AAC systems, and that the developed system is promising to support AAC.
Article
In recent years, with the raise of the neural network and deep learning, significant progress has been achieved in the field of image recognition. Convolutional Neural Network (CNN) has been widely used in multiple image recognition tasks, but the recognition accuracy still has a lot of room for improvement. In this paper, we proposed a hybrid model CNN-GRNN to improve recognition accuracy. The model uses CNN to extract multilayer image representation and it uses General Regression Neural Network (GRNN) to classify image using the extracted feature. The CNN-GRNN model replace Back propagation (BP) neural network inside CNN with GRNN to improve generalization and robustness of CNN. Furthermore, we validate our model on the Oxford-IIIT Pet Dataset database and the Keck Gesture Dataset, the experiment result indicate that our model is superior to Gray Level Co-occurrency (GLCM),HU invariant moments, CNN and CNN_SVM on small sample dataset. Our model has favorable real-time characteristic at the same time.
Article
Deep convolutional neural networks (CNNs) are successfully used in a number of applications. However, their storage and computational requirements have largely prevented their widespread use on mobile devices. Here we present a series of approaches for compressing and speeding up CNNs in the frequency domain, which focuses not only on smaller weights but on all the weights and their underlying connections. By treating convolutional filters as images, we decompose their representations in the frequency domain as common parts (i.e., cluster centers) shared by other similar filters and their individual private parts (i.e., individual residuals). A large number of low-energy frequency coefficients in both parts can be discarded to produce high compression without significantly compromising accuracy. Furthermore, we explore a data-driven method for removing redundancies in both spatial and frequency domains, which allows us to discard more useless weights by keeping similar accuracies. After obtaining the optimal sparse CNN in the frequency domain, we relax the computational burden of convolution operations in CNNs by linearly combining the convolution responses of discrete cosine transform (DCT) bases. The compression and speed-up ratios of the proposed algorithm are thoroughly analyzed and evaluated on benchmark image datasets to demonstrate its superiority over state-of-the-art methods.
Conference Paper
Hand gestures provide a natural way for humans to interact with computers to perform a variety of different applications. However, factors such as the complexity of hand gesture structures, differences in hand size, hand posture, and environmental illumination can influence the performance of hand gesture recognition algorithms. Recent advances in Deep Learning have significantly advanced the performance of image recognition systems. In particular, the Deep Convolutional Neural Network has demonstrated superior performance in image representation and classification, compared to conventional machine learning approaches. This paper proposes an Adapted Deep Convolutional Neural Network (ADCNN) suitable for hand gesture recognition tasks. Data augmentation is initially applied which shifts images both horizontally and vertically to an extent of 20% of the original dimensions randomly, in order to numerically increase the size of the dataset and to add the robustness needed for a deep learning approach. These images are input into the proposed ADCNN model which is empowered by the presence of network initialization (ReLU and Softmax) and L2 Regularization to eliminate the problem of data overfitting. With these modifications, the experimental results using the ADCNN model demonstrate that it is an effective method of increasing the performance of CNN for hand gesture recognition. The model was trained and tested using 3750 static hand gesture images, which incorporate variations in features such as scale, rotation, translation, illumination and noise. The proposed ADCNN was compared to a baseline Convolutional Neural Network and the results show that the proposed ADCNN achieved a classification recognition accuracy of 99.73%, and a 4% improvement over the baseline Convolutional Neural Network model (95.73%).