BookPDF Available

MouseNose - Uma ferramenta de acessibilidade para deficientes

March 2015

March 2015

Publisher: Novas Edições Acadêmicas
ISBN: 978-3639895490

Authors:

Universidade Federal do Paraná

Este estudo nasceu da necessidade de avaliação de um software às reais necessidades das pessoas com deficiências em suas atividades, tendo como principal finalidade a sua integração ao uso do computador nos moldes atuais. O software intitulado Mousenose, foi desenvolvido pelo Grupo IMAGO, da Universidade Federal do Paraná. O Mousenose funciona reconhecendo através de uma câmera de vídeo, uma parte do corpo do usuário (nariz) e a partir do rastreamento e da captura do campo visual da câmera, transmite o movimento do usuário no cursor sem o uso de hardware. https://www.amazon.com.br/Mousenose-Weldt-Claudia-Francele/dp/3639895495

Scheme for the proposed methodology, divided into four lanes that define responsibilities for the execution of activities.

…

Three main interfaces of the pilot system developed: A) Caregiver area, where datasets are created; B) User area, where gesture recognition is used for interaction through the use of communication boards; C). Communication boards area, where new boards can be generated by selecting images.

…

Example of the application of Data Augmentation. From a representative dynamic gesture image, eight other variations are generated by rotating and scaling operations.

…

Examples of motion representation by means of MHI (A) and OF-MHI (B).

…

+12

Schema used by SVM-based classifier.

…

Figures - uploaded by Luciano Silva

Content may be subject to copyright.

Content uploaded by Luciano Silva

Content may be subject to copyright.

XXX

Computer vision-based methodology to improve interaction

for people with motor and speech impairment

RÚBIA E. O. SCHULTZ ASCARI∗,Department of Informatics - UFPR and UTFPR, Brazil

ROBERTO PEREIRA, Department of Informatics - UFPR, Brazil

LUCIANO SILVA, Department of Informatics - UFPR, Brazil

Augmentative and Alternative Communication (AAC) aims to complement or replace spoken language to

compensate for expression diculties faced by people with speech impairments. Computing systems have been

developed to support AAC, however, partially due to technical problems, poor interface, and limited interaction

functions, AAC systems are not widespread, adopted, and used, therefore reaching a limited audience. This

paper proposes a methodology to support AAC for people with motor impairments, using computer vision

and machine learning techniques to allow for personalized gestural interaction. The methodology was applied

in a pilot system used by both volunteers without disabilities, and by volunteers with motor and speech

impairments, to create datasets with personalized gestures. The created datasets and a public dataset were

used to evaluate the technologies employed for gesture recognition, namely the Support Vector Machine

(SVM) and Convolutional Neural Network (using Transfer Learning), and for motion representation, namely

the conventional Motion History Image and Optical Flow-Motion History Image (OF-MHI). Results obtained

from the estimation of prediction error using K-fold cross-validation suggest SVM associated with OF-MHI

presents slightly better results for gesture recognition. Results indicate the technical feasibility of the proposed

methodology, which uses a low-cost approach, and reveals the challenges and specic needs observed during

the experiment with the target audience.

CCS Concepts:

•Human-centered computing →Human computer interaction (HCI)

;

•Social and

professional topics →People with disabilities;•Computing methodologies →Motion capture.

Additional Key Words and Phrases: Assistive Technology, Augmentative and Alternative Communication,

Computer Vision, Gesture Recognition, Accessibility

ACM Reference Format:

Rúbia E. O. Schultz Ascari, Roberto Pereira, and Luciano Silva. 2020. Computer vision-based methodology

to improve interaction for people with motor and speech impairment. J. ACM XX, X, Article XXX (X 2020),

34 pages. https://doi.org/10.1145/1122445.1122456

1 INTRODUCTION

People with disabilities very often must deal with dierent barriers to participate in social and

economic life, requiring support from family members and caregivers, or the aid of technical

solutions that facilitate interaction with the environment and other people. Although computers

∗Corresponding author: rubia@utfpr.edu.br

Authors’ addresses: Rúbia E. O. Schultz Ascari, Department of Informatics - UFPR and UTFPR, Curitiba, Brazil, rubia@utfpr.

edu.br; Roberto Pereira, Department of Informatics - UFPR, Curitiba, Brazil, rpereira@inf.ufpr.br; Luciano Silva, Department

of Informatics - UFPR, Curitiba, Brazil, luciano@ufpr.br.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

0004-5411/2020/X-ARTXXX $15.00

https://doi.org/10.1145/1122445.1122456

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:2 R. E. O. S. Ascari, et al.

are present in many aspects of daily life, computer systems still have imposed barriers for people

with disabilities, failing to oer the support they can be designed to oer.

Designing systems and interfaces for Assistive Technology (AT) is particularly challenging, as

there is no "average user" on which to base solutions that would work for users with specic and

diverse needs [

]. Selecting an AT requires maximizing the ow of information, and minimizing

the eort (physical and mental) needed to use it [

]. When developing AT devices, end-users

and their view of what an ideal solution means, must be considered, nding the balance between

functionality, performance, ease of use, and aesthetics.

Speech impairment is a condition in which the ability to produce the speech sounds necessary

to communicate with others is compromised. People with speech impairments very often have

an associated motor disability, aecting their ability to interact with other people and with the

environment. Therefore, alternatives are demanded for people who are totally or partially unable

to move, or control their limbs, and who cannot rely on verbal communication solely. Oering

specic resources, services, strategies, and practices, AT aims to help people with disabilities to be

socially included, and become or remain independent.

Augmentative and Alternative Communication (AAC) refers to forms of communication that

complement or replace speech to compensate for speech diculties by using intervention strategies

and non-verbal communication systems [

]. AAC mediated by computational applications enables

users with motor and speech impairments to access a computer, using it not only to express

themselves, but as an educational or training tool as well. Such possibilities may support people’s

communicative abilities, contributing to their training and learning [7].

There are many input devices and dierent technologies that open up new paradigms in Human-

Computer Interaction (HCI). Systems based on multimodal interaction provide extended possibilities

for users, and are able to adjust to the users’ specic needs, making systems more exible [

]. Or-

dinary computers and mobile phones, for instance, are equipped with cameras that favor Computer

Vision (CV) interfaces, providing another possibility for interaction via these devices. Easy access to

camera devices has allowed for the generation of new AT resources that do not involve expensive

or customized devices to accommodate special access needs, because they are software-based,

enabling cost reduction and improved availability as envisaged by Betke et al. [

]. Non-invasive

techniques based on CV allow for non-conventional interaction methods to be considered, including

the recognition of movement of the hands [

], head [

] and other body parts to perform

actions on computer systems [78].

Gesture recognition allows for people to interact with machines without the need of other

devices (e.g. mouse or keyboard). This interaction mode is capable of dealing with the particularities

and limitations of each user’s performance of a movement, thus being considered "natural", and

even intuitive, as people learn gestures since their childhood [

]. Although solutions in gestural

interaction have become popular, their application for AAC still requires experiments to evaluate

these technologies, their possibilities, and limitations. Examples of applications are needed to

demonstrate the technical viability of gesture recognition, and to allow for the development of

low-cost solutions that attend to a diversity of people and their physical, cognitive, social, and

economic conditions.

Users with motor impairment may present very particular postures and involuntary movements,

as well as short-term fatigue and varying motor capacities that are challenging for AAC systems.

In order to generate a computational solution that takes into account the characteristics and the

diversity of its target audience, this paper presents a methodology to support the development

of AAC for people who have motor and speech disabilities, making use of CV techniques and

machine learning to enable personalized gestural interaction. The methodology can support people,

such as users and caregivers, to generate and update a customized set of gestures that will be used

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:3

to train a gestural-based interactive AAC system. Therefore, people may create a personalized

gesture language for communication purposes, taking into account their abilities and limitations

when performing movements, thus allowing for other people to recognize these movements. This

paper presents the methodology and the results obtained from the use of machine learning and

motion representation techniques to recognize gestures using a system developed based on the

methodology. Gestures were obtained from a public dataset, and from two controlled experiments

with teachers and students with dierent skills.

The constructive nature of this research requires a progressive and incremental strategy where

progress is evaluated and informs further steps of research and development. In [

], we introduced

the rst version of the methodology and results from an exploratory evaluation with HCI experts

where a prototype was used for gesture recognition. Now, in this paper, we present the improved

methodology and results of a system developed based on it (an evolution of the prototype proposed

in [

]), where machine learning and motion representation techniques were applied to recognize

gestures. Gestures were obtained from a public dataset and from two controlled experiments with

teachers and students with dierent skills. Because of the intrinsic complexity to evaluate research

of constructive nature, dierent evaluation strategies are needed to evaluate the methodology

and its application via computing technology. Therefore, the main contribution of this paper is

a multiple evaluation conducted in dierent steps, or stages, with dierent focus each, and the

results obtained from each step, highlighting challenges and necessary improvements for both

the system and the methodology. The results presented in this paper have informed the research

progress and the system evolution: functional requirements and improvements are presented in [

characterizing the system as an Assistive Technology for AAC, and new features for the system,

including a game-based approach, are presented and evaluated in [10].

2 RELATED WORK

The eld of AAC includes research and the development of designs in education, systems, and prac-

tices, enabling the cooperation between several areas, and, therefore, requires a multidisciplinary

approach [

102

]. In the literature, dierent initiatives can be found to make AAC systems eectively

usable, and dierent proposals involving CV techniques can be found.

Krueger et al. [

] were one of the rst to exemplify the use of video to recognize hand movements

as an interaction mode. Jacob [

] investigated appearance-based interaction techniques into

real-time applications for people with disabilities. Jacob discussed some factors and technical

considerations for using eye movements as data input in interfaces for computing systems. Since

then several studies have been developed to improve the support for people with physical disabilities

by using AT, and interaction modalities based on the recognition of body movements. Table 1

presents research on gesture recognition that applies dierent devices and mostly CV techniques

for AT.

Studies presented in Table 1 are organized according to the type of device and part of the human

body used for tracking, indicating the target audience of each study. As for the parts of human

body used in each research, the type "Various" was included in the "Body Parts vs. Devices used"

column to refer to the use of two or more parts of the body as a visual signal, or to the tracking of

body movements in general. In Kane et al. [

], however, CV is applied for the identication of

context and location for AAC purposes, not for the identication or recognition of any part of the

users’ body.

As presented in Table 1, simple devices such as webcams were used by several dierent initiatives

(43% of the presented papers — 34 papers), mainly because they represent a viable alternative

for detecting and tracking movements, especially due to their low cost. The same advantage

holds for mobile devices, which are increasingly accessible. More recent work has also used depth

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:4 R. E. O. S. Ascari, et al.

Table 1. Examples of related studies that focus on gesture recognition.

Body parts

vs. Devices

used

Mobile

camera Depth

camera Single

camera or

webcam

Thermal

camera Eye-

tracker

Others /

Not

informed

More than

one device Total

Mouth/

Tongue [70]n; [97]d

[

]

;

[83]a;[5]i5

Head/ Face [86] f[16]r

[

116

]

;

[

]

; [

]

;

[

118

]

;

[

121

]

;

[

127

]

; [

]

;

[

126

]

; [

]

[43]a; [12]a[113]k; 14

Nose/ Nos-

trils [76]n[35]a2

Hands [122]g

[

]

;

[13]a

[

120

]

;

[

105

]

;

[

]

; [

]

;

[31]a

[

101

]

;

[88]a; [89]a11

Eyes

[

]

;

[132]j[23]a

[

]

; [

]

;

[

]

;

[

100

]

;

[

]

; [

]

;

[

131

]

;

[72]a; [93]a

[44]p

[

]

; [

]

;

[

]

; [

]

;

[

]

; [

]

;

[

]

; [

]

;

[52]l

[

]

; [

]

;

[41]k

[

]

; [

129

]

;

[

]

; [

]

;

[114]i

Feet [128]a1

Various

[

]

;

[96]o

[

]

; [

]

;

[

104

]

;

[

]

;

[

119

]

;

[

]

; [

]

;

[27]a

[

108

]

;

[

107

]

;

[51]a

[

]

;

[109]a15

None [60]b1

Total 5 7 34 3 9 12 9 79

Target audience: People with: aphysical disability ; baphasia ; cspinal muscular atrophy; ddeciency

of dexterity; eneuro-motor deciency; fmotor and speech disabilities; ghearing and speech diculties; hspeech

diculties; isevere motor diculties; jAmyotrophic lateral sclerosis; kupper limb motor disabilities; lhigh spinal

cord injury; macquired brain injury; ncerebral palsy; ocortical diseases (Alzheimer); pTotal Block Syndrome;

qadvanced stage of multiple sclerosis ; rtetraplegia.

data (7 papers since 2014) from devices such as a Kinect, BumbleBee depth sensor, a Monocular

infrared depth camera, and an Image range sensor. Gaze detecting/tracking research has also

received increased attention (11% of papers presented in Table 1 — 9 papers), possibly because eye

movements may be the only remaining movement some people with severe disabilities can control

voluntarily.

Tracking specic regions of the human body as a form of interaction with a specic target

audience tends to generate solutions more adapted to the diversity of users and their interests.

However, the accessibility of these solutions may fail if they do not allow users to eectively

adapt or customize solutions before they begin using them. Even when adaptation mechanisms are

provided, solutions must guarantee that users will be able to nd and use them.

Research aiming at developing successful sign language recognition, generation, and translation

systems are related to our study despite having as main target audience deaf and hard of hearing

people. People with motor impairments, in general, present diculties in performing movements,

and the correct execution of a sizeable predened gesture set, such as used in sign language, is a

challenge. Even so, the contributions obtained from studies aimed at sign language recognition

using Computer Vision can undoubtedly contribute positively to the development of technologies

aimed at people with motor and speech diculties. Non-intrusive vision-based sign language

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:5

recognition is the currently dominant approach [

], however, for Martins et al. [

], although

existing devices can easily capture gestures and expressions, they face some problems: the vast

number of gestures and similarity between them; dierent sign languages due to culture, individual

social life, and the way gestures were taught; and, the sequence of gestures to express a sentence

can be dicult to calculate because it is dicult to detect where a gesture starts and ends and

where the next one begins. Thus, there are still some critical challenges to be solved. The studies of

Martins et al. [

], Ghanem et al. [

], Bragg et al. [

] present key backgrounds, a review of the

state-of-the-art, a set of pressing challenges, and a call to action for the research in this area.

Although they employ sensors, rather than cameras, to track users (magnetic tracker and elec-

tromyography), studies from Roy et al. [

107

108

] show that people with a speech disability may be

able to perform gestures that are replicable, and that can be mapped into words or concepts. Due

to physical disabilities, these gestures may not follow any standardized form, or be recognized as

iconic representations. According to the authors, people with cerebral palsy are able to perform

actions or gestures with their arms that are recognizable by observers in their family, and found

out that by encouraging free expression, the number and possibility of dierent gestures which

can be performed by individuals is much greater than previously thought. When a person has

severe limitations regarding self-expression, the knowledge that observers (e.g. caregivers, family

members) have about an individual’s ability to perform movements is fundamental to create a

personalized gesture language. The research we present in this paper aims to support work that

uses such information, allowing for people with disabilities to interact with a computing system by

using interaction language composed of their own gestures.

3 A METHODOLOGY TO SUPPORT AAC VIA PERSONALIZED GESTURAL

INTERACTION

Considering the literature presented previously, we have identied that initiatives usually focus on

specic situations and characteristics, oering little or no exibility for people and their dierent

contexts of use, therefore, requiring that people adapt themselves to the system instead of adapting

the system to people’s dierent needs. Designing AAC systems for people who have communication

diculties and motor impairment is a challenge as, regardless of the origin of motor problems,

people usually have very particular postures and involuntary movements that can sometimes be

uncontrollable, making it impossible to use several interfaces.

In this research, we investigate a methodology that can be used to design systems based on

gesture recognition in which gestures and their meanings are created and congured by users

and their caregivers. The methodology was conceived to enable the recognition of patterns in

gestural interaction, captured using a camera, and to be substantiated by dierent cameras or

complementary input devices (e.g., brain-computer interfaces or mobile device sensors) that enable

multimodal interaction with the AAC system.

Based on the Problem-Solving perspective [

] to describe research in HCI, the problem in this

research can be understood as having a mixed nature, with characteristics of an empirical and

constructive nature. Its empirical nature is due to the fact thats experimentation is required to

test and describe the eects of a methodology designed to support AAC based on personalized

gestural interaction. It is constructive, in the sense that it aggregates information to understand

the use of an AAC computer system by people with motor and communication impairments.

Figure 1 presents a scheme for the proposed methodology, and uses the Business Process Modeling

Notation, a graphical notation for business process modeling [

], showing the responsibilities for,

the execution of activities, as well as how work-ows across functions, or how functions transfer

the responsibility for an activity.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:6 R. E. O. S. Ascari, et al.

Fig. 1. Scheme for the proposed methodology, divided into four lanes that define responsibilities for the

execution of activities.

The scheme can be understood from a macro level, but depends on a series of manipulations

and specic processes performed at a micro level, whose specic steps have already been tried

and evaluated in a previous experiment with HCI experts [

]. The results obtained in this previous

experiment reinforced our perception that a methodology aimed at personalized gestural interaction

is feasible, and can be applied in an assistive context, increasing the possibilities for people with

motor disabilities to communicate by means of AAC systems.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:7

A pilot system, named Personal Gesture Communication Assistant (PGCA), was developed to

analyze and evaluate the feasibility of the methodology, its potential, and its limits. The system’s

rst version was designed following the proposed methodology that allows for the creation of

personalized gestural interaction for AAC, and was evaluated by HCI experts in order to identify

usability and accessibility issues, as well as to validate its requirements before experimenting the

system with the target audience. Previous evaluations and experiments are needed before involving

the user, so as not to take a solution with problems and errors that could have been anticipated in a

lab test

to the eld. As a result of the evaluation, technical limitations and interaction problems

were identied, as well as suggestions for interface improvements. The evaluation activity with

experts indicated the need for improvements before an experiment in a real context was possible

and productive, helping to anticipate problems that would make it dicult for the system to be

exible and adaptable to each user’s characteristics, or even that would prevent its use by people

with dierent limitations.

Figure 2 presents three interfaces of the system: A. Caregiver area, where datasets are created; B.

User area, where gesture recognition is used for interacting with the system and with communication

boards. The user’s interaction with the system, and the dierent techniques used for gesture

recognition and motion representation, are briey described; and C. Communication boards area,

where new boards can be generated by selecting images.

3.1 Interaction with the PGCA system

When the system is started, a camera is automatically enabled to allow the user to perform the

calibration process, which consists of positioning the user in the capture center area of the camera,

thus facilitating the standardization of postures for recording and recognizing movements. After

completing this process, the caregiver can begin customizing the system using the specic guide

(caregiver area), which assists the user to record examples of gestures for training, as well as

later use as a way of interacting with the system. Ideally, the methodology will allow any user to

independently customize the system through gestures. For this exploratory version of the system,

caregiver assistance is needed for the initial conguration of the system, and for recording and

labeling the gestures with the user. While the support of a caregiver will always be necessary when

people have more extreme disabilities, particularly cognitive disabilities that aect intentional

interaction, the system must be designed to allow its conguration and use by users with at least

one identiable movement. All actions performed in the system through gestural interaction are

stored in a text le (log) in order to record relevant events during the interaction with the interface.

In the caregiver area (Figure 2 - A), images are recorded representing users’ movement history

for each gesture, then becoming classes that can be labeled with words that will be used for

communication or interaction purposes (e.g.: Hi, Goodbye, Bathroom, Food, Water, Conrm, Undo).

After recording several representative pictures of the same gesture (the more samples in the dataset,

better the results), the user can train the system via a button available at the bottom of the caregiver

area. The training process expands the dataset by Data Augmentation, as well as extracts and

classies the features. Then the user can evaluate the system accuracy.

The user area (Figure 2 - B) can be used after system training. This area represents the main

interface with AAC functions, in which gesture interaction will allow for the execution of the

following functions: 1. Detection and representation of categories of gestures by writing the

corresponding label in a text box, emitting a sound (by means of synthesized voice) referring to

the word, and presenting a related image. 2. System conguration for personalizing navigation

Intensive testing and evaluations before evaluating a product with the target audience is mainly a matter of ethics as

people with disabilities cannot be treated as subjects of research.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:8 R. E. O. S. Ascari, et al.

Fig. 2. Three main interfaces of the pilot system developed: A) Caregiver area, where datasets are created; B)

User area, where gesture recognition is used for interaction through the use of communication boards; C).

Communication boards area, where new boards can be generated by selecting images.

functions, and relating gestures to functionalities. 3. Selecting whether the mode of navigation in

the communication boards is automatic (based on time) or manual (via gestures), varying between

communication boards of alphabetical or of several gures. 4. Selecting images when navigating

in a communication board, where it is possible to select a letter or a gure to show its related

description, and play the corresponding sound. 5. Simulating the keyboard use via communication

boards for typing characters or commands, allowing its use as input for the PGCA interface, as well

as for other applications such as Internet Browsers or Text Editors, for example. The techniques

used for data augmentation, gesture recognition and motion representation are briey described

below.

In the communication board’s area (Figure 2 - C) the caregiver can create dierent communication

boards composed of images. Next, the user can select images from these boards for communication

purposes.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:9

3.2 Data Augmentation

In many classication problems, the available data is insucient to train accurate and robust

classiers, being necessary to apply the data augmentation process [

]. Data augmentation

transforms the base data and increases the number of training data. The transformed images are

usually produced from the original images with very little computation and are generated during

training [

]. The augmented data will represent a more comprehensive set of possible data points,

thus minimizing the distance between the training and validation set, as well as any future testing

sets [

112

]. In the study of Alani et al. [

], data augmentation is initially applied, which shifts

images both horizontally and vertically to the extent of 20% of the original dimensions randomly,

to increase the size of the dataset numerically and to add the robustness needed for a deep learning

approach.

For this research, each sample training data is augmented, creating another eight variations, by

rotating and scaling the original image, aiming to simulate small changes in camera positioning

or distances that may occur when users interact with the system. Figure 3 shows an example of a

hand gesture represented by a dynamic gesture image, where the original (central) image is used to

generate eight additional images for enlarging the dataset used by the system. Variations employed

were -10 and 10 for the angle, and 0.9 and 1.1 for scale.

Fig. 3. Example of the application of Data Augmentation. From a representative dynamic gesture image,

eight other variations are generated by rotating and scaling operations.

3.3 Motion representation by Conventional MHI

For the pilot system, movements performed in front of a single camera are captured and represented

as Motion History Image (MHI). Proposed originally by Davis and Bobick [

] [

], MHI is a global

spatio-temporal representation of motion that has been applied to motion analysis and tracking for

dierent purposes, such as gesture recognition [

120

] or human action recognition [

] [

131

]. MHI

converts the 3D space-time information from a video sequence into a single 2D intensity image. The

movements include information such as time and space, and the MHI image reects not only the

position of a spatial action but also the movement order. In the MHI, high xed intensity is assigned

to a foreground pixel (moving object), while the intensity value is decreased by a small constant to

a background pixel [

117

]. The intensity value in the MHI records the history of temporal changes

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:10 R. E. O. S. Ascari, et al.

in each pixel location. The MHI

𝐻𝜏(𝑥, 𝑦, 𝑡 )

is computed from an update function

𝜓(𝑥, 𝑦 , 𝑡 )

described

by Davis and Bobick [29] in Equation (1):

𝐻𝜏(𝑥, 𝑦, 𝑡 )=(𝜏, 𝜓 (𝑥, 𝑦 , 𝑡 )=1(∈ 𝑓 𝑜𝑟 𝑒𝑔𝑟𝑜𝑢𝑛𝑑 )

𝑚𝑎𝑥 (0,(𝐻𝜏(𝑥, 𝑦, 𝑡 −1)) − 𝛿),otherwise (1)

where (x,y,t) is the spatial coordinates (x,y) of an image pixel at a given time t (in terms of image

frame number). Duration

𝜏

determines the temporal extent of the movement in terms of frames,

and

𝛿

is the decay parameter. We used

𝜏

= 3 and

𝛿

= time stamp.

𝜓(𝑥, 𝑦, 𝑧 )

is dened on Equation

(2) as described by [38]:

𝜓(𝑥, 𝑦, 𝑧 )=(1, 𝐷(𝑥 , 𝑦, 𝑡 )

0,otherwise (2)

where

𝐷(𝑥, 𝑦 , 𝑡 )

is a binary image comprised of pixel intensity dierence in frames separated by

temporal distance Δ, dened on Equation (3):

𝐷(𝑥, 𝑦 , 𝑧)=|𝐼(𝑥 , 𝑦, 𝑡 ) − 𝐼(𝑥 , 𝑦, 𝑡 ±Δ)| (3)

where I(x,y,t) is the intensity value of pixel (x,y) at the

𝑡𝑡ℎ

frame of the image sequence. We

used the "updateMotionHistory" function available in the OpenCV (Open Source Computer Vision)

library to calculate MHI.

3.4 Motion representation by Optical Flow based MHI

During a second moment, the optical ow was evaluated to aggregate velocity information to

images that represent motion history performed in front of the camera. Optical ow [

] [

]

denotes a shift in the same scene in an image sequence at a dierent time instant, estimating

pixel-level movement between two images. In conventional MHI, every detected foreground pixel is

assigned with a xed intensity value

𝜏

and speed dierences are not considered: a slow movement

and a fast movement of dierent body parts will have the same motion strength [

117

]. Dierent

proposals have been presented to add velocity information to MHI by means of optical ow, such

as Tsai et al. [

117

], Fan and Tjahjadi [

], Khalifa et al. [

]. A similar proposal to [

] was used for

this research, but in our system, a labeling algorithm ("connectedComponentsWithStats" function

available in the OpenCV library) is applied to the silhouette obtained in dierent frames in order

to identify connected regions. Afterward, the Lucas Kanade optical ow [

] is calculated for the

centroid pixels of each of these regions, and this displacement value is replicated to the other

pixels of the same region, in order to speed the tracking process via optical ow. We used the

“calcOpticalFlowPyrLK” function available in the OpenCV library to calculate Lucas-Kanade optical

ow, with parameters winSize(31

31), minEigThreshold (0.001), and default values in the others

parameters. When the "Zoom in" option is checked in the system settings, the facial landmarks are

entered as points to be tracked, highlighting and improving the perception of facial movements.

The resulting intensity value indicates a history of motion speeds at that location. The Optical

ow-based MHI (OF-MHI) is dened in Equation (4), described by Fan and Tjahjadi [38]:

𝐸(𝑥, 𝑦, 𝑡 )=𝑠(𝑥, 𝑦 , 𝑡 ) + 𝐸(𝑥 ,𝑦 , 𝑡 −1).𝛼 (4)

where

𝑠(𝑥, 𝑦, 𝑡 )

represents the optical ow length of pixel

(𝑥, 𝑦)

at time frame

𝑡

, and

𝛼

is the

updated rate used (0

<𝛼<

1). The motion strength is given by the ow length

𝑠(𝑥, 𝑦, 𝑡 )

for each

individual pixel

(𝑥, 𝑦)

. The intensity of a pixel is increased if it is a foreground point. A small value

𝛼

creates an accelerated decrease in motion strength, and only the recent short-term movements

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:11

are retained in the temporal template. Larger values of

𝛼

, in turn, will originate a long-term history

in temporal template. We used 0.85 for 𝛼parameter.

Figure 4 shows an example of motion representation using conventional MHI and OF-MHI. Both

images are displayed in grayscale, where the darker color represents the most recent movement. In

the OF-MHI, darker tones also represent regions where higher speed movements occurred.

Fig. 4. Examples of motion representation by means of MHI (A) and OF-MHI (B).

3.5 Gesture Recognition by HOG and SVM

A multiclass Support Vector Machine (SVM) discriminant classier using Radial Basis Function

kernel (RBF), and the feature descriptor Histogram of Oriented Gradient (HOG) were used to

recognize gestures created by the user and the caregiver. Discriminant classiers are trained to

separate classes. SVM [

111

] is a linear binary classier that assigns a given sample to a class of only

two possible classes [

]: it separates data into two classes by learning a hyperplane in a larger

dimensional space. To address the problems of multiple classes, SVM can be adapted by applying

other methods. For our research, the "one-versus-all" method was used: given an n-class problem,

for each class a binary model is constructed; the training set consists of examples of this class

as positive labels, and examples of other classes as negative labels. HOG is a feature descriptor

used for object detection, obtained based on an image gradient histogram. HOG is represented

by a border (gradient) of structural features, and can suppress the inuence of translation and

rotation to some extent caused by the quantication of spatial position and orientation. For the

system, the HOG was extracted from the whole MHI, or OF-MHI resized to (64x48), as exemplied

in Figure 5, generating a feature vector of 1260 positions. To extract features using HOG descriptor

we used the “HOGDescriptor” function available in the OpenCV library, and parameters include

winSize (64,48), blockSize (16, 16), blockStride (8, 8), cellSize (8, 8), nbins (9), derivAperture(1),

winSigma(4), histogramNormType(0), and default values for the others parameters. Subsequently,

the “hog.compute” function was used, with parameters winStride(32x24) and padding(0,0).

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:12 R. E. O. S. Ascari, et al.

Fig. 5. Schema used by SVM-based classifier.

3.6 Gesture Recognition by CNN

The Convolutional Neural Network (CNN) features can give a good description of image content;

thus the potential of deep learning by means of CNN was also explored. Therefore, feature extraction

based on the HOG descriptor with the SVM classier was replaced by an automatic process

performed by CNN, which works directly with images, performing the feature extraction internally.

Figure 6 shows the scheme employed by the CNN-based classier.

Fig. 6. Schema used by CNN-based classifier.

To train a CNN from scratch, a large and varied dataset is necessary, and, since, in our context,

the number of samples is limited due to the fact that each user will create his/her own dataset,

Transfer Learning could be a viable alternative to improve the learning mechanism from one domain

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:13

by transferring information from a related domain [

125

]. Therefore, we used the TensorFlow [

]

Inception V3 [

115

] (a codename for a deep CNN architecture model, originally trained on ImageNet

dataset [

110

]), as the basis to retrain a custom set of images. Afterward, we applied Transfer

Learning by retraining Inception’s nal layer, by 4000 steps, with new categories in order to build a

custom image classier according to labels and gestures captured by the system’s users. The whole

MHI, or OF-MHI resized to (64x48), is used as input for the network after the data augmentation

process has been executed.

3.7 Apparatus

The materials used for the experiments performed in all steps reported in this paper were a laptop

with 8GB of RAM and the webcam coupled to the laptop. Data collections took place in dierent

environments, with varying conditions of lighting. When performed gestures, users were positioned

in front of the table on which the laptop was located. People with disabilities who used a wheelchair

stayed a little further away compared to users who could sit in a simple chair closer to the table.

The volunteers who participated in the rst step evaluation performed data collection in their work

environment. Students who participated in the third step of system evaluation performed data

collection in their school environment.

Regarding the algorithms employed, we chose to use well-established algorithms, although

some approaches are not novel. We employed methods available in the OpenCV library (Open

Source Computer Vision Library) because it oers high computational eciency, and simple use of

Computer Vision and Machine Learning infrastructures [

]. Using MHI images has advantages

related to simplicity, robustness in motion representation, and low computation. Lucas-Kanade

optical ow has very fast calculations, and accurate time derivatives [

], and it proved to be

eective in aggregating velocity information from the movement on the MHI. Support Vector

Machine (SVM) has the advantage of oering a strong generalization ability, simple architecture, as

well as the ability to classify a few samples [

] [

]. CNN has shown growing popularity, partially

due to its success in image classication and other Computer Vision elds [

] [

124

]. Transfer

learning can take advantage of the experiences acquired during a deep CNN pre-trained with a

large dataset, for a specic task, and improve the performance of gesture recognition (our task)

with a small dataset composed by a restricted number of samples (our context).

4 EVALUATING THE PGCA SYSTEM

As the target audience of this research involves groups of users considered vulnerable, the project

was submitted for evaluation by the Research Ethics Committee of the University linked to this

study. The approval from the Committee provided the legal conditions for testing the system

with users from three co-participating institutions. After HCI experts evaluated the system under

laboratory conditions, tests with users without disabilities (Step 1), tests using a public dataset (Step

2), and tests with users with speech and motor impairments (Step 3) were planned and conducted.

Figure 7 represents the evaluation of the system, showing the datasets used and the objective of

each experiment, identied by steps.

For the experiment performed with users without disabilities, teachers from one of the partici-

pating institutions were invited. Five people accepted the invitation, and participated in the rst

step of data collection. The objective of this step was to evaluate whether the system would be able

to recognize personalized gestures, and trained with few samples. Subsequently, aiming to identify

the best strategies for a new version of the PGCA system, two classiers for gesture recognition,

and two motion representation, were evaluated according to their performance. This evaluation

step was conducted using the Keck Gesture Dataset, a public dataset available in [59].

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:14 R. E. O. S. Ascari, et al.

Fig. 7. Experiments carried out to evaluate the PGCA system.

Finally, during a third step, after analyzing the results of the previous tests, improvements were

implemented in the PGCA system and an experiment with the target audience was conducted. The

main objective of the experiment was to verify whether the PGCA system would support the target

audience in generating a customized dataset, and also to analyze whether our system is robust and

eective for communication purposes, taking into consideration the possible limitations of use in

daily life by people with dierent disabilities.

5 RESULTS

This section describes the experiments conducted to evaluate the proposed system, and the main

results obtained. Images pertaining to datasets created by volunteer teachers and students are

presented only in the form of MHI or OF-MHI to maintain the participant’s anonymity.

5.1 Step 1 - Evaluation of machine learning techniques using datasets created by

volunteers without disabilities

For the rst experiment, ve volunteers (people with no motor and speech impairment) were invited

to create datasets composed of six to eight dierent gestures, with labels dened by the volunteers

themselves. The researcher who conducted the experiment played the role of a caregiver. The

experiment was designed to evaluate the accuracy of the classier regarding gesture recognition.

For this evaluation step, only the Caregiver Area and the User Area were available for use in

the PGCA system. Only the traditional MHI was implemented, and the captured samples were

registered only as images. There was no possibility to store videos of performed movements. The

Caregiver Area did not provide any form to validating the captured samples, allowing the creation

of data sets of personalized gestures, training, and system evaluation. The User Area was used only

to test the recognition of the gestures for which the system was trained.

Volunteers P1 and P2 created datasets with eight distinct classes, registering twenty samples by

class. Volunteer P3 created a dataset with eight distinct classes, registering fteen samples per class.

Volunteers P4 and P5 created datasets with six distinct classes. Volunteer P4 registered twenty

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:15

samples per class, and Volunteer P5 registered fteen samples per class. Figures 8 and 9 present

examples of samples generated by volunteers to compose each dataset. Volunteers P1 and P2 used

the option "Zoom in", available in the system settings, highlighting and improving the perception

of facial movements.

Fig. 8. Gesture samples performed by Volunteers P1, P2 and P3.

During the analysis of data captured in the evaluation step 1, it was identied the need to make a

change in methodology, which initially [

] performed data augmentation process before separating

original data into training and test data. This situation could generate a very optimistic performance

evaluation. Therefore, the methodology was updated to divide the original data set into training

and test data, to later perform the data augmentation process only in the training data sets. This

change was made before evaluating the system performance on the data sets created by volunteers

without disabilities.

To evaluate the performance of the classiers, an estimation of prediction error by using K-fold

cross-validation was used, using ten folds, separating 90% of the data for training, and 10% for

testing. The quantity of training data was expanded by Data Augmentation, where additional

samples were created from existing data.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:16 R. E. O. S. Ascari, et al.

Fig. 9. Gesture samples performed by Volunteers P4 and P5.

After running tests on all folds, the overall accuracy (weighted average), standard deviation,

Variance, and Cohen’s Kappa (a statistical measure of inter-rater agreement), were calculated for

each of them. HOG + SVM and CNN were the machine learning techniques used. Results obtained

for datasets generated by the ve volunteers are presented in Table 3, where two learning methods

are compared.

Table 3. Volunteers’ datasets - machine learning method comparison: Overall accuracy, Cohen’s Kappa,

Standard deviation, and Variance.

HOG + SVM CNN

Volunteer Acc.

Cohen k

Std dev.

Var. Acc.

Cohen k

Std dev.

Var.

P1 - MHI 0.981 0.979 0.04 0.00180 0.994 0.993 0.02 0.0004

P2 - MHI 0.981 0.978 0.02 0.00090 0.987 0.985 0.02 0.0007

P3 - MHI 0.975 0.971 0.07 0.00620 0.941 0.936 0.07 0.0069

P4 - MHI 0.974 0.970 0.03 0.00160 0.974 0.970 0.03 0.0016

P5 - MHI 0.988 0.986 0.02 0.00006 1 1 0 0

Typically, a perfect classication would produce a variance and standard deviation of zero, and the

accuracy and kappa value of one. According to Landis and Koch criteria [

], for the interpretation

of the kappa value: 0.0 to 0.2 = slight agreement, 0.2 to 0.4 = fair agreement, 0.4 to 0.6 = moderate

agreement, 0.6 to 0.8 = substantial agreement, and 0.8 to 1.0 = almost perfect agreement.

In this experiment, the classiers presented satisfactory results, since the average accuracy

obtained in all datasets was high (higher than 0.94), low standard deviation and variance were

observed, and kappa values indicated almost perfect agreement. The CNN-based classier presented

slightly better accuracy in comparison to the SVM-based classier, in four of the ve datasets used.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:17

5.2 Step 2 - Evaluation of machine learning techniques and motion representations

using Keck Gesture Dataset

In the second evaluation step, the public dataset Keck Gesture Dataset was used to evaluate the

system’s performance using two classiers (HOG + SVM and CNN), and two distinct motion

representations (Conventional MHI and OF-MHI). Keck Gesture Dataset is composed of fourteen

distinct gestures, performed by three people in front of a static background. For each gesture,

each person performs three repetitions. Examples of motion representations generated for each

of the fourteen gesture classes available in Keck Gesture Dataset are presented in Figure 10. For

this step, besides all features available previously in the PGCA system, a second form of motion

representation was included: Optical ow-based motion history image (OF-MHI).

Fig. 10. Samples of gestures from the Keck Gesture Dataset represented by Conventional MHI and Optical

Flow based MHI. MHI images were resized to give emphasis to moving regions.

For the evaluation of this dataset, nine samples were available for each of the fourteen classes

existing in the dataset: a total of one hundred and twenty-six original samples. The "Zoom in"

option was used to enlarge regions where movements are performed.

Results obtained by k-Fold Cross-Validation with nine folds are presented in Table 4. Each fold

contained one sample per class for testing, and seventy-two samples per class for training (after

running the Data Augmentation process).

There are dierent works aimed at recognizing gestures, actions, or images in which the Keck

Gesture Dataset was used to assess the accuracy of classiers or methods used. As no study was

found using precisely the same form of evaluation employed by us (k-fold cross-validation using

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:18 R. E. O. S. Ascari, et al.

Table 4. Keck Gesture Dataset - machine learning and motion representation - method comparison: Overall

Accuracy, Cohen’s Kappa, Standard deviation, and Variance.

Keck

Dataset

HOG + SVM CNN

Acc.

Cohen k

Std dev.

Var. Acc. Cohen k

Std dev.

Var.

MHI 0.88 0.87 0.04 0.0025 0.90 0.89 0.05 0.0038

OF-MHI 0.89 0.88 0.04 0.0026 0.87 0.86 0.07 0.0060

ten folds), a direct comparison of the classiers’ performance was not conducted. However, we

consider it worth mentioning some relevant works and the results obtained with the Keck Gesture

Dataset. For example, in the study of Pei et al. [

103

], a fast inverted index-based algorithm is

introduced for multi-class action recognition. Results presented using the proposed method indicate

an accuracy of up to 89.88%. Fu et al. [

] considered the action recognition problem based on

geometrical structure. The method proposed by the authors uses a low dimensional structure on the

Grassmannian manifold to represent video sequences by using the linear structure of the tangent

space and presented a recognition accuracy of 93.4%. Wan et al. [

123

] presented a class specic

dictionary learning approach via information theory for action and gesture recognition, and the

recognition accuracy achieved is up until 95.1%. The study of Zhang et al. [

130

] introduced a hybrid

model based on CNN for image classication, and the results indicate an accuracy of up to 93.15%.

In our experiment, both classiers presented satisfactory results, using the MHI as well as OF-

MHI. The classication using the two datasets created from the Keck Gesture Dataset presented

valid accuracy, and statistical data with few variations. Next, the two classiers and the two motion

representations were evaluated once again during the following experiment, conducted with the

target audience.

5.3 Step 3 - Evaluation of the methodology using datasets created by students with

motor and speech impairments

For evaluating the system with the target audience, several improvements were introduced in the

system: a) storage of the video referring to the movements used to create the data sets; b) a guide

for creating picture communication boards; c) new conguration options for simulating the use of

the keyboard; d) visualization in video form of the movement related to selected gestures in the

conguration screen; e) possibility to choose dierent communication boards in the User Area; f)

registration of new information regarding the main actions performed on the system interface in a

log le.

For the tests with the target audience, visits were made to four schools in the co-participating

Institutions: one of them is a specialized educational institution for students with disabilities, and

the others are public schools where there are students with disabilities in mainstream education.

After the researchers met with several students with disabilities registered in the participating

institutions, a rst selection was made looking for the students who would have a greater compre-

hension capacity, and the ability to carry out voluntary movements, according to the perception of

the teachers that accompany them daily. In the four schools, after conducting interviews with the

teachers, support teachers, or LIBRAS (Brazilian sign language) interpreter, seven students with

characteristics considered desirable to participate in the experiment were identied (i.e. people with

motor and speech disabilities and without signicant cognitive limitation). One student, among the

selected, was not authorized by the family to participate.

All selected students are characterized as people with cerebral palsy, with dierent levels of

disabilities. Table 5 describes some of the characteristics of the participating students.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:19

Table 5. Characteristics of students with motor and speech impairments who participated in the experiment

Student Sex* Age Medical report Voluntary movements

A M 18 years old

Cerebral palsy due to sequelae from complications

during labor. Head movements

B F 29 years old Brain damage and discrete hydrocephalus.

Movements of head and

hands

C M 38 years old

Quadriplegia with athetosis component, bilateral

sensorineural hearing loss.

Movements of head and

hands

D F 20 years old

Pseudobulbar palsy. Generalized hypotonia and

hyperreexia.

Movements of head and

hands, facial expressions.

E F 18 years old Static encephalopathy and spastic quadriplegia.

Head movements, facial

expressions

F M 18 years old

Static encephalitis, epilepsy, and Taybi’s Rubinstein

Syndrome.

Movements of head and

hands

*Sex: M - Male; F- Female.

Each student who participated of this experiment was accompanied by a teacher who played

the "caregiver user" role in the system, informing the gestures that the student usually uses to

communicate, and the meaning of each of these gestures. Therefore, tasks conducted with the help

of these teachers allowed for us to evaluate the system regarding the creation of a dataset with

gestures personalized for each student. The tasks expected to be performed during the experiment

were: 1. creating the dataset by capturing gestures for training the system; 2. training and evaluating

the system; 3. using the system to recognize gestures; 4. using gestures to select images in the

communication board; and 5. using gestures for interacting with the text editor or Internet browser.

5.3.1 Student A. Student A uses only two head gestures in the school environment to communicate

with his/her classmates and teachers, which refer to "Yes" and "No". This student has a very preserved

capacity for comprehension. Data collection for the system training and the interaction tests with

the interface were performed during two dierent sessions. For this student, the system was

congured to use the "Zoom in" option in order to better capture facial movements. Student A

presented some involuntary movements during the interaction with the system, leaning his leg or

arm on the table where the computer and the camera were arranged, generating some samples

with information about the back of the room, considered as noises. The dataset created with this

student was composed of two classes; fteen samples from each class were considered as valid for

training the system. After training the system, the rst step for testing interaction with the interface

looked at whether the system could recognize the gestures for which it had been trained. The

two gestures were correctly identied by the system when performed voluntarily by the student.

Subsequently, the system was congured by associating the "Yes" class with the system’s conrm

option. This conguration allowed for the testing of the communication board, and whether the

student could write words related to specic requests, such as "Sleep", "Bath", "Food", and others.

Next, the system was congured by associating the "No" class with the ENTER key, and the option

"simulate keyboard" was selected, allowing the use of other applications. This specic conguration

allowed for the user to select images on a communication board in order to write words directly

into a text editor, simulating the pressing of the keyboard’s ENTER key when performing the "No"

gesture. Some false positives occurred, which were later corrected by adjusting the condence level

of the recognition system.

5.3.2 Student B. Student B does not present signicant motor impairment and uses hand and head

gestures in the school environment. However, she avoids interacting with unknown people. She

has some diculties understanding and is very shy, demanding constant encouragement from the

teacher who accompanied her when performing the gestures. Data collection for system training

was performed during a session that happened in one day, and interaction tests with the interface

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:20 R. E. O. S. Ascari, et al.

were performed during another session three days later. The "Zoom in" option was disabled in the

system. Her customized dataset was composed of ve classes with thirteen samples each. After

training the system, during the interaction test, the gestures referring to "Yes", "Fast", and "Bunny"

(because of the Easter Bunny) were correctly recognized. The gestures referring to "Food" and "No"

generated some erroneous interpretations, but this did not prevent interaction with the system.

Subsequently, the system was congured by associating the "Yes" class with the system’s conrm

option, and the student selected images related to handicrafts on a communication board, as this is

a subject of interest to the student, according to her teacher. This board was used to write words

associated with each gure in the system interface when the student performed the "Yes" gesture.

In addition, a communication board composed of images of vowels was used for the student to

indicate the rst letter in the word "Elephant". With the teacher’s help, the student made the "Yes"

gesture to select the gure referring to the letter "E" on the system interface, and the corresponding

sound was emitted. Because the student demanded a lot of intervention from the teacher, and had

little initiative to selecting pictures on her own, no other interaction test was performed.

5.3.3 Student C. Student C presents motor and speech impairment, and severe hearing loss. He

uses hand and head gestures in the school environment. He uses some gestures from LIBRAS as well

as signs from home, but because of motor impairment in the hands, not all LIBRAS interpreters can

understand his communication intentions. Student C has a preserved capacity for understanding,

and, for the execution of the experiment, he was accompanied by his LIBRAS interpreter and

caregiver. Data collection for system training and interaction tests were performed during two

dierent sessions. The "Zoom in" option was disabled. Initially, the dataset created by this student

was composed of ten classes, with twelve samples each. After training the system, during the

interaction test, some gestures characterized with similar movements were being confused by the

classier, such as the gestures "Yes", "I", "Mom", and "Water". The stored videos and samples related

to these gestures showed the movements for these gestures are very similar to each other, with

variations only in the position of the ngers or hand and, probably because of this, the training

performed was not enough for the system to recognize the dierence in these signs. Therefore,

during another session one month later, a second dataset was created containing only seven gestures

with seventeen samples each, leaving out the gestures "I", "Mom", and " Water ", and keeping the"

Yes "gesture. With the new dataset, most of the gestures were correctly recognized by the system,

except the gesture referring to "Bathroom" that was not recognized in some situations. Subsequently,

the system was congured by associating the "Yes" class with the system’s conrm option, allowing

for the student to test the image selection on the communication boards. Next, the system was

congured by associating the "Intelligent" class with the ENTER key, and the "simulate keyboard"

option was selected. Then, the student selected images on a communication board to write words

in a text editor, and to simulate the use of the ENTER key. It was also possible for the student to

select a gure on a communication board with keyword options in order to search in an Internet

browser, writing directly into the browser URL, and simulating the ENTER key to search for the

keyword. The interaction test was nalized after this step. However, following the same interaction

structure, other gestures could be associated with the TAB key to navigate between the search

results, and to select the desired page using the ENTER key simulation.

5.3.4 Student D. This student uses head gestures and facial expressions in the school environment,

has a very preserved capacity for comprehension, and is able to voluntarily move the right arm

despite many spastic movements. The gesture referring to "Yes" is the raising of the eyebrows.

However, since the gesture has a lot of associated head movement, the system was not able to

correctly register the movement of this facial expression. In order to create a dataset for this student

to interact with the system, we chose to capture the movement referring to "No" (moving the

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:21

head to both sides), and the movement referring to "Hand" (moving his right arm). Data collection

was performed during a session, and the interaction tests during another session ve days later.

Ten samples were considered valid for each class in the dataset. After training the system, the

two gestures were correctly identied by the system during the interaction test. Subsequently,

the system was congured by associating the "Hand" class with the system’s conrm option,

and the "No" class with the ENTER key. The same interaction tests with the Internet Browser

and Text Editor performed by Students A and C were performed by this student. However, some

involuntary movements occurred with the arm used to make selections, and, unintentionally, the

system selected items on the boards several times. According to the accompanying teacher, as the

student is accustomed to using the eyebrow to give armative answers, using the arm is still a

challenging task for her, and would require more training.

5.3.5 Student E. Student E uses only small head gestures and facial expressions to expose her

communication intentions in the school environment. The teachers expressed doubts about the stu-

dent’s level of understanding. Two attempts were made to collect data with this student during two

dierent sessions on dierent days. However, this student makes very restricted head movements,

for the same intention to communicate "Yes", sometimes she moved her head and sometimes she

just smiled. According to the teachers, when eating, the student puts her tongue out to indicate

she does not want a specic food. However, during the experiment, this same gesture referring to

"No" was never performed by the student, even after several attempts by the teacher, who asked

questions seeking a negative response from the student. Therefore, the experiment with Student

E was nalized. The experiment was designed to obtain images during explicit training, when

the user answers the caregiver’s questions. A feature that might be added in the future, is for the

system to learn from an annotated video where the caregiver indicates the meaning of a students’

gestures and expressions.

5.3.6 Student F. The Student F has progressively lost motor functions, and uses hand and head

gestures in the school environment, mainly pointing to objects of interest. According to the

teachers, the student has a well-preserved comprehension capacity. The rst author who conducted

the experiment participated in some classes with the student, observing his gestural interaction.

However, it was not possible to create a dataset for this student, even though two attempts were made

to collect data on dierent dates and with the support of dierent teachers. During both sessions,

despite having the necessary motor conditions, the student showed no interest in executing the

gestures when requested. After the second attempt to capture his gestures, the researcher questioned

the student if he did not want to be lmed, and the student emitted a sound (considering his speech

limitations) which was understood as a "No". The data collection session was then terminated.

Taking into account the situations reported when creating customized datasets with the selected

students, it was possible to create datasets with gestures performed in a personalized manner by

four of the seven selected students. Figures 11, 12, and 13 present examples of samples generated

by students to compose each dataset.

To evaluate the classiers’ performance, an estimation of prediction error by using K-fold cross-

validation was used, using ten folds, separating 90% of the data for training, and 10% for testing.

The number of samples used for composing the dataset created by each student varied according

to the number of gestures captured and that were considered valid by the rst author who carried

out the experiment. That is, after a series of records of movements performed by the students, only

the samples considered similar to each other (correctly representative of the same class) were used,

while the others were discarded. Results obtained by k-Fold Cross-Validation are presented in Table

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:22 R. E. O. S. Ascari, et al.

Fig. 11. Gesture samples by Students A and D, represented by MHI and OF-MHI.

Fig. 12. Student B’s gesture samples represented by MHI and OF-MHI.

Fig. 13. Student C’s gesture samples represented by MHI and OF-MHI.

During the experiment to test the PGCA system for interaction, in this experiment, the OF-MHI

option for motion representation, and the CNN-based classier for gesture recognition, were used.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:23

However, for each gesture captured by the system and represented in an image, the video of the

movement was also stored, which later allowed for the simulation of gesture recognition using the

two forms of motion representation and the two classiers evaluated during previous experiments.

Table 6. Results obtained with students dataset using two machine learning (HOG + SVM and CNN) and two

motion representations (Conventional MHI and OF-MHI). Method comparison: Overall Accuracy, Cohen’s

Kappa, Standard deviation, and Variance.

HOG + SVM CNN

Student Acc.

Cohen k

Std dev. Var. Acc. Cohen k Std dev. Var.

AMHI 1 1 0 0 0.93 0.86 0.10 0.011

OF-MHI 1 1 0 0 0.97 0.96 0.07 0.006

BMHI 0.83 0.79 0.11 0.013 0.84 0.80 0.11 0.013

OF-MHI 0.86 0.83 0.09 0.009 0.78 0.73 0.14 0.021

C1 MHI 0.84 0.82 0.13 0.020 0.68 0.64 0.15 0.026

OF-MHI 0.87 0.86 0.13 0.019 0.73 0.70 0.10 0.012

C2 MHI 0.87 0.85 0.14 0.023 0.76 0.72 0.09 0.010

OF-MHI 0.87 0.86 0.13 0.019 0.87 0.84 0.11 0.015

DMHI 0.96 0.90 0.15 0.025 0.90 0.80 0.20 0.044

OF-MHI 1 1 0 0 1 1 0 0

As a result of this experiment, we observed that people with motor and speech impairments, can

generate a customized gesture dataset and to train a system to recognize these gestures. To gesture

recognition, the SVM-based classier associated with OF-MHI motion representation presented

greater overall performance than CNN-based classier and MHI motion representation. Besides

that, were identied some challenges and issues to be improved in the methodology and in the

PGCA system, which are described in the Discussion session. For instance, was identied the need

to improve the system usability and accessibility, ensure the quality of the captured samples, and

to consider the variations of the level of understanding of each user.

Motor disability is a condition that generates very particular skills and limitations, requiring

a personalized approach. This condition was viewed in the heterogeneity of the participants in

this assessment step. Therefore, the key point here is not nding an average result or going very

deeply into the subjectivities of each participant, but to identify if the system (and, consequently,

the methodology behind it) is capable of supporting participants, in their diversity of skills and

limitations, to create, train and use personalized gesture datasets. Results presented in terms of

gesture recognition accuracy allowed us to observe the viability to use PGCA system by people with

very dierent motor skills performing similar tasks, each in their own way. Although it can give us

information about the tool’s performance, we do not intend to compare the accuracy presented

between the dierent datasets created since the diculties faced and the eort required of each

student to perform the tasks varied widely. The various attempts to use the tool by students with

cerebral palsy allowed us to see that personalized gestural interaction is a promising path to be

explored for augmentative and alternative communication for this audience, although challenges

and future improvements are needed and have been identied.

Details about results from step 3 (considering only the use of OF-MHI motion representation)

and from interviews with special education professionals were used to improve the system and can

be found on [9].

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:24 R. E. O. S. Ascari, et al.

6 DISCUSSION

Most of the relevant studies found in the literature used low-cost solutions for image acquisition,

and explored the few possibilities for adapting resources available in computers used by most

people. Prioritizing low-cost solutions is necessary to reach a wider audience unable to aord

high-cost devices. In our research, the availability of the resource for the target audience motivated

using a simple camera for capturing basic input for the system.

During the experiments, we observed the developed system can be used by the target audience,

that is, by people with motor and communication impairments, allowing for the execution of the

foreseen tasks in the proposed methodology. The rst experiment with volunteers who do not

have a disability allowed for us to verify the possibility of training a system with personalized

gestures, using a few samples for training. Using a public dataset in the second experiment had the

objective of allowing for the repeatability of the experiment by other researchers, besides enabling

the performance of new tests with dierent technologies before the system was made available for

testing with the targeted public. Finally, the third experiment brought rich insights by involving

representatives from the target audience, providing a glimpse of the routine of students with motor

and speech impairments in the school environment, and allowing for us to observe the initiatives

used by the teachers to communicate with these students daily.

Considering the three mentioned experiments, results indicate that the implemented classiers

are able to recognize gestures after being trained with customized datasets even with a small number

of samples. Therefore, the classiers can be applied to enable personalized gestural interaction.

Transfer Learning has proven to be ecient for our work because even a trained network with a

dataset composed of colored images can be customized to successfully recognize our gray scale

images.

The SVM-based classier associated with OF-MHI motion representation presented improve-

ments in performance for gesture recognition on most datasets. New experiments will be carried

out with the target audience, and the classier that shows the best overall results is the natural

candidate to be adopted for launching the system’s nal version. Some important issues observed

during the evaluation steps are presented below.

6.1 Uncontrolled situations

In many situations, images with motion blur can provide inaccurate results, but in our work, as

motion is already represented in the form of a blur composed of shades of gray, motion blur would

most likely enter the sum of the frames. Since the classiers presented satisfactory results, it is

possible that motion blur not signicantly interfere with the nal motion representation. When

using the PGCA system, occlusions may disturb the understanding of the gesture if it interferes

considerably with the nal representation of the generated motion. The scene’s background must

be static, as any object or person who might move behind the user will generate inaccurate or

unnecessary motion representations. Backgrounds with complex scenes negatively inuenced

motion representation only in cases where users touched the table on which the camera was

positioned. For a new experiment with the target audience, an external camera mounted on a tripod

will be used to avoid this kind of situation. We also realize that lighting conditions in dierent

data collection environments can interfere with the system’s ability to capture the movements

performed by users correctly. In the rst experiment, one of the volunteers was positioned next

to a window (light source), and we noticed that the gesture representation generated when the

user was positioned with the body sideways to the window is signicantly dierent from the

representation generated when the user was positioned with the whole body facing the window,

and this can negatively interfere with system’s performance. Thus, for the system to perform better,

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:25

it is important to keep the same lighting pattern during the dataset creation and interaction with

the system. In the experiment with the target audience, we seek to position the users in front of

a light source, either a window or ordinary light bulb. More exhaustive tests could set optimal

conditions for dierent variations of cluttered background, dark lighting, low contrast, or motion

blur.

6.2 Important improvements

The user experience (UX) refers to all experiences resulting from interactions that a user has with

a product or service [

]. Lee et al. [

] dene UX for people with disabilities as an experience that

consists of aspects of the interaction between people with disabilities and products/services that are

inuenced by assistive technologies. The evaluation steps described in this paper aimed to conduct

preliminary accessibility checks by the developer and the volunteers without disabilities, for later,

perform the interface evaluation with the participation of users with disabilities and observe the

user experience. The experiments allowed us to identify, in dierent ways, points to be improved

in both the developed system and the proposed methodology, because, as highlighted by Ilyas et al.

[

], creating a dataset considering real situations, with gestures executed by the target audience,

is still perceived as problematic and challenging mainly due to human variations in performing the

same gestures. The challenge is greater when considering people with motor-impairment, because

disabilities can make it dicult for people to repeat gestures, and some computer vision solutions

present limited performance in the presence of involuntary body movement, or if the person

presents seizure disorders like spastic movements. In order to create the datasets, several executions

of the same gesture were captured and, later, samples that presented very dierent representations

(e.g., if some involuntary movement occurred during the execution of the gesture) were deleted via

the system. Even with the satisfactory recognition rate obtained in some tests, the system must

evaluate and guarantee the quality of the samples captured to compose the dataset. During the

process of creating the dataset, an image matching algorithm was developed to compare images

of the representation generated for each gesture with the rst gesture considered as the base. To

compare the captured samples, a subtraction operation is performed between corresponding images,

and the remaining area is checked to see if it is not greater than the original area of the base sample.

In addition, the centroid of the images is also compared to check if they are in the same quadrant

in the image, or in the close quadrant. Therefore, only samples considered valid by the system

and by the caregiver will be stored and become a part of the dataset. From the evaluation step 3,

we noticed the need to improve the system usability and accessibility, mainly concerning visual

aspects of the interface and user feedback. Additional features, such as a game-based interactive

interface was developed and evaluated with the target audience and described in [10].

6.3 Student’s level of understanding

For the PGCA system to bring benets to both users and caregivers, it is imperative that users

become aware of the possibilities of interaction with the system: therefore, users with cognitive

impairments may not be able to use the system in its current version, as the system waits for

explicit input from the user. Such a limitation drew the attention of the researchers, as selecting

students to participate in the experiment was not a trivial task. Even in situations where there are

similar diagnoses (e.g., people with cerebral palsy), teachers themselves have doubts about each

student’s level of understanding. Using the proposed methodology, while adding more forms of

data input and processing, can make the system more inclusive for users with severe cognitive

disabilities, for the system would interpret the users and their intentions to communicate rather

than their intentions to interact with the system.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:26 R. E. O. S. Ascari, et al.

6.4 Limitations

To avoid unnecessary and incorrect answers by the system, students that perform a restricted

number of voluntary gestures, like Students A and D, need caregiver assistance to initiate and

nalize the capture process through the interface. Since these students did not have other gestures

that could be associated with the capture/start functionality, this task must be done by the caregiver.

The interface could be improved to suggest to the caregiver that he/she close of the capture process,

or the system could do so automatically after a certain period of inactivity.

Student C can perform a higher amount of signs. However, some of the signs generated very

similar motion representations that ended up confusing the classier. For gesture-based interfaces,

the precision (high true positives and low false positive rates) has to be assured, while maintaining

the natural feeling of interpersonal communication [

]. Even with the possibility of using a

condence level to minimize false positives, users can create datasets composed of gestures that are

quite similar to each other, which may not be correctly recognized by the system. In these cases,

the system may recommend a new data collection for these gestures or suggest only one of the

similar gestures to compose the dataset. Further studies are being conducted to improve overall

system accuracy by including more information in motion representation, such as the texture of

the hand and face, in order to enrich the input samples used by the classier.

Sessions occurred on dierent days, and some students performed gestures with dierent speed

and intensity in each session, suggesting that it is important to collect samples on dierent days to

generate more information to be learned by the classier. Besides, the inclusion of a game-based

approach, perhaps using a serious game that requires the execution of similar movements for its

use, can be a way to stimulate the students’ interest and to promote engagement and motivation

to train and use the system. In [

], the authors experienced a game-base approach to stimulate

students and to promote their engagement and motivation to train and use the system, obtaining

promising results.

The feedback and the state of the system must be improved to indicate that the system considered

a gesture as complete and that it is trying to recognize it. Currently, a progress bar indicates the

system’s status, but during the experiments, it was not perceived or understood by the users.

Changes in the system interface and dierent forms of feedback can be adopted, such as presenting

the messages in textual form, and reading them with a synthesized voice to facilitate understanding

by non-literate users. Since one of the users who participated in the experiment has a hearing

impairment, displaying messages and usage guidelines translated into LIBRAS can also support

users while learning and using the system.

One known limitation is that we are not comparing the proposed methodology with other

methodologies with the same purpose, because it is not yet possible to do this. In the future,

new assessments can be made, comparing dierent applications of the methodology, by dierent

professionals, in dierent contexts.

6.5 Future Works

For each gesture performed during the interaction with the AAC system, an MHI or OF-MHI image

is generated, as well as a text le containing the system’s predicted class and its condence level.

This information could be used in a more autonomous version of the system, in the future, as a way

to reload the dataset with new samples, whose classication has shown a high condence level.

This larger number of samples could contribute for classiers to generate better results, achieving

adequate accuracy. This could be important for the process of personalization, as the system could

identify, for example, that over time a particular gesture has been executed very dierently from its

original execution when the system was trained. When reaching a very signicant level of dierence,

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:27

the system could suggest the user to retrain the system using newer samples. Furthermore, after a

long period of system use, a fairly large number of new samples could be generated. This could

make training a deep network like CNN from scratch possible, and possibly allow the SVM-classier

to also deliver better results. In the system’s current version, the caregiver could perform some

new sample collection periodically, parallel to the use of the system already trained with a rst

dataset. These new samples could gradually generate a more complete dataset, possibly able to

recognize the gestures more satisfactorily.

Our research until now aimed to identify dynamic gestural patterns, based mainly on the

execution of movements. Nevertheless, the proposed methodology can support the design of a

system with another perspective, focusing on recognizing gestures in a complex scenario, using

other technologies or devices, such as depth camera, gloves with electrical sensors, or head-mounted

displays. These alternatives could be used to generate an automatic translation system for complex

sign languages, such as LIBRAS, which demands other system requirements such as parameters for

the hand conguration, articulation point, orientation, movement, and facial expression.

The three evaluation steps, together, show the methodology can be applied via a computing

system to support its target audience to generate a customized dataset and to use such a dataset to

enable personalized gestural recognition and interaction. To the best of our knowledge, no other

existing methodology could be used for comparison. Therefore, the present study is mandatory

before we are able to apply the methodology to design dierent systems, in dierent contexts, by

dierent professionals, for dierent audiences, which can provide us evidence to evaluate more

attributes of our methodology than its feasibility in the future.

The best perspectives for the proposed methodology are on the design of AAC tools, targeted

at a particular individual, by learning their actions, rather than adapting dierent users’ varying

gestures through thresholds. For the next steps in our research, a personied approach could be

introduced by going beyond system interface adaptation and personalization. Another aspect to be

introduced is to conceive of intelligent assistive technology that requires minimal user intervention,

capturing samples continuously to interpret patterns of movements performed by people, especially

people with disabilities. These samples can be combined with other data (e.g., brain-computer

interface) to train a personied system, through machine learning, that would be more focused on

the user’s individualities, therefore able to learn and represent the user, overcoming personalization

standards.

The accessibility can be considered as a prerequisite of usability [

] [

106

]. The user experience

observed in the evaluation step 3 of the PGCA system indicates the feasibility of the system

and its potential in providing accessibility for people with motor and speech disabilities, despite

some limitations and challenges perceived. The professionals who monitored the execution of the

scheduled tasks in the evaluation step 3 showed interest in using the PGCA system in the school

environment. Even so, a methodology for evaluating the usability and accessibility of the PGCA

system should be employed to better understand and evaluate the perception of the professionals

who will follow the execution of news experiments with the target audience. These evaluations

may provide results to represent the opinion of caregivers and their real intention to use the system

in the future.

7 CONCLUSION

This paper presented a methodology for supporting AAC based on personalized gestural interaction,

as well as results from the evaluation of the pilot system created from this methodology. Three

dierent evaluations were conducted using datasets: 1. created by volunteers without disabilities, 2.

using the public dataset Keck Gesture Dataset, and 3. created by students with motor and speech

impairments.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:28 R. E. O. S. Ascari, et al.

Two machine learning techniques were used to generate classiers for gesture recognition: SVM

(with descriptor HOG), and CNN (using Transfer Learning). Two dierent motion representations

were used to describe movements: Conventional MHI, and Optical Flow based MHI. The SVM-based

classier, with the motion representation obtained by OF-MHI, presented better performance in

most tests.

The proposed methodology allows users and caregivers to create personalized gestural interaction

for communication purposes, and is promising to support the design of AAC systems. The biggest

challenge identied so far is related to the prole of our target audience: on the one hand, training

the system is quite dependent on the quality of the dataset created, and on the other hand, creating

a dataset with quality samples depends heavily on users’ comprehension capacity to know what

movements to perform, and on their capacity to perform voluntary movements with a few variations,

i.e., it is necessary to guarantee a certain level of awareness and repeatability in order to perform a

same movement multiple times.

In future work, the PGCA user interface will be adjusted to minimize the eort required from

users and caregivers in acquiring samples to create their customized dataset, and new experiments

with the target audience will be conducted in order to better evaluate the system, and its potential

to support the AAC. During a future stage of the research, we also intend to investigate how

uncontrolled situations can interfere with system accuracy by performing tests with datasets

created under dierent conditions, such as with cluttered background, dark lighting, low contrast,

and motion blur.

ACKNOWLEDGMENTS

The authors thank CAPES and CNPq for supporting this research, and especially thank the institu-

tions, the volunteers, teachers, and students who participated in the experiments.

REFERENCES

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jerey Dean, Matthieu Devin, Sanjay Ghe-

mawat, Georey Irving, Michael Isard, et al

2016. Tensorow: a system for large-scale machine learning.. In OSDI,

Vol. 16. 265–283.

[2]

Julio Abascal. 2008. Users with disabilities: maximum control with minimum eort. Articulated Motion and Deformable

Objects (2008), 449–456.

[3]

Malek Adjouadi, Anaelis Sesin, Melvin Ayala, and Mercedes Cabrerizo. 2004. Remote eye gaze tracking system as a

computer interface for persons with severe motor disability. In International Conference on Computers for Handicapped

Persons. Springer, 761–769.

[4]

Ali A Alani, Georgina Cosma, Aboozar Taherkhani, and TM McGinnity. 2018. Hand gesture recognition using an

adapted convolutional neural network with data augmentation. In 2018 4th International conference on information

management (ICIM). IEEE, 5–12.

[5]

Natasha Alves, Stefanie Blain, Tiago Falk, Brian Leung, Negar Memarian, and Tom Chau. 2016. Access Technologies

for Children and Youth with Severe Motor Disabilities. Paediatric Rehabilitation Engineering: From Disability to

Possibility (2016), 45.

[6]

Rui Azevedo Antunes, Luís Brito Palma, Fernando V Coito, Hermínio Duarteramos, and Paulo Gil. 2016. Intelligent

human-computer interface for improving pointing device usability and performance. In Control and Automation

(ICCA), 2016 12th IEEE International Conference on. IEEE, 714–719.

[7]

Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2018. Mobile Interaction for Augmentative

and Alternative Communication: a Systematic Mapping. SBC Journal on 3D Interactive Systems 9, 2 (2018), 105–118.

[8]

Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2018. Towards a Methodology to Support

Augmentative and Alternative Communication by means of Personalized Gestural Interaction. In Proceedings of the

17th Brazilian Symposium on Human Factors in Computing Systems. ACM, 38.

[9]

Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2019. Personalized Interactive Gesture

Recognition Assistive Technology. Proceedings of the 18th Brazilian Symposium on Human Factors in Computing

Systems (2019), 1–12.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:29

[10]

Rúbia Eliza de Oliveira Schultz Ascari, Roberto Pereira, and Luciano Silva. 2020. Personalized Gestural Interaction

Applied in a Gesture Interactive Game-based Approach for People with Disabilities. Proceedings of the 25th International

Conference on Intelligent User Interfaces (2020), 1–11.

[11]

Behrooz Ashtiani and I Scott MacKenzie. 2010. BlinkWrite2: an improved text entry method using eye blinks. In

Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications. ACM, 339–345.

[12]

Aqil Azmi, Nawaf M Alsabhan, and Majed S AlDosari. 2009. The Wiimote with SAPI: Creating an accessible low-cost,

human computer interface for the physically disabled. International Journal of Computer Science and Network Security

9, 12 (2009), 63–68.

[13]

Samy Bakheet. 2017. A Fuzzy Framework for Real-Time Gesture Spotting and Recognition. Journal of Russian Laser

Research 38, 1 (2017), 61–75.

[14]

Margrit Betke. 2008. Camera-Based Interfaces and Assistive Software for People with Sever Motion Impairments.

Technical Report. Boston University Computer Science Department.

[15]

Margrit Betke, James Gips, and Peter Fleming. 2002. The camera mouse: visual tracking of body features to provide

computer access for people with severe disabilities. IEEE Transactions on neural systems and Rehabilitation Engineering

10, 1 (2002), 1–10.

[16]

Zhen-Peng Bian, Junhui Hou, Lap-Pui Chau, and Nadia Magnenat-Thalmann. 2016. Facial position and expression-

based human–computer interface for persons with tetraplegia. IEEE journal of biomedical and health informatics 20, 3

(2016), 915–924.

[17]

Pradipta Biswas and Pat Langdon. 2011. A new input system for disabled users involving eye gaze tracker and

scanning interface. Journal of Assistive Technologies 5, 2 (2011), 58–66.

[18]

Pradipta Biswas and Pat Langdon. 2013. A new interaction technique involving eye gaze tracker and scanning system.

In Proceedings of the 2013 Conference on Eye Tracking South Africa. ACM, 67–70.

[19]

Pradipta Biswas and Pat Langdon. 2015. Multimodal intelligent eye-gaze tracking system. International Journal of

Human-Computer Interaction 31, 4 (2015), 277–294.

[20]

Pieter Blignaut. 2017. Development of a gaze-controlled support system for a person in an advanced stage of multiple

sclerosis: a case study. Universal Access in the Information Society 16, 4 (2017), 1003–1016.

[21]

Aaron F. Bobick and James W. Davis. 2001. The recognition of human movement using temporal templates. IEEE

Transactions on pattern analysis and machine intelligence 23, 3 (2001), 257–267.

[22]

Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braort, Naomi Caselli, Matt

Huenerfauth, Hernisa Kacorri, Tessa Verhoef, et al

2019. Sign Language Recognition, Generation, and Translation:

An Interdisciplinary Perspective. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility.

16–31.

[23]

Dario Cazzato, Marco Leo, and Cosimo Distante. 2014. An investigation on the feasibility of uncalibrated and

unconstrained gaze tracking for human assistive applications by using head pose estimation. Sensors 14, 5 (2014),

8363–8379.

[24]

Vikash Chauhan and Tim Morris. 2001. Face and feature tracking for cursor control. In Proceedings of the Scandinavian

Conference on Image Analysis. 356–362.

[25]

Weiqin Chen. 2013. Gesture-based applications for elderly people. In International Conference on Human-Computer

Interaction. Springer, 186–195.

[26]

Fulvio Corno, Laura Farinetti, and Isabella Signorile. 2002. A cost-eective solution for eye-gaze assistive technology.

In Multimedia and Expo, 2002. ICME’02. Proceedings. 2002 IEEE International Conference on, Vol. 2. IEEE, 433–436.

[27]

Stefania Cristina and Kenneth P Camilleri. 2016. Model-based head pose-free gaze estimation for assistive communi-

cation. Computer Vision and Image Understanding 149 (2016), 157–170.

[28]

E Dall’Asta and Riccardo Roncella. 2014. A COMPARISON OF SEMIGLOBAL AND LOCAL DENSE MATCHING

ALGORITHMS FOR SURFACE RECONSTRUCTION. International Archives of the Photogrammetry, Remote Sensing &

Spatial Information Sciences 45 (2014).

[29]

James W Davis and Aaron F Bobick. 1997. The representation and recognition of human movement using temporal

templates. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on.

IEEE, 928–934.

[30]

AV Dehankar, Sanjeev Jain, and VM Thakare. 2017. Using AEPI method for hand gesture recognition in varying

background and blurred images. In Electronics, Communication and Aerospace Technology (ICECA), 2017 International

conference of, Vol. 1. IEEE, 404–409.

[31]

AV Dehankar, VM Thakare, and Sanjeev Jain. 2017. Detecting centroid for hand gesture recognition using morpho-

logical computations. In Inventive Systems and Control (ICISC), 2017 International Conference on. IEEE, 1–5.

[32]

Pieter Desmet and Paul Hekkert. 2007. Framework of product experience. International journal of design 1, 1 (2007),

57–66.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:30 R. E. O. S. Ascari, et al.

[33]

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2015. Image super-resolution using deep convolutional

networks. IEEE transactions on pattern analysis and machine intelligence 38, 2 (2015), 295–307.

[34]

Simone Eidam, Jens Garstka, and Gabriele Peters. 2016. Towards regaining mobility through virtual presence for

patients with locked-in syndrome. In Proceedings of the 8th International Conference on Advanced Cognitive Technologies

and Applications. Rome, Italy. 120–123.

[35]

Layal El-A, Mohamad Karaki, Joelle Korban, and Mohamad A al Alaoui. 2004. ’Hands-free interface’-a fast and

accurate tracking procedure for real time human computer interaction. In Signal Processing and Information Technology,

2004. Proceedings of the Fourth IEEE International Symposium on. IEEE, 517–520.

[36]

Samuel Epstein, Eric Missimer, and Margrit Betke. 2014. Using kernels for a video-based mouse-replacement interface.

Personal and Ubiquitous Computing 18, 1 (2014), 47–60.

[37]

S Yu Eroshkin, NA Kameneva, DV Kovkov, and AI Sukhorukov. 2017. Conceptual system in the modern information

management. Procedia Computer Science 103 (2017), 609–612.

[38]

Xijian Fan and Tardi Tjahjadi. 2017. A dynamic framework based on local Zernike moment and motion history image

for facial expression recognition. Pattern Recognition 64 (2017), 399–406.

[39]

Alhussein Fawzi, Horst Samulowitz, Deepak Turaga, and Pascal Frossard. 2016. Adaptive data augmentation for

image classication. In 2016 IEEE International Conference on Image Processing (ICIP). Ieee, 3688–3692.

[40]

S Federici and MJ Scherer. 2012. The assistive technology assessment model and basic denitions. Assistive technology

assessment handbook (2012), 1–10.

[41]

Marcela Fejtová, Luis Figueiredo, Petr Novák, Olga Štěpánková, and Ana Gomes. 2009. Hands-free interaction with a

computer and other technologies. Universal Access in the Information Society 8, 4 (2009), 277.

[42]

Xiping Fu, Brendan McCane, Michael Albert, and Steven Mills. 2013. Action recognition based on principal geodesic

analysis. In 2013 28th International Conference on Image and Vision Computing New Zealand (IVCNZ 2013). IEEE,

259–264.

[43]

Yun Fu and Thomas S Huang. 2007. hMouse: Head tracking driven virtual computer mouse. In Applications of

Computer Vision, 2007. WACV’07. IEEE Workshop on. IEEE, 30–30.

[44]

Luke Gane, Sarah Power, Azadeh Kushki, and Tom Chau. 2011. Thermal imaging of the periorbital regions during

the presentation of an auditory startle stimulus. PloS one 6, 11 (2011), e27268.

[45]

Liliana García, Ricardo Ron-Angevin, Bertrand Loubière, Loc Renault, Gwendal Le Masson, Véronique Lespinet-Najib,

and Jean Marc André. 2017. A comparison of a Brain-Computer Interface and an Eye tracker: is there a more

appropriate technology for controlling a virtual keyboard in an ALS patient?. In International Work-Conference on

Articial Neural Networks. Springer, 464–473.

[46]

Cindy Gevarter, Mark F O’Reilly, Laura Rojeski, Nicolette Sammarco, Russell Lang, Giulio E Lancioni, and Je Sigafoos.

2013. Comparisons of intervention components within augmentative and alternative communication systems for

individuals with developmental disabilities: A review of the literature. Research in developmental disabilities 34, 12

(2013), 4404–4414.

[47]

Sakher Ghanem, Christopher Conly, and Vassilis Athitsos. 2017. A survey on sign language recognition using smart-

phones. In Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments.

171–176.

[48]

Francisco Gomez-Donoso, Miguel Cazorla, Alberto Garcia-Garcia, and Jose Garcia-Rodriguez. 2016. Automatic

Schaeer’s gestures recognition system. Expert Systems 33, 5 (2016), 480–488.

[49]

Magdalena González, Débora Mulet, Elisa Perez, Carlos Soria, and Vicente Mut. 2010. Vision based interface: an

alternative tool for children with cerebral palsy. In Engineering in Medicine and Biology Society (EMBC), 2010 Annual

International Conference of the IEEE. IEEE, 5895–5898.

[50]

Kristen Grauman, Margrit Betke, Jonathan Lombardi, James Gips, and Gary R Bradski. 2003. Communication via eye

blinks and eyebrow raises: Video-based human-computer interfaces. Universal Access in the Information Society 2, 4

(2003), 359–373.

[51]

John Paulin Hansen, Kristian Tørning, Anders Sewerin Johansen, Kenji Itoh, and Hirotaka Aoki. 2004. Gaze typing

compared with input by head and hand. In Proceedings of the 2004 symposium on Eye tracking research & applications.

ACM, 131–138.

[52]

Helena Hemmingsson, Gunnar Ahlsten, Helena Wandin, Patrik Rytterström, and Maria Borgestig. 2018. Eye-Gaze

Control Technology as Early Intervention for a Non-Verbal Young Child with High Spinal Cord Injury: A Case Report.

Technologies 6, 1 (2018), 12.

[53]

Alexandre Felippeto Henzen and Percy Nohama. 2017. Facial Movements Detection Using Neural Networks and

Mpeg-7 Descriptors Applied to Alternative and Augmentative Communication Systems. In VII Latin American Congress

on Biomedical Engineering CLAIB 2016, Bucaramanga, Santander, Colombia, October 26th-28th, 2016. Springer, 626–629.

[54]

Berthold KP Horn and Brian G Schunck. 1981. Determining optical ow. Articial intelligence 17, 1-3 (1981), 185–203.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:31

[55]

Anthony J Hornof and Anna Cavender. 2005. EyeDraw: enabling children with severe motor impairments to draw

with their eyes. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 161–170.

[56]

Chin-Pan Huang, Chaur-Heh Hsieh, Kuan-Ting Lai, and Wei-Yang Huang. 2011. Human action recognition using

histogram of oriented gradient of motion history image. In Instrumentation, Measurement, Computer, Communication

and Control, 2011 First International Conference on. IEEE, 353–356.

[57]

Chaudhary Muhammad Aqdus Ilyas, Mohammad A Haque, Matthias Rehm, Kamal Nasrollahi, and Thomas B Moeslund.

2017. Facial Expression Recognition for Traumatic Brain Injured Patients. In International Conference on Computer

Vision Theory and Applications. SCITEPRESS Digital Library.

[58]

Robert JK Jacob. 1991. The use of eye movements in human-computer interaction techniques: what you look at is

what you get. ACM Transactions on Information Systems (TOIS) 9, 2 (1991), 152–169.

[59]

Zhuolin Jiang, Zhe Lin, and Larry Davis. 2012. Recognizing human actions by learning and matching shape-motion

prototype trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2012), 533–547.

[60]

Shaun K Kane, Barbara Linam-Church, Kyle Altho, and Denise McCall. 2012. What we talk about: designing a

context-aware communication tool for people with aphasia. In Proceedings of the 14th international ACM SIGACCESS

conference on Computers and accessibility. ACM, 49–56.

[61]

Intissar Khalifa, Ridha Ejbali, and Mourad Zaied. 2018. Hand motion modeling for psychology analysis in job interview

using optical ow-history motion image: OF-HMI. In Tenth International Conference on Machine Vision (ICMV 2017),

Vol. 10696. International Society for Optics and Photonics, 106962L.

[62]

Tomasz Kocejko, Adam Bujnowski, and Jerzy Wtorek. 2009. Eye-mouse for disabled. In Human-computer systems

interaction. Springer, 109–122.

[63]

Myron W Krueger, Thomas Gionfriddo, and Katrin Hinrichsen. 1985. VIDEOPLACE—an articial reality. In ACM

SIGCHI Bulletin, Vol. 16. ACM, 35–40.

[64]

Andrew Kurauchi, Wenxin Feng, Carlos Morimoto, and Margrit Betke. 2015. HMAGIC: head movement and gaze

input cascaded pointing. In Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to

Assistive Environments. ACM, 47.

[65]

Denis Lalanne, Laurence Nigay, Peter Robinson, Jean Vanderdonckt, Jean-François Ladry, et al

2009. Fusion engines

for multimodal input: a survey. In Proceedings of the 2009 international conference on Multimodal interfaces. ACM,

153–160.

[66]

J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics

(1977), 159–174.

[67]

Mingyu Lee, Sung H Han, Hyun K Kim, and Hanul Bang. 2017. Identifying user experience elements for people with

disabilities. Presentation at ACHI: The Eighth International Conference on Advances in .. ..

[68]

Wouter Lemahieu and Bart Wyns. 2011. Low cost eye tracking for human-machine interfacing. Journal of Eye

Tracking, Visual Cognition and Emotion (2011).

[69]

Marco Leo, G Medioni, M Trivedi, Takeo Kanade, and Giovanni Maria Farinella. 2017. Computer vision for assistive

technologies. Computer Vision and Image Understanding 154 (2017), 1–15.

[70]

Brian Leung and Tom Chau. 2010. A multiple camera tongue switch for a child with severe spastic quadriplegic

cerebral palsy. Disability and Rehabilitation: Assistive Technology 5, 1 (2010), 58–68.

[71]

Yongqian Liu, Yuzhu He, and Weijia Cui. 2018. An improved SVM classier based on multi-verse optimizer for

fault diagnosis of autopilot. In 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control

Conference (IAEAC). IEEE, 941–944.

[72]

Yi Liu, Bu-Sung Lee, and Martin J McKeown. 2016. Robust eye-based dwell-free typing. International Journal of

Human–Computer Interaction 32, 9 (2016), 682–694.

[73]

Bruce D Lucas and Takeo Kanade. 1981. An iterative image registration technique with an application to stereo vision.

In Proceedings of the 7th International Joint Conference on Articial Intelligence. Vancouver, BC, Canada.

[74]

Robert Gabriel Lupu, Radu Gabriel Bozomitu, Alexandru Păsărică, and Cristian Rotariu. 2017. Eye tracking user

interface for Internet access used in assistive technology. In E-Health and Bioengineering Conference (EHB), 2017. IEEE,

659–662.

[75] I Scott MacKenzie and Behrooz Ashtiani. 2009. BlinkWrite: ecient text entry using eye blinks. Universal Access in

the Information Society 10, 1 (2009), 69–80.

[76]

Cristina Manresa-Yee, Pere Ponsa, Javier Varona, and Francisco J Perales. 2010. User experience to improve the

usability of a vision-based interface. Interacting with Computers 22, 6 (2010), 594–605.

[77]

Xianbai Mao, Liheng Wang, and Changxi Li. 2008. SVM classier for analog fault diagnosis using fractal features. In

2008 Second International Symposium on Intelligent Information Technology Application, Vol. 2. IEEE, 553–557.

[78]

Joanna Marnik. 2014. BlinkMouse-On-Screen Mouse Controlled by Eye Blinks. In Information Technologies in

Biomedicine, Volume 4. Springer, 237–248.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:32 R. E. O. S. Ascari, et al.

[79]

João MS Martins, João MF Rodrigues, and Jaime AC Martins. 2015. Low-cost natural interface based on head

movements. Procedia Computer Science 67 (2015), 312–321.

[80]

Paulo Martins, Henrique Rodrigues, Tânia Rocha, Manuela Francisco, and Leonel Morgado. 2015. Accessible options

for deaf people in e-learning platforms: technology solutions for sign language translation. Procedia Computer Science

67 (2015), 263–272.

[81]

César Mauri, Toni Granollers, Jesús Lorés, and Mabel García. 2006. Computer vision interaction for people with severe

movement restrictions. Human Technology: An Interdisciplinary Journal on Humans in ICT Environments (2006).

[82]

Negar Memarian, Tom Chau, and Anastasios N Venetsanopoulos. 2009. Application of infrared thermal imaging in

rehabilitation engineering: Preliminary results. In Science and Technology for Humanity (TIC-STH), 2009 IEEE Toronto

International Conference. IEEE, 1–5.

[83]

Negar Memarian, Anastasios N Venetsanopoulos, and Tom Chau. 2009. Infrared thermography as an access pathway

for individuals with severe motor impairments. Journal of neuroengineering and rehabilitation 6, 1 (2009), 11.

[84]

Eric Missimer and Margrit Betke. 2010. Blink and wink detection for mouse pointer control. In Proceedings of the 3rd

International Conference on Pervasive Technologies Related to Assistive Environments. ACM, 23.

[85]

Assit Prof Aree A Mohammed. 2014. Ecient eye blink detection method for disabled-helping domain. Eye 10, P1

(2014), P2.

[86]

Laura Montanini, Enea Cippitelli, Ennio Gambi, and Susanna Spinsante. 2015. Low complexity head tracking on

portable android devices for real time message composition. Journal on Multimodal User Interfaces 9, 2 (2015), 141–151.

[87]

Inhyuk Moon, Kyunghoon Kim, Jeicheong Ryu, and Museong Mun. 2003. Face direction-based human-computer

interface using image observation and EMG signal for the disabled. In Robotics and Automation, 2003. Proceedings.

ICRA’03. IEEE International Conference on, Vol. 1. IEEE, 1515–1520.

[88]

K Morrison and S J McKenna. 2002. Automatic visual recognition of gestures made by motor-impaired computer

users. Technology and Disability 14, 4 (2002), 197–203.

[89]

K Morrison and S J McKenna. 2002. Contact-free recognition of user-dened gestures as a means of computer access

for the physically disabled. In Workshop on Universal Access and Assistive Technology. 99–103.

[90]

Giorgos Mountrakis, Jungho Im, and Caesar Ogole. 2011. Support vector machines in remote sensing: A review. ISPRS

Journal of Photogrammetry and Remote Sensing 66, 3 (2011), 247–259.

[91]

Cosmin Munteanu, Sharon Oviatt, Gerald Penn, and Randy Gomez. 2016. Designing Speech and Multimodal

Interactions for Mobile , Wearable , and Pervasive Applications. (2016), 3612–3619.

[92]

Masoomeh Nabati and Alireza Behrad. 2015. 3D Head pose estimation and camera mouse implementation using a

monocular video camera. Signal, Image and Video Processing 9, 1 (2015), 39–44.

[93]

Rizwan Ali Naqvi, Muhammad Arsalan, and Kang Ryoung Park. 2017. Fuzzy system-based target selection for a NIR

camera-based gaze tracker. Sensors 17, 4 (2017), 862.

[94]

Saeed Nasri, Alireza Behrad, and Farbod Razzazi. 2015. A novel approach for dynamic hand gesture recognition using

contour-based similarity images. International Journal of Computer Mathematics 92, 4 (2015), 662–685.

[95]

Saeed Nasri, Alireza Behrad, and Farbod Razzazi. 2015. Spatio-temporal 3D surface matching for hand gesture

recognition using ICP algorithm. Signal, Image and Video Processing 9, 5 (2015), 1205–1220.

[96]

Farhood Negin, Pau Rodriguez, Michal Koperski, Adlen Kerboua, Jordi Gonzàlez, Jeremy Bourgeois, Emmanuelle

Chapoulie, Philippe Robert, and Francois Bremond. 2018. PRAXIS: Towards Automatic Cognitive Assessment Using

Gesture Recognition. Expert Systems with Applications (2018).

[97]

Shuo Niu, Li Liu, and D Scott McCrickard. 2018. Tongue-able interfaces: Prototyping and evaluating camera based

tongue gesture input system. Smart Health (2018).

[98]

Redwan AK Noaman, Mohd Alauddin Mohd Ali, and Nasharuddin Zainal. 2017. Enhancing pedestrian detection

using optical ow for surveillance. International Journal of Computational Vision and Robotics 7, 1-2 (2017), 35–48.

[99]

Antti Oulasvirta and Kasper Hornbæk. 2016. HCI research as problem-solving. In Proceedings of the 2016 CHI

Conference on Human Factors in Computing Systems. ACM, 4956–4967.

[100]

Kaushik Parmar, Bhavin Mehta, and Rupali Sawant. 2012. Facial-feature based Human-Computer Interface for

disabled people. In Communication, Information & Computing Technology (ICCICT), 2012 International Conference on.

IEEE, 1–5.

[101]

Rupal Patel and Deb Roy. 1998. Teachable interfaces for individuals with dysarthric speech and severe physical

disabilities. In Proceedings of the AAAI Workshop on Integrating Articial Intelligence and Assistive Technology. Citeseer,

40–47.

[102]

Jasmina Ivšac Pavliša, Marta Ljubešić, and Ivana Jerečić. 2012. The use of AAC with young children in croatia–from

the speech and language pathologist’s view. In KES International Symposium on Agent and Multi-Agent Systems:

Technologies and Applications. Springer, 221–230.

[103]

Lishen Pei, Mao Ye, Pei Xu, Xuezhuan Zhao, and Tao Li. 2013. Multi-class action recognition based on inverted index

of action states. In 2013 IEEE International Conference on Image Processing. IEEE, 3562–3566.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

Computer vision-based methodology to support AAC XXX:33

[104]

Emanuele Perini, Simone Soria, Andrea Prati, and Rita Cucchiara. 2006. FaceMouse: A human-computer interface for

tetraplegic people. In European Conference on Computer Vision. Springer, 99–108.

[105]

E Pirani and Mahesh Kolte. 2010. Gesture based educational software for children with acquired brain injuries.

International Journal in Computer Science and Engineering 2, 3 (2010), 790–794.

[106]

Franz Puhretmair and Klaus Miesenberger. 2005. Making sense of accessibility in IT Design-usable accessibility vs.

accessible usability. In 16th International Workshop on Database and Expert Systems Applications (DEXA’05). IEEE,

861–865.

[107]

David M Roy, Marilyn Panayi, Roman Erenshteyn, Richard Foulds, andRobert Fawcus. 1994. Gestural human-machine

interaction for people with severe speech and motor impairment due to cerebral palsy. In Conference companion on

Human factors in computing systems. ACM, 313–314.

[108]

David M Roy, Marilyn Panayi, Richard Foulds, Roman Erenshteyn, William S Harwin, and Robert Fawcus. 1994. The

enhancement of interaction for people with severe speech and physical impairment through the computer recognition

of gesture and manipulation. Presence: Teleoperators & Virtual Environments 3, 3 (1994), 227–235.

[109]

David Rozado, Jason Niu, and Martin Lochner. 2017. Fast Human-Computer Interaction by Combining Gaze Pointing

and Face Gestures. ACM Transactions on Accessible Computing (TACCESS) 10, 3 (2017), 10.

[110]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,

Aditya Khosla, Michael Bernstein, et al

2015. Imagenet large scale visual recognition challenge. International Journal

of Computer Vision 115, 3 (2015), 211–252.

[111]

Sancho Salcedo-Sanz, José Luis Rojo-Álvarez, Manel Martínez-Ramón, and Gustavo Camps-Valls. 2014. Support

vector machines in engineering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

4, 3 (2014), 234–267.

[112]

Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal

of Big Data 6, 1 (2019), 60.

[113]

Tyler Simpson, Colin Broughton, Michel JA Gauthier, and Arthur Prochazka. 2008. Tooth-click control of a hands-free

computer interface. IEEE Transactions on Biomedical Engineering 55, 8 (2008), 2050–2056.

[114]

Piotr Stawicki, Felix Gembler, Aya Rezeika, and Ivan Volosyak. 2017. A novel hybrid mental spelling application

based on eye tracking and SSVEP-based BCI. Brain sciences 7, 4 (2017), 35.

[115]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception

architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition.

2818–2826.

[116]

Kentaro Toyama. 1998. Look, ma-no hands! hands-free cursor control with real-time 3d face tracking. Workshop on

Perceptual User Interfaces (1998).

[117]

Du-Ming Tsai, Wei-Yao Chiu, and Men-Han Lee. 2015. Optical ow-motion history image (OF-MHI) for action

recognition. Signal, Image and Video Processing 9, 8 (2015), 1897–1906.

[118]

Jilin Tu, Hai Tao, and Thomas Huang. 2007. Face as mouse through visual face tracking. Computer Vision and Image

Understanding 108, 1-2 (2007), 35–40.

[119]

Outi Tuisku, Veikko Surakka, Ville Rantanen, Toni Vanhala, and Jukka Lekkala. 2013. Text entry by gazing and

smiling. Advances in Human-Computer Interaction 2013 (2013), 1.

[120]

Maryam Vafadar and Alireza Behrad. 2008. Human hand gesture recognition using motion orientation histogram

for interaction of handicapped persons with computer. In International Conference on Image and Signal Processing.

Springer, 378–385.

[121]

Javier Varona, Cristina Manresa-Yee, and Francisco J Perales. 2008. Hands-free vision-based interface for computer

accessibility. Journal of Network and Computer Applications 31, 4 (2008), 357–374.

[122]

Mrs M Vidhya, P Poornima Devi, S Priscilla Emima, and G Revathi. 2016. Implementation of Bidirectional Voice

Communication between Normal and Deaf & Dumb Person. International Journal of Advanced Research Trends in

Engineering and Technology (IJARTET) (2016).

[123]

Jun Wan, Vassilis Athitsos, Pat Jangyodsuk, Hugo Jair Escalante, Qiuqi Ruan, and Isabelle Guyon. 2014. CSMMI:

Class-specic maximization of mutual information for action and gesture recognition. IEEE Transactions on Image

Processing 23, 7 (2014), 3152–3165.

[124]

Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. 2018. Packing convolutional neural networks in the frequency

domain. IEEE transactions on pattern analysis and machine intelligence (2018).

[125]

Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. 2016. A survey of transfer learning. Journal of Big Data 3, 1

(2016), 9.

[126]

Krishna Ferreira Xavier, Vinícius Kruger da Costa, Rafael Cunha Cardoso, Jamir Alves Peroba, Adriano Oliveira Lima

Ferreira, Marcelo Bender Machado, Tatiana Aires Tavares, and Andréia Sias Rodrigues. 2017. VisiUMouse: An

Ubiquitous Computer Vision Technology for People with Motor Disabilities. (2017), 115–118.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

XXX:34 R. E. O. S. Ascari, et al.

[127]

Cristina Suemay Manresa Yee, Francisco Perales López, and Javier Varona Gómez. 2009. Advanced and natural

interaction system for motion-impaired users. Ph.D. Dissertation. PhD thesis, Departament de Ciencies Matematiques i

Informatica, Universitat de les Illes Balears, Spain.

[128]

I Yoda, K Ito, and T Nakayama. 2017. Modular Gesture Interface for People with Severe Motor Dysfunction: Foot

Recognition. Studies in health technology and informatics 242 (2017), 725–732.

[129]

Thorsten O Zander, Matti Gaertner, Christian Kothe, and Roman Vilimek. 2010. Combining eye gaze input with a

brain–computer interface for touchless human–computer interaction. Intl. Journal of Human–Computer Interaction

27, 1 (2010), 38–51.

[130]

Jiajia Zhang, Kun Shao, and Xing Luo. 2018. Small sample image recognition using improved convolutional neural

network. Journal of Visual Communication and Image Representation 55 (2018), 640–647.

[131]

Shasha Zhang, Weicun Zhang, and Yunluo Li. 2016. Human Action Recognition Based on Multifeature Fusion. In

Proceedings of 2016 Chinese Intelligent Systems Conference. Springer, 183–192.

[132]

Xiaoyi Zhang, Harish Kulkarni, and Meredith Ringel Morris. 2017. Smartphone-Based Gaze Gesture Communication

for People with Motor Disabilities. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.

ACM, 2878–2889.

J. ACM, Vol. XX, No. X, Article XXX. Publication date: X 2020.

ResearchGate has not been able to resolve any citations for this publication.

Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective

Conference Paper

Full-text available

Oct 2019

Developing successful sign language recognition, generation, and translation systems requires expertise in a wide range of fields, including computer vision, computer graphics, natural language processing, human-computer interaction, linguistics, and Deaf culture. Despite the need for deep interdisciplinary knowledge, existing research occurs in separate disciplinary silos, and tackles separate portions of the sign language processing pipeline. This leads to three key questions: 1) What does an interdisciplinary view of the current landscape reveal? 2) What are the biggest challenges facing the field? and 3) What are the calls to action for people working in the field? To help answer these questions, we brought together a diverse group of experts for a two-day workshop. This paper presents the results of that interdisciplinary workshop, providing key background that is often overlooked by computer scientists, a review of the state-of-the-art, a set of pressing challenges, and a call to action for the research community.

A survey on Image Data Augmentation for Deep Learning

Article

Full-text available

Jul 2019

Abstract Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfortunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data-space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in this survey include geometric transformations, color space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning. The application of augmentation methods based on GANs are heavily covered in this survey. In addition to augmentation techniques, this paper will briefly discuss other characteristics of Data Augmentation such as test-time augmentation, resolution impact, final dataset size, and curriculum learning. This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing Data Augmentation. Readers will understand how Data Augmentation can improve the performance of their models and expand limited datasets to take advantage of the capabilities of big data.

Mobile Interaction for Augmentative and Alternative Communication: a Systematic Mapping

Article

Full-text available

Aug 2018

Verbal communication is essential for socialization, meaning construction and knowledge sharing in a society. When verbal communication does not occur naturally because of constraints in people’s and environments capabilities, it is necessary to design alternative means. Augmentative and Alternative Communication (AAC) aims to complement or replace speech to compensate difficulties of verbal expression. AAC systems can provide technological support for people with speech disorders, assisting in the inclusion, learning and sharing of experiences. This paper presents a systematic mapping of the literature to identify research initiatives regarding the use of mobile devices and AAC solutions. The search identified 1366 potentially eligible scientific articles published between 2006 and 2016, indexed by ACM, IEEE, Science Direct, and Springer databases and by the SBC Journal on Interactive Systems. From the retrieved papers, 99 were selected and categorized into themes of research interest: games, autism, usability, assistive technology, AAC, computer interfaces, interaction in mobile devices, education, among others. Most of papers (57 out of 99) presented some form of interaction via mobile devices, and 46 papers were related to assistive technology, from which 14 were related to AAC. The results offer an overview on the applied research on mobile devices for AAC, pointing out to opportunities and challenges in this research domain, with emphasis on the need to promoting the use and effective adoption of assistive technology.

Personalized gestural interaction applied in a gesture interactive game-based approach for people with disabilities

Conference Paper

Mar 2020

Technology can support people with disabilities to participate in social and economic life. Using relevant Human-Computer Interaction, as obtained through Intelligent User Interfaces, people with motor and speech impairments may be able to communicate in different ways. Augmentative and Alternative Communication supported by Computer Vision systems can benefit from the recognition of users' remaining functional motions as an alternative interaction design approach. Based on a methodology in which gestures and their meanings are created and configured by users and their caregivers, we developed a Computer Vision system, named PGCA, that employs machine learning techniques to create personalized gestural interaction as an Assistive Technology resource for communication purposes. Using a low-cost approach, PGCA has been experienced with students with motor and speech impairments to create personalized gesture datasets and to identify improvements for the system. This paper presents an experiment carried with the target audience using a game-based approach where three students used PGCA to interact with communication boards and to play a game. The system was evaluated by special education professionals using the System Usability Scale and was considered suitable for its purpose. Results from the experiment suggest the technical feasibility for the methodology and for the system, also adding knowledge about the interaction process of disabled people with a game.

Personalized interactive gesture recognition assistive technology

Conference Paper

Oct 2019

Computing systems have the potential to contribute with interactive and low-cost solutions to support Augmentative and Alternative Communication (AAC) by applying different technologies to address different users' characteristics and needs. Computer Vision-based AAC systems can support users with motor difficulties by tracking and recognizing their remaining functional motions. To investigate possibilities for people with motor and speech impairments, this paper presents the PGCA: a Computer Vision system that allows the creation of a personalized gestural interaction as assistive technology for communication purposes. PGCA system takes into account the motor abilities and limitations of its users and the knowledge of caregivers in recognizing the gestures performed by the users. Results from interviews with special education professionals and from an experiment with the target audience suggest the use of personalized gestures is a common practice for AAC, and that creating custom datasets can be challenging, mainly due to the level of understanding of participants, the similarity between gestures, and variations in performing the same gestures. Improvements for the system were identified and described aiming to make the interface easier and more effective.

An improved SVM classifier based on multi-verse optimizer for fault diagnosis of autopilot

Conference Paper

Oct 2018

Towards a Methodology to Support Augmentative and Alternative Communication by means of Personalized Gestural Interaction

Conference Paper

Oct 2018

Augmentative and Alternative Communication (AAC) involves the use of non-verbal modes as a complement or substitute for spoken language, supporting communicative abilities of people, especially people with speech limitations. Computing systems have been proposed to support AAC, applying different technology to address users' different needs. Computer vision techniques can assist people with motor impairments by using their remaining functional motions. This paper proposes a methodology to support AAC of people with motor impairments, using computer vision and machine learning techniques to enable personalized gestural interaction. The methodology was instantiated in a pilot system described in this paper and evaluated by Human-Computer Interaction experts. The evaluation results suggested improvements for the methodology and for the system, and indicated the methodology is feasible to support the design of AAC systems, and that the developed system is promising to support AAC.

Small Sample Image Recognition Using Improved Convolutional Neural Network

Article

Jul 2018

In recent years, with the raise of the neural network and deep learning, significant progress has been achieved in the field of image recognition. Convolutional Neural Network (CNN) has been widely used in multiple image recognition tasks, but the recognition accuracy still has a lot of room for improvement. In this paper, we proposed a hybrid model CNN-GRNN to improve recognition accuracy. The model uses CNN to extract multilayer image representation and it uses General Regression Neural Network (GRNN) to classify image using the extracted feature. The CNN-GRNN model replace Back propagation (BP) neural network inside CNN with GRNN to improve generalization and robustness of CNN. Furthermore, we validate our model on the Oxford-IIIT Pet Dataset database and the Keck Gesture Dataset, the experiment result indicate that our model is superior to Gray Level Co-occurrency (GLCM),HU invariant moments, CNN and CNN_SVM on small sample dataset. Our model has favorable real-time characteristic at the same time.

Packing Convolutional Neural Networks in the Frequency Domain

Article

Jul 2018

Deep convolutional neural networks (CNNs) are successfully used in a number of applications. However, their storage and computational requirements have largely prevented their widespread use on mobile devices. Here we present a series of approaches for compressing and speeding up CNNs in the frequency domain, which focuses not only on smaller weights but on all the weights and their underlying connections. By treating convolutional filters as images, we decompose their representations in the frequency domain as common parts (i.e., cluster centers) shared by other similar filters and their individual private parts (i.e., individual residuals). A large number of low-energy frequency coefficients in both parts can be discarded to produce high compression without significantly compromising accuracy. Furthermore, we explore a data-driven method for removing redundancies in both spatial and frequency domains, which allows us to discard more useless weights by keeping similar accuracies. After obtaining the optimal sparse CNN in the frequency domain, we relax the computational burden of convolution operations in CNNs by linearly combining the convolution responses of discrete cosine transform (DCT) bases. The compression and speed-up ratios of the proposed algorithm are thoroughly analyzed and evaluated on benchmark image datasets to demonstrate its superiority over state-of-the-art methods.

Hand gesture recognition using an adapted convolutional neural network with data augmentation

Conference Paper

May 2018

Hand gestures provide a natural way for humans to interact with computers to perform a variety of different applications. However, factors such as the complexity of hand gesture structures, differences in hand size, hand posture, and environmental illumination can influence the performance of hand gesture recognition algorithms. Recent advances in Deep Learning have significantly advanced the performance of image recognition systems. In particular, the Deep Convolutional Neural Network has demonstrated superior performance in image representation and classification, compared to conventional machine learning approaches. This paper proposes an Adapted Deep Convolutional Neural Network (ADCNN) suitable for hand gesture recognition tasks. Data augmentation is initially applied which shifts images both horizontally and vertically to an extent of 20% of the original dimensions randomly, in order to numerically increase the size of the dataset and to add the robustness needed for a deep learning approach. These images are input into the proposed ADCNN model which is empowered by the presence of network initialization (ReLU and Softmax) and L2 Regularization to eliminate the problem of data overfitting. With these modifications, the experimental results using the ADCNN model demonstrate that it is an effective method of increasing the performance of CNN for hand gesture recognition. The model was trained and tested using 3750 static hand gesture images, which incorporate variations in features such as scale, rotation, translation, illumination and noise. The proposed ADCNN was compared to a baseline Convolutional Neural Network and the results show that the proposed ADCNN achieved a classification recognition accuracy of 99.73%, and a 4% improvement over the baseline Convolutional Neural Network model (95.73%).

MouseNose - Uma ferramenta de acessibilidade para deficientes

Abstract and Figures

Recommended publications

TERRITÓRIO E TERRITORIALIDADE EM FOZ DO IGUAÇU: SINDICATO DOS MOTOTAXISTAS E ESTADO NA ESCALA LOCAL

Sistema educacional para criação de apresentações multimídia de objetos culturais

SIBGRAPI 25th: Advances in Pattern Recognition and Computer Vision

Exploring RGB-D Cameras for 3D Reconstruction of Cultural Heritage: A New Approach Applied to Brazil...

Computer Vision-based Methodology to Improve Interaction for People with Motor and Speech Impairment