Conference PaperPDF Available

Mining Multimodal Repositories for Speech Affecting Diseases

September 2018

September 2018

DOI:10.21437/Interspeech.2018-1806

Conference: Interspeech 2018

Authors:

Joana Correia

ISMT - Instituto Superior Miguel Torga

Bhiksha Raj

Carnegie Mellon University

Isabel Trancoso

Inesc-ID / IST, Lisbon, Portugal

Francisco Teixeira

Instituto Superior Técnico

Content uploaded by Isabel Trancoso

Content may be subject to copyright.

Mining multimodal repositories for speech affecting diseases

Joana Correia12, Bhiksha Raj 1, Isabel Trancoso 2, Francisco Teixeira 2

1Carnegie Mellon University, USA

2INESC-ID / Instituto Superior T´

ecnico, University of Lisbon, Portugal

joanac@cs.cmu.edu

Abstract

The motivation for this work is to contribute to the col-

lection of large in-the-wild multimodal datasets in which the

speech of the subject is affected by certain medical conditions.

Our mining effort is focused on video blogs (vlogs), and as a

proof-of-concept we have selected three target diseases: De-

pression, Parkinson’s disease, and cold.

Given the large scale nature of the online repositories, we

take advantage of existing retrieval algorithms to narrow the

pool of candidate videos for a given query related with the dis-

ease (e.g. depression vlog), and on top of that we apply sev-

eral ﬁltering techniques. These techniques explore both audio,

video, text and metadata cues, in order to retrieve vlogs that in-

clude a single speaker which, at some point, admits that he/she

is currently affected by a given disease. The use of straight-

forward NLP techniques on the automatically transcribed data

showed that distinguishing between narratives of present and

past experiences is harder than distinguishing between narra-

tives of self experiences and of someone else’s.

The three resulting speech datasets were tested with neural

networks trained with speech data collected in controlled con-

ditions, yielding results only slightly below the ones achieved

with the original test datasets.

Index Terms: data mining, pathological speech

1. Introduction

Speech, being a complex bio-signal that is intrinsically related

to human physiology and cognition, has the potential to pro-

vide a rich bio-marker for health, e.g. allowing a non-invasive

route to early diagnosis and monitoring of a range of conditions

including Parkinson’s disease, anxiety, depression or dementia,

just to name a few [1][2]. With the rise of speech related ma-

chine learning applications over the last decade, there has been a

growing interest in developing speech based diagnosis-aid tools

that perform non-invasive diagnosis [3][4][5][6][7].

However, one of the biggest challenges of developing

computer-aided diagnosis systems based on speech is acquiring

large amounts of training data. Often, the limited training data

available is recorded in controlled conditions, raising concerns

relating to the ecological validity of the experimental results ob-

tained. At the same time, the cost of collecting data in con-

trolled conditions is high, eventually prohibitive: From ﬁnding

eligible and willing subjects, to assigning healthcare specialists,

and guaranteeing the logistic and legal requirements for the data

collection process.

Our motivation is to provide a proof-of-concept of a valid

alternative to the traditional process of creating datasets. We ar-

gue that it can be achieved through mining medical data from

This work was supported by national funds through Fundac¸˜

ao para

a Ciˆ

encia e a Tecnologia (FCT) with references UID/CEC/50021/2013,

and SFRH/BD/103402/2014.

in-the-wild, large scale, multimodal repositories. We hypoth-

esize that this type of data exists in very large quantities, and

contains highly varied examples of the effects of the diseases on

the subjects speech, unbound by human experiment design. At

the same time, this alternative keeps the collection cost low, in

terms of time and human resources. To the extent of our knowl-

edge, this is the ﬁrst work attempting to automatically collect

disease speciﬁc datasets from multimodal online repositories.

We describe the ideal video candidate for the dataset as:

featuring a single subject; who is talking about himself/herself;

referring to present and not past medical conditions; and in-

cludes a spoken conﬁrmation of their diagnosis. Video blogs

(vlogs), a popular category of videos which is deﬁned as a per-

sonal video logging of any given experience, typically with little

production and editing, usually contain most of the aforemen-

tioned characteristics. Therefore, we focused our mining efforts

on them, (p.e. the query is depression vlog) in order to help ex-

clude other video formats also related to the target disease, such

as news pieces, lectures, etc. Even then, the fraction of target

videos is typically less than half the total number of retrieved

videos. Therefore, it is necessary to ﬁlter out the videos that do

not contain ﬁrst person and present experiences about the target

disease. To do so, we propose doing a multimodal analysis of

the video and its metadata, using mostly off-the-shelf tools in

order to test the potential of our approach.

As a proof-of-concept we have selected three target dis-

eases: Depression, Parkinson’s disease, and cold. We col-

lected and labelled a small dataset for each target disease from

YouTube, building a corpus of in-the-Wild Speech Medical

(WSM) data, with which we test our proposed ﬁltering solution.

Additionally, we test state-of-the-art neural networks,

trained to detect pathological speech with data collected in con-

trolled environments, against the WSM Corpus, to highlight the

differences between in-the-wild pathological speech, and patho-

logical speech collected in controlled conditions.

This paper is organized as follows: Section 2 describes the

simple retrieval process used to build this initial dataset from the

online repository YouTube; Section 3 reports the process of ﬁl-

tering out the unwanted videos, describing the multimodal fea-

ture extraction process, and the classiﬁers; Their performance

in detecting the target videos in the WSM dataset is presented

in Section 4; Section 5 describes the models and experiments

performed with data in a controlled environment, and compares

them to the results obtained on WSM with the same models;

Finally we draw some conclusions in Section 6.

2. The WSM Corpus

The depression, Parkinson’s, and cold datasets of the WSM

corpus were collected in February 2018 from the online mul-

timodal repository YouTube. The published dates of videos

ranged from January 2007 to February 2018. The language of

Interspeech 2018

2-6 September 2018, Hyderabad

2963 10.21437/Interspeech.2018-1806

Table 1: Positive class incidence per label, per disease for the

WSM Corpus.

Dataset Vlog 1st Person Present Target topic All

Depression 92.2 73.4 50.0 56.3 28.1

Parkinson’s 56.3 54.7 56.3 68.8 28.1

Cold 96.9 79.7 90.6 62.5 46.9

Table 2: Overview of the WSM Corpus.

Dataset Class # Videos

Ave.

duration

[min]

Ave.

# Words/

Video

Ave.

# Words/

Min./Video

Vocab.

size

Total

length

[min]

Total

length

[words]

Depression

Positive 18 8.85 1142.44 149.28 2130 159 20564

Negative 40 10.44 1370.98 145.78 4321 418 54839

Overall 58 9.95 1300.05 146.86 5096 577 75403

Parkinson’s

Positive 18 6.73 948.50 138.78 2275 121 17073

Negative 43 10.11 1229.19 103.63 5058 435 52855

Overall 61 9.11 1146.36 114.00 5849 556 69928

Cold

Positive 30 15.96 968.23 149.77 2930 479 29047

Negative 33 10.07 1319.61 133.61 3710 332 43547

Overall 63 12.88 1152.29 141.30 5097 811 72594

the videos was restricted to English. The size of the WSM Cor-

pus has been limited to approximately 60 videos per dataset,

because of the need for manual labeling.

The dataset was collected by using a combination of the

ofﬁcial YouTube API and scrapping tools to retrieve a list of

results for the query ”[target disease] vlog”. The following in-

formation for each result (some of the items are marked as op-

tional, if they are not required to be ﬁlled out by the uploader):

video; unique identiﬁer; title; description (optional); transcrip-

tion (automatically generated for videos in English, unless pro-

vided by a user); channel identiﬁer; playlist identiﬁer; date pub-

lished; thumbnail; video category (closed set, 14 categories, e.g.

”News”, ”Music” or ”Entertainment”); number of views; num-

ber of thumbs up; number of thumbs down; comments.

We note that the video’s transcription was automatically

generated by YouTube(only for videos in English), using a large

scale, semi-supervised deep neural network for acoustic model-

ing [8], unless a human transcription is provided by a user.

Each video in WSM Corpus was hand labeled with four

intermediate binary labels: 1) the video is in a vlog format; 2)

the main speaker of the video talks mostly about him/herself;

3) the discourse is about present experiences or opinions; 4)

the main topic of the video is related to the target disease. If

all intermediate labels were positive, the video was labelled as

containing in-the-wild pathological speech.

Table 1 shows the class distribution for each label, for the

three datasets. Table 2 presents some statistics for each dataset,

relatively to the ”all” class, namely: the average length of the

videos, the average number of words in the video’s transcrip-

tion, the average number of words per minute; the dataset length

in minutes and in words; and the total vocabulary size. These

statistics are presented for each dataset, both overall and broken

down for positive and negative presence of pathological speech.

3. Automatic Filtering of Videos with

Pathological Speech

One of the goals of this work was to perform the distinction

between videos of subjects affected by a target disease at the

time of the recording and other videos possibly still related to

the target disease, p.e. news pieces, presentations, classes, or

forms of artistic expression. As such, we focused on extracting

features that help our classiﬁers to automatically replicate the

manual labels.

Our focus was to establish a baseline performance for this

task, therefore we opted for simple straightforward techniques,

both for the feature extraction stage as well as for the modeling

stage. We deferred replicating state of the art techniques use to

solve related problems, including multimodal emotion recogni-

tion [9][10][11], and techniques that perform the synchroniza-

tion of the features across different modalities [12][13][14] as

future work.

3.1. Feature extraction

The feature extraction was performed mostly using existing

toolkits, in order to establish a baseline performance. From the

information extracted for each video, we computed the follow-

ing multimodal features:

Natural Language: Bag-of-Word (BoW) features were ex-

tracted from the video’s transcription. The BoW model was

used to convert a transcription in to a frequency vector of tokens

in the transcriptions. In this scheme, we obtained one feature

vector per transcription, in which each feature was the normal-

ized frequency of an individual token. The length of the vector

was the total size of the vocabulary of the corpus of transcrip-

tions. This model ignored the ordering of the tokens in the tran-

scription. In order to reduce the weight of very common words,

(e.g. the, a, is in English), which carry very little meaningful

information about the actual content of the document, we used

the term-frequency times inverse document-frequency (tf-idf)

transform.

Sentiment features were derived from the title, description,

transcription and top ncomments of the video using the Stan-

ford Core NLP [15]. This tool is based on a Recursive Neural

Tensor Network (RNTN). RNTNs take as input phrases of any

length, and represent them through word vectors and a parse

tree. They then compute vectors for higher nodes in the tree

using the same tensor-based composition function. This RNTN

was trained on a corpus of movie reviews [16], and parsed with

the Stanford parser [17]. At this early stage, and given the small

dataset size, we have not yet included topic modeling, neither

semantic word embedding models.

Speech: We determined the number of speakers in the

video, via speaker diarization to the audio component of the

each video, using the LIUM toolbox [18]. The diarization pro-

cess is composed of 5 steps: First music segments are removed

music using Viterbi decoding; next, the signal is segmented in

to speakers and background by acoustic segmentation and Hi-

erarchical Agglomerative Clustering (HAC); then, a Gaussian

Mixture Model (GMM) is trained for each cluster; the signal is

then re-segmented through a Viterbi decoding; ﬁnally, the sys-

tem performs another HAC, using a cross-likelihood ratio mea-

sure and the trained GMMs.

Visual: Each video was segmented into scenes, using a

simple comparison between pairs of consecutive frames. Scene

changes were marked when the difference exceeded a preset

threshold. A random frame was selected for each resulting

scene. Automatic face detection using the toolkit [19], and

computation of color histograms is performed in the resulting

frames.

Metadata: Features derived from the collected metadata

included: a one hot vector representing the video category; the

video duration; the number of views; the number of comments;

the number of thumbs up; and the number of thumbs down at

the time of collection.

2964

3.2. Classiﬁers

We use two straightforward, well known models, to predict the

intermediate labels of the videos yias well as the pathological

speech label: Logistic regression (LR), and Support vector ma-

chines (SVMs). For the case of the SVM we train 3 distinct

models with linear, polynomial of degree 3, and radial basis

function (RBF) kernels.

Given the large scale nature of the online repositories, it is

our hypothesis that the amount of content available per disease

is much larger than the size of the desired dataset. As such, it

is preferable to exclude content with a low conﬁdence measure

of containing the target disease, rather than to include it. This

translates to training models that favour a high precision over a

high recall.

4. Filtering Results

In order to understand the contribution of each type of feature to

ﬁlter the target content, we trained a distinct classiﬁer for each

type of feature, and another one with all the features. The text

component contributed with 28 features that describe the senti-

ment in the title, description, transcription and comments of the

video; plus 5096, 5849 and 5097 BoW features, for depression,

Parkinson’s and cold dataset, respectively (the number differs

for each dataset, based on their respective vocabulary size). The

speech component contributed with a single feature describing

the number of speakers in the video. The video component

contributed with a 768 dimensional feature vector to describe

the average color histogram of the video; plus one feature in-

dicating the number of different faces identiﬁed in the video;

and one feature indicating the number of scenes detected in the

video. The metadata contributed with 19 features. By concate-

nating features extracted from all modalities, the ﬁnal feature

vectors have 5914, 6667, 5915 dimensions, for the depression,

Parkinson’s and cold dataset, respectively.

In total, 540 models were trained: LR, linear SVM, poly-

nomial SVM, and SVM-RBF, for each one of the eight types of

features plus one for all the features, for each of the 5 labels, per

dataset in the WSM Corpus. The models were trained in a leave-

one-out cross validation fashion. Given the limited amount of

examples in our datasets, and the comparatively large number of

features, the feature vectors were reduced in dimensionality by

eliminating the features with a Pearson correlation coefﬁcient

(PCC) to the label below 0.2, thus only the features that carried

some linear correlation to the label were preserved.

The results are reported in precision and recall. We consider

that a good model will have a high precision measure, since

the goal is to maximize the rate of true positives. At the same

time, false negatives are not a major concern in this scenario:

we assume that the repository being mined has a much larger

number of target videos than the size of the desired dataset.

Tables 3 4 and 5 summarize the performance of the best

overall model (SVM-RBF), for depression, Parkinson’s, and

cold, respectively. The results of the remaining models are

omitted, for the sake of brevity. The cells highlighted in

gray mark models which performed equal or worse than sim-

ply choosing the majority class. These models had performed

poorly due to the limited amount of features available, and the

excecively low dimentionality of the feature set. The best per-

forming models for each dataset achieve a 93%, 100%, and 88%

precision, and 72%, 89%, and 97% recall, for the depression,

Parkinson’s and cold datasets, respectively.

The Tables show the contribution of each type of feature

Table 3: Performance of the SVM-RBF reported in precision

and recall rate in detecting target content in the depression

dataset of the WSM Corpus.

Label

Modality Features Vlog 1st Person Present Target topic All

BoW 0.98 1.0 0.98, 1.0 0.73, 0.9375 0.89, 0.89 0.86, 0.67

Text Sentiment 0.91, 1.0 0.77, 0.96 0.52, 0.66 0.52, 0.71 0.33, 0.17

Speech #Speakers 0.91, 1.0 0.85, 0.91 0.56, 0.69 0.69, 0.94 0.0, 0.0

#Faces 0.91, 1.0 0.89, 0.93 0.69, 0.75 0.72, 0.94 0.0, 0.0

#Keyframes 0.91, 1.0 0.84, 0.96 0.56, 0.88 0.72, 0.97 0.0, 0.0Video

Color hist. 0.91, 1.0 0.77, 0.98 0.69, 0.78 0.80, 0.89 0.75, 0.33

Metadata Metadata 0.91, 1.0 0.77, 0.98 0.62, 1.0 0.60, 0.97 0.0, 0.0

All All 0.981, 1.0 0.93, 0.96 0.83, 0.91 0.89, 0.91 0.93, 0.72

Table 4: Performance of the SVM-RBF reported in precision

and recall rate in detecting target content in the Parkinson’s

disease dataset of the WSM Corpus.

Label

Modality Features Vlog 1st Person Present Target topic All

BoW 1.0, 0.86 0.74, 0.82 0.81, 1.0 0.91, 1.0 1.0, 0.89

Text Sentiment 0.71, 0.71 0.69, 0.71 0.77, 0.49 0.73, 0.95 0.88, 0.39

Speech # Speakers 0.48, 0.69 0.56, 1.0 0.57, 1.0 0.71, 1.0 0.0, 0.0

# Faces 0.63, 0.94 0.58, 0.85 0.58, 0.89 0.75, 0.95 0.0, 0.0

# Keyframes 0.55, 0.89 0.51, 0.82 0.54, 0.89 0.72, 0.91 0.0, 0.0Video

Color hist. 0.76, 0.71 0.69, 0.73 0.70, 0.60 0.69, 0.95 0.0, 0.0

Metadata Metadata 0.73, 0.77 0.49, 0.76 0.56, 0.77 0.70, 0.98 0.0, 0.0

All all 0.97, 0.91 0.87, 0.82 0.80, 0.91 0.90, 1.0 1.0, 0.89

to the overall performance, as well as the performance of the

model in identifying each intermediate label correctly, and the

ﬁnal label. The type of features that has the most impact are the

text features, concretely, the Bag-of-words, for every dataset,

and for every label. They convey, in fact, for the Parkinson’s

and cold datasets, sufﬁcient information to achieve the best per-

formance, without any other type of feature. Overall, it is not

clear which are the features, other than the bag-of-words that

consistently contribute to the good performance of the models.

Label 3 (Present) was the hardest label to correctly estimate, in

two out of the three datasets. The results for Label 1 (Vlog)

are not reported for Table 5, because the cold dataset did not

contain enough negative examples to allow model training. We

note that some feature types, such as the number of speakers or

the number of scenes, are seldom capable of generating a good

model, probably due to the limitations of the feature extraction

techniques.

5. Comparing the WSM Corpus to Datasets

Collected in Controlled Conditions

Neural networks trained with data collected in controlled con-

ditions, were used to detect pathological speech in the WSM

Corpus and their original test datasets. We only report results

for the depression and cold corpus, since at the time of making

this work we did not have a dataset for Parkinson’s detection

using speech.

Table 5: Performance of the SVM-RBF reported in precision

and recall rate in detecting target content in the cold dataset of

the WSM Corpus.

Label

Modality Features Vlog 1st Person Present Target topic All

BoW NA 1.0, 1.0 1.0, 1.0 0.95, 1.0 0.88, 0.97

Text Sentiment NA 0.81, 1.0 0.92, 1.0 0.64, 0.85 0.64, 0.53

Speech # Speakers NA 0.81, 1.0 0.92, 1.0 0.63, 1.0 0.70, 0.53

# Faces NA 0.81, 1.0 0.92, 1.0 0.72, 1.0 0.71, 0.5

# Keyframes NA 0.85, 0.98 0.92, 1.0 0.67, 0.97 0.56, 0.67Video

Color hist. NA 0.81, 1.0 0.92, 1.0 0.65, 0.93 0.60, 0.5

Metadata Metadata NA 0.81, 1.0 0.92, 1.0 0.65, 0.97 0.57, 0.40

All All NA 1.0, 1.0 1.0, 1.0 0.95, 1.0 0.88, 0.97

2965

5.1. Controlled Conditions Datasets

5.1.1. Depression

The depression subset of the Distress Analysis Interview Cor-

pus - Wizard-of-Oz (DAIC-WOZ) [20] is an audio-visual

database of clinical interviews. It consists of 189 sessions rang-

ing between 7 and 33 minutes, 106 of which are present in the

training set, and 34 in the development set. For each of these

sessions a score is provided in the PHQ-8 [21] scale as a mea-

sure of depression. Of the 106 participants in the training parti-

tion, 30 are considered to be depressed. In the development set,

34 subjects are classiﬁed as depressed [22].

5.1.2. Cold

The Upper Respiratory Tract Infection Corpus (URTIC) [2] is a

dataset provided by the Institute of Safety Technology of the

University of Wuppertal, Germany, for the Interspeech 2017

ComParE Challenge. It contains recordings of spontaneous and

scripted speech. The training and development partitions com-

prised 210 subjects each, but only 37 had a cold. The two par-

titions include 9,505 and 9,565 chunks of 3 to 10 seconds, re-

spectively [2].

5.2. Feature Extraction

For depression and cold datasets, we used extended Geneva

Minimalistic Acoustic Parameter Set (eGeMAPS) features, a

set of 88 acoustic features designed to serve as a standard for

paralinguistic analysis [23].

The DAIC and URTIC corpus were already segmented. The

WSM Corpus underwent automatic diarization, prior to feature

extraction, using LIUM, to eliminate silent segments, and di-

vide the speech signal into inter-pausal units. The segments

that did not belong to the main speaker were discarded. The

minimum segment length was set to 200ms.

5.3. Model

We follow a simple neural network structure for the model. It

consists of three layers: an input layer with 120 units, a hid-

den layer with 50 units, and an output layer with one unit. The

ﬁrst two layers share the same structure, ﬁrst a Fully Connected

(FC) layer, followed by a Batch Normalization (BN) layer, and

an Activation layer with Rectiﬁed Linear Units (ReLUs). The

output layer is characterized by a FC layer with a sigmoid acti-

vation. During training, Dropout layers are also inserted before

the second and third FC layers. Both the Dropout and the BN

layers in the network help prevent the model from overﬁtting

[24] [25]. These forms of regularization are important in this

case, due to the limited size of the training data.

Before training the network, the training set is zero-

centered and normalized by its standard deviation. The values

of the mean and standard deviation of this set are later used to

zero-center and normalize the development set.

The model was implemented in Keras[26], and was trained

with RMSProp, using the default values of this algorithm to-

gether with a learning rate of 0.02 and 100 epochs. To determine

the best dropout probabilities for each dropout layer, a random

search was conducted yielding the following values: 0.092 and

0.209 in the depression model; and 0.3746 and 0.5838 for the

second cold model, for the ﬁrst and second dropout layers, re-

spectively.

To compensate for the unbalanced labels on the training

partitions of the depression and cold datasets, we attribute dif-

Table 6: Comparison of the performance in UAR of the Neural

Networks to detect pathological speech in datasets collected in

controlled environments versus data collected in-the-wild.

Voice

affecting

disiease

Controlled Conditions

Dataset WSM Corpus

Train

(segment level)

Development

(segment level)

Development

(segment level)

Development

(speaker level)

Depression 60.59 60.57 54.79 61.94

Cold 59.95 66.92 53.11 54.81

ferent weight to samples of the positive and negative class:

0.8/0.2 for depression, and 0.9/0.1 for cold.

5.4. Results

The performances in precision and recall of the neural networks

against the WSM Corpus versus existing datasets of data col-

lected in controlled conditions are summarized in Table 6. As

expected, given the greater variability in recording conditions

(p.e. microphones, noise), the performances of the networks

when faces with in the wild data decrease when compared to

data collected in controlled conditions. However, it is possible

to improve the classiﬁcation at speaker level, versus at segment

level by aggregating the segments for each speaker, as the last

column of Table 6 shows, particularly in the case of depression.

The subject level prediction, obtained by computing a weighted

average of the segment level predictions, in which the weighting

term is given by the segment length.

We hypothesize that an additional justiﬁcation for the per-

formance drop is the greater variability in the speech alterations

of e speakers in the in-the-wild datasets, given that their dis-

course is not bounded, as could be the case in a controlled en-

vironment, thus facing the networks with unseen speech alter-

ations.

6. Conclusion

This work established a baseline for collecting disease speciﬁc

datasets of in-the-wild data, containing instances of speech af-

fecting diseases, based on mining multimodal online reposito-

ries. We demonstrated the viability of this process for three

diseases: depression, Parkinson’s, and cold, which leads us to

believe that the process is generalizable to collect datasets for

any disease. Given its modular nature, each component of the

system, can be individually improved.

The best performing models achieved a precision of 93%,

100%, and 88%, and a recall of 72%, 89%, and 97%, for the

dataset of depression, Parkinson’s, and cold respectively, in the

task of ﬁltering videos containing speech affecting diseases.

At the same time, we compared the WSM Corpus to

datasets of data collected in controlled conditions. The perfor-

mance in precision and recall of the existing models decreased

when faced with in-the-wild data, compared to data collected in

controlled conditions. We hypothesize this is due to a greater

variability in recording conditions (p.e. microphone, noise) and

in the effects of speech altering diseases in the subjects’ speech.

For future work, we will focus on three problems: collect-

ing and making available larger datasets of several speech af-

fecting diseases, thus increasing the dataset resources available

for medical applications; improving the performance of each

individual module of our proposed system, replacing them with

disease speciﬁc tools; and most importantly, moving towards

a completely unsupervised data collection system, by dropping

the label requirements during the training stage.

2966

7. References

[1] J. R. Orozco-Arroyave, E. A. Belalcazar-Bolanos, J. D. Arias-

Londo˜

no, J. F. Vargas-Bonilla, S. Skodda, J. Rusz, K. Daqrouq,

F. H¨

onig, and E. N¨

oth, “Characterization methods for the detec-

tion of multiple voice disorders: Neurological, functional, and la-

ryngeal diseases,” IEEE journal of biomedical and health infor-

matics, vol. 19, no. 6, pp. 1820–1828, 2015.

[2] B. Schuller, S. Steidl, A. Batliner, E. Bergelson, J. Krajewski,

C. Janott, A. Amatuni, M. Casillas, A. Seidl, M. Soderstrom et al.,

“The interspeech 2017 computational paralinguistics challenge:

Addressee, cold & snoring,” in Computational Paralinguistics

Challenge (ComParE), Interspeech 2017, 2017.

[3] K. L´

opez-de Ipi˜

na, J.-B. Alonso, C. M. Travieso, J. Sol´

e-Casals,

H. Egiraun, M. Faundez-Zanuy, A. Ezeiza, N. Barroso, M. Ecay-

Torres, P. Martinez-Lage et al., “On the selection of non-invasive

methods based on speech analysis oriented to automatic alzheimer

disease diagnosis,” Sensors, vol. 13, no. 5, pp. 6730–6745, 2013.

[4] K. Lopez-de Ipi˜

na, J. B. Alonso, J. Sol ´

e-Casals, N. Barroso,

P. Henriquez, M. Faundez-Zanuy, C. M. Travieso, M. Ecay-

Torres, P. Martinez-Lage, and H. Eguiraun, “On automatic diag-

nosis of alzheimers disease based on spontaneous speech analysis

and emotional temperature,” Cognitive Computation, vol. 7, no. 1,

pp. 44–55, 2015.

[5] A. A. Dibazar, S. Narayanan, and T. W. Berger, “Feature analy-

sis for automatic detection of pathological speech,” in Engineer-

ing in Medicine and Biology, 2002. 24th Annual Conference and

the Annual Fall Meeting of the Biomedical Engineering Society

EMBS/BMES Conference, 2002. Proceedings of the Second Joint,

vol. 1. IEEE, 2002, pp. 182–183.

[6] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and

T. F. Quatieri, “A review of depression and suicide risk assessment

using speech analysis,” Speech Communication, vol. 71, pp. 10–

49, 2015.

[7] J. Correia, I. Trancoso, and B. Raj, “Detecting psychological dis-

tress in adults through transcriptions of clinical interviews,” in

International Conference on Advances in Speech and Language

Technologies for Iberian Languages. Springer, 2016, pp. 162–

171.

[8] H. Liao, E. McDermott, and A. Senior, “Large scale deep neu-

ral network acoustic modeling with semi-supervised training data

for youtube video transcription,” in Automatic Speech Recognition

and Understanding (ASRU), 2013 IEEE Workshop on. IEEE,

2013, pp. 368–373.

[9] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico-

laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end

speech emotion recognition using a deep convolutional recurrent

network,” in Acoustics, Speech and Signal Processing (ICASSP),

2016 IEEE International Conference on. IEEE, 2016, pp. 5200–

5204.

[10] D. Palaz, M. Magimai.-Doss, and R. Collobert, “Analysis of

cnn-based speech recognition system using raw speech as input,”

Idiap, Tech. Rep., 2015.

[11] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and

S. Zafeiriou, “End-to-end multimodal emotion recognition using

deep neural networks,” IEEE Journal of Selected Topics in Signal

Processing, vol. 11, no. 8, pp. 1301–1309, 2017.

[12] S. Oviatt, A. DeAngeli, and K. Kuhn, “Integration and synchro-

nization of input modes during multimodal human-computer in-

teraction,” in Referring Phenomena in a Multimedia Context and

their Computational Treatment. Association for Computational

Linguistics, 1997, pp. 1–13.

[13] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of

affective computing: From unimodal analysis to multimodal fu-

sion,” Information Fusion, vol. 37, pp. 98–125, 2017.

[14] M. Vrigkas, C. Nikou, and I. A. Kakadiaris, “Identifying human

behaviors using synchronized audio-visual cues,” IEEE Transac-

tions on Affective Computing, vol. 8, no. 1, pp. 54–66, 2017.

[15] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng,

and C. Potts, “Recursive deep models for semantic compositional-

ity over a sentiment treebank,” in Proceedings of the 2013 confer-

ence on empirical methods in natural language processing, 2013,

pp. 1631–1642.

[16] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships

for sentiment categorization with respect to rating scales,” in Pro-

ceedings of the 43rd annual meeting on association for compu-

tational linguistics. Association for Computational Linguistics,

2005, pp. 115–124.

[17] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,”

in Proceedings of the 41st annual meeting of the association for

computational linguistics, 2003.

[18] M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, and

S. Meignier, “An open-source state-of-the-art toolbox for broad-

cast news diarization,” in Interspeech, 2013.

[19] A. Geitgey, “Facerecog,” https://github.com/ageitgey/face recognition,

2017.

[20] J. Gratch, R. Artstein, G. M. Lucas, G. Stratou, S. Scherer,

A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella et al.,

“The distress analysis interview corpus of human and computer

interviews.” in LREC. Citeseer, 2014, pp. 3123–3128.

[21] K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. Williams, J. T. Berry,

and A. H. Mokdad, “The PHQ-8 as a measure of current depres-

sion in the general population,” J Affect Disord, vol. 114, no. 1-3,

pp. 163–173, Apr 2009.

[22] M. F. Valstar, J. Gratch, B. W. Schuller, F. Ringeval, D. Lalanne,

M. Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic,

“AVEC 2016 - depression, mood, and emotion recognition

workshop and challenge,” CoRR, vol. abs/1605.01600, 2016.

[Online]. Available: http://arxiv.org/abs/1605.01600

[23] F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. Andr, C. Busso,

L. Devillers, J. Epps, P. Laukka, S. Narayanan, and K. Truong,

“The geneva minimalistic acoustic parameter set (gemaps) for

voice research and affective computing,” IEEE transactions on

affective computing, vol. 7, no. 2, pp. 190–202, 4 2016, open ac-

cess.

[24] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep

network training by reducing internal covariate shift.” CoRR, vol.

abs/1502.03167, 2015.

[25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and

R. Salakhutdinov, “Dropout: a simple way to prevent neural net-

works from overﬁtting.” Journal of Machine Learning Research,

vol. 15, pp. 1929–1958, 2014.

[26] F. Chollet et al., “Keras,” https://github.com/keras-team/keras,

2015.

2967

Visual Speech for Obstructive Sleep Apnea Detection

Conference Paper

Full-text available

Aug 2021

Pathological speech detection using x-vector embeddings

Preprint

Full-text available

Mar 2020

The potential of speech as a non-invasive biomarker to assess a speaker's health has been repeatedly supported by the results of multiple works, for both physical and psychological conditions. Traditional systems for speech-based disease classification have focused on carefully designed knowledge-based features. However, these features may not represent the disease's full symptomatology, and may even overlook its more subtle manifestations. This has prompted researchers to move in the direction of general speaker representations that inherently model symptoms, such as Gaussian Supervectors, i-vectors and, x-vectors. In this work, we focus on the latter, to assess their applicability as a general feature extraction method to the detection of Parkinson's disease (PD) and obstructive sleep apnea (OSA). We test our approach against knowledge-based features and i-vectors, and report results for two European Portuguese corpora, for OSA and PD, as well as for an additional Spanish corpus for PD. Both x-vector and i-vector models were trained with an out-of-domain European Portuguese corpus. Our results show that x-vectors are able to perform better than knowledge-based features in same-language corpora. Moreover, while x-vectors performed similarly to i-vectors in matched conditions, they significantly outperform them when domain-mismatch occurs.

Promoting Fairness and Diversity in Speech Datasets for Mental Health and Neurological Disorders Research

Preprint

Full-text available

Jun 2024

Current research in machine learning and artificial intelligence is largely centered on modeling and performance evaluation, less so on data collection. However, recent research demonstrated that limitations and biases in data may negatively impact trustworthiness and reliability. These aspects are particularly impactful on sensitive domains such as mental health and neurological disorders, where speech data are used to develop AI applications aimed at improving the health of patients and supporting healthcare providers. In this paper, we chart the landscape of available speech datasets for this domain, to highlight possible pitfalls and opportunities for improvement and promote fairness and diversity. We present a comprehensive list of desiderata for building speech datasets for mental health and neurological disorders and distill it into a checklist focused on ethical concerns to foster more responsible research.

Speech as a Biomarker for Obstructive Sleep Apnea Detection

Conference Paper

May 2019

Querying Depression Vlogs

Conference Paper

Dec 2018

Analysing Speech for Clinical Applications: 6th International Conference, SLSP 2018, Mons, Belgium, October 15–16, 2018, Proceedings

Chapter

Jan 2018

Speech Analytics for Medical Applications: 21st International Conference, TSD 2018, Brno, Czech Republic, September 11-14, 2018, Proceedings

Chapter

Sep 2018

Speech has the potential to provide a rich bio-marker for health, allowing a non-invasive route to early diagnosis and monitoring of a range of conditions related to human physiology and cognition. With the rise of speech related machine learning applications over the last decade, there has been a growing interest in developing speech based tools that perform non-invasive diagnosis. This talk covers two aspects related to this growing trend. One is the collection of large in-the-wild multimodal datasets in which the speech of the subject is affected by certain medical conditions. Our mining effort has been focused on video blogs (vlogs), and explores audio, video, text and metadata cues, in order to retrieve vlogs that include a single speaker which, at some point, admits that he/she is currently affected by a given disease. The second aspect is patient privacy. In this context, we explore recent developments in cryptography and, in particular in Fully Homomorphic Encryption, to develop an encrypted version of a neural network trained with unencrypted data, in order to produce encrypted predictions of health-related labels. As a proof-of-concept, we have selected two target diseases: Cold and Depression, to show our results and discuss these two aspects.

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

Article

Full-text available

Apr 2017

Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need to be extracted. To this purpose, we utilize a Convolutional Neural Network (CNN) to extract features from the speech, while for the visual modality a deep residual network (ResNet) of 50 layers. In addition to the importance of feature extraction, a machine learning algorithm needs also to be insensitive to outliers while being able to model the context. To tackle this problem, Long Short-Term Memory (LSTM) networks are utilized. The system is then trained in an end-to-end fashion where - by also taking advantage of the correlations of the each of the streams - we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.

AVEC 2016 - Depression, Mood, and Emotion Recognition Workshop and Challenge

Article

Full-text available

May 2016

The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.

Identifying Human Behaviors Using Synchronized Audio-Visual Cues

Article

Full-text available

Jan 2017

In this paper, a human behavior recognition method using multimodal features is presented.We focus on modeling individual and social behaviors of a subject (e.g., friendly/aggressive or hugging/kissing behaviors) with a hidden conditional random field (HCRF) in a supervised framework. Each video is represented by a vector of spatio-temporal visual features (STIP, head orientation and proxemic features) along with audio features (MFCCs). We propose a feature pruning method for removing irrelevant and redundant features based on the spatio-temporal neighborhood of each feature in a video sequence. The proposed framework assumes that human movements are highly correlated with sound emissions. For this reason, canonical correlation analysis (CCA) is employed to find correlation between the audio and video features prior to fusion. The experimental results, performed in two human behavior recognition datasets including political speeches and human interactions from TV shows, attest the advantages of the proposed method compared with several baseline and alternative human behavior recognition methods.

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

Article

Full-text available

Jan 2015

Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.

A review of depression and suicide risk assessment using speech analysis

Article

Full-text available

Apr 2015
SPEECH COMMUN

This paper is the first review into the automatic analysis of speech for use as an objective predictor of depression and suicidality. Both conditions are major public health concerns; depression has long been recognised as a prominent cause of disability and burden worldwide, whilst suicide is a misunderstood and complex course of death that strongly impacts the quality of life and mental health of the families and communities left behind. Despite this prevalence the diagnosis of depression and assessment of suicide risk, due to their complex clinical characterisations, are difficult tasks, nominally achieved by the categorical assessment of a set of specific symptoms. However many of the key symptoms of either condition, such as altered mood and motivation, are not physical in nature; therefore assigning a categorical score to them introduces a range of subjective biases to the diagnostic procedure. Due to these difficulties, research into finding a set of biological, physiological and behavioural markers to aid clinical assessment is gaining in popularity. This review starts by building the case for speech to be considered a key objective marker for both conditions; reviewing current diagnostic and assessment methods for depression and suicidality including key non-speech biological, physiological and behavioural markers and highlighting the expected cognitive and physiological changes associated with both conditions which affect speech production. We then review the key characteristics; size associated clinical scores and collection paradigm, of active depressed and suicidal speech databases. The main focus of this paper is on how common paralinguistic speech characteristics are affected by depression and suicidality and the application of this information in classification and prediction systems. The paper concludes with an in-depth discussion on the key challenges – improving the generalisability through greater research collaboration and increased standardisation of data collection, and the mitigating unwanted sources of variability – that will shape the future research directions of this rapidly growing field of speech processing research.

A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion

Article

Feb 2017
INFORM FUSION

Affective computing is an emerging interdisciplinary research field bringing together researchers and practitioners from various fields, ranging from artificial intelligence, natural language processing, to cognitive and social sciences. With the proliferation of videos posted online (e.g., on YouTube, Facebook, Twitter) for product reviews, movie reviews, political views, and more, affective computing research has increasingly evolved from conventional unimodal analysis to more complex forms of multimodal analysis. This is the primary motivation behind our first of its kind, comprehensive literature review of the diverse field of affective computing. Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address. Multimodality is defined by the presence of more than one modality or channel, e.g., visual, audio, text, gestures, and eye gage. In this paper, we focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90% of the relevant literature appears to cover these three modalities. Following an overview of different techniques for unimodal affect analysis, we outline existing methods for fusing information from different modalities. As part of this review, we carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis. A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field.

Detecting Psychological Distress in Adults Through Transcriptions of Clinical Interviews

Conference Paper

Nov 2016
Lect Notes Comput Sci

Automatic detection of psychological distress, namely post-traumatic stress disorder (PTSD), depression, and anxiety, is a valuable tool to decrease time, and budget constraints of medical diagnosis. In this work, we propose two supervised approaches, using global vectors (GloVe) for word representation, to detect the presence of psychological distress in adults, based on the analysis of transcriptions of psychological interviews conducted by a health care specialist. Each approach is meant to be used in a specific scenario: online, in which the analysis is performed on a per-turn basis and the feedback from the system can be provided nearly live; and offline, in which the whole interview is analysed at once and the feedback from the system is provided after the end of the interview. The online system achieves a performance of 66.7 % accuracy in the best case, while the offline system achieves a performance of 100 % accuracy in detecting the three types of distress. Furthermore, we re-evaluate the performance of the offline system using corrupted transcriptions, and confirm its robustness by observing a minimal degradation of the performance.

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Article

Jun 2014
J MACH LEARN RES

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.

Characterization Methods for the Detection of Multiple Voice Disorders: Neurological, Functional, and Organic Diseases

Article

Aug 2015

This paper evaluates the accuracy of different characterization methods for the automatic detection of multiple speech disorders. The speech impairments considered include dysphonia in people with Parkinson's disease (PD), dysphonia diagnosed in patients with different laryngeal pathologies (LP), and hypernasality in children with cleft lip and palate (CLP). Four different methods are applied to analyze the voice signals including noise content measures, spectral-cepstral modeling, nonlinear features, and measurements to quantify the stability of the fundamental frequency. These measures are tested in six databases, three with recordings of PD patients, two with patients with LP, and one with children with CLP. The abnormal vibration of the vocal folds observed in PD patients and in people with LP is modeled using the stability measures with accuracies ranging from 81% to 99% depending on the pathology. The spectralcepstral features are used in this paper to model the voice spectrum with special emphasis around the first two formants. These measures exhibit accuracies ranging from 95% to 99% in the automatic detection of hypernasal voices, which confirms the presence of changes in the speech spectrum due to hypernasality. Noise measures suitably discriminate between dysphonic and healthy voices in both databases with speakers suffering from LP. The results obtained in this study suggest that it is not suitable to use every kind of features to model all of the voice pathologies, conversely it is necessary to study the physiology of each impairment to choose the most appropriate set of features.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Article

Feb 2015

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

Mining Multimodal Repositories for Speech Affecting Diseases

Recommended publications

A modified stacking ensemble machine learning algorithm using genetic algorithms

Fusion of multimodal temporal clinical data for the retrieval of similar patient cases

Seismic recording at the Los Medanos area of Southeastern New Mexico, 1974-1975

A Recommendation System to Facilitate Business Process Modeling