Conference PaperPDF Available

The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning

December 2021

December 2021
2021

DOI:10.1109/EMBC46164.2021.9629934

Conference: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)
At: Virtual

Authors:

Mathilde Marie Duville

Luz Maria Alonso-Valerdi

Tecnológico de Monterrey

David I. Ibarra-Zarate

Tecnológico de Monterrey

he Mexican Emotional Speech Database is presented along with the evaluation of its reliability based on machine learning analysis. The database contains 864 voice recordings with six different prosodies: anger, disgust, fear, happiness, neutral, and sadness. Furthermore, three voice categories are included: female adult, male adult, and child. The following emotion recognition was reached for each category: 89.4%, 93.9% and 83.3% accuracy on female, male and child voices, respectively. Clinical Relevance — Mexican Emotional Speech Database is a contribution to healthcare emotional speech data and can be used to help objective diagnosis and disease characterization.

Content uploaded by Mathilde Marie Duville

Content may be subject to copyright.



Abstract—The Mexican Emotional Speech Database is

presented along with the evaluation of its reliability based on

machine learning analysis. The database contains 864 voice

recordings with six different prosodies: anger, disgust, fear,

happiness, neutral, and sadness. Furthermore, three voice

categories are included: female adult, male adult, and child. The

following emotion recognition was reached for each category:

89.4%, 93.9% and 83.3% accuracy on female, male and child

voices, respectively.

Clinical Relevance — Mexican Emotional Speech Database is

a contribution to healthcare emotional speech data and can be

used to help objective diagnosis and disease characterization.

I. INTRODUCTION

Acoustic cues of emotional speech production are major

predictors of health conditions such as depression [1], autism

[2], or schizophrenia [3]. Developments in wireless

communication and machine learning engineering led to smart

healthcare systems designed to detect pathologies from voice

signal analysis without medical visitation. Physiological

signals are uploaded to a cloud computer where they can be

accessed for subjective (undertaken by a physician) or

objective (performed by a computational algorithm) analysis

[4]. Objective pathological assessments rely on healthcare big

data used for classification of diseases [5]. On the other hand,

databases of speech signals must also be used to explore the

linguistic and emotional perception that characterizes

particular pathological conditions. For instance, the

development of validated stimuli for affective prosody may be

useful to study the behavioural and neuronal impairments that

define the atypical emotional perception of autistics [6], [7].

As emotional expression is shaped by cultural variations [8],

databases optimized for the population under study are an

urgent need. The aim of this work is to provide a Mexican

Emotional Speech Database (MESD) that contains single-

word utterances for child, female, and male voices, expressed

with six basic emotions: anger, disgust, fear, happiness,

neutral, and sadness. Two corpora were created: (corpus A)

involved the repetition of 24 words across prosodies and voice

categories, and (corpus B) offers utterances of words

controlled for linguistic (concreteness, familiarity, and

frequency of use), and emotional semantic (valence, arousal,

and discrete emotions) dimensions. Researchers, engineers,

*Research supported by the Mexican National Council of Science and

Technology (grant reference number: 1061809).

Mathilde M. Duville is working with Tecnologico de Monterrey, Escuela

de Ingenieria y Ciencias, Ave. Eugenio Garza Sada 2501, Monterrey, N.L.,

México, 64849 (e-mail: A00829725@itesm.mx).

Luz M. Alonso-Valerdi is working with Tecnologico de Monterrey,

Escuela de Ingenieria y Ciencias, Ave. Eugenio Garza Sada 2501, Monterrey,

N.L., México, 64849 (e-mail: lm.aloval@tec.mx).

and physicians can rely on utterances from the corpus that is

best appropriate to their needs and experimental conditions.

II. VOICE RECORDINGS

A. MESD Word Corpus

Nouns and adjectives were selected from two sources: the

single-word corpus of the INTERFACE for Castilian Spanish

database [9], hereinafter named corpus A; and the Madrid

Affective Database for Spanish (MADS), creating corpus B

[10]. Words from corpus A recurred across emotions and

voices (child, male, female). Words from corpus B were

selected according to the following criteria: (1) subjective age

of acquisition under 9-year-old, (2) emotional semantic rating

strictly superior to 2.5 (on a 5-point scale) for 5 particular

emotions (anger, disgust, fear, happiness, and sadness), (3)

valence and arousal ranging from 1 to 4, or from 6 to 9 for

emotional words and greater than 4 but lower than 6 for neutral

ones (9-point scale). Finally, (4) emotions were matched as

regards 3 linguistic features: concreteness, familiarity, and

frequency of use ratings. Scores from males, females, and

averaged for all subjects were considered separately.

In sum, MESD corpus included 48 words per emotion (24

from corpus A and 24 from corpus B). That is, 288 single

words were used for further utterance by male, female, or child

voices.

To control frequency, familiarity, and concreteness ratings,

R software

was used to run a one-way ANOVA on each

parameter separately with emotions as factor. Independence of

residuals was assessed by Durbin-Watson test. Normality and

homogeneity were assessed by Shapiro-Wilk and Bartlett

tests, respectively. In case of non-parametricity, Kruskal-

Wallis test was applied. Post-hoc tests were used to

statistically assess specific differences (Tukey after ANOVA,

Wilcoxon tests with p-value adjustment by Holm method after

Kruskal-Wallis). In case of significance, outlier values (i.e.,

ratings for frequency, familiarity or concreteness outside the

range defined by percentiles 2.5 and 97.5) were removed until

non-significance was reached. Level of significance was set at

p<0.05.

David Ibarra-Zarate is working with Tecnologico de Monterrey, Escuela

de Ingenieria y Ciencias, Ave. Eugenio Garza Sada 2501, Monterrey, N.L.,

México, 64849 (david.ibarra@tec.mx).

R Foundation for Statistical Computing, Vienna, Austria

The Mexican Emotional Speech Database (MESD):

elaboration and assessment based on machine learning*

Mathilde M. Duville, Luz M. Alonso-Valerdi, and David I. Ibarra-Zarate

2021 43rd Annual International Conference of the

IEEE Engineering in Medicine & Biology Society (EMBC)

Oct 31 - Nov 4, 2021. Virtual Conference

B. Participants and Ethical Considerations

Participants were volunteers and non-professional actors:

4 adult males, (mean age = 22.75, SD =2.06), 4 adult females

(mean age = 22.25, SD =2.50), and 8 children (5 girls and 3

boys, mean age = 9.87, SD = 1.12). They were included in the

study if they had grown up in Mexico in a cultural Mexican

environment (Mexican academic education and family

environments). Participants were excluded if they presented

any pathology that impairs emotional behavior, hearing, or

speech, or sickness traits affecting voice timbre. No participant

had lived in a foreign country (other than Mexico) more than

2 weeks in the last 4 years. A written informed consent was

obtained from all participants and children’s parents.

Recordings were conducted in accordance with the

Declaration of Helsinki and approved on July 14th, 2020 by the

Ethical Committee of the School of Medicine of Tecnologico

de Monterrey (register number within the National Committee

of Bioethics CONBIOETICA 19 CEI 011-2016-10-17) under

the following number: P000409-autismoEEG2020-CEIC-

CR002.

C. Material and Procedures for Voice Recording

Recordings were carried out in a professional recording

studio. A microphone Sennheiser e835 with a flat frequency

response (100 Hz to 10 kHz), and a Focusrite Scarlett 2i4 audio

interface connected to the microphone with an XLR cable and

to a computer were used. Audio files were recorded in the

digital audio workstation REAPER (Rapid Environment for

Audio Production, Engineering, and Recording), and stored as

a sequence of 24-bit with a sample rate of 48000Hz.

Adult and child sessions lasted 1 hour, and 30 minutes,

respectively. Each adult uttered words from corpora A and B

(48 words per emotion). Four children uttered words from

corpus A and 4 children uttered the ones from corpus B (24

words per emotion). The order of corpora was counterbalanced

across adult sessions. Emotions were randomly distributed in

both adult, and child sessions. After familiarizing with the

word-dataset, participants were required to utter each word

with the corresponding intended emotional intonation: anger,

disgust, fear, happiness, neutral, or sadness. Participants were

asked to wait at least 5 seconds between 2 utterances in order

to focus before each utterance.

III. EMOTION RECOGNITION

Before extracting acoustic features, each word was

excerpted from the continuous recording of each session to

generate an audio file for each individual word.

A. Acoustic Features Extraction and Data Normalization

Praat and Matlab R2019b were used to extract the features

detailed in Table I. The Gaussian distribution of the resulting

30 acoustic features was assessed by Shapiro-Wilk test.

Considering the lack of normality, a min-max normalization

was applied as described in (1):

xnormalized = x- mink

maxk- mink

(1)

Where x is the feature to be normalized, 𝑚𝑎𝑥𝑘 is the

highest value of acoustic feature vector k and 𝑚𝑖𝑛𝑘 is the

lowest.

B. Support Vector Machine (SVM) Predictive Model

Matlab R2019b was used to carry out a supervised learning

analysis using a cubic SVM classifier. Hyper-parameters were

adjusted to a box constraint level (soft-margin penalty) at 10.

The multiclass method (one-vs-one or one-vs-all) and the

kernel scale parameters was set to “auto”, that is, the algorithm

was automatically optimized for both parameters according to

the dataset. 77% of the data was used for training, and 23% for

validation. A stratified train/test split cross-validation method

was used and repeated 10 times. Therefore, data were

randomly split before each repetition so that each division

(training and validation) presented an equal number of words

per emotion. Particularly, training data included 222

observations (37 per emotion), and 66 observations (11 per

emotion) were used for validation. Accuracy, recall, precision

and F-score were computed in accordance with the resulting

confusion matrix [12].

C. Adult Voices

Male and female voices were considered as two separate

datasets. The input data for training were the 30 normalized

acoustic features extracted from each utterance after a

dimensionality reduction based on Principal Component

Analysis, explaining the 95% of the variance. A classification

TABLE I. EXTRACTED ACOUSTIC FEATURES FOR EMOTION RECOGNITION

Type

Feature

Description

Prosodic

Fundamental frequency

or pitch (Hertz)

Mean and standard deviation over the entire waveform.

Speech rate

Number of syllables per second.

Root mean square energy

(Volts)

Square root of mean energy.

Intensity (dB)

Mean and standard deviation over the entire waveform.

Voice

quality

Jitter (%)

Jitter local: average absolute difference between two consecutive periods, divided by the average period.

Jitter ppq5: 5-point period perturbation quotient. It is the average absolute difference between a period and the

average of it and its four closest neighbors, divided by the average period.

Shimmer (%)

Shimmer local: the average absolute difference between the amplitude of two consecutive periods, divided by the

average amplitude.

Shimmer rapq5 : 5-point amplitude perturbation quotient. It is the average absolute difference between the

amplitude of a period and the average of the amplitude of it and its four closest neighbors, divided by the average

amplitude.

Mean harmonics-to-noise

ratio (dB)

Mean value over the entire waveform of ten times logarithm with base 10 of the ratio between the percentage of

the signal composed of harmonics and the percentage of the signal composed of noise.

Spectral

Formants (Hertz)

F1, F2, F3: Mean and bandwidth in center.

Mel Frequency Cepstral

Coefficients

1-13 coefficients.

1645

analysis was conducted on utterances from each actor

independently. The final version of the MESD was created by

selecting for each emotion the utterances from the actor

leading to the highest F-score during validation. This process

is described in Fig. 1. A classification analysis was applied on

the final 288-utterance dataset.

Figure 1. Process to select utterances for adult voices.

D. Child Voices

A k-means clustering analysis was applied on features

extracted for each emotion separately (24 observations per

participant, leading to 6 datasets of 192 observations). This

approach allowed the identification of the highest

representative combinations of utterances from actors who

uttered corpus A with the ones who uttered corpus B. Namely,

it helped to select the most relevant sets of 48 utterances per

emotion that will be used as input for the further SVM-based

classification. Squared Euclidean distance metric and k-means

++ algorithm for cluster center initialization were used. The

optimized number of clusters was assessed by computing

silhouette scores. The number of clusters that led to the highest

average silhouette score was selected, namely, 2 clusters.

In each cluster, utterances of words coming from corpus A

(4 participants) and corpus B (4 participants) were considered

separately. For utterances from both corpora, the number of

observations for individual participants in each cluster was

computed. Pairs of participants (one who uttered words from

corpus A and one who uttered word from corpus B) were

assessed in each cluster by considering the participant with the

highest number of observations. As a result, each pair of

participants was composed of 288 utterances (48 per emotion,

including 24 of words from each corpus).

Then, a classification analysis was carried out on data from

each resulting pair. The input data for training was the 30

normalized acoustic features extracted from each utterance,

after a dimensionality reduction based on Principal

Component Analysis that explained 95% of the variance. The

final version of the MESD was created by selecting for each

emotion the utterances from the pair leading to the highest F-

score during validation. The resulting set of 288 utterances was

used to evaluate the accuracy and F-score for emotion

recognition on the final version of MESD.

IV. RESULTS

The MESD is freely available at:

http://dx.doi.org/10.17632/cy34mh68j9.1

A. MESD Word Corpus: Corpus B

No inter-emotion difference was emphasized for frequency

of use, familiarity, and concreteness ratings after outliers were

removed. Namely, for words used for child, male, and female

utterances, statistical analysis stressed non-significant p-

values (p>0.05) for each parameter.

B. Female Adult Voice

Fig. 2 presents the accuracies and F-scores reached for

each emotion and their mean values. It is important to note that

the most representative female participants for each emotion

resulting from the single-actor classification were: (1)

participant 2 for anger, (2) participant 2 for disgust, (3)

participant 1 for fear, (4) participant 6 for happiness, (5)

participant 2 for neutral, and (6) participant 1 for sadness.

Figure 2. SVM classifier outcome: accuracy and F-score on female voices.

C. Male Adult Voice

Fig. 3 presents the accuracies and F-scores reached for

each emotion and their mean values.

Figure 3. SVM classifier outcome: accuracy and F-score on male voices.

1646

The most representative male participants for each emotion

resulting from the single-actor classification were: (1)

participant 3 for anger, (2) participant 12 for disgust, (3)

participant 3 for fear, (4) participant 3 for happiness, (5)

participant 3 for neutral, and (6) participant 12 for sadness.

D. Child Voice

The most representative pairs of child participants for each

emotion resulting from the single-pair classification were: (1)

participants 16 and 5 for anger, (2) participants 9 and 15 for

disgust, (3) participant 16 and 5 for fear, (4) participant 17 and

15 for happiness, (5) participant 16 and 7 for neutral, and (6)

participant 16 and 5 for sadness. Fig. 4 presents the accuracies

and F-scores reached for each emotion and their mean values.

Figure 4. SVM classifier outcome: accuracy and F-score on child voices.

V. DISCUSSION

The MESD contributes to the production of reliable

emotional speech data available for healthcare analytics. To

date, supplies for linguistic affective stimuli adapted to

Mexican Spanish are very scarce [13]. Besides, very few

current databases target child voices [14]. The current database

presents several advantages: (1) a word corpus controlled for

emotional semantic and linguistic parameters was provided

[10]; and (2) the MESD includes single-word utterances that

contrary to sentences, do not embed variations of emotional

information through the utterance [15]. Furthermore, the

cognitive processing of emotional words does not involve

prediction, integration, and syntactic unification processes that

may interplay with the understanding of emotional

information [16]. Concreteness, familiarity, and frequency of

words from corpus B were controlled to fade trade-off effects

between linguistic and emotional processing when using the

MESD as stimuli for emotional perception. Nevertheless,

words recurrence that characterizes utterances of nouns and

adjectives from corpus A may be appropriate for prosodies and

voice categories comparisons based on data analysis sensible

to phonetic contents. Finally, for both words from corpus A

and B, the MESD provides 24 utterances from a unique

speaker for each emotional prosody, which guarantees the

homogeneity of speaker perception within emotional

intonational patterns. As a conclusion, the MESD is a

reputable source of emotional utterances that can be applied to

(1) big data for smart healthcare, (2) the characterization of

normal and pathological emotional prosodies processing and

expression, and (3) the exploration of normal or pathological

acoustic linguistic information processing and expression.

ACKNOWLEDGMENT

We thank the “Instituto Estatal de la Juventud (INJUVE)”,

and Norberto E. Naal-Ruiz (https://orcid.org/0000-0002-

1203-8925).We acknowledge the Evaluation and Language

resources Distribution Agency (ELDA) S.A.S., for sharing the

“Emotional speech synthesis database, ELRA catalogue

(http://catalog.elra.info), ISLRN: 477-238-467-792-9, ELRA

ID: ELRA-S0329”.

REFERENCES

[1] S. Shinohara et al., “Evaluation of the Severity of Major Depression

Using a Voice Index for Emotional Arousal,” Sensors, vol. 20, no. 18,

p. 5041, Sep. 2020, doi: 10.3390/s20185041.

[2] D. J. Hubbard, D. J. Faso, P. F. Assmann, and N. J. Sasson,

“Production and perception of emotional prosody by adults with

autism spectrum disorder: Affective prosody in ASD,” Autism Res.,

vol. 10, no. 12, pp. 1991–2001, Dec. 2017, doi: 10.1002/aur.1847.

[3] D. Chakraborty et al., “Prediction of Negative Symptoms of

Schizophrenia from Emotion Related Low-Level Speech Signals,” in

2018 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 6024–6028. doi:

10.1109/ICASSP.2018.8462102.

[4] M. Alhussein and G. Muhammad, “Automatic Voice Pathology

Monitoring Using Parallel Deep Models for Smart Healthcare,” IEEE

Access, vol. 7, pp. 46474–46479, 2019, doi:

10.1109/ACCESS.2019.2905597.

[5] M. S. Hossain and G. Muhammad, “Healthcare Big Data Voice

Pathology Assessment Framework,” IEEE Access, vol. 4, p. 10, 2016.

[6] J. S. Mulcahy, M. Davies, L. Quadt, H. D. Critchley, and S. N.

Garfinkel, “Interoceptive awareness mitigates deficits in emotional

prosody recognition in Autism,” Biol. Psychol., vol. 146, p. 107711,

Sep. 2019, doi: 10.1016/j.biopsycho.2019.05.011.

[7] R. Lindström et al., “Atypical perceptual and neural processing of

emotional prosodic changes in children with autism spectrum

disorders,” Clin. Neurophysiol., vol. 129, no. 11, pp. 2411–2420,

Nov. 2018, doi: 10.1016/j.clinph.2018.08.018.

[8] P. Laukka et al., “The expression and recognition of emotions in the

voice across five nations: A lens model analysis based on acoustic

features.,” J. Pers. Soc. Psychol., vol. 111, no. 5, pp. 686–705, Nov.

2016, doi: 10.1037/pspi0000066.

[9] V. Hozjan, Z. Kacic, A. Moreno, A. Bonafonte, and A. Nogueiras,

“Interface databases: Design and collection of a multilingual

emotional speech database,” Proc. 3rd Int. Conf. Lang. Resour. Eval.

LREC 2002 2024–2028, p. 5.

[10] J. A. Hinojosa et al., “Affective norms of 875 Spanish words for five

discrete emotional categories and two emotional dimensions,” Behav.

Res. Methods, vol. 48, no. 1, pp. 272–284, Mar. 2016, doi:

10.3758/s13428-015-0572-5.

[11] Z.-T. Liu, Q. Xie, M. Wu, W.-H. Cao, Y. Mei, and J.-W. Mao,

“Speech emotion recognition based on an improved brain emotion

learning model,” Neurocomputing, vol. 309, pp. 145–156, Oct. 2018,

doi: 10.1016/j.neucom.2018.05.005.

[12] A. Tharwat, “Classification assessment methods,” Appl. Comput.

Inform., vol. ahead-of-print, no. ahead-of-print, Aug. 2020, doi:

10.1016/j.aci.2018.08.003.

[13] S.-O. Caballero-Morales, “Recognition of Emotions in Mexican

Spanish Speech: An Approach Based on Acoustic Modelling of

Emotion-Specific Vowels,” Sci. World J., vol. 2013, pp. 1–13, 2013,

doi: 10.1155/2013/162093.

[14] H. Pérez-Espinosa, J. Martínez-Miranda, I. Espinosa-Curiel, J.

Rodríguez-Jacobo, L. Villaseñor-Pineda, and H. Avila-George,

“IESC-Child: An Interactive Emotional Children’s Speech Corpus,”

Comput. Speech Lang., vol. 59, pp. 55–74, Jan. 2020, doi:

10.1016/j.csl.2019.06.006.

[15] K. Hammerschmidt and U. Jürgens, “Acoustical Correlates of

Affective Prosody,” J. Voice, vol. 21, no. 5, pp. 531–540, Sep. 2007,

doi: 10.1016/j.jvoice.2006.03.002.

[16] J. A. Hinojosa, E. M. Moreno, and P. Ferré, “Affective

neurolinguistics: towards a framework for reconciling language and

emotion,” Lang. Cogn. Neurosci., vol. 35, no. 7, pp. 813–839, Sep.

2020, doi: 10.1080/23273798.2019.1620957.

1647

Autistic traits shape neuronal oscillations during emotion perception under attentional load modulation

Article

Full-text available

May 2023

Emotional content is particularly salient, but situational factors such as cognitive load may disturb the attentional prioritization towards affective stimuli and interfere with their processing. In this study, 31 autistic and 31 typically developed children volunteered to assess their perception of affective prosodies via event-related spectral perturbations of neuronal oscillations recorded by electroencephalography under attentional load modulations induced by Multiple Object Tracking or neutral images. Although intermediate load optimized emotion processing by typically developed children, load and emotion did not interplay in children with autism. Results also outlined impaired emotional integration emphasized in theta, alpha and beta oscillations at early and late stages, and lower attentional ability indexed by the tracking capacity. Furthermore, both tracking capacity and neuronal patterns of emotion perception during task were predicted by daily-life autistic behaviors. These findings highlight that intermediate load may encourage emotion processing in typically developed children. However, autism aligns with impaired affective processing and selective attention, both insensitive to load modulations. Results were discussed within a Bayesian perspective that suggests atypical updating in precision between sensations and hidden states, towards poor contextual evaluations. For the first time, implicit emotion perception assessed by neuronal markers was integrated with environmental demands to characterize autism.

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Preprint

Jun 2024

Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers numerous corpus and languages for researchers to refer to, making reproduction a burden. In this paper, we propose EmoBox, an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings. For intra-corpus settings, we carefully designed the data partitioning for different datasets. For cross-corpus settings, we employ a foundation SER model, emotion2vec, to mitigate annotation errors and obtain a test set that is fully balanced in speakers and emotions distributions. Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community.

The Context Sets the Tone: A Literature Review on Emotion Recognition from Speech Using AI

Chapter

May 2024

Improved emotion differentiation under reduced acoustic variability of speech in autism

Article

Full-text available

Mar 2024
BMC MED

Background Socio-emotional impairments are among the diagnostic criteria for autism spectrum disorder (ASD), but the actual knowledge has substantiated both altered and intact emotional prosodies recognition. Here, a Bayesian framework of perception is considered suggesting that the oversampling of sensory evidence would impair perception within highly variable environments. However, reliable hierarchical structures for spectral and temporal cues would foster emotion discrimination by autistics. Methods Event-related spectral perturbations (ERSP) extracted from electroencephalographic (EEG) data indexed the perception of anger, disgust, fear, happiness, neutral, and sadness prosodies while listening to speech uttered by (a) human or (b) synthesized voices characterized by reduced volatility and variability of acoustic environments. The assessment of mechanisms for perception was extended to the visual domain by analyzing the behavioral accuracy within a non-social task in which dynamics of precision weighting between bottom-up evidence and top-down inferences were emphasized. Eighty children (mean 9.7 years old; standard deviation 1.8) volunteered including 40 autistics. The symptomatology was assessed at the time of the study via the Autism Diagnostic Observation Schedule, Second Edition, and parents’ responses on the Autism Spectrum Rating Scales. A mixed within-between analysis of variance was conducted to assess the effects of group (autism versus typical development), voice, emotions, and interaction between factors. A Bayesian analysis was implemented to quantify the evidence in favor of the null hypothesis in case of non-significance. Post hoc comparisons were corrected for multiple testing. Results Autistic children presented impaired emotion differentiation while listening to speech uttered by human voices, which was improved when the acoustic volatility and variability of voices were reduced. Divergent neural patterns were observed from neurotypicals to autistics, emphasizing different mechanisms for perception. Accordingly, behavioral measurements on the visual task were consistent with the over-precision ascribed to the environmental variability (sensory processing) that weakened performance. Unlike autistic children, neurotypicals could differentiate emotions induced by all voices. Conclusions This study outlines behavioral and neurophysiological mechanisms that underpin responses to sensory variability. Neurobiological insights into the processing of emotional prosodies emphasized the potential of acoustically modified emotional prosodies to improve emotion differentiation by autistics. Trial registration BioMed Central ISRCTN Registry, ISRCTN18117434. Registered on September 20, 2020.

Perception of task-irrelevant affective prosody by typically developed and diagnosed children with Autism Spectrum Disorder under attentional loads: Electroencephalographic and behavioural data

Article

Full-text available

Mar 2023

The relevance of affective information triggers cognitive prioritisation, dictated by both the attentional load of the relevant task, and socio-emotional abilities. This dataset provides electroencephalographic (EEG) signals related to implicit emotional speech perception under low, intermediate, and high attentional demands. Demographic and behavioural data are also provided. Specific social-emotional reciprocity and verbal communication characterise Autism Spectrum Disorder (ASD) and may influence the processing of affective prosodies. Therefore, 62 children and their parents or legal guardians participated in data collection, including 31 children with high autistic traits (x̄age=9.6-year-old, σage=1.5) who previously received a diagnosis of ASD by a medical specialist, and 31 typically developed children (x̄age=10.2-year-old, σage=1.2). Assessments of the scope of autistic behaviours using the Autism Spectrum Rating Scales (ASRS, parent report) are provided for every child. During the experiment, children listened to task-irrelevant affective prosodies (anger, disgust, fear, happiness, neutral and sadness) while answering three visual tasks: neutral image viewing (low attentional load), one-target 4-disc Multiple Object Tracking (MOT; intermediate), one-target 8-disc MOT (high). The EEG data recorded during all three tasks and the tracking capacity (behavioural data) from MOT conditions are included in the dataset. Particularly, the tracking capacity was computed as a standardised index of attentional abilities during MOT, corrected for guessing. Beforehand, children answered the Edinburgh Handedness Inventory, and resting-state EEG activity of children was recorded for 2 minutes with eyes open. Those data are also provided. The present dataset can be used to investigate the electrophysiological correlates of implicit emotion and speech perceptions and their interaction with attentional load and autistic traits. Besides, resting-state EEG data may be used to characterise inter-individual heterogeneity at rest and, in turn, associate it with attentional capacities during MOT and with autistic behavioural patterns. Finally, tracking capacity may be useful to explore dynamic and selective attentional mechanisms under emotional constraints.

Neuronal and behavioral affective perceptions of human and naturalness-reduced emotional prosodies

Article

Full-text available

Nov 2022

Artificial voices are nowadays embedded into our daily lives with latest neural voices approaching human voice consistency (naturalness). Nevertheless, behavioral, and neuronal correlates of the perception of less naturalistic emotional prosodies are still misunderstood. In this study, we explored the acoustic tendencies that define naturalness from human to synthesized voices. Then, we created naturalness-reduced emotional utterances by acoustic editions of human voices. Finally, we used Event-Related Potentials (ERP) to assess the time dynamics of emotional integration when listening to both human and synthesized voices in a healthy adult sample. Additionally, listeners rated their perceptions for valence, arousal, discrete emotions, naturalness, and intelligibility. Synthesized voices were characterized by less lexical stress (i.e., reduced difference between stressed and unstressed syllables within words) as regards duration and median pitch modulations. Besides, spectral content was attenuated toward lower F2 and F3 frequencies and lower intensities for harmonics 1 and 4. Both psychometric and neuronal correlates were sensitive to naturalness reduction. (1) Naturalness and intelligibility ratings dropped with emotional utterances synthetization, (2) Discrete emotion recognition was impaired as naturalness declined, consistent with P200 and Late Positive Potentials (LPP) being less sensitive to emotional differentiation at lower naturalness, and (3) Relative P200 and LPP amplitudes between prosodies were modulated by synthetization. Nevertheless, (4) Valence and arousal perceptions were preserved at lower naturalness, (5) Valence (arousal) ratings correlated negatively (positively) with Higuchi's fractal dimension extracted on neuronal data under all naturalness perturbations, (6) Inter-Trial Phase Coherence (ITPC) and standard deviation measurements revealed high inter-individual heterogeneity for emotion perception that is still preserved as naturalness reduces. Notably, partial between-participant synchrony (low ITPC), along with high amplitude dispersion on ERPs at both early and late stages emphasized miscellaneous emotional responses among subjects. In this study, we highlighted for the first time both behavioral and neuronal basis of emotional perception under acoustic naturalness alterations. Partial dependencies between ecological relevance and emotion understanding outlined the modulation but not the annihilation of emotional integration by synthetization.

Evaluation of the Severity of Major Depression Using a Voice Index for Emotional Arousal

Article

Full-text available

Sep 2020
SENSORS-BASEL

Recently, the relationship between emotional arousal and depression has been studied. Focusing on this relationship, we first developed an arousal level voice index (ALVI) to measure arousal levels using the Interactive Emotional Dyadic Motion Capture database. Then, we calculated ALVI from the voices of depressed patients from two hospitals (Ginza Taimei Clinic (H1) and National Defense Medical College hospital (H2)) and compared them with the severity of depression as measured by the Hamilton Rating Scale for Depression (HAM-D). Depending on the HAM-D score, the datasets were classified into a no depression (HAM-D < 8) and a depression group (HAM-D ≥ 8) for each hospital. A comparison of the mean ALVI between the groups was performed using the Wilcoxon rank-sum test and a significant difference at the level of 10% (p = 0.094) at H1 and 1% (p = 0.0038) at H2 was determined. The area under the curve (AUC) of the receiver operating characteristic was 0.66 when categorizing between the two groups for H1, and the AUC for H2 was 0.70. The relationship between arousal level and depression severity was indirectly suggested via the ALVI.

Affective neurolinguistics: towards a framework for reconciling language and emotion

Article

Full-text available

Jun 2019

Standard neurocognitive models of language processing have tended to obviate the need for incorporating emotion processes, while affective neuroscience theories have typically been concerned with the way in which people communicate their emotions, and have often simply not addressed linguistic issues. Here, we summarise evidence from temporal and spatial brain imaging studies that have investigated emotion effects on lexical, semantic and morphosyntactic aspects of language during the comprehension of single words and sentences. The evidence reviewed suggests that emotion is represented in the brain as a set of semantic features in a distributed sensory, motor, language and affective network. Also, emotion interacts with a number of lexical, semantic and syntactic features in different brain regions and timings. This is in line with the proposals of interactive neurocognitive models of language processing, which assume the interplay between different representational levels during on-line language comprehension.

Automatic Voice Pathology Monitoring Using Parallel Deep Models for Smart Healthcare

Article

Full-text available

Mar 2019

Recent advancements in wireless communication and machine learning technologies aid in the development of an accurate and affordable healthcare facility. In this paper, we propose a smart healthcare framework in a mobile platform using deep learning. In the framework, a smart phone records a voice signal of a client and sends it to a cloud server. The cloud server processes the signal and classifies it as normal or pathological using a parallel convolutional neural network model. The decision on the signal is then transferred to the doctor for prescription. Two publicly available databases were used in the experiments, where voice samples were played in front of a smart phone. Experimental results show the suitability of the proposed framework in the healthcare framework.

Classification assessment methods: a detailed tutorial

Presentation

Full-text available

Sep 2018

Alaa Tharwat

This paper introduces a detailed explanation with numerical examples many classification assessment methods or classification measures such as: Accuracy, sensitivity, specificity, ROC curve, Precision-Recall curve, AUC score and many other metrics. In this paper, many details about the ROC curve, PR curve, and Detection Error Trade-off (DET) curve. Moreover, many details about some measures which are suitable for imbalanced data are explained. Your comments are highly appreciated. The link to the original paper is : https://www.sciencedirect.com/science/article/pii/S2210832718301546

Classification Assessment Methods: a detailed tutorial

Article

Full-text available

Aug 2018

Alaa Tharwat

Classification techniques have been applied to many applications in various fields of sciences. There are several ways of evaluating classification algorithms. The analysis of such metrics and its significance must be interpreted correctly for evaluating different learning algorithms. Most of these measures are scalar metrics and some of them are graphical methods. This paper introduces a detailed overview of the classification assessment measures with the aim of providing the basics of these measures and to show how it works to serve as a comprehensive source for researchers who are interested in this field. This overview starts by highlighting the definition of the confusion matrix in binary and multi-class classification problems. Many classification measures are also explained in details, and the influence of balanced and imbalanced data on each metric is presented. An illustrative example is introduced to show (1) how to calculate these measures in binary and multi-class classification problems, and (2) the robustness of some measures against balanced and imbalanced data. Moreover, some graphical measures such as Receiver operating characteristics (ROC), Precision-Recall, and Detection error trade-off (DET) curves are presented with details. Additionally, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves.

IESC-Child: An Interactive Emotional Children’s Speech Corpus

Article

Jan 2020

In this paper, we describe the process that we used to create a new corpus of children’s emotional speech. We used a Wizard of Oz (WoZ) setting to induce different emotional reactions in children during speech-based interactions with two robots. We recorded the speech spoken in Mexican Spanish by 174 children (both sexes) between 6 and 11 years of age. The recordings were manually segmented and transcribed. The segments were then labeled with two types of emotional-related paralinguistic information: emotion and attitude. The corpus contained 2093 min of audio recordings (34.88 h) divided into 19,793 speech segments. The Interactive Emotional Children’s Speech Corpus (IESC-Child) can be a valuable resource for researchers studying affective reactions in speech communication during child-computer interactions in Spanish and for creating models to recognize acoustic paralinguistic information. IESC-Child is available to the research community upon request.

Interoceptive awareness mitigates deficits in emotional prosody recognition in Autism

Article

Jun 2019
BIOL PSYCHOL

The sensing of internal bodily signals, a process known as interoception, contributes to subjective emotional feeling states that can guide empathic understanding of the emotions of others. Individuals with Autism Spectrum Conditions (ASC) typically show an attenuated intuitive capacity to recognise and interpret other peoples' emotional signals. Here we test directly if differences in interoceptive processing relate to the ability to perceive emotional signals from the intonation of speech (affective prosody) in ASC adults. We employed a novel prosody paradigm to compare emotional prosody recognition in ASC individuals and a group of neurotypical controls. Then, in a larger group of ASC individuals, we tested how recognition of affective prosody related to objective, subjective and metacognitive (awareness) psychological dimensions of interoception. ASC individuals showed reduced recognition of affective prosody compared to controls. Deficits in performance on the prosody task were mitigated by greater interoceptive awareness, so that ASC individuals were better able to judge the prosodic emotion if they had better insight into their own interoceptive abilities. This data links the ability to access interoceptive representations consciously to the recognition of emotional expression in others, suggesting a crossmodal target for interventions to enhance interpersonal skills.

Prediction of Negative Symptoms of Schizophrenia from Emotion Related Low-Level Speech Signals

Conference Paper

Apr 2018

Atypical perceptual and neural processing of emotional prosodic changes in children with Autism Spectrum Disorders

Article

Sep 2018
CLIN NEUROPHYSIOL

Objective: The present study explored the processing of emotional speech prosody in school-aged children with autism spectrum disorders (ASD) but without marked language impairments (children with ASD [no LI]). Methods: The mismatch negativity (MMN)/the late discriminative negativity (LDN), reflecting pre-attentive auditory discrimination processes, and the P3a, indexing involuntary orienting to attention-catching changes, were recorded to natural word stimuli uttered with different emotional connotations (neutral, sad, scornful and commanding). Perceptual prosody discrimination was addressed with a behavioral sound-discrimination test. Results: Overall, children with ASD (no LI) were slower in behaviorally discriminating prosodic features of speech stimuli than typically developed control children. Further, smaller standard-stimulus event related potentials (ERPs) and MMN/LDNs were found in children with ASD (no LI) than in controls. In addition, the amplitude of the P3a was diminished and differentially distributed on the scalp in children with ASD (no LI) than in control children. Conclusions: Processing of words and changes in emotional speech prosody is impaired at various levels of information processing in school-aged children with ASD (no LI). Significance: The results suggest that low-level speech sound discrimination and orienting deficits might contribute to emotional speech prosody processing impairments observed in ASD.

Speech Emotion Recognition Based on An Improved Brain Emotion Learning Model

Article

May 2018
NEUROCOMPUTING

The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning

Abstract

Recommended publications

Where Gifted Minds Meet Great Opportunities: Faculty of Excellence

A dimensional approach to vocal expression of emotion

Bimodal approach in emotion recognition using speech and facial expressions

Mandarin speech emotion recognition based on high dimensional geometry theory

How well can People and Computers Recognize Emotions in Speech?

A State of the Art Review on Emotional Speech Databases