Conference PaperPDF Available

The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning

Authors:

Abstract

he Mexican Emotional Speech Database is presented along with the evaluation of its reliability based on machine learning analysis. The database contains 864 voice recordings with six different prosodies: anger, disgust, fear, happiness, neutral, and sadness. Furthermore, three voice categories are included: female adult, male adult, and child. The following emotion recognition was reached for each category: 89.4%, 93.9% and 83.3% accuracy on female, male and child voices, respectively. Clinical Relevance — Mexican Emotional Speech Database is a contribution to healthcare emotional speech data and can be used to help objective diagnosis and disease characterization.
AbstractThe Mexican Emotional Speech Database is
presented along with the evaluation of its reliability based on
machine learning analysis. The database contains 864 voice
recordings with six different prosodies: anger, disgust, fear,
happiness, neutral, and sadness. Furthermore, three voice
categories are included: female adult, male adult, and child. The
following emotion recognition was reached for each category:
89.4%, 93.9% and 83.3% accuracy on female, male and child
voices, respectively.
Clinical Relevance Mexican Emotional Speech Database is
a contribution to healthcare emotional speech data and can be
used to help objective diagnosis and disease characterization.
I. INTRODUCTION
Acoustic cues of emotional speech production are major
predictors of health conditions such as depression [1], autism
[2], or schizophrenia [3]. Developments in wireless
communication and machine learning engineering led to smart
healthcare systems designed to detect pathologies from voice
signal analysis without medical visitation. Physiological
signals are uploaded to a cloud computer where they can be
accessed for subjective (undertaken by a physician) or
objective (performed by a computational algorithm) analysis
[4]. Objective pathological assessments rely on healthcare big
data used for classification of diseases [5]. On the other hand,
databases of speech signals must also be used to explore the
linguistic and emotional perception that characterizes
particular pathological conditions. For instance, the
development of validated stimuli for affective prosody may be
useful to study the behavioural and neuronal impairments that
define the atypical emotional perception of autistics [6], [7].
As emotional expression is shaped by cultural variations [8],
databases optimized for the population under study are an
urgent need. The aim of this work is to provide a Mexican
Emotional Speech Database (MESD) that contains single-
word utterances for child, female, and male voices, expressed
with six basic emotions: anger, disgust, fear, happiness,
neutral, and sadness. Two corpora were created: (corpus A)
involved the repetition of 24 words across prosodies and voice
categories, and (corpus B) offers utterances of words
controlled for linguistic (concreteness, familiarity, and
frequency of use), and emotional semantic (valence, arousal,
and discrete emotions) dimensions. Researchers, engineers,
*Research supported by the Mexican National Council of Science and
Technology (grant reference number: 1061809).
Mathilde M. Duville is working with Tecnologico de Monterrey, Escuela
de Ingenieria y Ciencias, Ave. Eugenio Garza Sada 2501, Monterrey, N.L.,
México, 64849 (e-mail: A00829725@itesm.mx).
Luz M. Alonso-Valerdi is working with Tecnologico de Monterrey,
Escuela de Ingenieria y Ciencias, Ave. Eugenio Garza Sada 2501, Monterrey,
N.L., México, 64849 (e-mail: lm.aloval@tec.mx).
and physicians can rely on utterances from the corpus that is
best appropriate to their needs and experimental conditions.
II. VOICE RECORDINGS
A. MESD Word Corpus
Nouns and adjectives were selected from two sources: the
single-word corpus of the INTERFACE for Castilian Spanish
database [9], hereinafter named corpus A; and the Madrid
Affective Database for Spanish (MADS), creating corpus B
[10]. Words from corpus A recurred across emotions and
voices (child, male, female). Words from corpus B were
selected according to the following criteria: (1) subjective age
of acquisition under 9-year-old, (2) emotional semantic rating
strictly superior to 2.5 (on a 5-point scale) for 5 particular
emotions (anger, disgust, fear, happiness, and sadness), (3)
valence and arousal ranging from 1 to 4, or from 6 to 9 for
emotional words and greater than 4 but lower than 6 for neutral
ones (9-point scale). Finally, (4) emotions were matched as
regards 3 linguistic features: concreteness, familiarity, and
frequency of use ratings. Scores from males, females, and
averaged for all subjects were considered separately.
In sum, MESD corpus included 48 words per emotion (24
from corpus A and 24 from corpus B). That is, 288 single
words were used for further utterance by male, female, or child
voices.
To control frequency, familiarity, and concreteness ratings,
R software
1
was used to run a one-way ANOVA on each
parameter separately with emotions as factor. Independence of
residuals was assessed by Durbin-Watson test. Normality and
homogeneity were assessed by Shapiro-Wilk and Bartlett
tests, respectively. In case of non-parametricity, Kruskal-
Wallis test was applied. Post-hoc tests were used to
statistically assess specific differences (Tukey after ANOVA,
Wilcoxon tests with p-value adjustment by Holm method after
Kruskal-Wallis). In case of significance, outlier values (i.e.,
ratings for frequency, familiarity or concreteness outside the
range defined by percentiles 2.5 and 97.5) were removed until
non-significance was reached. Level of significance was set at
p<0.05.
David Ibarra-Zarate is working with Tecnologico de Monterrey, Escuela
de Ingenieria y Ciencias, Ave. Eugenio Garza Sada 2501, Monterrey, N.L.,
México, 64849 (david.ibarra@tec.mx).
1
R Foundation for Statistical Computing, Vienna, Austria
The Mexican Emotional Speech Database (MESD):
elaboration and assessment based on machine learning*
Mathilde M. Duville, Luz M. Alonso-Valerdi, and David I. Ibarra-Zarate
2021 43rd Annual International Conference of the
IEEE Engineering in Medicine & Biology Society (EMBC)
Oct 31 - Nov 4, 2021. Virtual Conference
978-1-7281-1178-0/21/$31.00 ©2021 IEEE 1644
B. Participants and Ethical Considerations
Participants were volunteers and non-professional actors:
4 adult males, (mean age = 22.75, SD =2.06), 4 adult females
(mean age = 22.25, SD =2.50), and 8 children (5 girls and 3
boys, mean age = 9.87, SD = 1.12). They were included in the
study if they had grown up in Mexico in a cultural Mexican
environment (Mexican academic education and family
environments). Participants were excluded if they presented
any pathology that impairs emotional behavior, hearing, or
speech, or sickness traits affecting voice timbre. No participant
had lived in a foreign country (other than Mexico) more than
2 weeks in the last 4 years. A written informed consent was
obtained from all participants and children’s parents.
Recordings were conducted in accordance with the
Declaration of Helsinki and approved on July 14th, 2020 by the
Ethical Committee of the School of Medicine of Tecnologico
de Monterrey (register number within the National Committee
of Bioethics CONBIOETICA 19 CEI 011-2016-10-17) under
the following number: P000409-autismoEEG2020-CEIC-
CR002.
C. Material and Procedures for Voice Recording
Recordings were carried out in a professional recording
studio. A microphone Sennheiser e835 with a flat frequency
response (100 Hz to 10 kHz), and a Focusrite Scarlett 2i4 audio
interface connected to the microphone with an XLR cable and
to a computer were used. Audio files were recorded in the
digital audio workstation REAPER (Rapid Environment for
Audio Production, Engineering, and Recording), and stored as
a sequence of 24-bit with a sample rate of 48000Hz.
Adult and child sessions lasted 1 hour, and 30 minutes,
respectively. Each adult uttered words from corpora A and B
(48 words per emotion). Four children uttered words from
corpus A and 4 children uttered the ones from corpus B (24
words per emotion). The order of corpora was counterbalanced
across adult sessions. Emotions were randomly distributed in
both adult, and child sessions. After familiarizing with the
word-dataset, participants were required to utter each word
with the corresponding intended emotional intonation: anger,
disgust, fear, happiness, neutral, or sadness. Participants were
asked to wait at least 5 seconds between 2 utterances in order
to focus before each utterance.
III. EMOTION RECOGNITION
Before extracting acoustic features, each word was
excerpted from the continuous recording of each session to
generate an audio file for each individual word.
A. Acoustic Features Extraction and Data Normalization
Praat and Matlab R2019b were used to extract the features
detailed in Table I. The Gaussian distribution of the resulting
30 acoustic features was assessed by Shapiro-Wilk test.
Considering the lack of normality, a min-max normalization
was applied as described in (1):
xnormalized = x- mink
maxk- mink
(1)
Where x is the feature to be normalized, 𝑚𝑎𝑥𝑘 is the
highest value of acoustic feature vector k and 𝑚𝑖𝑛𝑘 is the
lowest.
B. Support Vector Machine (SVM) Predictive Model
Matlab R2019b was used to carry out a supervised learning
analysis using a cubic SVM classifier. Hyper-parameters were
adjusted to a box constraint level (soft-margin penalty) at 10.
The multiclass method (one-vs-one or one-vs-all) and the
kernel scale parameters was set to “auto”, that is, the algorithm
was automatically optimized for both parameters according to
the dataset. 77% of the data was used for training, and 23% for
validation. A stratified train/test split cross-validation method
was used and repeated 10 times. Therefore, data were
randomly split before each repetition so that each division
(training and validation) presented an equal number of words
per emotion. Particularly, training data included 222
observations (37 per emotion), and 66 observations (11 per
emotion) were used for validation. Accuracy, recall, precision
and F-score were computed in accordance with the resulting
confusion matrix [12].
C. Adult Voices
Male and female voices were considered as two separate
datasets. The input data for training were the 30 normalized
acoustic features extracted from each utterance after a
dimensionality reduction based on Principal Component
Analysis, explaining the 95% of the variance. A classification
TABLE I. EXTRACTED ACOUSTIC FEATURES FOR EMOTION RECOGNITION
Type
Feature
Prosodic
Fundamental frequency
or pitch (Hertz)
Speech rate
Root mean square energy
(Volts)
Intensity (dB)
Voice
quality
Jitter (%)
Shimmer (%)
Mean harmonics-to-noise
ratio (dB)
Spectral
Formants (Hertz)
Mel Frequency Cepstral
Coefficients
1645
analysis was conducted on utterances from each actor
independently. The final version of the MESD was created by
selecting for each emotion the utterances from the actor
leading to the highest F-score during validation. This process
is described in Fig. 1. A classification analysis was applied on
the final 288-utterance dataset.
Figure 1. Process to select utterances for adult voices.
D. Child Voices
A k-means clustering analysis was applied on features
extracted for each emotion separately (24 observations per
participant, leading to 6 datasets of 192 observations). This
approach allowed the identification of the highest
representative combinations of utterances from actors who
uttered corpus A with the ones who uttered corpus B. Namely,
it helped to select the most relevant sets of 48 utterances per
emotion that will be used as input for the further SVM-based
classification. Squared Euclidean distance metric and k-means
++ algorithm for cluster center initialization were used. The
optimized number of clusters was assessed by computing
silhouette scores. The number of clusters that led to the highest
average silhouette score was selected, namely, 2 clusters.
In each cluster, utterances of words coming from corpus A
(4 participants) and corpus B (4 participants) were considered
separately. For utterances from both corpora, the number of
observations for individual participants in each cluster was
computed. Pairs of participants (one who uttered words from
corpus A and one who uttered word from corpus B) were
assessed in each cluster by considering the participant with the
highest number of observations. As a result, each pair of
participants was composed of 288 utterances (48 per emotion,
including 24 of words from each corpus).
Then, a classification analysis was carried out on data from
each resulting pair. The input data for training was the 30
normalized acoustic features extracted from each utterance,
after a dimensionality reduction based on Principal
Component Analysis that explained 95% of the variance. The
final version of the MESD was created by selecting for each
emotion the utterances from the pair leading to the highest F-
score during validation. The resulting set of 288 utterances was
used to evaluate the accuracy and F-score for emotion
recognition on the final version of MESD.
IV. RESULTS
The MESD is freely available at:
http://dx.doi.org/10.17632/cy34mh68j9.1
A. MESD Word Corpus: Corpus B
No inter-emotion difference was emphasized for frequency
of use, familiarity, and concreteness ratings after outliers were
removed. Namely, for words used for child, male, and female
utterances, statistical analysis stressed non-significant p-
values (p>0.05) for each parameter.
B. Female Adult Voice
Fig. 2 presents the accuracies and F-scores reached for
each emotion and their mean values. It is important to note that
the most representative female participants for each emotion
resulting from the single-actor classification were: (1)
participant 2 for anger, (2) participant 2 for disgust, (3)
participant 1 for fear, (4) participant 6 for happiness, (5)
participant 2 for neutral, and (6) participant 1 for sadness.
Figure 2. SVM classifier outcome: accuracy and F-score on female voices.
C. Male Adult Voice
Fig. 3 presents the accuracies and F-scores reached for
each emotion and their mean values.
Figure 3. SVM classifier outcome: accuracy and F-score on male voices.
1646
The most representative male participants for each emotion
resulting from the single-actor classification were: (1)
participant 3 for anger, (2) participant 12 for disgust, (3)
participant 3 for fear, (4) participant 3 for happiness, (5)
participant 3 for neutral, and (6) participant 12 for sadness.
D. Child Voice
The most representative pairs of child participants for each
emotion resulting from the single-pair classification were: (1)
participants 16 and 5 for anger, (2) participants 9 and 15 for
disgust, (3) participant 16 and 5 for fear, (4) participant 17 and
15 for happiness, (5) participant 16 and 7 for neutral, and (6)
participant 16 and 5 for sadness. Fig. 4 presents the accuracies
and F-scores reached for each emotion and their mean values.
Figure 4. SVM classifier outcome: accuracy and F-score on child voices.
V. DISCUSSION
The MESD contributes to the production of reliable
emotional speech data available for healthcare analytics. To
date, supplies for linguistic affective stimuli adapted to
Mexican Spanish are very scarce [13]. Besides, very few
current databases target child voices [14]. The current database
presents several advantages: (1) a word corpus controlled for
emotional semantic and linguistic parameters was provided
[10]; and (2) the MESD includes single-word utterances that
contrary to sentences, do not embed variations of emotional
information through the utterance [15]. Furthermore, the
cognitive processing of emotional words does not involve
prediction, integration, and syntactic unification processes that
may interplay with the understanding of emotional
information [16]. Concreteness, familiarity, and frequency of
words from corpus B were controlled to fade trade-off effects
between linguistic and emotional processing when using the
MESD as stimuli for emotional perception. Nevertheless,
words recurrence that characterizes utterances of nouns and
adjectives from corpus A may be appropriate for prosodies and
voice categories comparisons based on data analysis sensible
to phonetic contents. Finally, for both words from corpus A
and B, the MESD provides 24 utterances from a unique
speaker for each emotional prosody, which guarantees the
homogeneity of speaker perception within emotional
intonational patterns. As a conclusion, the MESD is a
reputable source of emotional utterances that can be applied to
(1) big data for smart healthcare, (2) the characterization of
normal and pathological emotional prosodies processing and
expression, and (3) the exploration of normal or pathological
acoustic linguistic information processing and expression.
ACKNOWLEDGMENT
We thank the “Instituto Estatal de la Juventud (INJUVE)”,
and Norberto E. Naal-Ruiz (https://orcid.org/0000-0002-
1203-8925).We acknowledge the Evaluation and Language
resources Distribution Agency (ELDA) S.A.S., for sharing the
“Emotional speech synthesis database, ELRA catalogue
(http://catalog.elra.info), ISLRN: 477-238-467-792-9, ELRA
ID: ELRA-S0329”.
REFERENCES
[1] S. Shinohara et al., “Evaluation of the Severity of Major Depression
Using a Voice Index for Emotional Arousal,” Sensors, vol. 20, no. 18,
p. 5041, Sep. 2020, doi: 10.3390/s20185041.
[2] D. J. Hubbard, D. J. Faso, P. F. Assmann, and N. J. Sasson,
“Production and perception of emotional prosody by adults with
autism spectrum disorder: Affective prosody in ASD,” Autism Res.,
vol. 10, no. 12, pp. 19912001, Dec. 2017, doi: 10.1002/aur.1847.
[3] D. Chakraborty et al., “Prediction of Negative Symptoms of
Schizophrenia from Emotion Related Low-Level Speech Signals,” in
2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 60246028. doi:
10.1109/ICASSP.2018.8462102.
[4] M. Alhussein and G. Muhammad, “Automatic Voice Pathology
Monitoring Using Parallel Deep Models for Smart Healthcare,” IEEE
Access, vol. 7, pp. 4647446479, 2019, doi:
10.1109/ACCESS.2019.2905597.
[5] M. S. Hossain and G. Muhammad, “Healthcare Big Data Voice
Pathology Assessment Framework,” IEEE Access, vol. 4, p. 10, 2016.
[6] J. S. Mulcahy, M. Davies, L. Quadt, H. D. Critchley, and S. N.
Garfinkel, “Interoceptive awareness mitigates deficits in emotional
prosody recognition in Autism,” Biol. Psychol., vol. 146, p. 107711,
Sep. 2019, doi: 10.1016/j.biopsycho.2019.05.011.
[7] R. Lindström et al., “Atypical perceptual and neural processing of
emotional prosodic changes in children with autism spectrum
disorders,” Clin. Neurophysiol., vol. 129, no. 11, pp. 24112420,
Nov. 2018, doi: 10.1016/j.clinph.2018.08.018.
[8] P. Laukka et al., “The expression and recognition of emotions in the
voice across five nations: A lens model analysis based on acoustic
features.,” J. Pers. Soc. Psychol., vol. 111, no. 5, pp. 686705, Nov.
2016, doi: 10.1037/pspi0000066.
[9] V. Hozjan, Z. Kacic, A. Moreno, A. Bonafonte, and A. Nogueiras,
“Interface databases: Design and collection of a multilingual
emotional speech database,” Proc. 3rd Int. Conf. Lang. Resour. Eval.
LREC 2002 20242028, p. 5.
[10] J. A. Hinojosa et al., “Affective norms of 875 Spanish words for five
discrete emotional categories and two emotional dimensions,” Behav.
Res. Methods, vol. 48, no. 1, pp. 272284, Mar. 2016, doi:
10.3758/s13428-015-0572-5.
[11] Z.-T. Liu, Q. Xie, M. Wu, W.-H. Cao, Y. Mei, and J.-W. Mao,
“Speech emotion recognition based on an improved brain emotion
learning model,” Neurocomputing, vol. 309, pp. 145156, Oct. 2018,
doi: 10.1016/j.neucom.2018.05.005.
[12] A. Tharwat, “Classification assessment methods,” Appl. Comput.
Inform., vol. ahead-of-print, no. ahead-of-print, Aug. 2020, doi:
10.1016/j.aci.2018.08.003.
[13] S.-O. Caballero-Morales, “Recognition of Emotions in Mexican
Spanish Speech: An Approach Based on Acoustic Modelling of
Emotion-Specific Vowels,” Sci. World J., vol. 2013, pp. 113, 2013,
doi: 10.1155/2013/162093.
[14] H. Pérez-Espinosa, J. Martínez-Miranda, I. Espinosa-Curiel, J.
Rodríguez-Jacobo, L. Villaseñor-Pineda, and H. Avila-George,
“IESC-Child: An Interactive Emotional Children’s Speech Corpus,”
Comput. Speech Lang., vol. 59, pp. 5574, Jan. 2020, doi:
10.1016/j.csl.2019.06.006.
[15] K. Hammerschmidt and U. Jürgens, “Acoustical Correlates of
Affective Prosody,” J. Voice, vol. 21, no. 5, pp. 531540, Sep. 2007,
doi: 10.1016/j.jvoice.2006.03.002.
[16] J. A. Hinojosa, E. M. Moreno, and P. Ferré, “Affective
neurolinguistics: towards a framework for reconciling language and
emotion,” Lang. Cogn. Neurosci., vol. 35, no. 7, pp. 813839, Sep.
2020, doi: 10.1080/23273798.2019.1620957.
1647
... "***" p < 0.001, NS: Non significance. [48][49][50] , available in Mendeley Data at http:// doi. org/ 10. 17632/ cy34m h68j9.5. ...
Article
Full-text available
Emotional content is particularly salient, but situational factors such as cognitive load may disturb the attentional prioritization towards affective stimuli and interfere with their processing. In this study, 31 autistic and 31 typically developed children volunteered to assess their perception of affective prosodies via event-related spectral perturbations of neuronal oscillations recorded by electroencephalography under attentional load modulations induced by Multiple Object Tracking or neutral images. Although intermediate load optimized emotion processing by typically developed children, load and emotion did not interplay in children with autism. Results also outlined impaired emotional integration emphasized in theta, alpha and beta oscillations at early and late stages, and lower attentional ability indexed by the tracking capacity. Furthermore, both tracking capacity and neuronal patterns of emotion perception during task were predicted by daily-life autistic behaviors. These findings highlight that intermediate load may encourage emotion processing in typically developed children. However, autism aligns with impaired affective processing and selective attention, both insensitive to load modulations. Results were discussed within a Bayesian perspective that suggests atypical updating in precision between sensations and hidden states, towards poor contextual evaluations. For the first time, implicit emotion perception assessed by neuronal markers was integrated with environmental demands to characterize autism.
Preprint
Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers numerous corpus and languages for researchers to refer to, making reproduction a burden. In this paper, we propose EmoBox, an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings. For intra-corpus settings, we carefully designed the data partitioning for different datasets. For cross-corpus settings, we employ a foundation SER model, emotion2vec, to mitigate annotation errors and obtain a test set that is fully balanced in speakers and emotions distributions. Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community.
Article
Full-text available
Background Socio-emotional impairments are among the diagnostic criteria for autism spectrum disorder (ASD), but the actual knowledge has substantiated both altered and intact emotional prosodies recognition. Here, a Bayesian framework of perception is considered suggesting that the oversampling of sensory evidence would impair perception within highly variable environments. However, reliable hierarchical structures for spectral and temporal cues would foster emotion discrimination by autistics. Methods Event-related spectral perturbations (ERSP) extracted from electroencephalographic (EEG) data indexed the perception of anger, disgust, fear, happiness, neutral, and sadness prosodies while listening to speech uttered by (a) human or (b) synthesized voices characterized by reduced volatility and variability of acoustic environments. The assessment of mechanisms for perception was extended to the visual domain by analyzing the behavioral accuracy within a non-social task in which dynamics of precision weighting between bottom-up evidence and top-down inferences were emphasized. Eighty children (mean 9.7 years old; standard deviation 1.8) volunteered including 40 autistics. The symptomatology was assessed at the time of the study via the Autism Diagnostic Observation Schedule, Second Edition, and parents’ responses on the Autism Spectrum Rating Scales. A mixed within-between analysis of variance was conducted to assess the effects of group (autism versus typical development), voice, emotions, and interaction between factors. A Bayesian analysis was implemented to quantify the evidence in favor of the null hypothesis in case of non-significance. Post hoc comparisons were corrected for multiple testing. Results Autistic children presented impaired emotion differentiation while listening to speech uttered by human voices, which was improved when the acoustic volatility and variability of voices were reduced. Divergent neural patterns were observed from neurotypicals to autistics, emphasizing different mechanisms for perception. Accordingly, behavioral measurements on the visual task were consistent with the over-precision ascribed to the environmental variability (sensory processing) that weakened performance. Unlike autistic children, neurotypicals could differentiate emotions induced by all voices. Conclusions This study outlines behavioral and neurophysiological mechanisms that underpin responses to sensory variability. Neurobiological insights into the processing of emotional prosodies emphasized the potential of acoustically modified emotional prosodies to improve emotion differentiation by autistics. Trial registration BioMed Central ISRCTN Registry, ISRCTN18117434. Registered on September 20, 2020.
Article
Full-text available
The relevance of affective information triggers cognitive prioritisation, dictated by both the attentional load of the relevant task, and socio-emotional abilities. This dataset provides electroencephalographic (EEG) signals related to implicit emotional speech perception under low, intermediate, and high attentional demands. Demographic and behavioural data are also provided. Specific social-emotional reciprocity and verbal communication characterise Autism Spectrum Disorder (ASD) and may influence the processing of affective prosodies. Therefore, 62 children and their parents or legal guardians participated in data collection, including 31 children with high autistic traits (x̄age=9.6-year-old, σage=1.5) who previously received a diagnosis of ASD by a medical specialist, and 31 typically developed children (x̄age=10.2-year-old, σage=1.2). Assessments of the scope of autistic behaviours using the Autism Spectrum Rating Scales (ASRS, parent report) are provided for every child. During the experiment, children listened to task-irrelevant affective prosodies (anger, disgust, fear, happiness, neutral and sadness) while answering three visual tasks: neutral image viewing (low attentional load), one-target 4-disc Multiple Object Tracking (MOT; intermediate), one-target 8-disc MOT (high). The EEG data recorded during all three tasks and the tracking capacity (behavioural data) from MOT conditions are included in the dataset. Particularly, the tracking capacity was computed as a standardised index of attentional abilities during MOT, corrected for guessing. Beforehand, children answered the Edinburgh Handedness Inventory, and resting-state EEG activity of children was recorded for 2 minutes with eyes open. Those data are also provided. The present dataset can be used to investigate the electrophysiological correlates of implicit emotion and speech perceptions and their interaction with attentional load and autistic traits. Besides, resting-state EEG data may be used to characterise inter-individual heterogeneity at rest and, in turn, associate it with attentional capacities during MOT and with autistic behavioural patterns. Finally, tracking capacity may be useful to explore dynamic and selective attentional mechanisms under emotional constraints.
Article
Full-text available
Artificial voices are nowadays embedded into our daily lives with latest neural voices approaching human voice consistency (naturalness). Nevertheless, behavioral, and neuronal correlates of the perception of less naturalistic emotional prosodies are still misunderstood. In this study, we explored the acoustic tendencies that define naturalness from human to synthesized voices. Then, we created naturalness-reduced emotional utterances by acoustic editions of human voices. Finally, we used Event-Related Potentials (ERP) to assess the time dynamics of emotional integration when listening to both human and synthesized voices in a healthy adult sample. Additionally, listeners rated their perceptions for valence, arousal, discrete emotions, naturalness, and intelligibility. Synthesized voices were characterized by less lexical stress (i.e., reduced difference between stressed and unstressed syllables within words) as regards duration and median pitch modulations. Besides, spectral content was attenuated toward lower F2 and F3 frequencies and lower intensities for harmonics 1 and 4. Both psychometric and neuronal correlates were sensitive to naturalness reduction. (1) Naturalness and intelligibility ratings dropped with emotional utterances synthetization, (2) Discrete emotion recognition was impaired as naturalness declined, consistent with P200 and Late Positive Potentials (LPP) being less sensitive to emotional differentiation at lower naturalness, and (3) Relative P200 and LPP amplitudes between prosodies were modulated by synthetization. Nevertheless, (4) Valence and arousal perceptions were preserved at lower naturalness, (5) Valence (arousal) ratings correlated negatively (positively) with Higuchi's fractal dimension extracted on neuronal data under all naturalness perturbations, (6) Inter-Trial Phase Coherence (ITPC) and standard deviation measurements revealed high inter-individual heterogeneity for emotion perception that is still preserved as naturalness reduces. Notably, partial between-participant synchrony (low ITPC), along with high amplitude dispersion on ERPs at both early and late stages emphasized miscellaneous emotional responses among subjects. In this study, we highlighted for the first time both behavioral and neuronal basis of emotional perception under acoustic naturalness alterations. Partial dependencies between ecological relevance and emotion understanding outlined the modulation but not the annihilation of emotional integration by synthetization.
Article
Full-text available
Recently, the relationship between emotional arousal and depression has been studied. Focusing on this relationship, we first developed an arousal level voice index (ALVI) to measure arousal levels using the Interactive Emotional Dyadic Motion Capture database. Then, we calculated ALVI from the voices of depressed patients from two hospitals (Ginza Taimei Clinic (H1) and National Defense Medical College hospital (H2)) and compared them with the severity of depression as measured by the Hamilton Rating Scale for Depression (HAM-D). Depending on the HAM-D score, the datasets were classified into a no depression (HAM-D < 8) and a depression group (HAM-D ≥ 8) for each hospital. A comparison of the mean ALVI between the groups was performed using the Wilcoxon rank-sum test and a significant difference at the level of 10% (p = 0.094) at H1 and 1% (p = 0.0038) at H2 was determined. The area under the curve (AUC) of the receiver operating characteristic was 0.66 when categorizing between the two groups for H1, and the AUC for H2 was 0.70. The relationship between arousal level and depression severity was indirectly suggested via the ALVI.
Article
Full-text available
Standard neurocognitive models of language processing have tended to obviate the need for incorporating emotion processes, while affective neuroscience theories have typically been concerned with the way in which people communicate their emotions, and have often simply not addressed linguistic issues. Here, we summarise evidence from temporal and spatial brain imaging studies that have investigated emotion effects on lexical, semantic and morphosyntactic aspects of language during the comprehension of single words and sentences. The evidence reviewed suggests that emotion is represented in the brain as a set of semantic features in a distributed sensory, motor, language and affective network. Also, emotion interacts with a number of lexical, semantic and syntactic features in different brain regions and timings. This is in line with the proposals of interactive neurocognitive models of language processing, which assume the interplay between different representational levels during on-line language comprehension.
Article
Full-text available
Recent advancements in wireless communication and machine learning technologies aid in the development of an accurate and affordable healthcare facility. In this paper, we propose a smart healthcare framework in a mobile platform using deep learning. In the framework, a smart phone records a voice signal of a client and sends it to a cloud server. The cloud server processes the signal and classifies it as normal or pathological using a parallel convolutional neural network model. The decision on the signal is then transferred to the doctor for prescription. Two publicly available databases were used in the experiments, where voice samples were played in front of a smart phone. Experimental results show the suitability of the proposed framework in the healthcare framework.
Presentation
Full-text available
This paper introduces a detailed explanation with numerical examples many classification assessment methods or classification measures such as: Accuracy, sensitivity, specificity, ROC curve, Precision-Recall curve, AUC score and many other metrics. In this paper, many details about the ROC curve, PR curve, and Detection Error Trade-off (DET) curve. Moreover, many details about some measures which are suitable for imbalanced data are explained. Your comments are highly appreciated. The link to the original paper is : https://www.sciencedirect.com/science/article/pii/S2210832718301546
Article
Full-text available
Classification techniques have been applied to many applications in various fields of sciences. There are several ways of evaluating classification algorithms. The analysis of such metrics and its significance must be interpreted correctly for evaluating different learning algorithms. Most of these measures are scalar metrics and some of them are graphical methods. This paper introduces a detailed overview of the classification assessment measures with the aim of providing the basics of these measures and to show how it works to serve as a comprehensive source for researchers who are interested in this field. This overview starts by highlighting the definition of the confusion matrix in binary and multi-class classification problems. Many classification measures are also explained in details, and the influence of balanced and imbalanced data on each metric is presented. An illustrative example is introduced to show (1) how to calculate these measures in binary and multi-class classification problems, and (2) the robustness of some measures against balanced and imbalanced data. Moreover, some graphical measures such as Receiver operating characteristics (ROC), Precision-Recall, and Detection error trade-off (DET) curves are presented with details. Additionally, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves.
Article
In this paper, we describe the process that we used to create a new corpus of children’s emotional speech. We used a Wizard of Oz (WoZ) setting to induce different emotional reactions in children during speech-based interactions with two robots. We recorded the speech spoken in Mexican Spanish by 174 children (both sexes) between 6 and 11 years of age. The recordings were manually segmented and transcribed. The segments were then labeled with two types of emotional-related paralinguistic information: emotion and attitude. The corpus contained 2093 min of audio recordings (34.88 h) divided into 19,793 speech segments. The Interactive Emotional Children’s Speech Corpus (IESC-Child) can be a valuable resource for researchers studying affective reactions in speech communication during child-computer interactions in Spanish and for creating models to recognize acoustic paralinguistic information. IESC-Child is available to the research community upon request.
Article
The sensing of internal bodily signals, a process known as interoception, contributes to subjective emotional feeling states that can guide empathic understanding of the emotions of others. Individuals with Autism Spectrum Conditions (ASC) typically show an attenuated intuitive capacity to recognise and interpret other peoples' emotional signals. Here we test directly if differences in interoceptive processing relate to the ability to perceive emotional signals from the intonation of speech (affective prosody) in ASC adults. We employed a novel prosody paradigm to compare emotional prosody recognition in ASC individuals and a group of neurotypical controls. Then, in a larger group of ASC individuals, we tested how recognition of affective prosody related to objective, subjective and metacognitive (awareness) psychological dimensions of interoception. ASC individuals showed reduced recognition of affective prosody compared to controls. Deficits in performance on the prosody task were mitigated by greater interoceptive awareness, so that ASC individuals were better able to judge the prosodic emotion if they had better insight into their own interoceptive abilities. This data links the ability to access interoceptive representations consciously to the recognition of emotional expression in others, suggesting a crossmodal target for interventions to enhance interpersonal skills.
Article
Objective: The present study explored the processing of emotional speech prosody in school-aged children with autism spectrum disorders (ASD) but without marked language impairments (children with ASD [no LI]). Methods: The mismatch negativity (MMN)/the late discriminative negativity (LDN), reflecting pre-attentive auditory discrimination processes, and the P3a, indexing involuntary orienting to attention-catching changes, were recorded to natural word stimuli uttered with different emotional connotations (neutral, sad, scornful and commanding). Perceptual prosody discrimination was addressed with a behavioral sound-discrimination test. Results: Overall, children with ASD (no LI) were slower in behaviorally discriminating prosodic features of speech stimuli than typically developed control children. Further, smaller standard-stimulus event related potentials (ERPs) and MMN/LDNs were found in children with ASD (no LI) than in controls. In addition, the amplitude of the P3a was diminished and differentially distributed on the scalp in children with ASD (no LI) than in control children. Conclusions: Processing of words and changes in emotional speech prosody is impaired at various levels of information processing in school-aged children with ASD (no LI). Significance: The results suggest that low-level speech sound discrimination and orienting deficits might contribute to emotional speech prosody processing impairments observed in ASD.