PreprintPDF Available

Speech Emotion Recognition using Semantic Information

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural Networks (DNN), most of the studies in the literature fail to consider the semantic information in the speech signal. In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal. In particular, our framework is comprised of a semantic feature extractor, that captures the semantic information, and a paralinguistic feature extractor, that captures the paralinguistic information. Both semantic and paraliguistic features are then combined to a unified representation using a novel attention mechanism. The unified feature vector is passed through a LSTM to capture the temporal dynamics in the signal, before the final prediction. To validate the effectiveness of our framework, we use the popular SEWA dataset of the AVEC challenge series and compare with the three winning papers. Our model provides state-of-the-art results in the valence and liking dimensions.
Content may be subject to copyright.
SPEECH EMOTION RECOGNITION USING SEMANTIC INFORMATION
Panagiotis Tzirakis1, Anh Nguyen1, Stefanos Zafeiriou1, Bj¨
orn W. Schuller1,2
1GLAM – Group on Language, Audio, & Music, Imperial College London, UK
2EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
email: panagiotis.tzirakis12@imperial.ac.uk
ABSTRACT
Speech emotion recognition is a crucial problem manifesting in a
multitude of applications such as human computer interaction and
education. Although several advancements have been made in the
recent years, especially with the advent of Deep Neural Networks
(DNN), most of the studies in the literature fail to consider the se-
mantic information in the speech signal. In this paper, we propose
a novel framework that can capture both the semantic and the par-
alinguistic information in the signal. In particular, our framework is
comprised of a semantic feature extractor, that captures the semantic
information, and a paralinguistic feature extractor, that captures the
paralinguistic information. Both semantic and paraliguistic features
are then combined to a unified representation using a novel attention
mechanism. The unified feature vector is passed through a LSTM
to capture the temporal dynamics in the signal, before the final pre-
diction. To validate the effectiveness of our framework, we use the
popular SEWA dataset of the AVEC challenge series and compare
with the three winning papers. Our model provides state-of-the-art
results in the valence and liking dimensions. 1
Index Termsemotion recognition, deep learning, semantic,
paralinguistic, audiotextual information
1. INTRODUCTION
Automatic affect recognition is a vital component in human-to-
human communication affecting our social interaction, perception
among others [1]. In order to accomplish a natural interaction be-
tween human and machine, intelligent systems need to recognise
the emotional state of individuals. However, the task is challeng-
ing, as human emotions lack of temporal boundaries and different
individuals express emotions in different ways [2]. In addition,
emotions are expressed through multiple modalities. Over the past
two decades, a plethora of systems have been proposed that utilise
several modalities such as physiological signals, facial expression,
speech, and text [3, 4, 5, 6, 7]. To achieve an accurate emotion
recognition system, it is important to consider multiple modalities,
as complementary information exists among them [3].
Current studies exploit Deep Neural Networks (DNNs) to model
affect using multiple modalities [8, 9, 10]. Two modalities that have
been extensively used for the emotion recognition task are speech
and text [11, 12]. Whereas the speech signal provides low-level
characteristics of the emotions (e. g., prosody), text provides high-
level (semantic) information (e. g., the words “love” and “like” carry
strong emotional content). To this end, several systems have shown
1Code available here: https://github.com/glam-imperial/
semantic_speech_emotion_recognition
that integrating both modalities, strong performance gains can be
obtained [11].
However, one may argue that the textual information is redun-
dant, as it is already included in the speech signal, and as such se-
mantic information can be captured using only the speech modality.
To this end, we propose an audiotextual training framework, where
the text modality is used during training, but discarded during eval-
uation. In particular, we train Word2Vec [13] and Speech2Vec [14]
models, and align their two embedding spaces such that Speech2Vec
features are as close as possible with the Word2Vec ones [15]. In ad-
dition to the semantic information, we capture low-level character-
istics of the speech signal by training a convolution recurrent neural
network. The semantic and paralinguistic features are combined to a
unified representation and passed through a long short-term memory
(LSTM) module that captures the temporal dynamics in the signal,
before the final prediction.
To test the effectiveness of our model, we utilise the Sentiment
Analysis in the Wild (SEWA) dataset, which was used in the Au-
dio/Visual Emotion Challenge (AVEC) since 2017 [16]. The dataset
provides three continuous affect dimensions: arousal, valence, and
likability. Although the arousal and valence dimensions are eas-
ily integrated in a single network during the training phase of the
models, the likability dimension can cause convergence and gen-
eralisation difficulties [16, 17]. To this end, we propose to use a
novel ‘disentangled‘ attention mechanism to fuse the semantic and
paralinguistic features such that the information required per affect
dimension is disentangled. Our approach provides training stabil-
ity, and, at the same time, increases the generalisability of the net-
work during evaluation. We compare our framework with the three
best performing papers of the competition [18, 19, 17] in terms of
concordance correlation coefficient (ρc) [20, 21], and show that our
method provides state-of-the-art results for the valence and likability
dimensions.
In summary, the main contributions of the paper are the follow-
ing: (a) propose to use the acoustic speech signal to capture semantic
information that exists in the text modality, (b) show how to disen-
tangle the information in the network per affect dimensions for stable
training and generalisability during the evaluation phase, and (c) pro-
duce state-of-the-art results in the valence and likability dimensions
using the SEWA dataset.
2. RELATED WORK
Several studies have been proposed in the literature for speech emo-
tion recognition [22, 23, 24]. For example, Trigeorgis et al. [22]
utilised a convolution neural network to capture the spatial informa-
tion in the signal, and a recurrent neural network for the temporal
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
arXiv:2103.02993v1 [cs.SD] 4 Mar 2021
Paralinguistic Feature Extractor
Semantic Feature Extractor
Word2Vec Speech2Vec
Alignment
Fusion Layer
Paralinguistic
features
Semantic
features
LSTM
Arousal
Valence
Liking
Raw Waveform
(10 sec)
x
t1
n+1
xn-1
xn
t2tM
t1t2tK
t1t2tL
Speech2Vec
Speech2Vec
Features
Fig. 1.Our proposed model is comprised of two networks: (a) the semantic feature extractor, that extracts high-level features containing
semantic information of the input, and (b) the paralinguistic feature extractor, that extracts low-level features containing paralinguistic
information of the signal. Both feature vectors are passed through a fusion layer, that combines the information and extracts a unified
representation of the input, that is then passed through a LSTM model for the final prediction.
ones. In a similar study, Tzirakis et al. [23] showed that utilising
a deeper architecture with longer input window produces better re-
sults. In another study, Neumann et al. [25] propose an attentive
convolutional neural network (ACNN) that combines CNNs with at-
tention.
In the past ten years, a plethora of models have been proposed
that incorporate more than one modality for the emotion recognition
task [8, 26, 9]. In particular, Tzirakis et al. [8] uses both audio and
visual information for continuous emotion recognition. Although
this study produced good results, it utilises both modalities for the
training and evaluation of the model. In a more recent study, Al-
banie et al. [9] transfer the knowledge from the visual information
(facial expressions) to the speech model. In another study, Han et
al. [26] proposed an implicit fusion strategy for audiovisual emotion
recognition. In this study, both audio and visual modalities are used
for the training of the model and only one for the evaluation of the
model.
3. PROPOSED METHOD
Our cross-modal framework can leverage the semantic (high-level)
information (Sec. 3.1) and the paralinguistic (low-level) dynamics
in the speech signal (Sec. 3.2). The low- and high-level feature sets
are fused together using a novel attention fusion strategy (Sec. 3.3)
before feeding them to a one-layer LSTM module, to captures the
temporal dynamics in the signal, for the final frame-level prediction.
Fig. 1 depicts the proposed method.
3.1. Semantic Feature Extractor
To capture the semantic information in the speech signal, we train
Word2Vec and Speech2Vec models. The first model uses the text
information to extract a semantic vector representation from a given
word, whereas the second one the speech. We align their embedding
spaces, similar to [15], for semantically richer speech representa-
tions. Mathematically, we define the speech embedding matrix S=
[s1, s2, ..., sm]Rm×dsto be of mvocabulary words with dimen-
sion ds, and the text embedding matrix T= [t1, t2, ..., tn]Rn×dt
to be of nvocabulary words with dimension dt. Our goal is to learn
a linear mapping WRdt×dssuch that W S is most similar to T.
To this end, we learn an initial proxy of Wvia domain-
adversarial training. The adversarial training is a two-layer game
where the generator tries, by computing W, to deceive the discrim-
inator from correctly identifying the embedding space, and making
W S and Tas similar as possible. Mathematically, the discriminator
tries to minimise the following objective:
LD(θD|W) = 1
n
n
X
i=1
logPθD(speech = 1|W si)
1
m
m
X
i=1
logPθD(speech = 0|ti),
(1)
where θDare the parameters of the discriminator, and PθD(speech =
1|z)is the probability the vector zoriginates from speech embed-
ding.
On the other hand, the generator tries to minimise the following
objective:
LG(W|θD) = 1
n
n
X
i=1
logPθD(speech = 0|W si)
1
m
m
X
i=1
logPθD(speech = 1|ti).
(2)
A limitation of the above formulation is that all embedding vec-
tors are treated equally during training. However, words with higher
frequency would have better embedding quality in the vector space
than less frequent words. To this end, we use the frequent words to
create a dictionary that specifies which speech embedding vectors
correspond to which text embedding vectors, and refine W:
W=argmin
W||W SrTr||F,(3)
where Sris a matrix built by kspeech vectors from Sand Tris a ma-
trix built by kvectors from T. The solution of Eq. 3 is obtained from
Layer Kernel/Stride Channels Activation
Convolution 8/1 50 ReLU
Max-pooling 10/10 — —
Convolution 6/1 125 ReLU
Max-pooling 5/5— —
Convolution 6/1 125 ReLU
Max-pooling 5/5— —
Table 1. Paralinguistic feature extractor. Shown are the layer type,
kernel/stride size, channels size, and activation function.
the singular value decomposition of SrTT
r, i. .e., SV D(SrTT
r) =
UΣVT.
3.2. Paralinguistic Feature Extractor
Our paralinguistic feature extraction network is comprised of three
1-D CNN layers with a rectified linear unit (ReLU) as activation
function, and max-pooling operations in-between. Both convolution
and pooling operations are performed on the time domain, using the
raw waveform as input. Inspired by our previous work [23], we
perform convolution with small kernel size and stride of one, and
a large kernel and stride size for the max-pooling. Table 1 shows the
architecture of the network.
3.3. Fusion Strategies
Our last step is to fuse the semantic (xsRds) and paralinguistic
(xpRdp) speech features, before feeding them to the LSTM. This
is performed with two strategies: (i) concatenation, (ii) ‘disentan-
gled‘ attention mechanism.
Concatenation. The first approach is a standard feature-level fu-
sion, i. e., a simple concatenation of the feature vectors. Mathemati-
cally, xfusion = [xs,xp].
Disentangled attention mechanism. For our second approach,
we propose using attention mechanism to fuse the two modalities.
To this end, we perform a linear projection for each of the feature
sets such that they are in the same vector space (with dimension du):
˜
xs=Wsxs+bs,
˜
xp=Wpxp+bp,(4)
where WsRdu×ds,WpRdu×dpare projection matrices for
the semantic and paralinguistic feature sets, respectively.
We fuse these features using attention mechanism, i.e.,
Attention(˜
xs,˜
xt) = αs˜
xs+αp˜
xp,
αi=softmax(˜
xiqi
du
),(5)
where qRduis learnable vector that attends to different features.
At this point, we use three fully-connected (FC) layers with lin-
ear activation of same dimensionality on top of the output obtained
from first attention layer, i. e.,
a=Wa˜
xsp +ba
v=Wv˜
xsp +bv,
l=Wl˜
xsp +bl,
(6)
where {Wa,Wl,Wv} ∈ Rdu×duare projection matrices.
We choose to use three FC layers such that the information flow
per emotional dimension (i. e., arousal, valence, and liking) in the
network is disentangled. The intuition here is that by adding three
additional dense layers, we hope that each of these projections could
learn features that suit best for a dimension in our emotion space. In
case of a higher number of outputs, more FC layers can be used.
To fuse the information of the ‘disentanngled’ vector spaces, we
apply an attention layer so that each suited feature set could attend to
one another and produce an enriched fusion feature output for final
prediction. In particular, we, first, apply attention on aand l; and,
finally, on the result with v, i. e.,
z=Attention(a,l)
xfusion =Attention(z,v).(7)
4. DATASET
We test the performance of our proposed framework on a time-
continuous emotion recognition dataset for real-world environ-
ments. In particular, as outlined, we utilise the Sentiment Analysis
in the Wild (SEWA) dataset that was used in the AVEC 2017 chal-
lenge [16]. The dataset consists of ‘in-th-wild’ audiovisual record-
ings that were captured from web-cameras and microphones from
32 pairs (i. e., 64 participants) that watched a 90 sec commercial
visual and discussed it with their partner for maximum of 3min. It
provides three modalities, namely, audio, visual, and text, for three
emotional dimensions: arousal, valence, and liking. The dataset is
split into 3 partitions: training (17 pairs), development (7pairs), and
test (8pairs), and was annotated by 6German-speaking annotators
(3female, 3male).
5. EXPERIMENTS
5.1. Experimental Setup
For training the models, we utilised the Adam optimisation method
[27], and a fixed learning rate of 104throughout all experiments.
We used a mini-batch of 25 samples with sequence length of 300,
and a dropout [28] with p= 0.5for all layers except the recur-
rent ones to regularise our network. This step is important, as our
models have a large amount of parameters and not regularising the
network makes it prone on overfitting on the training data. In addi-
tion, the LSTM network we use in the training phase is trained with a
dropout of 0.5and a gradient norm clipping of 5.0. Finally, we seg-
ment the raw waveform into 10 sec long sequences with sampling
rate of 22 050 Hz. Hence, each sequence corresponds to a 22 0500-
dimension vector.
5.2. Objective Function
Our objective function is based on the Concordance Correlation
Coefficient (ρc) that was also used in the AVEC 2017 challenge.
ρcevaluates the agreement level between the predictions and the
gold standard by scaling their correlation coefficient with their mean
square difference. Mathematically, the the concordance loss Jccan
be defined as follows:
Lc= 1 ρc= 1 2σ2
xy
σ2
x+σ2
y+ (µxµy)2,(8)
where µx=E(x),µy=E(y),σ2
x=var(x),σ2
y=var(y), and
σ2
xy =cov(x,y).
Our end-to-end network is trained to predict the arousal, valence,
and liking dimensions, and as such, we define the overall loss as
follows, L= (La
c+Lv
c+Ll
c)/3,where La
c,Lv
c, and Lv
care the
concordance loss of the arousal, valence, and liking dimensions, re-
spectively, contributing equally to the loss.
5.3. Ablation Study
5.3.1. Comparing Vector Spaces
We test the performance of both the semantic and paralinguistic
networks, trained independently, and trained jointly, to show the
beneficial properties of our proposed framework. Table 2 depicts
the results in terms of ρcon the development set of the SEWA
dataset. We observe that Word2Vec produces slightly better results
than Speech2Vec. However, after aligning their embedding spaces,
the aligned Speech2Vec has higher performance than Word2Vec, in-
dicating both that the refinement process makes speech embedding
similar to the word ones, and that paralinguistic information exists in
the model. Finally, the paralinguistic network, although it produces
worse results than the aligned Speech2Vec model, provides the best
results for the arousal dimension.
Model Arousal Valence Liking Avg
Word2Vec .434 .513 .208 .385
Speech2Vec .433 .470 .182 .362
Align Speech2Vec .453 .452 .257 .387
Paralinguistic .508 .436 .154 .366
Table 2.SEWA dataset results (in terms of ρc) of the Word2Vec,
Speech2Vec, aligned Speech2Vec and paralinguistic models, on the
development set.
5.3.2. Fusion Strategies
We further explore the effectiveness of the attention fusion strat-
egy compared to the simple concatenation. For our experiments we
utilised both semantic and paralinguistic deep network models of the
proposed method. Table 5.3.2 depicts the results in terms of ρcon
the development set of the SEWA dataset. We observe that the at-
tention method performs superior to the other one on all emotional
dimensions, indicating the effectiveness of our approach to model
the three emotional dimensions by projecting them to three different
spaces before fusing them together with attention.
Fusion strategy Arousal Valence Liking Avg
Concatenation .427 .428 .306 .387
Disentangled attention .499 .497 .311 .435
Table 3.SEWA dataset results (in terms of ρc) of the various fusion
methods (i. e., concatenation, attention, and hierachical attention) in
the development set.
Method Arousal Valence Liking
Baseline [16] .225 (.344) .224 (.351) -.020 (.081)
Dang et. al. [18] .344 (.494) .346 (.507) — (—)
Huang et. al. [19] .583 (.584) .487 (.585) — (—)
Chen et. al. [17] .422 (.524) .405 (.504) .054 (.273)
Proposed .429 (.499) .503 (.497) .312 (.311)
Table 4.SEWA dataset test result (in terms of ρc) of our proposed
fusion model compared with the winning models in AVEC 2017. In
parenthesis are the performances obtained on the development set.
A dash is inserted if the results could not be obtained.
5.4. Results
We compare our proposed framework with the winning papers of the
AVEC 2017 challenge. As our model utilises only the audio modal-
ity during evaluation, we show, for fairness of comparison, the re-
sults of these studies using the audio information. Table 4 depicts
the results. First, we observe that our approach provides the best
results in the valence dimension with high margin, and the second
best in the arousal one. We should note, however, that the network
from Huang et al. [19] was pretrained on 300 hours of a sponta-
neous English speech recognition corpus before fine-tuning it to the
SEWA dataset. In addition to the features of the network, they also
utilise several hand-engineered features. Second, we observe that
our approach provides the highest performance in the likability di-
mension. Our method is able to generalise on this dimension com-
pared to Chen et al [17] whose performance drops significantly com-
pared to its performance on the development set. Finally, we should
note the high generalisation capability of our approach to model all
three emotional dimensions, indicating the effectiveness of the pro-
posed disentangled attention mechanism strategy.
6. CONCLUSIONS
In this paper, we propose a training framework using audio and
text information for speech emotion recognition. In particular, we
use Word2Vec and Speech2Vec models, and align their embedding
spaces for accurate semantic feature extraction using only the speech
signal. We combine the semantic and paralinguistic features using
a novel attention fusion strategy that first disentangles the informa-
tion per emotional dimension, and then combines it using attention.
The proposed model is evaluated on the SEWA dataset and produces
state-of-the-art results on the valence and liking dimensions, when
compared with the best performing papers submitted to the AVEC
2017 challenge.
In future work, we intend to use a single network to simulta-
neously capture the semantic and the paralinguistic information in
the speech signal. This will result in simplifying, and at the same
time, reducing the number of parameters of the model. Additionally,
we intend to investigate the performance of the proposed method on
categorical emotion recognition datasets.
7. ACKNOWLEDGEMENTS
The support of the EPSRC Center for Doctoral Training in High
Performance Embedded and Distributed Systems (HiPEDS, Grant
Reference EP/ L016796/1) is gratefully acknowledged.
REFERENCES
[1] R. Picard, Affective Computing, MIT Press, 1997.
[2] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features
and classifiers for emotion recognition from speech: a survey
from 2000 to 2011,” Artificial Intelligence Review, pp. 155–
177, 2015.
[3] P. Tzirakis, S. Zafeiriou, and B. Schuller, “Chapter 18 - Real-
world automatic continuous affect recognition from audiovi-
sual signals,” in Multimodal Behavior Analysis in the Wild, pp.
387–406. Elsevier, 2019.
[4] D. Kollias, P. Tzirakis, Mihalis A Nicolaou, A. Papaioannou,
G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect
prediction in-the-wild: Aff-wild database and challenge, deep
architectures, and beyond,” International Journal of Computer
Vision, pp. 907–929, 2019.
[5] B. Schuller, “Speech emotion recognition: two decades in a
nutshell, benchmarks, and ongoing trends,” Communications
of the ACM, pp. 90–99, 2018.
[6] L. Stappen, A. Baird, G. Rizos, P. Tzirakis, X. Du, F. Hafner,
L. Schumann, A. Mallol-Ragolta, B. Schuller, I. Lefter, et al.,
“Muse 2020 challenge and workshop: Multimodal sentiment
analysis, emotion-target engagement and trustworthiness de-
tection in real-life media: Emotional car reviews in-the-wild,
in Proc. ACM International on Multimodal Sentiment Analysis
in Real-life Media Challenge and Workshop, 2020, pp. 35–44.
[7] J. Zhang, Z. Yin, P. Chen, and S. Nichele, “Emotion recogni-
tion using multi-modal data and machine learning techniques:
A tutorial and review,” Information Fusion, pp. 103–126, 2020.
[8] P. Tzirakis, G. Trigeorgis, M. Nicolaou, B. Schuller, and
S. Zafeiriou, “End-to-end multimodal emotion recognition us-
ing deep neural networks,” IEEE Journal of Selected Topics in
Signal Processing, pp. 1301–1309, 2017.
[9] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emo-
tion recognition in speech using cross-modal transfer in the
wild,” in Proc. ACM Multimedia, 2018, pp. 292–301.
[10] P. Tzirakis, S. Zafeiriou, and B. Schuller, “End2You–The Im-
perial Toolkit for Multimodal Profiling by End-to-End Learn-
ing,” arXiv preprint arXiv:1802.01115, 2018.
[11] S. Yoon, S. Byun, and K. Jung, “Multimodal speech emotion
recognition using audio and text, in Proc. IEEE Spoken Lan-
guage Technology, 2018, pp. 112–118.
[12] M. V. M¨
antyl¨
a, D. Graziotin, and M. Kuutila, “The evolution of
sentiment analysis—A review of research topics, venues, and
top cited papers,” Computer Science Review, pp. 16–32, 2018.
[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
compositionality, in Proc. Advances in neural information
processing systems (NeurIPS), 2013, pp. 3111–3119.
[14] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-
to-sequence framework for learning word embeddings from
speech,” Proc. Interspeech, pp. 811–815, 2018.
[15] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsu-
pervised cross-modal alignment of speech and text embedding
spaces,” in Proc. Advances in neural information processing
systems (NeurIPS), 2018, pp. 7354–7364.
[16] F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie,
S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pan-
tic, “Avec 2017: Real-life depression, and affect recognition
workshop and challenge,” in Proc. ACM Multimedia Work-
shop, 2017, pp. 3–9.
[17] S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task
learning for dimensional and continuous emotion recognition,”
in Proc. ACM Multemedia Workshops, 2017, pp. 19–26.
[18] T. Dang, B. Stasak, Z. Huang, S. Jayawardena, M. Atcheson,
M. Hayat, P. Le, V. Sethu, R. Goecke, and J. Epps, “Investigat-
ing word affect features and fusion of probabilistic predictions
incorporating uncertainty in avec 2017, in Proc. ACM Multe-
media Workshops, 2017, pp. 27–35.
[19] J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi,
“Continuous multimodal emotion prediction based on long
short term memory recurrent neural network,” in Proc. ACM
Multemedia Workshops, 2017, pp. 11–18.
[20] B. Schuller, S. Steidl, A. Batliner, P. Marschik, H. Baumeis-
ter, F. Dong, S. Hantke, F. Pokorny, et al., “The interspeech
2018 computational paralinguistics challenge: Atypical & self-
assessed affect, crying & heart beats, 2018, pp. 122–126.
[21] P. Tzirakis, J. Chen, S. Zafeiriou, and B. Schuller, “End-to-
end multimodal affect recognition in real-world environments,
Information Fusion, vol. 68, pp. 46–53, 2021.
[22] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, Mihalis A.
Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-
to-end speech emotion recognition using a deep convolutional
recurrent network,” in Proc. IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.
5200–5204.
[23] P. Tzirakis, J. Zhang, and B. Schuller, “End-to-end speech
emotion recognition using deep neural networks,” in Proc.
IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), 2018, pp. 5089–5093.
[24] L. Tarantino, P. Garner, and A. Lazaridis, “Self-attention for
speech emotion recognition,” Proc. Interspeech 2019, pp.
2578–2582, 2019.
[25] M. Neumann and N. T. Vu, Attentive convolutional neural
network based speech emotion recognition: A study on the im-
pact of input features, signal length, and acted speech,” Proc.
Interspeech 2017, pp. 1263–1267, 2017.
[26] J. Han, Z. Zhang, Z. Ren, and B. Schuller, “Implicit fusion
by joint audiovisual training for emotion recognition in mono
modality, in Proc. IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), 2019, pp. 5861–
5865.
[27] D. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” arXiv preprint arXiv:1412.6980, 2014.
[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural
networks from overfitting, The Journal of Machine Learning
Research, vol. 15, no. 1, pp. 1929–1958, 2014.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthi-ness detection by means of more comprehensively integrating the audiovisual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audiovisual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild , which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic , in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust , in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CaR , the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34· UAR + 0.66·F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.
Article
Full-text available
Automatic understanding of human affect using visual signals is of great importance in everyday human–machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent continuous dimensions (e.g., the circumplex model of affect). Valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion) constitute popular and effective representations for affect. Nevertheless, the majority of collected datasets this far, although containing naturalistic emotional states, have been captured in highly controlled recording conditions. In this paper, we introduce the Aff-Wild benchmark for training and evaluating affect recognition algorithms. We also report on the results of the First Affect-in-the-wild Challenge (Aff-Wild Challenge) that was recently organized in conjunction with CVPR 2017 on the Aff-Wild database, and was the first ever challenge on the estimation of valence and arousal in-the-wild. Furthermore, we design and extensively train an end-to-end deep neural architecture which performs prediction of continuous emotion dimensions based on visual cues. The proposed deep learning architecture, AffWildNet, includes convolutional and recurrent neural network layers, exploiting the invariant properties of convolutional features, while also modeling temporal dynamics that arise in human behavior via the recurrent layers. The AffWildNet produced state-of-the-art results on the Aff-Wild Challenge. We then exploit the AffWild database for learning features, which can be used as priors for achieving best performances both for dimensional, as well as categorical emotion recognition, using the RECOLA, AFEW-VA and EmotiW 2017 datasets, compared to all other methods designed for the same goal. The database and emotion recognition models are available at http://ibug.doc.ic.ac.uk/resources/first-affect-wild-challenge.
Conference Paper
Full-text available
Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.
Conference Paper
Full-text available
The INTERSPEECH 2018 Computational Paralinguistics Challenge addresses four different problems for the first time ina research competition under well-defined conditions: In the Atypical Affect Sub-Challenge, four basic emotions annotatedin the speech of handicapped subjects have to be classified; in the Self-Assessed Affect Sub-Challenge, valence scores given by the speakers themselves are used for a three-class classification problem; in the Crying Sub-Challenge, three types of infant vocalisations have to be told apart; and in the Heart Beats Sub-Challenge, three different types of heart beats have to be determined. We describe the Sub Challenges, their conditions, and baseline feature extraction and classifiers, which include data-learnt (supervised) feature representations by end-to-end learning, the 'usual’ ComParE and BoAW features, and deep unsupervised representation learning using the auDeep toolkit for the first time in the challenge series.
Article
Automatic affect recognition in real-world environments is an important task towards a natural interaction between humans and machines. The recent years, several advancements have been accomplished in determining the emotional states with the use of Deep Neural Networks (DNNs). In this paper, we propose an emotion recognition system that utilizes the raw text, audio and visual information in an end-to-end manner. To capture the emotional states of a person, robust features need to be extracted from the various modalities. To this end, we utilize Convolutional Neural Networks (CNNs) and propose a novel transformer-based architecture for the text modality that can robustly capture the semantics of sentences. We develop an audio model to process the audio channel, and adopt a variation of a high resolution network (HRNet) to process the visual modality. To fuse the modality-specific features, we propose novel attention-based methods. To capture the temporal dynamics in the signal, we utilize Long Short-Term Memory (LSTM) networks. Our model is trained on the SEWA dataset of the AVEC 2017 research sub-challenge on emotion recognition, and produces state-of-the-art results in the text, visual and multimodal domains, and comparable performance in the audio case when compared with the winning papers of the challenge that use several hand-crafted and DNN features. Code is available at: https://github.com/glam-imperial/multimodal-affect-recognition.
Article
In recent years, the rapid advances in machine learning (ML) and information fusion has made it possible to endow machines/computers with the ability of emotion understanding, recognition, and analysis. Emotion recognition has attracted increasingly intense interest from researchers from diverse fields. Human emotions can be recognized from facial expressions, speech, behavior (gesture/posture) or physiological signals. However, the first three methods can be ineffective since humans may involuntarily or deliberately conceal their real emotions (so-called social masking). The use of physiological signals can lead to more objective and reliable emotion recognition. Compared with peripheral neurophysiological signals, electroencephalogram (EEG) signals respond to fluctuations of affective states more sensitively and in real time and thus can provide useful features of emotional states. Therefore, various EEG-based emotion recognition techniques have been developed recently. In this paper, the emotion recognition methods based on multi-channel EEG signals as well as multi-modal physiological signals are reviewed. According to the standard pipeline for emotion recognition, we review different feature extraction (e.g., wavelet transform and nonlinear dynamics), feature reduction, and ML classifier design methods (e.g., k-nearest neighbor (KNN), naive Bayesian (NB), support vector machine (SVM) and random forest (RF)). Furthermore, the EEG rhythms that are highly correlated with emotions are analyzed and the correlation between different brain areas and emotions is discussed. Finally, we compare different ML and deep learning algorithms for emotion recognition and suggest several open problems and future research directions in this exciting and fast-growing area of AI.
Conference Paper
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.