PreprintPDF Available

Speech Emotion Recognition using Semantic Information

March 2021

March 2021

License
CC BY-NC-SA 4.0

Authors:

Anh Nguyen

RMIT International University Vietnam

Stefanos Zafeiriou

Imperial College London

Björn Schuller

Technische Universität München

Preprints and early-stage research may not have been peer reviewed yet.

Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural Networks (DNN), most of the studies in the literature fail to consider the semantic information in the speech signal. In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal. In particular, our framework is comprised of a semantic feature extractor, that captures the semantic information, and a paralinguistic feature extractor, that captures the paralinguistic information. Both semantic and paraliguistic features are then combined to a unified representation using a novel attention mechanism. The unified feature vector is passed through a LSTM to capture the temporal dynamics in the signal, before the final prediction. To validate the effectiveness of our framework, we use the popular SEWA dataset of the AVEC challenge series and compare with the three winning papers. Our model provides state-of-the-art results in the valence and liking dimensions.

Our proposed model is comprised of two networks: (a) the semantic feature extractor, that extracts high-level features containing semantic information of the input, and (b) the paralinguistic feature extractor, that extracts low-level features containing paralinguistic information of the signal. Both feature vectors are passed through a fusion layer, that combines the information and extracts a unified representation of the input, that is then passed through a LSTM model for the final prediction.

…

Figures - available via license: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Content may be subject to copyright.

Available via license: CC BY-NC-SA 4.0

Content may be subject to copyright.

SPEECH EMOTION RECOGNITION USING SEMANTIC INFORMATION

Panagiotis Tzirakis1, Anh Nguyen1, Stefanos Zafeiriou1, Bj¨

orn W. Schuller1,2

1GLAM – Group on Language, Audio, & Music, Imperial College London, UK

2EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany

email: panagiotis.tzirakis12@imperial.ac.uk

ABSTRACT

Speech emotion recognition is a crucial problem manifesting in a

multitude of applications such as human computer interaction and

education. Although several advancements have been made in the

recent years, especially with the advent of Deep Neural Networks

(DNN), most of the studies in the literature fail to consider the se-

mantic information in the speech signal. In this paper, we propose

a novel framework that can capture both the semantic and the par-

alinguistic information in the signal. In particular, our framework is

comprised of a semantic feature extractor, that captures the semantic

information, and a paralinguistic feature extractor, that captures the

paralinguistic information. Both semantic and paraliguistic features

are then combined to a uniﬁed representation using a novel attention

mechanism. The uniﬁed feature vector is passed through a LSTM

to capture the temporal dynamics in the signal, before the ﬁnal pre-

diction. To validate the effectiveness of our framework, we use the

popular SEWA dataset of the AVEC challenge series and compare

with the three winning papers. Our model provides state-of-the-art

results in the valence and liking dimensions. 1

Index Terms—emotion recognition, deep learning, semantic,

paralinguistic, audiotextual information

1. INTRODUCTION

Automatic affect recognition is a vital component in human-to-

human communication affecting our social interaction, perception

among others [1]. In order to accomplish a natural interaction be-

tween human and machine, intelligent systems need to recognise

the emotional state of individuals. However, the task is challeng-

ing, as human emotions lack of temporal boundaries and different

individuals express emotions in different ways [2]. In addition,

emotions are expressed through multiple modalities. Over the past

two decades, a plethora of systems have been proposed that utilise

several modalities such as physiological signals, facial expression,

speech, and text [3, 4, 5, 6, 7]. To achieve an accurate emotion

recognition system, it is important to consider multiple modalities,

as complementary information exists among them [3].

Current studies exploit Deep Neural Networks (DNNs) to model

affect using multiple modalities [8, 9, 10]. Two modalities that have

been extensively used for the emotion recognition task are speech

and text [11, 12]. Whereas the speech signal provides low-level

characteristics of the emotions (e. g., prosody), text provides high-

level (semantic) information (e. g., the words “love” and “like” carry

strong emotional content). To this end, several systems have shown

1Code available here: https://github.com/glam-imperial/

semantic_speech_emotion_recognition

that integrating both modalities, strong performance gains can be

obtained [11].

However, one may argue that the textual information is redun-

dant, as it is already included in the speech signal, and as such se-

mantic information can be captured using only the speech modality.

To this end, we propose an audiotextual training framework, where

the text modality is used during training, but discarded during eval-

uation. In particular, we train Word2Vec [13] and Speech2Vec [14]

models, and align their two embedding spaces such that Speech2Vec

features are as close as possible with the Word2Vec ones [15]. In ad-

dition to the semantic information, we capture low-level character-

istics of the speech signal by training a convolution recurrent neural

network. The semantic and paralinguistic features are combined to a

uniﬁed representation and passed through a long short-term memory

(LSTM) module that captures the temporal dynamics in the signal,

before the ﬁnal prediction.

To test the effectiveness of our model, we utilise the Sentiment

Analysis in the Wild (SEWA) dataset, which was used in the Au-

dio/Visual Emotion Challenge (AVEC) since 2017 [16]. The dataset

provides three continuous affect dimensions: arousal, valence, and

likability. Although the arousal and valence dimensions are eas-

ily integrated in a single network during the training phase of the

models, the likability dimension can cause convergence and gen-

eralisation difﬁculties [16, 17]. To this end, we propose to use a

novel ‘disentangled‘ attention mechanism to fuse the semantic and

paralinguistic features such that the information required per affect

dimension is disentangled. Our approach provides training stabil-

ity, and, at the same time, increases the generalisability of the net-

work during evaluation. We compare our framework with the three

best performing papers of the competition [18, 19, 17] in terms of

concordance correlation coefﬁcient (ρc) [20, 21], and show that our

method provides state-of-the-art results for the valence and likability

dimensions.

In summary, the main contributions of the paper are the follow-

ing: (a) propose to use the acoustic speech signal to capture semantic

information that exists in the text modality, (b) show how to disen-

tangle the information in the network per affect dimensions for stable

training and generalisability during the evaluation phase, and (c) pro-

duce state-of-the-art results in the valence and likability dimensions

using the SEWA dataset.

2. RELATED WORK

Several studies have been proposed in the literature for speech emo-

tion recognition [22, 23, 24]. For example, Trigeorgis et al. [22]

utilised a convolution neural network to capture the spatial informa-

tion in the signal, and a recurrent neural network for the temporal

media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or

redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

arXiv:2103.02993v1 [cs.SD] 4 Mar 2021

Paralinguistic Feature Extractor

Semantic Feature Extractor

Word2Vec Speech2Vec

Alignment

Fusion Layer

Paralinguistic

features

Semantic

features

LSTM

Arousal

Valence

Liking

Raw Waveform

(10 sec)

n+1

xn-1

t2tM

t1t2tK

t1t2tL

Speech2Vec

Features

Fig. 1.Our proposed model is comprised of two networks: (a) the semantic feature extractor, that extracts high-level features containing

semantic information of the input, and (b) the paralinguistic feature extractor, that extracts low-level features containing paralinguistic

information of the signal. Both feature vectors are passed through a fusion layer, that combines the information and extracts a uniﬁed

representation of the input, that is then passed through a LSTM model for the ﬁnal prediction.

ones. In a similar study, Tzirakis et al. [23] showed that utilising

a deeper architecture with longer input window produces better re-

sults. In another study, Neumann et al. [25] propose an attentive

convolutional neural network (ACNN) that combines CNNs with at-

tention.

In the past ten years, a plethora of models have been proposed

that incorporate more than one modality for the emotion recognition

task [8, 26, 9]. In particular, Tzirakis et al. [8] uses both audio and

visual information for continuous emotion recognition. Although

this study produced good results, it utilises both modalities for the

training and evaluation of the model. In a more recent study, Al-

banie et al. [9] transfer the knowledge from the visual information

(facial expressions) to the speech model. In another study, Han et

al. [26] proposed an implicit fusion strategy for audiovisual emotion

recognition. In this study, both audio and visual modalities are used

for the training of the model and only one for the evaluation of the

model.

3. PROPOSED METHOD

Our cross-modal framework can leverage the semantic (high-level)

information (Sec. 3.1) and the paralinguistic (low-level) dynamics

in the speech signal (Sec. 3.2). The low- and high-level feature sets

are fused together using a novel attention fusion strategy (Sec. 3.3)

before feeding them to a one-layer LSTM module, to captures the

temporal dynamics in the signal, for the ﬁnal frame-level prediction.

Fig. 1 depicts the proposed method.

3.1. Semantic Feature Extractor

To capture the semantic information in the speech signal, we train

Word2Vec and Speech2Vec models. The ﬁrst model uses the text

information to extract a semantic vector representation from a given

word, whereas the second one the speech. We align their embedding

spaces, similar to [15], for semantically richer speech representa-

tions. Mathematically, we deﬁne the speech embedding matrix S=

[s1, s2, ..., sm]∈Rm×dsto be of mvocabulary words with dimen-

sion ds, and the text embedding matrix T= [t1, t2, ..., tn]∈Rn×dt

to be of nvocabulary words with dimension dt. Our goal is to learn

a linear mapping W∈Rdt×dssuch that W S is most similar to T.

To this end, we learn an initial proxy of Wvia domain-

adversarial training. The adversarial training is a two-layer game

where the generator tries, by computing W, to deceive the discrim-

inator from correctly identifying the embedding space, and making

W S and Tas similar as possible. Mathematically, the discriminator

tries to minimise the following objective:

LD(θD|W) = −1

i=1

logPθD(speech = 1|W si)

−1

i=1

logPθD(speech = 0|ti),

(1)

where θDare the parameters of the discriminator, and PθD(speech =

1|z)is the probability the vector zoriginates from speech embed-

ding.

On the other hand, the generator tries to minimise the following

objective:

LG(W|θD) = −1

i=1

logPθD(speech = 0|W si)

−1

i=1

logPθD(speech = 1|ti).

(2)

A limitation of the above formulation is that all embedding vec-

tors are treated equally during training. However, words with higher

frequency would have better embedding quality in the vector space

than less frequent words. To this end, we use the frequent words to

create a dictionary that speciﬁes which speech embedding vectors

correspond to which text embedding vectors, and reﬁne W:

W∗=argmin

W||W Sr−Tr||F,(3)

where Sris a matrix built by kspeech vectors from Sand Tris a ma-

trix built by kvectors from T. The solution of Eq. 3 is obtained from

Layer Kernel/Stride Channels Activation

Convolution 8/1 50 ReLU

Max-pooling 10/10 — —

Convolution 6/1 125 ReLU

Max-pooling 5/5— —

Convolution 6/1 125 ReLU

Max-pooling 5/5— —

Table 1. Paralinguistic feature extractor. Shown are the layer type,

kernel/stride size, channels size, and activation function.

the singular value decomposition of SrTT

r, i. .e., SV D(SrTT

r) =

UΣVT.

3.2. Paralinguistic Feature Extractor

Our paralinguistic feature extraction network is comprised of three

1-D CNN layers with a rectiﬁed linear unit (ReLU) as activation

function, and max-pooling operations in-between. Both convolution

and pooling operations are performed on the time domain, using the

raw waveform as input. Inspired by our previous work [23], we

perform convolution with small kernel size and stride of one, and

a large kernel and stride size for the max-pooling. Table 1 shows the

architecture of the network.

3.3. Fusion Strategies

Our last step is to fuse the semantic (xs∈Rds) and paralinguistic

(xp∈Rdp) speech features, before feeding them to the LSTM. This

is performed with two strategies: (i) concatenation, (ii) ‘disentan-

gled‘ attention mechanism.

Concatenation. The ﬁrst approach is a standard feature-level fu-

sion, i. e., a simple concatenation of the feature vectors. Mathemati-

cally, xfusion = [xs,xp].

Disentangled attention mechanism. For our second approach,

we propose using attention mechanism to fuse the two modalities.

To this end, we perform a linear projection for each of the feature

sets such that they are in the same vector space (with dimension du):

xs=Wsxs+bs,

xp=Wpxp+bp,(4)

where Ws∈Rdu×ds,Wp∈Rdu×dpare projection matrices for

the semantic and paralinguistic feature sets, respectively.

We fuse these features using attention mechanism, i.e.,

Attention(˜

xs,˜

xt) = αs˜

xs+αp˜

xp,

αi=softmax(˜

xiqi

√du

),(5)

where q∈Rduis learnable vector that attends to different features.

At this point, we use three fully-connected (FC) layers with lin-

ear activation of same dimensionality on top of the output obtained

from ﬁrst attention layer, i. e.,

a=Wa˜

xsp +ba

v=Wv˜

xsp +bv,

l=Wl˜

xsp +bl,

(6)

where {Wa,Wl,Wv} ∈ Rdu×duare projection matrices.

We choose to use three FC layers such that the information ﬂow

per emotional dimension (i. e., arousal, valence, and liking) in the

network is disentangled. The intuition here is that by adding three

additional dense layers, we hope that each of these projections could

learn features that suit best for a dimension in our emotion space. In

case of a higher number of outputs, more FC layers can be used.

To fuse the information of the ‘disentanngled’ vector spaces, we

apply an attention layer so that each suited feature set could attend to

one another and produce an enriched fusion feature output for ﬁnal

prediction. In particular, we, ﬁrst, apply attention on aand l; and,

ﬁnally, on the result with v, i. e.,

z=Attention(a,l)

xfusion =Attention(z,v).(7)

4. DATASET

We test the performance of our proposed framework on a time-

continuous emotion recognition dataset for real-world environ-

ments. In particular, as outlined, we utilise the Sentiment Analysis

in the Wild (SEWA) dataset that was used in the AVEC 2017 chal-

lenge [16]. The dataset consists of ‘in-th-wild’ audiovisual record-

ings that were captured from web-cameras and microphones from

32 pairs (i. e., 64 participants) that watched a 90 sec commercial

visual and discussed it with their partner for maximum of 3min. It

provides three modalities, namely, audio, visual, and text, for three

emotional dimensions: arousal, valence, and liking. The dataset is

split into 3 partitions: training (17 pairs), development (7pairs), and

test (8pairs), and was annotated by 6German-speaking annotators

(3female, 3male).

5. EXPERIMENTS

5.1. Experimental Setup

For training the models, we utilised the Adam optimisation method

[27], and a ﬁxed learning rate of 10−4throughout all experiments.

We used a mini-batch of 25 samples with sequence length of 300,

and a dropout [28] with p= 0.5for all layers except the recur-

rent ones to regularise our network. This step is important, as our

models have a large amount of parameters and not regularising the

network makes it prone on overﬁtting on the training data. In addi-

tion, the LSTM network we use in the training phase is trained with a

dropout of 0.5and a gradient norm clipping of 5.0. Finally, we seg-

ment the raw waveform into 10 sec long sequences with sampling

rate of 22 050 Hz. Hence, each sequence corresponds to a 22 0500-

dimension vector.

5.2. Objective Function

Our objective function is based on the Concordance Correlation

Coefﬁcient (ρc) that was also used in the AVEC 2017 challenge.

ρcevaluates the agreement level between the predictions and the

gold standard by scaling their correlation coefﬁcient with their mean

square difference. Mathematically, the the concordance loss Jccan

be deﬁned as follows:

Lc= 1 −ρc= 1 −2σ2

σ2

x+σ2

y+ (µx−µy)2,(8)

where µx=E(x),µy=E(y),σ2

x=var(x),σ2

y=var(y), and

σ2

xy =cov(x,y).

Our end-to-end network is trained to predict the arousal, valence,

and liking dimensions, and as such, we deﬁne the overall loss as

follows, L= (La

c+Lv

c+Ll

c)/3,where La

c,Lv

c, and Lv

care the

concordance loss of the arousal, valence, and liking dimensions, re-

spectively, contributing equally to the loss.

5.3. Ablation Study

5.3.1. Comparing Vector Spaces

We test the performance of both the semantic and paralinguistic

networks, trained independently, and trained jointly, to show the

beneﬁcial properties of our proposed framework. Table 2 depicts

the results in terms of ρcon the development set of the SEWA

dataset. We observe that Word2Vec produces slightly better results

than Speech2Vec. However, after aligning their embedding spaces,

the aligned Speech2Vec has higher performance than Word2Vec, in-

dicating both that the reﬁnement process makes speech embedding

similar to the word ones, and that paralinguistic information exists in

the model. Finally, the paralinguistic network, although it produces

worse results than the aligned Speech2Vec model, provides the best

results for the arousal dimension.

Model Arousal Valence Liking Avg

Word2Vec .434 .513 .208 .385

Speech2Vec .433 .470 .182 .362

Align Speech2Vec .453 .452 .257 .387

Paralinguistic .508 .436 .154 .366

Table 2.SEWA dataset results (in terms of ρc) of the Word2Vec,

Speech2Vec, aligned Speech2Vec and paralinguistic models, on the

development set.

5.3.2. Fusion Strategies

We further explore the effectiveness of the attention fusion strat-

egy compared to the simple concatenation. For our experiments we

utilised both semantic and paralinguistic deep network models of the

proposed method. Table 5.3.2 depicts the results in terms of ρcon

the development set of the SEWA dataset. We observe that the at-

tention method performs superior to the other one on all emotional

dimensions, indicating the effectiveness of our approach to model

the three emotional dimensions by projecting them to three different

spaces before fusing them together with attention.

Fusion strategy Arousal Valence Liking Avg

Concatenation .427 .428 .306 .387

Disentangled attention .499 .497 .311 .435

Table 3.SEWA dataset results (in terms of ρc) of the various fusion

methods (i. e., concatenation, attention, and hierachical attention) in

the development set.

Method Arousal Valence Liking

Baseline [16] .225 (.344) .224 (.351) -.020 (.081)

Dang et. al. [18] .344 (.494) .346 (.507) — (—)

Huang et. al. [19] .583 (.584) .487 (.585) — (—)

Chen et. al. [17] .422 (.524) .405 (.504) .054 (.273)

Proposed .429 (.499) .503 (.497) .312 (.311)

Table 4.SEWA dataset test result (in terms of ρc) of our proposed

fusion model compared with the winning models in AVEC 2017. In

parenthesis are the performances obtained on the development set.

A dash is inserted if the results could not be obtained.

5.4. Results

We compare our proposed framework with the winning papers of the

AVEC 2017 challenge. As our model utilises only the audio modal-

ity during evaluation, we show, for fairness of comparison, the re-

sults of these studies using the audio information. Table 4 depicts

the results. First, we observe that our approach provides the best

results in the valence dimension with high margin, and the second

best in the arousal one. We should note, however, that the network

from Huang et al. [19] was pretrained on 300 hours of a sponta-

neous English speech recognition corpus before ﬁne-tuning it to the

SEWA dataset. In addition to the features of the network, they also

utilise several hand-engineered features. Second, we observe that

our approach provides the highest performance in the likability di-

mension. Our method is able to generalise on this dimension com-

pared to Chen et al [17] whose performance drops signiﬁcantly com-

pared to its performance on the development set. Finally, we should

note the high generalisation capability of our approach to model all

three emotional dimensions, indicating the effectiveness of the pro-

posed disentangled attention mechanism strategy.

6. CONCLUSIONS

In this paper, we propose a training framework using audio and

text information for speech emotion recognition. In particular, we

use Word2Vec and Speech2Vec models, and align their embedding

spaces for accurate semantic feature extraction using only the speech

signal. We combine the semantic and paralinguistic features using

a novel attention fusion strategy that ﬁrst disentangles the informa-

tion per emotional dimension, and then combines it using attention.

The proposed model is evaluated on the SEWA dataset and produces

state-of-the-art results on the valence and liking dimensions, when

compared with the best performing papers submitted to the AVEC

2017 challenge.

In future work, we intend to use a single network to simulta-

neously capture the semantic and the paralinguistic information in

the speech signal. This will result in simplifying, and at the same

time, reducing the number of parameters of the model. Additionally,

we intend to investigate the performance of the proposed method on

categorical emotion recognition datasets.

7. ACKNOWLEDGEMENTS

The support of the EPSRC Center for Doctoral Training in High

Performance Embedded and Distributed Systems (HiPEDS, Grant

Reference EP/ L016796/1) is gratefully acknowledged.

REFERENCES

[1] R. Picard, Affective Computing, MIT Press, 1997.

[2] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features

and classiﬁers for emotion recognition from speech: a survey

from 2000 to 2011,” Artiﬁcial Intelligence Review, pp. 155–

177, 2015.

[3] P. Tzirakis, S. Zafeiriou, and B. Schuller, “Chapter 18 - Real-

world automatic continuous affect recognition from audiovi-

sual signals,” in Multimodal Behavior Analysis in the Wild, pp.

387–406. Elsevier, 2019.

[4] D. Kollias, P. Tzirakis, Mihalis A Nicolaou, A. Papaioannou,

G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect

prediction in-the-wild: Aff-wild database and challenge, deep

architectures, and beyond,” International Journal of Computer

Vision, pp. 907–929, 2019.

[5] B. Schuller, “Speech emotion recognition: two decades in a

nutshell, benchmarks, and ongoing trends,” Communications

of the ACM, pp. 90–99, 2018.

[6] L. Stappen, A. Baird, G. Rizos, P. Tzirakis, X. Du, F. Hafner,

L. Schumann, A. Mallol-Ragolta, B. Schuller, I. Lefter, et al.,

“Muse 2020 challenge and workshop: Multimodal sentiment

analysis, emotion-target engagement and trustworthiness de-

tection in real-life media: Emotional car reviews in-the-wild,”

in Proc. ACM International on Multimodal Sentiment Analysis

in Real-life Media Challenge and Workshop, 2020, pp. 35–44.

[7] J. Zhang, Z. Yin, P. Chen, and S. Nichele, “Emotion recogni-

tion using multi-modal data and machine learning techniques:

A tutorial and review,” Information Fusion, pp. 103–126, 2020.

[8] P. Tzirakis, G. Trigeorgis, M. Nicolaou, B. Schuller, and

S. Zafeiriou, “End-to-end multimodal emotion recognition us-

ing deep neural networks,” IEEE Journal of Selected Topics in

Signal Processing, pp. 1301–1309, 2017.

[9] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emo-

tion recognition in speech using cross-modal transfer in the

wild,” in Proc. ACM Multimedia, 2018, pp. 292–301.

[10] P. Tzirakis, S. Zafeiriou, and B. Schuller, “End2You–The Im-

perial Toolkit for Multimodal Proﬁling by End-to-End Learn-

ing,” arXiv preprint arXiv:1802.01115, 2018.

[11] S. Yoon, S. Byun, and K. Jung, “Multimodal speech emotion

recognition using audio and text,” in Proc. IEEE Spoken Lan-

guage Technology, 2018, pp. 112–118.

[12] M. V. M¨

antyl¨

a, D. Graziotin, and M. Kuutila, “The evolution of

sentiment analysis—A review of research topics, venues, and

top cited papers,” Computer Science Review, pp. 16–32, 2018.

[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,

“Distributed representations of words and phrases and their

compositionality,” in Proc. Advances in neural information

processing systems (NeurIPS), 2013, pp. 3111–3119.

[14] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-

to-sequence framework for learning word embeddings from

speech,” Proc. Interspeech, pp. 811–815, 2018.

[15] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsu-

pervised cross-modal alignment of speech and text embedding

spaces,” in Proc. Advances in neural information processing

systems (NeurIPS), 2018, pp. 7354–7364.

[16] F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie,

S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pan-

tic, “Avec 2017: Real-life depression, and affect recognition

workshop and challenge,” in Proc. ACM Multimedia Work-

shop, 2017, pp. 3–9.

[17] S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task

learning for dimensional and continuous emotion recognition,”

in Proc. ACM Multemedia Workshops, 2017, pp. 19–26.

[18] T. Dang, B. Stasak, Z. Huang, S. Jayawardena, M. Atcheson,

M. Hayat, P. Le, V. Sethu, R. Goecke, and J. Epps, “Investigat-

ing word affect features and fusion of probabilistic predictions

incorporating uncertainty in avec 2017,” in Proc. ACM Multe-

media Workshops, 2017, pp. 27–35.

[19] J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi,

“Continuous multimodal emotion prediction based on long

short term memory recurrent neural network,” in Proc. ACM

Multemedia Workshops, 2017, pp. 11–18.

[20] B. Schuller, S. Steidl, A. Batliner, P. Marschik, H. Baumeis-

ter, F. Dong, S. Hantke, F. Pokorny, et al., “The interspeech

2018 computational paralinguistics challenge: Atypical & self-

assessed affect, crying & heart beats,” 2018, pp. 122–126.

[21] P. Tzirakis, J. Chen, S. Zafeiriou, and B. Schuller, “End-to-

end multimodal affect recognition in real-world environments,”

Information Fusion, vol. 68, pp. 46–53, 2021.

[22] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, Mihalis A.

Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-

to-end speech emotion recognition using a deep convolutional

recurrent network,” in Proc. IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.

5200–5204.

[23] P. Tzirakis, J. Zhang, and B. Schuller, “End-to-end speech

emotion recognition using deep neural networks,” in Proc.

IEEE International Conference on Acoustics, Speech and Sig-

nal Processing (ICASSP), 2018, pp. 5089–5093.

[24] L. Tarantino, P. Garner, and A. Lazaridis, “Self-attention for

speech emotion recognition,” Proc. Interspeech 2019, pp.

2578–2582, 2019.

[25] M. Neumann and N. T. Vu, “Attentive convolutional neural

network based speech emotion recognition: A study on the im-

pact of input features, signal length, and acted speech,” Proc.

Interspeech 2017, pp. 1263–1267, 2017.

[26] J. Han, Z. Zhang, Z. Ren, and B. Schuller, “Implicit fusion

by joint audiovisual training for emotion recognition in mono

modality,” in Proc. IEEE International Conference on Acous-

tics, Speech and Signal Processing (ICASSP), 2019, pp. 5861–

5865.

[27] D. Kingma and J. Ba, “Adam: A method for stochastic opti-

mization,” arXiv preprint arXiv:1412.6980, 2014.

[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and

R. Salakhutdinov, “Dropout: a simple way to prevent neural

networks from overﬁtting,” The Journal of Machine Learning

Research, vol. 15, no. 1, pp. 1929–1958, 2014.

ResearchGate has not been able to resolve any citations for this publication.

MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media: Emotional Car Reviews in-the-wild

Conference Paper

Full-text available

Oct 2020

Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthi-ness detection by means of more comprehensively integrating the audiovisual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audiovisual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild , which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic , in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust , in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CaR , the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34· UAR + 0.66·F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.

Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond

Article

Full-text available

Jun 2019
INT J COMPUT VISION

Automatic understanding of human affect using visual signals is of great importance in everyday human–machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent continuous dimensions (e.g., the circumplex model of affect). Valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion) constitute popular and effective representations for affect. Nevertheless, the majority of collected datasets this far, although containing naturalistic emotional states, have been captured in highly controlled recording conditions. In this paper, we introduce the Aff-Wild benchmark for training and evaluating affect recognition algorithms. We also report on the results of the First Affect-in-the-wild Challenge (Aff-Wild Challenge) that was recently organized in conjunction with CVPR 2017 on the Aff-Wild database, and was the first ever challenge on the estimation of valence and arousal in-the-wild. Furthermore, we design and extensively train an end-to-end deep neural architecture which performs prediction of continuous emotion dimensions based on visual cues. The proposed deep learning architecture, AffWildNet, includes convolutional and recurrent neural network layers, exploiting the invariant properties of convolutional features, while also modeling temporal dynamics that arise in human behavior via the recurrent layers. The AffWildNet produced state-of-the-art results on the Aff-Wild Challenge. We then exploit the AffWild database for learning features, which can be used as priors for achieving best performances both for dimensional, as well as categorical emotion recognition, using the RECOLA, AFEW-VA and EmotiW 2017 datasets, compared to all other methods designed for the same goal. The database and emotion recognition models are available at http://ibug.doc.ic.ac.uk/resources/first-affect-wild-challenge.

Multimodal Speech Emotion Recognition Using Audio and Text

Conference Paper

Full-text available

Dec 2018

Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.

The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical and Self-Assessed Affect, Crying and Heart Beats

Conference Paper

Full-text available

Sep 2018

The INTERSPEECH 2018 Computational Paralinguistics Challenge addresses four different problems for the first time ina research competition under well-defined conditions: In the Atypical Affect Sub-Challenge, four basic emotions annotatedin the speech of handicapped subjects have to be classified; in the Self-Assessed Affect Sub-Challenge, valence scores given by the speakers themselves are used for a three-class classification problem; in the Crying Sub-Challenge, three types of infant vocalisations have to be told apart; and in the Heart Beats Sub-Challenge, three different types of heart beats have to be determined. We describe the Sub Challenges, their conditions, and baseline feature extraction and classifiers, which include data-learnt (supervised) feature representations by end-to-end learning, the 'usual’ ComParE and BoAW features, and deep unsupervised representation learning using the auDeep toolkit for the first time in the challenge series.

End-to-end multimodal affect recognition in real-world environments

Article

Apr 2021
INFORM FUSION

Automatic affect recognition in real-world environments is an important task towards a natural interaction between humans and machines. The recent years, several advancements have been accomplished in determining the emotional states with the use of Deep Neural Networks (DNNs). In this paper, we propose an emotion recognition system that utilizes the raw text, audio and visual information in an end-to-end manner. To capture the emotional states of a person, robust features need to be extracted from the various modalities. To this end, we utilize Convolutional Neural Networks (CNNs) and propose a novel transformer-based architecture for the text modality that can robustly capture the semantics of sentences. We develop an audio model to process the audio channel, and adopt a variation of a high resolution network (HRNet) to process the visual modality. To fuse the modality-specific features, we propose novel attention-based methods. To capture the temporal dynamics in the signal, we utilize Long Short-Term Memory (LSTM) networks. Our model is trained on the SEWA dataset of the AVEC 2017 research sub-challenge on emotion recognition, and produces state-of-the-art results in the text, visual and multimodal domains, and comparable performance in the audio case when compared with the winning papers of the challenge that use several hand-crafted and DNN features. Code is available at: https://github.com/glam-imperial/multimodal-affect-recognition.

Emotion Recognition Using Multi-Modal Data and Machine Learning Techniques: A Tutorial and Review

Article

Jul 2020
INFORM FUSION

In recent years, the rapid advances in machine learning (ML) and information fusion has made it possible to endow machines/computers with the ability of emotion understanding, recognition, and analysis. Emotion recognition has attracted increasingly intense interest from researchers from diverse fields. Human emotions can be recognized from facial expressions, speech, behavior (gesture/posture) or physiological signals. However, the first three methods can be ineffective since humans may involuntarily or deliberately conceal their real emotions (so-called social masking). The use of physiological signals can lead to more objective and reliable emotion recognition. Compared with peripheral neurophysiological signals, electroencephalogram (EEG) signals respond to fluctuations of affective states more sensitively and in real time and thus can provide useful features of emotional states. Therefore, various EEG-based emotion recognition techniques have been developed recently. In this paper, the emotion recognition methods based on multi-channel EEG signals as well as multi-modal physiological signals are reviewed. According to the standard pipeline for emotion recognition, we review different feature extraction (e.g., wavelet transform and nonlinear dynamics), feature reduction, and ML classifier design methods (e.g., k-nearest neighbor (KNN), naive Bayesian (NB), support vector machine (SVM) and random forest (RF)). Furthermore, the EEG rhythms that are highly correlated with emotions are analyzed and the correlation between different brain areas and emotions is discussed. Finally, we compare different ML and deep learning algorithms for emotion recognition and suggest several open problems and future research directions in this exciting and fast-growing area of AI.

Self-Attention for Speech Emotion Recognition

Conference Paper

Sep 2019

Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Mono Modality

Conference Paper

May 2019

Real-world automatic continuous affect recognition from audiovisual signals

Chapter

Jan 2019

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Conference Paper

Oct 2018

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

Speech Emotion Recognition using Semantic Information

Abstract and Figures

Recommended publications

Speech Emotion Recognition Using Semantic Information

End-to-end multimodal affect recognition in real-world environments

Facial Emotion Recognition using Deep Residual Networks in Real-World Environments

End-to-End Speech Emotion Recognition Using Deep Neural Networks