ArticlePDF Available

Speech Emotion Recognition using Time Distributed CNN and LSTM

January 2021
ITM Web of Conferences 40(5):03006

January 2021
40(5):03006

DOI:10.1051/itmconf/20214003006

License
CC BY 4.0

Authors:

Speech has several distinguishing characteristic features which has remained a state-of-the-art tool for extracting valuable information from audio samples. Our aim is to develop a emotion recognition system using these speech features, which would be able to accurately and efficiently recognize emotions through audio analysis. In this article, we have employed a hybrid neural network comprising four blocks of time distributed convolutional layers followed by a layer of Long Short Term Memory to achieve the same.The audio samples for the speech dataset are collectively assembled from RAVDESS, TESS and SAVEE audio datasets and are further augmented by injecting noise. Mel Spectrograms are computed from audio samples and are used to train the neural network. We have been able to achieve a testing accuracy of about 89.26%.

Feature Extraction Steps

Final Dataset after Augmentation

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Speech Emotion Recognition using Time Distributed CNN and LSTM

Beenaa Salian1,∗,Omkar Narvade1,∗∗ Rujuta Tambewagh1,∗∗∗, and Smita Bharne1,∗∗∗∗

1Ramrao Adik Institute of Technology, Navi Mumbai, India

Abstract. Speech has several distinguishing characteristic features which has remained a state-of-the-art tool

for extracting valuable information from audio samples. Our aim is to develop a emotion recognition system

using these speech features, which would be able to accurately and eﬃciently recognize emotions through audio

analysis. In this article, we have employed a hybrid neural network comprising four blocks of time distributed

convolutional layers followed by a layer of Long Short Term Memory to achieve the same.The audio samples

for the speech dataset are collectively assembled from RAVDESS, TESS and SAVEE audio datasets and are

further augmented by injecting noise. Mel Spectrograms are computed from audio samples and are used to train

the neural network. We have been able to achieve a testing accuracy of about 89.26%.

1 Introduction

As humans, our thoughts are best articulated using

speech. Therefore, in this increasingly technologically-

driven world, the next step forward would be extending

this understanding to machines. Although Speech Emo-

tion Recognition (SER) has been around for almost a

decade, it has regained attention due to recent develop-

ments in this ﬁeld (for eg. Voice-based virtual assistants

like Siri, Alexa, etc and automated help-center assistance,

self-driving cars, etc). Although there is a rising demand

for voice-controlled technologies, The recognition of emo-

tion from speech is the main challenge in human-machine

interaction. Despite signiﬁcant progress in speech recog-

nition, we are still a long way from determining underlying

emotions from the speaker’s audio signals since the ma-

chine does not understand the speaker’s emotional state.

Because speech is the simplest and most eﬀective means

of communication for humans, a computer must be able to

grasp the user’s mood in this more technologically-driven

world. We’ll look at which aspects of speech are the most

eﬀective in discriminating various emotions. An improved

speech emotion identiﬁcation model must be able to recog-

nise seven primary emotions: anger, disgust, happiness,

sorrow, neutral, surprise, and fear. We must develop a ro-

bust deep learning model which can accurately and eﬃ-

ciently classify emotions from speech alone.

There are various hybrid model implementations in the

SER domain. The most commonly used hybrid model is

the CNN-LSTM model. Where CNN is used for learning

local correlations and LSTM is used to learn long-term

dependen- cies from the learned local features. In [1] 13

MFCC (Mel Frequency Cepstral Coeﬃcient) with 13 ve-

∗e-mail: beenaasalian09@gmail.com

∗∗e-mail: nomkar99@gmail.com

∗∗∗e-mail: rujuta.tambewagh@gmail.com

∗∗∗∗e-mail: smita.bharne@rait.ac.in

locities and 13 acceleration components are used as fea-

tures and a 1D CNN and LSTM model are used for clas-

siﬁcation. In this paper, the EMODB dataset is used.To

compute MFCCs, Discrete Cosine Transform(DCT) is ap-

plied. DCT is needed to decorrelate ﬁlter bank coeﬃ-

cients. It is a linear operation and therefore discards any

non-linear useful information present in the speech sig-

nal. In [4] three diﬀerent features, MFCCs, magnitude

spectrogram, and log-mel spectrogram, are compared with

several architectures, such as CNN, BLSTM, and CNN-

LSTM, to determine which architecture and feature com-

bination is best for speech emotion recognition. All of the

models were tested on two diﬀerent datasets. – EMODB

and IEMOCAP where 4 emotions are classiﬁed where the

length of audio ﬁles is kept 3 seconds. The design is shown

to perform eﬀectively with Log-Mel Spectrograms when

combined with CNN+LSTM architecture in this article.

The aim of [3] is to ﬁnd the relation between the duration

of speech length and the recognition rate of emotions. In

this paper, analysis is performed using a CNN model hav-

ing two convolution layers where magnitude spectrograms

are taken as features. Performance for the system is ana-

lyzed using speech sequences of length from 0.25s to 1.5s.

It is observed as the length of the speech signal increases,

the accuracy of the system increases.

2 Proposed Methodology

2.1 Speech Corpus

The data sets used here cover people of diﬀerent ages –

ranging from young to old, covers both the genders, and

people having diﬀerent accents.

2.1.1 Toronto emotional speech set (TESS)

200 special words are spoken using the phrase ”Say the

word - ” by two actresses. The two actresses are aged 26

ITM Web of Conferences 40, 03006 (2021)

ICACC-2021

https://doi.org/10.1051/itmconf/20214003006

Commons Attribution License 4.0

(http://creativecommons.org/licenses/by/4.0/).

and 64 years. The audios ae recorded on set potraying one

of the seven emotions

2.1.2 Ryerson Audio-Visual Database of Emotional

Speech and Song (RAVDESS)

RAVDESS consists of gender balanced audio samples

from males and females ( 12 females and 12 males). The

audio samples contain seven diﬀerent emotions with two

levels of intensity - high and low.

2.1.3 Surrey Audio-Visual Expressed

Emotion(SAVEE)

The audio samples in the dataset are recorded with four

native english speakers. They are identiﬁed as DC, JE, JK,

KL. This results in a total of 120 utterances per speaker in

which they 7 emotions are covered.

2.1.4 Augmentation

The prediction accuracy of any deep learning model is

largely dependent on the amount and the diversity of data

available during training. A common method to increase

the diversity of your dataset, is to augment your data arti-

ﬁcially. To generate syntactic data for audio, we can apply

noise injection. We have used the numpy library to add

noise to our existing dataset.Table 1 shows the ﬁnal dataset

after augmentation.

Table 1. Final Dataset after Augmentation

Total Dataset After Augmentation

Code 0 1 2 3 4 5 6 Total

Org 652 652 655 652 652 652 808 4723

Aug 652 652 655 652 652 652 808 4723

Total 1304 1304 1310 1304 1304 1304 1616 9446

Emocode 0:Happy, 1:Sad,2:Angry, 3:Angry,

4:Fear,5:Surprise,6:Neutral

2.2 Feature Extraction

There is quite a lot of varied information present in speech

signals, most of which doesn’t aid us in our objective of

recognizing emotions from audio ﬁles. Hence, we ex-

tract the relevant information from these speech signals

and provide these as the feature vector input to our classi-

ﬁer. We have used Mel Spectrograms as our feature to be

extracted which would be further used to train our classi-

ﬁer. The steps for feature extraction can be seen in ﬁgure

Figure 1. Feature Extraction Steps

2.2.1 Preprocessing

Every audio sample in our dataset is ﬁrst sampled at

21,500 Hz with an oﬀset of 0.5 seconds and according to

the calculations, the average length of audio samples in

our dataset is around 3 seconds. Further, Z-distribution,

also known as standard normal distribution is performed

on the samples and then, we adjust these audio ﬁles to be

approximately 3 seconds long, i.e 650,000 samples per au-

dio. Adjustment is made by truncating the ﬁles, if they

have a length greater than average or by padding it with

zeroes if it is lesser than the same.

2.2.2 Time Window Framing

We assume that the properties in a non-stationary speech

signal signal remain constant over a very short period of

time. These short time periods are termed as ’frames’,

and these frames are long enough to contain vital char-

acteristics and short enough for it to be considered station-

ary. Here, window size is 23ms with an overlapping of

50frames.Framing begins with the ﬁrst N =256 samples,

the second frames starts with a hop of M =128 samples

and overlaps the ﬁrst frame by N M , and this is repeated

for the entire signal. This overlap smoothenes the transi-

tions between the frames.

2.2.3 Applying Hamming Window

The signal is further passed through a Hamming window

to smoothen out the signal and makes sure that the neigh-

bouring window-ends match. Also, it leads to a better sig-

nal clarity and helps in reducing the spectral leakages The

ITM Web of Conferences 40, 03006 (2021)

ICACC-2021

https://doi.org/10.1051/itmconf/20214003006

equation to calculate the same is given by (1)

w(k)=0.54 −0.46cos(2πk

N−1) (1)

where w(k) is the window function and 0 ≤k≤N−1

2.2.4 Fast Fourier Transform

Fast Fourier Transform takes a sequence of discrete sig-

nal amplitudes as input, and converts it into it’s frequency

con- stituents and is given by (2). We perform an N-point

FFT on every frame in the signal to calculate the overall

frequency spectrum. This is also termed as Short Term

Fourier Transform (STFT).

Xk=

N−1



n=0

Xke−i2π

Nkn

k=0,1,2,3, ..., N−1and N =512 (2)

2.2.5 Periodogram Estimate

To compute a periodogram estimation, we square the ab-

solute value of the result derived from the complex fourier

transform operation. This estimation helps us in identify-

ing which frequencies are present in every frame extracted

from the audio sample. The speech frame’s periodogram-

based power spectral estimate is given by

Pi(k)=1

N|Si(k)|2(3)

2.2.6 Applying Mel-Scale Filterbanks

Figure 2. Mel Filterbank

Mel scale is a non-linear transformation and is based

on the perception of the audio sample by the human au-

ditory system. To compute the Mel Spectrogram, we just

need to apply the mel-spaced ﬁlterbanks as seen in ﬁgure

2, which is a set of overlapping triangular ﬁlters.Starting

of one ﬁlter overlaps with the centre of the previous one

whereas the ending part overlaps with the centre of the

succeeding one and so on. It has a response of 1 at the

centre and decreases as we move to the ends, where it is

0. The ﬁlters are closely spaced and are narrow at the

lower frequencies, As the frequency increases, the ﬁlters

get wider and they turn less discriminative or less sensitive

to the variations in the frequency. Mel-Scale tells us how

to space the ﬁlterbanks on the scale and gives an estima-

tion on how wide to make them as the frequency increases.

For converting f Hertz to mmels, we apply

Mel(m)=2595 ×log10(1 +f

700 ) (4)

3 Classiﬁer

As seen in Figure 3, a hybrid neural network model con-

sisting of Time Distributed CNN followed by a LSTM

layer has been proposed. CNN has proven to be a break-

through in terms of performance related to image recogni-

tion and various computer vision tasks. Long Short Term

Memory has proven to be very useful in the case of ana-

lyzing sequential data. Therefore, it would help the model

to learn both short-term and long-term feature dependen-

cies by using the two in succession. CNN and LSTM, can

hence, take advantage of the strengths of both networks.

Figure 3. Classiﬁer Model

Figure 4. Learning Module

ITM Web of Conferences 40, 03006 (2021)

ICACC-2021

https://doi.org/10.1051/itmconf/20214003006

3.1 Time Distributed Convolution Layers

The main idea of the time distributed convolutional layers

is applying a rolling window over our input feature, i.e Mel

spectrogram. Hence, we get a sequence of images, and the

sequence of these images is provided as an input to the ﬁrst

layer in our neural network. As seen in ﬁgure 4, this part of

the model is subdivided into four Learning Modules (LM).

Figure 4 shows that each of these LMs constitutes a time

distributed convolutional layer, batch normalization, acti-

vation function, max pooling layer, and a dropout layer.

Exponential Linear Unit (ELU) is used as the activation

function in these four LMs followed by a Max Pooling

layer. This is used to lower the size of the image’s feature

maps, which helps to minimise the number of trainable pa-

rameters even more. The dropout regularization helps to

avoid overﬁtting by randomly dropping out a few neurons

from the neural network layer.

3.2 Long Short Term Memory

Speech signal is time-varying, and also spectrograms pri-

marily have a time component, so it is well worth trying to

explore these temporal properties of speech audio. Long

Short Term Memory has proven to be very useful in the

case of analyzing sequential data. Therefore, it could help

to identify and extract the global temporal features from

the mel spectrogram. An LSTM layer having 256 nodes

is added as a learning layer in the model followed by a

dense fully connected layer having ’softmax’ as its activa-

tion function.

4 Experiments

The compiled dataset is split into train and test sets, where

80% of the total is used for training the model and the

remaining 20% is evaluating the model’s performance.

Hence, we train with 7557 audio samples and test on 1889

audio samples. After applying a rolling window on the

mel spectrogram, a sequence of 6 overlapping images is

generated for each audio ﬁle, and this is provided as input

to our ﬁrst Learning Module (LM). As seen in ﬁgure [3]

are four LM’s in our model and each one consists of a time

distributed convolutional layer, batch normalization layer,

an activation function layer, dropout layer, and lastly the

max pooling layer. The convolutional layers have a ker-

nel size of 3 ×3. The layers in the ﬁrst two blocks have

64 feature maps, and the subsequent two have 128 fea-

ture maps. The activation function used in all four blocks

is Exponential Linear Unit as this function tends to con-

verge faster and provide better accuracies. After the four

LM blocks, the resultant output is ﬂattened and provided

as an input an LSTM layer, followed by a fully connected

layer with a softmax activation function which provides an

output that maps to the predicted emotion for every audio

sample. The optimizer for this model is Stochastic Gra-

dient Descent (SGD), with a learning rate of 0.01. We

applied early stopping having patience as 15, with respect

to maximizing the validation accuracy.

5 Results

The model is trained for a set of 100 epochs and com-

piled with categorical cross-entropy as the loss function.

By using ’Model Checkpoint’, we monitored and saved

the weights that provide maximum value for validation ac-

curacy. Our best model saved through this provides an ac-

curacy of 90.64% on the training set and 89.26% on the

testing set.

5.1 Performance Matrix

Table 2. Performance of the model

Evaluation Metrics

Emotions Precision Recall F1-score

Neutral 92 90 91

Happy 82 87 84

Sad 92 81 86

Angry 90 92 91

Fear 91 92 91

Disgust 84 96 90

Surprise 95 87 91

Table 2 summarises our model’s performance in terms

of various performance metrics. Precision is the ability of

our model to check the correctly predicted positives from

all the predicted positives and is given by Eq. (5). Our

model has the highest precision of 95% for surprise emo-

tion and lowest for happy emotion with precision of 82%.

Precision =T r uePositive

T ruePo sitive+FalsePositive(5)

Recall measures the model ability to check the correct

positive from all the existing positives in the test dataset

and is given by Eq. (6). We can see from table 2 that

our model has the best recall score for disgust emotion of

96%and the worst recall score of 81% for sad emotion.

Recall =T ruePo sitive

T ruePo sitive+FalseNegative(6)

The F1 score is the harmonic mean of precision and re-

call, which is used to measure an emotion’s overall perfor-

mance. Eq. (7). The F1 score is The best F1 score is for

surprise, neutral, angry, fear emotions which is 91% and

the worst F1 score is 84% for happy emotion.

F1−score =2∗Recall ∗Precision

Recall +Precision (7)

Model accuracy score is the most intuitive perfor-

mance measure, and it is simply a ratio of correctly pre-

dicted observation to the total observations. Our model

has an accuracy of 90.64% on training set and 89.26% on

validation data set.

ITM Web of Conferences 40, 03006 (2021)

ICACC-2021

https://doi.org/10.1051/itmconf/20214003006

Table 3. Comparison with Existing System

Parameters Proposed Model System [4] pt. 1

Architecture Time Distributed

CNN +LSTM

CNN

Dataset RAVDESS +

SAVEE +TESS

EmoDB

Features Used Log Mel-Scale

Filterbanks

Mel Spectrogram

Dataset

Distribution

7720 and 1450 271 and 68

Accuracy 89.26 % 78.16 %

Parameters Proposed Model System [4] pt. 2

Architecture Time Distributed

CNN +LSTM

Bi-LSTM

Dataset RAVDESS +

SAVEE +TESS

Iemocap

Features Used Log Mel-Scale

Filterbanks

Mel Fre-

quency Cepstral

Coeﬃcient

Dataset

Distribution

7720 and 1450 4424 and 1107

Accuracy 89.26 % 46.21 %

In the above tables, we have compared our system to

two pre-exiting systems presented in [4], based on various

parameters.

5.2 Loss and Accuracy Curves

A learning curve is a diagnostic tool that shows the perfor-

mance of the model over the period of time. We have in

the plotted the loss curve and accuracy curve for the train-

ing of our model. We can see that Figure 5 shows a steep

decline in training loss for the ﬁrst 20 epochs and then we

notice steady decline of training loss till 100th epoch and

Figure 6 shows a sharp increase in the accuracy for the ﬁrst

20 epochs and then we see that there is a steady increase

of the accuracy till 100th epoch.

Figure 5. Loss Curves

Figure 6. Accuracy Curves

5.3 Confusion Matrix

Confusion matrix helps us evaluate the performance of our

neural network, when it makes predictions on the test data,

and it helps us to analyse how accurate our recognition

model is with respect to speciﬁc emotions. From the con-

fusion matrix seen in ﬁgure 7, we can conclude that:

1. The model performs exceptionally well, while pre-

dicting fear and disgust emotions as these emotions

have a result of 251/274 and 249/259 respectively.

2. On the other hand emotions like happy and sad have

a result of 220/254 and 208/258,hence there is scope

of improvement in these emotions.

3. The model often predicts a sad emotion as neutral

emotion. By coming up with a solution for this

problem, we can improve the performance of the

system to a great extent.

Figure 7. Confusion Matrix

ITM Web of Conferences 40, 03006 (2021)

ICACC-2021

https://doi.org/10.1051/itmconf/20214003006

6 Conclusion

This research presents a hybrid neural network strategy for

detecting underlying emotions in audio samples. In terms

of performance, a Time Distributed CNN +LSTM model

trained on a large gender-balanced dataset of speakers with

various accents and nationalities outperforms other mod-

els. Neutral, Angry, Fear, Disgust, and Surprise all had

testing accuracy of 90% or higher. This model’s perfor-

mance could be improved even further through targeted

training, by focusing on emotions such as happy and sad,

which have accuracy rates of around 84% and 86%, re-

spectively.

References

[1] S. Basu, J. Chakraborty and M. Aftabuddin, "Emotion

recognition from speech using convolutional neural net-

work with recurrent neural network architecture," 2017

2nd International Conference on Communication and

Electronics Systems (ICCES), 2017, pp. 333-336, doi:

10.1109/CESYS.2017.8321292.

[2] J. Umamaheswari and A. Akila, "An Enhanced Hu-

man Speech Emotion Recognition Using Hybrid of

PRNN and KNN," 2019 International Conference on

Machine Learning, Big Data, Cloud and Parallel

Computing (COMITCon), 2019, pp. 177-183, doi:

10.1109/COMITCon.2019.8862221.

[3] B. Puterka and J. Kacur, "Time Window Analysis for

Automatic Speech Emotion Recognition," 2018 Inter-

national Symposium ELMAR, Zadar, 2018, pp. 143-

146. doi: 10.23919/ELMAR.2018.85.

[4] S. K. Pandey, H. S. Shekhawat and S. R. M.

Prasanna, "Deep Learning Techniques for Speech Emo-

tion Recognition: A Review," 2019 29th Interna-

tional Conference Radioelektronika (RADIOELEK-

TRONIKA), 2019, pp. 1-6, doi: 10.1109/RA-

DIOELEK.2019.8733432.

[5] A. B. Abdul Qayyum, A. Arefeen and C. Shahnaz,

"Convolutional Neural Network (CNN) Based Speech-

Emotion Recognition," 2019 IEEE International Con-

ference on Signal Processing, Information, Communi-

cation Systems (SPICSCON), 2019, pp. 122-125, doi:

10.1109/SPICSCON48833.2019.9065172.

[6] M. S. Likitha, S. R. R. Gupta, K. Hasitha and A. U.

Raju, "Speech based human emotion recognition using

MFCC," 2017 International Conference on Wireless

Communications, Signal Processing and Networking

(WiSPNET), 2017, pp. 2257-2260, doi: 10.1109/WiSP-

NET.2017.8300161.

[7] B. Puterka, J. Kacur and J. Pavlovicova, "Window-

ing for Speech Emotion Recognition," 2019 Interna-

tional Symposium ELMAR, 2019, pp. 147-150, doi:

10.1109/ELMAR.2019.8918885.

[8] M. K. Pichora-Fuller and K. Dupuis, “Toronto emo-

tional speech set (TESS),” 2010

[9] Livingstone SR, Russo FA (2018) The Ryer-

son Audio-Visual Database of Emotional Speech

and Song (RAVDESS): A dynamic, multimodal

set of facial and vocal expressions in North

American English. PLoS ONE 13(5): e0196391.

https://doi.org/10.1371/journal.pone.0196391

[10] B. Vlasenko, B. Schuller, A. Wendemuth,

and G. Rigoll, “Combining frame and turn-

levelinformation for robust recognition of emo-

tions within speech,”Proceedings of Interspeech,pp.

2249–2252, 01 2007.

ITM Web of Conferences 40, 03006 (2021)

ICACC-2021

https://doi.org/10.1051/itmconf/20214003006

Hybrid deep learning models based emotion recognition with speech signals

Article

Aug 2023

Emotion recognition is one of the most important components of human-computer interaction, and it is something that can be performed with the use of voice signals. It is not possible to optimise the process of feature extraction as well as the classification process at the same time while utilising conventional approaches. Research is increasingly focusing on many different types of “deep learning” in an effort to discover a solution to these difficulties. In today’s modern world, the practise of applying deep learning algorithms to categorization problems is becoming increasingly important. However, the advantages available in one model is not available in another model. This limits the practical feasibility of such approaches. The main objective of this work is to explore the possibility of hybrid deep learning models for speech signal-based emotion identification. Two methods are explored in this work: CNN and CNN-LSTM. The first model is the conventional one and the second is the hybrid model. TESS database is used for the experiments and the results are analysed in terms of various accuracy measures. An average accuracy of 97% for CNN and 98% for CNN-LSTM is achieved with these models.

Emotion-Recognition Algorithm Based on Weight-Adaptive Thought of Audio and Video

Article

Full-text available

Jun 2023

Emotion recognition commonly relies on single-modal recognition methods, such as voice and video signals, which demonstrate a good practicability and universality in some scenarios. Nevertheless, as emotion-recognition application scenarios continue to expand and the data volume surges, single-modal emotion recognition proves insufficient to meet people’s needs for accuracy and comprehensiveness when the amount of data reaches a certain scale. Thus, this paper proposes the application of multimodal thought to enhance emotion-recognition accuracy and conducts corresponding data preprocessing on the selected dataset. Appropriate models are constructed for both audio and video modalities: for the audio-modality emotion-recognition task, this paper adopts the “time-distributed CNNs + LSTMs” model construction scheme; for the video-modality emotion-recognition task, the “DeepID V3 + Xception architecture” model construction scheme is selected. Furthermore, each model construction scheme undergoes experimental verification and comparison with existing emotion-recognition algorithms. Finally, this paper attempts late fusion and proposes and implements a late-fusion method based on the idea of weight adaptation. The experimental results demonstrate the superiority of the multimodal fusion algorithm proposed in this paper. When compared to the single-modal emotion-recognition algorithm, the accuracy of recognition is increased by almost 4%, reaching 84.33%.

Comparative Deep Network Analysis of Speech Emotion Recognition Models using Data Augmentation

Conference Paper

Dec 2022

Convolutional Neural Network (CNN) Based Speech-Emotion Recognition

Conference Paper

Full-text available

Nov 2019

Speech is considered as the widest and most natural medium of communication. Speech can convey a plethora of information regarding one's mental, behavioral, emotional traits. Besides, speech-emotion recognition related work can aid in averting cyber crimes. Research on speech-emotion recognition exploiting concurrent machine learning techniques has been on the peak for some time. Numerous techniques like Recurrent Neural Network (RNN), Deep Neural Network (DNN), spectral feature extraction and many more have been applied on different datasets. This paper presents a unique Convolutional Neural Network (CNN) based speech-emotion recognition system. A model is developed and fed with raw speech from specific dataset for training, classification and testing purposes with the help of high end GPU. Finally, it comes out with a convincing accuracy of 83.61% which is better compared to any other similar task on this dataset by a large margin. This work will be influential in developing conversational and social robots and allocating all the nuances of their sentiments.

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English

Article

Full-text available

May 2018
PLOS ONE

The RAVDESS is a validated multimodal database of emotional speech and song. The database is gender balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity, with an additional neutral expression. All conditions are available in face-and-voice, face-only, and voice-only formats. The set of 7356 recordings were each rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity and test-retest intrarater reliability were reported. Corrected accuracy and composite "goodness" measures are presented to assist researchers in the selection of stimuli. All recordings are made freely available under a Creative Commons license and can be downloaded at https://doi.org/10.5281/zenodo.1188976.

Combining frame and turn-level information for robust recognition of emotions within speech

Conference Paper

Full-text available

Aug 2007

Windowing for Speech Emotion Recognition

Conference Paper

Sep 2019

An Enhanced Human Speech Emotion Recognition Using Hybrid of PRNN and KNN

Conference Paper

Feb 2019

Time Window Analysis for Automatic Speech Emotion Recognition

Conference Paper

Sep 2018

Emotion recognition from speech using convolutional neural network with recurrent neural network architecture

Conference Paper

Oct 2017

Toronto emotional speech set (TESS)

Jan 2010

M K Pichora-Fuller
K Dupuis

M. K. Pichora-Fuller and K. Dupuis, "Toronto emotional speech set (TESS)," 2010

Deep Learning Techniques for Speech Emotion Recognition: A Review

Jan 2019
1-6

S K Pandey
H S Shekhawat
S R M Prasanna

S. K. Pandey, H. S. Shekhawat and S. R. M. Prasanna, "Deep Learning Techniques for Speech Emotion Recognition: A Review," 2019 29th International Conference Radioelektronika (RADIOELEK-TRONIKA), 2019, pp. 1-6, doi: 10.1109/RA-DIOELEK.2019.8733432.

Speech based human emotion recognition using MFCC

Jan 2017
2257-2260

M S Likitha
S R R Gupta
K Hasitha
A U Raju

M. S. Likitha, S. R. R. Gupta, K. Hasitha and A. U. Raju, "Speech based human emotion recognition using MFCC," 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), 2017, pp. 2257-2260, doi: 10.1109/WiSP-NET.2017.8300161.

Speech Emotion Recognition using Time Distributed CNN and LSTM

Abstract and Figures

Recommended publications

Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM

Speech Emotion Recognition using Convolutional Neural Networks and Long Short-Term Memory

Emotion Recognition from Speech Using Convolutional Neural Networks

Speech Emotion Recognition Using 1D CNN with No Attention

Gender-Aware Speech Emotion Recognition in Multiple Languages