Fig 3 - uploaded by Steven R. Livingstone
Content may be subject to copyright.
Examples of the eight RAVDESS emotions. Still frame examples of the eight emotions contained in the RAVDESS, in speech and song. https://doi.org/10.1371/journal.pone.0196391.g003 

Examples of the eight RAVDESS emotions. Still frame examples of the eight emotions contained in the RAVDESS, in speech and song. https://doi.org/10.1371/journal.pone.0196391.g003 

Source publication
Article
Full-text available
The RAVDESS is a validated multimodal database of emotional speech and song. The database is gender balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry,...

Citations

... Because the IEMOCAP dataset is a relatively small dataset (7 h long), the model trained on the IEMOCAP dataset may be susceptible to the overfitting problem. Unseen test datasets (RAVDESS [29] and EmoDB [30]) were applied to the models trained with different datasets (IEMOCAP, ESD [31], and IEMOCAP + ESD) to assess the overfitting problem (Tables 8 and 9). The models trained with a single dataset (IEMOCAP or ESD) are susceptible to the overfitting problem, while the model trained with a combined dataset (IEMOCAP + ESD) is resilient to the overfitting problem. ...
... ESD is a 29-h dataset in English and Mandarin with five emotions [31]. RAVDESS is an 89-min dataset in English with eight emotions [29]. EmoDB is a 24-min dataset in German with seven emotions [30]. ...
... All the SER datasets [25,[29][30][31] used in this work are recorded by actors, not by ordinary people. Because the emotions from actors may be over-emphasized compared to the emotions from ordinary people in real life, the models trained with the datasets from actors might suffer from accuracy degradation in real-life applications. ...
Article
Full-text available
A speech emotion recognition (SER) model for noisy environments is proposed, by using four band-pass filtered speech waveforms as the model input instead of the simplified input features such as MFCC (Mel Frequency Cepstral Coefficients). The four waveforms retain the entire information of the original noisy speech while the simplified features keep only partial information of the noisy speech. The information reduction at the model input may cause the accuracy degradation under noisy environments. A normalized loss function is used for training to maintain the high-frequency details of the original noisy speech waveform. A multi-decoder Wave-U-Net model is used to perform the denoising operation and the Wave-U-Net output waveform is applied to an emotion classifier in this work. By this, the number of parameters is reduced to 2.8 M for inference from 4.2 M used for training. The Wave-U-Net model consists of an encoder, a 2-layer LSTM, six decoders, and skip-nets; out of the six decoders, four decoders are used for denoising four band-pass filtered waveforms, one decoder is used for denoising the pitch-related waveform, and one decoder is used to generate the emotion classifier input waveform. This work gives much less accuracy degradation than other SER works under noisy environments; compared to accuracy for the clean speech waveform, the accuracy degradation is 3.8% at 0 dB SNR in this work while it is larger than 15% in the other SER works. The accuracy degradation of this work at SNRs of 0 dB, −3 dB, and −6 dB is 3.8%, 5.2%, and 7.2%, respectively.
... It acquires the local information over a short period and spectral band to characterize the impact of emotion on speech [39][40][41][42]. The EMODB [43] and RAVDESS [44] datasets are widely used for the SER because of their distinctiveness, public availability, easy access, and availability for male/ female voice samples. The neural networks have shown the capability for high efficiency and high precision and can be effectively utilized for signal processing applications. ...
... It includes audio recordings of eight emotions from 12 male and 12 female actors at the 48 kHz sampling rate. It contains eight emotions: calm, sad, angry, happy, fear, neutral, disgust, and surprise [44]. ...
Article
Full-text available
Affective computing is crucial in various Human–Computer Interaction (HCI) and multimedia systems for comprehensive emotional assessment and response. The existing Speech Emotion Recognition (SER) provides limited performance due to inadequate frequency and time domain representation, poor correlation in global and local features, and contextual dependencies of components. The traditional SER techniques often result in poor accuracy due to spectral leakage, low-frequency resolution problems, and poor depiction of emotional speech's pitch, intonation, and voice timbre. This paper presents a novel two-way feature extraction (TWFR) based SER system using 2D-CNN and 1D-CNN to improve the distinctiveness of emotional speech. The first set of features, a 2-D representation of the wavelet packet decomposition (WPD) coefficients, is given to a 2-D Deep Convolution Neural Network (DCNN). The second set of features comprises various time-domain, spectral, and voice-quality features given to 1D-DCNN. The features from the last layer of 2D-DCNN and 1D-DCNN are concatenated and provided to a fully connected layer, followed by a softmax classifier for SER. The results of the TWFR-based SER scheme are assessed on EMODB and RAVDESS datasets based on recall, precision, accuracy, and F1-score. The proposed TWFR-based SER shows an overall accuracy of 98.48% for EMODB and 98.71% for RAVDESS datasets. The proposed TWFR-based SER helps improve the speech's pitch, intonation, and voice timbre in the spectral and time domain for SER and outpaces the current state of the arts.
... Lately, researchers have widely utilized the RAVDESS datasets to achieve their goals [78]. The dataset comprises audio and video clips of 24 actors. ...
Article
Full-text available
Analyzing emotions extracted from diverse media sources like video and audio presents a significant challenge in identifying human mental health indicators. This is particularly crucial in studying the mental well-being of Transgender communities, who often encounter difficulty in accessing emergency services due to discrimination. This study employed Speech Emotion Recognition (SER) for identifying emotions in audio signals and a dynamic Facial Expression Recognition (FER) mechanism for analyzing emotions in video frames. The SER model utilizes deep learning techniques and Explainable AI (XAI) to optimize the audio feature vector, achieving a remarkable accuracy on the RAVDESS dataset. Benchmark metrics employed to evaluate the performance of the SER model, shows that our employed SER surpasses previously engaged Machine Learning (ML) models. Similarly, the FER model (VGG-Face), based on transfer learning proved to be efficient in analyzing YouTube video frames maintaining high accuracy. Both SER and FER models are then applied to assess the mental health of transgender individuals. To consolidate our findings, we adopt a late fusion-based approach to combine classified emotions, employing various voting methods such as primary, Condorcet, non-Condorcet Preferential, and valence-arousal. The SER method reveals that anger, sadness, and disgust are dominant emotions, while the FER method indicates an equal presence of negativity and neutrality. Our consolidate results based on late fusion approaches highlighted the dominance of anger emotion. In essence, this study offers valuable insights into the mental well-being of marginalized communities, providing a foundation for the development of tailored programs to support their mental health needs.
... However, the use of Self-supervised learning (SSL) speech models (e.g., Hubert [8] and Wav2Vec2 [9]) has emerged as the state-of-the-art approach in SER, producing the best performance on the IEMOCAP dataset [9]. Another very important acted dataset is (RAVDESS) [14]. RAVDESS was also involved in benchmarking many different SER systems [15,16]. ...
... In an effort to address this challenge, [16] showed promising results by combining four of the most famous SER datasets (RAVDESS [14], TESS [19], CREMA-D [17], and IEMOCAP [11]). In this work, in addition to these four datasets, we added seven more (MELD [35], ASVP-ESD [23], EmoV-DB [24], EmoFilm [25], SAVEE [18], JL-Corpus [26], and ESD [27]) for a total of eleven SER datasets. ...
... SAVEE[18]: British-English corpus with 480 utterances from four speakers, spanning seven emotions in 30 minutes. RAVDESS[14]: Focuses on the speech part with 1440 utterances from 24 actors, covering eight emotions over 1.5 hours. CREMA-D[17]: Audio-visual dataset with 7442 stimuli from 91 actors, spanning six emotions over 5.3 hours. ...
Preprint
Full-text available
Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.
... The IEMOCAP dataset [25] includes 10,039 clips from 10 actors (5 males, 5 females) in both scripted and spontaneous conversations, annotated with 4 emotion classes. The RAVDESS [26] dataset comprises 1,440 recordings from 24 actors (12 males, 12 females), each articulating two sentences across 8 emotional states. Each recording features one speaker expressing a single emotion, with equal distribution of male and female voices across the dataset. ...
Preprint
Full-text available
Advent of modern deep learning techniques has given rise to advancements in the field of Speech Emotion Recognition (SER). However, most systems prevalent in the field fail to generalize to speakers not seen during training. This study focuses on handling challenges of multilingual SER, specifically on unseen speakers. We introduce CAMuLeNet, a novel architecture leveraging co-attention based fusion and multitask learning to address this problem. Additionally, we benchmark pretrained encoders of Whisper, HuBERT, Wav2Vec2.0, and WavLM using 10-fold leave-speaker-out cross-validation on five existing multilingual benchmark datasets: IEMOCAP, RAVDESS, CREMA-D, EmoDB and CaFE and, release a novel dataset for SER on the Hindi language (BhavVani). CAMuLeNet shows an average improvement of approximately 8% over all benchmarks on unseen speakers determined by our cross-validation strategy.
... The filename is unique, consisting of a 7-part numerical identifier. File parts are structured in the following sequence (modality, vocal channel, emotion, emotional intensity, statement, repetition, and actor) [29]. In the RAVDESS dataset, both speech audio and singing audio files are available. ...
Article
Full-text available
Human error is a mark assigned to an event that has negative effects or does not produce a desired result, with emotions playing an important role in how humans think and behave. If we detect feelings early, it may decrease human error. The human voice is one of the most powerful tools that can be used for emotion recognition. This study aims to reduce human error by building a system that detects positive or negative emotions of a user like (happy, sad, fear, and anger) through the analysis of the proposed vocal emotion component using Convolutional Neural Networks. By applying the proposed method to an emotional voice database (RAVDESS) using Librosa for voice processing and PyTorch, with the emotion classification of (happy/angry), the results show a better accuracy (98%) in comparison to the literature with regard to making a decision to deny or allow a user to access sensitive operations or send a warning to the system administrator prior to accessing system resources.
... The standardized validation protocols embedded in the dataset aid in its comparative benchmarking and reproducibility across studies. Additionally, the spectrum of emotions conveyed by speech in this dataset encompasses an array of affective states, including, but not limited to, tranquility, satisfaction, elation, melancholy, indignation, trepidation, astonishment, and revulsion [25]. Each expression was associated with a distinct level of emotional intensity, ranging from neutral to normal to strong. ...
Article
Full-text available
Speech emotion recognition (SER) is a technology that can be applied to distance education to analyze speech patterns and evaluate speakers’ emotional states in real time. It provides valuable insights and can be used to enhance students’ learning experiences by enabling the assessment of their instructors’ emotional stability, a factor that significantly impacts the effectiveness of information delivery. Students demonstrate different engagement levels during learning activities, and assessing this engagement is important for controlling the learning process and improving e-learning systems. An important aspect that may influence student engagement is their instructors’ emotional state. Accordingly, this study used deep learning techniques to create an automated system for recognizing instructors’ emotions in their speech when delivering distance learning. This methodology entailed integrating transformer, convolutional neural network, and long short-term memory architectures into an ensemble to enhance the SER. Feature extraction from audio data used Mel-frequency cepstral coefficients; chroma; a Mel spectrogram; the zero-crossing rate; spectral contrast, centroid, bandwidth, and roll-off; and the root-mean square, with subsequent optimization processes such as adding noise, conducting time stretching, and shifting the audio data. Several transformer blocks were incorporated, and a multi-head self-attention mechanism was employed to identify the relationships between the input sequence segments. The preprocessing and data augmentation methodologies significantly enhanced the precision of the results, with accuracy rates of 96.3%, 99.86%, 96.5%, and 85.3% for the Ryerson Audio–Visual Database of Emotional Speech and Song, Berlin Database of Emotional Speech, Surrey Audio–Visual Expressed Emotion, and Interactive Emotional Dyadic Motion Capture datasets, respectively. Furthermore, it achieved 83% accuracy on another dataset created for this study, the Saudi Higher-Education Instructor Emotions dataset. The results demonstrate the considerable accuracy of this model in detecting emotions in speech data across different languages and datasets.
... All of our code and models are available under an open source license 1 . We select two datasets without emotion annotation, LJSpeech 2 (single speaker, 24 hours) and LibriTTS-R [22,23] (2,456 speakers, 585 hours), as well as three datasets which contain an emotion label along with each utterance: Emotional Speech Database (ESD) [24], Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [25] and Toronto Emotional Speech Set (TESS) [26]. All used datasets are publicly available, an overview of the properties of the emotional speech datasets can be seen in Table 1. ...
Preprint
Full-text available
In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.
... Research Methodology: Empirical studies, theoretical analyses, and experimental results. [13], achieving high accuracy on the RAVDESS [14] and EMO-DB [15] datasets. The MAHCNN [16] model demonstrated superior capabilities in speech feature extraction and classification, which are beneficial for monitoring emotional dynamics in educational settings. ...
Article
Full-text available
Enhancing the naturalness and rhythmicity of generated audio in end-to-end speech synthesis is crucial. The current state-of-the-art (SOTA) model, VITS, utilizes a conditional variational autoencoder architecture. However, it faces challenges, such as limited robustness, due to training solely on text and spectrum data from the training set. Particularly, the posterior encoder struggles with mid- and high-frequency feature extraction, impacting waveform reconstruction. Existing efforts mainly focus on prior encoder enhancements or alignment algorithms, neglecting improvements to spectrum feature extraction. In response, we propose BERTIVITS, a novel model integrating BERT into VITS. Our model features a redesigned posterior encoder with residual connections and utilizes pre-trained models to enhance spectrum feature extraction. Compared to VITS, BERTIVITS shows significant subjective MOS score improvements (0.16 in English, 0.36 in Chinese) and objective Mel-Cepstral coefficient reductions (0.52 in English, 0.49 in Chinese). BERTIVITS is tailored for single-speaker scenarios, improving speech synthesis technology for applications like post-class tutoring or telephone customer service.
... With the advancements in deep learning, some studies [7,8] aim to learn emotion representations from audio signals using neural networks [9]. However, compared to other common downstream tasks like Automatic Speech Recognition (ASR), SER datasets [10,11] are relatively smaller. Therefore, leveraging self-supervised pretraining models to learn representations from a large volume of unlabeled speech data and subsequently using these models either as feature extractors or by directly fine-tuning the entire model has become a common solution. ...
Preprint
Full-text available
Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension through emotion prior knowledge in speech. In this paper, we propose an emotion-aware speech representation learning with intensity knowledge. Specifically, we extract frame-level emotion intensities using an established speech-emotion understanding model. Subsequently, we propose a novel emotional masking strategy (EMS) to incorporate emotion intensities into the masking process. We selected two representative models based on Transformer and CNN, namely MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted experiments on IEMOCAP dataset. Experiments have demonstrated that the representations derived from our proposed method outperform the original model in SER task.