Normalized confusion matrix for three-class sentiment analysis.

Normalized confusion matrix for three-class sentiment analysis.

Source publication
Article
Full-text available
The study of understanding sentiment and emotion in speech is a challenging task in human multimodal language. However, in certain cases, such as telephone calls, only audio data can be obtained. In this study, we independently evaluated sentiment analysis and emotion recognition from speech using recent self-supervised learning models—specifically...

Similar publications

Chapter
Full-text available
The requisition of intelligent devices that might classify a vocalized utterance have been skippering utterance research. The challenging task with utterance recognition models given for the language nature whereby there’re nope apparent limits among words, an acoustic start with ending are impacted through the neighboring words, also, with various...
Article
Full-text available
Noise annoyance has been often reported as one of the main adverse effects of noise exposure on human health, and there is consensus that it relates to several factors going beyond the mere energy content of the signal. Research has historically focused on a limited set of sound sources (e.g., transport and industrial noise); only more recently is...
Article
Full-text available
Audio signals play a crucial role in our perception of our surroundings. People rely on sound to assess motion, distance, direction, and environmental conditions, aiding in danger avoidance and decision making. However, in real-world environments, during the acquisition and transmission of audio signals, we often encounter various types of noises t...
Article
Full-text available
Superficially, read and spontaneous speech—the two main kinds of training data for automatic speech recognition—appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This is usually explained by different kinds of variation and noise, but there is a more fundamental...
Conference Paper
Full-text available
This paper presents the machine learning approach to the automated classification of a dog's emotional state based on the processing and recognition of audio signals. It offers helpful information for improving human-machine interfaces and developing more precise tools for classifying emotions from acoustic data. The presented model demonstrates an...

Citations

... In customer service, real-time sentiment analysis prioritizes high-emotion calls, improving overall satisfaction, while market research benefits from data-driven decisions derived from understanding consumer sentiments in various contexts [5]. Additionally, healthcare, education, entertainment, security, and internal organizational dynamics all leverage voice-based sentiment analysis for purposes ranging from early detection of emotional disorders to enhancing workplace satisfaction and productivity [6]. ...
Article
Full-text available
INTRODUCTION: Language serves as the primary conduit for human expression, extending its reach into various communication mediums like email and text messaging, where emoticons are frequently employed to convey nuanced emotions. In the digital landscape of long-distance communication, the detection and analysis of emotions assume paramount importance. However, this task is inherently challenging due to the subjectivity inherent in emotions, lacking a universal consensus for quantification or categorization.OBJECTIVES: This research proposes a novel speech recognition model for emotion analysis, leveraging diverse machine learning techniques along with a three-layer feature extraction approach. This research will also through light on the robustness of models on balanced and imbalanced datasets. METHODS: The proposed three-layered feature extractor uses chroma, MFCC, and Mel method, and passes these features to classifiers like K-Nearest Neighbour, Gradient Boosting, Multi-Layer Perceptron, and Random Forest.RESULTS: Among the classifiers in the framework, Multi-Layer Perceptron (MLP) emerges as the top-performing model, showcasing remarkable accuracies of 99.64%, 99.43%, and 99.31% in the Balanced TESS Dataset, Imbalanced TESS (Half) Dataset, and Imbalanced TESS (Quarter) Dataset, respectively. K-Nearest Neighbour (KNN) follows closely as the second-best classifier, surpassing MLP's accuracy only in the Imbalanced TESS (Half) Dataset at 99.52%.CONCLUSION: This research contributes valuable insights into effective emotion recognition through speech, shedding light on the nuances of classification in imbalanced datasets.
... Among the various studies that have gone in this direction, [54] introduces a robust multi-depth network for accurately recognizing facial expressions by leveraging a multi-depth network and a multirate-based 3D CNN, demonstrating its efficacy in understanding human emotions through facial expressions, while ref. [55] utilizes self-supervised learning models, including universal speech representations with speaker-aware pre-training, to evaluate different model sizes across sentiment and emotion tasks. ...
Article
Full-text available
The perception of sound greatly impacts users’ emotional states, expectations, affective relationships with products, and purchase decisions. Consequently, assessing the perceived quality of sounds through jury testing is crucial in product design. However, the subjective nature of jurors’ responses may limit the accuracy and reliability of jury test outcomes. This research explores the utility of facial expression analysis in jury testing to enhance response reliability and mitigate subjectivity. Some quantitative indicators allow the research hypothesis to be validated, such as the correlation between jurors’ emotional responses and valence values, the accuracy of jury tests, and the disparities between jurors’ questionnaire responses and the emotions measured by FER (facial expression recognition). Specifically, analysis of attention levels during different statuses reveals a discernible decrease in attention levels, with 70 percent of jurors exhibiting reduced attention levels in the ‘distracted’ state and 62 percent in the ‘heavy-eyed’ state. On the other hand, regression analysis shows that the correlation between jurors’ valence and their choices in the jury test increases when considering the data where the jurors are attentive. The correlation highlights the potential of facial expression analysis as a reliable tool for assessing juror engagement. The findings suggest that integrating facial expression recognition can enhance the accuracy of jury testing in product design by providing a more dependable assessment of user responses and deeper insights into participants’ reactions to auditory stimuli.
... For increased reliability, background subtraction expands to a bigram or trigram framework and NLP methods like POS tagger and bag of words model. Atmaja et al. [6] reviewed current advances in question classification and the strategies employed the queries into two categories depending on the user's behavior and the query logs. Train and asses using ML techniques and compare the results, whether supervised or unsupervised. ...
... The most noticeable case currently is the use of speech signal analysis (instead of cough) to predict the presence of Corona Virus Desease 2019 (COVID-19) signature in the vowel sounds [4][5][6]. Not only for physical health diagnosis (e.g., dysphonia detection [7], lung sound analysis [8]), but the use of voice analysis is also useful for mental health diagnosis (e.g., monitoring of emotions [9][10][11][12]). ...
Article
Research on diagnosing diseases based on voice signals is rapidly increasing, including cough-related diseases. When training the cough sound signals into deep learning models, it is necessary to have a standard input by segmenting several cough signals into individual cough signals. Segmenting coughs could also be used to monitor trends of cough-related disease by counting the number of coughs. Previous research has been developed to segment cough signals from non-cough signals. This research evaluates the segmentation methods of several cough signals from a single audio file into several audio files containing a single file. We evaluate three different methods, including manual segmentation as a baseline and two automatic segmentation methods: hysteresis comparator and root mean square (RMS) methods. The results by two automatic segmentation methods obtained precisions of 73% (hysteresis) and 70% (RMS) compared to 49% by manual segmentation. The agreements of listening tests to count the number of correct single-cough segmentations show fair and moderate correlations for automatic segmentation methods and are comparable with manual segmentation.
... Several modal fusion techniques have been well developed in various domains in recent years. For example, multimodal emotion recognition [28][29][30], multimodal interaction [31], and lip-sync fusion [32][33][34][35] have seen significant advancements. In 2023, Qin et al. [36] provided a comprehensive review of identification techniques and applications in unimodal identity recognition. ...
Article
Full-text available
With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio–visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms: attention, gated, and inter–attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems.
... An STM is used various sentiment detection mode, which can carry mixed emotions and sentiment of heavy multimedia traffic. An STM can be viewed as a sentiment detection approach that facilitates two essential components of sentiment detection and the transition between these sentiments [1,2]. The S-FSN model should be able to detect sentiments and reduce the heavy multimedia traffic produced by various file transfer. ...
... However, much focus on speaker recognition PTM embeddings for SER hasn't been given in contrast to speech SSL PTM embeddings such as wav2vec2.0 [6], Unispeech-SAT [7]. Pappagari et al. [8] showed the association between speaker and emotion recognition; focusing on the usefulness of speaker recognition PTM embeddings for SER. ...
... Although SA and emotion recognition are closely related areas, the prevailing perspective among researchers views sentiment as part of emotion. emotion recognition includes a wider and more in-depth understanding of emotional states and sensitivities whereas SA concentrates on a more basic categorization of sentiment polarity [242,243], for instance, when analyzing customer feedback, one might encounter expressions like ''I love your product'' (emotion) or ''Your customer service is good'' (sentiment). ...
Article
Full-text available
Analyzing and understanding the sentiments of social media documents on Twitter, Facebook, and Instagram has become a very important task at present. Analyzing the sentiment of these documents gives meaningful knowledge about the user opinions, which will help understand the overall view on these platforms. The problem of sentiment analysis (SA) can be regarded as a classification problem in which the text is classified as positive, negative, or neutral. This paper aims to give an intensive, but not exhaustive, review of the main concepts of SA and the state-of-the-art techniques; other aims are to make a comparative study of their performances, the main applications of SA as well as the limitations and the future directions for SA. Based on our analysis, researchers have utilized three main approaches for SA, namely lexicon/rules, machine learning (ML), and deep learning (DL). The performance of lexicon/rules-based models typically falls within the range of 55–85%. ML models, on the other hand, generally exhibit performance ranging from 55% to 90%, while DL models tend to achieve higher performance, ranging from 70% to 95%. These ranges are estimated and may be higher or lower depending on various factors, including the quality of the datasets, the chosen model architecture, the preprocessing techniques employed, as well as the quality and coverage of the lexicon utilized. Moreover, to further enhance models’ performance, researchers have delved into the implementation of hybrid models and optimization techniques which have demonstrated an ability to enhance the overall performance of SA models.
... This is because emotions may be communicated via speech in various ways, including changes in tone, pitch, loudness, rhythm, and other speech-related characteristics. Currently, there are multiple studies [13][14][15][16] that focused on identifying and analyzing speech features to detect the emotions of individuals. These studies aimed to effectively classify the detected features and accurately determine the emotional state of the speaker. ...
Article
Full-text available
Understanding and identifying emotional cues in human speech is a crucial aspect of human-computer communication. The application of computer technology in dissecting and deciphering emotions, along with the extraction of relevant emotional characteristics from speech, forms a significant part of this process. The objective of this study was to architect an innovative framework for speech emotion recognition predicated on spectrograms and semantic feature transcribers, aiming to bolster performance precision by acknowledging the conspicuous inadequacies in extant methodologies and rectifying them. To procure invaluable attributes for speech detection, this investigation leveraged two divergent strategies. Primarily, a wholly convolutional neural network model was engaged to transcribe speech spectrograms. Subsequently, a cutting-edge Mel-frequency cepstral coefficient feature abstraction approach was adopted and integrated with Speech2Vec for semantic feature encoding. These dual forms of attributes underwent individual processing before they were channeled into a long short-term memory network and a comprehensive connected layer for supplementary representation. By doing so, we aimed to bolster the sophistication and efficacy of our speech emotion detection model, thereby enhancing its potential to accurately recognize and interpret emotion from human speech. The proposed mechanism underwent a rigorous evaluation process employing two distinct databases: RAVDESS and EMO-DB. The outcome displayed a predominant performance when juxtaposed with established models, registering an impressive accuracy of 94.8% on the RAVDESS dataset and a commendable 94.0% on the EMO-DB dataset. This superior performance underscores the efficacy of our innovative system in the realm of speech emotion recognition, as it outperforms current frameworks in accuracy metrics.
... However, much focus on speaker recognition PTM embeddings for SER hasn't been given in contrast to speech SSL PTM embeddings such as wav2vec2.0 [6], Unispeech-SAT [7]. Pappagari et al. [8] showed the association between speaker and emotion recognition; focusing on the usefulness of speaker recognition PTM embeddings for SER. ...
Preprint
Full-text available
Speech emotion recognition (SER) is a field that has drawn a lot of attention due to its applications in diverse fields. A current trend in methods used for SER is to leverage embeddings from pre-trained models (PTMs) as input features to downstream models. However, the use of embeddings from speaker recognition PTMs hasn't garnered much focus in comparison to other PTM embeddings. To fill this gap and in order to understand the efficacy of speaker recognition PTM embeddings, we perform a comparative analysis of five PTM embeddings. Among all, x-vector embeddings performed the best possibly due to its training for speaker recognition leading to capturing various components of speech such as tone, pitch, etc. Our modeling approach which utilizes x-vector embeddings and mel-frequency cepstral coefficients (MFCC) as input features is the most lightweight approach while achieving comparable accuracy to previous state-of-the-art (SOTA) methods in the CREMA-D benchmark.