ChapterPDF Available

A Study on Speech Emotion Recognitions on Machine Learning Algorithms

December 2021

December 2021

DOI:10.3233/APC210225

License
CC BY-NC 4.0

In book: Recent Trends in Intensive Computing

Authors:

Amiripalli Shanmuk Srinivas

GITAM University

Show all 5 authorsHide

Speech emotion detection has been extremely relevant in today’s digital culture in recent years. RAVDESS, TESS, and SAVEE Datasets were used to train the model in our project. To determine the precision of each algorithm with each dataset, we looked at ten separate Machine Learning Algorithms. Following that, we cleaned the datasets by using the mask feature to eliminate unnecessary background noise, and then we applied all 10 algorithms to this clean speech dataset to improve accuracy. Then we look at the accuracies of all ten algorithms and see which one is the greatest. Finally, by using the algorithm, we could calculate the number of sound files correlated with each of the emotions described in those datasets.

Block Diagram of SER

…

Comparison of Modulation schemes

…

Figures - available via license: Creative Commons Attribution-NonCommercial 4.0 International

Content may be subject to copyright.

Available via license: CC BY-NC 4.0

Content may be subject to copyright.

A Study on Speech Emotion Recognitions

on Machine Learning Algorithms

Shanmuk Srinivas Amiripalli

a 1

, Potnuru Likhitha

, Sisankita Patnaik

, Suresh Babu K

Rampay Venkatarao

Dept of CSE, GIT, GITAM University,Visakhapatnam,Andhrapradesh,India

Dept of CSE, Wollega University, Nekemte, Eithopia

Abstract. Speech emotion detection has been extremely relevant in today's digital

culture in recent years. RAVDESS, TESS, and SAVEE Datasets were used to train

the model in our project. To determine the precision of each algorithm with each

dataset, we looked at ten separate Machine Learning Algorithms. Following that,

we cleaned the datasets by using the mask feature to eliminate unnecessary

background noise, and then we applied all 10 algorithms to this clean speech

dataset to improve accuracy. Then we look at the accuracies of all ten algorithms

and see which one is the greatest. Finally, by using the algorithm, we could

calculate the number of sound files correlated with each of the emotions described

in those datasets.

Keywords. Scikit-learn, MLPClassifier, Logistic Regression, Naïve Bayes, XGB,

LightGBM, Stochastic Gradient Descent, Support Vector Machine.

1. Introduction

In classification, a set of data is categorized into classes, and it can be performed

not only on structured data but also on unstructured data. Each data point of datasets is

predicted into which class it falls under. These classes can be referred as targets, labels

or categories. The task of the Classification predictive model is to approximate the

mapping function from input variables to discrete output variables. Example for Binary

Classification: While flipping a coin, the chances of getting head or tail can be

categorized into two classes. Example for Multiclass Classification: There may have a

3-class classification problem of an animal set to classify as lion, tiger or leopard with a

total of 100 instances. The classifier, in this case, needs training data to understand how

the given input variables are related to the class. Once the classifier is trained

accurately, it can detect the output of the particular testing data. This concept was used

to get the accuracy of the dataset. Speech is one of the most natural means for us to

communicate ourselves as humans. We depend on it so often that we can deduce its

significance from other forms of contact, such as emails and instant messaging, in

which we often use sentiment to convey the messages' contents. Since feelings are so

central in conversation, sensing and interpreting them is crucial in today's world of

remote communication.

Shanmuk Srinivas Amiripalli, Dept of CSE, GIT, GITAM University,Visakhapatnam, India

Email: shanmuk39@gmail.com

Recent Trends in Intensive Computing

M. Rajesh et al. (Eds.)

This article is published online with Open Access by IOS Press and distributed under the terms

of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).

doi:10.3233/APC210225

431

It is tough to detect feelings because they are emotional. There is no agreed-upon

procedure for quantifying or categorizing them. An SER paradigm is a collection of

methodologies for interpreting and categorizing speech signals to detect emotions. This

type of interface can be used for some items, including interactive voice-based

assistants and caller-agent contact research. We examine audio files' acoustic properties

in this study to detect real feelings in recorded expressions. The method of recognizing

human emotion through expression and affective states is known as Speech Emotion

Recognition or SER. This takes advantage of the idea that voice always reflects natural

feeling by tone and pitch. It involves defining people's feelings dependent on the tone

of their voice in their expressions. People cannot all talk at the same volume. It varies

based on their moods and circumstances. This is the same phenomenon that allows

creatures like dogs and horses to understand human feelings. It's popular in call centres.

If you note, call centre workers do not communicate to each customer in the same way;

their method of speaking to them differs based on the customer. Speech emotion

recognition systems enable workers to understand consumer emotions through speech.

In order for them to strengthen and deliver services to their customers.

2. Literature Survey

In [1-3] compares the speech emotion classification accuracy of speaker-based and

the time to construct the model between Support Vector Machine(SVM) and Multi-

Layer Perceptron(MLP) classifiers. The classification was performed with the WEKA

unit, and the features were extracted with PRAAT. A basic SER module structure was

chosen to compare the described classifiers. Confusion matrix, classification precision,

and construct time are used to test supervised learning algorithms' efficiency. Despite

the fact that MLP outperforms SVM in total emotion classification, SVM's preparation

was quicker.MLP and SVM had acceptance rates of 78.69 and 76.82, respectively. In

MLP, the highest emotion identification was for depression (89%), with pleasure and

anxiety being the most perplexing emotions, while in SVM, the highest emotion

recognition was for indignation (87.4%), with disgust and fear being the most

perplexing emotions. In research [4-7], wavelet packet techniques were used to

recognize speech sentiment. The wavelet packet coefficients were examined at five

decomposition stages, analyzed, and used as inputs to Support Vector Machine (SVM)

classifiers. The findings showed that using these features on seven emotional states in

two languages, German and Chinese, this wavelet packet strategy increased efficiency

by 4.5 percent and 16.9 percent, respectively, as opposed to a single one without these

features. These two datasets have a final success average of 61.9 percent and 62.2

percent, respectively. As a result, it was concluded that wavelet packet coefficient

features outperform Mel-Frequency Cepstral Coefficient (MFCC) features. The ANN-

based decision fusion for SER was introduced in [7-9]. SVM, k-NN, Gaussian Mixture

Model, HMM, ANN, and other sequence classification methods were used to

determine which was the most efficient tool for classifying speech emotions. SVM was

said to have the highest results out of all of them. Some suspected that the ANN could

achieve ideal results, but they didn't know which kind was best for SER. To identify

various feelings, four separate ANNs were used: Probabilistic Neural Network (PNN),

Radial Base Function (RBF) network, Back Propagation (BP) network, and Elman

Network. At the judgment stage, voting systems were used to fuse the recognitions

utilizing Statistical and Spectral characteristics. Principal Factor Analysis decreases the

S. Srinivas Amiripalli et al. / A Study on Speech Emotion Recognitions432

dimensionality of super vectors built from spectral features (PCA). However, since

PCA was used as a dimensionality reduction method rather than a pattern recognition

method, it caused some issues. Proposed decision fusion was discussed as a way to

escape them. The proposed decision fusion was successful, and the dimensionality

reduction was probable, according to the results. In research [10-12] proposes a

speaker-independent approach for categorizing emotional vocal sounds. The treatment

divided the mechanism of recognizing emotions into two sections. The first phase

entails a coarse encoding and grouping of six emotional states to determine which pair

of emotions has the highest likelihood. Low-level encoding procedures were suggested

at this time, and the extracted features were combined to produce the best emotional

state descriptive acoustic vectors. Second, utilizing the Sequential Floating Forward

Selection (SFFS) algorithm, modern encoding strategies were used to define a special

collection of acoustic features for each pair of emotions that can be used to discriminate

between them. There are a total of 72 high-level acoustic features.

3. Proposed Classification Algorithm

The classification is a supervised learning principle of Machine Learning that

separates a dataset into groups. Speech Expression Recognition, Face Identification,

Handwriting Recognition, Text Classification, and other classification problems are

some of the most important. It may also be a multiclass problem or a binary

classification problem. In Machine Learning, there are many classification algorithms.

On the RAVDESS, TESS, and SAVEE datasets, the following algorithms were used.

First of all, we give some audio datasets as input. Extracted features from those speech

files. Declared one dictionary for emotions in the dataset and another dictionary for

emotions that we want to observe. Loaded the dataset and split into two subsets i.e.

75% of data for training and 25% of the data for testing. Initialized a classifier and

trained the model using the dataset to predict the emotions of each of the speech files.

Finally, it gives the emotion as output. We took 3 different datasets, namely

RAVDESS, SAVEE and TESS, which consist of different emotions. We found the

accuracies of the ten different classifiers Logistic Regression, Naïve Baye's, Stochastic

Gradient Descent, KNN, Decision Tree, Random Forest, Support Vector Machine,

MLPC, XG Boost and Light GBM, for each dataset and compared them to know which

classifier have more accuracy [13-16]. The accuracy was calculated before and after

masking of the datasets. Depending on the accuracy, we got to know that MLP

Classifier has more accuracy compared to others. Before masking, the accuracy for

RAVDESS, SAVEE, and TESS in MLPC were 70%, 100% and 80%, respectively and

after masking, the accuracy for RAVDESS, SAVEE, and TESS in MLPC were 75.60%,

100% and 84%, respectively [17-21].

Figure 1.

Block Diagram of SER

S. Srinivas Amiripalli et al. / A Study on Speech Emotion Recognitions 433

Figure 2.

Architecture of SER

Algorithm

Step1: Start

Step2: Imported all required packages, libraries and modules.

Step3: Considering five standard features of any audio file and declared a

dictionary that contains the emotions in the dataset and a list with the emotions

observed.

Step 4: Input datasets are considered in this research is RAVDESS, TESS, SAVEE.

Step 5: Configuring the experiment with considered 75% of data for training and

25% of data for testing.

Step 6: Initialized a classifier model and fit the model.

Step 7: Accuracy for ten different classifiers and picked one classifier with more

accuracy.

Step 8: Using that classifier, we found a count of files of each of the emotions in

each dataset.

Step 9: Classification reports and confusion matrix for each of them are drawn.

Step10: Stop

4. Results and Discussion

After considering four emotions only as the observed emotions increases, there may

be a chance to decrease accuracy, so we did not consider all emotions. RAVDESS

contains eight emotions, whereas TESS and SAVEE contain seven emotions, so we

took four common emotions from these three datasets. We took emotions happiness,

fear, disgust and neutral in our observation. The below are the results for the count of

files of each emotion we considered in each dataset.

S. Srinivas Amiripalli et al. / A Study on Speech Emotion Recognitions434

(a) (b) (c)

Figure 3.Confusion Matrix of MLP Classifier using (a) RAVDESS Dataset (b) TESS Dataset (c)

SAVEE Datase

Table 1.Comparison of Modulation schemes

Emotion Number of files in

RAVDESS

Number of files in

TESS

Number of files

in SAVEE

HAPPY 45 103 20

FEARFUL 41 107 14

DISGUST 56 90 12

NEUTRAL 26 100 29

Figure 4.Count of files of each emotion in each datase

5. Conclusion

In the above research, the major observation is that MLP Classifier is the best

classifier compared to any other classifier in Machine Learning. However, to improve

this model's accuracy, we need to clean the noise in our dataset. We can improve the

efficiency of the model from 71% to 76% for RAVDESS, accuracy for TESS before,

and masking remained constant, i.e. 100%, 80% to 84% for SAVEE by considering

four emotions such as neutral, happy, fearful and disgust. We can use the model to

predict the emotions of the audio files in the datasets and any other sample audio files

with the extension .wav.

S. Srinivas Amiripalli et al. / A Study on Speech Emotion Recognitions 435

References

[1]. Idris, I., & Salam, M. S. H. (2014, December). Emotion detection with hybrid voice quality and

prosodic features using neural network. In 2014 4th World Congress on Information and

Communication Technologies (WICT 2014) (pp. 205-210). IEEE.

[2]. Wang, K., An, N., & Li, L. (2014, September). Speech emotion recognition based on wavelet

packet coefficient model. In The 9th International Symposium on Chinese Spoken Language

Processing (pp. 478-482). IEEE.

[3]. Xu, L., Xu, M., & Yang, D. (2009). ANN based decision fusion for speech emotion recognition.

In Tenth Annual Conference of the International Speech Communication Association.

[4]. Atassi, H., & Esposito, A. (2008, November). A speaker independent approach to the

classification of emotional vocal expressions. In 2008 20th IEEE international conference on tools

with artificial intelligence (Vol. 2, pp. 147-152). IEEE.

[5]. Huang, C., Jin, Y., Zhao, Y., Yu, Y., & Zhao, L. (2009, September). Speech emotion recognition

based on re-composition of two-class classifiers. In 2009 3rd International Conference on

Affective Computing and Intelligent Interaction and Workshops (pp. 1-3). IEEE.

[6]. Zhang, J., Yin, Z., Chen, P., & Nichele, S. (2020). Emotion recognition using multi-modal data

and machine learning techniques: A tutorial and review. Information Fusion, 59, 103-126.

[7]. Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., & Mahjoub, M. A. (2018, January). Speech

Emotion Recognition: Methods and Cases Study. In ICAART (2) (pp. 175-182).

[8]. Basu, S., Chakraborty, J., Bag, A., & Aftabuddin, M. (2017, March). A review on emotion

recognition using speech. In 2017 International Conference on Inventive Communication and

Computational Technologies (ICICCT) (pp. 109-114). IEEE.

[9]. Tarunika, K., Pradeeba, R. B., & Aruna, P. (2018, July). Applying machine learning techniques

for speech emotion recognition. In 2018 9th International Conference on Computing,

Communication and Networking Technologies (ICCCNT) (pp. 1-5). IEEE.

[10]. Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M. A., & Cleder, C. (2019).

Automatic speech emotion recognition using machine learning. In Social media and machine

learning. IntechOpen.

[11]. Amiripalli, S. S., Bobba, V., & Potharaju, S. P.: A novel trimet graph optimization (TGO)

topology for wireless networks, (2019) doi:10.1007/978-981-13-0617-4_8.

[12]. Potharaju, S. P., & Sreedevi, M. (2018). A novel cluster of quarter feature selection based on

symmetrical uncertainty. Gazi University Journal of Science, 31(2), 456-470.

[13]. Amiripalli, S. S., & Bobba, V.: Research on network design and analysis of TGO topology.

International Journal of Networking and Virtual Organisations, 19(1), 72-86, (2018).

[14]. Potharaju, S. P., & Sreedevi, M. (2017). A Novel Clustering Based Candidate Feature Selection

Framework Using Correlation Coefficient for Improving Classification Performance. Journal of

Engineering Science & Technology Review, 10(6).

[15]. Amiripalli, S. S., & Bobba, V. (2019). An Optimal TGO Topology Method for a Scalable and

Survivable Network in IOT Communication Technology. Wireless Personal Communications,

107(2), 1019-1040.z

[16]. Potharaju, S. P., Sreedevi, M., & Amiripalli, S. S. (2019). An Ensemble Feature Selection

Framework of Sonar Targets Using Symmetrical Uncertainty and Multi-Layer Perceptron (SU-

MLP). In Cognitive Informatics and Soft Computing (pp. 247-256). Springer, Singapore.

[17]. Amiripalli, S. S, V. Bobba, "A Fibonacci based TGO methodology for survivability in ZigBee

topologies". INTERNATIONAL JOURNAL OF SCIENTIFIC &TECHNOLOGY RESEARCH,

9(2), pp. 878-881. (2020).

[18]. Ramiah Chowdary, P., Challa, Y., Jitendra, M.S.N.V.: "Identification of MITM Attack by

Utilizing Artificial Intelligence Mechanism in Cloud Environments" Journal of Physics:

Conference Series, 1228 (1),012044, (2019).

[19]. Thota, J.R., Kothuru, M., Shanmuk Srinivas, A., Jitendra M, S.N.V.: Monitoring diabetes

occurrence probability using classification technique with a UI, International Journal of Scientific

and Technology Research, 9 (4), pp. 38-41, (2020).

[20]. Potharaju, S. P., Sreedevi, M., Ande, V. K., & Tirandasu, R. K. (2019). Data mining approach for

accelerating the classification accuracy of cardiotocography. Clinical Epidemiology and Global

Health, 7(2), 160-164.

[21]. Jitendra, M.S.N.V., Radhika, Y.: A review: Music feature extraction from an audio signal,

International Journal of Advanced Trends in Computer Science and Engineering, 9 (2), pp. 973-

980, (2020).

S. Srinivas Amiripalli et al. / A Study on Speech Emotion Recognitions436

ResearchGate has not been able to resolve any citations for this publication.

Monitoring Diabetes Occurrence Probability Using Classification Technique With A UI

Article

Full-text available

May 2020

Diabetes mellitus is responsible for not only high health-care costs, but has also become a major cause of death prevailing over the past few decades, accounting for about 2,23,000 deaths per year in India only due to it. Effective management and prediction of Diabetes can allow medical professionals to offer optimal treatment while lowering costs. Proper selections of machine learning algorithms are the parameters for implementation of such a decision support system. In this study, the different models were observed using various classifications, Decision tree techniques for each of the Data Reduction techniques to identify the best approach for diabetes prediction. In this paper, for detecting diabetes at an early stage over the instances of diabetic, non-diabetic and borderline patients we have chosen Gaussian Naive Bayes Algorithm in combination with Information Gain Attribute Evaluation, as we observed that this combination yields the best accuracy and performance.

Identification of MITM Attack by Utilizing Artificial Intelligence Mechanism in Cloud Environments

Article

Full-text available

May 2019

Cloud dealing with has changed into an essential bit of the reliably life. It basically made everyone life less troublesome with astounding highlights. To help broadening many of clients and more over keen gadgets, it is depended upon to keep the cloud condition continuously secure and time tested. Scattered handling security has changed into a fundamental testing field at now a day. Here we introduced Man-made consciousness based structure that might be useful for an exposure of man-in-the-middle assault (MITM) in passed on figuring condition. As MITM strike winds up being eminent with the development of time, on the off chance that it is recognized at first, by then the assault might be limited. So we concentrated on assault region fragment to remain the cloud condition utilizing Man-made consciousness technique for thinking. Watchwords—Cloud enrolling condition, Computerized reasoning method of reasoning, MITM strike.

An Optimal TGO Topology Method for a Scalable and Survivable Network in IOT Communication Technology

Article

Full-text available

Jul 2019
WIRELESS PERS COMMUN

In the year 2015 per person is having 3.47 devices that will be around 25 billion devices, In future years 2020 expected devices per person are 6.58 which is around 50 billion devices. From this statistic, we clearly identified a major problem for IoT devices is scalability and survivability. As the number of devices increases, which is nothing but scalability increases, the quality of service (QoS) will be decreased. To increase the QoS we have to design a novel architecture. To address these problems we come up with a trimet graph optimization (TGO) topology. Topology plays an important role in modern wireless technologies like WSN, Ad hoc networks, cyber-physical system and IOT. We have basic topologies like a bus, ring, star, mesh, tree, etc., which are not fulfilling all the requirements of the modern engineering problems. In this paper, we are making a new attempt to bridge the gap between existing topologies and network design problems. We are testing a novel TGO topology which is used to design a survival, simple, cost-effective topology for Wireless sensor networks and IoT. TGO topology was simulated extensively on various scenarios of wireless technologies using Contiki OS and Cooja simulator. Finally, results are compared with the existing topologies.

Automatic Speech Emotion Recognition Using Machine Learning

Chapter

Full-text available

Mar 2019

Applying Machine Learning Techniques for Speech Emotion Recognition

Conference Paper

Full-text available

Jul 2018

A Novel Cluster of Quarter Feature Selection Based on Symmetrical Uncertainty

Article

Full-text available

Jun 2018

Due to the diversity of sources, a large amount of data is being produced. The captured data associated with several problems including mislabeled data, missing values, imbalanced class labels, noise and high dimensionality. In this research article, we proposed a novel framework to address the high dimensionality issue with feature selection to increase the classification performance of various lazy learners, rule-based induction, Bayes, and tree-based models. In this research, we proposed robust Quarter Feature Selection (QFS) framework based on Symmetrical Uncertainty Attribute Evaluator. Our proposed technique analyzed with Six real world datasets. The proposed framework, divides the whole data space into 4 sets (Quarters) of features without duplication. Each such quarter has less than or equals 25 % features of whole data space. Practical results recorded that, one of the quarter, sometimes more than one quarter recorded improved accuracy than the traditional feature selection methods in the literature. In this research, we used filter-based feature selection methods such as Gain Ratio (GRAE), Information Gain (IG), Chi Squared (CHI 2), Relief to compare the quarter of features created by proposed technique.

A Review: Music Feature Extraction from an Audio Signal

Article

May 2020

In recent years, the revenue earned through digital music stood at a billion-dollar market and the US remained the most profitable market for Digital music. Due to the digital shift, today people have access to millions of music clips from online music applications through their smart phones. In this context, there are some issues identified between the music listeners, music search engine by querying and retrieving music clips from a large collection of music data set. Classification is one of the fundamental problems in music information retrieval (MIR). Still, there are some hurdles according to their listener’s preferences regarding music collections and their categorization. In this paper, different music extraction features are addressed, which can be used in various tasks related to music classification like a listener's mood, instrument recognition, artist identification, genre, query-by-humming, and music annotation. This review illustrates various features that can be used for addressing the research challenges posed by music mining.

Emotion Recognition Using Multi-Modal Data and Machine Learning Techniques: A Tutorial and Review

Article

Jul 2020
INFORM FUSION

In recent years, the rapid advances in machine learning (ML) and information fusion has made it possible to endow machines/computers with the ability of emotion understanding, recognition, and analysis. Emotion recognition has attracted increasingly intense interest from researchers from diverse fields. Human emotions can be recognized from facial expressions, speech, behavior (gesture/posture) or physiological signals. However, the first three methods can be ineffective since humans may involuntarily or deliberately conceal their real emotions (so-called social masking). The use of physiological signals can lead to more objective and reliable emotion recognition. Compared with peripheral neurophysiological signals, electroencephalogram (EEG) signals respond to fluctuations of affective states more sensitively and in real time and thus can provide useful features of emotional states. Therefore, various EEG-based emotion recognition techniques have been developed recently. In this paper, the emotion recognition methods based on multi-channel EEG signals as well as multi-modal physiological signals are reviewed. According to the standard pipeline for emotion recognition, we review different feature extraction (e.g., wavelet transform and nonlinear dynamics), feature reduction, and ML classifier design methods (e.g., k-nearest neighbor (KNN), naive Bayesian (NB), support vector machine (SVM) and random forest (RF)). Furthermore, the EEG rhythms that are highly correlated with emotions are analyzed and the correlation between different brain areas and emotions is discussed. Finally, we compare different ML and deep learning algorithms for emotion recognition and suggest several open problems and future research directions in this exciting and fast-growing area of AI.

An Ensemble Feature Selection Framework of Sonar Targets Using Symmetrical Uncertainty and Multi-Layer Perceptron (SU-MLP): Proceeding of CISC 2017

Chapter

Jan 2019

A Novel Trimet Graph Optimization (TGO) Topology for Wireless Networks: Proceeding of CISC 2017

Chapter

Jan 2019

Topology plays an important role in modern wireless technologies like WSN, Adhoc networks, Cyber-physical system and IOT. We have basic topologies like a bus, ring, star, mesh, tree, etc., which are not fulfilling all the requirements of the modern engineering problems. In this paper, we are making a new attempt to bridge the gap between existing topologies and network design problems. We are introducing a novel Trimet graph based optimized topology named as TGO (Trimet graph optimization) which is used to design a reliable, scalable, secure, simple, cost-effective topology for engineering problems. TGO topology was simulated extensively on various scenarios of wireless technologies using Cooja simulator. Final results are compared with the existing topologies.

A Study on Speech Emotion Recognitions on Machine Learning Algorithms

Abstract and Figures

Recommended publications

Mineral Rock Classification Using Convolutional Neural Network

CRACK DETECTION ON CONCRETE IMAGES USING CLASSIFICATION TECHNIQUES IN MACHINE LEARNING

WhatsApp College Enquiry Chatbot using Python Programming for Broadcast and Message Automatically

Online Cake Ordering System