Content uploaded by Fahad Taha Al-Dhief
Author content
All content in this area was uploaded by Fahad Taha Al-Dhief on Oct 12, 2022
Content may be subject to copyright.
Speech emotion recognition using optimized genetic
algorithm-extreme learning machine
Musatafa Abbas Abbood Albadr
1
&Sabrina Tiun
1
&Masri Ayob
1
&
Fahad Taha AL-Dhief
2
&Khairuddin Omar
1
&Mhd Khaled Maen
3
Received: 23 February 2021 /Revised: 17 May 2021 /Accepted: 21 February 2022 /
#The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
Automatic Emotion Speech Recognition (ESR) is considered as an active research field in
the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two
main parts: Front-End (features extraction) and Back-End (classification). However, most
previous ESR systems have been focused on the features extraction part only and ignored
the classification part. Whilst the classification process is considered an essential part in
ESR systems, where its role is to map out the extracted features from audio samples to
determine its corresponding emotion. Moreover, the evaluation of most ESR systems has
been conducted based on Subject Independent (SI) scenario only. Therefore, in this paper,
we are focusing on the Back-End (classification), where we have adopted our recent
developed Extreme Learning Machine (ELM), called Optimized Genetic Algorithm-
Extreme Learning Machine (OGA-ELM). In addition, we used the Mel Frequency
Cepstral Coefficients (MFCC) method in order to extract the features from the speech
utterances. This work proves the significance of the classification part in ESR systems,
where it improves the ESR performance in terms of achieving higher accuracy. The
performance of the proposed model was evaluated based on Berlin Emotional Speech
(BES) dataset which consists of 7 emotions (neutral, happiness, boredom, anxiety,
sadness, anger, and disgust). Four different evaluation scenarios have been conducted
such as Subject Dependent (SD), SI, Gender Dependent Female (GD-Female), and
Gender Dependent Male (GD-Male). The highest performance of the OGA-ELM was
very impressive in the four different scenarios and achieved an accuracy of 93.26%,
100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respec-
tively. Besides, the proposed ESR system has shown a fast execution time in all
experiments to identify the emotions.
Keywords Emotion speech recognition .Optimized genetic algorithm-extreme learning machine .
Mel frequency cepstral coefficients
https://doi.org/10.1007/s11042-022-12747-w
*Musatafa Abbas Abbood Albadr
mustafa_abbas1988@yahoo.com
Extended author information available on the last page of the article
Published online: 19 March 2022
Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1 Introduction
People beings use different forms of facial expressions, gestures, speeches for communication,
and body language. These communications transfer emotional states and messages of the
speakers [24]. In this regard, people have the natural ability to comprehend the speakers’
emotions through their speech signals. A robust system of emotionrecognition aims to identify
the people emotional state through the user’s voice automatically. The speech signal contains
the linguistic information of the speaker and also it includes other information such as age,
origin, gender, and emotional states [53]. Such these systems have made numerous potential
impacts on HCI [7,15,22]. Furthermore, an automatic ESR system has been applied in several
real-time applications in the purpose of analyzing and detecting the emotions such as detect the
emotions of callers in call centers [43], mental disorders diagnosis [37], and detect the diseases
of Parkinson and Alzheimer [39,62]. Further, ESR system is utilized to present assist in many
various applications (e.g., development of educational, learning environment, lie detection
system, games software, and entertainment) [46].
In general, ESR systems are generated based on two main stages; the first stage refers to
front-end that extracts the feature vectors from the samples of a speech utterance. While the
second stage refers to the back-end that recognises the emotion based on certain sets of feature
vectors, algorithms and models. Figure 1depicts the general overview of the ESR system.
Among the most common feature extraction methods used in ESR field are the Linear
Predictive Coding (LPC), Cepstrum Coefficients derived from LPC (LPCC), MFCC, and
Perceptual Linear Prediction (PLP) [12,40,41,52]. Out of all the aforesaid methods, MFCC is
the most popular feature extraction approach in speech applications generally and has been
cited to have the highest identification accuracy [28,48,51,61]. Whilst the classification
process is considered an essential part of any ESR system and its role is to map out the
extracted features from audio samples to determine its corresponding emotion. Several
classifiers are identified in literature, for instance the deep learning [10], Support Vector
Machine (SVM) [33], and ELM [42].
Recently, ELMs have emerged, becoming a modern framework for machine learning [6,
20,25,47,54,56]. ELMs are a type of feed-forward neural network characterised by random
initialisation of their hidden layer weights, combined with a fast training algorithm. The
effectiveness (without blindness) of this random initialisation and quick training makes them
very appealing for large-scale data analysis.
Fig. 1 general overview of the emotion speech recognition system
23964 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
In the last decades, the ELM algorithm has witnessed a high significance among
other algorithms of machine learning [32]. This is because ELM has unique charac-
teristics such as good generalization, classification capability, and extremely fast
training. In addition, ELM is an efficient solution for the Single-hidden Layer
Feedforward Networks (SLFNs), where it has proved its performance in terms of
efficiency and effectiveness in several applications. Therefore, the ELM has obtained
better and faster generalization than SVM and back propagation (BP)-based neural
networks (NNs) [3,29,31,35]. Consequently, many researchers have used the ELM
in ESR. For example in [58], the authors have presented a new particle swarm
optimization assisted Biogeography-based algorithm for feature selection, while the
ELM classifier was used for the classification part in order to distinguish the emo-
tions. The simulations were conducted using BES Dataset. Different evaluation ex-
periments were conducted such as Subject Dependent (SD), Subject Independent (SI),
Gender Dependent Female (GD-Female), and Gender Dependent Male (GD-Male).
The highest recognition accuracy was 90.31%, 99.47%, 98.94%, and 92.98% for SI,
SD, GD-Female and GD-Male respectively. Unfortunately, despite the superiority of
this developed work, the optimization of ELM with respect to the random weights has
been ignored. This is lead to non-optimal classification for ELM performance.
Another attempt is in the work of [27], where the authors proposed a dynamic framework to
use the advantages of the auditory-based empirical features and the complementary
spectrogram-based statistical features. Furthermore, a Kernel Extreme Learning Machine
(KELM) was used to recognize emotions. To validate this framework, they conducted
experiments on the BES dataset. The experimental results demonstrated that their proposed
framework outperformed the existing state-of-the-art method and achieved an accuracy of
92.90%. However, this work has ignored the optimization of KELM in terms of the input
weights for hidden layer.
A further attempt was made by [49] where the authors proposed a Hybrid Spectral Features
(HSF) which is combining the LPC, MFCC and PSD parameters. In addition, the ELM was
used as the pattern classifier to recognize emotions. The evaluation experiments were con-
ducted based on an emotional speech dataset which consists of nine emotions: neutral, calm,
sad, surprise, happy, anxiety, anger, fear, and boredom. In their experiments, the highest
overall recognition accuracy was 82.22%. Despite the superiority of their proposed method
over the benchmark, this work also has ignored the optimization of ELM in terms of the input
weights for hidden layer.
The methods in [38,44] are presented in the speech emotions recognition by using
BSE dataset. The method in [38] has been presented a new feature fusion (i.e., MFCC
and Prosody), and two different features selection approaches (i.e., SFS and SFFS).
Also, there are four classifiers have beenusedinthismethodwhichareLinear
Discriminant Analysis (LDA), Regularized Discriminant Analysis (RDA), SVM, and
K-Nearest Neighbour (KNN). The experimental results showed that the RDA and
SVM classifiers with SFFS features have obtained the highest accuracy of 92.70%.
While in the method in [44] has been proposed new feature extraction called Acoustic
Analysis Methods and Statistical Feature Selection (AAMSFS). In addition, it is used
three different classifiers and they are Multilayer Perceptron (MLP), SVM, and k-NN.
Based on the experimental results, the proposed AAMSFS with SVM classifier has
achievedaccuracyupto84.62%.
23965Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Another attempt was made by [36] where the authors took a cascaded normaliza-
tion method. They have combined, nonlinear value level, linear speaker level, and
feature vector level normalization in order to decrease the effects of the speaker and
also to maximize class separability. Additionally, the ELM was applied to distinguish
emotions. The evaluation experiments were conducted based on part of the recently
collected dataset (Turkish Emotional Speech (TES) dataset) with four emotion classes
which are joy, neutral, sadness, and anger. Even though the evaluation experiments
result showed the superiority of the ELM over the SVM and the ELM has acquired
an overall accuracy of 79.00% while the SVM has acquired 77.30%. However most
of the ESR researches have ignored the optimization of ELM in terms of the input
weights for hidden layer [2,3,6]. Where the drawback of ELM is must have a
certain technique in order to select the weights of the input-hidden layer. In other
words, there is no method to ensure that the trained ELM algorithm is the most
proper in the classification process. To solve this issue, an optimisation method
should be combined with the ELM to determine the optimal weights that guarantee
to obtain the best performance in the classification process. Furthermore, Table 1
summarizes the related works including strengths and weaknesses of each method.
Based on the studies above, the limitations of emotion speech recognition systems can be
summarized as follow:
&Most previous studies of emotion speech recognition have focused on the feature extrac-
tion part and ignore the classification part.
&No much studies have evaluated their methods based on different scenarios such as Subject
Independent (SI), Subject Dependent (SD), Gender Dependent Female (GD-Female) and
etc. In other words, most systems are evaluated based on SI scenario only.
&The accuracy rate of emotion recognition systems from the speech is still not encouraging.
&Accuracy, recall and precision are mostly used to evaluate emotion speech recognition
systems. However, the other evaluation measurements are ignored such as F-measure, G-
mean, and execution time.
Based on all the facts mentioned earlier, this study will use the MFCC features and one of the
recent ELM optimization which named OGA-ELM [4]. In [4], we have proposed OGA-ELM
in the application of Language Identification (LID) using i-vector features. While in this work,
we propose OGA-ELM using MFCC features in the application of emotion speech recogni-
tion. Furthermore, the main contributions of this work as follow:
&Improve the ESR and achieve performance with higher accuracy.
&In the proposed method, we have used four different scenarios which are SD, SI, GD-
Female, and GD-Male.
&Evaluate the performance of the OGA-ELM in ESR by using several evaluation measure-
ments such as accuracy, recall, precision, F-measure, G-mean, and execution time.
&Prove the effectiveness of the classification part in the ESR application.
The rest of this study is organized as follow: Section 2shows the description of the proposed
method; Section 4presents results discussion of the experiments, and Section 4 shows the
conclusions and future work.
23966 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 1 Summery of related work
Ref Dataset Features Classifier Result Strengths Weaknesses
[58]BES
with 7
emo-
tions
Higher Order Spectral (HOS)
features and Particle Swarm
Optimization assisted Bio-
geography based Optimiza-
tion (PSOBBO) for feature
selection.
ELM Accuracy of 90.31% (SI),
99.47% (SD), 98.94%
(GD-Female), and 92.98%
(GD-Male).
The results have proved that the
proposed HOS- PSOBBO-ELM
outperformed some previous studies.
1. In this work, the optimization of ELM with
respect to the random input-hidden layer
weights and hidden layer biases have been
ignored.
2. The results need more improvement in
terms of accuracy.
[27]BES
with 7
emo-
tions
Deep Complementary Feature
Extraction (DCF)
KELM Accuracy of 92.90% (SI) The proposed DCF-KELM
outperformed the CNN-BLSTM.
1.1. These works were evaluated based on
only one scenario which is SI.
2. The optimization of both KELM and ELM
with respect to the random input-hidden
layer weights and hidden layer biases have
been ignored.
3. The results of these methods are still not
encouraging and need more improvement.
[49]BES
with 7
emo-
tions
Hybrid Spectral Features
(HSF) which is combining
the LPC, MFCC and PSD
parameters.
ELM Accuracy of 82.22% (SI) The proposed HSF-ELM outperformed
some previous studies.
[38]BES
with 7
emo-
tions
Feature fusion (MFCC and
Prosody), and two different
features selection
approaches (SFS and SFFS)
LDA,
RDA,
SVM
and
KNN.
Accuracy of 92.70% (SI) The experimental results showed that the
RDA and SVM classifiers with SFFS
features gives the best emotion
recognition rate.
1.1. These works were evaluated based on
only one scenario which is SI.
2. The results of these methods are still not
encouraging and need more improvement.
[44]BES
with 7
emo-
tions
AAMSFS MLP,
SVM,
and
k-NN
Accuracy of 84.62% (SI) The results have shown that the
AAMSFS and SVM have achieved
the best accuracy rate.
[36] TES
with 4
emo-
tions
Combining nonlinear value
level, linear speaker level,
and feature vector level
normalization
ELM Accuracy of 79.00% (SI) The experimental results have proved
the superiority of the ELM over
SVM.
1. The work was evaluated based on only one
scenario which is SI.
2. The work was evaluated based on only 4
emotions.
3. The optimization of ELM with respect to
the random input-hidden layer weights and
hidden layer biases have been ignored.
4. The results are not encouraging and need
more improvement.
23967Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2 Method
The general dialogue of the proposed ESR system using the OGA-ELM method is illustrated
in Fig. 2. The dialogue consists of various phases that will be utilized to create the ESR system
based on speech signal. The first phase refers to the speech dataset for different human
emotions such as neutral, happiness, boredom, anxiety, sadness, anger, and disgust. The
second phase indicates the pre-processing of the speech signal samples. In the third phase,
the MFCC technique will be used to extract the needed features from utterances. Finally, in the
fourth phase, the extracted features will be fed into the OGA-ELM classifier in order to
identify human emotions based on the speech signal. The OGA-ELM is based on Optimised
Genetic Algorithm (OGA) [4], where the Genetic Algorithm (GA) has been chosen and
optimized in order to elevate the performance of ELM in terms of the classification part.
The GA was selected because it considered as one of the most popular optimization algorithms
which been used by most researchers, mainly due to its ease of implementation, and supported
by many libraries [23,57]. Additionally, GA has a good capability of global search, and also it
is considered as one of the essential technologies which are associated with modern intelligent
calculation [59]. As well as, GA is resource-friendly as it effectively finds better solutions
faster than the other optimization algorithms [13,26]. These four phases of the proposed ESR
system will be discussed as sub-sections, respectively.
2.1 Dataset
In this work, the BES (Berlin Emotional Speech) dataset [14] has selected for evaluation
purposes. The BES dataset is a standard dataset that is frequently used by emotion classifica-
tion researchers [18,27,45,58]. The BES dataset contains 533 emotional speech utterances
from 10 professional German actors (5 males and 5 females), with 7 emotions (Neutral,
Happiness, Boredom, Anxiety, Sadness, Anger, And Disgust). The actors were asked to
express 10 sentences with these 7 emotions. The audio files are in a range of 1–8 s duration
but in this study, we used fixed duration (see subsection 2.3). Detail explanation and details
about the BES dataset is provided in [14,17]. Table 2provides the details of the BES dataset.
This study has used 80% of the dataset for training purpose while the remaining 20% of the
dataset for testing purpose in all the evaluation scenarios.
2.2 Pre-processing
This section discusses the pre-processing of this study. SinceBES dataset is consists of different
duration utterances (the term of utterance refers to the speech signal) in a range of 1–8s.
Therefore, this study applied pre-processing that involves two-steps. The first step is to read the
utterances in a (.wav) extension. While the second step is to make the duration of all utterances
fix with 1 s which is 19608 sample. The output of the pre-processing step is the utterance vector
(19,608 × 1) in sampling, which is the input for the feature extraction processing.
2.3 Features extraction
The MFCC [19,21] feature extraction for ESR in this study begins with the process of
segmentation where the utterance vector (19,608 × 1) which obtained from the pre-
processing step is transformed into 25 ms frames and 10 ms overlap. This is followed by the
23968 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
attainment of thirteen MFCCs, and the application of the vocal tract length normalisation.
Subsequent, the cepstral mean and variance normalisation is performed together with the
RASTA filtering. Figure 3illustrates the entire processes of extracting the MFCC features.
These processes are as follow:
Pre-emphasis: It is the first stage in MFCC feature extraction the aim of it is to boost the
amount of energy in the high frequencies.
Windowing: The idea of implementing windowing is to segment the utterance into
frames.
Fast Fourier Transform (FFT): The aim of applying FFT is to convert the time domain
signal to a frequency domain signal because the features exist in the frequency domain
when dealing with speech data.
Magnitude: The aim of this step is to calculate the power spectrum of each frame.
Vocal Tract Length Normalisation (VTLN): VTLN aims to compensate for the fact that
speakers will have different sized vocal tracks.
Mel-Filter Bank: Mel- Filter Bank aims to approximate how much energy occurs at each
point or area.
Log: The goal of this step is to make sure that the high and low frequencies are separated
to simulate the human hearing system.
RASTA Filtering: In this step, the values of the first four frames of the array resulting
from the previous step will be changed to zero values to avoid a significant spike initially
arising from the ‘dc’offset level in each band. Each row of the remaining array is band-
pass filtered using a filter with a sharp spectral zero at the zero frequency since this
operation suppresses any constant or slowly varying component in each row.
Happiness
Disgust
Anxiety
Speech Signal
Utterance With
Different Duration
Classification
OGA-ELM
MFCC With
13 Cepstral
(13×42)
Features Extraction
Identified Emotion
Reshape The
MFCC To
One-row
Vector
First Step Second Step
Pre-processing
(19608×1)
(1×546)
Fig. 2 Dialogue of the proposed ESR system
Table 2 description of the BES dataset [14]
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterance 68 46 71 81 79 62 126
Number of Female Utterance 37 36 51 51 51 44 80
Number of Male Utterance 31 10 20 30 28 18 46
Emotion Label 1 2 3 4 5 6 7
23969Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Discrete Cosine Transform (DCT): In DCT step the log Mel spectrum is converted back
to time.
Cepstral Mean and Variance Normalisation: The purpose of CMVN is to decrease the
convolute channel distortion, noise, and speaker variations effects by forcing all utter-
ances to have a zero mean and unit variance
The output of the MFCC is an array of size (13 × number of frames) for each utterance (13
× 42). Following that reshape the MFCC features (13 × 42) for each utterance to a one-row
vector (1 × 546) which is the input of the classification step. Table 3provides a description of
the MFCC variables value which been used in this study. Due to the size of the frame and
frame shift in samples depend on the sampling rate, this study has set the value of the sampling
rate to 44,100 Hz instead of 16 kHz. The reason for that is to increase the frame size in samples
and decrease the frame numbers [11,34]. The size of the frame and frame shift in samples are
calculating as showing in Eqs. (1) and (2). While the number of frames is calculating as shown
in Eq. 3.
Nw ¼10−3Tw Sampling rate ð1Þ
Nw: frame size in samples.
Tw: frame size (25 ms).
Sampling rate: 44100.
And frame shift in samples:
Ns ¼10−3Ts Sampling rate ð2Þ
Ns: frame shift size in samples.
Ts: frame shift size (10 ms).
While the number of frames in each utterance is depicted in Eq. (3):
number of frames ¼length of utterance in samples−frame size in samples NwðÞ
frame shift in samples NsðÞ þ1ð3Þ
length of utterance in samples = 19,608; Nw =1103;Ns = 441; and number of frames = 42.
As a result, there will be an array: (Nw x number of frames) (1103 × 42).
2.4 Classification
2.4.1 Review of ELM
The basic ELM algorithm for training SLFN is proposed by [30]. The main concepts
or ideas behind ELM are the hidden layer weights, and biases are generated random-
ly. The output weights are then calculated using the least-squares solution which is
defined by the outputs of the hidden layer and targets. An overview of the ELM
structure and the training algorithm is shown in Fig. 4. The next subsection provides
a brief description of the ELM.
N=a set of distinct samples (Xj,t
j), where Xj=[x
j1,x
j2,…,x
jn]T∈Rnand tj=[t
j1,t
j2,…,
tjm]T∈Rm; a mathematical model described and applied with Eq. (1).
23970 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Utterance In
Samples
Pre-emphasis
Filter Windowing
MFCC Features
(19608 × 1) (19608 × 1)
(1103×42)
Fast Fourier
Transform (FFT)
Magnitude
Vocal Tract Length
Normalization
(VTLN)
(1025 ×42)
(1 ×1025)
Mel-Filter Bank
(1025 × 42)
(1×1025)
(20 × 42)
Log RASTA Filtering
(20 × 42) Discrete Cosine
Transform
(20 × 42)
Cepstral Mean and
Variance
Normalization
(13 × 42)
(13 × 42)
Fig. 3 Block diagram of the process of extracting MFCC features
Table 3 illustrating the value of the MFCC variables that have been used in this study
Variable value
Sampling rate 44,100 Hz
Utterance duration before pre-processing In a range of (1–8) second
Utterance duration after pre-processing One second duration which is 19,608 sample
Frame size in time 25 millisecond
Frame shift in time 10 millisecond
Frame size in sample (Nw) 1103
Frame shift in sample (Ns) 441
Number of frames of one utterance 42
MFCC features for one utterance before reshape (13× 42)
MFCC features for one utterance after reshape (1× 546)
23971Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
∑
L
i¼1
βigiXj
¼∑
L
i¼1
βigiWiþbi
ðÞ¼ojð4Þ
J=1,…,N.
where.
Wi=[Wi1,Wi2,…,Win]T= weight vector that provides the connection between the ith
hidden node and input nodes;
βi=[βi1, βi2, …,βim]T= weight vector that provides the connection between the ith
hidden node and output nodes;
bi= threshold of the ith hidden node;
Wi·Xj= inner product of Wiand Xj; however, the output nodes are selected linearly;
L= hidden layer nodes, and the standard of SLFNs in the activation function g(x) can be
the samples of Nwithout error.
Thus,
∑N
j¼1‖oj−tj‖¼0, that is, βi,Wiand biexist, such that in Eq. (5).
∑
L
i¼1
βigiWiXjþbi
¼tj;j¼1;…:; N:ð5Þ
The following can be obtained from the above equations for N:
Hβ¼Tð6Þ
Where:
HW
1…WL;b1…bL;X1…XN
¼
gW
1:X1þb1
ðÞ⋯gW
L:X1þbL
ðÞ
⋮…⋮
gW
1:XNþb1
ðÞ⋯gW
L:XNþbL
ðÞ
2
43
5
β¼βT
1
βT
L
L*m
and T¼tT
1
tT
N
N*m
Fig. 4 Diagram of the ELM [6]
23972 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
The authors in [30] named the variables, where Hrefers to the output matrix of the hidden
layer in the neural network; in Hthe ith column refers to the ith hidden layer nodes on the input
nodes. If the desired number of the hidden nodes is L≤N, this therefore means the activation
function gis infinitely differentiable. Equation (6) then becomes a linear system. Furthermore,
the output weights βcan be determined analytically by discovering a least squares solution as
follows:
β¼H†Tð7Þ
Where H†is the Moore–Penrose generalised inverse of H. Thus, the output weights are
calculated using a mathematical transformation without going through a lengthy training
phase.
The absence of a specific approach to determine the input-hidden layer weights is a major
drawback for ELM which subjects it to local minima. This means based on the given training
data, there is no way to assure that the trained ELM is the most appropriate in performing the
classification. Overcoming this drawback requires the integration of an optimised approach
with the ELM where the optimal weights can be identified and therefore the ELM’sbest
performance can be achieved. In the following subsection, the concept of OGA-ELM will be
explained.
2.4.2 OGA-ELM
This study adopted the OGA-ELM from [4] which is derived from the OGA to classify the
emotion speech signal dataset into seven emotion classes. It uses a single selection criterion,
where the values of input weight and the bias of hidden nodes are tuned by using the selection,
crossover and mutation operations. Table 4shows the parameters of the ELM and OGA that
have been used in this study’s experiments.
Nis a collection of featured samples (Xj,t
j), where Xj=[x
j1,x
j2,…,x
jn]T∈Rn, and tj=[t
j1,
tj2,…,t
jm]T∈Rm.
Where:
Xjis the input which is extracted features from MFCC;
tjis the true values (expected output).
At the beginning of OGA-ELM, the values of input weights, and the thresholds of hidden
nodes are randomly defined and characterised as chromosomes.
ch ¼w11;w12;…;w1n;w21;w22;…;w2n;wL1;wL2;…;wLn ;b1;…;bL
fg
Where:
wij: refers to the weight value that relates the ith hidden node and the jth input node, wij∈
[−1, 1];
bi:referstoith hidden node bias, bi∈[0, 1];
n: refers to the number of input node; and.
L: refers to the number of hidden node.
(1 + n) × L represents the chromosome dimensionality, that is, the (1 + n) × L
parameters that need to be optimised.
The fitness function of OGA–ELM is calculated, as shown in Eq. (8)tomaximisethe
accuracy.
23973Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 4 Parameters of the ELM and OGA [4]
ELM OGA
Parameter Value Parameter Value
ch Combined bias and input weight Number of iterations 500
ρOutput weight matrix Population size 100
Input weight −1 to 1 Crossover Arithmetical
Value of the biases 0–1 Mutation Uniform
Input node numbers Input attributes Population of the crossover (POPC) Refers to the crossover population, which is 70% of the population.
Hidden node numbers (100–300), with step or increment of 25 Population of the mutation (POPM) Refers to the mutation population, which is 30% of the population.
Output neuron (m) Class value Gamma value 0.4
Activation function Sigmoid Selection criteria Random
Regularizationfactor(C) −5
23974 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
fchðÞ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
∑N
j∑L
kρkgw
kxjþbk
−tj
2
2
N
sð8Þ
Where:
ρ= matrix of the output weights;
tj= expected output; and.
N = training samples number.
Where:
ρ¼HTI
CþHHT
−1
Tð9Þ
1
t// iteration number
PP ={ch
1
,ch
2
,..ch
100
} Randomly initial the input weights and biases
Train the ELM and calculate the fitness value of each variables according to Eq. (8)
While (not termination condition) do
t+1
t
While |POPC| ≤ |70% P| do
Based on random selection criteria selects a pair of parents for crossover
Mate the parents to create children (
Child
1
and
Child
2
)
end while
end while
POPC {} // initialise the crossover population
POPC {
Child
1,
Child
2
}
POPM {} // initialise the mutation population
While |POPM| ≤ |30% P| do
randomly, select a parent for mutation
perform mutation to create a child (
Child
M
)
POPM
Child
M
end while
P {POPC, POPM} // Merge POPC and POPM to get the next genration
Train the ELM and calculate the fitness value of each variables according to Eq. (8)
Sort the P based on their fitness values.
Initial population:
Evaluation:
Genetic operators:
Get the optimal weights thresholds between input layer and hidden layer
5: Begin the parameter optimization OGA
6:
8:
9:
10:
11:
12:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
End
28:
1: Start
2: Load the Dataset
3: Divide the Dataset to Training set and Testing set
Select the activation function and determine the hidden layer nodes L4:
7:
13:
Calculate the output matrix H of the hidden layer Eq. (10)
Calculate the output weights ρ according to Eq. (9)
Save the predicting ELM model
The prediction results of crop evapotranspiration
Calculate the average accuracy rate
30:
31:
32:
33:
34:
End the parameter optimization OGA
29:
Fig. 5 Pseudocode of the OGA-ELM [4]
23975Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
H¼
gw
1:X1þb1
ðÞ⋯gw
L:X1þbL
ðÞ
⋮…⋮
gw
1:XNþb1
ðÞ
⋯gw
L:XNþbL
ðÞ
2
43
5NL
ρ¼
ρ1T
⋮
ρL
T
2
43
5Lm
and T¼
t1T
⋮
tNT
2
43
5Nm
ð10Þ
T=[t1,t2,…,tN]Trefers to the expected output vector of the training set. ρ=[ρ1,ρ2,…,
ρN]Tis the output weights matrix. I is the identity matrix and Cis the regularization factor
which can be obtained by cross validation in the training process. H in Eq. (10) is the hidden
layer output matrix of the ELM network; in H,theith column is indicated to the ith hidden
layer nodes on the input nodes. Activation function g is infinitely distinguishable when the
desired number of hidden nodes is L ≤N. The deep explanation of the OGA-ELM is
providing in the following steps:
First, generate the initial population (P) randomly, p = {ch 1,ch2…ch 100}.
Second, calculate the fitness value for each chromosome (ch) of the population using Eq. (8).
ELM Training
Start
Emotion Speech Dataset
Training set Testing set
Select the activation
function, determine
the hidden layer nodes L
Calculate the output
matrix H of the hidden
layer Eq. (10)
Calculate the output
weights ρ according to
Eq. (9)
Saving the predicting
ELM model
The prediction results of
crop evapotranspiration
Calculate the average
accuracy rate
End
Randomly initial the input-weights and
the bias, determine a objective
function, population size, max iteration
Train ELM and calculate the fitness
of each variables according to Eq. (8)
Select parent randomly
Perform crossover to generate
two new children, save the
two new children into POPC
Is termination criteria satisfied?
Get the optimal weights and thresholds between input
layer and hidden layer.
Yes
No
Based on the random selection
criteria select pair of parents
Perform mutation to generate a
new child, save the new child
into POPM
Merge the POPC and POPM to
generate the new population
Is the POPC ≤ 70% of the
population
Is the POPM ≤ 30% of the
population
Yes
Initialize the crossover and mutation
populations: POPC = {}, POPM = {}
Yes
No
No
Parameter Optimization
Fig. 6 OGA-ELM’s Flowchart [4]
23976 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Third,the chromosomes are arranged based on their fitness values f(ch).Next,
using random selection criterion to select a pair of parents from the present population
for the operation of crossover to create a pair of new children to the new population.
Random selection criterion refers to the process that randomly picks a chromosome
from the population to be used in one of the two operations: crossover or mutation. In
the random selection criterion, every single chromosome of the population has an
equal chance of being chosen.
Fourth,the arithmetic crossover is applied to exchange information between the two
previously selected parents. The new children obtained by crossover operations are saved into
the Population of the Crossover (POPC) until it reaches 70% of the population. The explana-
tion of the arithmetic crossover is represented by the following formulae:
Child1¼α:xþ1−αðÞ:yð11Þ
Child2¼α:yþ1−αðÞ:xð12Þ
Subject to the boundaries (upper bounds and lower bounds for the input-hidden layer weights
[−1, 1], while for the hidden layer biases [0, 1]). In case the value of the gene has gone beyond
the max (upper bound), then we make it equal to the max (upper bound). While in case the
value of the gene has gone lower than the min (lower bound), then we make it equal to the min
(lower bound). The αis a randomly generated array with the size of the chromosome, and each
value of this array is randomly generated in a range of -gamma and gamma+1 which is (−0.4,
1.4). The x and y represent the first and second selected parents.
Fifth, a criteria of random selection is also used to randomly choose a chromosome from
the present population before implementing mutation. Mutation is applied to alter the chro-
mosome’s genes that are randomly selected. This work utilises uniform mutation. The uniform
Table 5 description of the BES dataset which been used in SI scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 68 46 71 81 79 62 126
Number of Training Utterance 54 37 57 65 63 50 101
Number of Testing Utterance 14 9 14 16 16 12 25
Emotion Label 1 2 3 4 5 6 7
Table 6 the overall result of the OGA-ELM in SI scenario
No of
neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 81 611 25 25 93.26 76.42 76.42 76.42 76.42 381.9471
125 76 606 30 30 91.91 71.70 71.70 71.70 71.70 395.5644
150 69 599 37 37 90.03 65.09 65.09 65.09 65.09 410.5236
175 73 603 33 33 91.11 68.87 68.87 68.87 68.87 426.4971
200 64 594 42 42 88.68 60.38 60.38 60.38 60.38 438.4431
225 78 608 28 28 92.45 73.58 73.58 73.58 73.58 451.8474
250 69 599 37 37 90.03 65.09 65.09 65.09 65.09 465.4381
275 67 597 39 39 89.49 63.21 63.21 63.21 63.21 478.1978
300 76 606 30 30 91.91 71.70 71.70 71.70 71.70 490.1058
23977Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
mutation works to substitute the selected gene’s value with a uniform random value chosen
from the gene’s user-specified upperand lower bounds (for the input-hidden layer weights [−1,
1] while for the hidden layer biases [0, 1]). The new child obtained from mutation will be
saved into the Population of the Mutation (POPM) until the POPM reaches 30% of the
population.
After the selection, mutation, and crossover operations are completed, a new population is
created via integrating the POPM and POPC. The following iteration will be continued along
with this new population, and this process will be repeated. The iterative process could be
stopped when either the results have converged or the iteration numbers is exceeded the
maximum limit. OGA–ELM’s pseudocode and flowchart are shown in Figs. 5and 6,
respectively.
3 Experimental results and discussion
Several experiments were conducted based on four different scenarios such as Subject
Dependent (SD), Subject Independent (SI), Gender Dependent Female (GD-Female),
Fig. 7 The confusion matrix for the best results of OGA-ELM in SI scenario
Table 7 the best result of the OGA-ELM in SI scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 93.40 95.28 90.57 97.17 96.23 96.23 83.96
Precision 100.00 83.33 66.67 93.33 83.33 78.57 61.76
Recall 50.00 55.56 57.14 87.50 93.75 91.67 84.00
F-Measure 66.67 66.67 61.54 90.32 88.24 84.62 71.19
G-Mean 70.71 68.04 61.72 90.37 88.39 84.87 72.03
tp 7 5 8 14 15 11 21
tn 92 96 88 89 87 91 68
fp 0 1 4 1 3 3 13
fn 7 4 6 2 1 1 4
23978 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
and Gender Dependent Male (GD-Male). In the SI scenario, we did not care about the
content in the utterance (sentence in the utterance) as well as gender (male or female)
and used the whole BES dataset. While in the SD scenario, we did care about the
content in the utterance (sentence in the utterance) and ignore the gender (male or
female). Therefore, since the BES dataset contains 10 different sentences so in this
scenario, we separated the BES dataset into 10 sub-datasets based on emotions and
sentences. Finally, in both GD-Male and GD-Female scenarios, we did care about the
gender (Male and Female) and ignore the content in the utterance (sentence in the
utterance). Thus, the BES dataset for both scenarios is separated based on emotions
and gender (Male and Female). For the GD-Male scenario, the dataset has used the
utterances that been recorded by males only. Whilst for the GD-Female scenario the
dataset has used the utterances that been recorded by females only.
The OGA-ELM has applied in several experiments in each scenario based on the scenario’s
dataset with a varying number of the hidden neurons and each experiment had 500 iterations. It
is worth mentioning that all the experiments have been implemented in MATLAB R2019a
programming language over a PC Core i7 of 3.20 GHz with 16 GB RAM and SSD 1 TB
(Windows 10). [50] was used as the basis for the evaluation in this study where varying
measures were applied. The selection of [50] was due it tackles the classifier evaluation issue
with presenting effective measurements. The learning algorithms’performance can be evalu-
ated in numerous methods by applying the SML (i.e., Supervised Machine Learning).
Furthermore the confusion matrix that has obtained of recognized examples for each class
according to their correction rate is presented in order to evaluate the quality of the
classification.
Therefore, there numerous of evaluation measurements were utilized to evaluate the
proposed approach OGA-ELM. The evaluation measurements rely on the ground truth, which
Table 8 description of the BES dataset which been used in SI scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 7 9 6 8 9 7 14
Number of Training Utterance 6 7 5 6 7 6 11
Number of Testing Utterance 1 2 1 2 2 1 3
Sentence Code a02 a02 a02 a02 a02 a02 a02
Emotion Label 1 2 3 4 5 6 7
Table 9 the best overall result of the OGA-ELM in SD scenario
No of
neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 12 72 0 0 100.00 100.00 100.00 100.00 100.00 253.8953
125 10 70 2 2 95.24 83.33 83.33 83.33 83.33 261.1620
150 11 71 1 1 97.62 91.67 91.67 91.67 91.67 270.1585
175 8 68 4 4 90.48 66.67 66.67 66.67 66.67 276.5632
200 9 69 3 3 92.86 75.00 75.00 75.00 75.00 282.2085
225 11 71 1 1 97.62 91.67 91.67 91.67 91.67 290.1031
250 9 69 3 3 92.86 75.00 75.00 75.00 75.00 296.8769
275 10 70 2 2 95.24 83.33 83.33 83.33 83.33 301.2103
300 8 68 4 4 90.48 66.67 66.67 66.67 66.67 304.7360
23979Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
entails the application of the model to expect the answer on the evaluation dataset followed by
a comparison between the predicted target and the actual answer. The evaluation measure-
ments have been used in order to evaluate the proposed OGA-ELM approach regarding recall;
accuracy; G-mean; precision, and F-measure. Eqs. (13–17) [1,5,8] depicts these evaluation
measurements.
accuracy ¼tp þtn
tp þtn þfn þfp ð13Þ
precision ¼tp
tp þfp ð14Þ
recall ¼tp
tp þfn ð15Þ
Fig. 8 The confusion matrix for the best results of OGA-ELM in SD scenario
Table 10 the best result of the OGA-ELM in SD scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Precision 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Recall 100.00 100.00 100.00 100.00 100.00 100.00 100.00
F-Measure 100.00 100.00 100.00 100.00 100.00 100.00 100.00
G-Mean 100.00 100.00 100.00 100.00 100.00 100.00 100.00
tp 1 2 1 2 2 1 3
tn 11 10 11 10 10 11 9
fp 0 0 0 0 0 0 0
fn 0 0 0 0 0 0 0
23980 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
F−Measure ¼2precision recallðÞ
precision þrecallðÞ ð16Þ
G−Mean ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
recall precision
2
pð17Þ
where
tn refers to true-negative; tp refers to true-positive; fn refers to false-negative; and fp refers
to false-positive.
The comparison results of the four different scenarios (SI, SD, GD-Male, and GD-Female)
are provided and deeply discussed separately in the following sub-sections.
3.1 Subject Independent (SI) Scenario
This section provides and discuss the performance of the OGA-ELM in SI scenario. In SI
scenario, we did not care about the content in the utterance (sentence in the utterance) as well
as gender (male or female) and used the whole BES dataset. Table 5provides the description
of the dataset which been used in this scenario. 80% of the dataset which is equal to 427
utterance was used as a training dataset. While the remaining 20% of the dataset which isequal
to 106 utterance was used as a testing dataset.
The best experiment results of the proposed OGA-ELM in SI scenario has obtained with
100 hidden neurons, where the overall accuracy is 93.26%. While the other evaluation
measurements have achieved 76.42%, 76.42%, 76.42%, and 60.89% for precision, recall, F-
measure and G-mean, respectively. The overall results of the evaluation measurements are
Table 11 description of the dataset which been used in GD-Female scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 37 36 51 51 51 44 80
Number of Training Utterance 30 27 41 41 41 35 64
Number of Testing Utterance 7 7 10 10 10 9 16
Emotion Label 1 2 3 4 5 6 7
Table 12 the best overall result of the OGA-ELM in GD-Female scenario
No of
Neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 62 407 7 7 97.10 89.86 89.86 89.86 89.86 332.5520
125 56 401 13 13 94.62 81.16 81.16 81.16 81.16 340.8604
150 52 397 17 17 92.96 75.36 75.36 75.36 75.36 351.3987
175 41 386 28 28 88.41 59.42 59.42 59.42 59.42 360.3810
200 53 398 13 13 93.37 76.81 76.81 76.81 76.81 372.7795
225 55 400 14 14 94.20 79.71 79.71 79.71 79.71 381.8452
250 50 395 19 19 92.13 72.46 72.46 72.46 72.46 393.0418
275 53 398 13 13 93.37 76.81 76.81 76.81 76.81 405.2617
300 48 393 21 21 91.30 69.57 69.57 69.57 69.57 413.9911
23981Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
shown in Table 6. Besides, Fig. 7shows the confusion matrix of OGA-ELM for the best
results. Whilst, Table 7illustrates the results of the evaluation measures for each class.
3.2 Subject Dependent (SD) Scenario
This section provides and discuss the best experiment results of the OGA-ELM in SD scenario.
In SD scenario, we did care about the content in the utterance (sentence in the utterance) and
ignore the gender (male or female). Therefore, since the BES dataset contains 10 different
sentences so in the SD scenario, we separated the BES dataset into 10 sub-datasets based on
emotions and sentences. The experiment’s accuracy results of the OGA-ELM in SD scenario
were in a range of 94.81%–100.00%. As mentioned earlier only the highest performance is
reporting in this study. Thus, the highest performance of the OGA-ELM in SD scenario was
with the sentence code “a02”and the sentence is “Das will sie am Mittwoch abgeben”.Table8
provides the description of the dataset which been used in SD scenario. 80% of the dataset
which is equal to 48 utterance was used as a training dataset. While the remaining 20% of the
dataset which is equal to 12 utterance was used as a testing dataset.
Fig. 9 The confusion matrix for the best results of OGA-ELM in GD-Female scenario
Table 13 the best result of the OGA-ELM in GD-Female scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 98.55 97.10 94.20 98.55 98.55 97.10 95.65
Precision 100.00 85.71 70.00 90.00 90.00 88.89 100.00
Recall 87.50 85.71 87.50 100.00 100.00 88.89 84.21
F-Measure 93.33 85.71 77.78 94.74 94.74 88.89 91.43
G-Mean 93.54 85.71 78.26 94.87 94.87 88.89 91.77
tp 7 6 7 9 9 8 16
tn 61 61 58 59 59 59 50
fp 1 1 1 0 0 1 3
fn 0 1 3 1 1 1 0
23982 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
The best experiment results of the proposed OGA-ELM in SD scenario have
acquired with 100 hidden neurons, where the overall accuracy is 100.00%. While
the other evaluation measurements have achieved 100.00%, 100.00%, 100.00%, and
100.00% for precision, recall, F-measure and G-mean respectively. The overall results
of the evaluation measurements are shown in Table 9. Also, Fig. 8shows the
confusion matrix of OGA-ELM for the best results. Whilst, Table 10 illustrates the
results of the evaluation measures for each class.
3.3 Gender Dependent Female (GD-Female) Scenario
This section provides and discuss the best experiment results of the OGA-ELM in
GD-Female scenario. In GD-Female scenario, we did care about the gender (Male and
Female) and ignore the content in the utterance (sentence in the utterance). The GD-
Female scenario has used the utterances of the BES dataset which been recorded by
females only. Table 11 provides the description of the dataset which been used in
GD-Female scenario. 80% of the dataset which is equal to 281 utterance was used as
a training dataset. While the remaining 20% of the dataset which is equal to 69
utterance was used as a testing dataset.
The best experiment results of the proposed OGA-ELM in GD-Female scenario
have acquired with 100 hidden neurons, where the overall accuracy is 97.10%. While
the other evaluation measures were achieved 89.86%, 89.86%, 89.86%, and 81.32%
for precision, recall, F-measure and G-mean, respectively. The overall results of the
evaluation measurements are shown in Table 12. Besides, Fig. 9shows the confusion
matrix of OGA-ELM for the best results. Whilst, Table 13 illustrates the results of the
evaluation measures for each class.
Table 14 description of the dataset which been used in GD-Male scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 31 10 20 30 28 18 46
Number of Training Utterance 25 8 16 24 22 14 37
Number of Testing Utterance 6 2 4 6 6 4 9
Emotion Label 1 2 3 4 5 6 7
Table 15 the best overall result of the OGA-ELM in GD-Male scenario
No of
Neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 32 217 5 5 96.14 86.49 86.49 86.49 86.49 276.6538
125 28 213 9 9 93.05 75.68 75.68 75.68 75.68 280.4774
150 31 216 6 6 95.37 83.78 83.78 83.78 83.78 285.0864
175 24 209 13 13 89.96 64.86 64.86 64.86 64.86 289.9731
200 27 212 10 10 92.28 72.97 72.97 72.97 72.97 294.3590
225 29 214 8 8 93.82 78.38 78.38 78.38 78.38 301.1567
250 27 212 10 10 92.28 72.97 72.97 72.97 72.97 307.5201
275 31 216 6 6 95.37 83.78 83.78 83.78 83.78 315.0164
300 26 211 11 11 91.51 70.27 70.27 70.27 70.27 320.1097
23983Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
3.4 Gender Dependent Male (GD-Male) Scenario
This section provides and discuss the best experiment results of the OGA-ELM in GD-Male
scenario. In GD-Male scenario, we did care about the gender (Male and Female) and ignore
the content in the utterance (sentence in the utterance). The GD-Male scenario has used the
utterances of the BES dataset which been recorded by males only. Table 14 provides the
description of the dataset which been used in GD-Male scenario. 80% of the dataset which is
equal to 146 utterance was used as a training dataset. While the remaining 20% of the dataset
which is equal to 37 utterance was used as a testing dataset.
The best experiment results of the proposed OGA-ELM in GD-Male scenario have
acquired with 100 hidden neurons, where the overall accuracy is 96.14%. While the other
evaluation measurements have achieved 86.49%, 86.49%, 86.49%, and 75.55% for precision,
recall, F-measure and G-mean, respectively. The overall results of the evaluation measure-
ments are shown in Table 15.Also,Fig.10 shows the confusion matrix of OGA-ELM for the
best results. Whilst, Table 16 illustrates the results of the evaluation measures for each class.
Based on all the above-mentioned experiments results, Table (5–16) we can conclude a
critical observation. The OGA could create suitable weights and biases for the single hidden
Fig. 10 The confusion matrix for the best results of OGA-ELM in GD-Male scenario
Table 16 the best result of the OGA-ELM in GD-Male scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 100.00 97.30 100.00 94.59 97.30 94.59 89.19
Precision 100.00 100.00 100.00 83.33 100.00 100.00 69.23
Recall 100.00 50.00 100.00 83.33 83.33 50.00 100.00
F-Measure 100.00 66.67 100.00 83.33 90.91 66.67 81.82
G-Mean 100.00 70.71 100.00 83.33 91.29 70.71 83.21
tp 6 1 4 5 5 2 9
tn 31 35 33 30 31 33 24
fp 0 0 0 1 0 0 4
fn 0 1 0 1 1 2 0
23984 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
layer of the ELM in order to minimize classification process errors. By avoiding unsuitable
weights and biases causes the ELM to avoid getting stuck in local maxima of weights and
biases. Consequently, the performance of the OGA-ELM was very impressive in the four
different scenarios, with an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD,
GD-Male, and GD-Female scenarios, respectively.
Furthermore, the proposed approach OGA-ELM will be compared with some recent works
[9,16,18,27,45,55,58,60] in terms of accuracy based on four different scenarios (i.e., SI,
SD, GD-Male, and GD-Female scenarios). All these methods have been used the BSE dataset
in their experiments. Table 17 illustrates the comparison accuracy results of the proposed
OGA-ELM and some other previous works.
Based on all the results in Table 17, obviously that the performance of the OGA-ELM
outperformed all the other previous works in SI, SD, and GD-Male scenarios. While only in
GD-Female scenario the highest accuracy of the proposed OGA-ELM was slightly lower than the
work in [58] where they achieved 98 .94% and the proposed OGA-ELM achieved 97.10%. That
proves the fact of generating the suitable weights and biases of the ELM that leads to minimizing
classification process errors. Moreover, avoiding unsuitable weights and biases causes the ELM
to avoid getting stuck in local maxima of weights and biases. Which is offering a promise that the
OGA-ELM is a reliable model for ESR. The best results are shown in Table 17.
In addition, numerous experiments have performed based on the basic ELM, Feedforward
Neural Network (NN), and SVM in the four different scenarios: SI, SD, GD-Male, and GD-
Female. Note, due to the pages limit, we reported only the highest performance of ELM, NN,
and SVM in each scenario in terms of accuracy; recall; precision; G-mean; F-measure; true
positive, true negative; false positive, false negative and execution time. Table 18,19,20 and
21 provides all the experiments result of the proposed OGA-ELM, basic ELM, NN, and SVM
in SI, SD, GD-Male, and GD-Female scenario. The best performance of the proposed OGA-
Table 17 the comparison of accuracy between the proposed OGA-ELM and other previous works
Reference No of
Emotion
Result based on
SI
Result based on
SD
Result based on GD-
Male
Result based on GD-
Female
ELM [58] 7 90 .31 99 .47 92 .98 98 .94
KELM [27]7 92.90 –– –
SMO [18]7 75.50 –– –
SVM [45]3 88.33 –– –
SVM [9]7 –– 94.90 85.77
SVM [16]7 –82.10 ––
SVM [55]6 88.80 –– –
SVM [60]7 70.59% –– –
OGA-ELM 793.26 100.00 96.14 97.10
Table 18 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in SI
scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 81 611 25 25 93.26 76.42 76.42 76.42 76.42 381.9471
Basic ELM 46 576 60 60 83.83 43.40 43.40 43.40 43.40 22.1431
NN 43 573 63 63 83.02 40.57 40.57 40.57 40.57 25.0304
SVM 42 572 64 64 82.75 39.62 39.62 39.62 39.62 22.9662
23985Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ELM has obtained an accuracy of 93.26%, 100.00%, 96.14%, and 97.10% for SI, SD, GD-
Male, and GD-Female scenario respectively. While the best performance of the basic ELM has
acquired an accuracy of 83.83%, 85.71%, 83.01%, and 84.68% for SI, SD, GD-Male, and GD-
Female scenario respectively. Further, the best performance of the NN has acquired an
accuracy of 83.02%, 83.33%, 81.47%, and 83.44% for SI, SD, GD-Male, and GD-Female
scenario respectively. The best performance of the SVM has acquired an accuracy of 82.75%,
85.71%, 82.24%, and 85.51% for SI, SD, GD-Male, and GD-Female scenario respectively.
BasedontheresultsinTables18,19,20 and 21, obviously that the performance of the OGA-
ELM outperformed the basic ELM, NN and SVM in the four different scenarios: SI, SD, GD-
Male, and GD-Female. That proves the fact of generating the suitable weights and biases of the
ELM leads to minimizing classification process errors. Thus, the performance of the OGA-ELM
was very impressive in the four different scenarios comparing to some previous works (see
Table 17), basic ELM, NN and SVM (see Tables 18,19,20 and 21). The best performance of the
OGA-ELM have acquired an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-
Male, and GD-Female scenario respectively. The original algorithms of ELM, NN, and SVM are
outperformed the proposed OGA-ELM in terms of the execution time due that the OGA-ELM is
based on GA that needs more time in order to obtain the best values of input weights and biases.
4 Conclusion
In this study, we have proposed an enhanced ESR system that based on the conventional
MFCC features and our previously developed ELM which named OGA-ELM. The OGA-
ELM underwent four different evaluation scenarios: SI, SD, GD-Male, and GD-Female using
the BES dataset for evaluation aspects. The outcome indicated the superiority of the OGA-
ELM over some previous works (see Table 17) and basic ELM, NN, and SVM (see Tables 18,
19,20 and 21) in the four different scenarios. The best performance of the OGA-ELM has
acquired an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and
Table 19 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in SD
scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 12 72 0 0 100.00 100.00 100.00 100.00 100.00 253.8953
Basic ELM 6 66 6 6 85.71 50.00 50.00 50.00 50.00 10.0748
NN 5 65 7 7 83.33 41.67 41.67 41.67 41.67 12.5962
SVM 6 66 6 6 85.71 50.00 50.00 50.00 50.00 11.0365
Table 20 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in GD-
Male scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 32 217 5 5 96.14 86.49 86.49 86.49 86.49 276.6538
Basic ELM 15 200 22 22 83.01 40.54 40.54 40.54 40.54 13.0750
NN 13 198 24 24 81.47 35.14 35.14 35.14 35.14 15.6759
SVM 14 199 23 23 82.24 37.84 37.84 37.84 37.84 14.0021
23986 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
GD-Female scenarios, respectively. Since the current study had only considered offline ESR,
the future work of this study is to create an ESR system that can handle the online execution of
the feature extraction and classification and apply real-time aspects simultaneously. Hence, it
can be implemented in analyse and detect the caller emotion in call centers. Another future
work using the proposed OGA-ELM can be applied in various applications such as voice
pathology detection, accent classification, and speaker identification. Finally, other optimisa-
tion approaches for ELM will be further explored in order to generate the most suitable
weights and biases for the ELM which leads to minimizing classification process errors.
Acknowledgements This project was funded by the Universiti Kebangsaan Malaysia under Dana Impak
Perdana grant (Research code: GUP-2020-063).
References
1. Albadr MA, Tiun S, Ayob M, al-Dhief F (2020) Genetic algorithm based on natural selection theory for
optimization problems. Symmetry 12(11):1758
2. Albadr MAA, Tiun S (2020) Spoken language identification based on particle swarm optimisation–extreme
learning machine approach. Circ Syst Signal Process 1–27
3. Albadr MAA, Tiun S, al-Dhief FT, Sammour MAM (2018) Spoken language identification based on the
enhanced self-adjusting extreme learning machine approach. PLoS One 13(4):e0194770
4. Albadr MAA, Tiun S, Ayob M, al-Dhief FT (2019) Spoken language identification based on optimised
genetic algorithm–extreme learning machine approach. Int J Speech Technol 22(3):711–727
5. Albadr MAA, Tiun S, Ayob M, al-Dhief FT, Omar K, Hamzah FA (2020) Optimised genetic algorithm-
extreme learning machine approach for automatic COVID-19 detection. PLoS One 15(12):e0242899
6. Albadra MAA, Tiuna S (2017) Extreme learning machine: a review. Int J Appl Eng Res 12(14):4610–4623
7. Al-Dhief FT et al (2020) A survey of voice pathology surveillance systems based on internet of things and
machine learning algorithms. IEEE Access 8:64514–64533
8. Al-Dhief FT et al (2020) Voice pathology detection using machine learning technique. In 2020 IEEE 5th
international symposium on telecommunication technologies (ISTT). IEEE
9. Alonso JB, Cabrera J, Medina M, Travieso CM (2015) New approach in quantification of emotional
intensity from the speech signal: emotional temperature. Expert Syst Appl 42(24):9554–9564
10. Badshah AM et al (2017) Speech emotion recognition from spectrograms with deep convolutional neural
network. In: 2017 international conference on platform technology and service (PlatCon). IEEE
11. Baroi OL et al (2019) Effects of different environmental noises and sampling frequencies on the perfor-
mance of MFCC and PLP based Bangla isolated word recognition system. In: 2019 1st international
conference on advances in Science, engineering and robotics technology (ICASERT). IEEE
12. Basu S et al (2017) A review on emotion recognition using speech. In: 2017 international conference on
inventive communication and computational technologies (ICICCT) IEEE
13. Bi W, Xu Y, Wang H (2020) Comparison of searching behaviour of three evolutionary algorithms applied
to water distribution system design optimization. Water 12(3):695
14. Burkhardt F et al (2005) A database of German emotional speech. In: Ninth European Conference on
Speech Communication and Technology
Table 21 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in GD-
Female scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 62 407 7 7 97.10 89.86 89.86 89.86 89.86 332.5520
Basic ELM 32 377 37 37 84.68 46.38 46.38 46.38 46.38 15.1266
NN 29 374 40 40 83.44 42.03 42.03 42.03 42.03 17.8584
SVM 34 379 35 35 85.51 49.28 49.28 49.28 49.28 16.0334
23987Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
15. Calvo RA, D'Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their
applications. IEEE Trans Affect Comput 1(1):18–37
16. Cao H, Verma R, Nenkova A (2015) Speaker-sensitive emotion recognition via ranking: studies on acted
and spontaneous speech. Comput Speech Lang 29(1):186–202
17. Chavhan Y, Dhore M, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J
Comput Appl 1(20):6–9
18. Choudhury AR et al (2018) Emotion recognition from speech signals using excitation source and spectral
features. In: 2018 IEEE applied signal processing conference (ASPCON). IEEE
19. Dendukuri LS, Hussain SJ (2019) Statistical feature set calculation using Teager energy operator on
emotional speech signals. In: 2019 international conference on wireless communications signal processing
and networking (WiSPNET). IEEE
20. Deng C, Huang GB, Xu J, Tang JX (2015) Extreme learning machines: new trends and applications.
Science China Inf Sci 58(2):1–16
21. Dogra A, Kaul A, Sharma R (2019) Automatic recognition of dialects of Himachal Pradesh using MFCC
&GMM. In: 2019 5th international conference on signal processing, computing and control (ISPCC). IEEE
22. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification
schemes, and databases. Pattern Recogn 44(3):572–587
23. Fortin F-A et al (2012) DEAP: evolutionary algorithms made easy. J Mach Learn Res 13(1):2171–2175
24. Gangamohan P, Kadiri SR, Yegnanarayana B (2016) Analysis of emotional speech—A review, in Toward
Robotic Socially Believable Behaving Systems-Volume I, Springer, p. 205–238
25. Ghasemi J, Esmaily J, Moradinezhad R (2020) Intrusion detection system using an optimized kernel
extreme learning machine and efficient features. Sādhanā45(1):1–9
26. Gogna A, Tayal A (2012) Comparative analysis of evolutionary algorithms for image enhancement. Int J
Met 2(1):80–100
27. Guo L, Wang L, Dang J, Liu Z, Guan H (2019) Exploration of complementary features for speech emotion
recognition based on kernel extreme learning machine. IEEE Access 7:75798–75809
28. Han W et al (2006) An efficient MFCC extraction method in speech recognition. In: 2006 IEEE interna-
tional symposium on circuits and systems. IEEE
29. Huang G-B, Chen L, Siew CK (2006) Universal approximation using incremental constructive feedforward
networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879–892
30. Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications.
Neurocomputing 70(1):489–501
31. Huang G-B et al (2011) Extreme learning machine for regression and multiclass classification. IEEE Trans
Syst, Man Cybern, Part B (Cybernetics) 42(2):513–529
32. Huang G-B et al (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans
Systems, Man, Cybernetics, Part B (Cybernetics) 42(2)513–529
33. Jain M et al (2020) Speech emotion recognition using support vector machine. arXiv preprint arXiv:
2002.07590
34. Juvela L et al (2018) Speech waveform synthesis from MFCC sequences with generative adversarial
networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).
IEEE
35. Kaya H, Karpov AA (2018) Efficient and effective strategies for cross-corpus acoustic emotion recognition.
Neurocomputing 275:1028–1034
36. Kaya H, Karpov AA, Salah AA (2016) Robust acoustic emotion recognition based on cascaded normal-
ization and extreme learning machines. In: international symposium on neural networks. Springer, 2016
37. Kostoulas T, Mporas I, Kocsis O, Ganchev T, Katsaounos N, Santamaria JJ, Jimenez-Murcia S, Fernandez-
Aranda F, Fakotakis N (2012) Affective speech interface in serious games for supporting therapy of mental
disorders. Expert Syst Appl 39(12):11072–11079
38. Kuchibhotla S, Vankayalapati HD, Anne KR (2016) An optimal two stage feature selection for speech
emotion recognition using acoustic features. Int J Speech Technol 19(4):657–667
39. Lopez-de-Ipiña K et al (2015) On automatic diagnosis of Alzheimer’s disease based on spontaneous speech
analysis and emotional temperature. Cogn Comput 7(1):44–55
40. Mar LL, Pa WP (2019) Depression detection from speech emotion recognition. Seventeenth International
Conference on Computer Applications (ICCA 2019)
41. Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral
coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083
42. Murugappan M et al (2020) Emotion classification in Parkinson's disease EEG using RQA and ELM. In:
2020 16th IEEE international colloquium on Signal Processing & its Applications (CSPA). IEEE
43. Neiberg D, Elenius K (2008) Automatic recognition of anger in spontaneous speech. In: Ninth Annual
Conference of the International Speech Communication Association
23988 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
44. Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–
326
45. Pakyurek M, Atmis M, Kulac S, Uludag U (2020) Extraction of novel features based on histograms of
MFCCs used in emotion classification from generated original speech dataset. Elektronika ir
Elektrotechnika 26(1):46–51
46. Petrushin VA (2000) Emotion recognition in speech signal: experimental study, development, and appli-
cation. In: Sixth International Conference on Spoken Language Processing
47. Poorna S, Nair G (2019) Multistage classification scheme to enhance speech emotion recognition. Int J
Speech Technol 22(2):327–340
48. Renanti MD, Buono A, Kusuma WA (2013) Infant cries identification by using codebook as feature
matching, and mfcc as feature extraction. J Theoretical Appl Inform Technol 56(3)
49. Shah AF and Anto PB (2017) Hybrid spectral features for speech emotion recognition. In: 2017 interna-
tional conference on innovations in information, embedded and communication systems (ICIIECS). IEEE
50. Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, F-score and ROC: a family of
discriminant measures for performance evaluation. In Australasian joint conference on artificial intelligence.
2006. Springer
51. Trang H, Loc TH, Nam HBH (2014) Proposed combination of PCA and MFCC feature extraction in speech
recognition system. In: 2014 International Conference on Advanced Technologies for Communications
(ATC 2014). IEEE
52. Tripathi A, Singh U, Bansal G, Gupta R, Singh AK (2020) A review onemotion detection and classification
using speech. Available at SSRN 3601803
53. Tzinis E, Potamianos A (2017) Segment-based speech emotion recognition using recurrent neural networks.
In: 2017 seventh international conference on affective computing and intelligent interaction (ACII). IEEE
54. van Heeswijk M (2015) Advances in extreme learning machines
55. Wang K et al (2015) Speech emotion recognition using Fourier parameters. IEEE Trans Affect Comput
6(1):69–75
56. Wang Y, Cao F, Yuan Y (2011) A study on effectiveness of extreme learning machine. Neurocomputing
74(16):2483–2490
57. Wilhelmstötter F (2021) Jenetics Library User’s Manual 6.2. [Online]. Available: https://jenetics.io
58. Yogesh C et al (2017) A new hybrid PSO assisted biogeography-based optimization for emotion and stress
recognition from speech signal. Expert Syst Appl 69:149–158
59. Yu F et al (2016) Improved roulette wheel selection-based genetic algorithm for TSP. In: 2016 international
conference on network and information Systems for Computers (ICNISC), IEEE
60. Zaidan NA, Salam MS (2016) MFCC global features selection in improving speech emotion recognition
rate. In: Advances in machine learning and signal processing. Springer, p. 141–153
61. Zhang X, Sun J, Luo Z (2014) One-against-all weighted dynamic time warping for language-independent
and speaker-dependent speech recognition in adverse conditions. PLoS One 9(2):e85458
62. Zhao S et al (2014) Automatic detection of expressed emotion in Parkinson's disease. In: 2014 IEEE
international conference on acoustics, speech and signal processing (ICASSP), IEEE
Publisher’snote Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Affiliations
Musatafa Abbas Abbood Albadr
1
&Sabrina Tiun
1
&Masri Ayob
1
&Fahad Taha
AL-Dhief
2
&Khairuddin Omar
1
&Mhd Khaled Maen
3
1
CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor,
Malaysia
2
School of Electrical Engineering, Department of Communication Engineering, Universiti Teknologi
Malaysia, UTM Johor Bahru, Johor, Malaysia
3
Department of Information and Technology, Uppsala University, Uppsala, Sweden
23989Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
A preview of this full-text is provided by Springer Nature.
Content available from Multimedia Tools and Applications
This content is subject to copyright. Terms and conditions apply.