ArticlePDF Available

Speech emotion recognition using optimized genetic algorithm-extreme learning machine

Authors:

Abstract and Figures

Automatic Emotion Speech Recognition (ESR) is considered as an active research field in the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two main parts: Front-End (features extraction) and Back-End (classification). However, most previous ESR systems have been focused on the features extraction part only and ignored the classification part. Whilst the classification process is considered an essential part in ESR systems, where its role is to map out the extracted features from audio samples to determine its corresponding emotion. Moreover, the evaluation of most ESR systems has been conducted based on Subject Independent (SI) scenario only. Therefore, in this paper, we are focusing on the Back-End (classification), where we have adopted our recent developed Extreme Learning Machine (ELM), called Optimized Genetic Algorithm- Extreme Learning Machine (OGA-ELM). In addition, we used the Mel Frequency Cepstral Coefficients (MFCC) method in order to extract the features from the speech utterances. This work proves the significance of the classification part in ESR systems, where it improves the ESR performance in terms of achieving higher accuracy. The performance of the proposed model was evaluated based on Berlin Emotional Speech (BES) dataset which consists of 7 emotions (neutral, happiness, boredom, anxiety, sadness, anger, and disgust). Four different evaluation scenarios have been conducted such as Subject Dependent (SD), SI, Gender Dependent Female (GD-Female), and Gender Dependent Male (GD-Male). The highest performance of the OGA-ELM was very impressive in the four different scenarios and achieved an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respec-tively. Besides, the proposed ESR system has shown a fast execution time in all experiments to identify the emotions.
This content is subject to copyright. Terms and conditions apply.
Speech emotion recognition using optimized genetic
algorithm-extreme learning machine
Musatafa Abbas Abbood Albadr
1
&Sabrina Tiun
1
&Masri Ayob
1
&
Fahad Taha AL-Dhief
2
&Khairuddin Omar
1
&Mhd Khaled Maen
3
Received: 23 February 2021 /Revised: 17 May 2021 /Accepted: 21 February 2022 /
#The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
Automatic Emotion Speech Recognition (ESR) is considered as an active research field in
the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two
main parts: Front-End (features extraction) and Back-End (classification). However, most
previous ESR systems have been focused on the features extraction part only and ignored
the classification part. Whilst the classification process is considered an essential part in
ESR systems, where its role is to map out the extracted features from audio samples to
determine its corresponding emotion. Moreover, the evaluation of most ESR systems has
been conducted based on Subject Independent (SI) scenario only. Therefore, in this paper,
we are focusing on the Back-End (classification), where we have adopted our recent
developed Extreme Learning Machine (ELM), called Optimized Genetic Algorithm-
Extreme Learning Machine (OGA-ELM). In addition, we used the Mel Frequency
Cepstral Coefficients (MFCC) method in order to extract the features from the speech
utterances. This work proves the significance of the classification part in ESR systems,
where it improves the ESR performance in terms of achieving higher accuracy. The
performance of the proposed model was evaluated based on Berlin Emotional Speech
(BES) dataset which consists of 7 emotions (neutral, happiness, boredom, anxiety,
sadness, anger, and disgust). Four different evaluation scenarios have been conducted
such as Subject Dependent (SD), SI, Gender Dependent Female (GD-Female), and
Gender Dependent Male (GD-Male). The highest performance of the OGA-ELM was
very impressive in the four different scenarios and achieved an accuracy of 93.26%,
100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respec-
tively. Besides, the proposed ESR system has shown a fast execution time in all
experiments to identify the emotions.
Keywords Emotion speech recognition .Optimized genetic algorithm-extreme learning machine .
Mel frequency cepstral coefficients
https://doi.org/10.1007/s11042-022-12747-w
*Musatafa Abbas Abbood Albadr
mustafa_abbas1988@yahoo.com
Extended author information available on the last page of the article
Published online: 19 March 2022
Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1 Introduction
People beings use different forms of facial expressions, gestures, speeches for communication,
and body language. These communications transfer emotional states and messages of the
speakers [24]. In this regard, people have the natural ability to comprehend the speakers
emotions through their speech signals. A robust system of emotionrecognition aims to identify
the people emotional state through the users voice automatically. The speech signal contains
the linguistic information of the speaker and also it includes other information such as age,
origin, gender, and emotional states [53]. Such these systems have made numerous potential
impacts on HCI [7,15,22]. Furthermore, an automatic ESR system has been applied in several
real-time applications in the purpose of analyzing and detecting the emotions such as detect the
emotions of callers in call centers [43], mental disorders diagnosis [37], and detect the diseases
of Parkinson and Alzheimer [39,62]. Further, ESR system is utilized to present assist in many
various applications (e.g., development of educational, learning environment, lie detection
system, games software, and entertainment) [46].
In general, ESR systems are generated based on two main stages; the first stage refers to
front-end that extracts the feature vectors from the samples of a speech utterance. While the
second stage refers to the back-end that recognises the emotion based on certain sets of feature
vectors, algorithms and models. Figure 1depicts the general overview of the ESR system.
Among the most common feature extraction methods used in ESR field are the Linear
Predictive Coding (LPC), Cepstrum Coefficients derived from LPC (LPCC), MFCC, and
Perceptual Linear Prediction (PLP) [12,40,41,52]. Out of all the aforesaid methods, MFCC is
the most popular feature extraction approach in speech applications generally and has been
cited to have the highest identification accuracy [28,48,51,61]. Whilst the classification
process is considered an essential part of any ESR system and its role is to map out the
extracted features from audio samples to determine its corresponding emotion. Several
classifiers are identified in literature, for instance the deep learning [10], Support Vector
Machine (SVM) [33], and ELM [42].
Recently, ELMs have emerged, becoming a modern framework for machine learning [6,
20,25,47,54,56]. ELMs are a type of feed-forward neural network characterised by random
initialisation of their hidden layer weights, combined with a fast training algorithm. The
effectiveness (without blindness) of this random initialisation and quick training makes them
very appealing for large-scale data analysis.
Fig. 1 general overview of the emotion speech recognition system
23964 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
In the last decades, the ELM algorithm has witnessed a high significance among
other algorithms of machine learning [32]. This is because ELM has unique charac-
teristics such as good generalization, classification capability, and extremely fast
training. In addition, ELM is an efficient solution for the Single-hidden Layer
Feedforward Networks (SLFNs), where it has proved its performance in terms of
efficiency and effectiveness in several applications. Therefore, the ELM has obtained
better and faster generalization than SVM and back propagation (BP)-based neural
networks (NNs) [3,29,31,35]. Consequently, many researchers have used the ELM
in ESR. For example in [58], the authors have presented a new particle swarm
optimization assisted Biogeography-based algorithm for feature selection, while the
ELM classifier was used for the classification part in order to distinguish the emo-
tions. The simulations were conducted using BES Dataset. Different evaluation ex-
periments were conducted such as Subject Dependent (SD), Subject Independent (SI),
Gender Dependent Female (GD-Female), and Gender Dependent Male (GD-Male).
The highest recognition accuracy was 90.31%, 99.47%, 98.94%, and 92.98% for SI,
SD, GD-Female and GD-Male respectively. Unfortunately, despite the superiority of
this developed work, the optimization of ELM with respect to the random weights has
been ignored. This is lead to non-optimal classification for ELM performance.
Another attempt is in the work of [27], where the authors proposed a dynamic framework to
use the advantages of the auditory-based empirical features and the complementary
spectrogram-based statistical features. Furthermore, a Kernel Extreme Learning Machine
(KELM) was used to recognize emotions. To validate this framework, they conducted
experiments on the BES dataset. The experimental results demonstrated that their proposed
framework outperformed the existing state-of-the-art method and achieved an accuracy of
92.90%. However, this work has ignored the optimization of KELM in terms of the input
weights for hidden layer.
A further attempt was made by [49] where the authors proposed a Hybrid Spectral Features
(HSF) which is combining the LPC, MFCC and PSD parameters. In addition, the ELM was
used as the pattern classifier to recognize emotions. The evaluation experiments were con-
ducted based on an emotional speech dataset which consists of nine emotions: neutral, calm,
sad, surprise, happy, anxiety, anger, fear, and boredom. In their experiments, the highest
overall recognition accuracy was 82.22%. Despite the superiority of their proposed method
over the benchmark, this work also has ignored the optimization of ELM in terms of the input
weights for hidden layer.
The methods in [38,44] are presented in the speech emotions recognition by using
BSE dataset. The method in [38] has been presented a new feature fusion (i.e., MFCC
and Prosody), and two different features selection approaches (i.e., SFS and SFFS).
Also, there are four classifiers have beenusedinthismethodwhichareLinear
Discriminant Analysis (LDA), Regularized Discriminant Analysis (RDA), SVM, and
K-Nearest Neighbour (KNN). The experimental results showed that the RDA and
SVM classifiers with SFFS features have obtained the highest accuracy of 92.70%.
While in the method in [44] has been proposed new feature extraction called Acoustic
Analysis Methods and Statistical Feature Selection (AAMSFS). In addition, it is used
three different classifiers and they are Multilayer Perceptron (MLP), SVM, and k-NN.
Based on the experimental results, the proposed AAMSFS with SVM classifier has
achievedaccuracyupto84.62%.
23965Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Another attempt was made by [36] where the authors took a cascaded normaliza-
tion method. They have combined, nonlinear value level, linear speaker level, and
feature vector level normalization in order to decrease the effects of the speaker and
also to maximize class separability. Additionally, the ELM was applied to distinguish
emotions. The evaluation experiments were conducted based on part of the recently
collected dataset (Turkish Emotional Speech (TES) dataset) with four emotion classes
which are joy, neutral, sadness, and anger. Even though the evaluation experiments
result showed the superiority of the ELM over the SVM and the ELM has acquired
an overall accuracy of 79.00% while the SVM has acquired 77.30%. However most
of the ESR researches have ignored the optimization of ELM in terms of the input
weights for hidden layer [2,3,6]. Where the drawback of ELM is must have a
certain technique in order to select the weights of the input-hidden layer. In other
words, there is no method to ensure that the trained ELM algorithm is the most
proper in the classification process. To solve this issue, an optimisation method
should be combined with the ELM to determine the optimal weights that guarantee
to obtain the best performance in the classification process. Furthermore, Table 1
summarizes the related works including strengths and weaknesses of each method.
Based on the studies above, the limitations of emotion speech recognition systems can be
summarized as follow:
&Most previous studies of emotion speech recognition have focused on the feature extrac-
tion part and ignore the classification part.
&No much studies have evaluated their methods based on different scenarios such as Subject
Independent (SI), Subject Dependent (SD), Gender Dependent Female (GD-Female) and
etc. In other words, most systems are evaluated based on SI scenario only.
&The accuracy rate of emotion recognition systems from the speech is still not encouraging.
&Accuracy, recall and precision are mostly used to evaluate emotion speech recognition
systems. However, the other evaluation measurements are ignored such as F-measure, G-
mean, and execution time.
Based on all the facts mentioned earlier, this study will use the MFCC features and one of the
recent ELM optimization which named OGA-ELM [4]. In [4], we have proposed OGA-ELM
in the application of Language Identification (LID) using i-vector features. While in this work,
we propose OGA-ELM using MFCC features in the application of emotion speech recogni-
tion. Furthermore, the main contributions of this work as follow:
&Improve the ESR and achieve performance with higher accuracy.
&In the proposed method, we have used four different scenarios which are SD, SI, GD-
Female, and GD-Male.
&Evaluate the performance of the OGA-ELM in ESR by using several evaluation measure-
ments such as accuracy, recall, precision, F-measure, G-mean, and execution time.
&Prove the effectiveness of the classification part in the ESR application.
The rest of this study is organized as follow: Section 2shows the description of the proposed
method; Section 4presents results discussion of the experiments, and Section 4 shows the
conclusions and future work.
23966 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 1 Summery of related work
Ref Dataset Features Classifier Result Strengths Weaknesses
[58]BES
with 7
emo-
tions
Higher Order Spectral (HOS)
features and Particle Swarm
Optimization assisted Bio-
geography based Optimiza-
tion (PSOBBO) for feature
selection.
ELM Accuracy of 90.31% (SI),
99.47% (SD), 98.94%
(GD-Female), and 92.98%
(GD-Male).
The results have proved that the
proposed HOS- PSOBBO-ELM
outperformed some previous studies.
1. In this work, the optimization of ELM with
respect to the random input-hidden layer
weights and hidden layer biases have been
ignored.
2. The results need more improvement in
terms of accuracy.
[27]BES
with 7
emo-
tions
Deep Complementary Feature
Extraction (DCF)
KELM Accuracy of 92.90% (SI) The proposed DCF-KELM
outperformed the CNN-BLSTM.
1.1. These works were evaluated based on
only one scenario which is SI.
2. The optimization of both KELM and ELM
with respect to the random input-hidden
layer weights and hidden layer biases have
been ignored.
3. The results of these methods are still not
encouraging and need more improvement.
[49]BES
with 7
emo-
tions
Hybrid Spectral Features
(HSF) which is combining
the LPC, MFCC and PSD
parameters.
ELM Accuracy of 82.22% (SI) The proposed HSF-ELM outperformed
some previous studies.
[38]BES
with 7
emo-
tions
Feature fusion (MFCC and
Prosody), and two different
features selection
approaches (SFS and SFFS)
LDA,
RDA,
SVM
and
KNN.
Accuracy of 92.70% (SI) The experimental results showed that the
RDA and SVM classifiers with SFFS
features gives the best emotion
recognition rate.
1.1. These works were evaluated based on
only one scenario which is SI.
2. The results of these methods are still not
encouraging and need more improvement.
[44]BES
with 7
emo-
tions
AAMSFS MLP,
SVM,
and
k-NN
Accuracy of 84.62% (SI) The results have shown that the
AAMSFS and SVM have achieved
the best accuracy rate.
[36] TES
with 4
emo-
tions
Combining nonlinear value
level, linear speaker level,
and feature vector level
normalization
ELM Accuracy of 79.00% (SI) The experimental results have proved
the superiority of the ELM over
SVM.
1. The work was evaluated based on only one
scenario which is SI.
2. The work was evaluated based on only 4
emotions.
3. The optimization of ELM with respect to
the random input-hidden layer weights and
hidden layer biases have been ignored.
4. The results are not encouraging and need
more improvement.
23967Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2 Method
The general dialogue of the proposed ESR system using the OGA-ELM method is illustrated
in Fig. 2. The dialogue consists of various phases that will be utilized to create the ESR system
based on speech signal. The first phase refers to the speech dataset for different human
emotions such as neutral, happiness, boredom, anxiety, sadness, anger, and disgust. The
second phase indicates the pre-processing of the speech signal samples. In the third phase,
the MFCC technique will be used to extract the needed features from utterances. Finally, in the
fourth phase, the extracted features will be fed into the OGA-ELM classifier in order to
identify human emotions based on the speech signal. The OGA-ELM is based on Optimised
Genetic Algorithm (OGA) [4], where the Genetic Algorithm (GA) has been chosen and
optimized in order to elevate the performance of ELM in terms of the classification part.
The GA was selected because it considered as one of the most popular optimization algorithms
which been used by most researchers, mainly due to its ease of implementation, and supported
by many libraries [23,57]. Additionally, GA has a good capability of global search, and also it
is considered as one of the essential technologies which are associated with modern intelligent
calculation [59]. As well as, GA is resource-friendly as it effectively finds better solutions
faster than the other optimization algorithms [13,26]. These four phases of the proposed ESR
system will be discussed as sub-sections, respectively.
2.1 Dataset
In this work, the BES (Berlin Emotional Speech) dataset [14] has selected for evaluation
purposes. The BES dataset is a standard dataset that is frequently used by emotion classifica-
tion researchers [18,27,45,58]. The BES dataset contains 533 emotional speech utterances
from 10 professional German actors (5 males and 5 females), with 7 emotions (Neutral,
Happiness, Boredom, Anxiety, Sadness, Anger, And Disgust). The actors were asked to
express 10 sentences with these 7 emotions. The audio files are in a range of 18 s duration
but in this study, we used fixed duration (see subsection 2.3). Detail explanation and details
about the BES dataset is provided in [14,17]. Table 2provides the details of the BES dataset.
This study has used 80% of the dataset for training purpose while the remaining 20% of the
dataset for testing purpose in all the evaluation scenarios.
2.2 Pre-processing
This section discusses the pre-processing of this study. SinceBES dataset is consists of different
duration utterances (the term of utterance refers to the speech signal) in a range of 18s.
Therefore, this study applied pre-processing that involves two-steps. The first step is to read the
utterances in a (.wav) extension. While the second step is to make the duration of all utterances
fix with 1 s which is 19608 sample. The output of the pre-processing step is the utterance vector
(19,608 × 1) in sampling, which is the input for the feature extraction processing.
2.3 Features extraction
The MFCC [19,21] feature extraction for ESR in this study begins with the process of
segmentation where the utterance vector (19,608 × 1) which obtained from the pre-
processing step is transformed into 25 ms frames and 10 ms overlap. This is followed by the
23968 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
attainment of thirteen MFCCs, and the application of the vocal tract length normalisation.
Subsequent, the cepstral mean and variance normalisation is performed together with the
RASTA filtering. Figure 3illustrates the entire processes of extracting the MFCC features.
These processes are as follow:
Pre-emphasis: It is the first stage in MFCC feature extraction the aim of it is to boost the
amount of energy in the high frequencies.
Windowing: The idea of implementing windowing is to segment the utterance into
frames.
Fast Fourier Transform (FFT): The aim of applying FFT is to convert the time domain
signal to a frequency domain signal because the features exist in the frequency domain
when dealing with speech data.
Magnitude: The aim of this step is to calculate the power spectrum of each frame.
Vocal Tract Length Normalisation (VTLN): VTLN aims to compensate for the fact that
speakers will have different sized vocal tracks.
Mel-Filter Bank: Mel- Filter Bank aims to approximate how much energy occurs at each
point or area.
Log: The goal of this step is to make sure that the high and low frequencies are separated
to simulate the human hearing system.
RASTA Filtering: In this step, the values of the first four frames of the array resulting
from the previous step will be changed to zero values to avoid a significant spike initially
arising from the dcoffset level in each band. Each row of the remaining array is band-
pass filtered using a filter with a sharp spectral zero at the zero frequency since this
operation suppresses any constant or slowly varying component in each row.
Happiness
Disgust
Anxiety
Speech Signal
Utterance With
Different Duration
Classification
OGA-ELM
MFCC With
13 Cepstral
(13×42)
Features Extraction
Identified Emotion
Reshape The
MFCC To
One-row
Vector
First Step Second Step
Pre-processing
(19608×1)
(1×546)
Fig. 2 Dialogue of the proposed ESR system
Table 2 description of the BES dataset [14]
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterance 68 46 71 81 79 62 126
Number of Female Utterance 37 36 51 51 51 44 80
Number of Male Utterance 31 10 20 30 28 18 46
Emotion Label 1 2 3 4 5 6 7
23969Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Discrete Cosine Transform (DCT): In DCT step the log Mel spectrum is converted back
to time.
Cepstral Mean and Variance Normalisation: The purpose of CMVN is to decrease the
convolute channel distortion, noise, and speaker variations effects by forcing all utter-
ances to have a zero mean and unit variance
The output of the MFCC is an array of size (13 × number of frames) for each utterance (13
× 42). Following that reshape the MFCC features (13 × 42) for each utterance to a one-row
vector (1 × 546) which is the input of the classification step. Table 3provides a description of
the MFCC variables value which been used in this study. Due to the size of the frame and
frame shift in samples depend on the sampling rate, this study has set the value of the sampling
rate to 44,100 Hz instead of 16 kHz. The reason for that is to increase the frame size in samples
and decrease the frame numbers [11,34]. The size of the frame and frame shift in samples are
calculating as showing in Eqs. (1) and (2). While the number of frames is calculating as shown
in Eq. 3.
Nw ¼103Tw Sampling rate ð1Þ
Nw: frame size in samples.
Tw: frame size (25 ms).
Sampling rate: 44100.
And frame shift in samples:
Ns ¼103Ts Sampling rate ð2Þ
Ns: frame shift size in samples.
Ts: frame shift size (10 ms).
While the number of frames in each utterance is depicted in Eq. (3):
number of frames ¼length of utterance in samplesframe size in samples NwðÞ
frame shift in samples NsðÞ þ1ð3Þ
length of utterance in samples = 19,608; Nw =1103;Ns = 441; and number of frames = 42.
As a result, there will be an array: (Nw x number of frames) (1103 × 42).
2.4 Classification
2.4.1 Review of ELM
The basic ELM algorithm for training SLFN is proposed by [30]. The main concepts
or ideas behind ELM are the hidden layer weights, and biases are generated random-
ly. The output weights are then calculated using the least-squares solution which is
defined by the outputs of the hidden layer and targets. An overview of the ELM
structure and the training algorithm is shown in Fig. 4. The next subsection provides
a brief description of the ELM.
N=a set of distinct samples (Xj,t
j), where Xj=[x
j1,x
j2,,x
jn]TRnand tj=[t
j1,t
j2,,
tjm]TRm; a mathematical model described and applied with Eq. (1).
23970 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Utterance In
Samples
Pre-emphasis
Filter Windowing
MFCC Features
(19608 × 1) (19608 × 1)
(11042)
Fast Fourier
Transform (FFT)
Magnitude
Vocal Tract Length
Normalization
(VTLN)
(1025 ×42)
(1 ×1025)
Mel-Filter Bank
(1025 × 42)
(1025)
(20 × 42)
Log RASTA Filtering
(20 × 42) Discrete Cosine
Transform
(20 × 42)
Cepstral Mean and
Variance
Normalization
(13 × 42)
(13 × 42)
Fig. 3 Block diagram of the process of extracting MFCC features
Table 3 illustrating the value of the MFCC variables that have been used in this study
Variable value
Sampling rate 44,100 Hz
Utterance duration before pre-processing In a range of (18) second
Utterance duration after pre-processing One second duration which is 19,608 sample
Frame size in time 25 millisecond
Frame shift in time 10 millisecond
Frame size in sample (Nw) 1103
Frame shift in sample (Ns) 441
Number of frames of one utterance 42
MFCC features for one utterance before reshape (13× 42)
MFCC features for one utterance after reshape (1× 546)
23971Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
L
i¼1
βigiXj

¼
L
i¼1
βigiWiþbi
ðÞ¼ojð4Þ
J=1,,N.
where.
Wi=[Wi1,Wi2,,Win]T= weight vector that provides the connection between the ith
hidden node and input nodes;
βi=[βi1, βi2, ,βim]T= weight vector that provides the connection between the ith
hidden node and output nodes;
bi= threshold of the ith hidden node;
Wi·Xj= inner product of Wiand Xj; however, the output nodes are selected linearly;
L= hidden layer nodes, and the standard of SLFNs in the activation function g(x) can be
the samples of Nwithout error.
Thus,
N
j¼1ojtj¼0, that is, βi,Wiand biexist, such that in Eq. (5).
L
i¼1
βigiWiXjþbi

¼tj;j¼1;:; N:ð5Þ
The following can be obtained from the above equations for N:
Hβ¼Tð6Þ
Where:
HW
1WL;b1bL;X1XN

¼
gW
1:X1þb1
ðÞgW
L:X1þbL
ðÞ
gW
1:XNþb1
ðÞgW
L:XNþbL
ðÞ
2
43
5
β¼βT
1
βT
L

L*m
and T¼tT
1
tT
N

N*m
Fig. 4 Diagram of the ELM [6]
23972 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
The authors in [30] named the variables, where Hrefers to the output matrix of the hidden
layer in the neural network; in Hthe ith column refers to the ith hidden layer nodes on the input
nodes. If the desired number of the hidden nodes is LN, this therefore means the activation
function gis infinitely differentiable. Equation (6) then becomes a linear system. Furthermore,
the output weights βcan be determined analytically by discovering a least squares solution as
follows:
β¼HTð7Þ
Where His the MoorePenrose generalised inverse of H. Thus, the output weights are
calculated using a mathematical transformation without going through a lengthy training
phase.
The absence of a specific approach to determine the input-hidden layer weights is a major
drawback for ELM which subjects it to local minima. This means based on the given training
data, there is no way to assure that the trained ELM is the most appropriate in performing the
classification. Overcoming this drawback requires the integration of an optimised approach
with the ELM where the optimal weights can be identified and therefore the ELMsbest
performance can be achieved. In the following subsection, the concept of OGA-ELM will be
explained.
2.4.2 OGA-ELM
This study adopted the OGA-ELM from [4] which is derived from the OGA to classify the
emotion speech signal dataset into seven emotion classes. It uses a single selection criterion,
where the values of input weight and the bias of hidden nodes are tuned by using the selection,
crossover and mutation operations. Table 4shows the parameters of the ELM and OGA that
have been used in this studys experiments.
Nis a collection of featured samples (Xj,t
j), where Xj=[x
j1,x
j2,,x
jn]TRn, and tj=[t
j1,
tj2,,t
jm]TRm.
Where:
Xjis the input which is extracted features from MFCC;
tjis the true values (expected output).
At the beginning of OGA-ELM, the values of input weights, and the thresholds of hidden
nodes are randomly defined and characterised as chromosomes.
ch ¼w11;w12;;w1n;w21;w22;;w2n;wL1;wL2;;wLn ;b1;;bL
fg
Where:
wij: refers to the weight value that relates the ith hidden node and the jth input node, wij
[1, 1];
bi:referstoith hidden node bias, bi[0, 1];
n: refers to the number of input node; and.
L: refers to the number of hidden node.
(1 + n) × L represents the chromosome dimensionality, that is, the (1 + n) × L
parameters that need to be optimised.
The fitness function of OGAELM is calculated, as shown in Eq. (8)tomaximisethe
accuracy.
23973Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 4 Parameters of the ELM and OGA [4]
ELM OGA
Parameter Value Parameter Value
ch Combined bias and input weight Number of iterations 500
ρOutput weight matrix Population size 100
Input weight 1 to 1 Crossover Arithmetical
Value of the biases 01 Mutation Uniform
Input node numbers Input attributes Population of the crossover (POPC) Refers to the crossover population, which is 70% of the population.
Hidden node numbers (100300), with step or increment of 25 Population of the mutation (POPM) Refers to the mutation population, which is 30% of the population.
Output neuron (m) Class value Gamma value 0.4
Activation function Sigmoid Selection criteria Random
Regularizationfactor(C) 5
23974 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
fchðÞ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
N
jL
kρkgw
kxjþbk

tj
2
2
N
sð8Þ
Where:
ρ= matrix of the output weights;
tj= expected output; and.
N = training samples number.
Where:
ρ¼HTI
CþHHT

1
Tð9Þ
1
t// iteration number
PP ={ch
1
,ch
2
,..ch
100
} Randomly initial the input weights and biases
Train the ELM and calculate the fitness value of each variables according to Eq. (8)
While (not termination condition) do
t+1
t
While |POPC| |70% P| do
Based on random selection criteria selects a pair of parents for crossover
Mate the parents to create children (
Child
1
and
Child
2
)
end while
end while
POPC {} // initialise the crossover population
POPC {
Child
1,
Child
2
}
POPM {} // initialise the mutation population
While |POPM| |30% P| do
randomly, select a parent for mutation
perform mutation to create a child (
Child
M
)
POPM
Child
M
end while
P {POPC, POPM} // Merge POPC and POPM to get the next genration
Train the ELM and calculate the fitness value of each variables according to Eq. (8)
Sort the P based on their fitness values.
Initial population:
Evaluation:
Genetic operators:
Get the optimal weights thresholds between input layer and hidden layer
5: Begin the parameter optimization OGA
6:
8:
9:
10:
11:
12:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
End
28:
1: Start
2: Load the Dataset
3: Divide the Dataset to Training set and Testing set
Select the activation function and determine the hidden layer nodes L4:
7:
13:
Calculate the output matrix H of the hidden layer Eq. (10)
Calculate the output weights ρ according to Eq. (9)
Save the predicting ELM model
The prediction results of crop evapotranspiration
Calculate the average accuracy rate
30:
31:
32:
33:
34:
End the parameter optimization OGA
29:
Fig. 5 Pseudocode of the OGA-ELM [4]
23975Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
H¼
gw
1:X1þb1
ðÞgw
L:X1þbL
ðÞ
gw
1:XNþb1
ðÞ
gw
L:XNþbL
ðÞ
2
43
5NL
ρ¼
ρ1T
ρL
T
2
43
5Lm
and T¼
t1T
tNT
2
43
5Nm
ð10Þ
T=[t1,t2,,tN]Trefers to the expected output vector of the training set. ρ=[ρ1,ρ2,,
ρN]Tis the output weights matrix. I is the identity matrix and Cis the regularization factor
which can be obtained by cross validation in the training process. H in Eq. (10) is the hidden
layer output matrix of the ELM network; in H,theith column is indicated to the ith hidden
layer nodes on the input nodes. Activation function g is infinitely distinguishable when the
desired number of hidden nodes is L N. The deep explanation of the OGA-ELM is
providing in the following steps:
First, generate the initial population (P) randomly, p = {ch 1,ch2ch 100}.
Second, calculate the fitness value for each chromosome (ch) of the population using Eq. (8).
ELM Training
Start
Emotion Speech Dataset
Training set Testing set
Select the activation
function, determine
the hidden layer nodes L
Calculate the output
matrix H of the hidden
layer Eq. (10)
Calculate the output
weights ρ according to
Eq. (9)
Saving the predicting
ELM model
The prediction results of
crop evapotranspiration
Calculate the average
accuracy rate
End
Randomly initial the input-weights and
the bias, determine a objective
function, population size, max iteration
Train ELM and calculate the fitness
of each variables according to Eq. (8)
Select parent randomly
Perform crossover to generate
two new children, save the
two new children into POPC
Is termination criteria satisfied?
Get the optimal weights and thresholds between input
layer and hidden layer.
Yes
No
Based on the random selection
criteria select pair of parents
Perform mutation to generate a
new child, save the new child
into POPM
Merge the POPC and POPM to
generate the new population
Is the POPC 70% of the
population
Is the POPM 30% of the
population
Yes
Initialize the crossover and mutation
populations: POPC = {}, POPM = {}
Yes
No
No
Parameter Optimization
Fig. 6 OGA-ELMs Flowchart [4]
23976 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Third,the chromosomes are arranged based on their fitness values f(ch).Next,
using random selection criterion to select a pair of parents from the present population
for the operation of crossover to create a pair of new children to the new population.
Random selection criterion refers to the process that randomly picks a chromosome
from the population to be used in one of the two operations: crossover or mutation. In
the random selection criterion, every single chromosome of the population has an
equal chance of being chosen.
Fourth,the arithmetic crossover is applied to exchange information between the two
previously selected parents. The new children obtained by crossover operations are saved into
the Population of the Crossover (POPC) until it reaches 70% of the population. The explana-
tion of the arithmetic crossover is represented by the following formulae:
Child1¼α:xþ1αðÞ:yð11Þ
Child2¼α:yþ1αðÞ:xð12Þ
Subject to the boundaries (upper bounds and lower bounds for the input-hidden layer weights
[1, 1], while for the hidden layer biases [0, 1]). In case the value of the gene has gone beyond
the max (upper bound), then we make it equal to the max (upper bound). While in case the
value of the gene has gone lower than the min (lower bound), then we make it equal to the min
(lower bound). The αis a randomly generated array with the size of the chromosome, and each
value of this array is randomly generated in a range of -gamma and gamma+1 which is (0.4,
1.4). The x and y represent the first and second selected parents.
Fifth, a criteria of random selection is also used to randomly choose a chromosome from
the present population before implementing mutation. Mutation is applied to alter the chro-
mosomes genes that are randomly selected. This work utilises uniform mutation. The uniform
Table 5 description of the BES dataset which been used in SI scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 68 46 71 81 79 62 126
Number of Training Utterance 54 37 57 65 63 50 101
Number of Testing Utterance 14 9 14 16 16 12 25
Emotion Label 1 2 3 4 5 6 7
Table 6 the overall result of the OGA-ELM in SI scenario
No of
neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 81 611 25 25 93.26 76.42 76.42 76.42 76.42 381.9471
125 76 606 30 30 91.91 71.70 71.70 71.70 71.70 395.5644
150 69 599 37 37 90.03 65.09 65.09 65.09 65.09 410.5236
175 73 603 33 33 91.11 68.87 68.87 68.87 68.87 426.4971
200 64 594 42 42 88.68 60.38 60.38 60.38 60.38 438.4431
225 78 608 28 28 92.45 73.58 73.58 73.58 73.58 451.8474
250 69 599 37 37 90.03 65.09 65.09 65.09 65.09 465.4381
275 67 597 39 39 89.49 63.21 63.21 63.21 63.21 478.1978
300 76 606 30 30 91.91 71.70 71.70 71.70 71.70 490.1058
23977Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
mutation works to substitute the selected genes value with a uniform random value chosen
from the genes user-specified upperand lower bounds (for the input-hidden layer weights [1,
1] while for the hidden layer biases [0, 1]). The new child obtained from mutation will be
saved into the Population of the Mutation (POPM) until the POPM reaches 30% of the
population.
After the selection, mutation, and crossover operations are completed, a new population is
created via integrating the POPM and POPC. The following iteration will be continued along
with this new population, and this process will be repeated. The iterative process could be
stopped when either the results have converged or the iteration numbers is exceeded the
maximum limit. OGAELMs pseudocode and flowchart are shown in Figs. 5and 6,
respectively.
3 Experimental results and discussion
Several experiments were conducted based on four different scenarios such as Subject
Dependent (SD), Subject Independent (SI), Gender Dependent Female (GD-Female),
Fig. 7 The confusion matrix for the best results of OGA-ELM in SI scenario
Table 7 the best result of the OGA-ELM in SI scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 93.40 95.28 90.57 97.17 96.23 96.23 83.96
Precision 100.00 83.33 66.67 93.33 83.33 78.57 61.76
Recall 50.00 55.56 57.14 87.50 93.75 91.67 84.00
F-Measure 66.67 66.67 61.54 90.32 88.24 84.62 71.19
G-Mean 70.71 68.04 61.72 90.37 88.39 84.87 72.03
tp 7 5 8 14 15 11 21
tn 92 96 88 89 87 91 68
fp 0 1 4 1 3 3 13
fn 7 4 6 2 1 1 4
23978 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
and Gender Dependent Male (GD-Male). In the SI scenario, we did not care about the
content in the utterance (sentence in the utterance) as well as gender (male or female)
and used the whole BES dataset. While in the SD scenario, we did care about the
content in the utterance (sentence in the utterance) and ignore the gender (male or
female). Therefore, since the BES dataset contains 10 different sentences so in this
scenario, we separated the BES dataset into 10 sub-datasets based on emotions and
sentences. Finally, in both GD-Male and GD-Female scenarios, we did care about the
gender (Male and Female) and ignore the content in the utterance (sentence in the
utterance). Thus, the BES dataset for both scenarios is separated based on emotions
and gender (Male and Female). For the GD-Male scenario, the dataset has used the
utterances that been recorded by males only. Whilst for the GD-Female scenario the
dataset has used the utterances that been recorded by females only.
The OGA-ELM has applied in several experiments in each scenario based on the scenarios
dataset with a varying number of the hidden neurons and each experiment had 500 iterations. It
is worth mentioning that all the experiments have been implemented in MATLAB R2019a
programming language over a PC Core i7 of 3.20 GHz with 16 GB RAM and SSD 1 TB
(Windows 10). [50] was used as the basis for the evaluation in this study where varying
measures were applied. The selection of [50] was due it tackles the classifier evaluation issue
with presenting effective measurements. The learning algorithmsperformance can be evalu-
ated in numerous methods by applying the SML (i.e., Supervised Machine Learning).
Furthermore the confusion matrix that has obtained of recognized examples for each class
according to their correction rate is presented in order to evaluate the quality of the
classification.
Therefore, there numerous of evaluation measurements were utilized to evaluate the
proposed approach OGA-ELM. The evaluation measurements rely on the ground truth, which
Table 8 description of the BES dataset which been used in SI scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 7 9 6 8 9 7 14
Number of Training Utterance 6 7 5 6 7 6 11
Number of Testing Utterance 1 2 1 2 2 1 3
Sentence Code a02 a02 a02 a02 a02 a02 a02
Emotion Label 1 2 3 4 5 6 7
Table 9 the best overall result of the OGA-ELM in SD scenario
No of
neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 12 72 0 0 100.00 100.00 100.00 100.00 100.00 253.8953
125 10 70 2 2 95.24 83.33 83.33 83.33 83.33 261.1620
150 11 71 1 1 97.62 91.67 91.67 91.67 91.67 270.1585
175 8 68 4 4 90.48 66.67 66.67 66.67 66.67 276.5632
200 9 69 3 3 92.86 75.00 75.00 75.00 75.00 282.2085
225 11 71 1 1 97.62 91.67 91.67 91.67 91.67 290.1031
250 9 69 3 3 92.86 75.00 75.00 75.00 75.00 296.8769
275 10 70 2 2 95.24 83.33 83.33 83.33 83.33 301.2103
300 8 68 4 4 90.48 66.67 66.67 66.67 66.67 304.7360
23979Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
entails the application of the model to expect the answer on the evaluation dataset followed by
a comparison between the predicted target and the actual answer. The evaluation measure-
ments have been used in order to evaluate the proposed OGA-ELM approach regarding recall;
accuracy; G-mean; precision, and F-measure. Eqs. (1317) [1,5,8] depicts these evaluation
measurements.
accuracy ¼tp þtn
tp þtn þfn þfp ð13Þ
precision ¼tp
tp þfp ð14Þ
recall ¼tp
tp þfn ð15Þ
Fig. 8 The confusion matrix for the best results of OGA-ELM in SD scenario
Table 10 the best result of the OGA-ELM in SD scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Precision 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Recall 100.00 100.00 100.00 100.00 100.00 100.00 100.00
F-Measure 100.00 100.00 100.00 100.00 100.00 100.00 100.00
G-Mean 100.00 100.00 100.00 100.00 100.00 100.00 100.00
tp 1 2 1 2 2 1 3
tn 11 10 11 10 10 11 9
fp 0 0 0 0 0 0 0
fn 0 0 0 0 0 0 0
23980 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
FMeasure ¼2precision recallðÞ
precision þrecallðÞ ð16Þ
GMean ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
recall precision
2
pð17Þ
where
tn refers to true-negative; tp refers to true-positive; fn refers to false-negative; and fp refers
to false-positive.
The comparison results of the four different scenarios (SI, SD, GD-Male, and GD-Female)
are provided and deeply discussed separately in the following sub-sections.
3.1 Subject Independent (SI) Scenario
This section provides and discuss the performance of the OGA-ELM in SI scenario. In SI
scenario, we did not care about the content in the utterance (sentence in the utterance) as well
as gender (male or female) and used the whole BES dataset. Table 5provides the description
of the dataset which been used in this scenario. 80% of the dataset which is equal to 427
utterance was used as a training dataset. While the remaining 20% of the dataset which isequal
to 106 utterance was used as a testing dataset.
The best experiment results of the proposed OGA-ELM in SI scenario has obtained with
100 hidden neurons, where the overall accuracy is 93.26%. While the other evaluation
measurements have achieved 76.42%, 76.42%, 76.42%, and 60.89% for precision, recall, F-
measure and G-mean, respectively. The overall results of the evaluation measurements are
Table 11 description of the dataset which been used in GD-Female scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 37 36 51 51 51 44 80
Number of Training Utterance 30 27 41 41 41 35 64
Number of Testing Utterance 7 7 10 10 10 9 16
Emotion Label 1 2 3 4 5 6 7
Table 12 the best overall result of the OGA-ELM in GD-Female scenario
No of
Neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 62 407 7 7 97.10 89.86 89.86 89.86 89.86 332.5520
125 56 401 13 13 94.62 81.16 81.16 81.16 81.16 340.8604
150 52 397 17 17 92.96 75.36 75.36 75.36 75.36 351.3987
175 41 386 28 28 88.41 59.42 59.42 59.42 59.42 360.3810
200 53 398 13 13 93.37 76.81 76.81 76.81 76.81 372.7795
225 55 400 14 14 94.20 79.71 79.71 79.71 79.71 381.8452
250 50 395 19 19 92.13 72.46 72.46 72.46 72.46 393.0418
275 53 398 13 13 93.37 76.81 76.81 76.81 76.81 405.2617
300 48 393 21 21 91.30 69.57 69.57 69.57 69.57 413.9911
23981Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
shown in Table 6. Besides, Fig. 7shows the confusion matrix of OGA-ELM for the best
results. Whilst, Table 7illustrates the results of the evaluation measures for each class.
3.2 Subject Dependent (SD) Scenario
This section provides and discuss the best experiment results of the OGA-ELM in SD scenario.
In SD scenario, we did care about the content in the utterance (sentence in the utterance) and
ignore the gender (male or female). Therefore, since the BES dataset contains 10 different
sentences so in the SD scenario, we separated the BES dataset into 10 sub-datasets based on
emotions and sentences. The experiments accuracy results of the OGA-ELM in SD scenario
were in a range of 94.81%100.00%. As mentioned earlier only the highest performance is
reporting in this study. Thus, the highest performance of the OGA-ELM in SD scenario was
with the sentence code a02and the sentence is Das will sie am Mittwoch abgeben.Table8
provides the description of the dataset which been used in SD scenario. 80% of the dataset
which is equal to 48 utterance was used as a training dataset. While the remaining 20% of the
dataset which is equal to 12 utterance was used as a testing dataset.
Fig. 9 The confusion matrix for the best results of OGA-ELM in GD-Female scenario
Table 13 the best result of the OGA-ELM in GD-Female scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 98.55 97.10 94.20 98.55 98.55 97.10 95.65
Precision 100.00 85.71 70.00 90.00 90.00 88.89 100.00
Recall 87.50 85.71 87.50 100.00 100.00 88.89 84.21
F-Measure 93.33 85.71 77.78 94.74 94.74 88.89 91.43
G-Mean 93.54 85.71 78.26 94.87 94.87 88.89 91.77
tp 7 6 7 9 9 8 16
tn 61 61 58 59 59 59 50
fp 1 1 1 0 0 1 3
fn 0 1 3 1 1 1 0
23982 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
The best experiment results of the proposed OGA-ELM in SD scenario have
acquired with 100 hidden neurons, where the overall accuracy is 100.00%. While
the other evaluation measurements have achieved 100.00%, 100.00%, 100.00%, and
100.00% for precision, recall, F-measure and G-mean respectively. The overall results
of the evaluation measurements are shown in Table 9. Also, Fig. 8shows the
confusion matrix of OGA-ELM for the best results. Whilst, Table 10 illustrates the
results of the evaluation measures for each class.
3.3 Gender Dependent Female (GD-Female) Scenario
This section provides and discuss the best experiment results of the OGA-ELM in
GD-Female scenario. In GD-Female scenario, we did care about the gender (Male and
Female) and ignore the content in the utterance (sentence in the utterance). The GD-
Female scenario has used the utterances of the BES dataset which been recorded by
females only. Table 11 provides the description of the dataset which been used in
GD-Female scenario. 80% of the dataset which is equal to 281 utterance was used as
a training dataset. While the remaining 20% of the dataset which is equal to 69
utterance was used as a testing dataset.
The best experiment results of the proposed OGA-ELM in GD-Female scenario
have acquired with 100 hidden neurons, where the overall accuracy is 97.10%. While
the other evaluation measures were achieved 89.86%, 89.86%, 89.86%, and 81.32%
for precision, recall, F-measure and G-mean, respectively. The overall results of the
evaluation measurements are shown in Table 12. Besides, Fig. 9shows the confusion
matrix of OGA-ELM for the best results. Whilst, Table 13 illustrates the results of the
evaluation measures for each class.
Table 14 description of the dataset which been used in GD-Male scenario
Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Number of All Utterances 31 10 20 30 28 18 46
Number of Training Utterance 25 8 16 24 22 14 37
Number of Testing Utterance 6 2 4 6 6 4 9
Emotion Label 1 2 3 4 5 6 7
Table 15 the best overall result of the OGA-ELM in GD-Male scenario
No of
Neuron
tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
100 32 217 5 5 96.14 86.49 86.49 86.49 86.49 276.6538
125 28 213 9 9 93.05 75.68 75.68 75.68 75.68 280.4774
150 31 216 6 6 95.37 83.78 83.78 83.78 83.78 285.0864
175 24 209 13 13 89.96 64.86 64.86 64.86 64.86 289.9731
200 27 212 10 10 92.28 72.97 72.97 72.97 72.97 294.3590
225 29 214 8 8 93.82 78.38 78.38 78.38 78.38 301.1567
250 27 212 10 10 92.28 72.97 72.97 72.97 72.97 307.5201
275 31 216 6 6 95.37 83.78 83.78 83.78 83.78 315.0164
300 26 211 11 11 91.51 70.27 70.27 70.27 70.27 320.1097
23983Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
3.4 Gender Dependent Male (GD-Male) Scenario
This section provides and discuss the best experiment results of the OGA-ELM in GD-Male
scenario. In GD-Male scenario, we did care about the gender (Male and Female) and ignore
the content in the utterance (sentence in the utterance). The GD-Male scenario has used the
utterances of the BES dataset which been recorded by males only. Table 14 provides the
description of the dataset which been used in GD-Male scenario. 80% of the dataset which is
equal to 146 utterance was used as a training dataset. While the remaining 20% of the dataset
which is equal to 37 utterance was used as a testing dataset.
The best experiment results of the proposed OGA-ELM in GD-Male scenario have
acquired with 100 hidden neurons, where the overall accuracy is 96.14%. While the other
evaluation measurements have achieved 86.49%, 86.49%, 86.49%, and 75.55% for precision,
recall, F-measure and G-mean, respectively. The overall results of the evaluation measure-
ments are shown in Table 15.Also,Fig.10 shows the confusion matrix of OGA-ELM for the
best results. Whilst, Table 16 illustrates the results of the evaluation measures for each class.
Based on all the above-mentioned experiments results, Table (516) we can conclude a
critical observation. The OGA could create suitable weights and biases for the single hidden
Fig. 10 The confusion matrix for the best results of OGA-ELM in GD-Male scenario
Table 16 the best result of the OGA-ELM in GD-Male scenario for each class
Anxiety Disgust Happiness Boredom Neutral Sadness Anger
Accuracy 100.00 97.30 100.00 94.59 97.30 94.59 89.19
Precision 100.00 100.00 100.00 83.33 100.00 100.00 69.23
Recall 100.00 50.00 100.00 83.33 83.33 50.00 100.00
F-Measure 100.00 66.67 100.00 83.33 90.91 66.67 81.82
G-Mean 100.00 70.71 100.00 83.33 91.29 70.71 83.21
tp 6 1 4 5 5 2 9
tn 31 35 33 30 31 33 24
fp 0 0 0 1 0 0 4
fn 0 1 0 1 1 2 0
23984 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
layer of the ELM in order to minimize classification process errors. By avoiding unsuitable
weights and biases causes the ELM to avoid getting stuck in local maxima of weights and
biases. Consequently, the performance of the OGA-ELM was very impressive in the four
different scenarios, with an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD,
GD-Male, and GD-Female scenarios, respectively.
Furthermore, the proposed approach OGA-ELM will be compared with some recent works
[9,16,18,27,45,55,58,60] in terms of accuracy based on four different scenarios (i.e., SI,
SD, GD-Male, and GD-Female scenarios). All these methods have been used the BSE dataset
in their experiments. Table 17 illustrates the comparison accuracy results of the proposed
OGA-ELM and some other previous works.
Based on all the results in Table 17, obviously that the performance of the OGA-ELM
outperformed all the other previous works in SI, SD, and GD-Male scenarios. While only in
GD-Female scenario the highest accuracy of the proposed OGA-ELM was slightly lower than the
work in [58] where they achieved 98 .94% and the proposed OGA-ELM achieved 97.10%. That
proves the fact of generating the suitable weights and biases of the ELM that leads to minimizing
classification process errors. Moreover, avoiding unsuitable weights and biases causes the ELM
to avoid getting stuck in local maxima of weights and biases. Which is offering a promise that the
OGA-ELM is a reliable model for ESR. The best results are shown in Table 17.
In addition, numerous experiments have performed based on the basic ELM, Feedforward
Neural Network (NN), and SVM in the four different scenarios: SI, SD, GD-Male, and GD-
Female. Note, due to the pages limit, we reported only the highest performance of ELM, NN,
and SVM in each scenario in terms of accuracy; recall; precision; G-mean; F-measure; true
positive, true negative; false positive, false negative and execution time. Table 18,19,20 and
21 provides all the experiments result of the proposed OGA-ELM, basic ELM, NN, and SVM
in SI, SD, GD-Male, and GD-Female scenario. The best performance of the proposed OGA-
Table 17 the comparison of accuracy between the proposed OGA-ELM and other previous works
Reference No of
Emotion
Result based on
SI
Result based on
SD
Result based on GD-
Male
Result based on GD-
Female
ELM [58] 7 90 .31 99 .47 92 .98 98 .94
KELM [27]7 92.90 ––
SMO [18]7 75.50 ––
SVM [45]3 88.33 ––
SVM [9]7 –– 94.90 85.77
SVM [16]7 82.10 ––
SVM [55]6 88.80 ––
SVM [60]7 70.59% ––
OGA-ELM 793.26 100.00 96.14 97.10
Table 18 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in SI
scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 81 611 25 25 93.26 76.42 76.42 76.42 76.42 381.9471
Basic ELM 46 576 60 60 83.83 43.40 43.40 43.40 43.40 22.1431
NN 43 573 63 63 83.02 40.57 40.57 40.57 40.57 25.0304
SVM 42 572 64 64 82.75 39.62 39.62 39.62 39.62 22.9662
23985Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ELM has obtained an accuracy of 93.26%, 100.00%, 96.14%, and 97.10% for SI, SD, GD-
Male, and GD-Female scenario respectively. While the best performance of the basic ELM has
acquired an accuracy of 83.83%, 85.71%, 83.01%, and 84.68% for SI, SD, GD-Male, and GD-
Female scenario respectively. Further, the best performance of the NN has acquired an
accuracy of 83.02%, 83.33%, 81.47%, and 83.44% for SI, SD, GD-Male, and GD-Female
scenario respectively. The best performance of the SVM has acquired an accuracy of 82.75%,
85.71%, 82.24%, and 85.51% for SI, SD, GD-Male, and GD-Female scenario respectively.
BasedontheresultsinTables18,19,20 and 21, obviously that the performance of the OGA-
ELM outperformed the basic ELM, NN and SVM in the four different scenarios: SI, SD, GD-
Male, and GD-Female. That proves the fact of generating the suitable weights and biases of the
ELM leads to minimizing classification process errors. Thus, the performance of the OGA-ELM
was very impressive in the four different scenarios comparing to some previous works (see
Table 17), basic ELM, NN and SVM (see Tables 18,19,20 and 21). The best performance of the
OGA-ELM have acquired an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-
Male, and GD-Female scenario respectively. The original algorithms of ELM, NN, and SVM are
outperformed the proposed OGA-ELM in terms of the execution time due that the OGA-ELM is
based on GA that needs more time in order to obtain the best values of input weights and biases.
4 Conclusion
In this study, we have proposed an enhanced ESR system that based on the conventional
MFCC features and our previously developed ELM which named OGA-ELM. The OGA-
ELM underwent four different evaluation scenarios: SI, SD, GD-Male, and GD-Female using
the BES dataset for evaluation aspects. The outcome indicated the superiority of the OGA-
ELM over some previous works (see Table 17) and basic ELM, NN, and SVM (see Tables 18,
19,20 and 21) in the four different scenarios. The best performance of the OGA-ELM has
acquired an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and
Table 19 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in SD
scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 12 72 0 0 100.00 100.00 100.00 100.00 100.00 253.8953
Basic ELM 6 66 6 6 85.71 50.00 50.00 50.00 50.00 10.0748
NN 5 65 7 7 83.33 41.67 41.67 41.67 41.67 12.5962
SVM 6 66 6 6 85.71 50.00 50.00 50.00 50.00 11.0365
Table 20 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in GD-
Male scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 32 217 5 5 96.14 86.49 86.49 86.49 86.49 276.6538
Basic ELM 15 200 22 22 83.01 40.54 40.54 40.54 40.54 13.0750
NN 13 198 24 24 81.47 35.14 35.14 35.14 35.14 15.6759
SVM 14 199 23 23 82.24 37.84 37.84 37.84 37.84 14.0021
23986 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
GD-Female scenarios, respectively. Since the current study had only considered offline ESR,
the future work of this study is to create an ESR system that can handle the online execution of
the feature extraction and classification and apply real-time aspects simultaneously. Hence, it
can be implemented in analyse and detect the caller emotion in call centers. Another future
work using the proposed OGA-ELM can be applied in various applications such as voice
pathology detection, accent classification, and speaker identification. Finally, other optimisa-
tion approaches for ELM will be further explored in order to generate the most suitable
weights and biases for the ELM which leads to minimizing classification process errors.
Acknowledgements This project was funded by the Universiti Kebangsaan Malaysia under Dana Impak
Perdana grant (Research code: GUP-2020-063).
References
1. Albadr MA, Tiun S, Ayob M, al-Dhief F (2020) Genetic algorithm based on natural selection theory for
optimization problems. Symmetry 12(11):1758
2. Albadr MAA, Tiun S (2020) Spoken language identification based on particle swarm optimisationextreme
learning machine approach. Circ Syst Signal Process 127
3. Albadr MAA, Tiun S, al-Dhief FT, Sammour MAM (2018) Spoken language identification based on the
enhanced self-adjusting extreme learning machine approach. PLoS One 13(4):e0194770
4. Albadr MAA, Tiun S, Ayob M, al-Dhief FT (2019) Spoken language identification based on optimised
genetic algorithmextreme learning machine approach. Int J Speech Technol 22(3):711727
5. Albadr MAA, Tiun S, Ayob M, al-Dhief FT, Omar K, Hamzah FA (2020) Optimised genetic algorithm-
extreme learning machine approach for automatic COVID-19 detection. PLoS One 15(12):e0242899
6. Albadra MAA, Tiuna S (2017) Extreme learning machine: a review. Int J Appl Eng Res 12(14):46104623
7. Al-Dhief FT et al (2020) A survey of voice pathology surveillance systems based on internet of things and
machine learning algorithms. IEEE Access 8:6451464533
8. Al-Dhief FT et al (2020) Voice pathology detection using machine learning technique. In 2020 IEEE 5th
international symposium on telecommunication technologies (ISTT). IEEE
9. Alonso JB, Cabrera J, Medina M, Travieso CM (2015) New approach in quantification of emotional
intensity from the speech signal: emotional temperature. Expert Syst Appl 42(24):95549564
10. Badshah AM et al (2017) Speech emotion recognition from spectrograms with deep convolutional neural
network. In: 2017 international conference on platform technology and service (PlatCon). IEEE
11. Baroi OL et al (2019) Effects of different environmental noises and sampling frequencies on the perfor-
mance of MFCC and PLP based Bangla isolated word recognition system. In: 2019 1st international
conference on advances in Science, engineering and robotics technology (ICASERT). IEEE
12. Basu S et al (2017) A review on emotion recognition using speech. In: 2017 international conference on
inventive communication and computational technologies (ICICCT) IEEE
13. Bi W, Xu Y, Wang H (2020) Comparison of searching behaviour of three evolutionary algorithms applied
to water distribution system design optimization. Water 12(3):695
14. Burkhardt F et al (2005) A database of German emotional speech. In: Ninth European Conference on
Speech Communication and Technology
Table 21 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in GD-
Female scenario
Approach tp tn fp fn Accuracy Precision Recall F-
Measure
G-
Mean
Execution Training / Testing
Time (S)
OGA-ELM 62 407 7 7 97.10 89.86 89.86 89.86 89.86 332.5520
Basic ELM 32 377 37 37 84.68 46.38 46.38 46.38 46.38 15.1266
NN 29 374 40 40 83.44 42.03 42.03 42.03 42.03 17.8584
SVM 34 379 35 35 85.51 49.28 49.28 49.28 49.28 16.0334
23987Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
15. Calvo RA, D'Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their
applications. IEEE Trans Affect Comput 1(1):1837
16. Cao H, Verma R, Nenkova A (2015) Speaker-sensitive emotion recognition via ranking: studies on acted
and spontaneous speech. Comput Speech Lang 29(1):186202
17. Chavhan Y, Dhore M, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J
Comput Appl 1(20):69
18. Choudhury AR et al (2018) Emotion recognition from speech signals using excitation source and spectral
features. In: 2018 IEEE applied signal processing conference (ASPCON). IEEE
19. Dendukuri LS, Hussain SJ (2019) Statistical feature set calculation using Teager energy operator on
emotional speech signals. In: 2019 international conference on wireless communications signal processing
and networking (WiSPNET). IEEE
20. Deng C, Huang GB, Xu J, Tang JX (2015) Extreme learning machines: new trends and applications.
Science China Inf Sci 58(2):116
21. Dogra A, Kaul A, Sharma R (2019) Automatic recognition of dialects of Himachal Pradesh using MFCC
&GMM. In: 2019 5th international conference on signal processing, computing and control (ISPCC). IEEE
22. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification
schemes, and databases. Pattern Recogn 44(3):572587
23. Fortin F-A et al (2012) DEAP: evolutionary algorithms made easy. J Mach Learn Res 13(1):21712175
24. Gangamohan P, Kadiri SR, Yegnanarayana B (2016) Analysis of emotional speechA review, in Toward
Robotic Socially Believable Behaving Systems-Volume I, Springer, p. 205238
25. Ghasemi J, Esmaily J, Moradinezhad R (2020) Intrusion detection system using an optimized kernel
extreme learning machine and efficient features. Sādhanā45(1):19
26. Gogna A, Tayal A (2012) Comparative analysis of evolutionary algorithms for image enhancement. Int J
Met 2(1):80100
27. Guo L, Wang L, Dang J, Liu Z, Guan H (2019) Exploration of complementary features for speech emotion
recognition based on kernel extreme learning machine. IEEE Access 7:7579875809
28. Han W et al (2006) An efficient MFCC extraction method in speech recognition. In: 2006 IEEE interna-
tional symposium on circuits and systems. IEEE
29. Huang G-B, Chen L, Siew CK (2006) Universal approximation using incremental constructive feedforward
networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879892
30. Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications.
Neurocomputing 70(1):489501
31. Huang G-B et al (2011) Extreme learning machine for regression and multiclass classification. IEEE Trans
Syst, Man Cybern, Part B (Cybernetics) 42(2):513529
32. Huang G-B et al (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans
Systems, Man, Cybernetics, Part B (Cybernetics) 42(2)513529
33. Jain M et al (2020) Speech emotion recognition using support vector machine. arXiv preprint arXiv:
2002.07590
34. Juvela L et al (2018) Speech waveform synthesis from MFCC sequences with generative adversarial
networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).
IEEE
35. Kaya H, Karpov AA (2018) Efficient and effective strategies for cross-corpus acoustic emotion recognition.
Neurocomputing 275:10281034
36. Kaya H, Karpov AA, Salah AA (2016) Robust acoustic emotion recognition based on cascaded normal-
ization and extreme learning machines. In: international symposium on neural networks. Springer, 2016
37. Kostoulas T, Mporas I, Kocsis O, Ganchev T, Katsaounos N, Santamaria JJ, Jimenez-Murcia S, Fernandez-
Aranda F, Fakotakis N (2012) Affective speech interface in serious games for supporting therapy of mental
disorders. Expert Syst Appl 39(12):1107211079
38. Kuchibhotla S, Vankayalapati HD, Anne KR (2016) An optimal two stage feature selection for speech
emotion recognition using acoustic features. Int J Speech Technol 19(4):657667
39. Lopez-de-Ipiña K et al (2015) On automatic diagnosis of Alzheimers disease based on spontaneous speech
analysis and emotional temperature. Cogn Comput 7(1):4455
40. Mar LL, Pa WP (2019) Depression detection from speech emotion recognition. Seventeenth International
Conference on Computer Applications (ICCA 2019)
41. Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral
coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083
42. Murugappan M et al (2020) Emotion classification in Parkinson's disease EEG using RQA and ELM. In:
2020 16th IEEE international colloquium on Signal Processing & its Applications (CSPA). IEEE
43. Neiberg D, Elenius K (2008) Automatic recognition of anger in spontaneous speech. In: Ninth Annual
Conference of the International Speech Communication Association
23988 Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
44. Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320
326
45. Pakyurek M, Atmis M, Kulac S, Uludag U (2020) Extraction of novel features based on histograms of
MFCCs used in emotion classification from generated original speech dataset. Elektronika ir
Elektrotechnika 26(1):4651
46. Petrushin VA (2000) Emotion recognition in speech signal: experimental study, development, and appli-
cation. In: Sixth International Conference on Spoken Language Processing
47. Poorna S, Nair G (2019) Multistage classification scheme to enhance speech emotion recognition. Int J
Speech Technol 22(2):327340
48. Renanti MD, Buono A, Kusuma WA (2013) Infant cries identification by using codebook as feature
matching, and mfcc as feature extraction. J Theoretical Appl Inform Technol 56(3)
49. Shah AF and Anto PB (2017) Hybrid spectral features for speech emotion recognition. In: 2017 interna-
tional conference on innovations in information, embedded and communication systems (ICIIECS). IEEE
50. Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, F-score and ROC: a family of
discriminant measures for performance evaluation. In Australasian joint conference on artificial intelligence.
2006. Springer
51. Trang H, Loc TH, Nam HBH (2014) Proposed combination of PCA and MFCC feature extraction in speech
recognition system. In: 2014 International Conference on Advanced Technologies for Communications
(ATC 2014). IEEE
52. Tripathi A, Singh U, Bansal G, Gupta R, Singh AK (2020) A review onemotion detection and classification
using speech. Available at SSRN 3601803
53. Tzinis E, Potamianos A (2017) Segment-based speech emotion recognition using recurrent neural networks.
In: 2017 seventh international conference on affective computing and intelligent interaction (ACII). IEEE
54. van Heeswijk M (2015) Advances in extreme learning machines
55. Wang K et al (2015) Speech emotion recognition using Fourier parameters. IEEE Trans Affect Comput
6(1):6975
56. Wang Y, Cao F, Yuan Y (2011) A study on effectiveness of extreme learning machine. Neurocomputing
74(16):24832490
57. Wilhelmstötter F (2021) Jenetics Library Users Manual 6.2. [Online]. Available: https://jenetics.io
58. Yogesh C et al (2017) A new hybrid PSO assisted biogeography-based optimization for emotion and stress
recognition from speech signal. Expert Syst Appl 69:149158
59. Yu F et al (2016) Improved roulette wheel selection-based genetic algorithm for TSP. In: 2016 international
conference on network and information Systems for Computers (ICNISC), IEEE
60. Zaidan NA, Salam MS (2016) MFCC global features selection in improving speech emotion recognition
rate. In: Advances in machine learning and signal processing. Springer, p. 141153
61. Zhang X, Sun J, Luo Z (2014) One-against-all weighted dynamic time warping for language-independent
and speaker-dependent speech recognition in adverse conditions. PLoS One 9(2):e85458
62. Zhao S et al (2014) Automatic detection of expressed emotion in Parkinson's disease. In: 2014 IEEE
international conference on acoustics, speech and signal processing (ICASSP), IEEE
Publishersnote Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Affiliations
Musatafa Abbas Abbood Albadr
1
&Sabrina Tiun
1
&Masri Ayob
1
&Fahad Taha
AL-Dhief
2
&Khairuddin Omar
1
&Mhd Khaled Maen
3
1
CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor,
Malaysia
2
School of Electrical Engineering, Department of Communication Engineering, Universiti Teknologi
Malaysia, UTM Johor Bahru, Johor, Malaysia
3
Department of Information and Technology, Uppsala University, Uppsala, Sweden
23989Multimedia Tools and Applications (2022) 81:23963–23989
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
... The Random Vector Functional Link (RVFL) [14] is utilized to establish the foundation of the ELM, resulting in a highly efficient and adaptable system [12,13]. Research shows that engineering applications use ELM regularly [15][16][17]. However, it should be noted that there are indeed obstacles associated with ELM [18,19], such as the need for many hidden nodes to provide better generalization and the requirement to choose appropriate activation functions. ...
... To mitigate the impact of poorly conditioned matrices on the accuracy of results, the input weights and biases employed in ELM are selected randomly. Consequently, the resulting matrix may not accurately represent the total column rank [17,20,21]. Therefore, this work utilizes a GWO algorithm to improve ELM's conditioning and ensure optimal solutions are attained. ...
... The ELM initially assigns random weights and biases to the input layer and then subsequently computes the output layer weights based on these randomly generated values. The algorithm under consideration exhibits a higher rate of learning and superior performance in comparison to conventional neural network algorithms [17,21]. In Fig 2, you can see a typical Single-layer Neural Network (SLNN). ...
Article
Full-text available
Pulse repetition interval modulation (PRIM) is integral to radar identification in modern electronic support measure (ESM) and electronic intelligence (ELINT) systems. Various distortions, including missing pulses, spurious pulses, unintended jitters, and noise from radar antenna scans, often hinder the accurate recognition of PRIM. This research introduces a novel three-stage approach for PRIM recognition, emphasizing the innovative use of PRI sound. A transfer learning-aided deep convolutional neural network (DCNN) is initially used for feature extraction. This is followed by an extreme learning machine (ELM) for real-time PRIM classification. Finally, a gray wolf optimizer (GWO) refines the network’s robustness. To evaluate the proposed method, we develop a real experimental dataset consisting of sound of six common PRI patterns. We utilized eight pre-trained DCNN architectures for evaluation, with VGG16 and ResNet50V2 notably achieving recognition accuracies of 97.53% and 96.92%. Integrating ELM and GWO further optimized the accuracy rates to 98.80% and 97.58. This research advances radar identification by offering an enhanced method for PRIM recognition, emphasizing the potential of PRI sound to address real-world distortions in ESM and ELINT systems.
... There are critical studies in the field of brain-inspired intelligence to realize high-level intelligence, high accuracy, high robustness, and low power consumption in comparison with state-of-the-art artificial intelligence works [9][10][11][12]. Over the past few decades, various domains such as COVID-19 detection [13,14], emotion speech recognition [15], voice pathology detection [16][17][18], speaker gender identification [19], machine translation [20], language identification [21][22][23], and diabetic retinopathy detection [24], have demonstrated the effectiveness of DM (Data Mining) and ML (Machine Learning) techniques. Consequently, there has been a notable surge in efforts to utilise DM and ML algorithms for BC diagnosis [25,26]. ...
... The most favourable statistical results are highlighted in bold font in these tables. Equations (13)(14)(15) [48] were employed to calculate the Mean, RMSE, and STD. ...
Article
Full-text available
Recently, significant attention has been given to using machine learning (ML) and data mining algorithms for diagnosing breast cancer (BC). However, many of these efforts still require improvement. Either they need proper statistical evaluation, or they use inadequate assessment metrics or both. In this context, the extreme learning machine (ELM) algorithm, known for its effectiveness, emerges as a promising approach for data classification. Therefore, this study proposes the ELM algorithm and statistically evaluates its performance in diagnosing BC. The ELM algorithm offers several advantages: it can eliminate overfitting, address binary and multi-class classification issues, and exhibit performance comparable to a kernel-based support vector machine with a neural network structure. To evaluate the ELM algorithm’s performance, two BC datasets, namely the Wisconsin breast cancer database (WBCD) and the Wisconsin diagnostic breast cancer (WDBC) were used. The experimental results demonstrated the excellent performance of the ELM algorithm. Using the WBCD dataset, the ELM algorithm achieved an average accuracy of 92.06%, precision of 80.25%, recall of 96.60%, F-measure of 87.56%, G-mean of 87.99%, MCC of 82.67%, and specificity of 90.27%. Similarly, using the WDBC dataset, it achieved an average accuracy of 94.52%, precision of 92.28%, recall of 93.09%, F-measure of 92.66%, G-mean of 92.67%, MCC of 88.32%, and specificity of 95.42%. These results highlight the ELM algorithm’s reliability as a classifier for diagnosing BC and its potential for addressing healthcare-related problems in other applications.
... e) The sklearn.metrics module in the Scikit-learn library serves as a powerful tool for evaluating machine learning models, catering to classification and regression tasks providing an accurate understanding of model performance and guiding informed optimization decisions.Frequently used metrics in sklearn.metrics include [34][35][36][37][38] In addition to these basic metrics, sklearn.metrics provides specialized metrics such as the confusion matrix, which is a tabular representation of the true and predicted labels, which helps in visualizing the performance of the classification model. ...
... Decision tree algorithm [37] Entropy, as a measure of randomness, is a guide for decision tree algorithms to evaluate the extent of information inclusion. The entropy calculation includes parts of events at child nodes and aims to maintain simplicity by constructing a tree with uniform branches at each level. ...
Article
Full-text available
The motivation behind this study stems from identifying contemporary challenges associated with prosecuting electronic financial crimes. Highlights ongoing efforts to identify and address credit card fraud and fraud as there are many credit card fraud issues in the financial industry. Traditional methods are no longer able to keep up with modern methods of tracking the behavior of credit card users and detecting suspicious cases. Artificial intelligence technology offers promising solutions to quickly detect and prevent future fraud by credit card users. Datasets used to detect financial anomalies are affected by imbalances in financial transactions, and this study aims to address the imbalance of financial fraud datasets using adversarial algorithm techniques and compare them with the most commonly used methods in the scientific literature.The results showed that the function of the adversarial algorithm is consistent in several ways, including allowing researchers and interested parties to determine data growth rates, which helps bring the dataset closer to real-time data from financial markets and banks. This study proposes a hybrid machine learning model consisting of three machine learning algorithms: decision trees, logistic regression, and Naive Bayes algorithm, and calculates performance metrics such as accuracy, specificity, precision, and F1 score. Experimental results reveal varying degrees of accuracy in fraud detection. Model testing using the SMOTE method recorded an accuracy of 98.1% and an F-score of 98.3%. On the other hand, the oversampling and under sampling test methods showed similar performance, with the two methods recording an accuracy of 94.3 and 95.3 and an F-score of 94.7 and 95.1, respectively. Finally, the GAN method excelled, receiving a test score and accuracy of 99.9%, as well as exceptional precision, recall, and F1 score. As a result, we conclude that the GAN method is able to balance the data set, which in turn is reflected in the performance of the model in training and the accuracy of predictions when tested. Historical transaction analysis identifies behavioral patterns and adapts to evolving fraud techniques. This approach enhances transaction security and protects against potential financial losses due to fraud. This contribution allows financial institutions and companies to proactively combat fraudulent activities.
... Extended author information available on the last page of the article During the last decades, the productivity, effectiveness, and efficiency of the ML and DM approaches have been demonstrated in a lot of domains such as language identification [9][10][11], COVID-19 detection [12,13], emotion speech recognition [14], voice pathology detection [15][16][17], and diabetic retinopathy detection [18]. Therefore, recently, an enormous efforts have been conducted on the use of ML and DM approaches to diagnose the BC [19][20][21]. ...
... Each run's outcomes were evaluated based on many assessment matrices including Precision (Pre), Accuracy (Acc), F-Measure (F-M), Recall (Rec), G-Mean (G-M), MCC, and Specificity (Spe). The mathematical-computation of these evaluation matrices are represented in (11)(12)(13)(14)(15)(16)(17) equations [50][51][52][53][54]. ...
Article
Full-text available
The utilisation of DM (Data Mining) and ML (Machine Learning) approaches in the BC (Breast Cancer) diagnosis has recently gained a lot of consideration. However, most of these works still need enhancement since either they were assessed utilising insufficient evaluation-metrics, or they weren’t statistically-assessed, or both. Lately, one-of-the-most effective and well-known ML approaches is OSELM (Online Sequential Extreme Learning Machine), it has seen as an efficient and reputable technique for classifying-data, however it has not been implemented in BC diagnosis problem. Consequently, this research proposes the OSELM approach in-order-to enhance the rate of accuracy for the BC diagnosis. The OSELM technique has the ability to (a) capability to be applied on both (multi-class and binary) classification, (b) prevent overfitting, as well as (c) It has a comparable ability to kernel-based SVM (Support Vector Machine) and operates with a neural-network-structure. In this research, two different BC datasets (WDBC (Wisconsin Diagnostic Breast Cancer) and WBCD (Wisconsin Breast Cancer Database)) were utilised to evaluate the OSELM approach performance. The experiments outcomes have revealed the outstanding-performance of the proposed OSELM approach, which attained an average of precision 94.09%, recall 95.57%, accuracy 96.13%, G-Mean 94.82%, F-Measure 94.80%, specificity 96.51%, and MCC 91.76% using WDBC dataset. Besides, attained an average of precision 95.08%, recall 98.89%, accuracy 97.89%, G-Mean 96.96%, F-Measure 96.93%, specificity 97.41%, and MCC 95.39% using WBCD dataset. This indicates that the OSELM approach is a reliable technique for the BC diagnosis and might be suitable for solving other-applications-related issues in the sector of healthcare. Besides, it can serve as a valuable decision-support tool for oncologists, providing additional information and insights to aid in their diagnoses and treatment plans.
... The current study has considered performance metrics namely accuracy, recall, F1-score and precision for assessing the performance of the proposed system. The corresponding mathematical equations are presented in this section [39][40][41][42]. ...
Article
Full-text available
In the fast-paced technological era, online financial transactions have gained widespread use as it offers significant merits to customers for easy transfer of money through smart phones. Nevertheless, fraudulent transactions put individual’s money into risk, for which, suitable approaches are required to detect such deceits. Concurrently, with the progress of ML (Machine Learning) approaches, existing works have bidden to identify the fraudulent and normal transactions. However, studies lacked in accordance with accuracy rate and only limited focus has been provided for detection of generalized fraudulent transactions. Considering this, the current study considers IoT fraud dataset and proposes DEGA (Deep Emended Genetic Algorithm) to attain better performance for detecting fraudulent and normal transactions. This model employs a competitive approach, integrating, new crossover and selection methods. This intend to improvise the ability of global search and partition the chromosomes into losers and winners. This ensures high quality parent for selection. Besides, a dynamic-mutation function is also proposed for enhancing the model’s searching ability. Subsequently, the study proposes EL-UXGB (Efficient Loss-Updated eXtreme Gradient Boosting) wherein dual sigmoid loss functions are proposed to resolve the imbalanced label cases. The overall performance of this study is assessed through analysis that confirms its effectiveness in detecting fraudulent transactions.
... This model aims to reduce complexity and enhance classifier accuracy. Abbas et al. [29] proposed an Optimized Genetic Algorithm Extreme Learning Machine for SER using the Berlin Emotional Speech dataset. They present four different scenarios subject dependent and independent, gender dependent female and male with impressive accuracies. ...
Article
Full-text available
Recognizing speech emotions is indeed a crucial aspect of human–computer interaction. However, developing a model that can accurately process multiple languages is one of the challenging tasks. The feature selection process plays a vital role in multilingual speech emotion recognition because it helps to reduce irrelevant features from each language, ultimately enhancing the performance of the model. This research aims to address this task in a more precise way. It achieves this by employing Grid Search based Principal Component Analysis and an ensemble voting classifier for multilingual speech emotion recognition. Here we mention three essential steps of recognizing emotion from a multilingual dataset. The first step involves feature extraction from speech signals, such as MFCC, root-mean-square, ZCR, flux, roll-off, Centroid, bandwidth, chroma, and fundamental frequency. The second step entails the selection of an essential feature subset by removing redundant and unnecessary features using Principal Component Analysis. We also utilize the Grid Search technique to determine the feature subset that would yield the highest accuracy. The third step encompasses SVM and Random Forest, that are widely recognized classifiers. Additionally, we propose an ensemble voting classifier. Our study compares the performance of these classifiers on three distinct corpora—RAVDESS, EMOVO, and SUBESCO with and without the feature selection strategy. The accuracy for RAVDESS EMOVO and SUBESCO dataset 74.30%, 79.66%, 87.64%, respectively. After comparing our proposed approach with other approaches mentioned in the literature survey, it became evident that our approach outperforms the rest.
Article
Full-text available
The early screening of depression is highly beneficial for patients to obtain better diagnosis and treatment. While the effectiveness of utilizing voice data for depression detection has been demonstrated, the issue of insufficient dataset size remains unresolved. Therefore, we propose an artificial intelligence method to effectively identify depression. The wav2vec 2.0 voice-based pre-training model was used as a feature extractor to automatically extract high-quality voice features from raw audio. Additionally, a small fine-tuning network was used as a classification model to output depression classification results. Subsequently, the proposed model was fine-tuned on the DAIC-WOZ dataset and achieved excellent classification results. Notably, the model demonstrated outstanding performance in binary classification, attaining an accuracy of 0.9649 and an RMSE of 0.1875 on the test set. Similarly, impressive results were obtained in multi-classification, with an accuracy of 0.9481 and an RMSE of 0.3810. The wav2vec 2.0 model was first used for depression recognition and showed strong generalization ability. The method is simple, practical, and applicable, which can assist doctors in the early screening of depression.
Article
Full-text available
Emotion recognition systems from speech signals are realized with the help of acoustic or spectral features. Acoustic analysis is the extraction of digital features from speech files using digital signal processing methods. Another method is the analysis of time-frequency images of speech using image processing. The size of the features obtained by acoustic analysis is in the thousands. Therefore, classification complexity increases and causes variation in classification accuracy. In feature selection, features unrelated to emotions are extracted from the feature space and are expected to contribute to the classifier performance. Traditional feature selection methods are mostly based on statistical analysis. Another feature selection method is the use of metaheuristic algorithms to detect and remove irrelevant features from the feature set. In this study, we compare the performance of metaheuristic feature selection algorithms for speech emotion recognition. For this purpose, a comparative analysis was performed on four different datasets, eight metaheuristics and three different classifiers. The results of the analysis show that the classification accuracy increases when the feature size is reduced. For all datasets, the highest accuracy was achieved with the support vector machine. The highest accuracy for the EMO-DB, EMOVA, eNTERFACE’05 and SAVEE datasets is 88.1%, 73.8%, 73.3% and 75.7%, respectively.
Article
Full-text available
The coronavirus disease (COVID-19), is an ongoing global pandemic caused by severe acute respiratory syndrome. Chest Computed Tomography (CT) is an effective method for detecting lung illnesses, including COVID-19. However, the CT scan is expensive and time-consuming. Therefore, this work focus on detecting COVID-19 using chest X-ray images because it is widely available, faster, and cheaper than CT scan. Many machine learning approaches such as Deep Learning, Neural Network, and Support Vector Machine; have used X-ray for detecting the COVID-19. Although the performance of those approaches is acceptable in terms of accuracy, however, they require high computational time and more memory space. Therefore, this work employs an Optimised Genetic Algorithm-Extreme Learning Machine (OGA-ELM) with three selection criteria (i.e., random, K-tournament, and roulette wheel) to detect COVID-19 using X-ray images. The most crucial strength factors of the Extreme Learning Machine (ELM) are: (i) high capability of the ELM in avoiding overfit-ting; (ii) its usability on binary and multi-type classifiers; and (iii) ELM could work as a kernel-based support vector machine with a structure of a neural network. These advantages make the ELM efficient in achieving an excellent learning performance. ELMs have successfully been applied in many domains, including medical domains such as breast cancer detection, pathological brain detection, and ductal carcinoma in situ detection, but not yet tested on detecting COVID-19. Hence, this work aims to identify the effectiveness of employing OGA-ELM in detecting COVID-19 using chest X-ray images. In order to reduce the dimensionality of a histogram oriented gradient features, we use principal component analysis. The performance of OGA-ELM is evaluated on a benchmark dataset containing 188 chest X-ray images with two classes: a healthy and a COVID-19 infected. The experimental result shows that the OGA-ELM achieves 100.00% accuracy with fast computation time. This demonstrates that OGA ELM is an efficient method for COVID-19 detecting using chest X-ray images.
Conference Paper
Full-text available
Recent proposed researches have witnessed that voice pathology detection systems can effectively contribute to the voice disorders assessment and provide early detection of voice pathologies. These systems used machine learning techniques which are considered as very promising tools in the detection of voice pathologies. However, most proposed systems in the detection of voice disorder utilized limited database. Furthermore, low accuracy rate is still the one of the most challenging issues for these techniques. This paper presents a voice pathology detection system using Online Sequential Extreme Learning Machine (OSELM) to classify the voice signal into healthy or pathological. In this work, the voice features are extracted by using Mel-Frequency Cepstral Coefficient (MFCC). The voice samples for the vowel /a/ were collected equally from Saarbrücken voice database (SVD). The proposed method is evaluated by three widely used measurements which are accuracy, sensitivity and specificity. The obtained results show that the maximum accuracy, sensitivity and specificity are 85%, 87% and 87%, respectively. According to the experimental results, the performance of OSELM algorithm is able to differentiate healthy and pathological voices effectively.
Article
Full-text available
The metaheuristic genetic algorithm (GA) is based on the natural selection process that falls under the umbrella category of evolutionary algorithms (EA). Genetic algorithms are typically utilized for generating high-quality solutions for search and optimization problems by depending on bio-oriented operators such as selection, crossover, and mutation. However, the GA still suffers from some downsides and needs to be improved so as to attain greater control of exploitation and exploration concerning creating a new population and randomness involvement happening in the population at the solution initialization. Furthermore, the mutation is imposed upon the new chromosomes and hence prevents the achievement of an optimal solution. Therefore, this study presents a new GA that is centered on the natural selection theory and it aims to improve the control of exploitation and exploration. The proposed algorithm is called genetic algorithm based on natural selection theory (GABONST). Two assessments of the GABONST are carried out via (i) application of fifteen renowned benchmark test functions and the comparison of the results with the conventional GA, enhanced ameliorated teaching learning-based optimization (EATLBO), Bat and Bee algorithms. (ii) Apply the GABONST in language identification (LID) through integrating the GABONST with extreme learning machine (ELM) and named (GABONST-ELM). The ELM is considered as one of the most useful learning models for carrying out classifications and regression analysis. The generation of results is carried out grounded upon the LID dataset, which is derived from eight separate languages. The GABONST algorithm has the capability of producing good quality solutions and it also has better control of the exploitation and exploration as compared to the conventional GA, EATLBO, Bat, and Bee algorithms in terms of the statistical assessment. Additionally, the obtained results indicate that (GABONST-ELM)-LID has an effective performance with accuracy reaching up to 99.38%.
Article
Full-text available
The incorporation of the cloud technology with the Internet of Things (IoT) is significant in order to obtain better performance for a seamless, continuous, and ubiquitous framework. IoT has many applications in the healthcare sector, one of these applications is voice pathology monitoring. Unfortunately, voice pathology has not gained much attention, where there is an urgent need in this area due to the shortage of research and diagnosis of lethal diseases. Most of the researchers are focusing on the voice pathology and their finding is only to differentiating either the voice is normal (healthy) or pathological voice, where there is a lack of the current studies for detecting a certain disease such as laryngeal cancer. In this paper, we present an extensive review of the state-of-the-art techniques and studies of IoT frameworks and machine learning algorithms used in the healthcare in general and in the voice pathology surveillance systems in particular. Furthermore, this paper also presents applications, challenges and key issues of both IoT and machine learning algorithms in the healthcare. Finally, this paper highlights some open issues of IoT in healthcare that warrant further research and investigation in order to present an easy, comfortable and effective diagnosis and treatment of disease for both patients and doctors.
Article
Full-text available
The determination and classification of natural language based on specified content and data set involves a process known as spoken language identification (LID). To initiate the process, useful features of the given data need to be extracted first in a mature process where the standard LID features have been previously developed by employing the use of MFCC, SDC, GMM and the i-vector-based framework. Nevertheless, optimisation of the learning process is still required to enable a comprehensive capturing of the extracted features’ embedded knowledge. The training of a single hidden layer neural network can be done using the extreme learning machine (ELM), which is an effective learning model for conducting classification and regression analysis. Nevertheless, the learning process of this model is not entirely effective (i.e. optimised) due to the random selection of weights within the input hidden layer. This study employs ELM as the LID learning model centred upon the extraction of the standard features. The enhanced self-adjusting extreme learning machine (ESA–ELM) is one of the ELM’s optimisation techniques which has been chosen as the benchmark and is enhanced by adopting a new alternative optimisation approach (PSO) instead of (EATLBO) in terms of achieving high performance. The improved ESA–ELM is named particle swarm optimisation–extreme learning machine (PSO–ELM). The generated results are based on LID with the same benchmarked data set derived from eight languages, which indicated the superior performance of the particle swarm optimisation–extreme learning machine LID (PSO–ELM LID) with an accuracy of 98.75% in comparison with the ESA–ELM LID which only achieved 96.25%.
Article
Full-text available
Over the past few decades, various evolutionary algorithms (EAs) have been applied to the optimization design of water distribution systems (WDSs). An important research area is to compare the performance of these EAs, thereby offering guidance for the selection of the appropriate EAs for practical implementations. Such comparisons are mainly based on the final solution statistics and, hence, are unable to provide knowledge on how different EAs reach the final optimal solutions and why different EAs performed differently in identifying optimal solutions. To this end, this paper aims to compare the real-time searching behaviour of three widely used EAs, which are genetic algorithms (GAs), the differential evolution (DE) algorithm and the ant colony optimization (ACO). These three EAs are applied to five WDS benchmarking case studies with different scales and complexities, and a set of five metrics are used to measure their run-time searching quality and convergence properties. Results show that the run-time metrics can effectively reveal the underlying searching mechanisms associated with each EA, which significantly goes beyond the knowledge from the traditional end-of-run solution statistics. It is observed that the DE is able to identify better solutions if moderate and large computational budgets are allowed due to its great ability in maintaining the balance between the exploration and exploitation. However, if the computational resources are rather limited or the decision has to be made in a very short time (e.g., real-time WDS operation), the GA can be a good choice as it can always identify better solutions than the DE and ACO at the early searching stages. Based on the results, the ACO performs the worst for the five case study considered. The outcome of this study is the offer of guidance for the algorithm selection based on the available computation resources, as well as knowledge into the EA’s underlying searching behaviours.
Article
It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1