ArticlePDF Available

Speech emotion recognition using optimized genetic algorithm-extreme learning machine

July 2022
Multimedia Tools and Applications 81(11):1-27

July 2022
81(11):1-27

DOI:10.1007/s11042-022-12747-w

Authors:

Musatafa Albadr

Basrah University for Oil and Gas, Al-Basrah, Iraq

Sabrina Tiun

Universiti Kebangsaan Malaysia

Masri Ayob

Universiti Kebangsaan Malaysia

Fahad Taha Al-Dhief

Universiti Teknologi Malaysia

Show all 6 authorsHide

Automatic Emotion Speech Recognition (ESR) is considered as an active research field in the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two main parts: Front-End (features extraction) and Back-End (classification). However, most previous ESR systems have been focused on the features extraction part only and ignored the classification part. Whilst the classification process is considered an essential part in ESR systems, where its role is to map out the extracted features from audio samples to determine its corresponding emotion. Moreover, the evaluation of most ESR systems has been conducted based on Subject Independent (SI) scenario only. Therefore, in this paper, we are focusing on the Back-End (classification), where we have adopted our recent developed Extreme Learning Machine (ELM), called Optimized Genetic Algorithm- Extreme Learning Machine (OGA-ELM). In addition, we used the Mel Frequency Cepstral Coefficients (MFCC) method in order to extract the features from the speech utterances. This work proves the significance of the classification part in ESR systems, where it improves the ESR performance in terms of achieving higher accuracy. The performance of the proposed model was evaluated based on Berlin Emotional Speech (BES) dataset which consists of 7 emotions (neutral, happiness, boredom, anxiety, sadness, anger, and disgust). Four different evaluation scenarios have been conducted such as Subject Dependent (SD), SI, Gender Dependent Female (GD-Female), and Gender Dependent Male (GD-Male). The highest performance of the OGA-ELM was very impressive in the four different scenarios and achieved an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respec-tively. Besides, the proposed ESR system has shown a fast execution time in all experiments to identify the emotions.

Dialogue of the proposed ESR system

…

Block diagram of the process of extracting MFCC features

…

Diagram of the ELM [6]

…

Pseudocode of the OGA-ELM [4]

…

OGA-ELM’s Flowchart [4]

…

Figures - available from: Multimedia Tools and Applications

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Fahad Taha Al-Dhief

Content may be subject to copyright.

Speech emotion recognition using optimized genetic

algorithm-extreme learning machine

Musatafa Abbas Abbood Albadr

&Sabrina Tiun

&Masri Ayob

Fahad Taha AL-Dhief

&Khairuddin Omar

&Mhd Khaled Maen

Received: 23 February 2021 /Revised: 17 May 2021 /Accepted: 21 February 2022 /

#The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract

Automatic Emotion Speech Recognition (ESR) is considered as an active research field in

the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two

main parts: Front-End (features extraction) and Back-End (classification). However, most

previous ESR systems have been focused on the features extraction part only and ignored

the classification part. Whilst the classification process is considered an essential part in

ESR systems, where its role is to map out the extracted features from audio samples to

determine its corresponding emotion. Moreover, the evaluation of most ESR systems has

been conducted based on Subject Independent (SI) scenario only. Therefore, in this paper,

we are focusing on the Back-End (classification), where we have adopted our recent

developed Extreme Learning Machine (ELM), called Optimized Genetic Algorithm-

Extreme Learning Machine (OGA-ELM). In addition, we used the Mel Frequency

Cepstral Coefficients (MFCC) method in order to extract the features from the speech

utterances. This work proves the significance of the classification part in ESR systems,

where it improves the ESR performance in terms of achieving higher accuracy. The

performance of the proposed model was evaluated based on Berlin Emotional Speech

(BES) dataset which consists of 7 emotions (neutral, happiness, boredom, anxiety,

sadness, anger, and disgust). Four different evaluation scenarios have been conducted

such as Subject Dependent (SD), SI, Gender Dependent Female (GD-Female), and

Gender Dependent Male (GD-Male). The highest performance of the OGA-ELM was

very impressive in the four different scenarios and achieved an accuracy of 93.26%,

100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respec-

tively. Besides, the proposed ESR system has shown a fast execution time in all

experiments to identify the emotions.

Keywords Emotion speech recognition .Optimized genetic algorithm-extreme learning machine .

Mel frequency cepstral coefficients

https://doi.org/10.1007/s11042-022-12747-w

*Musatafa Abbas Abbood Albadr

mustafa_abbas1988@yahoo.com

Extended author information available on the last page of the article

Published online: 19 March 2022

Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 Introduction

People beings use different forms of facial expressions, gestures, speeches for communication,

and body language. These communications transfer emotional states and messages of the

speakers [24]. In this regard, people have the natural ability to comprehend the speakers’

emotions through their speech signals. A robust system of emotionrecognition aims to identify

the people emotional state through the user’s voice automatically. The speech signal contains

the linguistic information of the speaker and also it includes other information such as age,

origin, gender, and emotional states [53]. Such these systems have made numerous potential

impacts on HCI [7,15,22]. Furthermore, an automatic ESR system has been applied in several

real-time applications in the purpose of analyzing and detecting the emotions such as detect the

emotions of callers in call centers [43], mental disorders diagnosis [37], and detect the diseases

of Parkinson and Alzheimer [39,62]. Further, ESR system is utilized to present assist in many

various applications (e.g., development of educational, learning environment, lie detection

system, games software, and entertainment) [46].

In general, ESR systems are generated based on two main stages; the first stage refers to

front-end that extracts the feature vectors from the samples of a speech utterance. While the

second stage refers to the back-end that recognises the emotion based on certain sets of feature

vectors, algorithms and models. Figure 1depicts the general overview of the ESR system.

Among the most common feature extraction methods used in ESR field are the Linear

Predictive Coding (LPC), Cepstrum Coefficients derived from LPC (LPCC), MFCC, and

Perceptual Linear Prediction (PLP) [12,40,41,52]. Out of all the aforesaid methods, MFCC is

the most popular feature extraction approach in speech applications generally and has been

cited to have the highest identification accuracy [28,48,51,61]. Whilst the classification

process is considered an essential part of any ESR system and its role is to map out the

extracted features from audio samples to determine its corresponding emotion. Several

classifiers are identified in literature, for instance the deep learning [10], Support Vector

Machine (SVM) [33], and ELM [42].

Recently, ELMs have emerged, becoming a modern framework for machine learning [6,

20,25,47,54,56]. ELMs are a type of feed-forward neural network characterised by random

initialisation of their hidden layer weights, combined with a fast training algorithm. The

effectiveness (without blindness) of this random initialisation and quick training makes them

very appealing for large-scale data analysis.

Fig. 1 general overview of the emotion speech recognition system

23964 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

In the last decades, the ELM algorithm has witnessed a high significance among

other algorithms of machine learning [32]. This is because ELM has unique charac-

teristics such as good generalization, classification capability, and extremely fast

training. In addition, ELM is an efficient solution for the Single-hidden Layer

Feedforward Networks (SLFNs), where it has proved its performance in terms of

efficiency and effectiveness in several applications. Therefore, the ELM has obtained

better and faster generalization than SVM and back propagation (BP)-based neural

networks (NNs) [3,29,31,35]. Consequently, many researchers have used the ELM

in ESR. For example in [58], the authors have presented a new particle swarm

optimization assisted Biogeography-based algorithm for feature selection, while the

ELM classifier was used for the classification part in order to distinguish the emo-

tions. The simulations were conducted using BES Dataset. Different evaluation ex-

periments were conducted such as Subject Dependent (SD), Subject Independent (SI),

Gender Dependent Female (GD-Female), and Gender Dependent Male (GD-Male).

The highest recognition accuracy was 90.31%, 99.47%, 98.94%, and 92.98% for SI,

SD, GD-Female and GD-Male respectively. Unfortunately, despite the superiority of

this developed work, the optimization of ELM with respect to the random weights has

been ignored. This is lead to non-optimal classification for ELM performance.

Another attempt is in the work of [27], where the authors proposed a dynamic framework to

use the advantages of the auditory-based empirical features and the complementary

spectrogram-based statistical features. Furthermore, a Kernel Extreme Learning Machine

(KELM) was used to recognize emotions. To validate this framework, they conducted

experiments on the BES dataset. The experimental results demonstrated that their proposed

framework outperformed the existing state-of-the-art method and achieved an accuracy of

92.90%. However, this work has ignored the optimization of KELM in terms of the input

weights for hidden layer.

A further attempt was made by [49] where the authors proposed a Hybrid Spectral Features

(HSF) which is combining the LPC, MFCC and PSD parameters. In addition, the ELM was

used as the pattern classifier to recognize emotions. The evaluation experiments were con-

ducted based on an emotional speech dataset which consists of nine emotions: neutral, calm,

sad, surprise, happy, anxiety, anger, fear, and boredom. In their experiments, the highest

overall recognition accuracy was 82.22%. Despite the superiority of their proposed method

over the benchmark, this work also has ignored the optimization of ELM in terms of the input

weights for hidden layer.

The methods in [38,44] are presented in the speech emotions recognition by using

BSE dataset. The method in [38] has been presented a new feature fusion (i.e., MFCC

and Prosody), and two different features selection approaches (i.e., SFS and SFFS).

Also, there are four classifiers have beenusedinthismethodwhichareLinear

Discriminant Analysis (LDA), Regularized Discriminant Analysis (RDA), SVM, and

K-Nearest Neighbour (KNN). The experimental results showed that the RDA and

SVM classifiers with SFFS features have obtained the highest accuracy of 92.70%.

While in the method in [44] has been proposed new feature extraction called Acoustic

Analysis Methods and Statistical Feature Selection (AAMSFS). In addition, it is used

three different classifiers and they are Multilayer Perceptron (MLP), SVM, and k-NN.

Based on the experimental results, the proposed AAMSFS with SVM classifier has

achievedaccuracyupto84.62%.

23965Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Another attempt was made by [36] where the authors took a cascaded normaliza-

tion method. They have combined, nonlinear value level, linear speaker level, and

feature vector level normalization in order to decrease the effects of the speaker and

also to maximize class separability. Additionally, the ELM was applied to distinguish

emotions. The evaluation experiments were conducted based on part of the recently

collected dataset (Turkish Emotional Speech (TES) dataset) with four emotion classes

which are joy, neutral, sadness, and anger. Even though the evaluation experiments

result showed the superiority of the ELM over the SVM and the ELM has acquired

an overall accuracy of 79.00% while the SVM has acquired 77.30%. However most

of the ESR researches have ignored the optimization of ELM in terms of the input

weights for hidden layer [2,3,6]. Where the drawback of ELM is must have a

certain technique in order to select the weights of the input-hidden layer. In other

words, there is no method to ensure that the trained ELM algorithm is the most

proper in the classification process. To solve this issue, an optimisation method

should be combined with the ELM to determine the optimal weights that guarantee

to obtain the best performance in the classification process. Furthermore, Table 1

summarizes the related works including strengths and weaknesses of each method.

Based on the studies above, the limitations of emotion speech recognition systems can be

summarized as follow:

&Most previous studies of emotion speech recognition have focused on the feature extrac-

tion part and ignore the classification part.

&No much studies have evaluated their methods based on different scenarios such as Subject

Independent (SI), Subject Dependent (SD), Gender Dependent Female (GD-Female) and

etc. In other words, most systems are evaluated based on SI scenario only.

&The accuracy rate of emotion recognition systems from the speech is still not encouraging.

&Accuracy, recall and precision are mostly used to evaluate emotion speech recognition

systems. However, the other evaluation measurements are ignored such as F-measure, G-

mean, and execution time.

Based on all the facts mentioned earlier, this study will use the MFCC features and one of the

recent ELM optimization which named OGA-ELM [4]. In [4], we have proposed OGA-ELM

in the application of Language Identification (LID) using i-vector features. While in this work,

we propose OGA-ELM using MFCC features in the application of emotion speech recogni-

tion. Furthermore, the main contributions of this work as follow:

&Improve the ESR and achieve performance with higher accuracy.

&In the proposed method, we have used four different scenarios which are SD, SI, GD-

Female, and GD-Male.

&Evaluate the performance of the OGA-ELM in ESR by using several evaluation measure-

ments such as accuracy, recall, precision, F-measure, G-mean, and execution time.

&Prove the effectiveness of the classification part in the ESR application.

The rest of this study is organized as follow: Section 2shows the description of the proposed

method; Section 4presents results discussion of the experiments, and Section 4 shows the

conclusions and future work.

23966 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 1 Summery of related work

Ref Dataset Features Classifier Result Strengths Weaknesses

[58]BES

with 7

emo-

tions

Higher Order Spectral (HOS)

features and Particle Swarm

Optimization assisted Bio-

geography based Optimiza-

tion (PSOBBO) for feature

selection.

ELM Accuracy of 90.31% (SI),

99.47% (SD), 98.94%

(GD-Female), and 92.98%

(GD-Male).

The results have proved that the

proposed HOS- PSOBBO-ELM

outperformed some previous studies.

1. In this work, the optimization of ELM with

respect to the random input-hidden layer

weights and hidden layer biases have been

ignored.

2. The results need more improvement in

terms of accuracy.

[27]BES

with 7

emo-

tions

Deep Complementary Feature

Extraction (DCF)

KELM Accuracy of 92.90% (SI) The proposed DCF-KELM

outperformed the CNN-BLSTM.

1.1. These works were evaluated based on

only one scenario which is SI.

2. The optimization of both KELM and ELM

with respect to the random input-hidden

layer weights and hidden layer biases have

been ignored.

3. The results of these methods are still not

encouraging and need more improvement.

[49]BES

with 7

emo-

tions

Hybrid Spectral Features

(HSF) which is combining

the LPC, MFCC and PSD

parameters.

ELM Accuracy of 82.22% (SI) The proposed HSF-ELM outperformed

some previous studies.

[38]BES

with 7

emo-

tions

Feature fusion (MFCC and

Prosody), and two different

features selection

approaches (SFS and SFFS)

LDA,

RDA,

SVM

and

KNN.

Accuracy of 92.70% (SI) The experimental results showed that the

RDA and SVM classifiers with SFFS

features gives the best emotion

recognition rate.

1.1. These works were evaluated based on

only one scenario which is SI.

2. The results of these methods are still not

encouraging and need more improvement.

[44]BES

with 7

emo-

tions

AAMSFS MLP,

SVM,

and

k-NN

Accuracy of 84.62% (SI) The results have shown that the

AAMSFS and SVM have achieved

the best accuracy rate.

[36] TES

with 4

emo-

tions

Combining nonlinear value

level, linear speaker level,

and feature vector level

normalization

ELM Accuracy of 79.00% (SI) The experimental results have proved

the superiority of the ELM over

SVM.

1. The work was evaluated based on only one

scenario which is SI.

2. The work was evaluated based on only 4

emotions.

3. The optimization of ELM with respect to

the random input-hidden layer weights and

hidden layer biases have been ignored.

4. The results are not encouraging and need

more improvement.

23967Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

2 Method

The general dialogue of the proposed ESR system using the OGA-ELM method is illustrated

in Fig. 2. The dialogue consists of various phases that will be utilized to create the ESR system

based on speech signal. The first phase refers to the speech dataset for different human

emotions such as neutral, happiness, boredom, anxiety, sadness, anger, and disgust. The

second phase indicates the pre-processing of the speech signal samples. In the third phase,

the MFCC technique will be used to extract the needed features from utterances. Finally, in the

fourth phase, the extracted features will be fed into the OGA-ELM classifier in order to

identify human emotions based on the speech signal. The OGA-ELM is based on Optimised

Genetic Algorithm (OGA) [4], where the Genetic Algorithm (GA) has been chosen and

optimized in order to elevate the performance of ELM in terms of the classification part.

The GA was selected because it considered as one of the most popular optimization algorithms

which been used by most researchers, mainly due to its ease of implementation, and supported

by many libraries [23,57]. Additionally, GA has a good capability of global search, and also it

is considered as one of the essential technologies which are associated with modern intelligent

calculation [59]. As well as, GA is resource-friendly as it effectively finds better solutions

faster than the other optimization algorithms [13,26]. These four phases of the proposed ESR

system will be discussed as sub-sections, respectively.

2.1 Dataset

In this work, the BES (Berlin Emotional Speech) dataset [14] has selected for evaluation

purposes. The BES dataset is a standard dataset that is frequently used by emotion classifica-

tion researchers [18,27,45,58]. The BES dataset contains 533 emotional speech utterances

from 10 professional German actors (5 males and 5 females), with 7 emotions (Neutral,

Happiness, Boredom, Anxiety, Sadness, Anger, And Disgust). The actors were asked to

express 10 sentences with these 7 emotions. The audio files are in a range of 1–8 s duration

but in this study, we used fixed duration (see subsection 2.3). Detail explanation and details

about the BES dataset is provided in [14,17]. Table 2provides the details of the BES dataset.

This study has used 80% of the dataset for training purpose while the remaining 20% of the

dataset for testing purpose in all the evaluation scenarios.

2.2 Pre-processing

This section discusses the pre-processing of this study. SinceBES dataset is consists of different

duration utterances (the term of utterance refers to the speech signal) in a range of 1–8s.

Therefore, this study applied pre-processing that involves two-steps. The first step is to read the

utterances in a (.wav) extension. While the second step is to make the duration of all utterances

fix with 1 s which is 19608 sample. The output of the pre-processing step is the utterance vector

(19,608 × 1) in sampling, which is the input for the feature extraction processing.

2.3 Features extraction

The MFCC [19,21] feature extraction for ESR in this study begins with the process of

segmentation where the utterance vector (19,608 × 1) which obtained from the pre-

processing step is transformed into 25 ms frames and 10 ms overlap. This is followed by the

23968 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

attainment of thirteen MFCCs, and the application of the vocal tract length normalisation.

Subsequent, the cepstral mean and variance normalisation is performed together with the

RASTA filtering. Figure 3illustrates the entire processes of extracting the MFCC features.

These processes are as follow:

Pre-emphasis: It is the first stage in MFCC feature extraction the aim of it is to boost the

amount of energy in the high frequencies.

Windowing: The idea of implementing windowing is to segment the utterance into

frames.

Fast Fourier Transform (FFT): The aim of applying FFT is to convert the time domain

signal to a frequency domain signal because the features exist in the frequency domain

when dealing with speech data.

Magnitude: The aim of this step is to calculate the power spectrum of each frame.

Vocal Tract Length Normalisation (VTLN): VTLN aims to compensate for the fact that

speakers will have different sized vocal tracks.

Mel-Filter Bank: Mel- Filter Bank aims to approximate how much energy occurs at each

point or area.

Log: The goal of this step is to make sure that the high and low frequencies are separated

to simulate the human hearing system.

RASTA Filtering: In this step, the values of the first four frames of the array resulting

from the previous step will be changed to zero values to avoid a significant spike initially

arising from the ‘dc’offset level in each band. Each row of the remaining array is band-

pass filtered using a filter with a sharp spectral zero at the zero frequency since this

operation suppresses any constant or slowly varying component in each row.

Happiness

Disgust

Anxiety

Speech Signal

Utterance With

Different Duration

Classification

OGA-ELM

MFCC With

13 Cepstral

(13×42)

Features Extraction

Identified Emotion

Reshape The

MFCC To

One-row

Vector

First Step Second Step

Pre-processing

(19608×1)

(1×546)

Fig. 2 Dialogue of the proposed ESR system

Table 2 description of the BES dataset [14]

Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Number of All Utterance 68 46 71 81 79 62 126

Number of Female Utterance 37 36 51 51 51 44 80

Number of Male Utterance 31 10 20 30 28 18 46

Emotion Label 1 2 3 4 5 6 7

23969Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Discrete Cosine Transform (DCT): In DCT step the log Mel spectrum is converted back

to time.

Cepstral Mean and Variance Normalisation: The purpose of CMVN is to decrease the

convolute channel distortion, noise, and speaker variations effects by forcing all utter-

ances to have a zero mean and unit variance

The output of the MFCC is an array of size (13 × number of frames) for each utterance (13

× 42). Following that reshape the MFCC features (13 × 42) for each utterance to a one-row

vector (1 × 546) which is the input of the classification step. Table 3provides a description of

the MFCC variables value which been used in this study. Due to the size of the frame and

frame shift in samples depend on the sampling rate, this study has set the value of the sampling

rate to 44,100 Hz instead of 16 kHz. The reason for that is to increase the frame size in samples

and decrease the frame numbers [11,34]. The size of the frame and frame shift in samples are

calculating as showing in Eqs. (1) and (2). While the number of frames is calculating as shown

in Eq. 3.

Nw ¼10−3Tw Sampling rate ð1Þ

Nw: frame size in samples.

Tw: frame size (25 ms).

Sampling rate: 44100.

And frame shift in samples:

Ns ¼10−3Ts Sampling rate ð2Þ

Ns: frame shift size in samples.

Ts: frame shift size (10 ms).

While the number of frames in each utterance is depicted in Eq. (3):

number of frames ¼length of utterance in samples−frame size in samples NwðÞ

frame shift in samples NsðÞ þ1ð3Þ

length of utterance in samples = 19,608; Nw =1103;Ns = 441; and number of frames = 42.

As a result, there will be an array: (Nw x number of frames) (1103 × 42).

2.4 Classification

2.4.1 Review of ELM

The basic ELM algorithm for training SLFN is proposed by [30]. The main concepts

or ideas behind ELM are the hidden layer weights, and biases are generated random-

ly. The output weights are then calculated using the least-squares solution which is

defined by the outputs of the hidden layer and targets. An overview of the ELM

structure and the training algorithm is shown in Fig. 4. The next subsection provides

a brief description of the ELM.

N=a set of distinct samples (Xj,t

j), where Xj=[x

j1,x

j2,…,x

jn]T∈Rnand tj=[t

j1,t

j2,…,

tjm]T∈Rm; a mathematical model described and applied with Eq. (1).

23970 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Utterance In

Samples

Pre-emphasis

Filter Windowing

MFCC Features

(19608 × 1) (19608 × 1)

(1103×42)

Fast Fourier

Transform (FFT)

Magnitude

Vocal Tract Length

Normalization

(VTLN)

(1025 ×42)

(1 ×1025)

Mel-Filter Bank

(1025 × 42)

(1×1025)

(20 × 42)

Log RASTA Filtering

(20 × 42) Discrete Cosine

Transform

(20 × 42)

Cepstral Mean and

Variance

Normalization

(13 × 42)

Fig. 3 Block diagram of the process of extracting MFCC features

Table 3 illustrating the value of the MFCC variables that have been used in this study

Variable value

Sampling rate 44,100 Hz

Utterance duration before pre-processing In a range of (1–8) second

Utterance duration after pre-processing One second duration which is 19,608 sample

Frame size in time 25 millisecond

Frame shift in time 10 millisecond

Frame size in sample (Nw) 1103

Frame shift in sample (Ns) 441

Number of frames of one utterance 42

MFCC features for one utterance before reshape (13× 42)

MFCC features for one utterance after reshape (1× 546)

23971Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

∑

i¼1

βigiXj



¼∑

i¼1

βigiWiþbi

ðÞ¼ojð4Þ

J=1,…,N.

where.

Wi=[Wi1,Wi2,…,Win]T= weight vector that provides the connection between the ith

hidden node and input nodes;

βi=[βi1, βi2, …,βim]T= weight vector that provides the connection between the ith

hidden node and output nodes;

bi= threshold of the ith hidden node;

Wi·Xj= inner product of Wiand Xj; however, the output nodes are selected linearly;

L= hidden layer nodes, and the standard of SLFNs in the activation function g(x) can be

the samples of Nwithout error.

Thus,

∑N

j¼1‖oj−tj‖¼0, that is, βi,Wiand biexist, such that in Eq. (5).

∑

i¼1

βigiWiXjþbi



¼tj;j¼1;…:; N:ð5Þ

The following can be obtained from the above equations for N:

Hβ¼Tð6Þ

Where:

1…WL;b1…bL;X1…XN



1:X1þb1

ðÞ⋯gW

L:X1þbL

ðÞ

⋮…⋮

1:XNþb1

ðÞ⋯gW

L:XNþbL

ðÞ

β¼βT

βT



L*m

and T¼tT



N*m

Fig. 4 Diagram of the ELM [6]

23972 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

The authors in [30] named the variables, where Hrefers to the output matrix of the hidden

layer in the neural network; in Hthe ith column refers to the ith hidden layer nodes on the input

nodes. If the desired number of the hidden nodes is L≤N, this therefore means the activation

function gis infinitely differentiable. Equation (6) then becomes a linear system. Furthermore,

the output weights βcan be determined analytically by discovering a least squares solution as

follows:

β¼H†Tð7Þ

Where H†is the Moore–Penrose generalised inverse of H. Thus, the output weights are

calculated using a mathematical transformation without going through a lengthy training

phase.

The absence of a specific approach to determine the input-hidden layer weights is a major

drawback for ELM which subjects it to local minima. This means based on the given training

data, there is no way to assure that the trained ELM is the most appropriate in performing the

classification. Overcoming this drawback requires the integration of an optimised approach

with the ELM where the optimal weights can be identified and therefore the ELM’sbest

performance can be achieved. In the following subsection, the concept of OGA-ELM will be

explained.

2.4.2 OGA-ELM

This study adopted the OGA-ELM from [4] which is derived from the OGA to classify the

emotion speech signal dataset into seven emotion classes. It uses a single selection criterion,

where the values of input weight and the bias of hidden nodes are tuned by using the selection,

crossover and mutation operations. Table 4shows the parameters of the ELM and OGA that

have been used in this study’s experiments.

Nis a collection of featured samples (Xj,t

j), where Xj=[x

j1,x

j2,…,x

jn]T∈Rn, and tj=[t

j1,

tj2,…,t

jm]T∈Rm.

Where:

Xjis the input which is extracted features from MFCC;

tjis the true values (expected output).

At the beginning of OGA-ELM, the values of input weights, and the thresholds of hidden

nodes are randomly defined and characterised as chromosomes.

ch ¼w11;w12;…;w1n;w21;w22;…;w2n;wL1;wL2;…;wLn ;b1;…;bL

Where:

wij: refers to the weight value that relates the ith hidden node and the jth input node, wij∈

[−1, 1];

bi:referstoith hidden node bias, bi∈[0, 1];

n: refers to the number of input node; and.

L: refers to the number of hidden node.

(1 + n) × L represents the chromosome dimensionality, that is, the (1 + n) × L

parameters that need to be optimised.

The fitness function of OGA–ELM is calculated, as shown in Eq. (8)tomaximisethe

accuracy.

23973Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 4 Parameters of the ELM and OGA [4]

ELM OGA

Parameter Value Parameter Value

ch Combined bias and input weight Number of iterations 500

ρOutput weight matrix Population size 100

Input weight −1 to 1 Crossover Arithmetical

Value of the biases 0–1 Mutation Uniform

Input node numbers Input attributes Population of the crossover (POPC) Refers to the crossover population, which is 70% of the population.

Hidden node numbers (100–300), with step or increment of 25 Population of the mutation (POPM) Refers to the mutation population, which is 30% of the population.

Output neuron (m) Class value Gamma value 0.4

Activation function Sigmoid Selection criteria Random

Regularizationfactor(C) −5

23974 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

fchðÞ¼ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

∑N

j∑L

kρkgw

kxjþbk



−tj







sð8Þ

Where:

ρ= matrix of the output weights;

tj= expected output; and.

N = training samples number.

Where:

ρ¼HTI

CþHHT



−1

Tð9Þ

t// iteration number

PP ={ch

,ch

,..ch

100

} Randomly initial the input weights and biases

Train the ELM and calculate the fitness value of each variables according to Eq. (8)

While (not termination condition) do

t+1

While |POPC| ≤ |70% P| do

Based on random selection criteria selects a pair of parents for crossover

Mate the parents to create children (

Child

and

Child

)

end while

POPC {} // initialise the crossover population

POPC {

Child

}

POPM {} // initialise the mutation population

While |POPM| ≤ |30% P| do

randomly, select a parent for mutation

perform mutation to create a child (

Child

)

POPM

Child

end while

P {POPC, POPM} // Merge POPC and POPM to get the next genration

Train the ELM and calculate the fitness value of each variables according to Eq. (8)

Sort the P based on their fitness values.

Initial population:

Evaluation:

Genetic operators:

Get the optimal weights thresholds between input layer and hidden layer

5: Begin the parameter optimization OGA

10:

11:

12:

14:

15:

16:

17:

18:

19:

20:

21:

22:

23:

24:

25:

26:

27:

End

28:

1: Start

2: Load the Dataset

3: Divide the Dataset to Training set and Testing set

Select the activation function and determine the hidden layer nodes L4:

13:

Calculate the output matrix H of the hidden layer Eq. (10)

Calculate the output weights ρ according to Eq. (9)

Save the predicting ELM model

The prediction results of crop evapotranspiration

Calculate the average accuracy rate

30:

31:

32:

33:

34:

End the parameter optimization OGA

29:

Fig. 5 Pseudocode of the OGA-ELM [4]

23975Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

H¼

1:X1þb1

ðÞ⋯gw

L:X1þbL

ðÞ

⋮…⋮

1:XNþb1

ðÞ

⋯gw

L:XNþbL

ðÞ

5NL

ρ¼

ρ1T

⋮

ρL

5Lm

and T¼

t1T

⋮

tNT

5Nm

ð10Þ

T=[t1,t2,…,tN]Trefers to the expected output vector of the training set. ρ=[ρ1,ρ2,…,

ρN]Tis the output weights matrix. I is the identity matrix and Cis the regularization factor

which can be obtained by cross validation in the training process. H in Eq. (10) is the hidden

layer output matrix of the ELM network; in H,theith column is indicated to the ith hidden

layer nodes on the input nodes. Activation function g is infinitely distinguishable when the

desired number of hidden nodes is L ≤N. The deep explanation of the OGA-ELM is

providing in the following steps:

First, generate the initial population (P) randomly, p = {ch 1,ch2…ch 100}.

Second, calculate the fitness value for each chromosome (ch) of the population using Eq. (8).

ELM Training

Start

Emotion Speech Dataset

Training set Testing set

Select the activation

function, determine

the hidden layer nodes L

Calculate the output

matrix H of the hidden

layer Eq. (10)

Calculate the output

weights ρ according to

Eq. (9)

Saving the predicting

ELM model

The prediction results of

crop evapotranspiration

Calculate the average

accuracy rate

End

Randomly initial the input-weights and

the bias, determine a objective

function, population size, max iteration

Train ELM and calculate the fitness

of each variables according to Eq. (8)

Select parent randomly

Perform crossover to generate

two new children, save the

two new children into POPC

Is termination criteria satisfied?

Get the optimal weights and thresholds between input

layer and hidden layer.

Yes

Based on the random selection

criteria select pair of parents

Perform mutation to generate a

new child, save the new child

into POPM

Merge the POPC and POPM to

generate the new population

Is the POPC ≤ 70% of the

population

Is the POPM ≤ 30% of the

population

Yes

Initialize the crossover and mutation

populations: POPC = {}, POPM = {}

Yes

Parameter Optimization

Fig. 6 OGA-ELM’s Flowchart [4]

23976 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Third,the chromosomes are arranged based on their fitness values f(ch).Next,

using random selection criterion to select a pair of parents from the present population

for the operation of crossover to create a pair of new children to the new population.

Random selection criterion refers to the process that randomly picks a chromosome

from the population to be used in one of the two operations: crossover or mutation. In

the random selection criterion, every single chromosome of the population has an

equal chance of being chosen.

Fourth,the arithmetic crossover is applied to exchange information between the two

previously selected parents. The new children obtained by crossover operations are saved into

the Population of the Crossover (POPC) until it reaches 70% of the population. The explana-

tion of the arithmetic crossover is represented by the following formulae:

Child1¼α:xþ1−αðÞ:yð11Þ

Child2¼α:yþ1−αðÞ:xð12Þ

Subject to the boundaries (upper bounds and lower bounds for the input-hidden layer weights

[−1, 1], while for the hidden layer biases [0, 1]). In case the value of the gene has gone beyond

the max (upper bound), then we make it equal to the max (upper bound). While in case the

value of the gene has gone lower than the min (lower bound), then we make it equal to the min

(lower bound). The αis a randomly generated array with the size of the chromosome, and each

value of this array is randomly generated in a range of -gamma and gamma+1 which is (−0.4,

1.4). The x and y represent the first and second selected parents.

Fifth, a criteria of random selection is also used to randomly choose a chromosome from

the present population before implementing mutation. Mutation is applied to alter the chro-

mosome’s genes that are randomly selected. This work utilises uniform mutation. The uniform

Table 5 description of the BES dataset which been used in SI scenario

Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Number of All Utterances 68 46 71 81 79 62 126

Number of Training Utterance 54 37 57 65 63 50 101

Number of Testing Utterance 14 9 14 16 16 12 25

Emotion Label 1 2 3 4 5 6 7

Table 6 the overall result of the OGA-ELM in SI scenario

No of

neuron

tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

100 81 611 25 25 93.26 76.42 76.42 76.42 76.42 381.9471

125 76 606 30 30 91.91 71.70 71.70 71.70 71.70 395.5644

150 69 599 37 37 90.03 65.09 65.09 65.09 65.09 410.5236

175 73 603 33 33 91.11 68.87 68.87 68.87 68.87 426.4971

200 64 594 42 42 88.68 60.38 60.38 60.38 60.38 438.4431

225 78 608 28 28 92.45 73.58 73.58 73.58 73.58 451.8474

250 69 599 37 37 90.03 65.09 65.09 65.09 65.09 465.4381

275 67 597 39 39 89.49 63.21 63.21 63.21 63.21 478.1978

300 76 606 30 30 91.91 71.70 71.70 71.70 71.70 490.1058

23977Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

mutation works to substitute the selected gene’s value with a uniform random value chosen

from the gene’s user-specified upperand lower bounds (for the input-hidden layer weights [−1,

1] while for the hidden layer biases [0, 1]). The new child obtained from mutation will be

saved into the Population of the Mutation (POPM) until the POPM reaches 30% of the

population.

After the selection, mutation, and crossover operations are completed, a new population is

created via integrating the POPM and POPC. The following iteration will be continued along

with this new population, and this process will be repeated. The iterative process could be

stopped when either the results have converged or the iteration numbers is exceeded the

maximum limit. OGA–ELM’s pseudocode and flowchart are shown in Figs. 5and 6,

respectively.

3 Experimental results and discussion

Several experiments were conducted based on four different scenarios such as Subject

Dependent (SD), Subject Independent (SI), Gender Dependent Female (GD-Female),

Fig. 7 The confusion matrix for the best results of OGA-ELM in SI scenario

Table 7 the best result of the OGA-ELM in SI scenario for each class

Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Accuracy 93.40 95.28 90.57 97.17 96.23 96.23 83.96

Precision 100.00 83.33 66.67 93.33 83.33 78.57 61.76

Recall 50.00 55.56 57.14 87.50 93.75 91.67 84.00

F-Measure 66.67 66.67 61.54 90.32 88.24 84.62 71.19

G-Mean 70.71 68.04 61.72 90.37 88.39 84.87 72.03

tp 7 5 8 14 15 11 21

tn 92 96 88 89 87 91 68

fp 0 1 4 1 3 3 13

fn 7 4 6 2 1 1 4

23978 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

and Gender Dependent Male (GD-Male). In the SI scenario, we did not care about the

content in the utterance (sentence in the utterance) as well as gender (male or female)

and used the whole BES dataset. While in the SD scenario, we did care about the

content in the utterance (sentence in the utterance) and ignore the gender (male or

female). Therefore, since the BES dataset contains 10 different sentences so in this

scenario, we separated the BES dataset into 10 sub-datasets based on emotions and

sentences. Finally, in both GD-Male and GD-Female scenarios, we did care about the

gender (Male and Female) and ignore the content in the utterance (sentence in the

utterance). Thus, the BES dataset for both scenarios is separated based on emotions

and gender (Male and Female). For the GD-Male scenario, the dataset has used the

utterances that been recorded by males only. Whilst for the GD-Female scenario the

dataset has used the utterances that been recorded by females only.

The OGA-ELM has applied in several experiments in each scenario based on the scenario’s

dataset with a varying number of the hidden neurons and each experiment had 500 iterations. It

is worth mentioning that all the experiments have been implemented in MATLAB R2019a

programming language over a PC Core i7 of 3.20 GHz with 16 GB RAM and SSD 1 TB

(Windows 10). [50] was used as the basis for the evaluation in this study where varying

measures were applied. The selection of [50] was due it tackles the classifier evaluation issue

with presenting effective measurements. The learning algorithms’performance can be evalu-

ated in numerous methods by applying the SML (i.e., Supervised Machine Learning).

Furthermore the confusion matrix that has obtained of recognized examples for each class

according to their correction rate is presented in order to evaluate the quality of the

classification.

Therefore, there numerous of evaluation measurements were utilized to evaluate the

proposed approach OGA-ELM. The evaluation measurements rely on the ground truth, which

Table 8 description of the BES dataset which been used in SI scenario

Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Number of All Utterances 7 9 6 8 9 7 14

Number of Training Utterance 6 7 5 6 7 6 11

Number of Testing Utterance 1 2 1 2 2 1 3

Sentence Code a02 a02 a02 a02 a02 a02 a02

Emotion Label 1 2 3 4 5 6 7

Table 9 the best overall result of the OGA-ELM in SD scenario

No of

neuron

tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

100 12 72 0 0 100.00 100.00 100.00 100.00 100.00 253.8953

125 10 70 2 2 95.24 83.33 83.33 83.33 83.33 261.1620

150 11 71 1 1 97.62 91.67 91.67 91.67 91.67 270.1585

175 8 68 4 4 90.48 66.67 66.67 66.67 66.67 276.5632

200 9 69 3 3 92.86 75.00 75.00 75.00 75.00 282.2085

225 11 71 1 1 97.62 91.67 91.67 91.67 91.67 290.1031

250 9 69 3 3 92.86 75.00 75.00 75.00 75.00 296.8769

275 10 70 2 2 95.24 83.33 83.33 83.33 83.33 301.2103

300 8 68 4 4 90.48 66.67 66.67 66.67 66.67 304.7360

23979Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

entails the application of the model to expect the answer on the evaluation dataset followed by

a comparison between the predicted target and the actual answer. The evaluation measure-

ments have been used in order to evaluate the proposed OGA-ELM approach regarding recall;

accuracy; G-mean; precision, and F-measure. Eqs. (13–17) [1,5,8] depicts these evaluation

measurements.

accuracy ¼tp þtn

tp þtn þfn þfp ð13Þ

precision ¼tp

tp þfp ð14Þ

recall ¼tp

tp þfn ð15Þ

Fig. 8 The confusion matrix for the best results of OGA-ELM in SD scenario

Table 10 the best result of the OGA-ELM in SD scenario for each class

Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Accuracy 100.00 100.00 100.00 100.00 100.00 100.00 100.00

Precision 100.00 100.00 100.00 100.00 100.00 100.00 100.00

Recall 100.00 100.00 100.00 100.00 100.00 100.00 100.00

F-Measure 100.00 100.00 100.00 100.00 100.00 100.00 100.00

G-Mean 100.00 100.00 100.00 100.00 100.00 100.00 100.00

tp 1 2 1 2 2 1 3

tn 11 10 11 10 10 11 9

fp 0 0 0 0 0 0 0

fn 0 0 0 0 0 0 0

23980 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

F−Measure ¼2precision recallðÞ

precision þrecallðÞ ð16Þ

G−Mean ¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

recall precision

pð17Þ

where

tn refers to true-negative; tp refers to true-positive; fn refers to false-negative; and fp refers

to false-positive.

The comparison results of the four different scenarios (SI, SD, GD-Male, and GD-Female)

are provided and deeply discussed separately in the following sub-sections.

3.1 Subject Independent (SI) Scenario

This section provides and discuss the performance of the OGA-ELM in SI scenario. In SI

scenario, we did not care about the content in the utterance (sentence in the utterance) as well

as gender (male or female) and used the whole BES dataset. Table 5provides the description

of the dataset which been used in this scenario. 80% of the dataset which is equal to 427

utterance was used as a training dataset. While the remaining 20% of the dataset which isequal

to 106 utterance was used as a testing dataset.

The best experiment results of the proposed OGA-ELM in SI scenario has obtained with

100 hidden neurons, where the overall accuracy is 93.26%. While the other evaluation

measurements have achieved 76.42%, 76.42%, 76.42%, and 60.89% for precision, recall, F-

measure and G-mean, respectively. The overall results of the evaluation measurements are

Table 11 description of the dataset which been used in GD-Female scenario

Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Number of All Utterances 37 36 51 51 51 44 80

Number of Training Utterance 30 27 41 41 41 35 64

Number of Testing Utterance 7 7 10 10 10 9 16

Emotion Label 1 2 3 4 5 6 7

Table 12 the best overall result of the OGA-ELM in GD-Female scenario

No of

Neuron

tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

100 62 407 7 7 97.10 89.86 89.86 89.86 89.86 332.5520

125 56 401 13 13 94.62 81.16 81.16 81.16 81.16 340.8604

150 52 397 17 17 92.96 75.36 75.36 75.36 75.36 351.3987

175 41 386 28 28 88.41 59.42 59.42 59.42 59.42 360.3810

200 53 398 13 13 93.37 76.81 76.81 76.81 76.81 372.7795

225 55 400 14 14 94.20 79.71 79.71 79.71 79.71 381.8452

250 50 395 19 19 92.13 72.46 72.46 72.46 72.46 393.0418

275 53 398 13 13 93.37 76.81 76.81 76.81 76.81 405.2617

300 48 393 21 21 91.30 69.57 69.57 69.57 69.57 413.9911

23981Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

shown in Table 6. Besides, Fig. 7shows the confusion matrix of OGA-ELM for the best

results. Whilst, Table 7illustrates the results of the evaluation measures for each class.

3.2 Subject Dependent (SD) Scenario

This section provides and discuss the best experiment results of the OGA-ELM in SD scenario.

In SD scenario, we did care about the content in the utterance (sentence in the utterance) and

ignore the gender (male or female). Therefore, since the BES dataset contains 10 different

sentences so in the SD scenario, we separated the BES dataset into 10 sub-datasets based on

emotions and sentences. The experiment’s accuracy results of the OGA-ELM in SD scenario

were in a range of 94.81%–100.00%. As mentioned earlier only the highest performance is

reporting in this study. Thus, the highest performance of the OGA-ELM in SD scenario was

with the sentence code “a02”and the sentence is “Das will sie am Mittwoch abgeben”.Table8

provides the description of the dataset which been used in SD scenario. 80% of the dataset

which is equal to 48 utterance was used as a training dataset. While the remaining 20% of the

dataset which is equal to 12 utterance was used as a testing dataset.

Fig. 9 The confusion matrix for the best results of OGA-ELM in GD-Female scenario

Table 13 the best result of the OGA-ELM in GD-Female scenario for each class

Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Accuracy 98.55 97.10 94.20 98.55 98.55 97.10 95.65

Precision 100.00 85.71 70.00 90.00 90.00 88.89 100.00

Recall 87.50 85.71 87.50 100.00 100.00 88.89 84.21

F-Measure 93.33 85.71 77.78 94.74 94.74 88.89 91.43

G-Mean 93.54 85.71 78.26 94.87 94.87 88.89 91.77

tp 7 6 7 9 9 8 16

tn 61 61 58 59 59 59 50

fp 1 1 1 0 0 1 3

fn 0 1 3 1 1 1 0

23982 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

The best experiment results of the proposed OGA-ELM in SD scenario have

acquired with 100 hidden neurons, where the overall accuracy is 100.00%. While

the other evaluation measurements have achieved 100.00%, 100.00%, 100.00%, and

100.00% for precision, recall, F-measure and G-mean respectively. The overall results

of the evaluation measurements are shown in Table 9. Also, Fig. 8shows the

confusion matrix of OGA-ELM for the best results. Whilst, Table 10 illustrates the

results of the evaluation measures for each class.

3.3 Gender Dependent Female (GD-Female) Scenario

This section provides and discuss the best experiment results of the OGA-ELM in

GD-Female scenario. In GD-Female scenario, we did care about the gender (Male and

Female) and ignore the content in the utterance (sentence in the utterance). The GD-

Female scenario has used the utterances of the BES dataset which been recorded by

females only. Table 11 provides the description of the dataset which been used in

GD-Female scenario. 80% of the dataset which is equal to 281 utterance was used as

a training dataset. While the remaining 20% of the dataset which is equal to 69

utterance was used as a testing dataset.

The best experiment results of the proposed OGA-ELM in GD-Female scenario

have acquired with 100 hidden neurons, where the overall accuracy is 97.10%. While

the other evaluation measures were achieved 89.86%, 89.86%, 89.86%, and 81.32%

for precision, recall, F-measure and G-mean, respectively. The overall results of the

evaluation measurements are shown in Table 12. Besides, Fig. 9shows the confusion

matrix of OGA-ELM for the best results. Whilst, Table 13 illustrates the results of the

evaluation measures for each class.

Table 14 description of the dataset which been used in GD-Male scenario

Emotion Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Number of All Utterances 31 10 20 30 28 18 46

Number of Training Utterance 25 8 16 24 22 14 37

Number of Testing Utterance 6 2 4 6 6 4 9

Emotion Label 1 2 3 4 5 6 7

Table 15 the best overall result of the OGA-ELM in GD-Male scenario

No of

Neuron

tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

100 32 217 5 5 96.14 86.49 86.49 86.49 86.49 276.6538

125 28 213 9 9 93.05 75.68 75.68 75.68 75.68 280.4774

150 31 216 6 6 95.37 83.78 83.78 83.78 83.78 285.0864

175 24 209 13 13 89.96 64.86 64.86 64.86 64.86 289.9731

200 27 212 10 10 92.28 72.97 72.97 72.97 72.97 294.3590

225 29 214 8 8 93.82 78.38 78.38 78.38 78.38 301.1567

250 27 212 10 10 92.28 72.97 72.97 72.97 72.97 307.5201

275 31 216 6 6 95.37 83.78 83.78 83.78 83.78 315.0164

300 26 211 11 11 91.51 70.27 70.27 70.27 70.27 320.1097

23983Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3.4 Gender Dependent Male (GD-Male) Scenario

This section provides and discuss the best experiment results of the OGA-ELM in GD-Male

scenario. In GD-Male scenario, we did care about the gender (Male and Female) and ignore

the content in the utterance (sentence in the utterance). The GD-Male scenario has used the

utterances of the BES dataset which been recorded by males only. Table 14 provides the

description of the dataset which been used in GD-Male scenario. 80% of the dataset which is

equal to 146 utterance was used as a training dataset. While the remaining 20% of the dataset

which is equal to 37 utterance was used as a testing dataset.

The best experiment results of the proposed OGA-ELM in GD-Male scenario have

acquired with 100 hidden neurons, where the overall accuracy is 96.14%. While the other

evaluation measurements have achieved 86.49%, 86.49%, 86.49%, and 75.55% for precision,

recall, F-measure and G-mean, respectively. The overall results of the evaluation measure-

ments are shown in Table 15.Also,Fig.10 shows the confusion matrix of OGA-ELM for the

best results. Whilst, Table 16 illustrates the results of the evaluation measures for each class.

Based on all the above-mentioned experiments results, Table (5–16) we can conclude a

critical observation. The OGA could create suitable weights and biases for the single hidden

Fig. 10 The confusion matrix for the best results of OGA-ELM in GD-Male scenario

Table 16 the best result of the OGA-ELM in GD-Male scenario for each class

Anxiety Disgust Happiness Boredom Neutral Sadness Anger

Accuracy 100.00 97.30 100.00 94.59 97.30 94.59 89.19

Precision 100.00 100.00 100.00 83.33 100.00 100.00 69.23

Recall 100.00 50.00 100.00 83.33 83.33 50.00 100.00

F-Measure 100.00 66.67 100.00 83.33 90.91 66.67 81.82

G-Mean 100.00 70.71 100.00 83.33 91.29 70.71 83.21

tp 6 1 4 5 5 2 9

tn 31 35 33 30 31 33 24

fp 0 0 0 1 0 0 4

fn 0 1 0 1 1 2 0

23984 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

layer of the ELM in order to minimize classification process errors. By avoiding unsuitable

weights and biases causes the ELM to avoid getting stuck in local maxima of weights and

biases. Consequently, the performance of the OGA-ELM was very impressive in the four

different scenarios, with an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD,

GD-Male, and GD-Female scenarios, respectively.

Furthermore, the proposed approach OGA-ELM will be compared with some recent works

[9,16,18,27,45,55,58,60] in terms of accuracy based on four different scenarios (i.e., SI,

SD, GD-Male, and GD-Female scenarios). All these methods have been used the BSE dataset

in their experiments. Table 17 illustrates the comparison accuracy results of the proposed

OGA-ELM and some other previous works.

Based on all the results in Table 17, obviously that the performance of the OGA-ELM

outperformed all the other previous works in SI, SD, and GD-Male scenarios. While only in

GD-Female scenario the highest accuracy of the proposed OGA-ELM was slightly lower than the

work in [58] where they achieved 98 .94% and the proposed OGA-ELM achieved 97.10%. That

proves the fact of generating the suitable weights and biases of the ELM that leads to minimizing

classification process errors. Moreover, avoiding unsuitable weights and biases causes the ELM

to avoid getting stuck in local maxima of weights and biases. Which is offering a promise that the

OGA-ELM is a reliable model for ESR. The best results are shown in Table 17.

In addition, numerous experiments have performed based on the basic ELM, Feedforward

Neural Network (NN), and SVM in the four different scenarios: SI, SD, GD-Male, and GD-

Female. Note, due to the pages limit, we reported only the highest performance of ELM, NN,

and SVM in each scenario in terms of accuracy; recall; precision; G-mean; F-measure; true

positive, true negative; false positive, false negative and execution time. Table 18,19,20 and

21 provides all the experiments result of the proposed OGA-ELM, basic ELM, NN, and SVM

in SI, SD, GD-Male, and GD-Female scenario. The best performance of the proposed OGA-

Table 17 the comparison of accuracy between the proposed OGA-ELM and other previous works

Reference No of

Emotion

Result based on

Result based on GD-

Male

Result based on GD-

Female

ELM [58] 7 90 .31 99 .47 92 .98 98 .94

KELM [27]7 92.90 –– –

SMO [18]7 75.50 –– –

SVM [45]3 88.33 –– –

SVM [9]7 –– 94.90 85.77

SVM [16]7 –82.10 ––

SVM [55]6 88.80 –– –

SVM [60]7 70.59% –– –

OGA-ELM 793.26 100.00 96.14 97.10

Table 18 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in SI

scenario

Approach tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

OGA-ELM 81 611 25 25 93.26 76.42 76.42 76.42 76.42 381.9471

Basic ELM 46 576 60 60 83.83 43.40 43.40 43.40 43.40 22.1431

NN 43 573 63 63 83.02 40.57 40.57 40.57 40.57 25.0304

SVM 42 572 64 64 82.75 39.62 39.62 39.62 39.62 22.9662

23985Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

ELM has obtained an accuracy of 93.26%, 100.00%, 96.14%, and 97.10% for SI, SD, GD-

Male, and GD-Female scenario respectively. While the best performance of the basic ELM has

acquired an accuracy of 83.83%, 85.71%, 83.01%, and 84.68% for SI, SD, GD-Male, and GD-

Female scenario respectively. Further, the best performance of the NN has acquired an

accuracy of 83.02%, 83.33%, 81.47%, and 83.44% for SI, SD, GD-Male, and GD-Female

scenario respectively. The best performance of the SVM has acquired an accuracy of 82.75%,

85.71%, 82.24%, and 85.51% for SI, SD, GD-Male, and GD-Female scenario respectively.

BasedontheresultsinTables18,19,20 and 21, obviously that the performance of the OGA-

ELM outperformed the basic ELM, NN and SVM in the four different scenarios: SI, SD, GD-

Male, and GD-Female. That proves the fact of generating the suitable weights and biases of the

ELM leads to minimizing classification process errors. Thus, the performance of the OGA-ELM

was very impressive in the four different scenarios comparing to some previous works (see

Table 17), basic ELM, NN and SVM (see Tables 18,19,20 and 21). The best performance of the

OGA-ELM have acquired an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-

Male, and GD-Female scenario respectively. The original algorithms of ELM, NN, and SVM are

outperformed the proposed OGA-ELM in terms of the execution time due that the OGA-ELM is

based on GA that needs more time in order to obtain the best values of input weights and biases.

4 Conclusion

In this study, we have proposed an enhanced ESR system that based on the conventional

MFCC features and our previously developed ELM which named OGA-ELM. The OGA-

ELM underwent four different evaluation scenarios: SI, SD, GD-Male, and GD-Female using

the BES dataset for evaluation aspects. The outcome indicated the superiority of the OGA-

ELM over some previous works (see Table 17) and basic ELM, NN, and SVM (see Tables 18,

19,20 and 21) in the four different scenarios. The best performance of the OGA-ELM has

acquired an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and

Table 19 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in SD

scenario

Approach tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

OGA-ELM 12 72 0 0 100.00 100.00 100.00 100.00 100.00 253.8953

Basic ELM 6 66 6 6 85.71 50.00 50.00 50.00 50.00 10.0748

NN 5 65 7 7 83.33 41.67 41.67 41.67 41.67 12.5962

SVM 6 66 6 6 85.71 50.00 50.00 50.00 50.00 11.0365

Table 20 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in GD-

Male scenario

Approach tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

OGA-ELM 32 217 5 5 96.14 86.49 86.49 86.49 86.49 276.6538

Basic ELM 15 200 22 22 83.01 40.54 40.54 40.54 40.54 13.0750

NN 13 198 24 24 81.47 35.14 35.14 35.14 35.14 15.6759

SVM 14 199 23 23 82.24 37.84 37.84 37.84 37.84 14.0021

23986 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

GD-Female scenarios, respectively. Since the current study had only considered offline ESR,

the future work of this study is to create an ESR system that can handle the online execution of

the feature extraction and classification and apply real-time aspects simultaneously. Hence, it

can be implemented in analyse and detect the caller emotion in call centers. Another future

work using the proposed OGA-ELM can be applied in various applications such as voice

pathology detection, accent classification, and speaker identification. Finally, other optimisa-

tion approaches for ELM will be further explored in order to generate the most suitable

weights and biases for the ELM which leads to minimizing classification process errors.

Acknowledgements This project was funded by the Universiti Kebangsaan Malaysia under Dana Impak

Perdana grant (Research code: GUP-2020-063).

References

1. Albadr MA, Tiun S, Ayob M, al-Dhief F (2020) Genetic algorithm based on natural selection theory for

optimization problems. Symmetry 12(11):1758

2. Albadr MAA, Tiun S (2020) Spoken language identification based on particle swarm optimisation–extreme

learning machine approach. Circ Syst Signal Process 1–27

3. Albadr MAA, Tiun S, al-Dhief FT, Sammour MAM (2018) Spoken language identification based on the

enhanced self-adjusting extreme learning machine approach. PLoS One 13(4):e0194770

4. Albadr MAA, Tiun S, Ayob M, al-Dhief FT (2019) Spoken language identification based on optimised

genetic algorithm–extreme learning machine approach. Int J Speech Technol 22(3):711–727

5. Albadr MAA, Tiun S, Ayob M, al-Dhief FT, Omar K, Hamzah FA (2020) Optimised genetic algorithm-

extreme learning machine approach for automatic COVID-19 detection. PLoS One 15(12):e0242899

6. Albadra MAA, Tiuna S (2017) Extreme learning machine: a review. Int J Appl Eng Res 12(14):4610–4623

7. Al-Dhief FT et al (2020) A survey of voice pathology surveillance systems based on internet of things and

machine learning algorithms. IEEE Access 8:64514–64533

8. Al-Dhief FT et al (2020) Voice pathology detection using machine learning technique. In 2020 IEEE 5th

international symposium on telecommunication technologies (ISTT). IEEE

9. Alonso JB, Cabrera J, Medina M, Travieso CM (2015) New approach in quantification of emotional

intensity from the speech signal: emotional temperature. Expert Syst Appl 42(24):9554–9564

10. Badshah AM et al (2017) Speech emotion recognition from spectrograms with deep convolutional neural

network. In: 2017 international conference on platform technology and service (PlatCon). IEEE

11. Baroi OL et al (2019) Effects of different environmental noises and sampling frequencies on the perfor-

mance of MFCC and PLP based Bangla isolated word recognition system. In: 2019 1st international

conference on advances in Science, engineering and robotics technology (ICASERT). IEEE

12. Basu S et al (2017) A review on emotion recognition using speech. In: 2017 international conference on

inventive communication and computational technologies (ICICCT) IEEE

13. Bi W, Xu Y, Wang H (2020) Comparison of searching behaviour of three evolutionary algorithms applied

to water distribution system design optimization. Water 12(3):695

14. Burkhardt F et al (2005) A database of German emotional speech. In: Ninth European Conference on

Speech Communication and Technology

Table 21 illustration of the best overall result of the proposed OGA-ELM, basic ELM, NN and SVM in GD-

Female scenario

Approach tp tn fp fn Accuracy Precision Recall F-

Measure

G-

Mean

Execution Training / Testing

Time (S)

OGA-ELM 62 407 7 7 97.10 89.86 89.86 89.86 89.86 332.5520

Basic ELM 32 377 37 37 84.68 46.38 46.38 46.38 46.38 15.1266

NN 29 374 40 40 83.44 42.03 42.03 42.03 42.03 17.8584

SVM 34 379 35 35 85.51 49.28 49.28 49.28 49.28 16.0334

23987Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

15. Calvo RA, D'Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their

applications. IEEE Trans Affect Comput 1(1):18–37

16. Cao H, Verma R, Nenkova A (2015) Speaker-sensitive emotion recognition via ranking: studies on acted

and spontaneous speech. Comput Speech Lang 29(1):186–202

17. Chavhan Y, Dhore M, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J

Comput Appl 1(20):6–9

18. Choudhury AR et al (2018) Emotion recognition from speech signals using excitation source and spectral

features. In: 2018 IEEE applied signal processing conference (ASPCON). IEEE

19. Dendukuri LS, Hussain SJ (2019) Statistical feature set calculation using Teager energy operator on

emotional speech signals. In: 2019 international conference on wireless communications signal processing

and networking (WiSPNET). IEEE

20. Deng C, Huang GB, Xu J, Tang JX (2015) Extreme learning machines: new trends and applications.

Science China Inf Sci 58(2):1–16

21. Dogra A, Kaul A, Sharma R (2019) Automatic recognition of dialects of Himachal Pradesh using MFCC

&GMM. In: 2019 5th international conference on signal processing, computing and control (ISPCC). IEEE

22. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification

schemes, and databases. Pattern Recogn 44(3):572–587

23. Fortin F-A et al (2012) DEAP: evolutionary algorithms made easy. J Mach Learn Res 13(1):2171–2175

24. Gangamohan P, Kadiri SR, Yegnanarayana B (2016) Analysis of emotional speech—A review, in Toward

Robotic Socially Believable Behaving Systems-Volume I, Springer, p. 205–238

25. Ghasemi J, Esmaily J, Moradinezhad R (2020) Intrusion detection system using an optimized kernel

extreme learning machine and efficient features. Sādhanā45(1):1–9

26. Gogna A, Tayal A (2012) Comparative analysis of evolutionary algorithms for image enhancement. Int J

Met 2(1):80–100

27. Guo L, Wang L, Dang J, Liu Z, Guan H (2019) Exploration of complementary features for speech emotion

recognition based on kernel extreme learning machine. IEEE Access 7:75798–75809

28. Han W et al (2006) An efficient MFCC extraction method in speech recognition. In: 2006 IEEE interna-

tional symposium on circuits and systems. IEEE

29. Huang G-B, Chen L, Siew CK (2006) Universal approximation using incremental constructive feedforward

networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879–892

30. Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications.

Neurocomputing 70(1):489–501

31. Huang G-B et al (2011) Extreme learning machine for regression and multiclass classification. IEEE Trans

Syst, Man Cybern, Part B (Cybernetics) 42(2):513–529

32. Huang G-B et al (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans

Systems, Man, Cybernetics, Part B (Cybernetics) 42(2)513–529

33. Jain M et al (2020) Speech emotion recognition using support vector machine. arXiv preprint arXiv:

2002.07590

34. Juvela L et al (2018) Speech waveform synthesis from MFCC sequences with generative adversarial

networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).

IEEE

35. Kaya H, Karpov AA (2018) Efficient and effective strategies for cross-corpus acoustic emotion recognition.

Neurocomputing 275:1028–1034

36. Kaya H, Karpov AA, Salah AA (2016) Robust acoustic emotion recognition based on cascaded normal-

ization and extreme learning machines. In: international symposium on neural networks. Springer, 2016

37. Kostoulas T, Mporas I, Kocsis O, Ganchev T, Katsaounos N, Santamaria JJ, Jimenez-Murcia S, Fernandez-

Aranda F, Fakotakis N (2012) Affective speech interface in serious games for supporting therapy of mental

disorders. Expert Syst Appl 39(12):11072–11079

38. Kuchibhotla S, Vankayalapati HD, Anne KR (2016) An optimal two stage feature selection for speech

emotion recognition using acoustic features. Int J Speech Technol 19(4):657–667

39. Lopez-de-Ipiña K et al (2015) On automatic diagnosis of Alzheimer’s disease based on spontaneous speech

analysis and emotional temperature. Cogn Comput 7(1):44–55

40. Mar LL, Pa WP (2019) Depression detection from speech emotion recognition. Seventeenth International

Conference on Computer Applications (ICCA 2019)

41. Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral

coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083

42. Murugappan M et al (2020) Emotion classification in Parkinson's disease EEG using RQA and ELM. In:

2020 16th IEEE international colloquium on Signal Processing & its Applications (CSPA). IEEE

43. Neiberg D, Elenius K (2008) Automatic recognition of anger in spontaneous speech. In: Ninth Annual

Conference of the International Speech Communication Association

23988 Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

44. Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–

326

45. Pakyurek M, Atmis M, Kulac S, Uludag U (2020) Extraction of novel features based on histograms of

MFCCs used in emotion classification from generated original speech dataset. Elektronika ir

Elektrotechnika 26(1):46–51

46. Petrushin VA (2000) Emotion recognition in speech signal: experimental study, development, and appli-

cation. In: Sixth International Conference on Spoken Language Processing

47. Poorna S, Nair G (2019) Multistage classification scheme to enhance speech emotion recognition. Int J

Speech Technol 22(2):327–340

48. Renanti MD, Buono A, Kusuma WA (2013) Infant cries identification by using codebook as feature

matching, and mfcc as feature extraction. J Theoretical Appl Inform Technol 56(3)

49. Shah AF and Anto PB (2017) Hybrid spectral features for speech emotion recognition. In: 2017 interna-

tional conference on innovations in information, embedded and communication systems (ICIIECS). IEEE

50. Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, F-score and ROC: a family of

discriminant measures for performance evaluation. In Australasian joint conference on artificial intelligence.

2006. Springer

51. Trang H, Loc TH, Nam HBH (2014) Proposed combination of PCA and MFCC feature extraction in speech

recognition system. In: 2014 International Conference on Advanced Technologies for Communications

(ATC 2014). IEEE

52. Tripathi A, Singh U, Bansal G, Gupta R, Singh AK (2020) A review onemotion detection and classification

using speech. Available at SSRN 3601803

53. Tzinis E, Potamianos A (2017) Segment-based speech emotion recognition using recurrent neural networks.

In: 2017 seventh international conference on affective computing and intelligent interaction (ACII). IEEE

54. van Heeswijk M (2015) Advances in extreme learning machines

55. Wang K et al (2015) Speech emotion recognition using Fourier parameters. IEEE Trans Affect Comput

6(1):69–75

56. Wang Y, Cao F, Yuan Y (2011) A study on effectiveness of extreme learning machine. Neurocomputing

74(16):2483–2490

57. Wilhelmstötter F (2021) Jenetics Library User’s Manual 6.2. [Online]. Available: https://jenetics.io

58. Yogesh C et al (2017) A new hybrid PSO assisted biogeography-based optimization for emotion and stress

recognition from speech signal. Expert Syst Appl 69:149–158

59. Yu F et al (2016) Improved roulette wheel selection-based genetic algorithm for TSP. In: 2016 international

conference on network and information Systems for Computers (ICNISC), IEEE

60. Zaidan NA, Salam MS (2016) MFCC global features selection in improving speech emotion recognition

rate. In: Advances in machine learning and signal processing. Springer, p. 141–153

61. Zhang X, Sun J, Luo Z (2014) One-against-all weighted dynamic time warping for language-independent

and speaker-dependent speech recognition in adverse conditions. PLoS One 9(2):e85458

62. Zhao S et al (2014) Automatic detection of expressed emotion in Parkinson's disease. In: 2014 IEEE

international conference on acoustics, speech and signal processing (ICASSP), IEEE

Publisher’snote Springer Nature remains neutral with regard to jurisdictional claims in published maps

and institutional affiliations.

Affiliations

Musatafa Abbas Abbood Albadr

&Sabrina Tiun

&Masri Ayob

&Fahad Taha

AL-Dhief

&Khairuddin Omar

&Mhd Khaled Maen

CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor,

Malaysia

School of Electrical Engineering, Department of Communication Engineering, Universiti Teknologi

Malaysia, UTM Johor Bahru, Johor, Malaysia

Department of Information and Technology, Uppsala University, Uppsala, Sweden

23989Multimedia Tools and Applications (2022) 81:23963–23989

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center

GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers

and authorised users (“Users”), for small-scale personal, non-commercial use provided that all

sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of

use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and

students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and

conditions, a relevant site licence or a personal subscription. These Terms will prevail over any

conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to

the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of

the Creative Commons license used will apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may

also use these personal data internally within ResearchGate and Springer Nature and as agreed share

it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise

disclose your personal data outside the ResearchGate or the Springer Nature group of companies

unless we have your permission as detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial

use, it is important to note that Users may not:

use such content for the purpose of providing other users with access on a regular or large scale

basis or as a means to circumvent access control;

use such content where to do so would be considered a criminal or statutory offence in any

jurisdiction, or gives rise to civil liability, or is otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association

unless explicitly agreed to by Springer Nature in writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a

systematic database of Springer Nature journal content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a

product or service that creates revenue, royalties, rent or income from our content or its inclusion as

part of a paid for service or for other commercial gain. Springer Nature journal content cannot be

used for inter-library loans and librarians may not upload Springer Nature journal content on a large

scale into their, or any other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not

obligated to publish any information or content on this website and may remove it or features or

functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke

this licence to you at any time and remove access to any copies of the Springer Nature journal content

which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or

guarantees to Users, either express or implied with respect to the Springer nature journal content and

all parties disclaim and waive any implied warranties or warranties imposed by law, including

merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published

by Springer Nature that may be licensed from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a

regular basis or in any other manner not expressly permitted by these Terms, please contact Springer

Nature at

onlineservice@springernature.com

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Multimedia Tools and Applications

This content is subject to copyright. Terms and conditions apply.

Enhanced PRIM recognition using PRI sound and deep learning techniques

Article

Full-text available

May 2024
PLOS ONE

Pulse repetition interval modulation (PRIM) is integral to radar identification in modern electronic support measure (ESM) and electronic intelligence (ELINT) systems. Various distortions, including missing pulses, spurious pulses, unintended jitters, and noise from radar antenna scans, often hinder the accurate recognition of PRIM. This research introduces a novel three-stage approach for PRIM recognition, emphasizing the innovative use of PRI sound. A transfer learning-aided deep convolutional neural network (DCNN) is initially used for feature extraction. This is followed by an extreme learning machine (ELM) for real-time PRIM classification. Finally, a gray wolf optimizer (GWO) refines the network’s robustness. To evaluate the proposed method, we develop a real experimental dataset consisting of sound of six common PRI patterns. We utilized eight pre-trained DCNN architectures for evaluation, with VGG16 and ResNet50V2 notably achieving recognition accuracies of 97.53% and 96.92%. Integrating ELM and GWO further optimized the accuracy rates to 98.80% and 97.58. This research advances radar identification by offering an enhanced method for PRIM recognition, emphasizing the potential of PRI sound to address real-world distortions in ESM and ELINT systems.

Extreme Learning machine algorithm for breast Cancer diagnosis

Article

Full-text available

Jun 2024
MULTIMED TOOLS APPL

Recently, significant attention has been given to using machine learning (ML) and data mining algorithms for diagnosing breast cancer (BC). However, many of these efforts still require improvement. Either they need proper statistical evaluation, or they use inadequate assessment metrics or both. In this context, the extreme learning machine (ELM) algorithm, known for its effectiveness, emerges as a promising approach for data classification. Therefore, this study proposes the ELM algorithm and statistically evaluates its performance in diagnosing BC. The ELM algorithm offers several advantages: it can eliminate overfitting, address binary and multi-class classification issues, and exhibit performance comparable to a kernel-based support vector machine with a neural network structure. To evaluate the ELM algorithm’s performance, two BC datasets, namely the Wisconsin breast cancer database (WBCD) and the Wisconsin diagnostic breast cancer (WDBC) were used. The experimental results demonstrated the excellent performance of the ELM algorithm. Using the WBCD dataset, the ELM algorithm achieved an average accuracy of 92.06%, precision of 80.25%, recall of 96.60%, F-measure of 87.56%, G-mean of 87.99%, MCC of 82.67%, and specificity of 90.27%. Similarly, using the WDBC dataset, it achieved an average accuracy of 94.52%, precision of 92.28%, recall of 93.09%, F-measure of 92.66%, G-mean of 92.67%, MCC of 88.32%, and specificity of 95.42%. These results highlight the ELM algorithm’s reliability as a classifier for diagnosing BC and its potential for addressing healthcare-related problems in other applications.

Improving credit card fraud detection using machine learning and GAN technology

Article

Full-text available

Apr 2024

The motivation behind this study stems from identifying contemporary challenges associated with prosecuting electronic financial crimes. Highlights ongoing efforts to identify and address credit card fraud and fraud as there are many credit card fraud issues in the financial industry. Traditional methods are no longer able to keep up with modern methods of tracking the behavior of credit card users and detecting suspicious cases. Artificial intelligence technology offers promising solutions to quickly detect and prevent future fraud by credit card users. Datasets used to detect financial anomalies are affected by imbalances in financial transactions, and this study aims to address the imbalance of financial fraud datasets using adversarial algorithm techniques and compare them with the most commonly used methods in the scientific literature.The results showed that the function of the adversarial algorithm is consistent in several ways, including allowing researchers and interested parties to determine data growth rates, which helps bring the dataset closer to real-time data from financial markets and banks. This study proposes a hybrid machine learning model consisting of three machine learning algorithms: decision trees, logistic regression, and Naive Bayes algorithm, and calculates performance metrics such as accuracy, specificity, precision, and F1 score. Experimental results reveal varying degrees of accuracy in fraud detection. Model testing using the SMOTE method recorded an accuracy of 98.1% and an F-score of 98.3%. On the other hand, the oversampling and under sampling test methods showed similar performance, with the two methods recording an accuracy of 94.3 and 95.3 and an F-score of 94.7 and 95.1, respectively. Finally, the GAN method excelled, receiving a test score and accuracy of 99.9%, as well as exceptional precision, recall, and F1 score. As a result, we conclude that the GAN method is able to balance the data set, which in turn is reflected in the performance of the model in training and the accuracy of predictions when tested. Historical transaction analysis identifies behavioral patterns and adapts to evolving fraud techniques. This approach enhances transaction security and protects against potential financial losses due to fraud. This contribution allows financial institutions and companies to proactively combat fraudulent activities.

Online sequential extreme learning machine approach for breast cancer diagnosis

Article

Full-text available

Mar 2024
NEURAL COMPUT APPL

The utilisation of DM (Data Mining) and ML (Machine Learning) approaches in the BC (Breast Cancer) diagnosis has recently gained a lot of consideration. However, most of these works still need enhancement since either they were assessed utilising insufficient evaluation-metrics, or they weren’t statistically-assessed, or both. Lately, one-of-the-most effective and well-known ML approaches is OSELM (Online Sequential Extreme Learning Machine), it has seen as an efficient and reputable technique for classifying-data, however it has not been implemented in BC diagnosis problem. Consequently, this research proposes the OSELM approach in-order-to enhance the rate of accuracy for the BC diagnosis. The OSELM technique has the ability to (a) capability to be applied on both (multi-class and binary) classification, (b) prevent overfitting, as well as (c) It has a comparable ability to kernel-based SVM (Support Vector Machine) and operates with a neural-network-structure. In this research, two different BC datasets (WDBC (Wisconsin Diagnostic Breast Cancer) and WBCD (Wisconsin Breast Cancer Database)) were utilised to evaluate the OSELM approach performance. The experiments outcomes have revealed the outstanding-performance of the proposed OSELM approach, which attained an average of precision 94.09%, recall 95.57%, accuracy 96.13%, G-Mean 94.82%, F-Measure 94.80%, specificity 96.51%, and MCC 91.76% using WDBC dataset. Besides, attained an average of precision 95.08%, recall 98.89%, accuracy 97.89%, G-Mean 96.96%, F-Measure 96.93%, specificity 97.41%, and MCC 95.39% using WBCD dataset. This indicates that the OSELM approach is a reliable technique for the BC diagnosis and might be suitable for solving other-applications-related issues in the sector of healthcare. Besides, it can serve as a valuable decision-support tool for oncologists, providing additional information and insights to aid in their diagnoses and treatment plans.

Efficient loss updated XGBoost with deep emended genetic algorithm for detecting online fraudulent transactions

Article

Full-text available

Apr 2024
MULTIMED TOOLS APPL

In the fast-paced technological era, online financial transactions have gained widespread use as it offers significant merits to customers for easy transfer of money through smart phones. Nevertheless, fraudulent transactions put individual’s money into risk, for which, suitable approaches are required to detect such deceits. Concurrently, with the progress of ML (Machine Learning) approaches, existing works have bidden to identify the fraudulent and normal transactions. However, studies lacked in accordance with accuracy rate and only limited focus has been provided for detection of generalized fraudulent transactions. Considering this, the current study considers IoT fraud dataset and proposes DEGA (Deep Emended Genetic Algorithm) to attain better performance for detecting fraudulent and normal transactions. This model employs a competitive approach, integrating, new crossover and selection methods. This intend to improvise the ability of global search and partition the chromosomes into losers and winners. This ensures high quality parent for selection. Besides, a dynamic-mutation function is also proposed for enhancing the model’s searching ability. Subsequently, the study proposes EL-UXGB (Efficient Loss-Updated eXtreme Gradient Boosting) wherein dual sigmoid loss functions are proposed to resolve the imbalanced label cases. The overall performance of this study is assessed through analysis that confirms its effectiveness in detecting fraudulent transactions.

An Improved MSER using Grid Search based PCA and Ensemble Voting Technique

Article

Full-text available

Mar 2024
MULTIMED TOOLS APPL

Recognizing speech emotions is indeed a crucial aspect of human–computer interaction. However, developing a model that can accurately process multiple languages is one of the challenging tasks. The feature selection process plays a vital role in multilingual speech emotion recognition because it helps to reduce irrelevant features from each language, ultimately enhancing the performance of the model. This research aims to address this task in a more precise way. It achieves this by employing Grid Search based Principal Component Analysis and an ensemble voting classifier for multilingual speech emotion recognition. Here we mention three essential steps of recognizing emotion from a multilingual dataset. The first step involves feature extraction from speech signals, such as MFCC, root-mean-square, ZCR, flux, roll-off, Centroid, bandwidth, chroma, and fundamental frequency. The second step entails the selection of an essential feature subset by removing redundant and unnecessary features using Principal Component Analysis. We also utilize the Grid Search technique to determine the feature subset that would yield the highest accuracy. The third step encompasses SVM and Random Forest, that are widely recognized classifiers. Additionally, we propose an ensemble voting classifier. Our study compares the performance of these classifiers on three distinct corpora—RAVDESS, EMOVO, and SUBESCO with and without the feature selection strategy. The accuracy for RAVDESS EMOVO and SUBESCO dataset 74.30%, 79.66%, 87.64%, respectively. After comparing our proposed approach with other approaches mentioned in the literature survey, it became evident that our approach outperforms the rest.

Depression recognition using voice-based pre-training model

Article

Full-text available

Jun 2024

The early screening of depression is highly beneficial for patients to obtain better diagnosis and treatment. While the effectiveness of utilizing voice data for depression detection has been demonstrated, the issue of insufficient dataset size remains unresolved. Therefore, we propose an artificial intelligence method to effectively identify depression. The wav2vec 2.0 voice-based pre-training model was used as a feature extractor to automatically extract high-quality voice features from raw audio. Additionally, a small fine-tuning network was used as a classification model to output depression classification results. Subsequently, the proposed model was fine-tuned on the DAIC-WOZ dataset and achieved excellent classification results. Notably, the model demonstrated outstanding performance in binary classification, attaining an accuracy of 0.9649 and an RMSE of 0.1875 on the test set. Similarly, impressive results were obtained in multi-classification, with an accuracy of 0.9481 and an RMSE of 0.3810. The wav2vec 2.0 model was first used for depression recognition and showed strong generalization ability. The method is simple, practical, and applicable, which can assist doctors in the early screening of depression.

Comparative Performance Analysis of Metaheuristic Feature Selection Methods for Speech Emotion Recognition

Article

Full-text available

Apr 2024

Emotion recognition systems from speech signals are realized with the help of acoustic or spectral features. Acoustic analysis is the extraction of digital features from speech files using digital signal processing methods. Another method is the analysis of time-frequency images of speech using image processing. The size of the features obtained by acoustic analysis is in the thousands. Therefore, classification complexity increases and causes variation in classification accuracy. In feature selection, features unrelated to emotions are extracted from the feature space and are expected to contribute to the classifier performance. Traditional feature selection methods are mostly based on statistical analysis. Another feature selection method is the use of metaheuristic algorithms to detect and remove irrelevant features from the feature set. In this study, we compare the performance of metaheuristic feature selection algorithms for speech emotion recognition. For this purpose, a comparative analysis was performed on four different datasets, eight metaheuristics and three different classifiers. The results of the analysis show that the classification accuracy increases when the feature size is reduced. For all datasets, the highest accuracy was achieved with the support vector machine. The highest accuracy for the EMO-DB, EMOVA, eNTERFACE’05 and SAVEE datasets is 88.1%, 73.8%, 73.3% and 75.7%, respectively.

Rapid detection of loss on ignition for unburned carbon powder in fly ash triboelectric separation based on image recognition and machine learning

Article

Apr 2024
ADV POWDER TECHNOL

Speech Emotion Recognition for Electricity Customer Service Based on CBGRU and Multihead Self-Attention Mechanism

Conference Paper

Dec 2023

Optimised genetic algorithm-extreme learning machine approach for automatic COVID-19 detection

Article

Full-text available

Dec 2020
PLOS ONE

The coronavirus disease (COVID-19), is an ongoing global pandemic caused by severe acute respiratory syndrome. Chest Computed Tomography (CT) is an effective method for detecting lung illnesses, including COVID-19. However, the CT scan is expensive and time-consuming. Therefore, this work focus on detecting COVID-19 using chest X-ray images because it is widely available, faster, and cheaper than CT scan. Many machine learning approaches such as Deep Learning, Neural Network, and Support Vector Machine; have used X-ray for detecting the COVID-19. Although the performance of those approaches is acceptable in terms of accuracy, however, they require high computational time and more memory space. Therefore, this work employs an Optimised Genetic Algorithm-Extreme Learning Machine (OGA-ELM) with three selection criteria (i.e., random, K-tournament, and roulette wheel) to detect COVID-19 using X-ray images. The most crucial strength factors of the Extreme Learning Machine (ELM) are: (i) high capability of the ELM in avoiding overfit-ting; (ii) its usability on binary and multi-type classifiers; and (iii) ELM could work as a kernel-based support vector machine with a structure of a neural network. These advantages make the ELM efficient in achieving an excellent learning performance. ELMs have successfully been applied in many domains, including medical domains such as breast cancer detection, pathological brain detection, and ductal carcinoma in situ detection, but not yet tested on detecting COVID-19. Hence, this work aims to identify the effectiveness of employing OGA-ELM in detecting COVID-19 using chest X-ray images. In order to reduce the dimensionality of a histogram oriented gradient features, we use principal component analysis. The performance of OGA-ELM is evaluated on a benchmark dataset containing 188 chest X-ray images with two classes: a healthy and a COVID-19 infected. The experimental result shows that the OGA-ELM achieves 100.00% accuracy with fast computation time. This demonstrates that OGA ELM is an efficient method for COVID-19 detecting using chest X-ray images.

Voice Pathology Detection Using Machine Learning Technique

Conference Paper

Full-text available

Dec 2020

Recent proposed researches have witnessed that voice pathology detection systems can effectively contribute to the voice disorders assessment and provide early detection of voice pathologies. These systems used machine learning techniques which are considered as very promising tools in the detection of voice pathologies. However, most proposed systems in the detection of voice disorder utilized limited database. Furthermore, low accuracy rate is still the one of the most challenging issues for these techniques. This paper presents a voice pathology detection system using Online Sequential Extreme Learning Machine (OSELM) to classify the voice signal into healthy or pathological. In this work, the voice features are extracted by using Mel-Frequency Cepstral Coefficient (MFCC). The voice samples for the vowel /a/ were collected equally from Saarbrücken voice database (SVD). The proposed method is evaluated by three widely used measurements which are accuracy, sensitivity and specificity. The obtained results show that the maximum accuracy, sensitivity and specificity are 85%, 87% and 87%, respectively. According to the experimental results, the performance of OSELM algorithm is able to differentiate healthy and pathological voices effectively.

Genetic Algorithm Based on Natural Selection Theory for Optimization Problems

Article

Full-text available

Oct 2020

The metaheuristic genetic algorithm (GA) is based on the natural selection process that falls under the umbrella category of evolutionary algorithms (EA). Genetic algorithms are typically utilized for generating high-quality solutions for search and optimization problems by depending on bio-oriented operators such as selection, crossover, and mutation. However, the GA still suffers from some downsides and needs to be improved so as to attain greater control of exploitation and exploration concerning creating a new population and randomness involvement happening in the population at the solution initialization. Furthermore, the mutation is imposed upon the new chromosomes and hence prevents the achievement of an optimal solution. Therefore, this study presents a new GA that is centered on the natural selection theory and it aims to improve the control of exploitation and exploration. The proposed algorithm is called genetic algorithm based on natural selection theory (GABONST). Two assessments of the GABONST are carried out via (i) application of fifteen renowned benchmark test functions and the comparison of the results with the conventional GA, enhanced ameliorated teaching learning-based optimization (EATLBO), Bat and Bee algorithms. (ii) Apply the GABONST in language identification (LID) through integrating the GABONST with extreme learning machine (ELM) and named (GABONST-ELM). The ELM is considered as one of the most useful learning models for carrying out classifications and regression analysis. The generation of results is carried out grounded upon the LID dataset, which is derived from eight separate languages. The GABONST algorithm has the capability of producing good quality solutions and it also has better control of the exploitation and exploration as compared to the conventional GA, EATLBO, Bat, and Bee algorithms in terms of the statistical assessment. Additionally, the obtained results indicate that (GABONST-ELM)-LID has an effective performance with accuracy reaching up to 99.38%.

Emotion Classification in Parkinson's Disease EEG using RQA and ELM

Conference Paper

Full-text available

Feb 2020

A Survey of Voice Pathology Surveillance Systems Based on Internet of Things and Machine Learning Algorithms

Article

Full-text available

Apr 2020

The incorporation of the cloud technology with the Internet of Things (IoT) is significant in order to obtain better performance for a seamless, continuous, and ubiquitous framework. IoT has many applications in the healthcare sector, one of these applications is voice pathology monitoring. Unfortunately, voice pathology has not gained much attention, where there is an urgent need in this area due to the shortage of research and diagnosis of lethal diseases. Most of the researchers are focusing on the voice pathology and their finding is only to differentiating either the voice is normal (healthy) or pathological voice, where there is a lack of the current studies for detecting a certain disease such as laryngeal cancer. In this paper, we present an extensive review of the state-of-the-art techniques and studies of IoT frameworks and machine learning algorithms used in the healthcare in general and in the voice pathology surveillance systems in particular. Furthermore, this paper also presents applications, challenges and key issues of both IoT and machine learning algorithms in the healthcare. Finally, this paper highlights some open issues of IoT in healthcare that warrant further research and investigation in order to present an easy, comfortable and effective diagnosis and treatment of disease for both patients and doctors.

Spoken Language Identification Based on Particle Swarm Optimisation–Extreme Learning Machine Approach

Article

Full-text available

Sep 2020
CIRC SYST SIGNAL PR

The determination and classification of natural language based on specified content and data set involves a process known as spoken language identification (LID). To initiate the process, useful features of the given data need to be extracted first in a mature process where the standard LID features have been previously developed by employing the use of MFCC, SDC, GMM and the i-vector-based framework. Nevertheless, optimisation of the learning process is still required to enable a comprehensive capturing of the extracted features’ embedded knowledge. The training of a single hidden layer neural network can be done using the extreme learning machine (ELM), which is an effective learning model for conducting classification and regression analysis. Nevertheless, the learning process of this model is not entirely effective (i.e. optimised) due to the random selection of weights within the input hidden layer. This study employs ELM as the LID learning model centred upon the extraction of the standard features. The enhanced self-adjusting extreme learning machine (ESA–ELM) is one of the ELM’s optimisation techniques which has been chosen as the benchmark and is enhanced by adopting a new alternative optimisation approach (PSO) instead of (EATLBO) in terms of achieving high performance. The improved ESA–ELM is named particle swarm optimisation–extreme learning machine (PSO–ELM). The generated results are based on LID with the same benchmarked data set derived from eight languages, which indicated the superior performance of the particle swarm optimisation–extreme learning machine LID (PSO–ELM LID) with an accuracy of 98.75% in comparison with the ESA–ELM LID which only achieved 96.25%.

Comparison of Searching Behaviour of Three Evolutionary Algorithms Applied to Water Distribution System Design Optimization

Article

Full-text available

Mar 2020

Over the past few decades, various evolutionary algorithms (EAs) have been applied to the optimization design of water distribution systems (WDSs). An important research area is to compare the performance of these EAs, thereby offering guidance for the selection of the appropriate EAs for practical implementations. Such comparisons are mainly based on the final solution statistics and, hence, are unable to provide knowledge on how different EAs reach the final optimal solutions and why different EAs performed differently in identifying optimal solutions. To this end, this paper aims to compare the real-time searching behaviour of three widely used EAs, which are genetic algorithms (GAs), the differential evolution (DE) algorithm and the ant colony optimization (ACO). These three EAs are applied to five WDS benchmarking case studies with different scales and complexities, and a set of five metrics are used to measure their run-time searching quality and convergence properties. Results show that the run-time metrics can effectively reveal the underlying searching mechanisms associated with each EA, which significantly goes beyond the knowledge from the traditional end-of-run solution statistics. It is observed that the DE is able to identify better solutions if moderate and large computational budgets are allowed due to its great ability in maintaining the balance between the exploration and exploitation. However, if the computational resources are rather limited or the decision has to be made in a very short time (e.g., real-time WDS operation), the GA can be a good choice as it can always identify better solutions than the DE and ACO at the early searching stages. Based on the results, the ACO performs the worst for the five case study considered. The outcome of this study is the offer of guidance for the algorithm selection based on the available computation resources, as well as knowledge into the EA’s underlying searching behaviours.

Extreme Learning Machine: Theory and Applications

Article

Dec 2006
NEUROCOMPUTING

It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1

A Review on Emotion Detection and Classification using Speech

Article

Jan 2020

Statistical Feature Set Calculation using Teager Energy Operator on Emotional Speech Signals

Conference Paper

Mar 2019

Speech emotion recognition using optimized genetic algorithm-extreme learning machine

Abstract and Figures

Recommended publications

Speech emotion detection based on neural networks

Extreme Learning Machine for Automatic Language Identification Utilizing Emotion Speech Data

Particle Swarm Optimization-Based Extreme Learning Machine for COVID-19 Detection

Grey wolf optimization-extreme learning machine for automatic spoken language identification

Gray wolf optimization-extreme learning machine approach for diabetic retinopathy detection