ArticlePDF Available

Speech Emotion Recognition Based on Modified ReliefF

Authors:

Abstract and Figures

As the key of human–computer natural interaction, the research of emotion recognition is of great significance to the development of computer intelligence. In view of the issue that the current emotional feature dimension is too high, which affects the classification performance, this paper proposes a modified ReliefF feature selection algorithm to screen out feature subsets with smaller dimensions and better performance from high-dimensional features to further improve the efficiency and accuracy of emotion recognition. In the modified algorithm, the selection range of random samples is adjusted; the correlation between features is measured by the maximum information coefficient, and the distance measurement method between samples is established based on the correlation. The experimental results on the eNTERFACE’05 and SAVEE speech emotional datasets show that the features filtered based on the modified algorithm significantly reduce the data dimensions and effectively improve the accuracy of emotion recognition.
Content may be subject to copyright.
Sensors 2022, 22, 8152. https://doi.org/10.3390/s22218152 www.mdpi.com/journal/sensors
Article
Speech Emotion Recognition Based on Modified ReliefF
Guo-Min Li, Na Liu * and Jun-Ao Zhang
College of Communication and Information Engineering, Xi’an University of Science and Technology,
Xian 710600, China
* Correspondence: ly092022@stu.xust.edu.cn
Abstract: As the key of humancomputer natural interaction, the research of emotion recognition is
of great significance to the development of computer intelligence. In view of the issue that the cur-
rent emotional feature dimension is too high, which affects the classification performance, this paper
proposes a modified ReliefF feature selection algorithm to screen out feature subsets with smaller
dimensions and better performance from high-dimensional features to further improve the effi-
ciency and accuracy of emotion recognition. In the modified algorithm, the selection range of ran-
dom samples is adjusted; the correlation between features is measured by the maximum infor-
mation coefficient, and the distance measurement method between samples is established based on
the correlation. The experimental results on the eNTERFACE’05 and SAVEE speech emotional da-
tasets show that the features filtered based on the modified algorithm significantly reduce the data
dimensions and effectively improve the accuracy of emotion recognition.
Keywords: emotion recognition; feature selection; modified ReliefF; maximum information coefficient
1. Introduction
Affective computing enables computers to better recognize and express emotions. As
the main research direction in the field of affective computing, emotion recognition is
widely used in many fields, such as intelligent medicine [1], remote education and hu-
man-computer interaction [12]. Speech, as a major expressing form of emotion, contains
rich emotional information. Therefore, speech emotion recognition (SER) has been a re-
search focus in affective computing. SER refers to the technology of extracting emotional
features from speech signals through computer processing to judge the type of human
emotion, including preprocessing, feature extraction, and emotion classification. The fo-
cus of SER is to select and extract suitable features, and the quality of features determines
the final accuracy of emotion recognition.
The features commonly used in SER mainly include sound quality features, spectral
features, prosodic features, and corresponding statistical characteristics, such as the max-
imum, average, range, variance, etc. [3]. Prosodic features [4] describe the variation of
speech, mainly including pitch frequency, speech energy, duration, etc. Spectral features
describe the association between vocal movement and vocal channel change, mainly in-
cluding cepstrum features (such as Mel cepstrum coefficient MFCC [5,6]) and linear spec-
tral features (such as linear prediction coefficient LPC [7]). Sound quality features [8] re-
flect the vibration properties of sound and describe the clarity and identification of speech,
including bandwidth, formant frequency, etc. In [9], the prosodic parameters of four type
of emotions, such as anger, sadness, happiness and boredom, in the emotional database
were studied and analyzed. In [10,11], prosodic features, such as energy, formants, and
pitch, were extracted for speech emotion recognition. In [12], the Fourier parameter fea-
tures were proposed from emotional speech signal for SER. In [13,14], the emotion recog-
Citation: Li, G.-M.; Liu, N.; Zhang,
J.-A. Speech Emotion Recognition
Based on Modified ReliefF.
Sensors 2022, 22, 8152.
https://doi.org/10.3390/s22218152
Academic Editor: Petros Daras
Received: 27 September 2022
Accepted: 21 October 2022
Published: 25 October 2022
Publisher’s Note: MDPI stays neu-
tral with regard to jurisdictional
claims in published maps and institu-
tional affiliations.
Copyright: © 2022 by the authors. Li-
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license (https://cre-
ativecommons.org/licenses/by/4.0/).
Sensors 2022, 22, 8152 2 of 12
nition rate improved by concatenating Mel frequency cepstral coefficients with other fea-
ture sets, including energy and formant, pitch, and bandwidth. In [15], Mel frequency
magnitude coefficient was extracted from speech signals, and multiclass SVM was used
as the classifier to classify the emotions.
Up to now, scholars have proposed many effective emotional features, but these fea-
tures often have high dimensions, with a large amount of redundancy. When using high-
dimensional features directly for emotion analysis, it will prolong the model training time
and impact the recognition performance. Therefore, it is necessary to select features to
improve the efficiency and effectiveness of the model.
Feature selection refers to screening out a set of subsets from an existing dataset. The
subsets meet certain criteria, while retaining the classification ability of original features
as much as possible, removing irrelevant features, reducing the data dimension, and im-
proving the model efficiency [16]. The feature selection method is divided into Wrapper
[17] and Filter [18]. The wrapper algorithm directly uses the classification performance as
the evaluation criterion of the feature importance. The subset selected by the strategy will
eventually be used to construct the classification model. The filter method mainly uses
distance, dependence, and other measurement criteria to calculate the implicit infor-
mation in the features and gives the corresponding weight value of the features according
to the calculation results. According to the weight, the important features under the crite-
rion can be selected. This method directly obtains the implicit information of the features
by mathematical calculation without involving the classifier; the strategy has high com-
putational efficiency and can quickly eliminate non-critical features and remove noise fea-
tures in the data.
In the process of speech emotion recognition, principal component analysis (PCA)
was used for filtering feature selection to eliminate irrelevant features and improve the
accuracy of classification in [19,20]. The feature selection method based on maximal rele-
vance and minimal redundancy (MRMR) was used to evaluate the emotional features,
which ensures the accuracy of emotional classification and effectively optimizes the fea-
ture set in [21]. The feature selection method based on CFS was used to evaluate the fea-
tures and select the feature subset with high correlation with the category in [22]; it per-
forms well on multiple emotional datasets. In [23,24], ReliefF algorithm was used to screen
the emotional features of speech, which effectively reduced the feature dimension while
ensuring the recognition rate.
The above feature selection methods have their own advantages. In contrast, the Re-
liefF algorithm has the characteristics of high efficiency and high precision. It can assign
corresponding weights to the features according to the discrimination of features for lim-
ited samples in different categories. Therefore, many scholars have carried out related re-
search using ReliefF algorithm combined with specific problems. For example, when stud-
ying the feature selection problem of hand gesture recognition, in [25], Minkowski dis-
tance was used to replace Euclidean distance to improve the selection method of nearest
neighbor samples in ReliefF algorithm. In [26], the maximal information coefficient was
used to replace the Euclidean distance to select the nearest neighbor samples, and the im-
proved ReliefF algorithm was combined with the wrapper algorithm to automatically find
the optimal feature subset. In [27], the features were sorted according to the classification
performance of each feature, and the features with better performance were selected by
setting the threshold. Then, the ReliefF algorithm was used to perform secondary screen-
ing on the features to achieve the purpose of dimensionality reduction.
In conclusion, the existing research combines the ReliefF algorithm with other meth-
ods, thus expanding its application scope and solving the feature selection problem of
specific scenes. However, the ReliefF algorithm itself has defects, such as instability
caused by randomness of the samples selected and redundancy among attributes. There-
fore, in this paper, a modified ReliefF algorithm is proposed and applied to speech emo-
tion recognition. The purpose is to select the optimal feature subset from the high-dimen-
Sensors 2022, 22, 8152 3 of 12
sional speech emotional features, reduce the feature dimensions, and improve the effi-
ciency and accuracy of emotion recognition. The modified algorithm updates the selection
range of random samples and the distance measurement between attributes. The block
diagram of the speech emotion recognition system is shown in Figure 1.
Figure 1. Block diagram of the speech emotion recognition.
2. Feature Extraction
2.1. Preprocessing
After the speech signal is digitized, it needs to be preprocessed to improve the quality
of the speech. Speech is a non-stationary signal, but it can be regarded as a stationary
signal in a small time period [28]. In order to obtain a short-term stable speech signal, it
needs to be divided into frames, and there is a part of overlap between adjacent frames,
which is called frame shift. Multiply the speech signal s(n) by a window function w(n) to
obtain the framed speech:
)()()( nwnsnsw=
,
(1)
2.2. Short Energy
Short energy, also called frame energy, is closely related to human emotional state.
When people are emotionally excited, speech contains more energy; when people are de-
pressed, speech contains less energy. Suppose the ith frame speech signal is xi(m), the
frame length is N, and its short energy is:
,
(2)
2.3. Pitch Frequency
Pitch frequency is an influential feature parameter in SER, which represents the fun-
damental frequency of vocal cord vibration during vocalization. When a person is in a
calm state, the pitch is relatively stable. When a person is in a happy or angry state, the
pitch frequency becomes higher, and when a person is in a low mood, correspondingly
the pitch frequency becomes lower. Usually, the autocorrelation function method is used
Sensors 2022, 22, 8152 4 of 12
to estimate pitch frequency. Suppose the ith frame of speech is xi(m), the frame length is
N, and its short-time autocorrelation function is:
=
+= kN
m
iii kmxmxkR
1
)()()(
,
(3)
where k represents the time delay. If the speech signal is periodic, its autocorrelation func-
tion is also periodic, and the period is the same as the speech period. On the integer mul-
tiple of the period, the autocorrelation function has a maximum value, the pitch period is
estimated accordingly, and the inverse of the pitch period is the pitch frequency.
2.4. Formant
Formant reflects the physical characteristics of the vocal tract during vocalization.
Different emotional speech cause different changes in the vocal tract, and the position of
the formant frequency changes accordingly. The linear prediction method is usually used
to estimate the formant parameters, and the transfer function of the vocal tract is ex-
pressed as:
=
== p
k
k
kza
zA
zH
1
1
1
)(
1
)(
,
(4)
where ak represents the linear prediction coefficient, p represents the model order. Sup-
pose zk = rkejθk is a root of A(z), then the formant frequency is expressed as:
T
Fk
kπ2
=
,
(5)
2.5. Fbank and MFCC
Fbank and MFCC are feature sets established by imitating the human auditory sys-
tem. The human ears perception of frequency is not linear. In low frequency, the human
ears perception of sound is proportional to the frequency of sound, but as the frequency
increases, the ears perception of sound has a nonlinear relationship with frequency. On
this basis, Mel frequency is introduced:
+= 700
1lg2595 f
fMel
,
(6)
where fMel denotes the perception frequency in Mel, and f denotes the real frequency in
Hz. Calculate discrete cosine transform on Fbank to obtain the Mel frequency cepstrum
coefficient. Both MFCC and Fbank coefficients are commonly used feature parameters in
the field of emotion recognition. The extraction process is shown in Figure 2.
Figure 2. Extraction process of MFCC and Fbank.
Sensors 2022, 22, 8152 5 of 12
3. Feature Selection
3.1. ReliefF
The ReliefF algorithm was proposed by Kononenko to solve the limitation that the
Relief algorithm can only handle two-class problems [29]. The main idea is that the smaller
the distance between samples of the same category and the greater the distance between
samples of different categories, the more obvious the features effect on classification and
the greater the weight. Conversely, the larger the distance between samples of the same
category and the smaller the distance between samples of different categories, the weaker
the features effect on classification and the smaller the feature weight. The steps of the
ReliefF algorithm are:
(1) Initialize the weight vector w and the number of sampling times m;
(2) Select a sample R randomly, and find k similar neighbors and heterogeneous neigh-
bors, respectively. The distance between R and each neighbor Xi on feature fr is cal-
culated in (7):
)min()max(
|)()(|
),,(
rr
rir
ir ff
fXfR
XRfdiff
=
,
(7)
(3) Update the weight of feature fr:
)/())(,,(
))((1
)(
)/(),,(
)( 11
kmCMRfdiff
RclassP
CP
kmHRfdiffw
RclassC
k
j
jr
k
j
jrfr
+= ==
(8)
where diff (fi, R, Hj) represents the distance difference between R and the jth neighbor of
the same category Hj(j=1,2,...,k) on feature fr, diff (fi, R, Mj(C)) represents the distance dif-
ference between R and the jth neighbor of a different category Mj(C)(j=1,2,...,k) on feature
fr, P(C) is the proportion of the samples of category C to total samples, and P(class(R)) is
the proportion of the category to which the sample R belongs.
(4) Repeat the above steps m times, and the weight is averaged to obtain the final weight
vector w.
3.2. Modified ReliefF
To ensure the stability of the feature selection algorithm, the samples are selected
from each category on average, and the sampling range is the former G samples with the
closest Euclidean distance to the center of the corresponding category. In addition, when
the weight is updated, the maximal information coefficient (MIC) is used to measure the
correlation between features, and the distance measurement method between sample fea-
tures is established based on it.
MIC is a statistical method used to measure the dependence degree between varia-
bles [30]. Its essence is normalized mutual information, which has higher accuracy and
universality. The mutual information of variables x and y is expressed as:
= )()(
),(
log),();( 2ypxp
yxp
yxpyxI
x y
,
(9)
where, p(x,y) represents joint probability of x and y, and p(x) and p(y) represent their prob-
ability density, respectively. Then, the maximum information coefficient is:
=),min(log
);(
max);(
2ba
yxI
yxMIC
,
(10)
where a and b represent the number of grids divided on the x and y axes of the scatter
diagram composed of vectors x and y, and a
×
b < M0.6 (M is the number of samples). The
size of the MIC value reflects the degree of correlation between features. The maximum
Sensors 2022, 22, 8152 6 of 12
information coefficient MIC (fr, fn) between the rth dimension feature fr and the nth dimen-
sion feature fn is marked as srn; then, the correlation coefficient matrix between features is
expressed as:
=
NNNN
N
N
sss
sss
sss
s
21
22221
11211
,
(11)
where N represents the total feature dimension. Define the distance measure between
sample Xi and Xj over features fr:
+= 1),,(
1
1
1),,(),,( N
rn
jinrnjirjir XXfdiffs
N
XXfdiffXXfdist
,
(12)
Then, the distance between Xi and Xj is:
=
=N
r
jirji XXfdist
N
XXdist
1
),,(
1
),(
,
(13)
The specific process of the modified ReliefF algorithm is as follows:
(1) Calculate the sample center of category l, and sort all samples in this category accord-
ing to their distance to category center;
(2) Randomly select sample R from the former G samples closest to the center of the
category and repeat m times;
(3) For current sample, find k neighbor samples of the same category and neighbor sam-
ples of different categories and calculate the distance between samples;
(4) The weight is updated according to the ratio of the distance between the heterogene-
ous neighbors and the similar neighbors to assign a larger weight to the feature with
large heterogeneous distance and small homogeneous distance, and vice versa, as-
sign a smaller weight:
H
f
M
f
f
r
r
rD
D
w=
,
(14)
where
M
fr
D
denotes the mean distance between sample R and the heterogeneous neigh-
bors on feature fr, and
H
fr
D
denotes the mean distance between sample R and the similar
neighbors on feature fr:
)/(),,(
1
kmHRfdistD k
j
jr
H
fr=
=
,
(15)
)/())(,,(
))((1
)(
)( 1
kmCMRfdist
RclassP
CP
D
RclassC
k
j
jr
M
fr
=
=
,
(16)
(5) Repeat the above process for L categories and calculate the mean of the feature
weights:
Lww L
l
ff rr /
1
=
=
,
(17)
After the feature weights are obtained, the features are sorted in descending order
according to the weights to obtain a feature set FIR.
The modified ReliefF algorithm considers the discrimination of different features to
categories and the correlation between features in limited samples. In addition, the classi-
Sensors 2022, 22, 8152 7 of 12
fication performance of each feature can be directly considered as weight to sort the fea-
tures, and the obtained feature set FR is called performance-related features here. The fea-
tures with better performance can be screened by setting a threshold. The two features are
fused, and the fusion features are selected in combination with the model classification
results to obtain a feature vector that can fully express the emotional state. The fusion
features is expressed as in (18):
IRIRRRF FWFWF +=
,
(18)
where WR represents the proportion of the features reordered by classification perfor-
mance in the fusion features, WIR represents the proportion of the features reordered based
on modified ReliefF weight.
4. Experiment and Results Analysis
The proposed method is validated on the eNTERFACE05 dataset [31] and SAVEE
dataset [32]. The eNTERFACE05 dataset was performed by 42 subjects, with a total of
1287 audio files. The audio sampling frequency was 48 kHz, and the average duration was
about 3 s. It includes six fundamental emotions: anger, disgust, fear, happiness, sadness
and surprise. The SAVEE dataset was obtained by recording 120 emotional speeches by
four subjects, with a sampling rate of 44.1 kHz. It includes seven types of emotions: anger,
fear, joy, sadness, disgust, surprise, and neutrality, with a total of 480 speech files. From
the dataset, 80% of each emotion was selected as the training data and the remaining 20%
as the testing data.
Emotional features, including energy, first formant, pitch frequency, 13-order MFCC,
delta and delta-delta MFCC, Fbank coefficients and their statistical features, including
maximum, mean, variance, skewness, kurtosis, etc. were extracted, with a total of 235 di-
mensions. Support vector machine (SVM) and random forest (RF) classifier were used for
emotion recognition.
Take the features of the eNTERFACE05 dataset as an example to illustrate the neces-
sity of feature selection. The emotion recognition accuracy of original features, i.e., un-
sorted features, under different dimensions is shown in Figure 3. Here, for the conven-
ience of observation in the figure, the feature dimension is valued at an interval of 10; that
is, the value range is 1:10:235.
Figure 3. Classification accuracy of original features with feature dimension.
It can be seen that with the increase of dimension, the amount of information con-
tained in the feature set increases too, so the recognition accuracy curve generally shows
an upward trend, but the curve is not monotonically increasing. For example, when the
feature dimension increases from 70 to 90, the recognition rate correspondingly increases
from 52% to 56%, while the features with dimensions between 90 and 100 lead to a 2%
Sensors 2022, 22, 8152 8 of 12
reduction in recognition rate, and the recognition rate is basically stable when the dimen-
sions are 160 and 190. This suggests that not all features are beneficial to classification, and
there may be adverse features or irrelevant features. When screening features, favorable
features should be retained as much as possible, and unfavorable features and irrelevant
features should be eliminated.
The feature weights are calculated by the feature selection algorithm, and the features
are re-sorted according to the weights. The re-sorted features constitute different feature
subsets according to different dimensions. The recognition rates of different feature sub-
sets are compared and analyzed, including: (1) original features; (2) re-sorted features
with PCA method; (3) re-sorted features with ReliefF method; (4) re-sorted features with
MIC method; (5) re-sorted features with MRMR method; (6) re-sorted features with CFS
method; and (7) fusion features based on the modified ReliefF method (Proposed). Among
them, ReliefF algorithm is repeated 60 times, the number of nearest neighbors is 30, and
the modified ReliefF algorithm is repeated 10 times in each category. The number of near-
est neighbors is 30, the weight of performance-related features in the fusion features is 2,
and the weight of modified ReliefF re-sorted features is 8.
The correlation curve between the recognition accuracy and the dimensions of fea-
ture subsets was analyzed, as shown in Figures 4 and 5. It can be seen from the figure that
the performance of features vary in different datasets and different classifiers, but in gen-
eral, as the feature dimensions increase, the accuracy of emotion recognition increases ac-
cordingly. Different from the recognition rate curve of original features, when the emotion
recognition rate based on feature selection increases to a certain extent, it will slowly de-
cline or fluctuate within a certain range. The feature dimensions with the highest accuracy
rate are the dimensions of the optimal feature subset. Among them, the fusion features
based on the modified ReliefF algorithm has better performance, which has higher recog-
nition accuracy and lower feature dimensions.
Table 1 shows the highest classification accuracy of the selected feature subsets of
various methods, and Table 2 shows the minimum feature dimensions required for each
method to achieve the final recognition accuracy. It can be seen from the table that the
filtered features have better recognition performance, and the fusion features based on the
modified algorithm performs best among all features. The average recognition rate of both
datasets and the feature dimensions required improve at varying degrees compared with
the original features. Especially for the eNTERFACE05 dataset, when the feature dimen-
sions is only 8.47% of the total dimensions, the fusion features reach the final recognition
accuracy through SVM and RF. For the SAVEE dataset, the fusion features achieve the
final recognition accuracy when the feature dimensions is 40 with SVM, and 70 with RF,
accounting for 16.95% and 29.66% of the total dimensions, respectively.
(a)
(b)
Figure 4. Accuracy comparison of feature subsets under different dimensions (SVM): (a) eNTER-
FACE05 dataset; (b) SAVEE dataset.
Sensors 2022, 22, 8152 9 of 12
(a)
(b)
Figure 5. Accuracy comparison of feature subsets under different dimensions (RF): (a) eNTER-
FACE05 dataset; (b) SAVEE dataset.
Table 1. Highest recognition rate for each method (%).
Dataset
Features
SVM
RF
eNTERFACE05
Original Features
75.87
76.26
PCA
78.59
63.81
ReliefF
78.59
80.15
MIC
78.98
80.93
MRMR
71.98
73.93
CFS
78.98
80.54
Proposed
80.54
82.87
SAVEE
Original Features
71.87
77.08
PCA
71.87
67.70
ReliefF
77.08
78.12
MIC
75.00
79.16
MRMR
75.00
77.08
CFS
72.91
78.12
Proposed
81.25
80.21
Table 2. Minimum feature dimensions required for each method (%).
Dataset
Features
SVM
RF
eNTERFACE05
Original Features
236
236
PCA
80
236
ReliefF
60
30
MIC
140
20
MRMR
236
236
CFS
170
40
Proposed
20
20
SAVEE
Original Features
236
236
PCA
100
236
ReliefF
100
150
MIC
110
120
MRMR
140
220
CFS
150
200
Proposed
40
70
Sensors 2022, 22, 8152 10 of 12
The recognition accuracy of the fusion features based on modified ReliefF for each
type of emotion was analyzed and compared with the recognition accuracy of the original
features. The results are shown in Figures 6 and 7. In general, the fusion features can dis-
tinguish each type of emotional state well, and in most cases, the fusion features perform
better than the original features. It can be seen from Figure 6 that for the eNTERFACE05
dataset, the recognition accuracy of the fusion features for angry and surprise
achieves more than 90% through SVM, which is 4.65% and 6.97% higher than the original
features, respectively. The best accuracy of the surprise state reaches 100% through RF.
Moreover, the modified features greatly improved the recognition performance of the
disgust state, and its accuracy is 18.61% higher than that of the original features.
From Figure 7, for the SAVEE dataset, the recognition accuracy of the fusion features
for each type of emotion is better than the original features through SVM. Among them,
the recognition accuracy of disgust, fear, surprise”, and neutral reaches more than
90%, while the recognition accuracy of the original features for these emotional categories
is only about 80%. With RF classifier, the fusion features effectively improve the recogni-
tion performance of sadness and surprise, and the recognition accuracy of sadness
and surprise is 16.70% and 16.66% higher than the original features.
(a)
(b)
Figure 6. Recognition results on eNTERFACE05: (a) SVM classifier; (b) RF classifier.
(a)
(b)
Figure 7. Recognition results on SAVEE: (a) SVM classifier; (b) RF classifier.
5. Conclusions
The quality of emotional features determines the accuracy of emotion recognition.
The focus of this paper is to screen out the key features that are most discriminative for
emotions from high-dimensional features and remove irrelevant features, reducing the
model burden and improving recognition efficiency.
This paper put forward a modified feature selection algorithm to choose optimal
speech emotion features. SVM and RF classifiers are applied to experimental analysis on
eNTERFACE05 and SAVEE datasets. The results show that the fusion features based on
Sensors 2022, 22, 8152 11 of 12
the modified algorithm can effectively solve the problem of high feature dimension in
speech emotion recognition, and in the case of less feature dimension, better emotion clas-
sification results are obtained. On the eNTERFACE05 dataset, the final recognition rate
of the original features can be achieved by selecting at least 20 features from 236 features,
which is 91.52% lower than the original feature dimensions, and the best classification
accuracy of the modified method through SVM, 80.54%, was 4.67% higher than the origi-
nal features, while the best classification accuracy through RF, 82.87%, was 6.61% higher
than the original features. On the SAVEE dataset, the final recognition rate of the original
features can be achieved by selecting at least 40 features from 236 features, which is 83.05%
lower than the original feature dimensions, and the best classification accuracy of the
modified method through SVM, 81.25%, was 9.38% higher than the original features,
while the best classification accuracy through RF, 80.21%, was 3.13% higher than the orig-
inal features.
At present, this paper mainly classifies emotions based on traditional emotional fea-
tures. The next step is to study how to effectively integrate traditional features with deep
features to further improve the effect of emotion recognition. In addition, this method can
also be applied to feature selection problems in various fields such as pattern recognition.
Author Contributions: Conceptualization, G.-M.L. and N.L.; methodology, N.L.; software, N.L.;
validation, G.-M.L., N.L. and J.-A.Z.; formal analysis, G.-M.L.; investigation, N.L.; resources, N.L.;
data curation, N.L.; writingoriginal draft preparation, N.L.; writingreview and editing, G.-M.L.,
N.L. and J.-A.Z.; visualization, N.L.; supervision, J.-A.Z.; project administration, J.-A.Z.; funding
acquisition, G.-M.L. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by Shaanxi Science and Technology Plan, grant number
2021GY-338.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The datasets (eNTERFACE05 and SAVEE datasets) used in this paper
are available at http://www.enterface.net/results/ (accessed on 1 September 2022) and
http://kahlan.eps.surrey.ac.uk/savee/Database.html (accessed on 1 September 2022).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Kallipolitis, A.; Galliakis, M.; Menychtas, A. Affective analysis of patients in homecare video-assisted telemedicine using com-
putational intelligence. Neural Comput. Appl. 2020, 32, 1712517136.
2. Chowdary, K.D.; Hemanth, D.J. Human emotion recognition using intelligent approaches: A review. Intell. Decis. Technol. 2020,
13, 417433.
3. Wani, T.M.; Gunawan, T.S.; Qadri, S. A comprehensive review of speech emotion recognition systems. IEEE Access 2021, 9,
4779547814.
4. Abdel-Hamid, L.; Shaker, N.H.; Emara, I. Analysis of linguistic and prosodic features of bilingual Arabic-English speakers for
speech emotion recognition. IEEE Access 2020, 8, 7295772970.
5. Lin, L.; Tan, L. Multi-Distributed speech emotion recognition based on Mel frequency cepstogram and parameter transfer. Chin.
J. Electron. 2022, 31, 155167.
6. Kacur, J.; Puterka, B.; Pavlovicova, J.; Oravec, M. On the Speech Properties and Feature Extraction Methods in Speech Emotion
Recognition. Sensors 2021, 21, 1888.
7. Mohammad, O.A.; Elhadef, M. Arabic speech emotion recognition method based on LPC and PPSD. In Proceedings of the 2021
2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab
Emirates, 1921 January 2021.
8. Kawitzky, D.; Allister, T. The effect of formant bio-feedback on the feminization of voice in transgender women. J. Voice 2020,
34, 5367.
9. Lee, J. Generating Robotic Speech Prosody for Human Robot Interaction: A Preliminary Study. Appl. Sci. 2021, 11, 3468.
10. Luengo, I.; Navas, E.; Hernáez, I. Automatic emotion recognition using prosodic parameters. In Proceedings of the 9th European
Conference on Speech Communication and Technology 2005, Lisbon, Portugal, 48 September 2005.
11. Lugger, M.; Janoir, M.E.; Yang, B. Combining classifiers with diverse feature sets for robust speaker independent emotion recog-
nition. In Proceedings of the European Signal Processing Conference, Nice, France, 31 August4 September 2015; pp. 12251229.
Sensors 2022, 22, 8152 12 of 12
12. Wang, K.; An, N.; Bing, N.L.; Zhang, Y.; Li, L. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput.
2015, 6, 6975.
13. Ozseven, T. Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recog-
nition. Appl. Acoust. 2018, 142, 7077.
14. Anusha, K.; Hima, B.V.; Anil, K.B. Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech
Technol. 2020, 23, 4555.
15. Ancilin, J.; Milton, A. Improved speech emotion recognition with Mel frequency magnitude coefficient. Appl. Acoust. 2021, 179,
108046.
16. Iqbal, M.; Muneeb, A.M.; Noman, M.; Manzoor, E. Review of feature selection methods for text classification. Int. J. Adv. Comput.
Res. 2020, 10, 22777970.
17. Yadaiah, V.; Vivekanandam, D.R.; Jatothu, R. A Fuzzy Logic Based Soft Computing Approach in CBIR System Using Incremen-
tal Filtering Feature Selection to Identify Patterns. Int. J. Appl. Eng. Res. 2018, 13, 24322442.
18. Qiu, C.Y. A novel multi-swarm particle swarm optimization for feature selection. Genet. Program. Evolvable Mach. 2019, 20, 503
529.
19. Jagtap, S.B.; Desai, K.R. Study of Effect of PCA on Speech Emotion Recognition. Int. Res. J. Eng. Technol. 2019, 6, 24422447.
20. Padmaja, J.N.; Rao, R.R. Analysis of Speaker Independent Emotion Recognition System Using Principle Component Analysis
(PCA) And Gaussian Mixture Models (GMM). Int. J. Eng. Technol. Sci. Res. 2017, 4, 767778.
21. Soumyajit, S.; Manosij, G.; Soulib, G.; Shibaprasad, S.; Pawan, K.S. Feature selection for facial emotion recognition using cosine
similarity-based harmony search algorithm. Appl. Sci. 2020, 10, 2816.
22. Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Zikria, Y.B. Impact of feature selection algorithm on speech emotion
recognition using deep convolutional neural network. Sensors 2020, 20, 6008.
23. Sugan, N.; Satya, S.S.N.; Lakshmi, S.K. Speech emotion recognition using cepstral features extracted with novel triangular filter
banks based on bark and ERB frequency scales. Digit. Signal Process. 2020, 104, 102763.
24. Er, M. A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 2020, 8, 221640
221653.
25. Madni, M.; Vijaya, C. Hand gesture recognition using semi vectorial multilevel segmentation method with improved ReliefF
algorithm. Int. J. Intell. Eng. Syst. 2021, 14, 447457.
26. Ge, Q.; Zhang, G.B.; Zhang, X.F. Automatic feature selection algorithm based on ReliefF with maximum information coefficient
and SVM interaction. J. Comput. Appl. 2021, 42, 30463053.
27. Pan, L.Z.; Yin, Z.M.; She, S.G. Emotion recognition based on physiological signal fusion and FCA-ReliefF. Comput. Meas. Control
2020, 28, 179183.
28. Saxena, A.; Khanna, A.; Gupta, D. Emotion Recognition and Detection Methods: A Comprehensive Survey. J. Artif. Intell. Syst.
2020, 2, 5379.
29. Naofal, H.M.A.; Adiwijaya, A.; Astuti, W. Comparative analysis of ReliefF-SVM and CFS-SVM for microarray data classification.
Int. J. Electr. Comput. Eng. 2021, 11, 33933402.
30. Zheng, K.; Wang, X.; Wu, B. Feature subset selection combining maximal information entropy and maximal information coef-
ficient. Appl. Intell. 2020, 50, 487501.
31. Veni, S.; Anand, R.S.; Mohan, D. Feature fusion in multimodal emotion recognition system for enhancement of human-machine
interaction. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Tamil Nadu, India, 1112 Decem-
ber 2020; p. 1084.
32. Aouani, H.; Ayed, Y.B. Emotion recognition in speech using MFCC with SVM, DSVM and auto-encoder. Int. Res. J. Mod. Eng.
Technol. Sci. 2021, 3, 573578.
Article
Full-text available
Automatic Speech Emotion Recognition (ASER) has recently garnered attention across various fields including artificial intelligence, pattern recognition, and human–computer interaction. However, ASER encounters numerous challenges such as a shortage of diverse datasets, appropriate feature selection, and suitable intelligent recognition techniques. To address these challenges, a systematic literature review (SLR) was conducted following established guidelines. A total of 60 primary research papers spanning from 2011 to 2023 were reviewed to investigate, interpret, and analyze the related literature by addressing five key research questions. Despite being an emerging area with applications in real-life scenarios, ASER still grapples with limitations in existing techniques. This SLR provides a comprehensive overview of existing techniques, datasets, and feature extraction tools in the ASER domain, shedding light on the weaknesses of current research studies. Additionally, it outlines a list of limitations for consideration in future work.
Article
Full-text available
Speech emotion recognition (SER) is an active research area in affective computing. Recognizing emotions from speech signals helps to assess human behaviour, which has promising applications in the area of human-computer interaction. The performance of deep learning-based SER methods relies heavily on feature learning. In this paper, we propose SCAR-NET, an improved convolutional neural network, to extract emotional features from speech signals and implement classification. This work includes two main parts: First, we extract spectral, temporal, and spectral-temporal correlation features through three parallel paths; and then split-convolve-aggregate residual blocks are designed for multi-branch deep feature learning. The features are refined by global average pooling (GAP) and pass through a softmax classifier to generate predictions for different emotions. We also conduct a series of experiments to evaluate the robustness and effectiveness of SCAR-NET which can achieve 96.45%, 83.13%, and 89.93% accuracy on the speech emotion datasets EMO-DB, SAVEE, and RAVDESS. These results show the outperformance of SCAR-NET.
Article
Full-text available
Speech emotion recognition (SER) is the use of speech signals to estimate the state of emotion. At present, machine learning is one of the main research methods of SER, the test and training dataS of traditional machine learning all have the same distribution and feature space, but the data of speech is accessed from different environments and devices, with different distribution characteristics in real life. Thus, the traditional machine learning method is applied to the poor performance of SER. This paper proposes a multi‐distributed SER method based on Mel frequency cepstogram (MFCC) and parameter transfer. The method is based on single‐layer long short‐term memory (LSTM), pre‐trained inception‐v3 network and multi‐distribution corpus. The speech pre‐processed MFCC is taken as the input of single‐layer LSTM, and input to the pre‐trained inception‐v3 network. The features are extracted through the pre‐trained inception‐v3 model. Then the features are sent to the newly defined the fully connected layer and classification layer, let the parameters of the fully connected layer be fine‐tuned, finally get the classification result. The experiment proves that the method can effectively complete the classification of multi‐distribution speech emotions and is more effective than the traditional machine learning framework of SER.
Article
Full-text available
Cancer is one of the main causes of death in the world where the World Health Organization (WHO) recognized cancer as among the top causes of death in 2018. Thus, detecting cancer symptoms is paramount in order to cure and subsequently reduce the casualties due to cancer disease. Many studies have been developed data mining approaches to detect symptoms of cancer through a classifying human gene data expression. One popular approach is using microarray data based on DNA. However, DNA microarray data has many dimensions that can have a detrimental effect on the accuracy of classification. Therefore, before performing classification, a feature selection technique must be used to eliminate features that do not have important information to support the classification process. The feature selection techniques used were ReliefF and correlation-based feature selection (CFS) and a classification technique used in this study is support vector machine (SVM). Several testing schemes were applied in this analysis to compare the performance of ReliefF and CFS with SVM. It showed that the ReliefF outperformed compared with CFS as microarray data classification approach.
Article
Full-text available
The use of affective speech in robotic applications has increased in recent years, especially regarding the developments or studies of emotional prosody for a specific group of people. The current work proposes a prosody-based communication system that considers the limited parameters found in speech recognition for the elderly, for example. This work explored what types of voices were more effective for understanding presented information, and if the affects of robot voices reflected on the emotional states of listeners. By using functions of a small humanoid robot, two different experiments conducted to find out comprehension level and the affective reflection respectively. University students participated in both tests. The results showed that affective voices helped the users understand the information, as well as that they felt corresponding negative emotions in conversations with negative voices.
Article
Full-text available
During the last decade, Speech Emotion Recognition (SER) has emerged as an integral component within Human-computer Interaction (HCI) and other high-end speech processing systems. Generally, an SER system targets the speaker’s existence of varied emotions by extracting and classifying the prominent features from a preprocessed speech signal. However, the way humans and machines recognize and correlate emotional aspects of speech signals are quite contrasting quantitatively and qualitatively, which present enormous difficulties in blending knowledge from interdisciplinary fields, particularly speech emotion recognition, applied psychology, and human-computer interface. The paper carefully identifies and synthesizes recent relevant literature related to the SER systems’ varied design components/methodologies, thereby providing readers with a state-of-the-art understanding of the hot research topic. Furthermore, while scrutinizing the current state of understanding on SER systems, the research gap’s prominence has been sketched out for consideration and analysis by other related researchers, institutions, and regulatory bodies.
Article
Full-text available
Emotion Recognition (ER) systems is very much important for interpersonal relationship. Emotions are developed by some physiological changes. The straightforward of this effort is to discover the competence of language and facemask elements to deliver the feeling exact information for enhancing the Human-Machine interaction. The techniques and systems used in emotion detection may vary depending on the features inspected. Since both these features complement each other, combining them results in higher performance in terms of accuracy of 94.734%. The proposed system was tested on ENTERFACE’05 database and real time video. For Video, Speeded Up Robust Features (SURF) and Gabor features are used.
Article
Full-text available
Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.
Research
In smart environment control based on human moods may involve human emotion recognition with variety of techniques. Speech based emotion recognition can be considered in variety of applications where voice command-based controls are depleted. Speech emotion recognition means extracting the emotional state and synthesizing actual intension of a speaker. In this paper we focused on emotion recognition by extracting various features of speech signal also we compare the results by discriminating features using PCA techniques. This voice command-based emotion recognition system recognizes the four emotions namely angry, sad, happy and fear. We evaluate the results using accuracy, sensitivity and specificity parameters for SVM based classifier results.
Article
Automatic speech emotion recognition using machine learning is a demanding research topic in the field of affective computing. Identifying the speech features for speech emotion recognition is a challenging issue as the feature needs to emphasize the information about emotion from the speech. Spectral features play a vital role in emotion recognition from speech signals. In this paper, two modifications are made in the extraction of Mel frequency cepstral coefficient, they are, using magnitude spectrum instead of energy spectrum and exclusion of discrete cosine transform and extract Mel Frequency Magnitude Coefficient. Mel frequency magnitude coefficient is the log of magnitude spectrum on a non-linear Mel scale frequency. Mel frequency magnitude coefficient and three conventional spectral features, Mel frequency cepstral coefficient, log frequency power coefficient and linear prediction cepstral coefficient are tested on Berlin, Ravdess, Savee, EMOVO, eNTERFACE and Urdu databases with multiclass support vector machine as the classifier. Mel frequency magnitude coefficient as a stand alone feature recognizes emotion with an accuracy of 81.50% for Berlin, 64.31% for Ravdess, 75.63% for Savee, 73.30% for EMOVO, 56.41% for eNTERFACE and 95.25% for Urdu databases. Mel frequency magnitude coefficient is found to be the better spectral feature for the identification of emotion from speech compared to the conventional features.