ArticlePDF Available

Dynamic scaling factor based differential evolution with multi-layer perceptron for gene selection from pathway information of microarray data

Authors:

Abstract and Figures

The microarray data contains the high volume of genes having multiple values of expressions and small number of samples. Therefore, the selection of gene from microarray data is an extremely challenging and important issue to analyze the biological behavior of features. In this context, dynamic scaling factor based differential evolution (DE) with multi-layer perceptron (MLP) is designed for selection of genes from pathway information of microarray data. At first DE is employed to select the relevant and lesser number of genes. Then MLP is used to build a classifier model over the selected genes. A suitable and efficient representation of vector is designed for DE. The fitness function is derived separately as T-score, classification accuracy and weight sum approach of both. Simulation and further analysis is performed in terms of sensitivity, specificity, accuracy and F-score. Moreover, statistical and biological analysis are also conducted.
Content may be subject to copyright.
Multimedia Tools and Applications
https://doi.org/10.1007/s11042-022-13964-z
Dynamic scaling factor based differential evolution
with multi-layer perceptron for gene selection
from pathway information of microarray data
Pintu Kumar Ram1·Pratyay Kuila1
Received: 6 September 2021 / Revised: 7 April 2022 / Accepted: 13 September 2022
©The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
The microarray data contains the high volume of genes having multiple values of expres-
sions and small number of samples. Therefore, the selection of gene from microarray data is
an extremely challenging and important issue to analyze the biological behavior of features.
In this context, dynamic scaling factor based differential evolution (DE) with multi-layer
perceptron (MLP) is designed for selection of genes from pathway information of microar-
ray data. At first DE is employed to select the relevant and lesser number of genes. Then
MLP is used to build a classifier model over the selected genes. A suitable and efficient
representation of vector is designed for DE. The fitness function is derived separately as
T-score, classification accuracy and weight sum approach of both. Simulation and further
analysis is performed in terms of sensitivity, specificity, accuracy and F-score. Moreover,
statistical and biological analysis are also conducted.
Keywords Differential evolution ·Microarray data ·Pathway ·T-score ·
Biological significance
1 Introduction
1.1 Background and motivation
The human beings and other species may generally be affected by various diseases. Some-
times the diseases spread rapidly throughout the body. If it is not detected and diagnosed
at an early stage, it can seriously affect the human body and may claim the lives. With the
Pintu Kumar Ram
rampintu570@gmail.com
Pratyay Kuila
pratyay kuila@yahoo.com
1Department of Computer Science & Engineering, National Institute of Technology Sikkim,
Ravangla, 737139, Sikkim, India
Multimedia Tools and Applications
proliferation of artificial intelligence (AI) and machine learning techniques, the genomic
data for such identified diseases may be utilized to diagnose and detected some unknown
instances of such diseases. The microarray technology allows to simultaneously studying
the expression of such thousands of genes to detect some diseases like cancer, etc. Normally,
it operates with the gene expression patterns which involve in the formation of disease and
non disease cell. Thus, the analysis of microarray based gene expression data for disease
diagnosis has become a hot topic among the researchers [19,34].
In the microarray technology, large numbers of gene expression data are fabricated in a
single glass slide or silicon thin chip. It has been created in the form of a matrix where rows
stand for samples and columns represent the features/genes of the data. In the formation of
microarray chip, different samples of patient are first collected and then labeled with dye
and fabricated in the chip. Now, the data is ready in the form of a matrix with large number
of gene expression values. The basic concept of microarray chip formation and extraction
of microarray data are shown in Fig. 1. It attracts many researches to select and analyze the
genes from microarray data in disease diagnosis [7,32,37]. Based on the structure of data,
the microarray has the tendency to be formed with high volume of gene and lesser number
of samples. The high volume of genes with lesser samples makes it difficult to be utilized
to diagnose the diseases. Moreover, presence of noisy and redundant genes/feature becomes
challenging issue for researchers to classify the disease and non- disease cell. Therefore,
reducing the number of genes and further build an efficient model by using small samples
to accurately diagnose the diseases is challenging and important.
Note that the microarray data generally do not express the biological behavior of the
genes. In order to understand the biological behavior of the microarray data, the pathways
are identified. The pathway is the set of genes with similar biological behaviors. Pathway-
based information has come with a crucial role in diseases classification. It is important to
incorporate in the biological pathway and classify the samples of differentially expressed
genes/features that are associated with the diseases. To identify the pathways, many standard
databases such as KEGG (Kyoto Encyclopedia of Genes and Genomics) are utilized [1,9,
G1
G2
G3
-
-
-
-
GQ
Sample1
Sample2
Sample3
-
-
-
-
Samplep
Hybridization
Dyeing
Extract Sample
Normal Sample Abnormal Sample
Wash and Scan Microarray Chip
2D Matrix
Class b
Class a
Fig. 1 Basic concept of microarray chip formation and extraction of microarray data
Multimedia Tools and Applications
35]. The researchers are also getting attracted on the selection of features to analyze the
pathway marker from microarray data [22,38].
Evolutionary algorithms (EAs) [28] are drawing enormous attention by the research com-
munity for their potential capability to generate feasible and near optimal solutions for many
complex problems [12,14,16]. However, inherent challenges of employing the EAs are to
proper tuning of the parameters to balance the exploration and exploitation in the search
space. In this paper, a dynamic scaling factor based differential evolution (DE) technique
is employed for the gene selection problem from microarray data. Our contribution in this
paper is as follow.
1.2 Author’s contribution
In this article, differential evolution (DE)-based approach for pathway-based gene analysis
is proposed. The DE is employed to find out the pertinent features from the large number of
redundant features. The selected set of features has some biological behavior. By observing
the biological behavior, the disease of a particular species can be predicted. Thus, it helps
to correctly diagnose the disease and examine the needful steps as per the requirement. The
major contributions in this article are as follow:
The dynamic scaling factor based DE is used to find out relevant feature of pathway
gene.
The Scaling Factor (F) of DE is dynamically updated to balance the exploration and
exploitation of the DE.
The vectors are efficiently encoded with real values. It is ensured to provide complete
solution to the problem.
The fitness function is derived to measure the quality of each vectors. Here, three dif-
ferent cases are considered to evaluate the vectors. In first case, T-score and in second
case Classification Accuracy (CA) is used as objective function to evaluate the vectors.
In third case, the fitness function is derived by weight sum approach (WSA) using both
the T-score and CA.
Multi-layer parceptron (MLP) is used to obtain the classification accuracy (CA). More-
over, the MLP is also employed to build the classifier model on the selected genes by
the DE.
The proposed algorithm is simulated using standard data sets and the performance
is also compared with various existing approaches like particle swarm optimization
(PSO), genetic algorithm (GA) and gravitational search algorithm (GSA).
Further, statistical analysis is performed to show the significance of the algorithm over
the several existing algorithms (PSO, GA and GSA).
1.3 Structure of the article
The remaining parts of the article are arranged as follows. The works that are associated
with the proposed work is discussed in Section 2. Section 3, describes the problem formu-
lation, system model, and preprocessing are given. In Section 4, overview of DE algorithm
is discussed. The proposed method is described in Section 5. The data analysis and simula-
tion results, including analysis of variance (ANOVA) and biological analysis, are narrated
in Section 6. The work is concluded in Section 7.
Multimedia Tools and Applications
2 Associated works
The analysis of feature/genes from microarray data has always been of interest to
researchers in the field of medical science. A large number of feature selection problems
have been studied in the literatures. The several state of art approaches are discussed as
follows.
In [6], the authors have used information gain to remove the noisy genes and SVM clas-
sifier was employed over the filtered gene subset to classify the cancerous samples without
incorporate of any evolutionary algorithm. In the same year, Salem et al. [31] have proposed
the new technique to differentiate the human cancerous diseases based on gene expression
profile. They have used filter and wrapper approach. In filter, the information gain is used
to select the non redundant feature set from large volume of data and in wrapper, traditional
genetic algorithm is employed to select the best chromosome. The chromosomes are ini-
tialized in binary string (0,1) where 1 represents selected feature and 0 represents the non
selected features. Next, only accuracy is used as a fitness function to measure the good-
ness of chromosome. At the end, genetic programming has been used to classify the feature
subset.
Rani et al. [26] have proposed the approach based on mutual information (MI) and
genetic algorithm (GA) to classify the cancer diseases from microarray based gene expres-
sion data. They have deployed the technique in two cases. Firstly, they used mutual
information to select the best features. Secondly, those features are used as an input for
genetic algorithm to get the best and optimal feature sets for better classification. In addition,
only SVM classifier is employed here for classification. However, it is inefficient to explore
the behavior of features. Moreover, various classifiers are required to find the behavior of
features instead of single classifier.
Mabarti et al. [20] have designed the approach based on the concept of minimum redun-
dancy and maximum relevance with genetic algorithm. Here, chromosomes are randomly
initialized and each chromosome is evaluated using the minimum redundancy and max-
imum relevance approach. Further, C4.5 classifier is used as to obtain the classification
outcomes. Ghosh et al. [8] have introduced the recursive Memetic algorithm for feature
selection from microarray data with dual task, maximizing the accuracy and minimizing the
features. Here, chromosomes are randomly initialized. Each chromosome is evaluated using
accuracy as a primary objective. If it is not meet the criteria then weight values are assigned
for accuracy and selected feature simultaneously. Also they have compared the performance
of their approach with traditional memetic algorithm and genetic algorithm.
Han et al. [10] have modeled for the extraction of the best features from gene expression
data using sensitivity information of gene to class label. For this, they have used k-means
clustering to search the hidden patterns in data sets at initial level. Afterwards, binary par-
ticle swarm optimization (BPSO) has been used to select the best features set. In the same
year, in [41], the authors have presented the model by joint contribution of filter and wrapper
approach to select the features. Here, F-statics has been used to filter the meaningful features
and is followed by maximum relevance binary particle swarm optimization (MRBPSO) as
a wrapper method to get the best features subset. Also, Mandal et al. [21] have proposed
the particle swarm optimization (PSO) technique based on inferring pathway activity for
the analysis of features from the gene expression data. The particles are initialized as binary
bit i.e., 0 and 1. To measure the quality of particle, single objective (T-score) is used as
Multimedia Tools and Applications
the fitness. A classifier such as support vector machine (SVM) is used to achieve the out-
come of observed solutions. Further, biological behavior of the selected features is analyzed.
However, it is required multiple objectives as well as classifiers to optimized the approach.
Moreover, Prasad et al. [24], have proposed the recursive PSO technique to select minimum
features from large data set. Initially, they have used filter approach to extract the subset
of features based on the ranking approach. Afterwards, PSO is deployed over the extracted
features. The particles are randomly initialized as 0 and 1. To measure the performance of
particles, accuracy using support vector machine classifier is used as the fitness function.
Zhang et al. [40] have designed the model for feature selection for microarray data using
information gain with an improved binary krill herd algorithm. The data set contains the
irrelevant features that impacts on performance of the system. Therefore, information gain
is applied over the data set to obtain the relevant features. Higher score of information
gain indicates the highly relevant feature whereas lower score indicates the lower relevant.
Afterwards, binary krill herd approach is employed over the feature having high score of
information gain. Each individual/krill is initialized as random binary bit. Here, accuracy
using k-nearest neighbor approach is used as fitness function. Ram et al. [25] have suggested
the feature selection based on the gravitational search algorithm from microarray data. Here,
the agents are randomly initialized as 0 and 1. The quality of each agents are determined
by the accuracy using 5-fold cross validation with SVM. In addition biological analysis is
performed for selected features.
Xu et al. [39] have introduced the model cancer classification based on behavioral anal-
ysis of pathway based on gene with their expression value and interaction between gene
from microarray data. In [29], the authors have developed the approach by using multiob-
jective graph theoretic approach to select the feature from microarray data. It is designed in
two approaches, maximum community or cluster of feature is selected and fisher score or
node centrality of existing feature in community is measured. In addition, spectral cluster-
ing based over protein protein network via affinity matrix using graph approach is proposed
in [4]. Bakhshandeh et al. [3] have aimed to detect the irrelevant features from the subset of
features. Therefore, they have proposed symmetric uncertainty class feature association map
(SU-CFAM) method. Initially, they have generated the similarity matrix by using symmet-
ric uncertainty which was based on either correlation between feature to feature or feature
to class. Later, they have created the clusters of features by using community detection algo-
rithm. Further the adjacency matrix of all the clusters is constructed and then ultimate subset
of features is extracted. If two features are highly correlated then these becomes redundant.
Also, if the feature and the class label are less correlated then these become irrelevant.
In the literature [21,25,31], feature selection are performed for high dimension problem.
In comparison, the suggested method uses a novel approach to enhance the performance of
approach for high dimension feature problem as follows.
In contrast to [8,20,26,31], the proposed approach uses dynamic scaling factor to
maintain the exploration and exploitation of the search space in the DE while searching
for the better the solution.
A novel fitness function is derived. This is in contrast to the fitness function as used in
[20,24,40].
The vector is designed in an efficient manner. This is in contrast to [21,25,26,31].
In contrast to [20,21,26], the proposed approach uses multiple classifiers due to the
variance in size of features.
Multimedia Tools and Applications
3 Problem representation and system model
3.1 System model and problem formulation
LetusassumeamicroarraydatasetD={cij |1iS,1jN}with Snumber
of samples (row) and Nnumbers of features or genes (column). Generally, microarray data
has higher number of dimensions or features/genes. The Snumber of samples are divided
into two classes Caand Cb,S=|Ca|+|Cb|.
A pathway can be described as a set of samples with selected number of genes having
similar biological significance. A pathway PT
ican be represented as PT
i={aij |1i
S,1jZi},Z
i<N. Given a high dimensional microarray data, the problem is to find
the relevent and lesser numbers of features/genes to be applied for further classification.
The terminologies those are used in the further sections are shown in Table 1.
3.2 Preprocessing
3.2.1 Information index classification
For data preprocessing or extracting the genes from large volume of data matrix, the Infor-
mation Index Classification (IIC) method is used. Datasets exist in the matrix form, where
samples have multiple classes in the row form. Genes are presented in the columns. For each
column, the IIC value is calculated as in (1), where μgm and μgn represent the mean of gth
gene in the mth and nth class respectively. σgm and σgn represent the standard deviation of
gth gene for mth and nth class respectively. cdenotes number of class. The IIC value has
been formulated from the given equation.
IIC(g) =
c
m=1
c
n=1,n=m
1
2
|μgm μgn|
σgm +σgn
+1
2Inσ2
gm +σ2
gn
2×σgm ×σgn  (1)
Table 1 Terminologies
Notations Descriptions
μgn The mean of gth gene for nth class.
σgn The standard deviation of gth gene for nth class.
ϕWeight parameter.
gpq The expression level of pth sample in qth gene.
gqThe expression level of qth gene over the whole samples.
Ca,CbClass aand Class b.
DMicroarray dataset.
NNumber of features.
PT
iith Pathway.
Psize Population size in DE.
SNumber of sample.
WTotal number of pathway.
ZiNumber of features for PT
i.
Viith solution vector.
Multimedia Tools and Applications
After calculating the IIC over the data matrix, the genes are sorted in decreasing order
basedontheirIICvalue.Nowx%ofgene(x < 10%)are selected with higher IIC value.
High IIC values highly represent the feature sets. Thus, the Nnumber of features (e.g.,
N=1000 out of 12000) are selected for further classification procedure.
3.2.2 Normalization
After selecting the number of genes using high value of IIC, these genes are imported in
pathway database (http://david.abcc.ncifcrf.gov/tools.jsp) and collect the pathway informa-
tion. Each pathway has number of genes. The min max normalization approach is used for
each pathway having corresponding genes. The min-max normalization has been computed
by given (2), where gpq is the expression level of pth sample in qth gene and gqrepresents
the expression level of qth gene over the whole samples.
Normalize(gpq)=gpq mi n(gq)
max(gq)min(gq)(2)
4 An overview of differential evoluation
Differential Evolution (DE) [2,18] is a population based evolutionary technique which is
broadly used to crack the complex optimization problems. It consists of four steps as in
conventional evolutionary algorithms, initialization of population vector, mutation based on
vector difference, crossover and selection. It starts with randomly generated population of
solution vectors with a predefined population size. In a population, each vector can generate
an individual solution. After initialization of population vector, the finesses value of each
individual vector is computed. Then the iterative process starts with the mutation, crossover
and selection to find out the better solution. In each iteration (or generation), the vectors are
updated until the termination condition. At the end, the final solution is identified on the
basis of fitness value. The pictorial representation of the proposed DE is explained in the
Fig. 2.
To perform the DE operation, various schemes are available. The schemes are rep-
resented as DE/x/y/z”. Here, DE stands for differential evolution, xrepresents the
Start
Mutation (Based On
Difference Vector)
Crossover
Selection
Terminate?
End
Yes
No
Evaluate the Fitness
Initialization of
Population
Fig. 2 Flowchart of differential evolution algorithm
Multimedia Tools and Applications
selected vector for the mutation operation. It can be a random or best vector from the pop-
ulation that is utilized in the mutation phase. ystands for number of difference vectors
those are involved in the mutation and zstands for crossover method (it may be bino-
mial, exponential, etc). There are some well employed DE schemes like DE/RAND/1/BIN,
DE/RAND/2/BIN,DE/BEST/1/EXP,DE/BEST/2/EXP,etc.Here,BINandEXPrepresent
binomial and exponential crossover respectively.
In mutation operation (assume for DE/BEST/1/BIN), it fixes a target vector (TV) that
need to be mutated. Then two vectors are randomly selected from the population to create
a difference vector. The generated difference vector and the best vector are used in the
mutation to generate a mutant or donor vector (DV). Now, crossover is performed between
the DV and the TV to produce the offspring known as trial vector (TLV). Then selection is
performed between TLV and TV based on their fitness value. Fitness functions are evaluated
for the both the vectors. TLV will replace the TV if the fitness of TLV is better than TV,
otherwise the TV will exist in the population. The process repeats until it does not meet the
reasonable solution or a stopping criterion is obtained.
5 Proposed model
In this work, the microarray data is first preprocessed. The KEGG tool is used to extract
the pathways. The proposed DE based approach is the utilized over the data to select the
relevant and lesser number of genes. The proposed DE has used three different cases of
fitness. The selected features by the DE are then handed over to the classifier (e.g., MLP)
to build the model. The overall framework of the proposed model is shown in the Fig. 3and
discusses below.
Phase 1: Initially, the large data is preprocessed to identify the non redundant features.
Here, information index classification (IIC) (discussed in Section (3.2.1)) is employed
to select the smaller subset of features from the data set with large volume of features.
Given the microarray data set of Ngenes, smaller number (say Na,N
a<N) of genes
or features are extracted by using the IIC.
Phase 2: After preprocessing, the extracted Nagenes are imported to the public domain
database, David tool (http://david.abcc.ncifcrf.gov/tools.jsp) to extract the pathways.
Here, KEGG (Kyoto Encyclopedia of Genes and Genomics) pathway is used as it is also
used in [1,9,35]. Let us assume, Wnumber of pathways Φ={PT
1,PT
2,...,PT
W}
are extracted.
Phase 3: Next, normalization (discussed in Section (3.2.2)) is done over the extracted
pathway information.
Phase 4: Now, the proposed dynamic scaling factor based-DE is employed on the
extracted pathways (Φ) to select lesser number of genes with high accuracy value. The
phases of the proposed DE are detailed discussed in the following Section 5.1 to 5.6.
Phase 5: The selected features by the proposed DE are then handed over to the classifier
techniques to build the classifier model as discussed in the Section 5.7.
5.1 Vector initialization
A vector should always produce a valid solution. Here, the vectors are with the same length
as number of identified pathways in preprocessing step through KEGG, i.e., W.Let,the
ith vector, Vibe represented as
Vi={vi1,v
i2,...,v
iW }. Each element vik,1kW
Multimedia Tools and Applications
PT1-- - - - - - PT2--- - - - - - - PTW--- ---
G1G2G3G5---- - - - - - ---G
Q
S1
S2
S3
-
-
Sp
G1G2----- GN, N<Q
S1
S2
-
-
-
Sp
S1
G2
----
Gn
G4
G5
----
Gn
.
G8
G9
-----
Gn
S2
.
-
.
-
.
-
.
Sp.
PT1
PT2
PT3
PT4
PT5
--------
----------
PTW
0.2
0.6
0.8
0.1
0.3
---------
----------
0.7
PT2
PT3
PTW
S1
S2
-
-
-
SP
KEGG Pathway
Database
http://david.abcc.ncifcrf.gov
W= Total no of Pathway
p= Total no. of sample
n= No. of gene in each pathway
Total no. of gene = n(PT1,PT2, ,PTW)
Ca
Cb
For e.g., Only 3
Pathways are
active
(PT2, PT3, PTW)
If PTi > 0.5
Pscheme = PT2PT3PTW
Pathway data matrix
Using IIC
DE
Ca
Cb
Main Data
Fig. 3 Proposed model based on pathway scheme
is initialized by a randomly generated number, rand(0,1), 0rand(0,1)1.0, i.e.,
vik =rand(0,1). The element value vik of the vector Viindicates whether the pathway
PT
kis selected or not. If vik >0.5, then only PT
kis selected. A population is the randomly
generated with Psize number of vectors as given in the Algorithm 1.
Illustration 1. Assume a set of 10 pathways, Φ={PT
1,PT
2,...,PT
10}as shown in
Fig. 4. Therefore, the size of the vector is 10. Now, random number rand(0,1)is gen-
erated for each element of the vector as mentioned above. Let us assume the generated
numbers are as shown in Fig. 4. It can be observed that the pathway PT
2is selected as
vi2>0.5. Similarly, the pathways PT
4,PT
6,PT
7,and PT
10 are also selected.
Multimedia Tools and Applications
Algorithm 1 Generate Population(POP).
5.2 Fitness computation
The supremacy of the vectors is computed by the evolved fitness function. Here, three cases
are considered to evaluate the vectors. In first case, T-score and in second case Classifica-
tion Accuracy (CA) is used as objective function to evaluate the vectors. In third case, the
fitness function is derived by weight sum approach (WSA) using both the T-score and CA.
1. Case 1 (T-score): It is applicable to observe the variation of any data points of an
observation. Basically, it focuses on the mean of a distribution to analyze the data point
in terms of how much it deviates from the mean of the distribution. Therefore, T-score
is used over the expressions of a pathway which contains genes. Here, the objective is
represented by the (3).
Minimize T(P
scheme)=μaμb
σa
Sa+σb
Sb
(3)
where, Pscheme be the pathway scheme or the final set of pathways that are selected by
the DE, μxrepresents the mean and σxrepresents standard deviation of the sample of
the class x∈{a, b}.Sxindicates the overall samples of binary class. The pseudo-code
to calculate the fitness using T-score is given in Algorithm 2.
2. Case 2 (Classification Accuracy(C A)): After selecting the pathways, k-fold cross val-
idation technique is used with MLP classifier to evaluate the fitness value. Normally,
the value of kmay be 10 or less. The higher value of kis less biased as well as vari-
ability is high. In contradictory, the smaller value goes towards the validation set and
higher value leads towards the LOOC (Leave one out cross validation). Here, 5-fold
cross validation is used. In this approach, the pathway matrix is randomly partitioned
with respect to samples into 5 subsets including training and testing subsets. 4 of the
5 subsets are used as training and 1 subset is used as testing as shown in Fig. 5. Here,
Pathways
PT1
PT2
PT3
PT4
PT5
PT6
PT7
PT8
PT9
PT10
rand(0, 1)
0.45
0.56
0.32
0.61
0.12
0.82
0.79
0.41
0.29
0.91
PT2
PT4
PT6
PT7
PT10
Selected Pathways
Fig. 4 A random vector initialization for ten number of pathways
Multimedia Tools and Applications
Algorithm 2 Fitness Case1.
MLP is employed on the four training sub sets. It will repeat 5 times and takes mean
accuracy. Here, the objective is represented by the (4).
Maximize CA =5
i=1Aci
5(4)
The CA can be calculated using the similar Algorithm 2.
3. Case 3 (Weight Sum Approach (WSA)): In this case, the fitness function is derived by
using both the parameters T-score and CA as given in (3)and(4) respectively. Here,
weight sum approach is used to combine the objectives as follow.
Maximize WSA =ϕ1×(1T(P
scheme)) +ϕ2×CA (5)
where, ϕ1and ϕ2are the weight parameters, ϕ1+ϕ2=1,0ϕ1
21. The
parameters, ϕ1and ϕ2are tested by different combination of values to fix the final
vector.
Training Testing Training Testing Training Testing Training Testing Training Testing
Ac 1Ac 2Ac 3Ac 4Ac 5
Main Data
(After select the pathway)
Fig. 5 Fitness for CA using 5-fold cross validation
Multimedia Tools and Applications
5.3 Mutation
Mutation is conducted for each target vector (
TV) of the population. To mutate a tar-
get vector, a donor vector (−→
DV ) is generated for each target vector. There are various
mutation and crossover schemes in the literature as mentioned in the Section 2. Here, the
DE/RAND/1/BI N scheme is used to illustrate the phases. The donor vector, −→
DV i(g) at
gth generation is generated as follow.
−→
DV i(g) =
Xr(g) +F.{
Xs(g)
Xt(g)}(6)
Here, first a random vector
Xr(g) is selected from the population and then other two ran-
dom vectors
Xs(g) and
Xt(g) are also selected in such a way that r= s= t(as per
DE/RAND/1/BI N scheme). In the (6), (
Xs(g)
Xt(g)) is known as the difference vec-
tor. In case of per DE/BEST /1/B I N scheme, instead of a random vector
Xr(g), the best
vector
XBEST (g) is selected for the mutation. In the simulation, the DE/RAND/1/BI N
scheme is used.
Fbe the scaling factor. Generally, Fbelongs to the range of [0.4, 1.0]. In this work, F
is dynamically changed and it is discussed in the Section 5.6.Nowthe−→
DV i(g) is employed
to create the child vector by crossover operation as mentioned in following section.
Illustration 2. Let us assume a population of eight vectors as shown in Fig. 6.Let
X1be
the target vector and the randomly generated three vectors are X3,X5and X7.Nowthe
donor vector is generated by the (6).
0.500 .0.130 0.050 0.130 0.050 0.133
0.700 0.140 0.140 0.140 0.140 0.140
0.200 0.050 0.033 0.050 0.330 0.300
0.300 0.060 0.060 0.060 0.060 0.060
0.600 0.100 0.100 0.150 0.150 0.100
0.800 0.100 0.100 0.300 0.150 0.150
0.400 0.080 0.800 0.100 0.100 0.040
0.750 0.150 0.750 0.150 0.750 0.150
0.800 0.100 0.100 0.300 0.150 0.150
Population
Target Vector
Xt
Xs
Xr
Random
Vectors
(DV)i (g)=Xr(g)+ F × (Xs(g)-Xt(g))
+ 0.3 × (
Donor
Vector
= 0.8
=)
0.6 0.4
0.2
Fig. 6 An example of mutation operation. The 3rd ,5
th and 7th vectors are selected for mutation.
Computation for the first component of the donor vector is only shown
Multimedia Tools and Applications
5.4 Crossover
The crossover is accomplished between a target vector
TVi(g) ={vi1(g), vi2(g),
...,v
iW (g)}and the corresponding donor vector −→
DV i(g) ={di1(g), di2(g), . . . , diW (g)}
to produce an offspring vector −−→
TLVi(g) ={ti1(g ), ti2(g), . . . , tiW(g )}. Here, binary
(BIN) crossover is applied with crossover rate say Crwhich is predefined. The jth element
of the −−→
TLV vector is generated as follow.
tij (g) =
dij (g), if randjCr
vij (g), otherwise
(7)
Illustration 3. It can be observed from Fig. 7that the elements of the −−→
TLV is selected in
between −→
DV and
TV based on the corresponding random number. It can be seen that the
first element of the −−→
TLV is same as the first element of the −→
DV as the random number
(rand1) is less than the Cr.
5.5 Selection
In the selection phases, it is decided that which one among the target vector and the newly
generated child vector will survive in the next generation. It decision is taken based on the
fitness value as follow.
TVi(g +1)=
(−−→
TLVi(g)), If F itness(−−→
TLVi(g)) F itness(
TVi(g))
TVi(g), Otherwise.
(8)
The mutation, crossover and selection operations are iterated till the termination crite-
rion. Here, the termination criterion is kept as the predefined number of iterations.
5.6 Updation of dynamic scaling Fcator (
F
)
The parameter Fplays crucial role in DE algorithm. In conventional DE, the value of F
is fixed for each iteration. An inherent drawback of any population based stochastic evo-
di1(g)
di2(g)
di3(g)
di4(g)
di5(g)
di6(g)
di7(g)
di8(g)
di9(g)
di10(g)
vi1(g)
vi2(g)
vi3(g)
vi4(g)
vi5(g)
vi6(g)
vi7(g)
vi8(g)
vi9(g)
vi10(g)
di1(g)
vi2(g)
di3(g)
vi4(g)
di5(g)
vi6(g)
di7(g)
vi8(g)
di9(g)
vi10(g)
rand1<Cr
)()( gDV i
rand10>Cr
Fig. 7 Crossover operation
Multimedia Tools and Applications
lutionary algorithms is premature convergence [33]. Moreover, an evolutionary algorithm
is considered an efficient algorithm if it can balance between exploration and exploitation
of the search space. In this regard, many researchers suggested dynamic scaling factor [30,
36]. Here, the authors have also employed dynamic scaling factor to overcome these issues.
The scaling factor Fis dynamically updated with new value in each solution to control
the variation in exploration and exploitation at search space. Initially, Fis randomly gen-
erated in the range of [-0.8, 0.8] and it reduces till the [-0.4, 0.4]. Now, it can be observed
that the tendency of exploration increases at initial iteration and afterwards exploitation
increases. Hence, it balances the exploration and exploitation mechanism.
The pseudo code of the proposed DE is given as in Algorithm 3 and corresponding
flowchart is given in Fig. 8. After the termination, the finest vector of the population is
finalized as the final solution vector.
Remark 5.1 A vector can be generated in (W) time. Therefore, the initial population can
be generated in (Psize ×W) time. Then in the iterative process, for each trial vector, a
donor vector can be generated in (W) time (line no. 9 of Algorithm 3). Then crossover
takes also (W) time (line no. 11 to 17). The selection operation requires computation of
fitness value for the new child vector. The fitness of the child vector can be computed in
(W ×S) time, where Sbe the sample size. Then further selection takes (1)time to
identify best vector among the trial and child vector (line 19 to 23). Therefore, the overall
time complexity of DE can be computed as (Psize ×W)+(I ×Psize ×(W +WS)) or
(I ×Psize ×W×S).
5.7 Machine learning classifier
In recent times, machine learning has left a deep impact on the field of data science. More-
over, it has an ability to deal with the experiences, observations and instructions which have
in the form of data for correct prediction. There are lots of machine learning algorithms,
those are used in the field of Data Mining, Face Recognition, Handwritten Recognition and
Bioinformatics [13,15]. Here, the proposed work employs on the Multilayer perceptron
(MLP) neural network classifier.
Initially, MLP with 5-fold cross validation is used to compute the objective function in
fitness evaluation as shown in Section 5.2. Then, MLP is further applied over the selected
Start
Mutation
(Based On
Difference
Crossover
Selection Terminate?
End
Yes
No
Update the Scaling
Factor (F)
Evaluate the
Proposed Fitness
Initialization of
Population Vector
based on Pathway
Fig. 8 Flowchart of dynamic scaling factor based DE
Multimedia Tools and Applications
Algorithm 3 Proposed differential evolution.
genes by the DE to build a classifier model. Performance of the classifier model is evalu-
ated in the simulation analysis phase in terms of sensitivity (SN), specificity (SP), accuracy
(AC), and F-score (FS). The MLP includes the several layers, named as the Input Layer (IL),
Hidden Layers (HLs) and Output Layer (OL). It has the potential to handle the complex
datasets to get the desired output with maximum accuracy. The layers are connected with
each other. In general, after feeding the data in the input layer, it goes towards the hidden
layer with the combination of some weights and biases. Then, it is activated by the acti-
vation function. In this case, sigmoid activation function is used which converges towards
the desired output. Overall flow of MLP is depicted in Fig. 9. Here, Inrepresents the input
(input layer) of each neurons, wnand βnrepresent the weight. Here, single hidden layer is
used which has two neurons. Each neuron of hidden layer is represented in the weighted
sum of the inputs and added with a bias value. Afterwards, the activation function is applied
over summed value. Hence, the output comes either output 1 (O1)or output 2 (O2).
Multimedia Tools and Applications
w11
β11
w12
w21
w22
wn1
wn2
β12
β21
In
O2
β22
Input Layer Hidden Layer Output Laye
r
I1
I2
O1
H2
H1
Fig. 9 Multilayer perceptron neural network
6 Data analysis and simulation results
6.1 Overview of Datasets
In this article, three real-life gene datasets are utilized for simulation analysis. The datasets
might be easily fetched from the website: www.biolab.si/supp/bi-cancer/. An overview of
the considered datasets is follow.
Prostate: This dataset is for prostate tumor. It consists of 102 samples with total 12533
numbers of genes or features for each sample. Here, the samples are split up into binary
classes, normal class with 50 samples and tumor class with 52 samples.
DLBCL: This dataset is for B-cell lineage malignancies. The dataset consists of two
different B-cells as, Diffuse Large Lymphoma (DLBCL) and Follicular Lymphoma
(FL). It contains total 77 samples with 7070 genes. The DLBCL class consists of 58
samples and remaining 19 samples are for FL class.
Child All: It is an acute lymphoblastic lymphoma gene set. It contains 8280 number
of gene with 110 samples which is divided into two classes based on before and after
investigation, nevertheless of the type of investigation. First 50 samples are for before
therapy and 60 samples are for after therapy.
6.2 Simulation environment
The simulation has been done on a system with Intel i7, 8th generation processor, 8GB of
RAM and Windows 10 as operating system. The proposed DE is implemented using Rlan-
guage and the simulation results are depicted using MATLAB. The proposed DE algorithm
is evaluated by three cases of fitness functions, T-score, Classification Accuracy (CA) and
WSA. In the rest of the paper, the DE with T-score fitness function is denoted as DETS and
similarly, DECA and DEWSA.
CA is computed using multi-layer perceptron. Also, the scaling factor (F)ofDEis
dynamically updated in each iteration. For the sake of the comparison, similar existing
works using PSO [21], GA [31]andGSA[25] are also executed. Here, microarray based
gene expression data is taken for the experiment analysis. After executing the proposed
DE, wilcoxon rank sum test [5] is applied to get P-value for each pathway. Then, top
50% pathways are extracted based on ascending order of P-values and evaluated by 10-
fold cross validation with various machine learning classifiers (k-nearest neighbor (k-NN),
Na¨
ıve Bayes (NB), support vector machine (SVM) and multi-layer perceptron (MLP)) to
Multimedia Tools and Applications
get the sensitivity (SN), specificity (SP), accuracy (AC), and F-score (FS) respectively by
the following [(9)-(13)], which are derived from the confusion matrix [23].
Accuracy =τp+τn
τp+τn+fp+fn
(9)
Fscore =2×P×R
τn+fp
(10)
Sensitivity orRecall(R) =τp
τp+fn
(11)
Specif icity =τn
τn+fp
(12)
P reci sion(P ) =τp
τp+fp
(13)
τp: stands for true positive, τn: stands for true negative, fp: stands for false positive, fn:
stands for false negative.
The used parameters in the proposed work are considered as shown in the Table 2.Note
that, the considered parameters are same as taken in [17], [11]and[27] for PSO, GA and
GSA respectively.
6.3 Simulation results
The outcomes of the simulation in terms of sensitivity (SN), specificity (SP), sccuracy (AC),
and F-score (FS) for different classifiers are discussed in Tables 3,4,5and 6.Tables3,4,5
and 6describe the comparative analysis of 10 fold cross validation with MLP, SVM, k-NN
and NB classifiers respectively. It can be observed that the proposed approach (DEWSA)
performs better than DECA, DETS and the existing techniques such as PSO [21], GA [31]
and GSA [25] in terms of SN, SP, AC and FS respectively for all the data sets. DECA
performs as similar to PSO and DETS behaves as similar of GA and GSA. The rationale
behind such outcomes is that, the vector operates through the subset of pathways which
contains the different number of features/genes having similar behavior. Hence, it reduces
the computation power and the performance of the system is enhanced. Also, the weight
parameters ϕ1
2and scaling factor (F) are tuned efficiently in each generation of solution.
Dynamic updation of the scaling factor (F) helps the proposed DE based work to reach to
the better solution by balancing the exploration and exploitation of the searching space. In
other hand, the classifier MLP gives better outcome than the other classifiers (SVM, k-NN
and NB) because MLP has a strong search capability in complex solution.
Moreover, the comparative analysis with respect to iteration v/s average fitness is plotted
for all datasets as shown in Fig. 10. It can be observed that the proposed approach dominates
Table 2 Parameters setup
Proposed PSO [17]GA[11]GSA[27]
Iteration 100 100 100 100
Population size 50 50 50 50
Crossover rate (Crate),F0.8, Dynamic NA 0.8, NA NA
Mutation rate (Mrate)NA NA 0.1 NA
c1,c2,w,α,G0NA 1.4, 1.4, 0.79, NA, NA NA 20, 100
Multimedia Tools and Applications
Table 3 Simulation result of 50% pathway for SN, SP, AC and FS by MLP
Algorithm Prostate DLBCL Child ALL
SN SP AC FS SN SP AC FS SN SP AC FS
DEWSA 0.89 0.90 0.91 0.88 0.88 0.88 0.90 0.89 0.89 0.88 0.89 0.88
DECA 0.88 0.88 0.89 0.88 0.86 0.84 0.88 0.86 0.87 0.85 0.86 0.85
DETS 0.85 0.86 0.85 0.84 0.80 0.81 0.81 0.82 0.73 0.76 0.76 0.76
PSO[21] 0.88 0.89 0.89 0.88 0.84 0.82 0.85 0.86 0.85 0.83 0.82 0.83
GA[31] 0.84 0.83 0.80 0.82 0.78 0.76 0.79 0.74 0.72 0.72 0.70 0.72
GSA[25] 0.83 0.80 0.82 0.81 0.74 0.74 0.76 0.75 0.70 0.73 0.71 0.73
all the approaches due to efficient design of the solution vector and dynamic updation of the
scaling factor (F) in each generation.
It should be noted that the number of the feature has played a vital role in the perfor-
mance of the classifiers (MLP, SVM, k-NN, and NB). After selection of lesser and relevant
pathways by the DE, the classifiers are used to generate the classification model. Here, the
classifiers are used by varying the % of the selected pathways. The graphs are plotted with
varying the % of the pathways for different classifier as shown in Fig. 11. It can be observed
that the Accuracy of the classifiers varies while the % of the pathways is varied for SVM,
k-NN, and NB. The dealing tendency of MLP with several sizes of data can be found in
the Fig. 11 (a). It can be observed that the performance of MLP in terms of accuracy varies
comparably lower than the SVM, k-NN, and NB for all datasets. SVM can deal with large
datasets and the risk of over fitting is low. But the important factor of SVM is depending
upon the selection of kernel function. Here, linear kernel function is used. The NB classifier
can deal with the small and large amount of data. Thus, from Fig. 11 (b) and (d), it can be
found that the accuracy has achieved in increasing order i.e., higher size of feature produced
the higher accuracy. k-NN is not efficient for with large datasets. Moreover, it needs a fea-
ture scaling factor to accurately predict the instances. In addition, it is quite sensitive with
noisy data and missing values. For this reason, the authors have already pruned the noisy
data by using filter approach. From Fig. 11 (c), it has been noticed that, lower size of feature
is providing the higher accuracy. Thus, after applying the different classifier for different
sizes of feature sets, it can be noticed that MLP performs better than the other classifiers.
Table 4 Simulation result 50% pathway for SN, SP, AC and FS by k-NN
Algorithm Prostate DLBCL Child ALL
SN SP AC FS SN SP AC FS SN SP AC FS
DEWSA 0.88 0.88 0.86 0.84 0.84 0.86 0.86 0.86 0.86 0.84 0.85 0.84
DECA 0.84 0.83 0.84 0.84 0.82 0.82 0.80 0.82 0.82 0.80 0.81 0.80
DETS 0.79 0.74 0.78 0.74 0.66 0.67 0.73 0.75 0.72 0.65 0.72 0.72
PSO[21] 0.81 0.80 0.84 0.82 0.78 0.78 0.79 0.80 0.78 0.77 0.77 0.76
GA[31] 0.78 0.73 0.74 0.72 0.69 0.70 0.72 0.72 0.71 0.70 0.72 0.71
GSA[25] 0.76 0.75 0.72 0.70 0.64 0.68 0.69 0.69 0.70 0.68 0.70 0.71
Multimedia Tools and Applications
Table 5 Simulation result of 50% pathway for SN, SP, AC and FS by SVM
Algorithm Prostate DLBCL Child ALL
SN SP AC FS SN SP AC FS SN SP AC FS
DEWSA 0.85 0.84 0.84 0.82 0.82 0.80 0.81 0.84 0.74 0.72 0.72 0.72
DECA 0.81 0.80 0.80 0.78 0.78 0.76 0.78 0.80 0.72 0.70 0.69 0.70
DETS 0.78 0.76 0.74 0.74 0.72 0.70 0.74 0.75 0.70 0.68 0.65 0.65
PSO [21] 0.80 0.80 0.78 0.78 0.74 0.75 0.76 0.76 0.70 0.70 0.65 0.70
GA [31] 0.76 0.74 0.73 0.74 0.70 0.70 0.72 0.72 0.68 0.69 0.65 0.65
GSA [25] 0.75 0.75 0.72 0.70 0.70 0.68 0.65 0.64 0.70 0.68 0.68 0.70
6.4 Analysis of variance (ANOVA)
Analysis of variance (ANOVA) is a technique to differentiate the mean of the samples as
well as finds if they are equivalent or not. It consists of two hypothesis testing, null hypoth-
esis (Hnull ) and alternate hypothesis (Halt ). Thus the respective hypothesis are classified
as,
Hnull :μDEW SA =μPSO =μGA =μGSA (14)
Halt :μDEWS A = μPSO = μGA = μGSA (15)
From the above equation, it defines that the null hypothesis is accepted if the means are
equal for all the samples otherwise alternate hypothesis is accepted. Normally the output
of ANOVA depends on F-statics and F-critical values as well as P-value. If the value of
F-statics is greater than the value of F-critical and the value of α(chosen by the user) is
greater than the P-value then the hypothesis is rejected as null hypothesis otherwise it is
accepted as alternate hypothesis. In this paper, ANOVA statistical test is performed between
the DEWSA, PSO, GA, and GSA.
Here, ten samples of accuracy of each algorithm are taken. The value of alpha (α)is
chosen as 0.05 which is a certain level point. The input for ANOVA test is discussed in the
Tabl e 7. Also, the output of ANOVA is demonstrated in Table 8. It is found that the value
of F-statics is greater than the value of F-critical and also chosen value of alpha is larger
than the value P-value. Therefore, it can be said that the null hypothesis is rejected and
it can be observed that the mean of samples is differentiated by the selected algorithms.
Thus, ANOVA can notify only significant differences statistically, however it cannot show
Table 6 Simulation result of 50% pathway for SN, SP, AC and FS by NB
Algorithm Prostate DLBCL Child ALL
SN SP AC FS SN SP AC FS SN SP AC FS
DEWSA 0.75 0.78 0.78 0.76 0.80 0.78 0.78 0.80 0.72 0.70 0.70 0.68
DECA 0.72 0.76 0.76 0.74 0.77 0.76 0.76 0.78 0.70 0.68 0.68 0.68
DETS 0.68 0.72 0.72 0.70 0.71 0.72 0.72 0.72 0.67 0.64 0.62 0.64
PSO[21] 0.70 0.74 0.74 0.73 0.75 0.74 0.74 0.76 0.69 0.66 0.65 0.65
GA[31] 0.65 0.70 0.70 0.70 0.70 0.68 0.70 0.68 0.65 0.65 0.64 0.64
GSA[25] 0.68 0.70 0.68 0.68 0.74 0.72 0.72 0.70 0.64 0.65 0.65 0.65
Multimedia Tools and Applications
10 20 30 40 50 60 70 80 90 100
0.4
0.45
0.5
0.55
0.6
0.65
Iteration
ssentiFegarevA
Prostate Data
DEWSA
DECA
PSO
DETS
GA
GSA
10 20 30 40 50 60 70 80 90 100
0.4
0.45
0.5
0.55
0.6
0.65
Iteration
ssentiFegarevA
DLBCL Data
DEWSA
DECA
PSO
DETS
GA
GSA
10 20 30 40 50 60 70 80 90 100
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Iteration
ss
e
n
t
i
Fe
gar
e
v
A
Child All Data
DEWSA
DECA
PSO
DETS
GA
GSA
Fig. 10 Iteration v/s Average Fitness in (a) Prostate (b) DLBCL and (c) Child All Data. DEWSA beats the
other approaches in all data sets due to its novelty
which samples or groups are distinct from others. Therefore, the Least Significant Differ-
ence (LSD) post-hoc test is performed to differentiate the groups significantly from the
other groups. The LSD test is demonstrated in Table 9. The condition of the LSD post-hoc
test states that the groups do not differ significantly from each other if the interval is zero.
From Table 9, it is found that the interval level i.e., lower bound and upper bound of mean
differences do not carry zero for DEWSA, PSO, GA and GSA. Thus, the condition is satis-
fied in our case. Therefore, the performance of statistical analysis of ANOVA test followed
by LSD post hoc test clearly statistically significantly differentiate the samples of different
algorithms based on accuracy values.
6.5 Biological importance
In this section, the importance of biological significance of selected relevant pathway genes
are analyzed. The best proposed technique (DEWSA) is executed for ten times and a set
of ten gene is found. The gene that repeats at multiple of five is selected as the bet-
ter gene. Afterwards, the heat-map of selected features for each data sets is plotted. The
related genes as well as related disease of selected genes are explored using gene database
Multimedia Tools and Applications
Prostate DLBCL Child All
0.75
0.8
0.85
0.9
0.95
1
Datasets
yca
r
ucc
A
MLP Classifier
20%
40%
50%
60%
80%
100%
Prostate DLBCL Child All
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Datasets
ycaruccA
SVM Classifier
20%
40%
50%
60%
80%
100%
Prostate DLBCL Child All
0.7
0.75
0.8
0.85
0.9
0.95
1
Datasets
ycaruccA
k-NN Classifier
20%
40%
50%
60%
80%
100%
Prostate DLBCL Child All
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Datasets
ycaruccA
NB Classifier
20%
40%
50%
60%
80%
100%
Fig. 11 Accuracy by varying the % of the pathways for (a) MLP, (b) SVM, (c) k-NN and (d) NB Classifier
(www.disgenet.org), where the symbols and their small descriptions are represented in
Tabl es 10,11 and 12. Due to large file of pathway gene, some allied genes with their specific
pubmed citation (SPCIN) [1] are studied.
Tabl e 10 shows the results for prostate data. It can be seen that only five genes are found
after executing the proposed (DEWSA) technique. The related disease as well as allied
Table 7 Input for ANOVA test
Factors Count Sum Mean SD 95% interval(Lower and Upper Bound)
DEWSA 10 8.74 .8740 .05502 (.8346,.9134)
PSO 10 7.79 .7790 .08212 (.7203,.8377)
GA 10 6.96 .6960 .06637 (.6485,.7435)
GSA 10 7.21 .7210 .07216 (.7104,.8224)
Multimedia Tools and Applications
Table 8 Output for ANOVA test
Groups Sum of Square Df Mean Square F-Statics P-value F-critical
Between Groups 0.124 3 0.039334 4.32461 0.031471 3.46724
Within Groups 0.271 34 0.005325
Total 0.326 37
Table 9 LSD Post-hoc test
Difference of levels Difference of Means Standard Error 95% interval(Lower and Upper Bound)
DEWSA - PSO .09500 .03235 .0342,.1462
DEWSA - GA .11600 .03235 .0258,.1334
DEWSA - GSA .09800 .03235 .0217,.1423
Table 10 Biological analysis for Prostate data
Selected gene Symbol Description Related Disease Allied Genes(SPCIN)
37639at , HPN Hepsin Prostate carcinoma RAFI(7),IGFI(69),KLK3(776)
41288at CALM1 Calmodulin Carcinogenesis RAFI(24),KLK3(5),IGFI(43)
31527at RPS2 Ribosomal protein Prostatic disease KLK3(11),IGFI(1)
39939at COL4A6 Collagen Tumor Progression KLK3(14),PXN(3)
38634at RBP1 Retinol protein Malignent Prostate INS(31),FLNA(2),KLK3(781)
Table 11 Biological analysis for DLBCL data
Selected gene Symbol Description Related Disease Allied Genes(SPCIN)
X56494at PKML Pyruvate Kinase Anemia TP53(32),AKTI(2)
X16983at ITGA4 Integrin alpha 4 B-lymphoma MAPK3(2),FASLG(1)
D87119at TRIB2 Tribbles homolog 2 T-lymphoma HRAS(1),AKTI(9),
X62078at GM2A Ganglioside activator Lymphoma FASLG(17),MAPK3(5),
Table 12 Biological analysis for Child ALL data
Selected gene Symbol Description Related Disease Allied Genes(SPCIN)
38464at GCS1 Glocosidose 1 Acute Leukemia CDKI(4),PLKI(4)
39994at CCR1 Chemokine Cardiovascular Disease CHUK(4),CXCR2(11)
32264at GZMM Granzyme M Carcinogenesin PLKI(15),CXCR2(5)
36651at ACP2 Acidphosphate2 Tumor Progression ITGA6(1),PLKI(9)
Multimedia Tools and Applications
genes with their small description and symbols are also given in the Table 10. The heat-
map for selected genes are plotted as shown in Fig. 12 (a). The heat-map is represented in
the form of gene in x-axis with respect to samples of the classes in y-axis. The expression
level of genes is studied in form of different colors like green, red and black. The green
color describes the low expression value, red color describes the high expression value and
Fig. 12 Heat-map for (a) Prostate Data (b) DLBCL Data and (c) Child All Data. X-axes represents the
features and Y-axes represents the different samples
Multimedia Tools and Applications
black color describes the absence of expressions values. Lower expression implies the nor-
mal samples and high expression implies the tumor samples. The selected genes (41288at,
31527at , 39939at ) indicates the high expression and the genes (37639at, 38634at )represents
the low expressions.
It can be observed from Table 11 that only four genes are selected after executing the pro-
cess on DLBCL datasets. It can also be noticed from Fig. 12 (b), that the genes (X56494at ,
X16983at ) are high expression and D87119at and X62078at are low expression. Similarly,
four genes are selected for Child All data as shown in Table 12. It can be seen form Fig. 12
(c), the genes (38464at , 39994at , 32264at and 36651at ) represents the low expression.
7 Conclusion
In this work, a model based on dynamic scaling factor differential evolution (DE) algorithm
with multi-layer perceptron is designed to select the relevant pathway gene from high vol-
ume of gene expression data. The vectors are efficiently represented. Here, two objectives,
T-score and classification accuracy (CA) are considered to compute the fitness function.
These objectives are used in three cases separately. In case one, T-score is taken as objective
function and it is denoted as DETS. In case two, classification accuracy (CA) is consid-
ered as objective function and it is denoted as DECA. In case three, weight sum approach
(WSA) is considered by using both the objectives, i.e., T-score and classification accuracy
(CA) and it is denoted as DEWSA. After execution, it is found that the proposed approach
(DEWSA) performs better than DETS, DECA and other existing approaches (PSO, GA
and GSA) in terms of sensitivity, specificity, accuracy and F-score. It can be observed that
DEWSA, DECA and PSO behave similarly while DETS, GA and GSA behave the same for
all datasets (Prostate, DLBCL and Child ALL) while applying different classifiers (MLP,
k-NN, SVM and NB). DEWSA, DECA and PSO have achieved higher value in terms of
sensitivity, specificity, accuracy and F-score when using the MLP classifier, at the variance
of 1%-2% for prostate data and 2%-2.5% for DLBCL and Child All datasets. Whereas,
k-NN performs 3%-4% better and NB performs 1.5%-2% better for all data sets, SVM per-
forms 3%-4% better for Prostate and DLBCL and 1.5%-2% for Child ALL dataset. On the
other hand, DETS, GA and GSA perform similarly in the lower value in terms of sensitivity,
specificity, accuracy for all datasets with all classifiers. Note that the outcomes are varied
due to the dimension of genes/features. Hence, it is concluded that the DEWSA performs
quite better than the other approaches (DETS, DECA, PSO, GA and GSA) at the higher
value of sensitivity, specificity, sccuracy and F-score for MLP classifier.
Moreover, biological analysis is performed on the selected features and heat-maps are
presented. To show the statistical significance of the proposed algorithm (DEWSA) over the
existing approaches (PSO, GA and GSA), analysis of variance (ANOVA) is also done.
The suggested approach may be useful in health sector to diagnose the disease. It can also
be used in various fields where feature selection is required. Note that, here the only binary
class of data is considered. In future, another model for multi class data set can be designed.
Moreover, a multi-objective optimization technique with various evolutionary algorithms
may also be developed.
Declarations
Conflict of Interests The authors declare that they have no conflict of interest. The research work of this
article is not funded by any organizations/agencies.
Multimedia Tools and Applications
References
1. Agarwalla P, Mukhopadhyay S (2018) Bi-stage hierarchical selection of pathway genes for cancer
progression using a swarm based computational approach. Appl Soft Comput 62:230–250
2. Ali IM, Essam D, Kasmarik K (2020) A novel design of differential evolution for solving discrete
traveling salesman problems. Swarm Evol Comput 52:100607
3. Bakhshandeh S, Azmi R, Teshnehlab M (2019) Symmetric uncertainty class-feature association map for
feature selection in microarray dataset. Int J Mach Learn Cybern, pp 1–18
4. Berahmand K, Nasiri E, Li Y et al (2021) Spectral clustering on protein-protein interaction networks via
constructing affinity matrix using attributed graph embedding. Comput Biol Med 138:104933
5. Fan G-F, Yu M, Dong S-Q, Yeh Y-H, Hong W-C (2021) Forecasting short-term electricity load
using hybrid support vector regression with grey catastrophe and random forest modeling. Util Policy
73:101294
6. Gao L, Ye M, Lu X, Huang D (2017) Hybrid method based on information gain and support vector
machine for gene selection in cancer classification. Genomics Proteomics Bioinforma 15(6):389–395
7. Geeitha S, Thangamani M (2018) Incorporating EBO-HSIC with SVM for gene selection associated
with cervical cancer classification. J Med Syst 42(11):225
8. Ghosh M, Begum S, Sarkar R, Chakraborty D, Maulik U (2019) Recursive memetic algorithm for gene
selection in microarray data. Expert Syst Appl 116:172–185
9. Ghosh M, Sen S, Sarkar R, Maulik U (2021) Quantum squirrel inspired algorithm for gene selection in
methylation and expression data of prostate cancer. Appl Soft Comput 105:107–221
10. Han F, Sun W, Ling Q-H (2014) A novel strategy for gene selection of microarray data based on gene-
to-class sensitivity information. PloS one 9(5):e97530
11. Hassanat A, Almohammadi K, Alkafaween E, Abunawas E, Hammouri A, Prasath V (2019) Choos-
ing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Inf
10(12):390
12. Katiyar S, Khan R, Kumar S (2021) Artificial bee colony algorithm for fresh food distribution without
quality loss by delivery route optimization. J Food Qual
13. Khan R (2021) Deep learning system and it’s automatic testing: an approach. Ann Data Sci, pp 1–15
14. Khan R, Amjad M, Pandey D (2014) Automated test case generation using nature inspired meta
heuristics-genetic algorithm: a review paper. Int J Appl Innov Eng & Manag (IJAIEM) vol 3(11)
15. Khan R, Kumar S, Srivastava AK, Dhingra N, Gupta M, Bhati N, Kumari P (2021) Machine learning
and iot-based waste management model. Comput Intell Neurosci
16. Khan R, Shabaz M, Hussain S, Ahmad F, Mishra P (2021) Early flood detection and rescue using
bioinformatic devices, internet of things (iot) and android application. World J Eng
17. Kuila P, Jana PK (2014) Energy efficient clustering and routing algorithms for wireless sensor networks:
Particle swarm optimization approach. Eng Appl Artif Intell 33:127–140
18. Kuila P, Jana PK (2014) A novel differential evolution based clustering algorithm for wireless sensor
networks. Appl Soft Comput 25:414–425
19. Lee J, Choi IY, Jun C-H (2020) An efficient multivariate feature ranking method for gene selection in
high-dimensional microarray data. Expert Syst Appl 166:113–971
20. Mabarti I (2020) Implementation of minimum redundancy maximum relevance (MRMR) and genetic
algorithm (GA) for microarray data classification with c4. 5 decision tree. J Data Sci Appl 3(1):38–47
21. Mandal M, Mondal J, Mukhopadhyay A (2015) A PSO-based approach for pathway marker identifica-
tion from gene expression data. IEEE Trans Nanobioscience 14(6):591–597
22. Mandal M, Mukhopadhyay A (2014) A graph-theoretic approach for identifying non-redundant and
relevant gene markers from microarray data using multiobjective binary PSO. PloS one 9(3):e90949
23. Polat H, Mehr HD, Cetin A (2017) Diagnosis of chronic kidney disease based on support vector machine
by feature selection methods. J Med Syst 41(4):55
24. Prasad Y, Biswas K, Hanmandlu M (2018) A recursive PSO Scheme for gene selection in microarray
data. Appl Soft Comput 71:213–225
25. Ram PK, Kuila P (2021) GSA-Based approach for gene selection from microarray gene expression data.
Mach Learn Algoritm Appl, pp 159–174
26. Rani MJ, Devaraj D (2019) Two-stage hybrid gene selection using mutual information and genetic
algorithm for cancer data classification. J Med Syst 43(8):235
27. Rashedi E, Nezamabadi-Pour H, Saryazdi S (2009) GSA: A gravitational search algorithm. Inf Sci
179(13):2232–2248
28. Rostami M, Berahmand K, Nasiri E, Forouzandeh S (2021) Review of swarm intelligence-based feature
selection methods. Eng Appl Artif Intell 100:104–210
Multimedia Tools and Applications
29. Rostami M, Forouzandeh S, Berahmand K, Soltani M, Shahsavari M, Oussalah M (2022) Gene selection
for microarray data classification via multi-objective graph theoretic-based method. Artif Intell Med
123:102228
30. Sakr WS, El-Sehiemy RA, Azmy AM (2017) Adaptive differential evolution algorithm for efficient
reactive power management. Appl Soft Comput 53:336–351
31. Salem H, Attiya G, El-Fishawy N (2017) Classification of human cancer diseases by gene expression
profiles. Appl Soft Comput 50:124–134
32. Shahbeig S, Rahideh A, Helfroush MS, Kazemi K (2018) Gene selection from large-scale gene
expression data based on fuzzy interactive multi-objective binary optimization for medical diagnosis.
Biocybern Biomed Eng 38(2):313–328
33. Sharma H, Bansal JC, Arya K (2012) Dynamic scaling factor based differential evolution algorithm.
In: proceedings of the international conference on soft computing for problem solving (SocProS 2011)
December 20-22, 2011. Springer,pp 73–85
34. Shukla AK, Tripathi D (2020) Detecting biomarkers from microarray data using distributed correlation
based gene selection. Genes Genom, pp 1–17
35. Sujamol S, Vimina E, Krishnakumar U (2021) Improving recurrence prediction accuracy of ovarian
cancer using multi-phase feature selection methodology. Appl Artif Intell 35(3):206–226
36. Sun G, Yang B, Yang Z, Xu G (2019) An adaptive differential evolution with combined strategy for
global numerical optimization. Soft Comput, pp 1–20
37. Vijay SAA, GaneshKumar P (2018) Fuzzy expert system based on a novel hybrid stem cell (HSC)
algorithm for classification of micro array data. J Med Syst 42(4):61
38. Wang D, Liu J-X, Gao Y-L, Zheng C-H, Xu Y (2016) Characteristic gene selection based on robust graph
regularized non-negative matrix factorization. IEEE/ACM Trans Comput Biol Bioinform 13(6):1059–
1067
39. Xu P, Zhao G, Kou Z, Fang G, Liu W (2020) Classification of cancers based on a comprehensive pathway
activity inferred by genes and their interactions. IEEE Access 8:30515–30521
40. Zhang G, Hou J, Wang J, Yan C, Luo J (2020) Feature selection for microarray data classification using
hybrid information gain and a modified binary krill herd algorithm. Interdiscip Sci Comput Life Sci
41. Zhou W, Dickerson JA (2014) A novel class dependent feature selection method for cancer biomarker
discovery. Comput Biol Med 47:66–75
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is
solely governed by the terms of such publishing agreement and applicable law.
... • Design of experiment (DoE) is performed using the Taguchi method [8], [9]. Extensive simulations are performed and the proposed QDE-UA is compared with the existing strategies such as EUA-FOA [10], DE [11], and QPSO [8]. Hypothesisbased statistical analyses (ANOVA, Friedman test) are also conducted to illustrate the significance of the proposed QDE-UA. ...
... Differential evolution (DE) is a well-known evolutionary algorithm for solving complex optimization problems [11], [18]. DE usually consists of four steps: vector initialization, mutation, crossover, and selection. ...
... to h(π c,k (g)) = [π c,k (g)%|cov(u k )|] + 1 (refer to line 4) until either [h(π c,k (g))] th ES of cov(u k ) is found with sufficient resources for u k or cov(u k ) has no ES to serve u k (line [5][6][7][8][9][10][11][12]. Sometimes, the ESs in cov(u k ) covering u k may not have sufficient computation resources to serve u k . ...
... For example; Euclidean distance measurement and principal component analysis [9] methods are proposed to identify the genes. Along with the statistical approach, data mining, machine learning, deep machine learning and artificial intelligence have been extensively used in microarray data analysis [18,23,36]. Another algorithm named evolutionary IDGA was proposed using the concept of artificial intelligence. ...
... Also, a two-step hybrid method is suggested by using moth-flame optimization (MFO) with an ELM as its fitness function to make the search for the optimum feature selection without totally depending on separate approaches [35]. Genes are chosen from microarray data's pathway information using a dynamic scaling factorbased differential evolution with MLP [36,37] discuss potential future research directions for pattern analysis techniques as well as deliberations and recurring issues in the application of discriminatory modeling methods and pattern analysis techniques for the interpretation of genome-wide and genetic information [30]. ...
Article
Full-text available
Microarray data have become an integral part of the clinical and drug discovery process. Due to its voluminous and heterogeneous nature, the question arises of the interpretability and stability of the traditional gene selection method. To enhance the stability of the gene selection method, so that the results are better explicable, an ameliorated Extended ReliefF gene selection algorithm is proposed. It encodes gene affinity information using a new mathematical formula based on Bayes’ theorem and Manhattan distance for calculating the nearest neighbor in a pooled sample. It works in four aspects: initializing sample gene weight, improving gene weight, maximizing sample gene weight and finally adopting mutation operation. The proposed method selects the most informative genes which are highly perceptive to the prognosis of the disease. Further, to accomplish the accuracy and stability of the algorithm, soft classification is performed on Relieved_F, STIR, VLS-RelifF, I-RelieF, conventional ReliefF and proposed extended ReliefF algorithms using three classifiers namely Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) on ten microarray datasets. According to the findings, MLP training times are much longer than those of RF and SVM. From a network perspective, SVM is much faster at training, whereas MLP excels in terms of accuracy. With a rise in gene similarity among the genes selected from the multiple training sets, the approach becomes more stable. As a result, it can be seen that the recommended gene selection algorithm greatly outperforms the other feature selection methods in terms of accuracy and stability.
... 429 Differential evolution (DE) is one of the most effective 430 population-based evolutionary algorithms for resolving chal-431 lenging optimization problems [27], [28]. The DE comprises 432 four steps: initialization of vectors, mutation, crossover, and 433 selection. ...
Article
With the development of the sixth-generation network, Digital Twin (DT) is driving the explosive growth of Internet-of-Vehicles (IoVs). The rapid proliferation of highly mobile IoVs, coupled with advanced applications, resulted in rigorous demands for quality of experience (QoE) and intricate task caching. The diverse requirements of on-vehicle applications, as well as the freshness of dynamic cached information, provide significant challenges for edge servers in efficiently fulfilling energy and latency demands. This work studies a freshness-aware caching-aided offloading-based task allocation problem (FCAOP) in DT-enabled IoV (DTIoV) with Intelligent Reflective Surfaces (IRS) and edge computing. DT is used to accumulate real-time data and digitally depict the physical objects of the IoV to enhance decision-making. A quantum-inspired differential evolution (QDE) algorithm is proposed to reduce the overall delay and energy consumption in DTIoV (QDE-DTIoV). The quantum vector (QV) is encoded to represent a complete solution to the FCAOP. The decoding of the QVs is done using a one-time hashing algorithm. The fitness function is derived by considering delay, energy consumption, and freshness of the tasks. Extensive simulations demonstrate the superiority of QDE-DTIoV over other benchmark algorithms, showing an average latency improvement of 23%-26% and a reduction in energy consumption ranging from 22% to 33%.
... The use of the Decision Tree model aids in the early detection of cancer [56,57], diagnosing cardiac arrhythmias [58,59], forecasting stroke outcomes [60][61][62], and assisting with chronic disease management [63,64]. MLP has been successfully applied in a variety of medical fields, including disease prediction [65,66], medical image recognition [67,68], and gene selection [69,70]. In this study, all classifiers' performances have been checked on full features and selected features in terms of classification accuracy. ...
Article
Full-text available
Consolidated efforts have been made to enhance the treatment and diagnosis of heart disease due to its detrimental effects on society. As technology and medical diagnostics become more synergistic, data mining and storing medical information can improve patient management opportunities. Therefore, it is crucial to examine the interdependence of the risk factors in patients' medical histories and comprehend their respective contributions to the prognosis of heart disease. This research aims to analyze the numerous components in patient data for accurate heart disease prediction. The most significant attributes for heart disease prediction have been determined using the Correlation-based Feature Subset Selection Technique with Best First Search. It has been found that the most significant factors for diagnosing heart disease are age, gender, smoking, obesity, diet, physical activity, stress, chest pain type, previous chest pain, blood pressure diastolic, diabetes, troponin, ECG, and target. Distinct artificial intelligence techniques (logistic regression, Naïve Bayes, K-nearest neighbor (K-NN), support vector machine (SVM), decision tree, random forest, and multilayer perceptron (MLP)) are applied and compared for two types of heart disease datasets (all features and selected features). Random forest using selected features has achieved the highest accuracy rate (90%) compared to employing all of the input features and other artificial intelligence techniques. The proposed approach could be utilized as an assistant framework to predict heart disease at an early stage.
Conference Paper
The research describes a technique that enables an Unmanned Aerial Vehicle (UAV) to delegate a part of a task to mobile devices. By outsourcing the computationally costly sections of a work to the more capable UAV and using the mobile devices for tasks that can be completed locally, the system makes good use of the capabilities of both the UAV and mobile devices. In this paper, Differential Evolution (DE) based algorithm is proposed for energy and delay efficient partial offloading in UAV-assisted MEC system. An extensive Simulations have been done to measure the performance of the proposed algorithm. The results reveal a considerable reduction in latency. Overall, our method shows how partial task offloading may be used to enhance the performance of UAV-assisted systems.
Article
Full-text available
In recent decades, the improvement of computer technology has increased the growth of high-dimensional microarray data. Thus, data mining methods for DNA microarray data classification usually involve samples consisting of thousands of genes. One of the efficient strategies to solve this problem is gene selection, which improves the accuracy of microarray data classification and also decreases computational complexity. In this paper, a novel social network analysis-based gene selection approach is proposed. The proposed method has two main objectives of the relevance maximization and redundancy minimization of the selected genes. In this method, on each iteration, a maximum community is selected repetitively. Then among the existing genes in this community, the appropriate genes are selected by using the node centrality-based criterion. The reported results indicate that the developed gene selection algorithm while increasing the classification accuracy of microarray data, will also decrease the time complexity.
Article
Full-text available
The process of testing conventional programs is quite easy as compared to the programs using Deep Learning approach. The term Deep learning (DL) is used for a novel programming approach that is highly data centric and where the governing rules and logic are primarily dependent on the data used for training. Conventionally, Deep Learning models are evaluated by using a test dataset to evaluate their performance against set parameters. The difference in data and logic handling between programs using conventional methods and programs using the DL approach makes it difficult to apply the traditional approaches of testing directly to DL based programs. The accuracy of test data is currently the best measure of the adequacy of testing in the DL based systems. This poses a problem because of the difficulty in availability of test data that is of sufficient quality. This in turn restricts the level of confidence that can be established on the adequacy of testing of DL based systems. Unlike conventional applications, using the conventional programming approaches the lack of quality test data and the lack of interpretability makes the system analysis and detection of defects a difficult task in DL based systems. So testing of DL based models can be done automatically with a different approach compared to normal software.
Article
Full-text available
This paper enlightens the use of artificial intelligence (AI) for distribution of fresh foods by searching more viable route to keep intact the food attributes. In recent years, very hard-hitting competition is for food industries because of the individuals living standards and their responsiveness for fresh food products demand within stipulated time period. Food industry deals with the extensive kind of activities such as food processing, food packaging and distribution, and instrumentation and control. To meet market demand, customer satisfaction, and maintaining its own brand and ranking on global scale, artificial intelligence can play a vibrant role in decision-making by providing analytical solutions with adjusting available resources. Therefore, by integrating innovative technologies for fresh food distribution, potential benefits have been increased, and simultaneously risk associated with the food quality is reduced. Time is a major factor upon which food quality depends; hence, time required to complete the task must be minimized, and it is achieved by reducing the distance travelled; so, path optimization is the key for the overall task. Swarm intelligence (SI) is a subfield of artificial intelligence and consists of many algorithms. SI is a branch of nature-inspired algorithm, having a capability of global search, and gives optimized solution for real-time problems adaptive in nature. An artificial bee colony (ABC) optimization and cuckoo search (CS) algorithm also come into the category of SI algorithm. Researchers have implemented ABC algorithm and CS algorithm to optimize the distribution route for fresh food delivery in time window along with considering other factors: fixed number of delivery vehicles and fixed cost and fuel by covering all service locations. Results show that this research provides an efficient approach, i.e., artificial bee colony algorithm for fresh food distribution in time window without penalty and food quality loss.
Article
Full-text available
A rapid rise in inhabitants across the globe has led to the inadmissible management of waste in various countries, giving rise to various health issues and environmental pollution. The waste-collecting trucks collect waste just once or twice in seven days. Due to improper waste collection practices, the waste in the dustbin is spread on the streets. Thus, to defeat this situation, an efficient solution for smart and effective waste management using machine learning (ML) and the Internet of Things (IoT) is proposed in this paper. In the proposed solution, the authors have used an Arduino UNO microcontroller, ultrasonic sensor, and moisture sensor. Using image processing, one can measure the waste index of a particular dumping ground. A hardware prototype is also developed for the proposed framework. Thus, the presented solution for the efficient management of waste accomplishes the aim of establishing clean and pollution-free cities.
Article
The identification of protein complexes in protein-protein interaction networks is the most fundamental and essential problem for revealing the underlying mechanism of biological processes. However, most existing protein complexes identification methods only consider a network's topology structures, and in doing so, these methods miss the advantage of using nodes' feature information. In protein-protein interaction, both topological structure and node features are essential ingredients for protein complexes. The spectral clustering method utilizes the eigenvalues of the affinity matrix of the data to map to a low-dimensional space. It has attracted much attention in recent years as one of the most efficient algorithms in the subcategory of dimensionality reduction. In this paper, a new version of spectral clustering, named text-associated DeepWalk-Spectral Clustering (TADW-SC), is proposed for attributed networks in which the identified protein complexes have structural cohesiveness and attribute homogeneity. Since the performance of spectral clustering heavily depends on the effectiveness of the affinity matrix, our proposed method will use the text-associated DeepWalk (TADW) to calculate the embedding vectors of proteins. In the following, the affinity matrix will be computed by utilizing the cosine similarity between the two low dimensional vectors, which will be considerable to improve the accuracy of the affinity matrix. Experimental results show that our method performs unexpectedly well in comparison to existing state-of-the-art methods in both real protein network datasets and synthetic networks.
Article
This paper develops a novel short-term load forecasting model that hybridizes several machine learning methods, such as support vector regression (SVR), grey catastrophe (GC (1,1)), and random forest (RF) modeling. The modeling process is based on the minimization of both SVR and risk. GC is used to process and extract catastrophe points in the long term to reduce randomness. RF is used to optimize forecasting performance by exploiting its superior optimization capability. The proposed SVR-GC-RF model has higher forecasting accuracy (MAPE values are 6.35% and 6.21%, respectively) using electric loads from Australian-Energy-Market-Operator; it can provide analytical support to forecast electricity consumption accurately.
Chapter
Selection of gene is the most effective method that plays a vital role to detect the cancers. Due to non-redundant data set, it is very difficult to extract the optimal features or genes from microarray data. In this paper, we have proposed a new model to extract the best features subset with high accuracy based on Gravitational Search Algorithm (GSA) with machine learning classifiers. An extensive simulation is performed to evaluate the performance of the proposed algorithm. Simulation results are compared with the Particle Swarm Optimization Algorithm (PSO). The superiority of the proposed algorithm has been observed.
Article
Purpose The impact of natural disasters on human life, the environment and the flora and fauna can be contained to large extent by intelligent human intervention. This study introduces the human capabilities which can be extended considerably with technology. Internet of things have always provided opportunities for predicting and managing manmade/natural disasters. The extreme reason for causing soil erosions, landslides, cloud bursts, floods, etc., are due to excessive rainfall. However, the flood is one of the most happening natural disasters, following Bihar to be the most affected region due to floods. Lots of lives and properties were lost and damaged. Design/methodology/approach This implemented researchers to introduce an advanced solution for such calamities. Expectations were developed that it would signalize authority as early as possible so that advanced measures are taken before the effect. The lack of sensing or alarming technology in India pushed researchers to develop a model using the Android app that basically detected the upcoming flood and other calamities. Findings Most importantly the entire model was programmed with IoT and its techniques so that the response is quicker and more accurate. Originality/value This research study is original.
Article
Prostate cancer is the second most common type of cancer among men after skin cancer In this work, we present a comprehensive view on genomic and epigenomic changes following the incremental biological functionality. For gene selection, a new Feature Selection algorithm called Quantum Squirrel inspired Feature Selection is proposed here. While exploring the feature space, the proposed algorithm exploits the benefits of Squirrel Search Algorithm (a recently proposed swarm intelligence algorithm) along with Quantum mechanics. Moreover, a modified version of the end of winter concept is used to achieve effective dimension reduction capacity. Quantum Squirrel inspired Feature Selection is executed on both expression and methylation data of prostate cancer. The major challenge in gene selection is to bring down the number of selected features without compromising on accuracy. The proposed algorithm consistently achieves this goal and outperforms other state-of-the-art algorithms. The proposed algorithm has steadily attained 100% accuracy while selecting a much lower number of features (around 4), which is a major improvement over others. The top selected genes are biologically validated in terms of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontologies (GO), which further demonstrates the usefulness of the proposed method. The genes selected by Quantum Squirrel inspired Feature Selection show an association with prostate carcinoma and most are known biomarkers. A few novel biomarkers selected by proposed algorithm have also been detailed in this work. Source code of this work is available at: Quantum Squirrel inspired Feature Selection.
Article
In the past decades, the rapid growth of computer and database technologies has led to the rapid growth of large-scale datasets. On the other hand, data mining applications with high dimensional datasets that require high speed and accuracy are rapidly increasing. An important issue with these applications is the curse of dimensionality, where the number of features is much higher than the number of patterns. One of the dimensionality reduction approaches is feature selection that can increase the accuracy of the data mining task and reduce its computational complexity. The feature selection method aims at selecting a subset of features with the lowest inner similarity and highest relevancy to the target class. It reduces the dimensionality of the data by eliminating irrelevant, redundant, or noisy data. In this paper, a comparative analysis of different feature selection methods is presented, and a general categorization of these methods is performed. Moreover, in this paper, state-of-the-art swarm intelligence is studied, and the recent feature selection methods based on these algorithms are reviewed. Furthermore, the strengths and weaknesses of the different studied swarm intelligence-based feature selection methods are evaluated.