ArticlePDF Available

Dynamic scaling factor based differential evolution with multi-layer perceptron for gene selection from pathway information of microarray data

September 2022
Multimedia Tools and Applications 82(9):1-26

September 2022
82(9):1-26

DOI:10.1007/s11042-022-13964-z

Authors:

Pintu Kumar Ram

National Institute of Technology Sikkim

Pratyay Kuila

National Institute of Technology Sikkim

The microarray data contains the high volume of genes having multiple values of expressions and small number of samples. Therefore, the selection of gene from microarray data is an extremely challenging and important issue to analyze the biological behavior of features. In this context, dynamic scaling factor based differential evolution (DE) with multi-layer perceptron (MLP) is designed for selection of genes from pathway information of microarray data. At first DE is employed to select the relevant and lesser number of genes. Then MLP is used to build a classifier model over the selected genes. A suitable and efficient representation of vector is designed for DE. The fitness function is derived separately as T-score, classification accuracy and weight sum approach of both. Simulation and further analysis is performed in terms of sensitivity, specificity, accuracy and F-score. Moreover, statistical and biological analysis are also conducted.

Terminologies

…

Generate_Population(POP).

…

Fitness_Case1.

…

Proposed differential evolution.

…

+22

Basic concept of microarray chip formation and extraction of microarray data

…

Figures - uploaded by Pratyay Kuila

Content may be subject to copyright.

Content uploaded by Pratyay Kuila

Content may be subject to copyright.

Multimedia Tools and Applications

https://doi.org/10.1007/s11042-022-13964-z

Dynamic scaling factor based diﬀerential evolution

with multi-layer perceptron for gene selection

from pathway information of microarray data

Pintu Kumar Ram1·Pratyay Kuila1

Received: 6 September 2021 / Revised: 7 April 2022 / Accepted: 13 September 2022

©The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract

The microarray data contains the high volume of genes having multiple values of expres-

sions and small number of samples. Therefore, the selection of gene from microarray data is

an extremely challenging and important issue to analyze the biological behavior of features.

In this context, dynamic scaling factor based differential evolution (DE) with multi-layer

perceptron (MLP) is designed for selection of genes from pathway information of microar-

ray data. At first DE is employed to select the relevant and lesser number of genes. Then

MLP is used to build a classifier model over the selected genes. A suitable and efficient

representation of vector is designed for DE. The fitness function is derived separately as

T-score, classification accuracy and weight sum approach of both. Simulation and further

analysis is performed in terms of sensitivity, specificity, accuracy and F-score. Moreover,

statistical and biological analysis are also conducted.

Keywords Differential evolution ·Microarray data ·Pathway ·T-score ·

Biological significance

1 Introduction

1.1 Background and motivation

The human beings and other species may generally be affected by various diseases. Some-

times the diseases spread rapidly throughout the body. If it is not detected and diagnosed

at an early stage, it can seriously affect the human body and may claim the lives. With the

Pintu Kumar Ram

rampintu570@gmail.com

Pratyay Kuila

pratyay kuila@yahoo.com

1Department of Computer Science & Engineering, National Institute of Technology Sikkim,

Ravangla, 737139, Sikkim, India

Multimedia Tools and Applications

proliferation of artificial intelligence (AI) and machine learning techniques, the genomic

data for such identified diseases may be utilized to diagnose and detected some unknown

instances of such diseases. The microarray technology allows to simultaneously studying

the expression of such thousands of genes to detect some diseases like cancer, etc. Normally,

it operates with the gene expression patterns which involve in the formation of disease and

non disease cell. Thus, the analysis of microarray based gene expression data for disease

diagnosis has become a hot topic among the researchers [19,34].

In the microarray technology, large numbers of gene expression data are fabricated in a

single glass slide or silicon thin chip. It has been created in the form of a matrix where rows

stand for samples and columns represent the features/genes of the data. In the formation of

microarray chip, different samples of patient are first collected and then labeled with dye

and fabricated in the chip. Now, the data is ready in the form of a matrix with large number

of gene expression values. The basic concept of microarray chip formation and extraction

of microarray data are shown in Fig. 1. It attracts many researches to select and analyze the

genes from microarray data in disease diagnosis [7,32,37]. Based on the structure of data,

the microarray has the tendency to be formed with high volume of gene and lesser number

of samples. The high volume of genes with lesser samples makes it difficult to be utilized

to diagnose the diseases. Moreover, presence of noisy and redundant genes/feature becomes

challenging issue for researchers to classify the disease and non- disease cell. Therefore,

reducing the number of genes and further build an efficient model by using small samples

to accurately diagnose the diseases is challenging and important.

Note that the microarray data generally do not express the biological behavior of the

genes. In order to understand the biological behavior of the microarray data, the pathways

are identified. The pathway is the set of genes with similar biological behaviors. Pathway-

based information has come with a crucial role in diseases classification. It is important to

incorporate in the biological pathway and classify the samples of differentially expressed

genes/features that are associated with the diseases. To identify the pathways, many standard

databases such as KEGG (Kyoto Encyclopedia of Genes and Genomics) are utilized [1,9,

Sample1

Sample2

Sample3

Samplep

Hybridization

Dyeing

Extract Sample

Normal Sample Abnormal Sample

Wash and Scan Microarray Chip

2D Matrix

Class b

Class a

Fig. 1 Basic concept of microarray chip formation and extraction of microarray data

Multimedia Tools and Applications

35]. The researchers are also getting attracted on the selection of features to analyze the

pathway marker from microarray data [22,38].

Evolutionary algorithms (EAs) [28] are drawing enormous attention by the research com-

munity for their potential capability to generate feasible and near optimal solutions for many

complex problems [12,14,16]. However, inherent challenges of employing the EAs are to

proper tuning of the parameters to balance the exploration and exploitation in the search

space. In this paper, a dynamic scaling factor based differential evolution (DE) technique

is employed for the gene selection problem from microarray data. Our contribution in this

paper is as follow.

1.2 Author’s contribution

In this article, differential evolution (DE)-based approach for pathway-based gene analysis

is proposed. The DE is employed to find out the pertinent features from the large number of

redundant features. The selected set of features has some biological behavior. By observing

the biological behavior, the disease of a particular species can be predicted. Thus, it helps

to correctly diagnose the disease and examine the needful steps as per the requirement. The

major contributions in this article are as follow:

•The dynamic scaling factor based DE is used to find out relevant feature of pathway

gene.

•The Scaling Factor (F) of DE is dynamically updated to balance the exploration and

exploitation of the DE.

•The vectors are efficiently encoded with real values. It is ensured to provide complete

solution to the problem.

•The fitness function is derived to measure the quality of each vectors. Here, three dif-

ferent cases are considered to evaluate the vectors. In first case, T-score and in second

case Classification Accuracy (CA) is used as objective function to evaluate the vectors.

In third case, the fitness function is derived by weight sum approach (WSA) using both

the T-score and CA.

•Multi-layer parceptron (MLP) is used to obtain the classification accuracy (CA). More-

over, the MLP is also employed to build the classifier model on the selected genes by

the DE.

•The proposed algorithm is simulated using standard data sets and the performance

is also compared with various existing approaches like particle swarm optimization

(PSO), genetic algorithm (GA) and gravitational search algorithm (GSA).

•Further, statistical analysis is performed to show the significance of the algorithm over

the several existing algorithms (PSO, GA and GSA).

1.3 Structure of the article

The remaining parts of the article are arranged as follows. The works that are associated

with the proposed work is discussed in Section 2. Section 3, describes the problem formu-

lation, system model, and preprocessing are given. In Section 4, overview of DE algorithm

is discussed. The proposed method is described in Section 5. The data analysis and simula-

tion results, including analysis of variance (ANOVA) and biological analysis, are narrated

in Section 6. The work is concluded in Section 7.

Multimedia Tools and Applications

2 Associated works

The analysis of feature/genes from microarray data has always been of interest to

researchers in the field of medical science. A large number of feature selection problems

have been studied in the literatures. The several state of art approaches are discussed as

follows.

In [6], the authors have used information gain to remove the noisy genes and SVM clas-

sifier was employed over the filtered gene subset to classify the cancerous samples without

incorporate of any evolutionary algorithm. In the same year, Salem et al. [31] have proposed

the new technique to differentiate the human cancerous diseases based on gene expression

profile. They have used filter and wrapper approach. In filter, the information gain is used

to select the non redundant feature set from large volume of data and in wrapper, traditional

genetic algorithm is employed to select the best chromosome. The chromosomes are ini-

tialized in binary string (0,1) where 1 represents selected feature and 0 represents the non

selected features. Next, only accuracy is used as a fitness function to measure the good-

ness of chromosome. At the end, genetic programming has been used to classify the feature

subset.

Rani et al. [26] have proposed the approach based on mutual information (MI) and

genetic algorithm (GA) to classify the cancer diseases from microarray based gene expres-

sion data. They have deployed the technique in two cases. Firstly, they used mutual

information to select the best features. Secondly, those features are used as an input for

genetic algorithm to get the best and optimal feature sets for better classification. In addition,

only SVM classifier is employed here for classification. However, it is inefficient to explore

the behavior of features. Moreover, various classifiers are required to find the behavior of

features instead of single classifier.

Mabarti et al. [20] have designed the approach based on the concept of minimum redun-

dancy and maximum relevance with genetic algorithm. Here, chromosomes are randomly

initialized and each chromosome is evaluated using the minimum redundancy and max-

imum relevance approach. Further, C4.5 classifier is used as to obtain the classification

outcomes. Ghosh et al. [8] have introduced the recursive Memetic algorithm for feature

selection from microarray data with dual task, maximizing the accuracy and minimizing the

features. Here, chromosomes are randomly initialized. Each chromosome is evaluated using

accuracy as a primary objective. If it is not meet the criteria then weight values are assigned

for accuracy and selected feature simultaneously. Also they have compared the performance

of their approach with traditional memetic algorithm and genetic algorithm.

Han et al. [10] have modeled for the extraction of the best features from gene expression

data using sensitivity information of gene to class label. For this, they have used k-means

clustering to search the hidden patterns in data sets at initial level. Afterwards, binary par-

ticle swarm optimization (BPSO) has been used to select the best features set. In the same

year, in [41], the authors have presented the model by joint contribution of filter and wrapper

approach to select the features. Here, F-statics has been used to filter the meaningful features

and is followed by maximum relevance binary particle swarm optimization (MRBPSO) as

a wrapper method to get the best features subset. Also, Mandal et al. [21] have proposed

the particle swarm optimization (PSO) technique based on inferring pathway activity for

the analysis of features from the gene expression data. The particles are initialized as binary

bit i.e., 0 and 1. To measure the quality of particle, single objective (T-score) is used as

Multimedia Tools and Applications

the fitness. A classifier such as support vector machine (SVM) is used to achieve the out-

come of observed solutions. Further, biological behavior of the selected features is analyzed.

However, it is required multiple objectives as well as classifiers to optimized the approach.

Moreover, Prasad et al. [24], have proposed the recursive PSO technique to select minimum

features from large data set. Initially, they have used filter approach to extract the subset

of features based on the ranking approach. Afterwards, PSO is deployed over the extracted

features. The particles are randomly initialized as 0 and 1. To measure the performance of

particles, accuracy using support vector machine classifier is used as the fitness function.

Zhang et al. [40] have designed the model for feature selection for microarray data using

information gain with an improved binary krill herd algorithm. The data set contains the

irrelevant features that impacts on performance of the system. Therefore, information gain

is applied over the data set to obtain the relevant features. Higher score of information

gain indicates the highly relevant feature whereas lower score indicates the lower relevant.

Afterwards, binary krill herd approach is employed over the feature having high score of

information gain. Each individual/krill is initialized as random binary bit. Here, accuracy

using k-nearest neighbor approach is used as fitness function. Ram et al. [25] have suggested

the feature selection based on the gravitational search algorithm from microarray data. Here,

the agents are randomly initialized as 0 and 1. The quality of each agents are determined

by the accuracy using 5-fold cross validation with SVM. In addition biological analysis is

performed for selected features.

Xu et al. [39] have introduced the model cancer classification based on behavioral anal-

ysis of pathway based on gene with their expression value and interaction between gene

from microarray data. In [29], the authors have developed the approach by using multiob-

jective graph theoretic approach to select the feature from microarray data. It is designed in

two approaches, maximum community or cluster of feature is selected and fisher score or

node centrality of existing feature in community is measured. In addition, spectral cluster-

ing based over protein protein network via affinity matrix using graph approach is proposed

in [4]. Bakhshandeh et al. [3] have aimed to detect the irrelevant features from the subset of

features. Therefore, they have proposed symmetric uncertainty class feature association map

(SU-CFAM) method. Initially, they have generated the similarity matrix by using symmet-

ric uncertainty which was based on either correlation between feature to feature or feature

to class. Later, they have created the clusters of features by using community detection algo-

rithm. Further the adjacency matrix of all the clusters is constructed and then ultimate subset

of features is extracted. If two features are highly correlated then these becomes redundant.

Also, if the feature and the class label are less correlated then these become irrelevant.

In the literature [21,25,31], feature selection are performed for high dimension problem.

In comparison, the suggested method uses a novel approach to enhance the performance of

approach for high dimension feature problem as follows.

•In contrast to [8,20,26,31], the proposed approach uses dynamic scaling factor to

maintain the exploration and exploitation of the search space in the DE while searching

for the better the solution.

•A novel fitness function is derived. This is in contrast to the fitness function as used in

[20,24,40].

•The vector is designed in an efficient manner. This is in contrast to [21,25,26,31].

•In contrast to [20,21,26], the proposed approach uses multiple classifiers due to the

variance in size of features.

Multimedia Tools and Applications

3 Problem representation and system model

3.1 System model and problem formulation

LetusassumeamicroarraydatasetD={cij |1≤i≤S,1≤j≤N}with Snumber

of samples (row) and Nnumbers of features or genes (column). Generally, microarray data

has higher number of dimensions or features/genes. The Snumber of samples are divided

into two classes Caand Cb,S=|Ca|+|Cb|.

A pathway can be described as a set of samples with selected number of genes having

similar biological significance. A pathway PT

ican be represented as PT

i={aij |1≤i≤

S,1≤j≤Zi},Z

i<N. Given a high dimensional microarray data, the problem is to find

the relevent and lesser numbers of features/genes to be applied for further classification.

The terminologies those are used in the further sections are shown in Table 1.

3.2 Preprocessing

3.2.1 Information index classiﬁcation

For data preprocessing or extracting the genes from large volume of data matrix, the Infor-

mation Index Classification (IIC) method is used. Datasets exist in the matrix form, where

samples have multiple classes in the row form. Genes are presented in the columns. For each

column, the IIC value is calculated as in (1), where μgm and μgn represent the mean of gth

gene in the mth and nth class respectively. σgm and σgn represent the standard deviation of

gth gene for mth and nth class respectively. cdenotes number of class. The IIC value has

been formulated from the given equation.

IIC(g) =



m=1



n=1,n=m

1

|μgm −μgn|

σgm +σgn

2Inσ2

gm +σ2

2×σgm ×σgn  (1)

Table 1 Terminologies

Notations Descriptions

μgn The mean of gth gene for nth class.

σgn The standard deviation of gth gene for nth class.

ϕWeight parameter.

gpq The expression level of pth sample in qth gene.

gqThe expression level of qth gene over the whole samples.

Ca,CbClass aand Class b.

DMicroarray dataset.

NNumber of features.

iith Pathway.

Psize Population size in DE.

SNumber of sample.

WTotal number of pathway.

ZiNumber of features for PT

−→

Viith solution vector.

Multimedia Tools and Applications

After calculating the IIC over the data matrix, the genes are sorted in decreasing order

basedontheirIICvalue.Nowx%ofgene(x < 10%)are selected with higher IIC value.

High IIC values highly represent the feature sets. Thus, the Nnumber of features (e.g.,

N=1000 out of 12000) are selected for further classification procedure.

3.2.2 Normalization

After selecting the number of genes using high value of IIC, these genes are imported in

pathway database (http://david.abcc.ncifcrf.gov/tools.jsp) and collect the pathway informa-

tion. Each pathway has number of genes. The min max normalization approach is used for

each pathway having corresponding genes. The min-max normalization has been computed

by given (2), where gpq is the expression level of pth sample in qth gene and gqrepresents

the expression level of qth gene over the whole samples.

Normalize(gpq)=gpq −mi n(gq)

max(gq)−min(gq)(2)

4 An overview of diﬀerential evoluation

Differential Evolution (DE) [2,18] is a population based evolutionary technique which is

broadly used to crack the complex optimization problems. It consists of four steps as in

conventional evolutionary algorithms, initialization of population vector, mutation based on

vector difference, crossover and selection. It starts with randomly generated population of

solution vectors with a predefined population size. In a population, each vector can generate

an individual solution. After initialization of population vector, the finesses value of each

individual vector is computed. Then the iterative process starts with the mutation, crossover

and selection to find out the better solution. In each iteration (or generation), the vectors are

updated until the termination condition. At the end, the final solution is identified on the

basis of fitness value. The pictorial representation of the proposed DE is explained in the

Fig. 2.

To perform the DE operation, various schemes are available. The schemes are rep-

resented as “DE/x/y/z”. Here, DE stands for differential evolution, xrepresents the

Start

Mutation (Based On

Difference Vector)

Crossover

Selection

Terminate?

End

Yes

Evaluate the Fitness

Initialization of

Population

Fig. 2 Flowchart of differential evolution algorithm

Multimedia Tools and Applications

selected vector for the mutation operation. It can be a random or best vector from the pop-

ulation that is utilized in the mutation phase. ystands for number of difference vectors

those are involved in the mutation and zstands for crossover method (it may be bino-

mial, exponential, etc). There are some well employed DE schemes like DE/RAND/1/BIN,

DE/RAND/2/BIN,DE/BEST/1/EXP,DE/BEST/2/EXP,etc.Here,BINandEXPrepresent

binomial and exponential crossover respectively.

In mutation operation (assume for DE/BEST/1/BIN), it fixes a target vector (TV) that

need to be mutated. Then two vectors are randomly selected from the population to create

a difference vector. The generated difference vector and the best vector are used in the

mutation to generate a mutant or donor vector (DV). Now, crossover is performed between

the DV and the TV to produce the offspring known as trial vector (TLV). Then selection is

performed between TLV and TV based on their fitness value. Fitness functions are evaluated

for the both the vectors. TLV will replace the TV if the fitness of TLV is better than TV,

otherwise the TV will exist in the population. The process repeats until it does not meet the

reasonable solution or a stopping criterion is obtained.

5 Proposed model

In this work, the microarray data is first preprocessed. The KEGG tool is used to extract

the pathways. The proposed DE based approach is the utilized over the data to select the

relevant and lesser number of genes. The proposed DE has used three different cases of

fitness. The selected features by the DE are then handed over to the classifier (e.g., MLP)

to build the model. The overall framework of the proposed model is shown in the Fig. 3and

discusses below.

•Phase 1: Initially, the large data is preprocessed to identify the non redundant features.

Here, information index classification (IIC) (discussed in Section (3.2.1)) is employed

to select the smaller subset of features from the data set with large volume of features.

Given the microarray data set of Ngenes, smaller number (say Na,N

a<N) of genes

or features are extracted by using the IIC.

•Phase 2: After preprocessing, the extracted Nagenes are imported to the public domain

database, David tool (http://david.abcc.ncifcrf.gov/tools.jsp) to extract the pathways.

Here, KEGG (Kyoto Encyclopedia of Genes and Genomics) pathway is used as it is also

used in [1,9,35]. Let us assume, Wnumber of pathways Φ={PT

1,PT

2,...,PT

are extracted.

•Phase 3: Next, normalization (discussed in Section (3.2.2)) is done over the extracted

pathway information.

•Phase 4: Now, the proposed dynamic scaling factor based-DE is employed on the

extracted pathways (Φ) to select lesser number of genes with high accuracy value. The

phases of the proposed DE are detailed discussed in the following Section 5.1 to 5.6.

•Phase 5: The selected features by the proposed DE are then handed over to the classifier

techniques to build the classifier model as discussed in the Section 5.7.

5.1 Vector initialization

A vector should always produce a valid solution. Here, the vectors are with the same length

as number of identified pathways in preprocessing step through KEGG, i.e., W.Let,the

ith vector, Vibe represented as −→

Vi={vi1,v

i2,...,v

iW }. Each element vik,1≤k≤W

Multimedia Tools and Applications

PT1-- - - - - - PT2--- - - - - - - PTW--- ---

G1G2G3G5---- - - - - - ---G

G1G2----- GN, N<Q

----

-----

Sp.

PT1

PT2

PT3

PT4

PT5

--------

----------

PTW

0.2

0.6

0.8

0.1

0.3

---------

----------

0.7

PT2

PT3

PTW

KEGG Pathway

Database

http://david.abcc.ncifcrf.gov

W= Total no of Pathway

p= Total no. of sample

n= No. of gene in each pathway

Total no. of gene = n(PT1,PT2, ,PTW)

For e.g., Only 3

Pathways are

active

(PT2, PT3, PTW)

If PTi > 0.5

Pscheme = PT2∪PT3∪PTW

Pathway data matrix

Using IIC

Main Data

Fig. 3 Proposed model based on pathway scheme

is initialized by a randomly generated number, rand(0,1), 0≤rand(0,1)≤1.0, i.e.,

vik =rand(0,1). The element value vik of the vector Viindicates whether the pathway

kis selected or not. If vik >0.5, then only PT

kis selected. A population is the randomly

generated with Psize number of vectors as given in the Algorithm 1.

Illustration 1. Assume a set of 10 pathways, Φ={PT

1,PT

2,...,PT

10}as shown in

Fig. 4. Therefore, the size of the vector is 10. Now, random number rand(0,1)is gen-

erated for each element of the vector as mentioned above. Let us assume the generated

numbers are as shown in Fig. 4. It can be observed that the pathway PT

2is selected as

vi2>0.5. Similarly, the pathways PT

4,PT

6,PT

7,and PT

10 are also selected.

Multimedia Tools and Applications

Algorithm 1 Generate Population(POP).

5.2 Fitness computation

The supremacy of the vectors is computed by the evolved fitness function. Here, three cases

are considered to evaluate the vectors. In first case, T-score and in second case Classifica-

tion Accuracy (CA) is used as objective function to evaluate the vectors. In third case, the

fitness function is derived by weight sum approach (WSA) using both the T-score and CA.

1. Case 1 (T-score): It is applicable to observe the variation of any data points of an

observation. Basically, it focuses on the mean of a distribution to analyze the data point

in terms of how much it deviates from the mean of the distribution. Therefore, T-score

is used over the expressions of a pathway which contains genes. Here, the objective is

represented by the (3).

Minimize T(P

scheme)=μa−μb

σa

Sa+σb

(3)

where, Pscheme be the pathway scheme or the final set of pathways that are selected by

the DE, μxrepresents the mean and σxrepresents standard deviation of the sample of

the class x∈{a, b}.Sxindicates the overall samples of binary class. The pseudo-code

to calculate the fitness using T-score is given in Algorithm 2.

2. Case 2 (Classification Accuracy(C A)): After selecting the pathways, k-fold cross val-

idation technique is used with MLP classifier to evaluate the fitness value. Normally,

the value of kmay be 10 or less. The higher value of kis less biased as well as vari-

ability is high. In contradictory, the smaller value goes towards the validation set and

higher value leads towards the LOOC (Leave one out cross validation). Here, 5-fold

cross validation is used. In this approach, the pathway matrix is randomly partitioned

with respect to samples into 5 subsets including training and testing subsets. 4 of the

5 subsets are used as training and 1 subset is used as testing as shown in Fig. 5. Here,

Pathways

PT1

PT2

PT3

PT4

PT5

PT6

PT7

PT8

PT9

PT10

rand(0, 1)

0.45

0.56

0.32

0.61

0.12

0.82

0.79

0.41

0.29

0.91

PT2

PT4

PT6

PT7

PT10

Selected Pathways

Fig. 4 A random vector initialization for ten number of pathways

Multimedia Tools and Applications

Algorithm 2 Fitness Case1.

MLP is employed on the four training sub sets. It will repeat 5 times and takes mean

accuracy. Here, the objective is represented by the (4).

Maximize CA =5

i=1Aci

5(4)

The CA can be calculated using the similar Algorithm 2.

3. Case 3 (Weight Sum Approach (WSA)): In this case, the fitness function is derived by

using both the parameters T-score and CA as given in (3)and(4) respectively. Here,

weight sum approach is used to combine the objectives as follow.

Maximize WSA =ϕ1×(1−T(P

scheme)) +ϕ2×CA (5)

where, ϕ1and ϕ2are the weight parameters, ϕ1+ϕ2=1,0≤ϕ1,ϕ

2≤1. The

parameters, ϕ1and ϕ2are tested by different combination of values to fix the final

vector.

Training Testing Training Testing Training Testing Training Testing Training Testing

Ac 1Ac 2Ac 3Ac 4Ac 5

Main Data

(After select the pathway)

Fig. 5 Fitness for CA using 5-fold cross validation

Multimedia Tools and Applications

5.3 Mutation

Mutation is conducted for each target vector (−→

TV) of the population. To mutate a tar-

get vector, a donor vector (−→

DV ) is generated for each target vector. There are various

mutation and crossover schemes in the literature as mentioned in the Section 2. Here, the

DE/RAND/1/BI N scheme is used to illustrate the phases. The donor vector, −→

DV i(g) at

gth generation is generated as follow.

−→

DV i(g) =−→

Xr(g) +F.{−→

Xs(g) −−→

Xt(g)}(6)

Here, first a random vector −→

Xr(g) is selected from the population and then other two ran-

dom vectors −→

Xs(g) and −→

Xt(g) are also selected in such a way that r= s= t(as per

DE/RAND/1/BI N scheme). In the (6), (−→

Xs(g)−−→

Xt(g)) is known as the difference vec-

tor. In case of per DE/BEST /1/B I N scheme, instead of a random vector −→

Xr(g), the best

vector −→

XBEST (g) is selected for the mutation. In the simulation, the DE/RAND/1/BI N

scheme is used.

Fbe the scaling factor. Generally, Fbelongs to the range of [0.4, 1.0]. In this work, F

is dynamically changed and it is discussed in the Section 5.6.Nowthe−→

DV i(g) is employed

to create the child vector by crossover operation as mentioned in following section.

Illustration 2. Let us assume a population of eight vectors as shown in Fig. 6.Let−→

X1be

the target vector and the randomly generated three vectors are X3,X5and X7.Nowthe

donor vector is generated by the (6).

0.500 .0.130 0.050 0.130 0.050 0.133

0.700 0.140 0.140 0.140 0.140 0.140

0.200 0.050 0.033 0.050 0.330 0.300

0.300 0.060 0.060 0.060 0.060 0.060

0.600 0.100 0.100 0.150 0.150 0.100

0.800 0.100 0.100 0.300 0.150 0.150

0.400 0.080 0.800 0.100 0.100 0.040

0.750 0.150 0.750 0.150 0.750 0.150

0.800 0.100 0.100 0.300 0.150 0.150

Population

Target Vector

Random

Vectors

(DV)i (g)=Xr(g)+ F × (Xs(g)-Xt(g))

+ 0.3 × (

Donor

Vector

= 0.8

0.6 0.4

0.2

Fig. 6 An example of mutation operation. The 3rd ,5

th and 7th vectors are selected for mutation.

Computation for the first component of the donor vector is only shown

Multimedia Tools and Applications

5.4 Crossover

The crossover is accomplished between a target vector −→

TVi(g) ={vi1(g), vi2(g),

...,v

iW (g)}and the corresponding donor vector −→

DV i(g) ={di1(g), di2(g), . . . , diW (g)}

to produce an offspring vector −−→

TLVi(g) ={ti1(g ), ti2(g), . . . , tiW(g )}. Here, binary

(BIN) crossover is applied with crossover rate say Crwhich is predefined. The jth element

of the −−→

TLV vector is generated as follow.

tij (g) =⎧

⎨

⎩

dij (g), if randj≤Cr

vij (g), otherwise

(7)

Illustration 3. It can be observed from Fig. 7that the elements of the −−→

TLV is selected in

between −→

DV and −→

TV based on the corresponding random number. It can be seen that the

first element of the −−→

TLV is same as the first element of the −→

DV as the random number

(rand1) is less than the Cr.

5.5 Selection

In the selection phases, it is decided that which one among the target vector and the newly

generated child vector will survive in the next generation. It decision is taken based on the

fitness value as follow.

−→

TVi(g +1)=⎧

⎪

⎨

⎪

⎩

(−−→

TLVi(g)), If F itness(−−→

TLVi(g)) ≥F itness(−→

TVi(g))

−→

TVi(g), Otherwise.

(8)

The mutation, crossover and selection operations are iterated till the termination crite-

rion. Here, the termination criterion is kept as the predefined number of iterations.

5.6 Updation of dynamic scaling Fcator (

)

The parameter Fplays crucial role in DE algorithm. In conventional DE, the value of F

is fixed for each iteration. An inherent drawback of any population based stochastic evo-

di1(g)

di2(g)

di3(g)

di4(g)

di5(g)

di6(g)

di7(g)

di8(g)

di9(g)

di10(g)

vi1(g)

vi2(g)

vi3(g)

vi4(g)

vi5(g)

vi6(g)

vi7(g)

vi8(g)

vi9(g)

vi10(g)

di1(g)

vi2(g)

di3(g)

vi4(g)

di5(g)

vi6(g)

di7(g)

vi8(g)

di9(g)

vi10(g)

rand1<Cr

)()( gDV i

rand10>Cr

Fig. 7 Crossover operation

Multimedia Tools and Applications

lutionary algorithms is premature convergence [33]. Moreover, an evolutionary algorithm

is considered an efficient algorithm if it can balance between exploration and exploitation

of the search space. In this regard, many researchers suggested dynamic scaling factor [30,

36]. Here, the authors have also employed dynamic scaling factor to overcome these issues.

The scaling factor Fis dynamically updated with new value in each solution to control

the variation in exploration and exploitation at search space. Initially, Fis randomly gen-

erated in the range of [-0.8, 0.8] and it reduces till the [-0.4, 0.4]. Now, it can be observed

that the tendency of exploration increases at initial iteration and afterwards exploitation

increases. Hence, it balances the exploration and exploitation mechanism.

The pseudo code of the proposed DE is given as in Algorithm 3 and corresponding

flowchart is given in Fig. 8. After the termination, the finest vector of the population is

finalized as the final solution vector.

Remark 5.1 A vector can be generated in (W) time. Therefore, the initial population can

be generated in (Psize ×W) time. Then in the iterative process, for each trial vector, a

donor vector can be generated in (W) time (line no. 9 of Algorithm 3). Then crossover

takes also (W) time (line no. 11 to 17). The selection operation requires computation of

fitness value for the new child vector. The fitness of the child vector can be computed in

(W ×S) time, where Sbe the sample size. Then further selection takes (1)time to

identify best vector among the trial and child vector (line 19 to 23). Therefore, the overall

time complexity of DE can be computed as (Psize ×W)+(I ×Psize ×(W +WS)) or

(I ×Psize ×W×S).

5.7 Machine learning classiﬁer

In recent times, machine learning has left a deep impact on the field of data science. More-

over, it has an ability to deal with the experiences, observations and instructions which have

in the form of data for correct prediction. There are lots of machine learning algorithms,

those are used in the field of Data Mining, Face Recognition, Handwritten Recognition and

Bioinformatics [13,15]. Here, the proposed work employs on the Multilayer perceptron

(MLP) neural network classifier.

Initially, MLP with 5-fold cross validation is used to compute the objective function in

fitness evaluation as shown in Section 5.2. Then, MLP is further applied over the selected

Start

Mutation

(Based On

Difference

Crossover

Selection Terminate?

End

Yes

Update the Scaling

Factor (F)

Evaluate the

Proposed Fitness

Initialization of

Population Vector

based on Pathway

Fig. 8 Flowchart of dynamic scaling factor based DE

Multimedia Tools and Applications

Algorithm 3 Proposed differential evolution.

genes by the DE to build a classifier model. Performance of the classifier model is evalu-

ated in the simulation analysis phase in terms of sensitivity (SN), specificity (SP), accuracy

(AC), and F-score (FS). The MLP includes the several layers, named as the Input Layer (IL),

Hidden Layers (HLs) and Output Layer (OL). It has the potential to handle the complex

datasets to get the desired output with maximum accuracy. The layers are connected with

each other. In general, after feeding the data in the input layer, it goes towards the hidden

layer with the combination of some weights and biases. Then, it is activated by the acti-

vation function. In this case, sigmoid activation function is used which converges towards

the desired output. Overall flow of MLP is depicted in Fig. 9. Here, Inrepresents the input

(input layer) of each neurons, wnand βnrepresent the weight. Here, single hidden layer is

used which has two neurons. Each neuron of hidden layer is represented in the weighted

sum of the inputs and added with a bias value. Afterwards, the activation function is applied

over summed value. Hence, the output comes either output 1 (O1)or output 2 (O2).

Multimedia Tools and Applications

w11

β11

w12

w21

w22

wn1

wn2

β12

β21

β22

Input Layer Hidden Layer Output Laye

Fig. 9 Multilayer perceptron neural network

6 Data analysis and simulation results

6.1 Overview of Datasets

In this article, three real-life gene datasets are utilized for simulation analysis. The datasets

might be easily fetched from the website: www.biolab.si/supp/bi-cancer/. An overview of

the considered datasets is follow.

•Prostate: This dataset is for prostate tumor. It consists of 102 samples with total 12533

numbers of genes or features for each sample. Here, the samples are split up into binary

classes, normal class with 50 samples and tumor class with 52 samples.

•DLBCL: This dataset is for B-cell lineage malignancies. The dataset consists of two

different B-cells as, Diffuse Large Lymphoma (DLBCL) and Follicular Lymphoma

(FL). It contains total 77 samples with 7070 genes. The DLBCL class consists of 58

samples and remaining 19 samples are for FL class.

•Child All: It is an acute lymphoblastic lymphoma gene set. It contains 8280 number

of gene with 110 samples which is divided into two classes based on before and after

investigation, nevertheless of the type of investigation. First 50 samples are for before

therapy and 60 samples are for after therapy.

6.2 Simulation environment

The simulation has been done on a system with Intel i7, 8th generation processor, 8GB of

RAM and Windows 10 as operating system. The proposed DE is implemented using Rlan-

guage and the simulation results are depicted using MATLAB. The proposed DE algorithm

is evaluated by three cases of fitness functions, T-score, Classification Accuracy (CA) and

WSA. In the rest of the paper, the DE with T-score fitness function is denoted as DETS and

similarly, DECA and DEWSA.

CA is computed using multi-layer perceptron. Also, the scaling factor (F)ofDEis

dynamically updated in each iteration. For the sake of the comparison, similar existing

works using PSO [21], GA [31]andGSA[25] are also executed. Here, microarray based

gene expression data is taken for the experiment analysis. After executing the proposed

DE, wilcoxon rank sum test [5] is applied to get P-value for each pathway. Then, top

50% pathways are extracted based on ascending order of P-values and evaluated by 10-

fold cross validation with various machine learning classifiers (k-nearest neighbor (k-NN),

Na¨

ıve Bayes (NB), support vector machine (SVM) and multi-layer perceptron (MLP)) to

Multimedia Tools and Applications

get the sensitivity (SN), specificity (SP), accuracy (AC), and F-score (FS) respectively by

the following [(9)-(13)], which are derived from the confusion matrix [23].

Accuracy =τp+τn

τp+τn+fp+fn

(9)

F−score =2×P×R

τn+fp

(10)

Sensitivity orRecall(R) =τp

τp+fn

(11)

Specif icity =τn

τn+fp

(12)

P reci sion(P ) =τp

τp+fp

(13)

τp: stands for true positive, τn: stands for true negative, fp: stands for false positive, fn:

stands for false negative.

The used parameters in the proposed work are considered as shown in the Table 2.Note

that, the considered parameters are same as taken in [17], [11]and[27] for PSO, GA and

GSA respectively.

6.3 Simulation results

The outcomes of the simulation in terms of sensitivity (SN), specificity (SP), sccuracy (AC),

and F-score (FS) for different classifiers are discussed in Tables 3,4,5and 6.Tables3,4,5

and 6describe the comparative analysis of 10 fold cross validation with MLP, SVM, k-NN

and NB classifiers respectively. It can be observed that the proposed approach (DEWSA)

performs better than DECA, DETS and the existing techniques such as PSO [21], GA [31]

and GSA [25] in terms of SN, SP, AC and FS respectively for all the data sets. DECA

performs as similar to PSO and DETS behaves as similar of GA and GSA. The rationale

behind such outcomes is that, the vector operates through the subset of pathways which

contains the different number of features/genes having similar behavior. Hence, it reduces

the computation power and the performance of the system is enhanced. Also, the weight

parameters ϕ1,ϕ

2and scaling factor (F) are tuned efficiently in each generation of solution.

Dynamic updation of the scaling factor (F) helps the proposed DE based work to reach to

the better solution by balancing the exploration and exploitation of the searching space. In

other hand, the classifier MLP gives better outcome than the other classifiers (SVM, k-NN

and NB) because MLP has a strong search capability in complex solution.

Moreover, the comparative analysis with respect to iteration v/s average fitness is plotted

for all datasets as shown in Fig. 10. It can be observed that the proposed approach dominates

Table 2 Parameters setup

Proposed PSO [17]GA[11]GSA[27]

Iteration 100 100 100 100

Population size 50 50 50 50

Crossover rate (Crate),F0.8, Dynamic NA 0.8, NA NA

Mutation rate (Mrate)NA NA 0.1 NA

c1,c2,w,α,G0NA 1.4, 1.4, 0.79, NA, NA NA 20, 100

Multimedia Tools and Applications

Table 3 Simulation result of 50% pathway for SN, SP, AC and FS by MLP

Algorithm Prostate DLBCL Child ALL

SN SP AC FS SN SP AC FS SN SP AC FS

DEWSA 0.89 0.90 0.91 0.88 0.88 0.88 0.90 0.89 0.89 0.88 0.89 0.88

DECA 0.88 0.88 0.89 0.88 0.86 0.84 0.88 0.86 0.87 0.85 0.86 0.85

DETS 0.85 0.86 0.85 0.84 0.80 0.81 0.81 0.82 0.73 0.76 0.76 0.76

PSO[21] 0.88 0.89 0.89 0.88 0.84 0.82 0.85 0.86 0.85 0.83 0.82 0.83

GA[31] 0.84 0.83 0.80 0.82 0.78 0.76 0.79 0.74 0.72 0.72 0.70 0.72

GSA[25] 0.83 0.80 0.82 0.81 0.74 0.74 0.76 0.75 0.70 0.73 0.71 0.73

all the approaches due to efficient design of the solution vector and dynamic updation of the

scaling factor (F) in each generation.

It should be noted that the number of the feature has played a vital role in the perfor-

mance of the classifiers (MLP, SVM, k-NN, and NB). After selection of lesser and relevant

pathways by the DE, the classifiers are used to generate the classification model. Here, the

classifiers are used by varying the % of the selected pathways. The graphs are plotted with

varying the % of the pathways for different classifier as shown in Fig. 11. It can be observed

that the Accuracy of the classifiers varies while the % of the pathways is varied for SVM,

k-NN, and NB. The dealing tendency of MLP with several sizes of data can be found in

the Fig. 11 (a). It can be observed that the performance of MLP in terms of accuracy varies

comparably lower than the SVM, k-NN, and NB for all datasets. SVM can deal with large

datasets and the risk of over fitting is low. But the important factor of SVM is depending

upon the selection of kernel function. Here, linear kernel function is used. The NB classifier

can deal with the small and large amount of data. Thus, from Fig. 11 (b) and (d), it can be

found that the accuracy has achieved in increasing order i.e., higher size of feature produced

the higher accuracy. k-NN is not efficient for with large datasets. Moreover, it needs a fea-

ture scaling factor to accurately predict the instances. In addition, it is quite sensitive with

noisy data and missing values. For this reason, the authors have already pruned the noisy

data by using filter approach. From Fig. 11 (c), it has been noticed that, lower size of feature

is providing the higher accuracy. Thus, after applying the different classifier for different

sizes of feature sets, it can be noticed that MLP performs better than the other classifiers.

Table 4 Simulation result 50% pathway for SN, SP, AC and FS by k-NN

Algorithm Prostate DLBCL Child ALL

SN SP AC FS SN SP AC FS SN SP AC FS

DEWSA 0.88 0.88 0.86 0.84 0.84 0.86 0.86 0.86 0.86 0.84 0.85 0.84

DECA 0.84 0.83 0.84 0.84 0.82 0.82 0.80 0.82 0.82 0.80 0.81 0.80

DETS 0.79 0.74 0.78 0.74 0.66 0.67 0.73 0.75 0.72 0.65 0.72 0.72

PSO[21] 0.81 0.80 0.84 0.82 0.78 0.78 0.79 0.80 0.78 0.77 0.77 0.76

GA[31] 0.78 0.73 0.74 0.72 0.69 0.70 0.72 0.72 0.71 0.70 0.72 0.71

GSA[25] 0.76 0.75 0.72 0.70 0.64 0.68 0.69 0.69 0.70 0.68 0.70 0.71

Multimedia Tools and Applications

Table 5 Simulation result of 50% pathway for SN, SP, AC and FS by SVM

Algorithm Prostate DLBCL Child ALL

SN SP AC FS SN SP AC FS SN SP AC FS

DEWSA 0.85 0.84 0.84 0.82 0.82 0.80 0.81 0.84 0.74 0.72 0.72 0.72

DECA 0.81 0.80 0.80 0.78 0.78 0.76 0.78 0.80 0.72 0.70 0.69 0.70

DETS 0.78 0.76 0.74 0.74 0.72 0.70 0.74 0.75 0.70 0.68 0.65 0.65

PSO [21] 0.80 0.80 0.78 0.78 0.74 0.75 0.76 0.76 0.70 0.70 0.65 0.70

GA [31] 0.76 0.74 0.73 0.74 0.70 0.70 0.72 0.72 0.68 0.69 0.65 0.65

GSA [25] 0.75 0.75 0.72 0.70 0.70 0.68 0.65 0.64 0.70 0.68 0.68 0.70

6.4 Analysis of variance (ANOVA)

Analysis of variance (ANOVA) is a technique to differentiate the mean of the samples as

well as finds if they are equivalent or not. It consists of two hypothesis testing, null hypoth-

esis (Hnull ) and alternate hypothesis (Halt ). Thus the respective hypothesis are classified

as,

Hnull :μDEW SA =μPSO =μGA =μGSA (14)

Halt :μDEWS A = μPSO = μGA = μGSA (15)

From the above equation, it defines that the null hypothesis is accepted if the means are

equal for all the samples otherwise alternate hypothesis is accepted. Normally the output

of ANOVA depends on F-statics and F-critical values as well as P-value. If the value of

F-statics is greater than the value of F-critical and the value of α(chosen by the user) is

greater than the P-value then the hypothesis is rejected as null hypothesis otherwise it is

accepted as alternate hypothesis. In this paper, ANOVA statistical test is performed between

the DEWSA, PSO, GA, and GSA.

Here, ten samples of accuracy of each algorithm are taken. The value of alpha (α)is

chosen as 0.05 which is a certain level point. The input for ANOVA test is discussed in the

Tabl e 7. Also, the output of ANOVA is demonstrated in Table 8. It is found that the value

of F-statics is greater than the value of F-critical and also chosen value of alpha is larger

than the value P-value. Therefore, it can be said that the null hypothesis is rejected and

it can be observed that the mean of samples is differentiated by the selected algorithms.

Thus, ANOVA can notify only significant differences statistically, however it cannot show

Table 6 Simulation result of 50% pathway for SN, SP, AC and FS by NB

Algorithm Prostate DLBCL Child ALL

SN SP AC FS SN SP AC FS SN SP AC FS

DEWSA 0.75 0.78 0.78 0.76 0.80 0.78 0.78 0.80 0.72 0.70 0.70 0.68

DECA 0.72 0.76 0.76 0.74 0.77 0.76 0.76 0.78 0.70 0.68 0.68 0.68

DETS 0.68 0.72 0.72 0.70 0.71 0.72 0.72 0.72 0.67 0.64 0.62 0.64

PSO[21] 0.70 0.74 0.74 0.73 0.75 0.74 0.74 0.76 0.69 0.66 0.65 0.65

GA[31] 0.65 0.70 0.70 0.70 0.70 0.68 0.70 0.68 0.65 0.65 0.64 0.64

GSA[25] 0.68 0.70 0.68 0.68 0.74 0.72 0.72 0.70 0.64 0.65 0.65 0.65

Multimedia Tools and Applications

10 20 30 40 50 60 70 80 90 100

0.4

0.45

0.5

0.55

0.6

0.65

Iteration

ssentiFegarevA

Prostate Data

DEWSA

DECA

PSO

DETS

GSA

10 20 30 40 50 60 70 80 90 100

0.4

0.45

0.5

0.55

0.6

0.65

Iteration

ssentiFegarevA

DLBCL Data

DEWSA

DECA

PSO

DETS

GSA

10 20 30 40 50 60 70 80 90 100

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Iteration

gar

Child All Data

DEWSA

DECA

PSO

DETS

GSA

Fig. 10 Iteration v/s Average Fitness in (a) Prostate (b) DLBCL and (c) Child All Data. DEWSA beats the

other approaches in all data sets due to its novelty

which samples or groups are distinct from others. Therefore, the Least Significant Differ-

ence (LSD) post-hoc test is performed to differentiate the groups significantly from the

other groups. The LSD test is demonstrated in Table 9. The condition of the LSD post-hoc

test states that the groups do not differ significantly from each other if the interval is zero.

From Table 9, it is found that the interval level i.e., lower bound and upper bound of mean

differences do not carry zero for DEWSA, PSO, GA and GSA. Thus, the condition is satis-

fied in our case. Therefore, the performance of statistical analysis of ANOVA test followed

by LSD post hoc test clearly statistically significantly differentiate the samples of different

algorithms based on accuracy values.

6.5 Biological importance

In this section, the importance of biological significance of selected relevant pathway genes

are analyzed. The best proposed technique (DEWSA) is executed for ten times and a set

of ten gene is found. The gene that repeats at multiple of five is selected as the bet-

ter gene. Afterwards, the heat-map of selected features for each data sets is plotted. The

related genes as well as related disease of selected genes are explored using gene database

Multimedia Tools and Applications

Prostate DLBCL Child All

0.75

0.8

0.85

0.9

0.95

Datasets

yca

ucc

MLP Classifier

20%

40%

50%

60%

80%

100%

Prostate DLBCL Child All

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Datasets

ycaruccA

SVM Classifier

20%

40%

50%

60%

80%

100%

Prostate DLBCL Child All

0.7

0.75

0.8

0.85

0.9

0.95

Datasets

ycaruccA

k-NN Classifier

20%

40%

50%

60%

80%

100%

Prostate DLBCL Child All

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Datasets

ycaruccA

NB Classifier

20%

40%

50%

60%

80%

100%

Fig. 11 Accuracy by varying the % of the pathways for (a) MLP, (b) SVM, (c) k-NN and (d) NB Classifier

(www.disgenet.org), where the symbols and their small descriptions are represented in

Tabl es 10,11 and 12. Due to large file of pathway gene, some allied genes with their specific

pubmed citation (SPCIN) [1] are studied.

Tabl e 10 shows the results for prostate data. It can be seen that only five genes are found

after executing the proposed (DEWSA) technique. The related disease as well as allied

Table 7 Input for ANOVA test

Factors Count Sum Mean SD 95% interval(Lower and Upper Bound)

DEWSA 10 8.74 .8740 .05502 (.8346,.9134)

PSO 10 7.79 .7790 .08212 (.7203,.8377)

GA 10 6.96 .6960 .06637 (.6485,.7435)

GSA 10 7.21 .7210 .07216 (.7104,.8224)

Multimedia Tools and Applications

Table 8 Output for ANOVA test

Groups Sum of Square Df Mean Square F-Statics P-value F-critical

Between Groups 0.124 3 0.039334 4.32461 0.031471 3.46724

Within Groups 0.271 34 0.005325

Total 0.326 37

Table 9 LSD Post-hoc test

Difference of levels Difference of Means Standard Error 95% interval(Lower and Upper Bound)

DEWSA - PSO .09500 .03235 .0342,.1462

DEWSA - GA .11600 .03235 .0258,.1334

DEWSA - GSA .09800 .03235 .0217,.1423

Table 10 Biological analysis for Prostate data

Selected gene Symbol Description Related Disease Allied Genes(SPCIN)

37639at , HPN Hepsin Prostate carcinoma RAFI(7),IGFI(69),KLK3(776)

41288at CALM1 Calmodulin Carcinogenesis RAFI(24),KLK3(5),IGFI(43)

31527at RPS2 Ribosomal protein Prostatic disease KLK3(11),IGFI(1)

39939at COL4A6 Collagen Tumor Progression KLK3(14),PXN(3)

38634at RBP1 Retinol protein Malignent Prostate INS(31),FLNA(2),KLK3(781)

Table 11 Biological analysis for DLBCL data

Selected gene Symbol Description Related Disease Allied Genes(SPCIN)

X56494at PKML Pyruvate Kinase Anemia TP53(32),AKTI(2)

X16983at ITGA4 Integrin alpha 4 B-lymphoma MAPK3(2),FASLG(1)

D87119at TRIB2 Tribbles homolog 2 T-lymphoma HRAS(1),AKTI(9),

X62078at GM2A Ganglioside activator Lymphoma FASLG(17),MAPK3(5),

Table 12 Biological analysis for Child ALL data

Selected gene Symbol Description Related Disease Allied Genes(SPCIN)

38464at GCS1 Glocosidose 1 Acute Leukemia CDKI(4),PLKI(4)

39994at CCR1 Chemokine Cardiovascular Disease CHUK(4),CXCR2(11)

32264at GZMM Granzyme M Carcinogenesin PLKI(15),CXCR2(5)

36651at ACP2 Acidphosphate2 Tumor Progression ITGA6(1),PLKI(9)

Multimedia Tools and Applications

genes with their small description and symbols are also given in the Table 10. The heat-

map for selected genes are plotted as shown in Fig. 12 (a). The heat-map is represented in

the form of gene in x-axis with respect to samples of the classes in y-axis. The expression

level of genes is studied in form of different colors like green, red and black. The green

color describes the low expression value, red color describes the high expression value and

Fig. 12 Heat-map for (a) Prostate Data (b) DLBCL Data and (c) Child All Data. X-axes represents the

features and Y-axes represents the different samples

Multimedia Tools and Applications

black color describes the absence of expressions values. Lower expression implies the nor-

mal samples and high expression implies the tumor samples. The selected genes (41288at,

31527at , 39939at ) indicates the high expression and the genes (37639at, 38634at )represents

the low expressions.

It can be observed from Table 11 that only four genes are selected after executing the pro-

cess on DLBCL datasets. It can also be noticed from Fig. 12 (b), that the genes (X56494at ,

X16983at ) are high expression and D87119at and X62078at are low expression. Similarly,

four genes are selected for Child All data as shown in Table 12. It can be seen form Fig. 12

(c), the genes (38464at , 39994at , 32264at and 36651at ) represents the low expression.

7 Conclusion

In this work, a model based on dynamic scaling factor differential evolution (DE) algorithm

with multi-layer perceptron is designed to select the relevant pathway gene from high vol-

ume of gene expression data. The vectors are efficiently represented. Here, two objectives,

T-score and classification accuracy (CA) are considered to compute the fitness function.

These objectives are used in three cases separately. In case one, T-score is taken as objective

function and it is denoted as DETS. In case two, classification accuracy (CA) is consid-

ered as objective function and it is denoted as DECA. In case three, weight sum approach

(WSA) is considered by using both the objectives, i.e., T-score and classification accuracy

(CA) and it is denoted as DEWSA. After execution, it is found that the proposed approach

(DEWSA) performs better than DETS, DECA and other existing approaches (PSO, GA

and GSA) in terms of sensitivity, specificity, accuracy and F-score. It can be observed that

DEWSA, DECA and PSO behave similarly while DETS, GA and GSA behave the same for

all datasets (Prostate, DLBCL and Child ALL) while applying different classifiers (MLP,

k-NN, SVM and NB). DEWSA, DECA and PSO have achieved higher value in terms of

sensitivity, specificity, accuracy and F-score when using the MLP classifier, at the variance

of 1%-2% for prostate data and 2%-2.5% for DLBCL and Child All datasets. Whereas,

k-NN performs 3%-4% better and NB performs 1.5%-2% better for all data sets, SVM per-

forms 3%-4% better for Prostate and DLBCL and 1.5%-2% for Child ALL dataset. On the

other hand, DETS, GA and GSA perform similarly in the lower value in terms of sensitivity,

specificity, accuracy for all datasets with all classifiers. Note that the outcomes are varied

due to the dimension of genes/features. Hence, it is concluded that the DEWSA performs

quite better than the other approaches (DETS, DECA, PSO, GA and GSA) at the higher

value of sensitivity, specificity, sccuracy and F-score for MLP classifier.

Moreover, biological analysis is performed on the selected features and heat-maps are

presented. To show the statistical significance of the proposed algorithm (DEWSA) over the

existing approaches (PSO, GA and GSA), analysis of variance (ANOVA) is also done.

The suggested approach may be useful in health sector to diagnose the disease. It can also

be used in various fields where feature selection is required. Note that, here the only binary

class of data is considered. In future, another model for multi class data set can be designed.

Moreover, a multi-objective optimization technique with various evolutionary algorithms

may also be developed.

Declarations

Conflict of Interests The authors declare that they have no conflict of interest. The research work of this

article is not funded by any organizations/agencies.

Multimedia Tools and Applications

References

1. Agarwalla P, Mukhopadhyay S (2018) Bi-stage hierarchical selection of pathway genes for cancer

progression using a swarm based computational approach. Appl Soft Comput 62:230–250

2. Ali IM, Essam D, Kasmarik K (2020) A novel design of differential evolution for solving discrete

traveling salesman problems. Swarm Evol Comput 52:100607

3. Bakhshandeh S, Azmi R, Teshnehlab M (2019) Symmetric uncertainty class-feature association map for

feature selection in microarray dataset. Int J Mach Learn Cybern, pp 1–18

4. Berahmand K, Nasiri E, Li Y et al (2021) Spectral clustering on protein-protein interaction networks via

constructing affinity matrix using attributed graph embedding. Comput Biol Med 138:104933

5. Fan G-F, Yu M, Dong S-Q, Yeh Y-H, Hong W-C (2021) Forecasting short-term electricity load

using hybrid support vector regression with grey catastrophe and random forest modeling. Util Policy

73:101294

6. Gao L, Ye M, Lu X, Huang D (2017) Hybrid method based on information gain and support vector

machine for gene selection in cancer classification. Genomics Proteomics Bioinforma 15(6):389–395

7. Geeitha S, Thangamani M (2018) Incorporating EBO-HSIC with SVM for gene selection associated

with cervical cancer classification. J Med Syst 42(11):225

8. Ghosh M, Begum S, Sarkar R, Chakraborty D, Maulik U (2019) Recursive memetic algorithm for gene

selection in microarray data. Expert Syst Appl 116:172–185

9. Ghosh M, Sen S, Sarkar R, Maulik U (2021) Quantum squirrel inspired algorithm for gene selection in

methylation and expression data of prostate cancer. Appl Soft Comput 105:107–221

10. Han F, Sun W, Ling Q-H (2014) A novel strategy for gene selection of microarray data based on gene-

to-class sensitivity information. PloS one 9(5):e97530

11. Hassanat A, Almohammadi K, Alkafaween E, Abunawas E, Hammouri A, Prasath V (2019) Choos-

ing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Inf

10(12):390

12. Katiyar S, Khan R, Kumar S (2021) Artificial bee colony algorithm for fresh food distribution without

quality loss by delivery route optimization. J Food Qual

13. Khan R (2021) Deep learning system and it’s automatic testing: an approach. Ann Data Sci, pp 1–15

14. Khan R, Amjad M, Pandey D (2014) Automated test case generation using nature inspired meta

heuristics-genetic algorithm: a review paper. Int J Appl Innov Eng & Manag (IJAIEM) vol 3(11)

15. Khan R, Kumar S, Srivastava AK, Dhingra N, Gupta M, Bhati N, Kumari P (2021) Machine learning

and iot-based waste management model. Comput Intell Neurosci

16. Khan R, Shabaz M, Hussain S, Ahmad F, Mishra P (2021) Early flood detection and rescue using

bioinformatic devices, internet of things (iot) and android application. World J Eng

17. Kuila P, Jana PK (2014) Energy efficient clustering and routing algorithms for wireless sensor networks:

Particle swarm optimization approach. Eng Appl Artif Intell 33:127–140

18. Kuila P, Jana PK (2014) A novel differential evolution based clustering algorithm for wireless sensor

networks. Appl Soft Comput 25:414–425

19. Lee J, Choi IY, Jun C-H (2020) An efficient multivariate feature ranking method for gene selection in

high-dimensional microarray data. Expert Syst Appl 166:113–971

20. Mabarti I (2020) Implementation of minimum redundancy maximum relevance (MRMR) and genetic

algorithm (GA) for microarray data classification with c4. 5 decision tree. J Data Sci Appl 3(1):38–47

21. Mandal M, Mondal J, Mukhopadhyay A (2015) A PSO-based approach for pathway marker identifica-

tion from gene expression data. IEEE Trans Nanobioscience 14(6):591–597

22. Mandal M, Mukhopadhyay A (2014) A graph-theoretic approach for identifying non-redundant and

relevant gene markers from microarray data using multiobjective binary PSO. PloS one 9(3):e90949

23. Polat H, Mehr HD, Cetin A (2017) Diagnosis of chronic kidney disease based on support vector machine

by feature selection methods. J Med Syst 41(4):55

24. Prasad Y, Biswas K, Hanmandlu M (2018) A recursive PSO Scheme for gene selection in microarray

data. Appl Soft Comput 71:213–225

25. Ram PK, Kuila P (2021) GSA-Based approach for gene selection from microarray gene expression data.

Mach Learn Algoritm Appl, pp 159–174

26. Rani MJ, Devaraj D (2019) Two-stage hybrid gene selection using mutual information and genetic

algorithm for cancer data classification. J Med Syst 43(8):235

27. Rashedi E, Nezamabadi-Pour H, Saryazdi S (2009) GSA: A gravitational search algorithm. Inf Sci

179(13):2232–2248

28. Rostami M, Berahmand K, Nasiri E, Forouzandeh S (2021) Review of swarm intelligence-based feature

selection methods. Eng Appl Artif Intell 100:104–210

Multimedia Tools and Applications

29. Rostami M, Forouzandeh S, Berahmand K, Soltani M, Shahsavari M, Oussalah M (2022) Gene selection

for microarray data classification via multi-objective graph theoretic-based method. Artif Intell Med

123:102228

30. Sakr WS, El-Sehiemy RA, Azmy AM (2017) Adaptive differential evolution algorithm for efficient

reactive power management. Appl Soft Comput 53:336–351

31. Salem H, Attiya G, El-Fishawy N (2017) Classification of human cancer diseases by gene expression

profiles. Appl Soft Comput 50:124–134

32. Shahbeig S, Rahideh A, Helfroush MS, Kazemi K (2018) Gene selection from large-scale gene

expression data based on fuzzy interactive multi-objective binary optimization for medical diagnosis.

Biocybern Biomed Eng 38(2):313–328

33. Sharma H, Bansal JC, Arya K (2012) Dynamic scaling factor based differential evolution algorithm.

In: proceedings of the international conference on soft computing for problem solving (SocProS 2011)

December 20-22, 2011. Springer,pp 73–85

34. Shukla AK, Tripathi D (2020) Detecting biomarkers from microarray data using distributed correlation

based gene selection. Genes Genom, pp 1–17

35. Sujamol S, Vimina E, Krishnakumar U (2021) Improving recurrence prediction accuracy of ovarian

cancer using multi-phase feature selection methodology. Appl Artif Intell 35(3):206–226

36. Sun G, Yang B, Yang Z, Xu G (2019) An adaptive differential evolution with combined strategy for

global numerical optimization. Soft Comput, pp 1–20

37. Vijay SAA, GaneshKumar P (2018) Fuzzy expert system based on a novel hybrid stem cell (HSC)

algorithm for classification of micro array data. J Med Syst 42(4):61

38. Wang D, Liu J-X, Gao Y-L, Zheng C-H, Xu Y (2016) Characteristic gene selection based on robust graph

regularized non-negative matrix factorization. IEEE/ACM Trans Comput Biol Bioinform 13(6):1059–

1067

39. Xu P, Zhao G, Kou Z, Fang G, Liu W (2020) Classification of cancers based on a comprehensive pathway

activity inferred by genes and their interactions. IEEE Access 8:30515–30521

40. Zhang G, Hou J, Wang J, Yan C, Luo J (2020) Feature selection for microarray data classification using

hybrid information gain and a modified binary krill herd algorithm. Interdiscip Sci Comput Life Sci

41. Zhou W, Dickerson JA (2014) A novel class dependent feature selection method for cancer biomarker

discovery. Comput Biol Med 47:66–75

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps

and institutional affiliations.

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the

author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is

solely governed by the terms of such publishing agreement and applicable law.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Multimedia Tools and Applications

This content is subject to copyright. Terms and conditions apply.

Quantum-Inspired Differential Evolution with Decoding using Hashing for Efficient User Allocation in Edge Computing Environment

Preprint

Full-text available

Jan 2024

Assessing gene stability and gene affinity in microarray data classification using an extended relieff algorithm

Article

Full-text available

Oct 2023
MULTIMED TOOLS APPL

Microarray data have become an integral part of the clinical and drug discovery process. Due to its voluminous and heterogeneous nature, the question arises of the interpretability and stability of the traditional gene selection method. To enhance the stability of the gene selection method, so that the results are better explicable, an ameliorated Extended ReliefF gene selection algorithm is proposed. It encodes gene affinity information using a new mathematical formula based on Bayes’ theorem and Manhattan distance for calculating the nearest neighbor in a pooled sample. It works in four aspects: initializing sample gene weight, improving gene weight, maximizing sample gene weight and finally adopting mutation operation. The proposed method selects the most informative genes which are highly perceptive to the prognosis of the disease. Further, to accomplish the accuracy and stability of the algorithm, soft classification is performed on Relieved_F, STIR, VLS-RelifF, I-RelieF, conventional ReliefF and proposed extended ReliefF algorithms using three classifiers namely Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) on ten microarray datasets. According to the findings, MLP training times are much longer than those of RF and SVM. From a network perspective, SVM is much faster at training, whereas MLP excels in terms of accuracy. With a rise in gene similarity among the genes selected from the multiple training sets, the approach becomes more stable. As a result, it can be seen that the recommended gene selection algorithm greatly outperforms the other feature selection methods in terms of accuracy and stability.

Quantum-Inspired Differential Evolution for Freshness-aware Caching-aided Offloading in Digital Twin-enabled Internet of Vehicles

Article

May 2024

With the development of the sixth-generation network, Digital Twin (DT) is driving the explosive growth of Internet-of-Vehicles (IoVs). The rapid proliferation of highly mobile IoVs, coupled with advanced applications, resulted in rigorous demands for quality of experience (QoE) and intricate task caching. The diverse requirements of on-vehicle applications, as well as the freshness of dynamic cached information, provide significant challenges for edge servers in efficiently fulfilling energy and latency demands. This work studies a freshness-aware caching-aided offloading-based task allocation problem (FCAOP) in DT-enabled IoV (DTIoV) with Intelligent Reflective Surfaces (IRS) and edge computing. DT is used to accumulate real-time data and digitally depict the physical objects of the IoV to enhance decision-making. A quantum-inspired differential evolution (QDE) algorithm is proposed to reduce the overall delay and energy consumption in DTIoV (QDE-DTIoV). The quantum vector (QV) is encoded to represent a complete solution to the FCAOP. The decoding of the QVs is done using a one-time hashing algorithm. The fitness function is derived by considering delay, energy consumption, and freshness of the tasks. Extensive simulations demonstrate the superiority of QDE-DTIoV over other benchmark algorithms, showing an average latency improvement of 23%-26% and a reduction in energy consumption ranging from 22% to 33%.

Heart disease prediction using distinct artificial intelligence techniques: performance analysis and comparison

Article

Full-text available

Jun 2023

Consolidated efforts have been made to enhance the treatment and diagnosis of heart disease due to its detrimental effects on society. As technology and medical diagnostics become more synergistic, data mining and storing medical information can improve patient management opportunities. Therefore, it is crucial to examine the interdependence of the risk factors in patients' medical histories and comprehend their respective contributions to the prognosis of heart disease. This research aims to analyze the numerous components in patient data for accurate heart disease prediction. The most significant attributes for heart disease prediction have been determined using the Correlation-based Feature Subset Selection Technique with Best First Search. It has been found that the most significant factors for diagnosing heart disease are age, gender, smoking, obesity, diet, physical activity, stress, chest pain type, previous chest pain, blood pressure diastolic, diabetes, troponin, ECG, and target. Distinct artificial intelligence techniques (logistic regression, Naïve Bayes, K-nearest neighbor (K-NN), support vector machine (SVM), decision tree, random forest, and multilayer perceptron (MLP)) are applied and compared for two types of heart disease datasets (all features and selected features). Random forest using selected features has achieved the highest accuracy rate (90%) compared to employing all of the input features and other artificial intelligence techniques. The proposed approach could be utilized as an assistant framework to predict heart disease at an early stage.

Energy and Delay Efficient Partial Offloading for UAV-assisted MEC Systems using Differential Evolution Algorithm

Conference Paper

May 2023

The research describes a technique that enables an Unmanned Aerial Vehicle (UAV) to delegate a part of a task to mobile devices. By outsourcing the computationally costly sections of a work to the more capable UAV and using the mobile devices for tasks that can be completed locally, the system makes good use of the capabilities of both the UAV and mobile devices. In this paper, Differential Evolution (DE) based algorithm is proposed for energy and delay efficient partial offloading in UAV-assisted MEC system. An extensive Simulations have been done to measure the performance of the proposed algorithm. The results reveal a considerable reduction in latency. Overall, our method shows how partial task offloading may be used to enhance the performance of UAV-assisted systems.

Gene selection for microarray data classification via multi-objective graph theoretic-based method

Article

Full-text available

Dec 2021
ARTIF INTELL MED

In recent decades, the improvement of computer technology has increased the growth of high-dimensional microarray data. Thus, data mining methods for DNA microarray data classification usually involve samples consisting of thousands of genes. One of the efficient strategies to solve this problem is gene selection, which improves the accuracy of microarray data classification and also decreases computational complexity. In this paper, a novel social network analysis-based gene selection approach is proposed. The proposed method has two main objectives of the relevance maximization and redundancy minimization of the selected genes. In this method, on each iteration, a maximum community is selected repetitively. Then among the existing genes in this community, the appropriate genes are selected by using the node centrality-based criterion. The reported results indicate that the developed gene selection algorithm while increasing the classification accuracy of microarray data, will also decrease the time complexity.

Deep Learning System and It’s Automatic Testing: An Approach

Article

Full-text available

Nov 2021

Rizwan Khan

The process of testing conventional programs is quite easy as compared to the programs using Deep Learning approach. The term Deep learning (DL) is used for a novel programming approach that is highly data centric and where the governing rules and logic are primarily dependent on the data used for training. Conventionally, Deep Learning models are evaluated by using a test dataset to evaluate their performance against set parameters. The difference in data and logic handling between programs using conventional methods and programs using the DL approach makes it difficult to apply the traditional approaches of testing directly to DL based programs. The accuracy of test data is currently the best measure of the adequacy of testing in the DL based systems. This poses a problem because of the difficulty in availability of test data that is of sufficient quality. This in turn restricts the level of confidence that can be established on the adequacy of testing of DL based systems. Unlike conventional applications, using the conventional programming approaches the lack of quality test data and the lack of interpretability makes the system analysis and detection of defects a difficult task in DL based systems. So testing of DL based models can be done automatically with a different approach compared to normal software.

Artificial Bee Colony Algorithm for Fresh Food Distribution without Quality Loss by Delivery Route Optimization

Article

Full-text available

Oct 2021
J FOOD QUALITY

This paper enlightens the use of artificial intelligence (AI) for distribution of fresh foods by searching more viable route to keep intact the food attributes. In recent years, very hard-hitting competition is for food industries because of the individuals living standards and their responsiveness for fresh food products demand within stipulated time period. Food industry deals with the extensive kind of activities such as food processing, food packaging and distribution, and instrumentation and control. To meet market demand, customer satisfaction, and maintaining its own brand and ranking on global scale, artificial intelligence can play a vibrant role in decision-making by providing analytical solutions with adjusting available resources. Therefore, by integrating innovative technologies for fresh food distribution, potential benefits have been increased, and simultaneously risk associated with the food quality is reduced. Time is a major factor upon which food quality depends; hence, time required to complete the task must be minimized, and it is achieved by reducing the distance travelled; so, path optimization is the key for the overall task. Swarm intelligence (SI) is a subfield of artificial intelligence and consists of many algorithms. SI is a branch of nature-inspired algorithm, having a capability of global search, and gives optimized solution for real-time problems adaptive in nature. An artificial bee colony (ABC) optimization and cuckoo search (CS) algorithm also come into the category of SI algorithm. Researchers have implemented ABC algorithm and CS algorithm to optimize the distribution route for fresh food delivery in time window along with considering other factors: fixed number of delivery vehicles and fixed cost and fuel by covering all service locations. Results show that this research provides an efficient approach, i.e., artificial bee colony algorithm for fresh food distribution in time window without penalty and food quality loss.

Machine Learning and IoT-Based Waste Management Model

Article

Full-text available

Aug 2021
Comput Intell Neurosci

A rapid rise in inhabitants across the globe has led to the inadmissible management of waste in various countries, giving rise to various health issues and environmental pollution. The waste-collecting trucks collect waste just once or twice in seven days. Due to improper waste collection practices, the waste in the dustbin is spread on the streets. Thus, to defeat this situation, an efficient solution for smart and effective waste management using machine learning (ML) and the Internet of Things (IoT) is proposed in this paper. In the proposed solution, the authors have used an Arduino UNO microcontroller, ultrasonic sensor, and moisture sensor. Using image processing, one can measure the waste index of a particular dumping ground. A hardware prototype is also developed for the proposed framework. Thus, the presented solution for the efficient management of waste accomplishes the aim of establishing clean and pollution-free cities.

Spectral clustering on protein-protein interaction networks via constructing affinity matrix using attributed graph embedding

Article

Nov 2021
COMPUT BIOL MED

The identification of protein complexes in protein-protein interaction networks is the most fundamental and essential problem for revealing the underlying mechanism of biological processes. However, most existing protein complexes identification methods only consider a network's topology structures, and in doing so, these methods miss the advantage of using nodes' feature information. In protein-protein interaction, both topological structure and node features are essential ingredients for protein complexes. The spectral clustering method utilizes the eigenvalues of the affinity matrix of the data to map to a low-dimensional space. It has attracted much attention in recent years as one of the most efficient algorithms in the subcategory of dimensionality reduction. In this paper, a new version of spectral clustering, named text-associated DeepWalk-Spectral Clustering (TADW-SC), is proposed for attributed networks in which the identified protein complexes have structural cohesiveness and attribute homogeneity. Since the performance of spectral clustering heavily depends on the effectiveness of the affinity matrix, our proposed method will use the text-associated DeepWalk (TADW) to calculate the embedding vectors of proteins. In the following, the affinity matrix will be computed by utilizing the cosine similarity between the two low dimensional vectors, which will be considerable to improve the accuracy of the affinity matrix. Experimental results show that our method performs unexpectedly well in comparison to existing state-of-the-art methods in both real protein network datasets and synthetic networks.

Forecasting short-term electricity load using hybrid support vector regression with grey catastrophe and random forest modeling

Article

Dec 2021
Util Pol

This paper develops a novel short-term load forecasting model that hybridizes several machine learning methods, such as support vector regression (SVR), grey catastrophe (GC (1,1)), and random forest (RF) modeling. The modeling process is based on the minimization of both SVR and risk. GC is used to process and extract catastrophe points in the long term to reduce randomness. RF is used to optimize forecasting performance by exploiting its superior optimization capability. The proposed SVR-GC-RF model has higher forecasting accuracy (MAPE values are 6.35% and 6.21%, respectively) using electric loads from Australian-Energy-Market-Operator; it can provide analytical support to forecast electricity consumption accurately.

GSA‐Based Approach for Gene Selection from Microarray Gene Expression Data

Chapter

Aug 2021

Selection of gene is the most effective method that plays a vital role to detect the cancers. Due to non-redundant data set, it is very difficult to extract the optimal features or genes from microarray data. In this paper, we have proposed a new model to extract the best features subset with high accuracy based on Gravitational Search Algorithm (GSA) with machine learning classifiers. An extensive simulation is performed to evaluate the performance of the proposed algorithm. Simulation results are compared with the Particle Swarm Optimization Algorithm (PSO). The superiority of the proposed algorithm has been observed.

Early flood detection and rescue using bioinformatic devices, internet of things (IOT) and Android application

Article

Aug 2021

Purpose The impact of natural disasters on human life, the environment and the flora and fauna can be contained to large extent by intelligent human intervention. This study introduces the human capabilities which can be extended considerably with technology. Internet of things have always provided opportunities for predicting and managing manmade/natural disasters. The extreme reason for causing soil erosions, landslides, cloud bursts, floods, etc., are due to excessive rainfall. However, the flood is one of the most happening natural disasters, following Bihar to be the most affected region due to floods. Lots of lives and properties were lost and damaged. Design/methodology/approach This implemented researchers to introduce an advanced solution for such calamities. Expectations were developed that it would signalize authority as early as possible so that advanced measures are taken before the effect. The lack of sensing or alarming technology in India pushed researchers to develop a model using the Android app that basically detected the upcoming flood and other calamities. Findings Most importantly the entire model was programmed with IoT and its techniques so that the response is quicker and more accurate. Originality/value This research study is original.

Quantum squirrel inspired algorithm for gene selection in methylation and expression data of prostate cancer

Article

Mar 2021
APPL SOFT COMPUT

Prostate cancer is the second most common type of cancer among men after skin cancer In this work, we present a comprehensive view on genomic and epigenomic changes following the incremental biological functionality. For gene selection, a new Feature Selection algorithm called Quantum Squirrel inspired Feature Selection is proposed here. While exploring the feature space, the proposed algorithm exploits the benefits of Squirrel Search Algorithm (a recently proposed swarm intelligence algorithm) along with Quantum mechanics. Moreover, a modified version of the end of winter concept is used to achieve effective dimension reduction capacity. Quantum Squirrel inspired Feature Selection is executed on both expression and methylation data of prostate cancer. The major challenge in gene selection is to bring down the number of selected features without compromising on accuracy. The proposed algorithm consistently achieves this goal and outperforms other state-of-the-art algorithms. The proposed algorithm has steadily attained 100% accuracy while selecting a much lower number of features (around 4), which is a major improvement over others. The top selected genes are biologically validated in terms of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontologies (GO), which further demonstrates the usefulness of the proposed method. The genes selected by Quantum Squirrel inspired Feature Selection show an association with prostate carcinoma and most are known biomarkers. A few novel biomarkers selected by proposed algorithm have also been detailed in this work. Source code of this work is available at: Quantum Squirrel inspired Feature Selection.

Review of swarm intelligence-based feature selection methods

Article

Apr 2021
ENG APPL ARTIF INTEL

In the past decades, the rapid growth of computer and database technologies has led to the rapid growth of large-scale datasets. On the other hand, data mining applications with high dimensional datasets that require high speed and accuracy are rapidly increasing. An important issue with these applications is the curse of dimensionality, where the number of features is much higher than the number of patterns. One of the dimensionality reduction approaches is feature selection that can increase the accuracy of the data mining task and reduce its computational complexity. The feature selection method aims at selecting a subset of features with the lowest inner similarity and highest relevancy to the target class. It reduces the dimensionality of the data by eliminating irrelevant, redundant, or noisy data. In this paper, a comparative analysis of different feature selection methods is presented, and a general categorization of these methods is performed. Moreover, in this paper, state-of-the-art swarm intelligence is studied, and the recent feature selection methods based on these algorithms are reviewed. Furthermore, the strengths and weaknesses of the different studied swarm intelligence-based feature selection methods are evaluated.

Dynamic scaling factor based differential evolution with multi-layer perceptron for gene selection from pathway information of microarray data

Abstract and Figures

Recommended publications

Gene Selection from High Dimensionality of Data Based on Quantum Inspired Genetic Algorithm

GSA‐Based Approach for Gene Selection from Microarray Gene Expression Data

FCPSO: Evaluation of Feature Clustering Using Particle Swarm Optimization for Health Data

Feature selection from microarray data : Genetic algorithm based approach