ArticlePDF Available

IMMIGRATE: A Margin-based Feature Selection Method with Interaction Terms

March 2020
Entropy 22(3):291

March 2020
22(3):291

DOI:10.3390/e22030291

License
CC BY

Authors:

Jun S Liu

Harvard University

Traditional hypothesis-margin researches focus on obtaining large margins and feature selection. In this work, we show that the robustness of margins is also critical and can be measured using entropy. In addition, our approach provides clear mathematical formulations and explanations to uncover feature interactions, which is often lack in large hypothesis-margin based approaches. We design an algorithm, termed IMMIGRATE (Iterative max-min entropy margin-maximization with interaction terms), for training the weights associated with the interaction terms. IMMIGRATE simultaneously utilizes both local and global information and can be used as a base learner in Boosting. We evaluate IMMIGRATE in a wide range of tasks, in which it demonstrates exceptional robustness and achieves the state-of-the-art results with high interpretability.

Flow chart of IMMIGRATE. Step 0: Initialize W randomly, under the constraints W ≥ 0, W T = W and W 2 F = 1). Step 1: Fix W, update {α n,h } and {β n,m }. Step 2: Fix {α n,h } and {β n,m }, update W. Steps 1 and 2 are iterated to optimize the cost function, where ∆C is the change of the cost function in (5) and is a pre-set limit.

…

Summarizes the accuracies on five high-dimensional gene expression datasets.

…

Figures - available via license: CC BY

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

entropy

Article

IMMIGRATE: A Margin-based Feature Selection

Method with Interaction Terms

Ruzhang Zhao 1, Pengyu Hong 2,* and Jun S. Liu 3, *

Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205,

USA; rzhao@jhu.edu

2Department of Computer Science, Brandeis University, Waltham, MA 02453, USA

3Department of Statistics, Harvard University, Cambridge, MA 02138, USA

*Correspondence: hongpeng@brandeis.edu (P.H.), jliu@stat.harvard.edu (J.S.L.); Tel.: +1-617-495-1600 (J.S.L.),

+1-781-736-2729 (P.H.)

Received: 31 January 2020; Accepted: 25 February 2020; Published: 2 March 2020

Abstract:

Traditional hypothesis-margin researches focus on obtaining large margins and feature selection.

In this work, we show that the robustness of margins is also critical and can be measured using entropy.

In addition, our approach provides clear mathematical formulations and explanations to uncover feature

interactions, which is often lack in large hypothesis-margin based approaches. We design an algorithm,

termed IMMIGRATE (Iterative max-min entropy margin-maximization with interaction terms), for

training the weights associated with the interaction terms. IMMIGRATE simultaneously utilizes both

local and global information and can be used as a base learner in Boosting. We evaluate IMMIGRATE in

a wide range of tasks, in which it demonstrates exceptional robustness and achieves the state-of-the-art

results with high interpretability.

Keywords: hypothesis-margin; feature selection; entropy; IMMIGRATE

1. Introduction

Feature selection is one of the most fundamental problems in machine learning and pattern recognition

[

]. The Relief algorithm by Kira and Rendell

[2]

is one of the most successful feature selection algorithms.

It can be interpreted as an online learning algorithm that solves a convex optimization problem with

a hypothesis-margin-based cost function. Instead of deploying exhaustive or heuristic combinatorial

searches, Relief decomposes a complex, global and nonlinear classiﬁcation task into a simple and local one.

Following the large hypothesis-margin principle for classiﬁcation, Relief calculates the weights of features,

which can be used for feature selection. Considering the binary classiﬁcation in a set of samples

with

two kinds of labels, the hypothesis-margin of an instance

is later formally deﬁned in Gilad-Bachrach

et al. [

] as

2(k~

x−NM(~

x)k − k~

x−NH(~

x)k)

, where

NH(~

denotes the “nearest hit," i.e., the nearest

sample to

with the same label, while

NM(~

denotes the “nearest miss", the nearest sample to

with

the different label. The large hypothesis-margin principle has motivated several successful extensions of

the Relief algorithm. For example, ReliefF [

] uses multiple nearest neighbors. Simba [

] recalculates the

nearest neighbors every time the feature weights are updated. Yang et al. [

] consider global information

to improve Simba. I-RELIEF [

] identiﬁes the nearest hits and misses in a probabilistic manner, which

forms a variation of hypothesis-margin. LFE [

] extends Relief from feature selection to feature extraction

using local information. IM4E is proposed by Bei and Hong

[8]

to balance margin-quantity maximization

Entropy 2020,22, 291; doi:10.3390/e22030291 www.mdpi.com/journal/entropy

Entropy 2020,22, 291 2 of 19

and margin-quality maximization. Both approaches in Sun and Wu

[7]

, Bei and Hong

[8]

use a variation of

hypothesis-margin proposed in Sun and Li [6].

The Relief-based algorithms indirectly consider feature interactions by normalizing the feature

weights [

], which, however, cannot directly reﬂect natural effects of associations and hence results in

poor understanding on how feature interacts. For example, Relief and many of its extensions cannot tell

whether a high weight of a certain feature is caused by its linear effect or its interaction with other features

[

]. Furthermore, these methods cannot directly reveal and measure the impact of the interaction terms on

classiﬁcation results.

To this end, we propose the

terative

ax-

n entropy mar

in-maximization with inte

ction

rms algorithm (IMMIGRATE, henceforth). IMMIGRATE directly measures the inﬂuence of feature

interactions and has the following characteristics. First, when deﬁning hypothesis-margin, we introduce

a new trainable quadratic-Manhattan measurement to capture interaction terms, which measures the

interaction importance directly. Second, we take advantage of the margin stability by measuring the

underlying entropy based on the distribution of instances. Third, we derive an iterative optimization

algorithm to efﬁciently minimize the cost function. Fourth, we design a novel classiﬁcation method that

utilizes the learned quadratic-Manhattan measurement to predict the class of a new instance. Fifth, we

design a more powerful approach (i.e., Boosted IMMIGRATE) by using IMMIGRATE as the base learner

of Boosting [

]. Sixth, to make IMMIGRATE efﬁcient for analyzing high-dimensional datasets, we take

advantage of IM4E [8] to obtain an effective initialization.

The rest of the paper is organized as follows. Section 2 explains the foundation of the Relief algorithm,

and Section 3 introduces the IMMIGRATE algorithm. Section 4 summarizes and discusses our experiments

with different datasets, showing that IMMIGRATE achieves the state-of-the-art results, and Boosted

IMMIGRATE outperforms other boosting classiﬁers signiﬁcantly. The computation time of IMMIGRATE is

comparable to other popular feature selection methods that consider interaction terms. Section 5 concludes

the article with comparisons with related works and a short discussion.

2. Review: the Relief Algorithm

We ﬁrst introduce a few notations used throughout the paper:

xi∈RA

as the

-th instance in the

training set

;

as the class label of

;

as the size of

;

as the number of features (i.e., attributes);

as the feature weight vector; and

xi|

as a vector where absolute value operation is element-wise. Relief

[

] iteratively calculates the feature weights in

(Algorithm 1). The higher a feature weight is, the more

relevant the corresponding feature is. After the calculation of feature weights, a threshold is chosen to

select relevant features. Relief can be viewed as a convex optimization problem that minimizes the cost

function in Equation 1:

∑

n=1~

wT~

xn−NH(~

xn)−~

wT~

xn−NM(~

xn),

subject to : ~

w≥0, k~

wk2

2=1,

(1)

where

M(N)

is a user deﬁned number of randomly chosen training samples,

NH(~

is the nearest

"hit" (from the same class) of

;

NM(~

is the nearest "miss" (from a different class) of

; and

wT~

xn−

NH(~

xn)

is the weighted Manhattan distance. Denote

u=∑M

n=1~

xn−NH(~

xn)−~

xn−NM(~

xn)

Minimizing the cost function of Relief (1) can be solved using the Lagrange multiplier method and

the Karush–Kuhn–Tucker conditions [

] to get a closed-form solution:

w= (−~

u)+/k(−~

u)+k2

, where

a)+

truncates the negative elements to 0. This solution to the original Relief algorithm is important for

understanding the Relief-based algorithms.

Entropy 2020,22, 291 3 of 19

Algorithm 1 The Original Relief Algorithm

N: the number of training instances.

A: the number of features(i.e. attributes).

M: the number of randomly chosen training samples to update feature weight ~

Input: a training dataset {zn= (~

xn,yn)}n=1,··· ,N.

Initialization: Initialize all feature weights to 0: ~

w=0.

for i= 1 to Mdo

Randomly select an instance ~

xiand ﬁnd its NH(~

xi)and NM(~

xi).

Update the feature weights by ~

w=~

w−(~

xi−NH(~

xi))2/M+ (~

xi−NM(~

xi))2/M,

where the square operation is element-wise.

Return:~

3. IMMIGRATE Algorithm

Without loss of generality, we establish the IMMIGRATE algorithm in a general binary classiﬁcation

setting. This formulation can be easily extended to handle multi-class classiﬁcation problems. Let

the whole data set be

P={zn|zn= (~

yn)

xn∈RA

yn=±

n=1

; the hit index set of

Hn={j|zj∈ P,yj=yn&j6=n}, and the miss index set of ~

xnbe Mn={j|zj∈ P,yj6=yn}.

3.1. Hypothesis-Margin

Given a distance

d(~

xj)

between two instances,

and

, a hypothesis-margin [

] is deﬁned as

ρn,h,m=d(~

xm)−d(~

xh)

, where

and

represent the nearest hit and nearest miss for instance

respectively. We adopt the probabilistic hypothesis-margin deﬁned by Sun and Li [6] as

ρn=∑

m∈Mn

βn,md(~

xn,~

xm)−∑

h∈Hn

αn,hd(~

xn,~

xh), (2)

where

αn,h≥

βn,m≥

∑h∈Hnαn,h=

∑m∈Mnβn,m=

1, for

∀n∈ {

···

. As in the above

design, the hidden random variable

αn,h

represents the probability that

is the nearest hit of instance

while

βn,m

indicates the probability that

is the nearest miss of instance

. In the rest of the paper, for

conciseness, we will use margin to indicate hypothesis-margin.

3.2. Entropy to Measure Margin Stability

The distributions of hits and misses can be used to evaluate the stability of margins (i.e., margin

quality). A more stable margin can be obtained by considering the distributions of instances with the

same or different labels with respect to the target instance. A margin is deemed stable if it will not be

greatly reduced by changes to only a few neighbors of the target instance. Considering an instance

its probabilities

{αn,h}

and

{βn,m}

represent the distributions of its hits and misses, respectively. We can

use the hit entropy

Ehit (~

xn) = −∑h∈Hnαn,hlog αn,h

and miss entropy

Emiss (~

xn) = −∑m∈Mnβn,mlog βn,m

to evaluate the stability of

’s margin. The following two scenarios help explain the intuition of using

these entropy. Scenario A: all neighbors are distributed evenly around the target instance; scenario B: the

neighbor distribution is highly uneven. An extreme example for scenario B is that one instance is quite

close to the target and the rest are quite far away from the target. An easy experiment to test the stability is

to discard one instance from the system and to check how it inﬂuences the margin. In scenario A, if the

closest neighbor (no matter if it is hit or miss) is discarded, the margin changes only slightly because there

are many other hits/misses evenly distributed around the target. In scenario B, if the closest neighbor is

Entropy 2020,22, 291 4 of 19

a miss, its removal can increase the margin signiﬁcantly. On the contrary, if the closest neighbor is a hit,

removing it can decrease the margin signiﬁcantly. Intuitively speaking, hits prefer scenario A and misses

favor scenario B.

Since scenarios A and B correspond to high and low entropy, respectively, the margin can beneﬁt

from a large hit entropy

Ehit

(e.g., scenario A) and a low miss entropy

Emiss

(e.g., scenario B). We can set up

a framework to maximize the hit entropy and minimize the miss entropy, which is equivalent to make

the margin in Equation 2the most stable. Bei and Hong

[8]

use the term max-min entropy principle to

describe the process that maximizes the hit entropy and minimize the loss entropy to maximize the margin

quality. The process of stabilizing margin is an extension of the large margin principle.

3.3. Quadratic-Manhattan Measurement

We extend the margin in Equation 2by using a new quadratic-Manhattan measurement deﬁned as:

q(~

xi,~

xj) = ~

xi−~

xj

TW~

xi−~

xj, (3)

where

is a non-negative symmetric matrix (element-wise non-negative) with its Frobenius norm

kWkF=

1. The quadratic-Manhattan measurement is a natural extension of the weight vector, and the

distance deﬁned in Equation 3is a natural extension of the weighted Manhattan distance in Equation 1.

Off-diagonal elements in

capture feature interactions and diagonal elements in

capture main effects.

To understand why quadratic-Manhattan measurement can capture the inﬂuence of interactions, we

observe that the effect of element

wa,b

(

a6=b

) in

enters into (3) as the coefﬁcient for the combination

of the

-th and

-th elements in vector

~

xi−~

xj

. In Relief-based algorithms, the weighted Manhattan

distance Equation 1can be equivalently captured by the feature weight update equation in Algorithm 1.

Similarly,

wa,b

can be updated using the combination of the

-th and

-th features based on a randomly

given instance. We thus deﬁne our new margin using the quadratic-Manhattan measurement as

∑

m∈Mn

βn,mq(~

xn,~

xm)−∑

h∈Hn

αn,hq(~

xn,~

xh). (4)

3.4. IMMIGRATE

We design the following cost function to maximize our new margin, and simultaneously, the hit

entropy and miss entropy are optimized.

∑

n=1∑

h∈Hn

αn,h~

xn−~

xh

TW~

xn−~

xh−∑

m∈Mn

βn,m~

xn−~

xm

TW~

xn−~

xm

+σ

∑

n=1

[Emiss (zn)−Ehit (zn)],

subject to : W≥0, WT=W,kWk2

F=1,

∀n,∑

h∈Hn

αn,h=1, ∑

m∈Mn

βn,m=1, and αn,h≥0, βn,m≥0,

(5)

where

Emiss (zn) = −∑m∈Mnβn,mlog βn,m

Ehit (zn) = −∑h∈Hnαn,hlog αn,h

, and

is a hyperparameter

that can be tuned via internal cross-validation.

We also design the following optimization procedure containing two iterative steps to ﬁnd

that

minimizes the cost function. The framework starts from a randomly initialized

and stops when the

change of cost function is less than a preset limit or the iteration number reaches a preset threshold. In

practice, we ﬁnd that it typically takes less than 10 iterations to stop and obtain good results. Based on our

Entropy 2020,22, 291 5 of 19

experiments, different initialization of

does not inﬂuence the results of the iterative optimization. The

computation time of IMMIGRATE is comparable to other interaction related methods such as SODA [

hierNet [13].

As depicted by the ﬂow-chart in Figure 1, the IMMIGRATE algorithm iteratively optimizes the cost

function Equation 5. It starts with a random initiation satisfying certain boundary conditions, and proceeds

to iterate the two steps as detailed below in Algorithm 2.

Algorithm 2 The IMMIGRATE Algorithm

Input: a training dataset {zn= (~

xn,yn)}n=1,··· ,N.

Initialization: Let t=0, randomly initialize W(0)satisfying W(0)≥0, WT=W,kW(0)k2

F=1.

repeat

Calculate {α(t+1)

n,h},{β(t+1)

n,m}with Equation 6.

Calculate W(t+1)with Theorem 1, Equation 8.

t=t+1.

until the change of Cin Equation 5is small enough or the iteration indicator treaches a preset limit.

Output:W(t).

Figure 1.

Flow chart of IMMIGRATE. Step 0: Initialize

randomly, under the constraints

W≥

WT=W

and

kWk2

1). Step 1: Fix

, update

{αn,h}

and

{βn,m}

.Step 2: Fix

{αn,h}

and

{βn,m}

, update

. Steps 1

and 2 are iterated to optimize the cost function, where

∆C

is the change of the cost function in (5) and

is a

pre-set limit.

3.4.1. Step 1: Fix W, Update {αn,h}and {βn,m}

Fixing

and setting

∂C

∂αn,h=

0 and

∂C

∂βn,m=

0, we can obtain the closed-form updates of

αn,h

and

βn,m

αn,h=ex p(−q(~

xn,~

xh)/σ)

∑h∈Hnexp(−q(~

xn,~

xh)/σ),

βn,m=ex p(−q(~

xn,~

xm)/σ)

∑k∈Mnexp(−q(~

xn,~

xk)/σ).

(6)

The Hessian matrix of C w.r.t. probability pair (αn,h,βn,m) is:

∂2C

∂(αn,h,βn,m)= σ/αn,h∂2C/∂βn,mαn,h

∂2C/∂βn,mαn,h−σ/βn,m!. (7)

Since

αn,h

βn,m>

0, the determinant of the Hessian matrix is negative, where a saddle point is found in

the

(αn,h

βn,m)

space. Therefore, the cost function

achieves its local minimum and local maximum w.r.t.

αn,hand βn,m, respectively.

Entropy 2020,22, 291 6 of 19

3.4.2. Step 2: Fix {αn,h}and {βn,m}, Update W

Fixing

αn,h

and

βn,m

, the minimization w.r.t.

is convex. In Equation 5,

satisﬁes

W≥

WT=

kWk2

1. In our iterative optimization strategy, we impose

to be a distance metric for computation.

Then, a closed-form solution to Wcan be derived (see Equation 8).

Theorem 1. With {αn,h}and {βn,m}ﬁxed, Equation 5gives rise to a closed-form solution for updating W. Let

Σ=

∑

n=1

(Σn,H−Σn,M),

where

Σn,H=∑h∈Hnαn,h~

xn−~

xh~

xn−~

xh

Σn,M=∑m∈Mnβn,m~

xn−~

xm~

xn−~

xm

. Let the

ψi

’s and

µi

’s

be the eigenvectors and eigenvalues of Σ, respectively, so that Σψi=µiψiwith kψik2

2=1. Then,

W=Φ ΦT, (8)

where Φ= (√η1ψ1,√η2ψ2,·· · ,√ηAψA),√ηi=r(−µi)+/q∑A

i=1((−µi)+)2.

Proof.

Since

is a distance metric matrix, it is symmetric and positive-semideﬁnite. Let

λ1≥λ2≥ ··· ≥

λA≥0 be eigenvalues of W, then the eigen-decomposition of Wis

W=PΛPT=PΛ1/2Λ1/2PT,

= [pλ1p1,·· · ,pλApA][pλ1p1,··· ,pλApA]T≡ΦΦT,(9)

where

is an orthogonal matrix, and

Φ= [φ1

···

φA]≡[√λ1p1

···

√λApA]

. Thus,

φi,φj=

0. The

constraint kWk2

F=1 can be simpliﬁed as:

kWk2

F=∑

i,j

i,j=∑

(φT

iφi)2=1. (10)

Let us rearrange Equation 5as:

∑

h∈Hn

αn,h~

xn−~

xh

TW~

xn−~

xhtr(W∑

h∈Hn

αn,h~

xn−~

xh~

xn−~

xh

T),

tr(WΣn,H) = tr(Σn,H

∑

i=1

φiφT

i) =

∑

i=1

φT

iΣn,Hφi.

(11)

Then, Equation 5can be further simpliﬁed as:

∑

i=1

φT

iΣφi,

subject to : kWk2

F=∑

(φT

iφi)2=1, φi,φj=0,

(12)

Entropy 2020,22, 291 7 of 19

where

Σ=∑N

n=1Σn,H−Σn,M

and

Σn,H=∑h∈Hnαn,h~

xn−~

xh~

xn−~

xh

Σn,M=∑m∈Mnβn,m~

xn−

xm~

xn−~

xm

. The orthogonality condition can be ignored because this condition is required in the

constraint. The Lagrangian for the optimization problem in Equation 12 is easy to obtain:

∑

i=1

φT

iΣφi+λ(

∑

i=1

(φT

iφi)2−1). (13)

Differentiating Lwith respect to φiyields:

∂L/∂φi=2Σφi+4λφ T

iφiφi=0. (14)

Denote φi/kφik2:=ψi. From Equation 14, we have

Σψi=µiψi, (15)

where µi=−2λkφik2

2. Thus, ψiand µiare an eigenvector and eigenvalue of Σ, respectively.

Let

φi=√ηiψi

ηi≥

0. Thus,

C=∑A

i=1√ηiψT

iΣ√ηiψi=∑A

i=1ηiµiψT

iψi=∑A

i=1ηiµi

, and

kWk2

∑i(√ηiψT

i√ηiψi)2=∑i(ηi)2=1. Then, Equation 12 can be simpliﬁed to be

∑

i=1

ηiµi, subject to : A

∑

i=1

(ηi)2=1, ηi≥0. (16)

Note that Equation 16 is exactly the same as the original Relief Algorithm (Algorithm 1):

η= (−~

µ)+/k(−~

µ)+k2, (17)

where

a)+= [max(a1

, 0

)

max(a2

, 0

)

···

max(aI

, 0

)]

, and

φi=√ηiψi

. It is also easy to see that the

updated Wis a distance metric.

3.4.3. Weight Pruning

Some previous Relief-based algorithms offer options to remove weights lower than a preset threshold.

IMMIGRATE offers a similar option to prune small weights: set small elements in

to 0. By default, we

use a threshold to prune small weights to 0, where

should be normalized w.r.t. Frobenius norm after

the pruning.

3.4.4. Predict New Samples

A prediction rule based on the learned weight matrix Wcan be formulated as:

y0=arg min

c∑

yn=c

αc

n(~

x0)q(~

x0,~

xn),

αc

n(~

x0) = ex p−q(~

x0,~

xn)/σ

∑yk=cex p−q(~

x0,~

xk)/σ,

(18)

where

z0= (~

y0)

is a new instance,

denotes the class and

is the predicted label. This prediction

method assigns a new instance to a class that maximizes its hypothesis-margin using the learned weight

matrix

, which makes it more stable than the

-NN method used in the traditional Relief-based

algorithms.

Entropy 2020,22, 291 8 of 19

3.5. IMMIGRATE in Ensemble Learning

Boosting [

] has been widely used to create ensemble learners that produce the state-of-the-art

results in many tasks. Boosting combines a set of relatively weak base learners to create a much stronger

learner. To use IMMIGRATE as the base classiﬁer in the AdaBoost algorithm [

], we modify the cost

function Equation 5to include sample weights and use the modiﬁed version in the boosting iterations. We

name the algorithm BIM, standing for

oosted

MIGRATE (Refer to Equation 19 and Algorithm 3for

more details about BIM. BIM schedules the adjustment of the hyperparameter

in its boosting iterations.

It starts with

being a predeﬁned

σmax

and gradually reduces

by multiplying it with

(σmin/σmax )1/T

each interaction until reaching σmin, where Tis a predeﬁned maximum number of boosting iterations.

∑

n=1

D(~

xn)∑

h∈Hn

αn,h~

xn−~

xh

TW~

xn−~

xh−∑

m∈Mn

βn,m~

xn−~

xm

TW~

xn−~

xm

+σ

∑

n=1

D(~

xn)[Emi ss(zn)−Ehit (zn)],

subject to : W≥0, WT=W,kWk2

F=1,

∀n,∑

h∈Hn

αn,h=1, ∑

m∈Mn

βn,m=1, and αn,h≥0, βn,m≥0,

(19)

where

Emiss (zn) = −∑m∈Mnβn,mlog βn,m

Ehit (zn) = −∑h∈Hnαn,hlog αn,h

∑N

n=1D(~

xn) =

1, and

D(~

xn)≥0, ∀n

Algorithm 3 The BIM Algorithm

T: the number of classiﬁers for BIM.

Input : a training dataset {zn= (~

xn,yn)}n=1,··· ,N.

Initialization : for each ~

xn, set D1(~

xn) = 1/N.

for t:= 1 to Tdo

Limit max number of iteration of IMMIGRATE less than preset.

Train weak IMMIGRATE classiﬁer ht(x)using a chosen σtand weights Dt(x)by Equation 19.

Compute the error rate etas et=∑N

i=1Dt(xi)I[yi6=ht(xi)].

if et≥1/2 or et=0then

Discard ht,T=T−1 and continue .

Set αt=0.5 ×log[(1−et)/et].

Update D(xi): For each xi,

Dt+1(xi) = Dt(xi)exp(αtI[yi6=ht(xi)]).

Normalize Dt+1(xi), so that ∑N

i=1Dt+1(xi) = 1.

Output:hf inal (x) = arg maxy∈{0,1}∑t:ht(x)=yαt.

3.6. IMMIGRATE for High-Dimensional Data Space

When applied to high-dimensional data, IMMIGRATE can incur a high computational cost because it

considers the interactions between every feature pair. To reduce the computational cost, we ﬁrst use IM4E

[

] to learn a feature weight vector, which is used to initialize the diagonal elements of

in the proposed

quadratic-Manhattan measurement. We also use the learned feature weight vector to help pre-screen the

features, and keep only those with weights above a preset limit. In the remaining computation, we only

model interactions between those chosen features. The discarded features after pre-screening can be added

back empirically based on the need of a speciﬁc application. We term this procedure IM4E-IMMIGRATE,

Entropy 2020,22, 291 9 of 19

which is effective and computationally efﬁcient. It can also be boosted (Boosted IM4E-IMMIGRATE) to be

stronger.

4. Experiments

In our experiments, all continuous features are normalized with mean zero and unit

variance. And cross-validation is used here to compare the performances of various

approaches. We have implemented IMMIGRATE in R and MATLAB. The R package is available

at https://CRAN.R-project.org/package=Immigrate, and the MATLAB version is available at

https://github.com/RuzhangZhao/Immigrate-MATLAB-. Both IMMIGRATE and BIM can be accelerated

by parallel computing as their computations are matrix-based.

4.1. Synthetic Dataset

We ﬁrst test the robustness of the IMMIGRATE algorithm using a synthesized dataset where we have

two interacting features following Gaussian distributions in a binary classiﬁcation setting. The simulated

dataset contains 100 samples from one class governed by a Gaussian distribution with mean

(

4, 2

and

the covariance matrix

1 0.5

0.5 1 !

and another 100 samples from the other class governed by a Gaussian

distribution with mean

(

6, 0

and the same covariance matrix. In addition, we add noises following a

Gaussian distribution with mean

(

−

and the covariance matrix

8 4

4 8 !

to the ﬁst class, and add

noises following a Gaussian distribution with mean

(

2, 4

and the same covariance matrix to the second

class. Figure 2shows a scatter plot of the synthesized dataset containing 10% samples from the noise

distributions. The slope of the orange dotted line in Figure 2is 1, which separates data with different

labels.

The noises are included to disturb the detection of the interaction term. The noise level starts from

5%, and gradually increases by 5% to 50%. As the baseline, we apply logistic regression and observe that

the

-test

-value of the interaction coefﬁcient increases from 3

−11

to 7

−5

and 0.7 when the noise

level increases from 0% to 10% and 50%. Local Feature Extraction (LFE, Sun and Wu

[7]

) is a Relief-based

algorithm which considers interaction terms indirectly, though the interaction information is only used for

feature extraction. We run IMMIGRATE and LFE on the synthesized datasets and compare the weights of

the interaction term between features 1 and 2 in Figure 3, which shows IMMIGRATE is more robust than

LFE.

Entropy 2020,22, 291 10 of 19

−6

−3

0 5 10

feature1

feature2

label

Figure 2. The synthesized dataset with 10% noise.

0.0

0.1

0.2

0.3

0.4

0.5

0.1 0.2 0.3 0.4 0.5

noise

interaction_weight

method

LFE

IGT

Figure 3. IMMIGRATE (IGT) is more robust than LFE.

4.2. Real Datasets

We compare IMMIGRATE with several existing popular methods using real datasets from the UCI

database. The following algorithms are considered in the comparison: Support Vector Machine [

] with

Sigmoid Kernel (SV1), Support Vector Machine with Radial basis function Kernel (SV2), LASSO (LAS)

[

], Decision Tree (DT) [

], Naive Bayes Classiﬁer (NBC) [

], Radial basis function Network (RBF)

[19], 1-Nearest Neighbor (1NN) [20], 3-Nearest Neighbor (3NN), Large Margin Nearest Neighbor (LMN)

[

], Relief (REL) [

], ReliefF (RFF) [

], Simba (SIM) [

], and Linear Discriminant Analysis (LDA) [

In addition, several methods designed for detecting interaction terms are included: LFE [

], Stepwise

conditional likelihood variable selection for Discriminant Analysis (SOD) [

], and hierNet (HIN) [

]. We

also include three most widely used and competitive ensemble learners: Adaptive Boosting (ADB) [

Random Forest (RF) [

], and XgBoost (XGB) [

]. We use the following abbreviations when presenting

the results: IM4 for IM4E, IGT for IMMIGRATE, and B4G for the boosted IM4E-IMMIGRATE.

Whenever possible, we use the settings of the aforementioned methods reported in their original

papers: LMNN uses 3-NN classiﬁer; Relief and Simba use Euclidean distance and 1-NN classiﬁer; ReliefF

uses Manhattan distance and

-NN classiﬁer (

=1,3,5 is decided by internal cross-validation); in SODA,

gam (=0, 0.5, 1) is determined by internal cross-validation and logistic regression is used for prediction.

The IM4E algorithm has two hyperparameters

and

. We ﬁx

λ=

1 as it has no actual contribution and

Entropy 2020,22, 291 11 of 19

tune

as suggested by Bei and Hong

[8]

. Hence, the IMMIGRATE algorithm only has one hyperparameter

. When tuning

, we gradually decrease

from

σ0=

4 by half each time until it is not larger than 0.2. The

preset limit for weight pruning is 1

, where

is the number of features. Furthermore, the preset iteration

number is chosen to be 10. For each dataset,

and whether weight pruning is applied are determined by

the best internal cross-validation results. For BIM, we use

σmax =

σmin =

0.2, and the maximal number

of boosting iterations Tis 100. The preset threshold in IM4E-IMMIGRATE is 2/A.

We repeat ten-fold cross-validation ten times for each algorithm on each dataset, i.e., 100 trials are

carried out. When comparing two algorithms (i.e., A vs. B), we calculate the paired Student’s

-test using

the results of 100 trials. First, the null hypothesis is there is no difference between the performances of A

and those of B. When the

-value is larger than the signiﬁcant level cutoff 0.05, we say A "Tie" B, which

means there is no signiﬁcant difference between their performances. When the

-value is smaller than the

signiﬁcant level cutoff 0.05, the second null hypothesis is the performances of B are no worse than those of

A. When the new

-value is smaller than the signiﬁcant level cutoff 0.05, we say A "wins", which means A

on average performs signiﬁcantly better than B on this dataset, and vice versa.

4.2.1. Gene Expression Datasets

Gene expression datasets typically have thousands of features. We use the following ﬁve gene

expression datasets for feature selections: GLI [

], Colon (COL) [

], Myeloma (ELO) [

], Breast (BRE)

[

], Prostate (PRO) [

]. All datasets have more than 10,000 features. Refer to Table A1 in Appendix A for

details of all datasets.

We perform ten-fold cross-validation ten times, i.e., 100 trials in total. The results are summarized in

Table 1. The last row "(W,T,L)" indicates the number of times that the Boosted IM4E-IMMIGRATE (B4G)

W,T,L (win,tie,loss) compared with each algorithm by the paired Student’s

-test with the signiﬁcance

level of

α=

0.05. The comparison results are also summarized in Figure 4(top plot) for easy comparison.

Although our B4G is not always the best, it outperforms other methods in most cases. In particular, when

IM4E-IMMIGRATE (EGT) is compared with other methods, it also outperforms in most cases.

Entropy 2020,22, 291 12 of 19

Figure 4.

Results of paired

-test on gene expression datasets (

top subplot

) and UCI datasets (

bottom

subplot

). The top plot shows how well (i.e., "Win" (red bars), "Tie" (green bars), and "Lose" (blue bars)) our

Boosted IM4E-IMMIGRATE performs compared with other approaches. In the bottom plot, the results of

methods labeled in black are the comparisons with our IMMIGRATE, and the results of methods (ABD, RF,

and XGB) labeled in blue are the comparisons with our BIM.

Table 1. Summarizes the accuracies on ﬁve high-dimensional gene expression datasets.

Data SV1 SV2 LAS DT NBC 1NN 3NN SOD RF XGB IM4 EGT B4G

GLI 85.1 86.0 85.2 83.8 83.0 88.7 87.7 88.7 87.6 86.3 87.5 89.1 89.9

COL 73.7 82.0 80.6 69.2 71.1 72.1 77.9 78.1 82.6 79.5 84.3 78.6 82.5

ELO 72.9 90.2 74.6 77.3 76.3 85.6 91.3 86.9 79.2 77.9 88.9 88.6 88.4

BRE 76.0 88.7 91.4 76.4 69.4 83.0 73.6 82.6 86.3 87.3 88.1 90.2 91.5

PRO 71.3 69.9 87.9 86.4 68.0 83.2 82.7 83.2 91.8 90.5 88.0 89.5 89.7

W,T,L15,0,0 4,0,1 4,1,0 5,0,0 5,0,0 5,0,0 4,0,1 5,0,0 3,1,1 4,0,1 3,1,1 -,-,- -,-,-

The last row shows the number of times Boosted IM4E-IMMIGRATE(

B4G

) W,T,L (win,tie,loss)

compared with each algorithm by paired t-test

Ten-fold cross-validation is performed for ten times, namely 100 trials are carried out for each

dataset. The average accuracies are reported on the corresponding datasets in Table 1,2,3.

Here, with 100 trials and two algorithms A and B, paired Student’s

-test is carried out between

the results of these two algorithms. Under the signiﬁcance level of

α=

0.05, algorithm A is

signiﬁcantly better than another algorithm B (i.e. A wins) on a dataset if the

-value of the

paired Student’s

-test with corresponding null hypothesis is less than

α=

0.05. (The rule also

applies to experiments on UCI datasets) .

4.2.2. UCI Datasets

We also carry out an extensive comparison using many UCI datasets [

]: BCW, CRY, CUS, ECO,

GLA, HMS, IMM, ION, LYM, MON, PAR, PID, SMR, STA, URB, USE and WIN. Refer to Appendix A

Entropy 2020,22, 291 13 of 19

Table A1 for the full names and links for those datasets. If a dataset has more than two classes, we use two

classes with the largest sample size. In addition, we use three large-scale datasets: CRO∗, ELE∗, WAV∗.

We perform ten-fold cross-validation ten times. Tables 2for IMMIGRATE and Table 3for BIM show

the average accuracies on the corresponding datasets. In Table 2, the last row "(W,T,L)" indicates the

number of times IMMIGRATE (IGT) and BIM W,T,L (win,tie,loss) when compared with each algorithm

separately by using the paired Student’s

-test with the signiﬁcance level of

α=

0.05. The comparison

results are also summarized in Figure 4(bottom subplot), where the ﬁrst 17 items (black) indicate the

results for IMMIGRATE while the last three items (blue) indicate the results for BIM.

Although IMMIGRATE or BIM is not always the best, they outperform other methods signiﬁcantly

in one-to-one comparisons in terms of cross-validation results. Figure 4(bottom subplot, black part)

and Table 2show that IMMIGRATE achieves the state-of-the-art performance as the base classiﬁer while

Figure 4(bottom subplot, blue part) and Table 3show BIM achieves the state-of-the-art performance as the

boosted version. To visualize the feature selection results of our approaches, we plot the feature weight

heat maps of four datasets (GLA, LYM, SMR and STA) in Appendix BFigure A1.

Table 2. Summarizes the accuracies on UCI datasets.

Data SV1 SV2 LAS DT NBC RBF 1NN 3NN LMN REL RFF SIM LFE LDA SOD hIN IM4 IGT

BCW 61.4 66.6 71.4 70.5 62.4 56.9 68.2 72.2 69.5 66.4 67.1 67.7 67.1 73.9 65.2 71.8 66.4 74.5

CRY 72.9 90.6 87.4 85.3 84.4 89.7 89.1 85.4 87.8 73.8 77.2 79.7 86.0 88.6 86.0 87.9 86.2 89.8

CUS 86.5 88.9 89.6 89.6 89.5 86.8 86.5 88.7 88.8 82.1 84.7 84.3 86.4 90.3 90.8 90.3 87.5 90.1

ECO 92.9 96.9 98.6 98.6 97.8 94.6 96.0 97.8 97.8 89.0 90.7 91.2 93.1 99.0 97.9 98.7 97.5 98.2

GLA 64.2 76.7 72.3 79.4 69.5 73.0 81.1 78.1 79.4 64.1 63.5 67.1 81.2 72.0 75.3 75.0 78.0 87.5

HMS 63.8 64.5 67.7 72.5 67.2 66.8 66.0 69.3 71.2 65.3 66.0 65.7 64.9 69.0 67.4 69.4 66.6 69.2

IMM 74.3 70.6 74.4 84.1 77.9 67.3 69.4 77.9 76.7 69.9 71.8 69.0 75.0 75.2 72.3 70.2 80.7 83.8

ION 80.5 93.5 83.6 87.4 89.4 79.9 86.7 84.1 84.5 85.8 86.2 84.2 91.0 83.3 90.3 92.6 88.3 92.9

LYM 83.6 81.5 85.2 75.2 83.6 71.1 77.2 82.8 86.6 64.9 71.0 70.4 79.6 85.2 79.3 84.8 83.3 87.2

MON 74.4 91.7 75.0 86.4 74.0 68.2 75.1 84.4 84.9 61.4 61.8 65.0 64.8 74.4 91.9 97.2 75.6 99.5

PAR 72.7 72.5 77.1 84.8 74.1 71.5 94.6 91.4 91.8 87.3 90.3 84.6 94.0 85.6 88.2 89.5 83.2 93.8

PID 65.6 73.1 74.7 74.3 71.2 70.3 70.3 73.5 74.0 64.8 68.0 67.0 67.8 74.5 75.7 74.1 72.1 74.7

SMR 73.5 83.9 73.6 72.3 70.3 67.1 86.9 84.7 86.1 69.5 78.3 81.0 84.3 73.1 70.5 83.0 76.4 86.5

STA 69.8 71.6 70.8 68.9 71.0 69.5 67.8 70.8 71.3 59.7 64.0 63.0 66.7 71.3 71.8 69.2 70.8 75.9

URB 85.2 87.9 88.1 82.6 85.8 75.3 87.2 87.5 87.9 81.9 83.2 73.0 87.9 73.0 87.9 88.3 87.4 89.9

USE 95.7 95.2 97.2 93.2 90.6 84.9 90.5 91.5 92.0 54.5 63.7 69.5 85.8 96.9 96.2 96.5 94.1 96.4

WIN 98.3 99.3 98.6 93.1 97.3 97.2 96.4 96.6 96.5 87.2 95.0 95.0 93.8 99.7 92.9 98.9 98.2 99.0

CRO∗75.4 97.5 89.9 91.0 88.8 75.4 98.4 98.5 98.6 98.5 98.7 95.1 98.6 89.1 95.2 95.5 81.9 98.2

ELE∗72.3 95.7 79.9 80.0 82.5 70.8 81.1 83.9 89.7 64.6 75.4 76.2 79.8 79.9 93.7 93.6 83.2 93.7

WAV∗90.0 91.9 92.2 86.2 91.4 84.0 86.5 88.3 88.8 77.6 80.0 83.6 84.7 91.8 92.0 92.1 91.1 92.4

W,T,L120,0,0 16,2,2 15,4,1 16,3,1 19,1,0 20,0,0 17,2,1 18,2,0 16,3,1 19,1,0 19,1,0 19,1,0 18,2,0 15,4,1 13,4,3 12,7,1 19,0,1 -,-,-

The last row (W,T,L) shows the number of times that IMMIGRATE (

IGT

) wins/ties/losses an existing algorithm

according to the paired t-test on the cross-validation results.

Entropy 2020,22, 291 14 of 19

Table 3. Summarizes the accuracies on the UCI datasets.

Data ADB RF XGB BIM

BCW 78.2 78.6 78.6 78.3

CRY 90.4 92.9 89.9 91.5

CUS 90.8 91.1 91.4 91.0

ECO 98.0 98.9 98.2 98.6

GLA 85.0 87.0 87.9 86.8

HMS 65.8 72.1 70.0 72.0

IMM 77.2 84.2 81.7 86.1

ION 92.1 93.5 92.5 93.1

LYM 84.8 87.0 87.4 88.1

MON 98.4 95.8 99.1 99.7

PAR 90.5 91.0 91.9 93.2

PID 73.5 76.0 75.1 76.2

SMR 81.4 82.8 83.3 86.6

STA 69.0 71.3 69.5 74.1

URB 87.9 88.6 88.8 91.4

USE 96.0 95.3 94.9 96.1

WIN 97.5 99.1 98.2 99.1

CRO∗97.3 97.4 98.5 98.6

ELE∗91.1 92.3 95.2 94.1

WAV∗89.5 91.2 90.8 93.3

W,T,L117,3,0 11,8,1 14,4,2 -,-,-

The last row (W,T,L) shows

the number of times that

the Boosted IMMIGRATE

(

BIM

) wins/ties/losses

an existing algorithm

according to the

paired

-test on the

cross-validation results.

5. Related Works

In many recent publications, Relief-based algorithms and feature selection with interaction terms

have been well explored. Some methods are reviewed here to show the connection and differences with

our approach. The hypothesis-margin deﬁnition in Equation 2adopted in this work is also used in some

previous studies, such as Bei and Hong

[8]

. However, Bei and Hong

[8]

do not consider the interactions

between features. Our work provides a measurable way to show the inﬂuence of each feature interaction.

Sun and Wu

[7]

propose local feature extraction (LFE) method, which learns linear combination of

features for feature extraction. LFE explores the information of feature interaction terms indirectly, which

is partly our aim. However, LFE does not consider global information or margin stability, which results in

signiﬁcant differences in the cost function and the optimization procedures.

Our quadratic-Manhattan measurement deﬁned in Equation 3is related to the Mahalanobis metric

used in previous works on metric learning, such as Large Margin Nearest Neighbor (LMNN) [

Weinberger and Saul

[21]

use semi-deﬁnite programming for learning distance metric in LMNN.

LMNN and our approach are both based on K-Nearest Neighbors. A major difference is that our

quadratic-Manhattan measurement has matrix

to be non-negative and symmetric (element-wise

non-negative) with its Frobenius norm

kWkF=

1, whereas metric learning only requires its matrix

to be symmetric semi-positive deﬁnite. Actually, the non-negative element requirement of

provides IMMIGRATE a high intepretability, where items in matrix indicate interaction importance.

Entropy 2020,22, 291 15 of 19

Quadratic-Manhattan measurement serves well in the classiﬁcation task and offers a direct explanation

about how features, in particular, feature interaction terms, contribute to the classiﬁcation results.

6. Conclusions and Discussion

In this paper, we propose a new quadratic-Manhattan measurement to extend the hypothesis-margin

framework, based on which a feature selection algorithm IMMIGRATE is developed for detecting

and weighting interaction terms. We also develop its extended versions, Boosted IMMIGRATE (BIM)

and IM4E-IMMIGRATE. IMMIGRATE and its variants follow the principle of maximizing stable

hypothesis-margin and are implemented via a computationally efﬁcient iterative optimization procedure.

Extensive experiments show that IMMIGRATE outperforms state-of-the-art methods signiﬁcantly, and its

boosted version BIM outperforms other boosting-based approaches. In conclusion, compared with other

Relief-based algorithms, IMMIGRATE mainly has the following advantages: (1) both local and global

information are considered; (2) interaction terms are used; (3) robust and less prone to noise; (4) easily

boosted. The computation time of IMMIGRATE variants is comparable to other methods able to detect

interaction terms.

There are some limitations for IMMIGRATE and we discuss some directions of improving the

algorithm accordingly. First, in Section 3.4.3, small weights are removed to obtain sparse solutions using

some cutoffs directly, which is hard to do inference for the obtained weights. Penalty terms such as the

- or

-penalty are usually applied to shrink and select important weights. We suggest that our cost

function Equation 5can be modiﬁed to include such a penalty term to replace the process of weight

pruning in Section 3.4.3. Second, although IMMIGRATE is efﬁcient, it still costs much time to compute

data with large size. To further improve the computational efﬁciency of IMMIGRATE for large-scale

datasets, we can improve training by using well selected prototypes [

], which, as a subset of the

original data, are representative but with noisy and redundant samples removed. Third, IMMIGRATE

only considers pair-wise interactions between features. Interactions among multiple features can play

important roles in real applications, [

]. Our work provides a basis for developing new algorithms

to detect multi-feature interactions. For example, people can use tensor form to consider weights for

multi-feature interactions. Fourth, although our iterative optimization procedure is efﬁcient, it achieves

ad hoc solutions with no guarantee of reaching the global optimum. It remains an open challenge to

develop better optimization algorithms. Finally, the selection of an appropriate

currently relies on

internal cross-validation, which cannot uncover the underlying properties of σ. A better strategy may be

developed by rigorously investigating the theoretical contributions of σ.

Author Contributions:

methodology, R.Z. and P.H.; software, R.Z.; validation, R.Z., P.H. and J.S.L.; investigation, R.Z.,

P.H. and J.S.L.; resources, R.Z., P.H. and J.S.L.; data curation, R.Z. and P.H.; writing–original draft preparation, R.Z.;

writing–review and editing, R.Z., P.H. and J.S.L.; supervision, P.H. and J.S.L.; funding acquisition, P.H. and J.S.L.

Funding:

This research was supported partially by the the National Science Foundation grants DMS-1613035,

DMS-1712714, and OAC-1920147.

Acknowledgments:

The authors thank Xin Xing for the valuable suggestions to improve the work. And the authors

thank Yang Li for the helpful suggestions about R codes.

Conﬂicts of Interest: The authors declare no conﬂict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Entropy 2020,22, 291 16 of 19

NH Nearest Hit

NM Nearest Miss

IM4E Iterative Margin-Maximization under Max-Min Entropy algorithm

IMMIGRATE Iterative Max-MIn entropy marGin-maximization with inteRAction TErms algorithm

Appendix A. Information of the Real Datasets

Table A1. Summary of the UCI datasets and the gene expression datasets.

Data No.F1No.I2Full Name

BCW 9 116 Breast Cancer Wisconsin (Prognostic)

CRY 6 90 Cryotherapy

CUS 7 440 Wholesale customers

ECO 5 220 Ecoli

GLA 9 146 Glass Identiﬁcation

HMS 3 306 Haberman’s Survival

IMM 7 90 Immunotherapy

ION 32 351 Ionosphere

LYM 16 142 Lymphograph

MON 6 432 MONK’s Problems

PAR 22 194 Parkinsons

PID 8 768 Pima-Indians-Diabetes

SMR 60 208 Connectionist Bench (Sonar, Mines vs. Rocks)

STA 12 256 Statlog (Heart)

URB 147 238 Urban Land Cover

USE 5 251 User Knowledge Modeling

WIN 13 130 Wine

CRO∗28 9003 Crowdsourced Mapping

ELE∗12 10000 Electrical Grid Stability Simulated

WAV∗21 3304 Waveform Database Generator

GLI 22283 85 Gliomas Strongly Predicts Survival[26]

COL 2000 62 Tumor and Normal Colon Tissues[27]

ELO 12625 173 Myeloma[28]

BRE 24481 78 Breast Cancer[29]

PRO 12600 136 Clinical Prostate Cancer Behavior[30]

1No.F: Number of Features.

2No.I: Number of Instances.

Entropy 2020,22, 291 17 of 19

Appendix B. Heat Maps

1 3 5 7 9

features

0.0

0.1

0.2

0.3

weights

1 6 11 16

features

0.00

0.05

0.10

0.15

0.20

weights

GLA LYM

1 15 30 45 60

features

0.01

0.02

0.03

0.04

0.05

weights

1 4 7 10

features

0.0

0.1

0.2

0.3

weights

SMR STA

Figure A1.

Heat Maps of Feature Weights Learned by IMMIGRATE. The color bars show the values of

corresponding colors in the plots.

References

1. Fukunaga, K. Introduction to Statistical Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 2013.

Kira, K.; Rendell, L.A. A practical approach to feature selection. In Machine Learning Proceedings 1992; Morgan

Kaufmann: Burlington, MA, USA, 1992; pp. 249–256.

Gilad-Bachrach, R.; Navot, A.; Tishby, N. Margin based feature selection-theory and algorithms. In Proceedings

of the 21st International Conference on Machine Learning, Banff, Alberta, Canada, 4–8 July 2004; p. 43.

Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In European Conference on Machine

Learning, Springer: Berlin, Germany, 1994, pp. 171–182.

Yang, M.; Wang, F.; Yang, P. A Novel Feature Selection Algorithm Based on Hypothesis-Margin. JCP

2008

3, 27–34.

Sun, Y.; Li, J. Iterative RELIEF for feature weighting. In Proceedings of the 23rd international conference on

Machine learning, 25–29 June 2006, Pittsburgh, PA, USA; pp. 913–920.

Sun, Y.; Wu, D. A relief based feature extraction algorithm. In Proceedings of the 2008 SIAM International

Conference on Data Mining, Atlanta, GA, USA, 24–26 April 2008; pp. 188–195.

Bei, Y.; Hong, P. Maximizing margin quality and quantity. In Proceedings of 2015 IEEE 25th International

Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6.

Entropy 2020,22, 291 18 of 19

Urbanowicz, R.J.; Meeker, M.; La Cava, W.; Olson, R.S.; Moore, J.H. Relief-based feature selection: Introduction

and review. J. Biomed. Inform. 2018,85, 189–203.

10. Schapire, R.E. The strength of weak learnability. Mach. Learn. 1990,5, 197–227.

11.

Kuhn, H.W.; Tucker, A.W. Nonlinear programming. In Traces and Emergence of Nonlinear Programming; Springer:

Berlin, Germany, 2014; pp. 247–258.

12.

Li, Y.; Liu, J.S. Robust variable and interaction selection for logistic regression and general index models. J. Am.

Stat. Assoc. 2018,114, 1–16.

13. Bien, J.; Taylor, J.; Tibshirani, R. A lasso for hierarchical interactions. Ann. Stat. 2013,41, 1111.

14. Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. Icml 1996,96, 148–156.

15. Freund, Y.; Mason, L. The alternating decision tree learning algorithm. Icml 1999,99, 124–133.

16. Soentpiet, R. Advances in Kernel Methods: Support Vector Learning; MIT press: Cambridge, MA, USA, 1999.

17. Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. 1996,58, 267–288.

18.

John, G.H.; Langley, P. Estimating continuous distributions in Bayesian classiﬁers. In Proceedings of the Eleventh

Conference on Uncertainty in Artiﬁcial Intelligence, Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 1995,

pp. 338–345.

19.

Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice Hall PTR: Upper Saddle River, NJ, USA, 1994.

20. Aha, D.W.; Kibler, D.; Albert, M.K. Instance-based learning algorithms. Mach. Learn. 1991,6, 37–66.

21.

Weinberger, K.Q.; Saul, L.K. Distance metric learning for large margin nearest neighbor classiﬁcation. J. Mach.

Learn. Res. 2009,10, 207–244.

22.

Robnik-Šikonja, M.; Kononenko, I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn.

2003

53, 23–69.

23. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936,7, 179–188.

24. Breiman, L. Random forests. Mach. Learn. 2001,45, 5–32.

25.

Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, 13–17 August 2016, San Francisco, CA,

USA; pp. 785–794.

26.

Freije, W.A.; Castro-Vargas, F.E.; Fang, Z.; Horvath, S.; Cloughesy, T.; Liau, L.M.; Mischel, P.S.; Nelson, S.F. Gene

expression proﬁling of gliomas strongly predicts survival. Cancer Res. 2004,64, 6503–6510.

27.

Alon, U.; Barkai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A.J. Broad patterns of gene

expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Proc. Natl. Acad. Sci. 1999,96, 6745–6750.

28.

Tian, E.; Zhan, F.; Walker, R.; Rasmussen, E.; Ma, Y.; Barlogie, B.; Shaughnessy Jr, J.D. The role of the Wnt-signaling

antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. N. Engl. J. Med.

2003

349, 2483–2494.

29.

Van’t Veer, L.J.; Dai, H.; Van De Vijver, M.J.; He, Y.D.; Hart, A.A.; Mao, M.; Peterse, H.L.; Van Der Kooy, K.;

Marton, M.J.; Witteveen, A.T.; et al. Gene expression proﬁling predicts clinical outcome of breast cancer. Nature

2002,415, 530.

30.

Singh, D.; Febbo, P.G.; Ross, K.; Jackson, D.G.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A.A.; D’Amico, A.V.;

Richie, J.P.; et al. Gene expression correlates of clinical prostate cancer behavior. Cancer cell 2002,1, 203–209.

31.

Frank, A.; Asuncion, A. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml

(accessed on 1 August 2019)

32.

Garcia, S.; Derrac, J.; Cano, J.; Herrera, F. Prototype selection for nearest neighbor classiﬁcation: Taxonomy and

empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 2012,34, 417–435.

Entropy 2020,22, 291 19 of 19

33.

Yu, S.; Giraldo, L.G.S.; Jenssen, R.; Principe, J.C. Multivariate Extension of Matrix-based Renyi’s

-order Entropy

Functional. IEEE Trans. Pattern Anal. Mach. Intell. 2019. doi:10.1109/TPAMI.2019.2932976.

34.

Vinh, N.X.; Zhou, S.; Chan, J.; Bailey, J. Can high-order dependencies improve mutual information based feature

selection? Pattern Recognit. 2016,53, 46–58.



2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access

article distributed under the terms and conditions of the Creative Commons Attribution (CC

BY) license (http://creativecommons.org/licenses/by/4.0/).

ResearchGate has not been able to resolve any citations for this publication.

Multivariate Extension of Matrix-Based Rényi's α-Order Entropy Functional

Article

Full-text available

Aug 2019

The matrix-based Renyi's α-order entropy functional was recently introduced using the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the current theory in the matrix-based Renyi's α-order entropy functional only defines the entropy of a single variable or mutual information between two random variables. In information theory and machine learning communities, one is also frequently interested in multivariate information quantities, such as the multivariate joint entropy and different interactive quantities among multiple variables. In this paper, we first define the matrix-based Renyi's α-order joint entropy among multiple variables. We then show how this definition can ease the estimation of various information quantities that measure the interactions among multiple variables, such as interactive information and total correlation. We finally present an application to feature selection to show how our definition provides a simple yet powerful way to estimate a widely-acknowledged intractable quantity from data. A real example on hyperspectral image (HSI) band selection is also provided.

Robust Variable and Interaction Selection for Logistic Regression and General Index Models

Article

Full-text available

Nov 2017

Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms. In the forward stage, SODA adds in predictors that have significant overall effects, whereas in the backward stage SODA removes unimportant terms to optimize the extended Bayesian Information Criterion (EBIC). Compared with existing methods for variable selection in quadratic discriminant analysis, SODA can deal with high-dimensional data in which the number of predictors is much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. We further extend SODA to conduct variable selection and model fitting for general index models. Compared with existing variable selection methods based on the Sliced Inverse Regression (SIR) (Li, 1991 Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327.[Taylor & Francis Online], [Web of Science ®] [Google Scholar]), SODA requires neither linearity nor constant variance condition and is thus more robust. Our theoretical analysis establishes the variable-selection consistency of SODA under high-dimensional settings, and our simulation studies as well as real-data applications demonstrate superior performances of SODA in dealing with non-Gaussian design matrices in both logistic and general index models.

Regression Shrinkage and Selection Via the Lasso

Article

Jan 1996

Robert Tibshirani

We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.

Relief-Based Feature Selection: Introduction and Review

Article

Nov 2017

Feature selection plays a critical role in data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that strike an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.

Experiments with a new boosting algorithm, machine learning

Article

Jan 1996

Nonlinear programming

Article

Jan 1982

XGBoost: A Scalable Tree Boosting System

Conference Paper

Aug 2016

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

Maximizing margin quality and quantity

Conference Paper

Sep 2015

The large-margin principle has been widely applied to learn classifiers with good generalization power. While tremendous efforts have been devoted to develop machine learning techniques that maximize margin quantity, little attention has been paid to ensure the margin quality. In this paper, we proposed a new framework that aims to achieve superior generalizability by considering not only margin quantity but also margin quality. An instantiation of the framework was derived by deploying a max-min entropy principle to maximize margin-quality in addition to using a traditional means for maximizing margin-quantity. We developed an iterative learning algorithm to solve this instantiation. We compared the algorithm with a couple of widely-used machine learning techniques (e.g., Support Vector Machines, decision tree, naive Bayes classifier, k-nearest neighbors, etc.) and several other large margin learners (e.g., RELIEF, Simba, G-flip, LOGO, etc.) on a number of UCI machine learning datasets and gene expression datasets. The results demonstrated the effectiveness of our new framework and algorithm.

Regression Shrinkage and Selection via the LASSO

Article

Jan 1996

R. J. Tibshirani

Random forests

Article

Jan 2001

IMMIGRATE: A Margin-based Feature Selection Method with Interaction Terms

Abstract and Figures

Recommended publications

IMMIGRATE: A Margin-based Feature Selection Method with Interaction Terms

Feature Selection for Analogy-Based Learning to Rank

Minimizing the Misclassification Rate of the Nearest Neighbor Rule Using a Two-stage Method

Improving the Accuracy of Nearest-Neighbor Classification Using Principled Construction and Stochast...