ArticlePDF AvailableLiterature Review

Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review

August 2019
Big Data 7(4)

August 2019
7(4)

DOI:10.1089/big.2018.0175

Authors:

Haneen Abu Alfeilat

Ahmad Hassanat

Mutah University

Omar Lasassmeh

Mu’tah University

Ahmad S. Tarawneh

Eötvös Loránd University

Show all 7 authorsHide

The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision, and recall) of the KNN using a large number of distance measures, tested on a number of real-world data sets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, and the results showed large gaps between the performances of different distances. We found that a recently proposed nonconvex distance performed the best when applied on most data sets comparing with the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only ∼20% while the noise level reaches 90%, this is true for most of the distances used as well. This means that the KNN classifier using any of the top 10 distances tolerates noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing with other distances.

Representation of HasD between the point 0 and n, where n belongs to [À10,10]. HasD, Hassanat distance.

…

The framework of our experiments for discerning the effect of various distances on the performance of KNN classifier. 1-NN, 1-nearest neighbor.

…

Training and testing data examples

…

Training and testing data examples with distances

…

+10

The highest accuracy in each data set

…

Figures - uploaded by Surya Prasath

Content may be subject to copyright.

Content uploaded by Surya Prasath

Content may be subject to copyright.

Effects of Distance Measure Choice on K-Nearest

Neighbor Classiﬁer Performance: A Review

Haneen Arafat Abu Alfeilat,

Ahmad B.A. Hassanat,

*Omar Lasassmeh,

Ahmad S. Tarawneh,

Mahmoud Bashir Alhasanat,

3,4

Hamzeh S. Eyal Salman,

and V.B. Surya Prasath

5–8

Abstract

The K-nearest neighbor (KNN) classiﬁer is one of the simplest and most common classiﬁers, yet its performance

competes with the most complex classiﬁers in the literature. The core of this classiﬁer depends mainly on mea-

suring the distance or similarity between the tested examples and the training examples. This raises a major

question about which distance measures to be used for the KNN classiﬁer among a large number of distance

and similarity measures available? This review attempts to answer this question through evaluating the perfor-

mance (measured by accuracy, precision, and recall) of the KNN using a large number of distance measures,

tested on a number of real-world data sets, with and without adding different levels of noise. The experimental

results show that the performance of KNN classiﬁer depends signiﬁcantly on the distance used, and the results

showed large gaps between the performances of different distances. We found that a recently proposed non-

convex distance performed the best when applied on most data sets comparing with the other tested distances.

In addition, the performance of the KNN with this top performing distance degraded only *20% while the noise

level reaches 90%, this is true for most of the distances used as well. This means that the KNN classiﬁer using any

of the top 10 distances tolerates noise to a certain degree. Moreover, the results show that some distances are

less affected by the added noise comparing with other distances.

Keywords: big data; K-nearest neighbor; machine learning; noise; supervised learning

Introduction

Classiﬁcation is an important problem in big data, data

science, and machine learning. The K-nearest neighbor

(KNN) is one of the oldest, simplest, and accurate algo-

rithms for patterns classiﬁcation and regression models.

KNN was proposed in 1951 by Evelyn and Hodges,

and

then modiﬁed by Cover and Hart

KNN has been

identiﬁed as one of the top 10 methods in data mining.

Consequently, KNN has been studied over the past few

decades and widely applied in many ﬁelds.

Thus, KNN

comprises the baseline classiﬁer in many pattern classi-

ﬁcation problems such as pattern recognition,

text cat-

egorization,

ranking models,

object recognition,

and

event recognition

applications. KNN is a nonparamet-

ric algorithm.

Nonparametric means either there are

no parameters or ﬁxed number of parameters irrespec-

tive of size of data. Instead, parameters would be deter-

mined by the size of the training data set, although there

are no assumptions that need to be made to the under-

lying data distribution. Thus, KNN could be the best

choice for any classiﬁcation study that involves a little

or no prior knowledge about the distribution of the

data. In addition, KNN is one of the laziest learning

methods. This implies storing all training data and

waits until having the test data produced, without having

to create a learning model.

Despite its slow characteristic, surprisingly, KNN is

used extensively for big data classiﬁcation, this includes

Department of Computer Science, Faculty of Information Technology, Mutah University, Karak, Jordan.

Department of Algorithm and Their Applications, Eo

¨tvo

¨s Lora

´nd University, Budapest, Hungary.

Department of Geomatics, Faculty of Environmental Design, King Abdulaziz University, Jeddah, Saudi Arabia.

Department of Civil Engineering, Faculty of Engineering, Al-Hussein Bin Talal University, Maan, Jordan.

Department of Pediatrics, Division of Biomedical Informatics, Cincinnati Children’s Hospit al Medical Center, Cincinnati, Ohio.

Departments of

Pediatrics and

Biomedical Informatics, College of Medicine, University of Cincinnati, Cincinnati, Ohio.

Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati , Ohio.

*Address correspondence to: Ahmad B.A. Hassanat, Department of Computer Science, Faculty of Information Technology, Mutah University, Karak 61710, Jordan, E-mail:

hasanat@mutah.edu.jo

Big Data

Volume 00, Number 00, 2019

ªMary Ann Liebert, Inc.

DOI: 10.1089/big.2018.0175

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

the work of Refs.,

12–16

this is due to the other good

characteristics of the KNN such as simplicity and rea-

sonable accuracy, since the speed issue is normally

solved using some kind of divided-and-conquer ap-

proaches. Slowness is not the only problem associated

with the KNN classiﬁer, in addition to choosing the

best K-neighbors problem,

choosing the best dis-

tance/similarity measure is an important problem,

this is because the performance of the KNN classiﬁer

is dependent on the distance/similarity measure used.

This article focuses on ﬁnding the best distance/simi-

larity measure to be used with the KNN classiﬁer to

guarantee the highest possible accuracy.

Related work

Several studies have been conducted to analyze the per-

formance of KNN classiﬁer using different distance mea-

sures. Each study was applied on various kinds of data

sets with different distributions, types of data, and using

different number of distance and similarity measures.

Chomboon et al.

analyzed the performance of

KNN classiﬁer using 11 distance measures. These in-

clude Euclidean distance (ED), Mahalanobis distance,

Manhattan distance (MD), Minkowski distance, Che-

bychev distance, Cosine distance (CosD), Correlation

distance (CorD), Hamming distance (HamD), Jaccard

distance ( JacD), Standardized Euclidean distance, and

Spearman distance. Their experiment had been applied

on eight binary synthetic data sets with various kinds of

distributions that were generated using MATLAB.

They divided each data set into 70% for training set

and 30% for the testing set. The results showed that

the MD, Minkowski distance, Chebychev distance,

ED, Mahalanobis distance, and Standardized Euclidean

distance measures achieved similar accuracy results

and outperformed other tested distances.

Mulak and Talhar

evaluated the performance of

KNN classiﬁer using Chebychev distance, ED, and

MD, distance measures on K divergence distance

(KDD) data set.

The KDD data set contains 41 features

and 2 classes, the type of data of which is numeric. The

data set was normalized before conducting the experi-

ment. To evaluate the performance of KNN, accuracy,

sensitivity, and speciﬁcity measures were calculated

for each distance. The reported results indicate that

the use of MD outperform the other tested distances,

with 97.8% accuracy rate, 96.76% sensitivity rate, and

98.35% speciﬁcity rate.

Hu et al.

analyzed the effect of distance measures

on KNN classiﬁer for medical domain data sets.

Their experiments were based on three different types

of medical data sets containing categorical, numerical,

and mixed types of data, which were chosen from the

University of California, Irvine (UCI) machine learning

repository, and four distance metrics including ED,

CosD, Chi-square, and Minkowsky distances. They di-

vided each data set into 90% of data as training and

10% as testing set, with K values ranging from 1 to

15. The experimental results showed that Chi-square

distance function was the best choice for the three dif-

ferent types of data sets. However, the CosD, ED, and

Minkowsky distance metrics performed the ‘‘worst’’

over the mixed type of data sets. The ‘‘worst’’ perfor-

mance means the method with the lowest accuracy.

Todeschini et al.

22,23

analyzed the effect of 18 differ-

ent distance measures on the performance of KNN

classiﬁer using eight benchmark data sets. The investi-

gated distance measures included MD, ED, Soergel dis-

tance (SoD), Lance–Williams distance, contracted

Jaccard–Tanimoto distance, Jaccard–Tanimoto dis-

tance, Bhattacharyya distance (BD), Lagrange distance,

Mahalanobis distance, Canberra distance (CanD),

Wave-Edge distance, Clark distance (ClaD), CosD,

CorD, and four locally centered Mahalanobis distances.

For evaluating the performance of these distances, the

nonerror rate and average rank were calculated for

each distance. The result indicated that the ‘‘best’’ per-

formance was the MD, ED, SoD, Contracted Jaccard–

Tanimoto distance, and Lance–Williams distance mea-

sures. The ‘‘best’’ performance means the method with

the highest accuracy.

Lopes and Ribeiro

analyzed the impact of ﬁve

distance metrics, namely ED, MD, CanD, Chebychev

distance, and Minkowsky distance in instance-based

learning algorithms. Particularly, 1-nearest neighbor

(1-NN) classiﬁer and the Incremental Hypersphere Clas-

siﬁer, they reported the results of their empirical evalua-

tion on 15 data sets with different sizes showing that the

Euclidean and Manhattan metrics signiﬁcantly yield good

results comparing with the other tested distances.

Alkasassbeh et al.

investigated the effect of Eucli-

dean, Manhattan, and a nonconvex distance due to Has-

sanat

distance metrics on the performance of the KNN

classiﬁer, with K ranging from 1 to the square root of

the size of the training set, considering only the odd

K’s. In addition, they experimented on other classiﬁ-

ers such as the Ensemble Nearest Neighbor classiﬁer

and the Inverted Indexes of Neighbors Classiﬁer.

Their

experiments were conducted on 28 data sets taken

from the UCI machine learning repository, the reported

2 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

results show that Hassanat distance (HasD)

outper-

formed both of MD and ED in most of the tested data

sets using the three investigated classiﬁers.

Lindi

investigated three distance metrics to use

the best performer among them with the KNN classiﬁer,

which was employed as a matcher for their face recog-

nition system that was proposed for the NAO robot.

The tested distances were Chi-square distance, ED,

and HasD. Their experiments showed that HasD out-

performed the other two distances in terms of precision,

but was slower than both of the other distances.

Table 1 provides a summary of these previous stud-

ies on evaluating various distances within KNN classi-

ﬁer, along with the best distance assessed by each of

them. As can be seen from the mentioned literature re-

view of most related studies, all of the previous studies

have investigated either a small number of distance and

similarity measures (ranging from 3 to 18 distances), a

small number of data sets, or both.

Contributions

In KNN classiﬁer, the distances between the test sample

and the training data samples are identiﬁed by different

measures. Therefore, distance measures play a vital role

in determining the ﬁnal classiﬁcation output.

ED is

the most widely used distance metric in KNN classiﬁ-

cations; however, only few studies examined the effect

of different distance metrics on the performance of

KNN, these used a small number of distances, a small

number of data sets, or both. Such shortage in experi-

ments does not prove which distance is the best to be

used with the KNN classiﬁer. Therefore, this review at-

tempts to bridge this gap by testing a large number of

distance metrics on a large number of different data

sets, in addition to investigating the distance metrics

that are least affected by added noise.

The KNN classiﬁer can deal with noisy data; there-

fore, we need to investigate the impact of choosing dif-

ferent distance measures on the KNN performance

when classifying a large number of real-world data

sets, in addition to investigate which distance has the

lowest noise implications. There are two main research

questions addressed in this review:

1. Which is the ‘‘best’’ distance metric to be imple-

mented with the KNN classiﬁer?

2. Which is the ‘‘best’’ distance metric to be imple-

mented with the KNN classiﬁer in the case of

noise existence?

We mean by the ‘‘best distance metric’’ (in this re-

view) is the one that allows the KNN to classify test ex-

amples with the highest precision, recall, and accuracy,

that is, the one that gives best performance of the KNN

in terms of accuracy.

The previous questions were partially answered by

the aforementioned studies; however, most of the

reviewed research in this regard used a small number

of distances/similarity measures and/or a small number

of data sets. This study investigates the use of a relatively

large number of distances and data sets, to draw more

signiﬁcant conclusions, in addition to reviewing a larger

number of distances in one place.

Organization

We organized our review as follows. First in KNN and

Distance Measures section, we provide an introductory

overview to KNN classiﬁcation method and present its

history, characteristics, advantages, and disadvantages.

We review the deﬁnitions of various distance measures

used in conjunction with KNN. Experimental Frame-

work section explains the data sets that were used in

classiﬁcation experiments, the structure of the experi-

ments model, and the performance evaluations mea-

sures. We present and discuss the results produced

by the experimental framework. Finally, in Conclu-

sions and Future Perspectives section, we provide the

conclusions and possible future directions.

KNN and Distance Measures

Brief overview of KNN classiﬁer

The KNN algorithm classiﬁes an unlabeled test sample

based on the majority of similar samples among the

KNNs that are the closest to test sample. The distances

Table 1. Comparison between previous studies for distance

measures in K-nearest neighbor classiﬁer along with ‘‘best’’

performing distance

Reference

No. of

distances

No. of

data sets Best distance

18 11 8 Manhattan, Minkowski

Chebychev

Euclidean, Mahalanobis

Standardized Euclidean

19 3 1 Manhattan

21 4 37 Chi-square

22 18 8 Manhattan, Euclidean, Soergel

Contracted Jaccard–Tanimoto

Lance–Williams

24 5 15 Euclidean and Manhattan

25 3 28 Hassanat

28 3 2 Hassanat

Ours 54 28 Hassanat

Comparatively our current study compares the highest number of dis-

tance measures on a variety of data sets.

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 3

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

between the test sample and each of the training data

samples are determined by a speciﬁc distance measure.

Figure 1 shows that a KNN example contains training

samples with two classes, the ﬁrst class is ‘‘blue square’’

and the second class is ‘‘red triangle.’’ The test sample

is represented in green circle. These samples are placed

into two dimensional feature spaces with one dimension

foreachfeature.Toclassifythetestsamplethatbelongs

to class ‘‘blue square’’ or to class ‘‘red triangle,’’ KNN

adopts a distance function to ﬁnd the KNNs to the test

sample. Finding the majority of classes among the

KNNs predicts the class of the test sample. In this case,

when k=3, the test sample is classiﬁed to the ﬁrst class

‘‘red triangle’’ because there are two red triangles and

only one blue square inside the inner circle, but when

k=5, it is classiﬁed to the ‘‘blue square’’ class because

there are two red triangles and only three blue squares.

KNN is simple, but proved to be a highly efﬁcient

and effective algorithm for solving various real-life

classiﬁcation problems. However, KNN has got some

disadvantages that include the following:

1. How to ﬁnd the optimum K value in KNN

Algorithm?

2. High computational time cost as we need to

compute the distance between each test sample

and all training samples, for each test example

we need O(nm) time complexity (number of op-

erations), where n is the number of examples in

the training data and m is the number of features

for each example.

3. High memory requirement as we need to store all

training samples O(nm) space complexity.

4. Finally, we need to determine the distance func-

tion that is the core of this study.

The ﬁrst problem was solved either by using all the

examples and taking the inverted indexes,

or using en-

semble learning.

For the second and third problems,

many studies have proposed different solutions depend-

ing on reducing the size of the training data set, those in-

clude and not limited to Refs.,

31–35

or using approximate

KNN classiﬁcation such as those in Arya and Mount

and Zheng et al.

Although some previous studies

exist in the literature that investigated the fourth prob-

lem (see Related Work section), in this study we attempt

to investigate the fourth problem on a much larger scale,

that is, investigating a large number of distance metrics

tested on a large set of problems. In addition, we inves-

tigate the effect of noise on choosing the most suitable

distance metric to be used by the KNN classiﬁer.

The basic KNN classiﬁer steps can be described as

follows:

Algorithm 1: Basic KNN algorithm

Input: Training samples D, test sample d,K

Output: Class label of test sample

1: Compute the distance between dand every sample in D

2: Choose the Ksamples in Dthat are nearest to d; denote the set

by P(˛D)

3: Assign dthe class that is the most frequent class (or the majority class)

1. Training phase: The training samples and the

class labels of these samples are stored, no miss-

ing data allowed, no non-numeric data allowed.

2. Classiﬁcation phase: Each test sample is classiﬁed

using majority vote of its neighbors by the follow-

ing steps:

(a) Distances from the test sample to all stored

training samples are calculated using a spe-

ciﬁc distance function or similarity measure.

FIG. 1. An example of KNN classiﬁcation with K

neighbors K=3 (solid line circle) and K=5 (dashed

line circle), distance measure is ED. Each triangle

represents a training example with two features

(x, y), which belongs to class 1. Each square

represents a training example with two features

(x, y), which belongs to class 2. The gray circle

represents a test example with two features (x, y),

which belongs to an unknown (?) class, and the

KNN needs to predict its class based on the ED

distance. ED, Euclidean distance; KNN, K-nearest

neighbor.

4 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

(b) The KNNs of the test sample are selected,

where Kis a predeﬁned small integer.

assigned to the test sample. In other words, a

test sample is assigned to the class c if it is the

most frequent class label among the Knearest

training samples. If K=1, then the class of the

nearest neighbor is assigned to the test sample.

KNN algorithm is described by Algorithm 1.

We provide a toy example to illustrate how to com-

pute the KNN classiﬁer. Assuming that we have three

training examples, having three attributes for each,

and one test example as given in Table 2.

Step1: Determine the parameter K=number of the

nearest neighbors to be considered. For this example

we assume K=1.

Step 2: Calculate the distance between test sample

and all training samples using a speciﬁc similarity

measure, in this example, ED is used, see Table 3.

Step 3: Sort all examples based on their similarity or

distance to the tested example, and then keep only

the Ksimilar (nearest) examples as given in Table 4:

Step 4: Based on the minimum distance, the class of

the test sample is assigned to be 1. However, if K=3

for instance, the class will be 2.

Noisy data

The existence of noise in data is mainly related to the way

that has been applied to acquire and preprocess data from

its environment.

Noisy data are a corrupted form of

data in some way, which leads to partial alteration of the

data values. Two main sources of noise can be identiﬁed:

ﬁrst, the implicit errors caused by measurement tools,

such as using different types of sensors. Second, the random

errors caused by batch processes or experts while collecting

data, for example, errors during the process document dig-

itization. Based on these two sources of errors, two types of

noise can be classiﬁed in a given data set

1. Class noise occurs when the sample is incorrectly

labeled due to several causes such as data entry er-

rors during labeling process, or the inadequacy of

information that is being used to label each sample.

2. Attribute noise refers to the corruptions in values

of 1 or more attributes due to several causes, such

as failures in sensor devices, irregularities in sam-

pling, or transcription errors.

The generation of noise can be classiﬁed by three

main characteristics

1. The place where the noise is introduced: Noise

may affect the attributes, class, training data,

and test data separately or in combination.

2. The noise distribution: The way in which the noise

is introduced, for example, uniform or Gaussian.

3. The magnitude of generated noise values: The ex-

tent to which the noise affects the data can be rel-

ative to each data value of each attribute, or relative

to the standard deviation, minimum, maximum

for each attribute.

In this study, we add different noise levels to the

tested data sets to ﬁnd the optimal distance metric

that is least affected by this added noise with respect

to the KNN classiﬁer performance.

Distance measures review

The ﬁrst appearance of the word distance can be found

in the writings of Aristoteles (384 AC–322 AC), who

argued that the word distance means, ‘‘It is between

extremities that distance is greatest’’ or ‘‘things which

have something between them, that is, a certain distance.’’

In addition, ‘‘distance has the sense of dimension [as in

space has three dimensions, length, breadth and depth].’’

Euclid, one of the most important mathematicians of

the ancient history, used the word distance only in his

third postulate of the Principia

: ‘‘Every circle can be de-

scribed by a centre and a distance.’’ The distance is a

Table 2. Training and testing data examples

X1 X2 X3 Class

Training sample (1) 5 4 3 1

Training sample (2) 1 2 2 2

Training sample (3) 1 2 3 2

Test sample 4 4 2 ?

Table 3. Training and testing data examples with distances

X1 X2 X3 Class Distance

Training

sample (1)

543 1 D=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

(4 5)2þ(4 4)2þ(2 3)2

p=1:4

Training

sample (2)

122 2 D=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

(4 1)2þ(4 2)2þ(2 2)2

p=3:6

Training

sample (3)

123 2 D=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

(4 1)2þ(4 2)2þ(2 3)2

p=3:7

Test sample 4 4 2 ?

Table 4. Training and testing data examples with distances

X1 X2 X3 Class Distance

Rank minimum

distance

Training sample (1) 5 4 3 1 1.4 1

Training sample (2) 1 2 2 2 3.6 2

Training sample (3) 1 2 3 2 3.7 3

Test sample 4 4 2 ?

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 5

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

numerical description of how far apart entities are. In data

mining, the distance means a concrete way of describing

what it means for elements of some space to be close to or

far away from each other. Synonyms for distance include

farness, dissimilarity, and diversity, and synonyms for

similarity include proximity

and nearness.

The distance function between two vectors xand yis

a function d(x,y) that deﬁnes the distance between

both vectors as a non-negative real number. This func-

tion is considered as a metric if it satisﬁes a certain

number of properties

that include the following:

1. Non-negativity: The distance between xand yis

always a value ‡0.

d(x,y)0:

2. Identity of indiscernibles: The distance between

xand yis equal to 0 if and only if xis equal to y.

d(x,y)=0iff x=y:

3. Symmetry: The distance between xand yis equal

to the distance between yand x.

d(x,y)=d(y,x):

4. Triangle inequality: Considering the presence of

a third point z, the distance between xand yis al-

ways less than or equal to the sum of the distance

between xand zand the distance between yand z.

d(x,y)d(x,z)þd(z,y):

Whenthedistanceisintherange[0,1],thecalculation

of a corresponding similarity measure s(x,y)isasfollows:

s(x,y)=1d(x,y):

We consider the 8 major distance families that consist

of 54 total distance measures. We categorized these dis-

tance measures following a similar categorization done

by Cha.

In what follows, we give the mathematical def-

initions of distances to measure the closeness between

two vectors xand y,wherex=(x

,.,x

)and

y=(y

,.,y

) having numeric attributes. As an exam-

ple, we show the computed distance value between the ex-

ample vectors v1={5.1,3.5,1.4,0.3}, v2={5.4,3.4,1.7,0.2}

as a result in each of these categories of distances

reviewed here. Theoretical analysis of these various dis-

tance metrics is beyond the scope of this study.

1. L

Minkowski distance measures: This family of

distances includes three distance metrics that are

special cases of Minkowski distance, correspond-

ing to different values of pfor this power distance.

The Minkowski distance, which is also known as

norm, is a generalized metric. It is deﬁned as

DMink(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1jxiyijp

where pis a positive value. When p=2, the distance

becomes the ED. When p=1, it becomes MD. Cheby-

shev distance (CD) is a variant of Minkowski distance,

where p=N.x

is the ith value in vector xand y

is the

ith value in vector y.

1.1. MD: The MD, also known as L

norm, Taxi-

cab norm, rectilinear distance, or City block

distance, which was considered by Hermann

Minkowski in 19th-century Germany. This dis-

tance represents the sum of the absolute differ-

ences between the opposite values in vectors.

MD(x,y)=+

i=1jxiyij:

1.2. CD: CD is also known as maximum value

distance,

Lagrange,

and chessboard dis-

tance.

This distance is appropriate in

cases when two objects are to be deﬁned as

different if they are different in any one di-

mension.

It is a metric deﬁned on a vector

space where distance between two vectors is

the greatest of their difference along any co-

ordinate dimension.

CD(x,y)=max

ijxiyij:

1.3. ED: Also known as L

norm or Ruler dis-

tance, which is an extension to the Pythago-

rean theorem. This distance represents the

root of the sum of the square of differences

between the opposite values in vectors.

ED(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1jxiyij2

Minkowski distance measures

Abbreviation Name Definition Result

MD Manhattan +n

i=1jxiyij0.8

CD Chebyshev max

–y

j0.3

ED Euclidean ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1jxiyij2

q0.4472

2. L

Distance measures: This distance family de-

pends on mainly ﬁnding the absolute difference,

the family includes Lorentzian distance (LD),

CanD, Sorensen distance (SD), SoD, Kulczynski

6 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

distance (KD), Mean Character distance (MCD),

and Non Intersection distance (NID).

2.1. LD: LD is represented by the natural log of

the absolute difference between two vectors.

This distance is sensitive to small changes

since the log scale expands the lower range

and compresses the higher range.

LD(x,y)=+

i=1

ln(1þjxiyij),

where ln is the natural logarithm, and to ensure that the

non-negativity property and to avoid log of 0, 1 is added.

2.2. CanD: CanD is introduced by

and modiﬁed

by Lance and Williams.

It is a weighted ver-

sion of MD, wherein the absolute difference

between the attribute values of the vectors x

and yis divided by the sum of the absolute at-

tribute values before summing.

This distance

is mainly used for positive values. It is very

sensitive to small changes near 0, where it is

more sensitive to proportional than to abso-

lute differences. Therefore, this characteristic

becomes more apparent in higher dimen-

sional space, respectively, with an increasing

number of variables. The CanD is often used

for data scattered around an origin.

CanD(x,y)=+

i=1

jxiyij

jxijþjyij:

2.3. SD: The SD,

also known as Bray–Curtis, is

one of the most commonly applied measure-

ments to express relationships in ecology, en-

vironmental sciences, and related ﬁelds. It is a

modiﬁed Manhattan metric, wherein the

summed differences between the attribute

values of the vectors xand yare standardized

by their summed attribute values.

When all

the vector values are positive, this measure

take value between 0 and 1.

SD(x,y)=+n

i=1jxiyij

i=1(xiþyi):

2.4. SoD: SoD is one of the distance measures that

is widely used to calculate the evolutionary dis-

tance.

It is also known as Ruzicka distance.

For binary variables only, this distance is iden-

tical to the complement of the Tanimoto (or

Jaccard) similarity coefﬁcient.

This distance

obeys all four metric properties provided by

all attributes that have non-negative values.

SoD(x,y)=+n

i=1jxiyij

i=1max (xi,yi):

2.5. KD: Similar to the SoD, but instead of using

the maximum, it uses the minimum function.

KD(x,y)=+n

i=1jxiyij

i=1min (xi,yi):

2.6. MCD: Also known as average Manhattan, or

Gower distance.

MCD(x,y)=+n

i=1jxiyij

2.7. NID: NID is the complement to the intersec-

tion similarity and is obtained by subtracting

the intersection similarity from 1.

NID(x,y)=1

i=1jxiyij:

Distance measures

Abbreviation Name Definition Result

LD Lorentzian +n

i=1ln(1 þjxiyij) 0.7153

CanD Canberra +n

i=1jxiyij

jxijþjyij0.0381

SD Sorensen +n

i=1jxiyij

i=1(xiþyi)0.0381

SoD Soergel +n

i=1jxiyij

i=1max(xi,yi)0.0734

KD Kulczynski +n

i=1jxiyij

i=1min(xi,yi)0.0792

MCD Mean Character +n

i=1jxiyij

n0.2

NID Non Intersection 1

2+n

i=1jxiyij0.4

3. Inner product distance measures: Distance mea-

sures belonging to this family are calculated by

some products of pairwise values from both vec-

tors, this type of distances includes JacD, CosD,

Dice distance (DicD), and Chord distance (ChoD).

3.1. JacD: The JacD measures dissimilarity be-

tween sample sets, it is a complementary to

the Jaccard similarity coefﬁcient

and is

obtained by subtracting the Jaccard coefﬁ-

cient from 1. This distance is a metric.

JacD(x,y)=+n

i=1(xiyi)2

i=1x2

iþ+n

i=1y2

i+n

i=1xiyi

3.2. CosD: The CosD, also called angular dis-

tance, is derived from the cosine similarity

that measures the angle between two vectors,

where CosD is obtained by subtracting the

cosine similarity from 1.

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 7

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

CosD(x,y)=1+n

i=1xiyi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1x2

q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1y2

3.3. DicD: The DicD is derived from the dice sim-

ilarity,

which is a complementary to the

dice similarity and is obtained by subtracting

the dice similarity from 1. It can be sensitive

to values near 0. This distance is a not a met-

ric, in particular, the property of triangle in-

equality does not hold. This distance is

widely used in information retrieval in docu-

ments and biological taxonomy.

DicD(x,y)=12+n

i=1xiyi

i=1x2

iþ+n

i=1y2

3.4. ChoD: It is a modiﬁcation of ED,

which was

introduced by Orloci

and to be used in ana-

lyzing community composition data.

It was

deﬁned as the length of the chord joining

two normalized points within a hypersphere

of radius 1. This distance is one of the distance

measures that is commonly used for clustering

continuous data.

ChoD(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

22+n

i=1xiyi

i=1x2

i+n

i=1y2

Inner product distance measures family

Abbreviation Name Definition Result

JacD Jaccard +n

i=1(xiyi)2

i=1x2

iþ+n

i=1y2

i+n

i=1xiyi0.0048

CosD Cosine 1 +n

i=1xiyi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1x2

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1y2

p0.0016

DicD Dice 1 2+n

i=1xiyi

i=1x2

iþ+n

i=1y2

0.9524

ChoD Chord ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

22+n

i=1xiyi

i=1x2

i+n

i=1y2

r0.0564

4. Squared chord distance (SCD) measures: Dis-

tances that belong to this family are obtained by

calculating the sum of geometrics. The geometric

mean of two values is the square root of their prod-

uct. The distances in this family cannot be used

with features vector of negative values, this family

includes Bhattachayya distance, SCD, Matusita

distance (MatD), and Hellinger distance (HeD).

4.1. BD: The BD measures the similarity of two

probability distributions.

BD(x,y)=ln +

i=1ﬃﬃﬃﬃﬃﬃﬃ

xiyi

4.2. SCD: SCD is mostly used with paleontologists

and in studies on pollen. In this distance, the

sum of square of square root difference at each

point is taken along both vectors, which incre-

ases the difference for more dissimilar feature.

SCD(x,y)=+

i=1

(ﬃﬃﬃﬃ

pﬃﬃﬃﬃ

p)2:

4.3. MatD: MatD is the square root of the SCD.

MatD(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1

(ﬃﬃﬃﬃ

pﬃﬃﬃﬃ

p)2

4.4. HeD: HeD also called Jeffries–Matusita dis-

tance

was introduced in 1909 by Hellin-

ger,

it is a metric used to measure the

similarity between two probability distribu-

tions. This distance is closely related to BD.

HeD(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1

(ﬃﬃﬃﬃ

pﬃﬃﬃﬃ

p)2

SCD measures family

Abbreviation Name Definition Result

BD Bhattacharyya ln+n

i=1ﬃﬃﬃﬃﬃﬃﬃ

xiyi

p2.34996

SCD Squared chord +n

i=1(ﬃﬃﬃﬃ

pﬃﬃﬃﬃ

p)20.0297

MatD Matusita ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(ﬃﬃﬃﬃ

pﬃﬃﬃﬃ

p)2

q0.1722

HeD Hellinger ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

2+n

i=1(ﬃﬃﬃﬃ

pﬃﬃﬃﬃ

p)2

q0.2436

5. Squared L

distance measures: In L

distance

measure family, the square of difference at each

point along both vectors is considered for the

total distance, this family includes Squared Eucli-

dean distance (SED), ClaD, Neyman v

distance

(NCSD), Pearson v

distance (PCSD), Squared

distance (SquD), Probabilistic Symmetric v

distance (PSCSD), Divergence distance (DivD),

Additive Symmetric v

distance (ASCSD), Aver-

agedistance(AD),MeanCensoredEuclidean

distance (MCED), and Squared Chi-Squared

distance (SCSD).

5.1. SED: SED is the sum of the squared differ-

ences without taking the square root.

SED(x,y)=+

i=1

(xiyi)2:

5.2. ClaD: The ClaD also called coefﬁcient of di-

vergence was introduced by Clark.

It is

the squared root of half of the DivD.

8 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

ClaD(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1

xiyi

jxijþjyij



5.3. NCSD: The NCSD

is called a quasi-distance.

NCSD(x,y)=+

i=1

(xiyi)2

5.4. PCSD: PCSD,

also called v

distance.

PCSD(x,y)=+

i=1

(xiyi)2

5.5. SquD: SquD also called triangular discrimina-

tion distance. This distance is a quasi-distance.

SquD(x,y)=+

i=1

(xiyi)2

xiþyi

5.6. PSCSD: This distance is equivalent to Sangvi

distance.

PSCSD(x,y)=2+

i=1

(xiyi)2

xiþyi

5.7. DivD:

DivD(x,y)=2+

i=1

(xiyi)2

(xiþyi)2:

5.8. ASCSD: Also known as symmetric v

diver-

gence.

ASCSD(x,y)=2+

i=1

(xiyi)2(xiþyi)

xiyi

5.9. AD: The AD, also known as average Eucli-

dean, is a modiﬁed version of the ED,

wherein the ED has the following drawback,

‘‘if two data vectors have no attribute values

in common, they may have a smaller distance

than the other pair of data vectors containing

the same attribute values,’’

so that, this dis-

tance was adopted.

AD(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1

(xiyi)2

5.10. MCED: In this distance, the sum of squared

differences between values is calculated and,

to get the mean value, the summed value is

divided by the total number of values

wherein the pairs values do not equal to 0.

After that, the square root of the mean

should be computed to get the ﬁnal distance.

MCED(x,y)=ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xiyi)2

i=11x2

iþy2

i6¼0

5.11. SCSD:

SCSD(x,y)=+

i=1

(xiyi)2

jxiþyij:

Squared L

distance measures family

Abbreviation Name Definition Result

SED Squared Euclidean +n

i=1(xiyi)20.2

ClaD Clark ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1jxiyij

xiþyi



r0.2245

NCSD Neyman v

i=1

(xiyi)2

xi0.1181

PCSD Pearson v

i=1

(xiyi)2

yi0.1225

SquD Squared v

i=1

(xiyi)2

xiþyi0.0591

PSCSD Probabilistic Symmetric v

2+n

i=1

(xiyi)2

xiþyi0.1182

DivD Divergence 2+n

i=1

(xiyi)2

(xiþyi)20.1008

ASCSD Additive Symmetric v

2+n

i=1

(xiyi)2(xiþyi)

xiyi0.8054

AD Average ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

n+n

i=1(xiyi)2

q0.2236

MCED Mean Censored Euclidean ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xiyi)2

i=11x2

iþy2

i6¼0

r0.2236

SCSD Squared Chi-Squared +n

i=1

(xiyi)2

jxiþyij0.0591

6. Shannon entropy distance measures: The dis-

tance measures belonging to this family are re-

lated to the Shannon entropy.

These distances

include Kullback–Leibler distance (KLD), Jeffreys

distance ( JefD), KDD, Topsoe distance (TopD),

Jensen–Shannon distance ( JSD), and Jensen dif-

ference distance ( JDD).

6.1. KLD: KLD was introduced by Kullback and

Leibler,

it is also known as Kullback–Leibler

divergence, relative entropy, or information

deviation, which measures the difference be-

tween two probability distributions. This dis-

tanceisnotametricmeasure,becauseitis

not symmetric. Furthermore, it does not satisfy

triangular inequality property, therefore, it

is called quasi-distance. Kullback–Leibler di-

vergence has been used in several natural

language applications such as for query expan-

sion, language models, and categorization.

KLD(x,y)=+

i=1

xiln xi

where ln is the natural logarithm.

6.2. JefD: JefD,

also called J-divergence or KL2-

distance, is a symmetric version of the KLD.

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 9

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

JefD(x,y)=+

i=1

(xiyi)ln xi

6.3. KDD:

KDD(x,y)=+

i=1

xiln 2xi

xiþyi

6.4. TopD: The TopD,

also called information

statistics, is a symmetric version of the

KLD. The TopD is twice the Jensen–Shannon

divergence. This distance is not a metric, but

its square root is a metric.

TopD(x,y)=+

i=1

xiln 2xi

xiþyi



þ+

i=1

yiln 2yi

xiþyi



6.5. JSD: JSD is the square root of the Jensen–

Shannon divergence. It is half of the TopD,

which uses the average method to make the

K divergence symmetric.

JSD(x,y)=1

i=1

xiln 2xi

xiþyi



þ+

i=1

yiln 2yi

xiþyi



6.6. JDD: JDD was introduced by Sibson.

JDD(x,y)=1

i=1

xilnxiþyilnyi

2xiþyi



ln xiþyi





Shannon entropy distance measures family

Abbreviation Name Definition Result

KLD Kullback–

Leibler

i=1xiln xi

yi0.3402

JefD Jeffreys +n

i=1(xiyi)ln xi

yi0.1184

KDD K divergence +n

i=1xiln 2xi

xiþyi0.1853

TopD Topsoe +n

i=1xiln 2xi

xiþyi



þ+n

i=1yiln 2yi

xiþyi



0.0323

JSD Jensen–

Shannon

2+n

i=1xiln 2xi

xiþyiþ+n

i=1yiln 2yi

xiþyi

0.014809

JDD Jensen

difference

2+n

i=1

xilnxiþyilnyi

2xiþyi



ln xiþyi



0.0074

7. Vicissitude distance measures: Vicissitude distance

family consists of four distances, Vicis-Wave Hed-

gesdistance(VWHD),VicisSymmetricdistance

(VSD), Max Symmetric v

distance (MSCSD),

and Min Symmetric v

distance (MiSCSD). These

distances were generated from syntactic relation-

ship for the aforementioned distance measures.

7.1. VWHD: The so-called Wave-Hedges dis-

tance has been applied to compressed image

retrieval,

content-based video retrieval,

time series classiﬁcation,

image ﬁdelity,

ﬁn-

ger print recognition,

etc. Interestingly, the

source of the ‘‘Wave-Hedges’’ metric has not

been correctly cited, and some of the previ-

ously mentioned resources allude to it incor-

rectly as given by Hedges.

Thesourceof

this metric eludes the authors, despite best

efforts otherwise. Even the name of the dis-

tance ‘‘Wave-Hedges’’ is questioned.

VWHD(x,y)=+

i=1

jxiyij

min (xi,yi):

7.2. VSD: VSD is deﬁned by three formulas,

VSDF1, VSDF2, and VSDF3 as the following:

VSDF1(x,y)=+

i=1

(xiyi)2

min (xi,yi)2,

VSDF2(x,y)=+

i=1

(xiyi)2

min (xi,yi),

VSDF3(x,y)=+

i=1

(xiyi)2

max (xi,yi)

7.3. MSCSD:

MSCD(x,y)=max +

i=1

(xiyi)2

i=1

(xiyi)2



7.4. MiSCSD:

MiSCSD(x,y)=min +

i=1

(xiyi)2

i=1

(xiyi)2



Vicissitude distance measures family

Abbreviation Name Definition Result

VWHD Vicis-Wave Hedges +n

i=1jxiyij

min (xi,yi)0.8025

VSDF1 Vicis Symmetric1 +n

i=1

(xiyi)2

min (xi,yi)20.3002

VSDF2 Vicis Symmetric2 +n

i=1

(xiyi)2

min (xi,yi)0.1349

VSDF3 Vicis Symmetric3 +n

i=1

(xiyi)2

max (xi,yi)0.1058

MSCSD Max Symmetric v

max +n

i=1

(xiyi)2

xi,+n

i=1

(xiyi)2



0.1225

MiSCSD Min Symmetric v

min +n

i=1

(xiyi)2

xi,+n

i=1

(xiyi)2



0.1181

8. Other distance measures: These metrics exhibit

distance measures utilizing multiple ideas or

measures from previous distance measures,

these include and not limited to Average distance

) (AvgD), Kumar–Johnson distance (KJD),

Taneja distance (TanD), Pearson distance (PeaD),

CorD, Squared Pearson distance (SPeaD), HamD,

Hausdorff distance (HauD), v

statistic distance

(CSSD), Whittaker’s index of association distance

(WIAD), Meehl distance (MeeD), Motyka dis-

tance (MotD), and HasD.

10 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

8.1. AvgD: AvgD is the average of MD and CD.

AvgD(x,y)=+n

i=1jxiyijþmaxijxiyij

8.2. KJD:

KJD(x,y)=+

i=1

(x2

iþy2

i)2

2(xiyi)3=2

8.3. TanD

TJD(x,y)=+

i=1

xiþyi



ln xiþyi

2ﬃﬃﬃﬃﬃﬃﬃ

xiyi



8.4. PeaD: The PeaD is derived from the Pear-

son’s correlation coefﬁcient, which measures

the linear relationship between two vectors.

This distance is obtained by subtracting the

Pearson’s correlation coefﬁcient from 1.

PeaD(x,y)=1+n

i=1(xi

x)(yi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xi

x)2?+n

i=1(yi

y)2

where 

x=1

n+n

i=1xi.

8.5. CorD: CorD is a version of the PeaD, where

the PeaD is scaled to obtain a distance mea-

sure in the range between 0 and 1.

CorD(x,y)=1

21+n

i=1(xi

x)(yi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xi

x)2

q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(yi

y)2

8.6. SPeaD:

SPeaD(x,y)=1+n

i=1(xi

x)(yi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xi

x)2

q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(yi

y)2

8.7. HamD: HamD

is a distance metric that

measures the number of mismatches between

two vectors. It is mostly used for nominal

data, string, and bit-wise analyses, and also

can be useful for numerical data.

HamD(x,y)=+

i=1

1xi6¼yi:

8.8. HauD:

HauD(x,y)=max (h(x,y), h(y,x)),

where h(x,y)=max

xi˛x

min

yi˛y

jjx

–y

jj, and jj,jj is

the vector norm (e.g., L

norm). The function h(x,y)is

called the directed HauD from xto y. The HauD(x,y)

measures the degree of mismatch between the sets x

and yby measuring the remoteness between each

point x

and y

and vice versa.

8.9. CSSD: The CSSD was used for image retriev-

al,

histogram,

etc.

CSSD(x,y)=+

i=1

ximi

where mi=xiþyi

8.10. WIAD: WIAD was designed for species

abundance data.

WIAD(x,y)=1

i=1jxi

i=1xiyi

i=1yij:

8.11. MeeD: MeeD depends on one consecutive

point in each vector.

MeeD(x,y)=+

n1

i=1

(xiyixiþ1þyiþ1)2:

8.12. MotD:

MotD(x,y)=+n

i=1max(xi,yi)

i=1(xiþyi):

8.13. HasD: This is a nonconvex distance intro-

duced by Hassanat

HasD(x,y)=+

i=1

D(xi,yi),

where

D(x,y)=11þmin (xi,yi)

1þmax (xi,yi), min (xi,yi)0

11þmin (xi,yi)þjmin (xi,yi)j

1þmax (xi,yi)þjmin (xi,yi)j, min (xi,yi)<0

As can be seen, HasD is bounded by [0,1]. It reaches 1

when the maximum value approaches inﬁnity assuming

the minimum is ﬁnite, or when the minimum value ap-

proaches minus inﬁnity assuming the maximum is ﬁnite.

This is shown in Figure 2 and the following equations.

lim

max (Ai,Bi)!1 D(Ai,Bi)=lim

max (Ai,Bi)!1D(Ai,Bi)=1:

By satisfying all the metric properties, this distance

was proved to be a metric by Hassanat.

In this metric

no matter what the difference between two values is,

the distance will be in the range of 0 to 1. So the max-

imum distance approaches the dimension of the tested

vectors; therefore, the increase in dimensions increases

the distance linearly in the worst case.

Other distance measures family

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 11

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

Experimental Framework

Data sets used for experiments

The experiments were done on 28 data sets that repre-

sent real-life classiﬁcation problems, obtained from

the UCI Machine Learning Repository.

The UCI

Machine Learning Repository is a collection of data-

bases, domain theories, and data generators that are

used by the machine learning community for the em-

pirical analysis of machine learning algorithms. The

database was created in 1987 by David Aha and fellow

graduate students at UCI. Since that time, it has been

widely used by students, educators, and researchers all

over the world as a primary source of machine learn-

ing data sets.

Each data set consists of a set of examples. Each ex-

ample is deﬁned by a number of attributes and all the

examples inside the data are represented by the same

number of attributes. One of these attributes is called

the class attribute, which contains the class value

(label) of the data, whose values are predicted for test-

ing the examples. Short description of all the data sets

used is provided in Table 5.

FIG. 2. Representation of HasD between the point 0 and n,wherenbelongs to [10,10]. HasD, Hassanat distance.

Abbreviation Name Definition Result

AvgD Average (L

)+n

i=1jxiyijþmaxijxiyij

20.55

KJD Kumar–Johnson +n

i=1

(x2

iþy2

i)2

2(xiyi)3=2

 21.2138

TanD Taneja +n

i=1

xiþyi



ln xiþyi

2ﬃﬃﬃﬃﬃ

xiyi

 0.0149

PeaD Pearson 1 +n

i=1(xi

x)(yi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xi

x)2?+n

i=1(yi

y)2



x=1

n+n

i=1xi0.9684

CorD Correlation 1

21+n

i=1(xi

x)(yi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xi

x)2

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(yi

y)2

 0.4842

SPeaD Squared Pearson 1 +n

i=1(xi

x)(yi

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(xi

x)2

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i=1(yi

y)2



0.999

HamD Hamming +n

i=11xi6¼yi4

HauD Hausdorff Max [h(x,y), h(y,x)]

h(x,y)=max

xi˛x

min

yi˛y

jjx

–y

jj 0.3

CSSD v

statistic +n

i=1

ximi

mi,mi=xiþyi

20.0894

WIAD Whittaker’s index

of association

2+n

i=1jxi

i=1xiyi

i=1yij1.9377

MeeD Meehl +n1

i=1(xiyixiþ1þyiþ1)20.48

MotD Motyka +n

i=1max(xi,yi)

i=1(xiþyi)0.5190

HasD Hassanat +n

i=1D(xi,yi)

=11þmin (xi,yi)

1þmax (xi,yi), min (xi,yi)0

11þmin (xi,yi)þjmin (xi,yi)j

1þmax (xi,yi)þjmin (xi,yi)j, min(xi,yi)<0

(0.2571

12 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

Experimental setup

Each data set is divided into two data sets, one for train-

ing and the other for testing. For this purpose, 34% of

the data set is used for testing and 66% of the data set is

dedicated for training. The value of K is set to 1 for sim-

plicity. The 34% of the data, which were used as a test

sample, were chosen randomly, and each experiment

on each data set was repeated 10 times to obtain ran-

dom examples for testing and training. The overall ex-

perimental framework is shown in Figure 3. Our

experiments are divided into two major parts:

1. The ﬁrst part of experiments aim to ﬁnd the best

distance measures to be used by KNN classiﬁer

without any noise in the data sets. We used all

the 54 distances that were reviewed in Distance

Measures Review section.

2. The second part of experiments aim to ﬁnd the best

distance measure to be used by KNN classiﬁer in

the case of noisy data. In this study, we deﬁne the

‘‘best’’ method as the method that performs with

the highest accuracy. We added noise into each

data set at various levels of noise. The experiments

in the second part were conducted using the top 10

distances, those that achieved the best results in the

ﬁrst part of experiments. Therefore, to create a

noisy data set from the original data set, a level of

noise x% is selected in the range of (10%–90%),

the level of noise means the number of examples

that need to be noisy, the amount of noise is se-

lected randomly between the minimum and max-

imum values of each attribute, all attributes for

each examples are corrupted by a random noise,

the number of noisy examples is selected ran-

domly. Algorithm 2 describes the process of cor-

rupting data with random noise to be used for

further experiments for the purposes of this study.

Algorithm 2: Create noisy data set

Input: Original data set D, level of noise x% [10%–90%]

Output: Noisy data set

1: Number of noisy examples: N=x%* number of examples in D

2: Array NoisyExample [N]

3: for K=1toNdo

4: Randomly choose an example number as Efrom D

5: if Eis chosen previously then

6: Go to Step 4

7: else

8: NoisyExample [k]=E

9: for each attribute A

do do

10: for each NoisyExample NE

do do

11: RV =Random value between Min(A

) and Max(A

)

12: NE

=RV

Performance evaluation measures

Different measures are available for evaluating the per-

formance of classiﬁers. In this study, three measures

were used: accuracy, precision, and recall. Accuracy is

calculated to evaluate the overall classiﬁer performance.

It is deﬁned as the ratio of the test samples that are cor-

rectly classiﬁed to the number of tested examples,

Accuracy =Numberofcorrectclassifications

Totalnumberoftestsamples :(1)

To assess the performance with respect to every class in

a data set, we compute precision and recall measures.

Precision (or positive predictive value) is the fraction

of retrieved instances that are relevant, whereas recall

(or sensitivity) is the fraction of relevant instances

that are retrieved. These measures can be constructed

by computing the following:

1. True positive (TP): The number of correctly clas-

siﬁed examples of a speciﬁc class (as we calculate

these measures for each class).

2. True negative (TN): The number of correctly

classiﬁed examples that were not belonging to

the speciﬁc class.

Table 5. Description of the real-world data sets used

(from the UCI machine learning repository)

Name #E #F #C Data type Min Max

Heart 270 25 2 Real and integer 0 564

Balance 625 4 3 Positive integer 1 5

Cancer 683 9 2 Positive integer 0 9

German 1000 24 2 Positive integer 0 184

Liver 345 6 2 Real and integer 0 297

Vehicle 846 18 4 Positive integer 0 1018

Vote 399 10 2 Positive integer 0 2

BCW 699 10 2 Positive integer 1 13,454,352

Haberman 306 3 2 Positive integer 0 83

Letter

recognition

20,000 16 26 Positive integer 0 15

Wholesale 440 7 2 Positive integer 1 112,151

Australian 690 42 2 Positive real 0 100,001

Glass 214 9 6 Positive real 0 75.41

Sonar 208 60 2 Positive real 0 1

Wine 178 13 3 Positive real 0.13 1680

EEG 14,980 14 2 Positive real 86.67 715,897

Parkinson 1040 27 2 Positive real 0 1490

Iris 150 4 3 Positive real 0.1 7.9

Diabetes 768 8 2 Real and integer 0 846

Monkey1 556 17 2 Binary 0 1

Ionosphere 351 34 2 Real 11

Phoneme 5404 5 2 Real 1.82 4.38

Segmen 2310 19 7 Real 50 1386.33

Vowel 528 10 11 Real 5.21 5.07

Wave21 5000 21 3 Real 4.2 9.06

Wave40 5000 40 3 Real 3.97 8.82

Banknote 1372 4 2 Real 13.77 17.93

QSAR 1055 41 2 Real 5.256 147

BCW, Breast Cancer Wisconsin; #C, number of classes; #E, number of

examples; #F, number of features; Max, maximum; Min, minimum;

QSAR, quantitative structure activity relationships.

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 13

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

3. False positive (FP): The number of examples

that are incorrectly assigned to the speciﬁc

class.

4. False negative (FN): The number of examples that

are incorrectly assigned to another class.

The precision and recall of a multiclass classiﬁcation

system are deﬁned by

AveragePrecision =1

N+

i=1

TPi

TPiþFPi

,(2)

AverageRecall =1

N+

i=1

TPi

TPiþFNi

,(3)

where Nis the number of classes, TP

is the number of

TP for class i,FN

is the number of FN for class i, and

is the number of FP for class i.

These performance measures can be derived from

the confusion matrix. The confusion matrix is repre-

sented by a matrix that shows the predicted and actual

classiﬁcation. The matrix is n·n, where nis the num-

ber of classes. The structure of confusion matrix for

multiclass classiﬁcation is given by

FIG. 3. The framework of our experiments for discerning the effect of various distances on the performance

of KNN classiﬁer. 1-NN, 1-nearest neighbor.

PredictedClass

Classifiedas c1Classifiedas c12  Classifiedas c1n

ActualClass c1c11 c12  c1n

ActualClass c2c21 c22  c2n

ActualClass cncn1cn2 cnn

(4)

14 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

This matrix reports the number of FPs, FNs, TPs, and

TNs that are deﬁned through elements of the confusion

matrix as follows:

TPi=cii,(5)

FPi=+

k=1

cki TPi,(6)

FNi=+

k=1

cik TPi,(7)

TNi=+

k=1

f=1

ckf TPiFPiFNi:(8)

Accuracy, precision, and recall are calculated for the

KNN classiﬁer using all the similarity measures and

distance metrics discussed in Distance Measures

Review section, on all the data sets described in

Table 5, this is to compare and assess the performance

of the KNN classiﬁer using different distance metrics

and similarity measures.

Experimental results and discussion

For the purposes of this review, two sets of experiments

were conducted. The aim of the ﬁrst set is to compare

the performance of the KNN classiﬁers when used with

each of the 54 distances and similarity measures

reviewed in Distance Measures Review section without

any noise. The second set of experiments is designed to

ﬁnd the most robust distance that affected the least

with different noise levels.

Without noise

A number of different predeﬁned distance families were

used in this set of experiments. The accuracy of each dis-

tance on each data set is averaged for 10 runs. The same

technique is followed for all other distance families to

Table 6. Average accuracies, recalls, precisions over all data sets for each distance

Distance Accuracy Recall Precision Distance Accuracy Recall Precision

ED 0.8001 0.6749 0.6724 PSCSD 0.6821 0.5528 0.5504

MD 0.8113 0.6831 0.681 DivD 0.812 0.678 0.6768

CD 0.7708 0.656 0.6467 ClaD 0.8227 0.6892 0.6871

LD 0.8316 0.6964 0.6934 ASCSD 0.6259 0.4814 0.4861

CanD 0.8282 0.6932 0.6916 SED 0.8001 0.6749 0.6724

SD 0.7407 0.6141 0.6152 AD 0.8001 0.6749 0.6724

SoD 0.7881 0.6651 0.6587 SCSD 0.8275 0.693 0.6909

KD 0.6657 0.5369 0.5325 MCED 0.7973 0.6735 0.6704

MCD 0.8113 0.6831 0.681 TopD 0.6793 0.461 0.4879

NID 0.8113 0.6831 0.681 JSD 0.6793 0.461 0.4879

CosD 0.803 0.6735 0.6693 JDD 0.7482 0.5543 0.5676

ChoD 0.7984 0.662 0.6647 JefD 0.7951 0.6404 0.6251

JacD 0.8024 0.6756 0.6739 KLD 0.513 0.3456 0.3808

DicD 0.8024 0.6756 0.6739 KDD 0.5375 0.3863 0.4182

SCD 0.8164 0.65 0.4813 KJD 0.6501 0.4984 0.5222

HeD 0.8164 0.65 0.6143 TanD 0.7496 0.5553 0.5718

BD 0.4875 0.3283 0.4855 AvgD 0.8084 0.6811 0.6794

MatD 0.8164 0.65 0.5799 HamD 0.6413 0.5407 0.5348

VWHD 0.6174 0.4772 0.5871 MeeD 0.415 0.1605 0.3333

VSDF1 0.7514 0.6043 0.5125 WIAD 0.812 0.6815 0.6804

VSDF2 0.6226 0.4828 0.5621 HauD 0.5967 0.4793 0.4871

VSDF3 0.7084 0.5791 0.5621 CSSD 0.4397 0.2538 0.332

MSCSD 0.7224 0.5876 0.3769 SPeaD 0.8023 0.6711 0.6685

MiSCSD 0.6475 0.5137 0.5621 CorD 0.803 0.6717 0.6692

PCSD 0.6946 0.5709 0.5696 PeaD 0.759 0.6546 0.6395

NCSD 0.6536 0.5144 0.5148 MotD 0.7407 0.6141 0.6152

SquD 0.6821 0.5528 0.5504 HasD 0.8394 0.7018 0.701

Bold values signify the best performance, which is the highest accuracy, precision or recall.

HasD obtained the highest overall average.

AD, Average distance; ASCSD, Additive Symmetric v

; AvgD, Average (L

) distance; BD, Bhattacharyya distance; CanD, Canberra distance; CD,

Chebyshev distance; ChoD, Chord distance; ClaD, Clark distance; CorD, Correlation distance; CosD, Cosine distance; CSSD, v

statistic distance; DicD,

Dice distance; DivD, Divergence distance; ED, Euclidean distance; HamD, Hamming distance; HasD, Hassanat distance; HauD, Hausdorff distance; HeD,

Hellinger distance; JacD, Jaccard distance; JDD, Jensen difference distance; JefD, Jeffreys distance; JSD, Jensen–Shannon distance; KD, Kulczynski dis-

tance; KDD, K divergence distance; KJD, Kumar–Johnson distance; KLD, Kullback–Leibler distance; LD, Lorentzian distance; MatD, Matusita distance;

MCD, Mean Character distance; MCED, Mean Censored Euclidean distance; MD, Manhattan distance; MeeD, Meehl distance; MiSCSD, Min symmetric v

distance; MotD, Motyka distance; MSCSD, Max symmetric v

distance; NCSD, Neyman v

distance; NID, Non Intersection distance; PCSD, Pearson v

distance; PeaD, Pearson distance; PSCSD, Probabilistic Symmetric v

distance; SCD, Squared chord distance; SCSD, Squared Chi-Squared; SD, Sorensen

distance; SED, Squared Euclidean distance; SoD, Soergel distance; SPeaD, Squared Pearson distance; SquD, Squared v

distance; TanD, Taneja distance;

TopD, Topsoe distance; VSD, Vicis symmetric distance; VWHD, Vicis-Wave Hedges distance; WIAD, Whittaker’s index of association distance.

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 15

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

report accuracy, recall, and precision of the KNN classi-

ﬁer for eachdistance on each data set. The average values

for each of 54 distances considered in the article are sum-

marized in Table 6, where HasD obtained the highest

overall average.

Table 7 shows the distances that obtained the highest

accuracy on each data set. Based on these results we

summarize the following observations.

ThedistancemeasuresinL

family outperformed

the other distance families in ﬁve data sets. LD

achieved the highest accuracy in two data sets,

namely on Vehicle and Vowel with average accura-

cies of 69.13% and 97.71%, respectively. In contrast,

CanD achieved the highest accuracy in two data

sets, Australian and Wine data sets achieved average

accuracies of 82.09% and 98.5%, respectively. SD

and SoD achieved the highest accuracy on Segmen

data set with an average accuracy of 96.76%.

Among the L

Minkowski and L

distance families,

the MD, NID, and MCD achieved similar perfor-

mance with overall accuracies on all data sets: this

is due to the similarity between these distances.

In inner product family, JacD and DicD outper-

form all other tested distances on Letter recog-

nition data set with an average accuracy of

95.16%. Among the L

Minkowski and L

distance

families, the CD, JacD, and DicD outperform the

other tested distances on the Banknote data set

with an average accuracy of 100%.

In Squared chord family, MatD, SCD, and HeD

achieved similar performance with overall accura-

cies on all data sets, this is expected because these

distances are very similar.

In Squared L

distance measures family, SquD and

PSCSD achieved similar performance with overall

accuracy in all data sets: this is due to the similarity

between these two distances. The distance measures

in this family outperform the other distance families

on two data sets, namely, ASCSD achieved the high-

est accuracy on the German data set with an average

accuracy of 71%. ClaD and DivD achieved the high-

est accuracy on the Vote data set with an average ac-

curacy of 91.87%. Among the L

Minkowski and

Squared L

distance measures family, ED, SED,

and AD achieved similar performance in all data

sets: this is due to the similarity between these

three distances. Also, these distances and MCED

outperform the other tested distances in two data

sets, Wave21 and Wave40, with average accuracies

of 77.74% and 75.87%, respectively. Among the L

distance and Squared L

families, MCED and LD

achieved the highest accuracy on the Glass data

set with an average accuracy of 71.11%.

In Shannon entropy distance measures family,

JSD and TopD achieved similar performance

with overall accuracies on all data sets: this is

due to similarity between both of the distances,

as TopD is twice the JSD. KLD outperforms all

the tested distances on Haberman data set with

an average accuracy of 73.27%.

The Vicissitude distance measures family outper-

form the other distance families on ﬁve data sets,

namely, VSDF1 achieved the highest accuracy in

three data sets, Liver, Parkinson, and Phoneme

with accuracies of 65.81%, 99.97%, and 89.8%, re-

spectively. MSCSD achieved the highest accuracy

on the Diabetes data set with an average accuracy

of 68.79%. MiSCSD also achieved the highest ac-

curacy on Sonar data set with an average accuracy

of 87.71%.

The other distance measures family outperforms all

other distance families in seven data sets. The

WIAD achieved the highest accuracy on Monkey1

data set with an average accuracy of 94.97%. The

AvgD also achieved the highest accuracy on the

Wholesale data set with an average accuracy of

Table 7. The highest accuracy in each data set

Data set Distance Accuracy

Australian CanD 0.8209

Balance ChoD, SPeaD, CorD, CosD 0.933

Banknote CD, DicD, JacD 1

BCW HasD 0.9624

Cancer HasD 0.9616

Diabetes MSCSD 0.6897

Glass LD, MCED 0.7111

Haberman KLD 0.7327

Heart HamD 0.7714

Ionosphere HasD 0.9025

Liver VSDF1 0.6581

Monkey1 WIAD 0.9497

Parkinson VSDF1 0.9997

Phoneme VSDF1 0.898

QSAR HasD 0.8257

Segmen SoD, SD 0.9676

Sonar MiSCSD 0.8771

Vehicle LD 0.6913

Vote ClaD, DivD 0.9178

Vowel LD 0.9771

Wholesale AvgD 0.8866

Wine CanD 0.985

German ASCSD 0.71

Iris ChoD, SPeaD, CorD, CosD 0.9588

Wave21 ED, AD, MCED, SED 0.7774

Egg ChoD, SPeaD, CorD, CosD 0.9725

Wave40 ED, AD, MCED, SED 0.7587

Letter recognition JacD, DicD 0.9516

16 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

88.66%. HasD also achieved the highest accuracy in

four data sets, namely, Cancer, Breast Cancer Wis-

consin (BCW), Ionosphere, and quantitative struc-

ture activity relationships (QSAR) with average

accuracies of 96.16%, 96.24%, 90.25%, and 82.57%,

respectively. Finally, HamD achieved the highest ac-

curacy on the Heart data set with an average accu-

racy of 77.14%. Among the inner product and

other distance measures families, SPeaD, CorD,

ChoD, and CosD outperform other tested distances

in three data sets, namely, Balance, Iris, and Egg

with average accuracies of 94.3%, 95.88%, and

97.25%, respectively.

Table 8 shows the distances that obtained the highest

recall on each data set. Based on these results, we sum-

marize the following observations.

The L

distance measures family outperforms the

other distance families in seven data sets, for ex-

ample, CanD achieved the highest recalls in two

data sets, Australian and Wine with 81.83% and

73.94% average recalls respectively. LD also

achieved the highest recalls on four data sets,

Glass, Ionosphere, Vehicle, and Vowel with

51.15%, 61.52%, 54.85%, and 97.68% average re-

calls, respectively. SD and SoD achieved the high-

est recall on Segmen data set with 84.67% average

recall. Among the L

Minkowski and L

distance

families, MD, NID, and MCD achieved similar

performance as expected, due to their similarity.

In Inner Product distance measures family, JacD

and DicD outperform all other tested distances

in Heberman data set with 38.53% average recall.

Among the L

Minkowski and Inner Product dis-

tance measures families, CD, JacD, and DicD also

outperform the other tested distances in the Bank-

note data set with 100% average recall.

In SCD measures family, MatD, SCD, and HeD

achieved similar performance: this is due to simi-

larity of their equations as clariﬁed previously.

The Squared L

distance measures family outper-

forms the other distance families on two data sets,

namely, ClaD and DivD outperform the other

tested distances on Vote data set with 91.03% av-

erage recall. PCSD outperforms the other tested

distances on Wholesale data set with 58.16% aver-

age recall. ASCSD also outperforms the other

tested distances on German data set with 43.92%

average recall. Among the L

Minkowski and

Squared L

distance measures families, ED, SED,

and AD achieved similar performance in all data

sets: this is due to similarity of their equations as

clariﬁed previously. These distances and MCED

distance outperform the other tested distances in

two data sets, namely, Wave21 and Wave40 with

77.71% and 75.88% average recalls, respectively.

In Shannon entropy distance measures family,

JSD and TopD achieved similar performance as

expected, due to their similarity.

The Vicissitude distance measures family outper-

forms the other distance families in six data sets.

VSDF1 achieved the highest recall on three data

sets: Liver, Parkinson, and Phoneme data sets

with 43.65%, 99.97%, 88.13% average recalls, re-

spectively. MSCSD achieved the highest recall on

Diabetes data set with 43.71% average recall.

MiSCSD also achieved the highest recall on

Sonar data set with 58.88% average recall.

VSDF2 achieved the highest recall on Letter rec-

ognition data set with 95.14% average recall.

The other distance measures family outperforms

all other distance families in ﬁve data sets. Partic-

ularly, HamD achieved the highest recall on the

Heart data set with 51.22% average recall.

WIAD also achieved the highest average recall

on the Monkey1 data set with 94.98% average

Table 8. The highest recall in each data set

Data set Distance Recall

Australian CanD 0.8183

Balance ChoD, SPeaD, CorD, CosD 0.6437

Banknote CD, DicD, JacD 1

BCW HasD 0.3833

Cancer HasD 0.9608

Diabetes MSCSD 0.4371

Glass LD 0.5115

Haberman DicD, JacD 0.3853

Heart HamD 0.5122

Ionosphere LD 0.6152

Liver VSDF1 0.4365

Monkey1 WIAD 0.9498

Parkinson VSDF1 0.9997

Phoneme VSDF1 0.8813

QSAR HasD 0.8041

Segmen SoD, SD 0.8467

Sonar MiSCSD 0.5888

Vehicle LD 0.5485

Vote ClaD, DivD 0.9103

Vowel LD 0.9768

Wholesale PCSD 0.5816

Wine CanD 0.7394

German ASCSD 0.4392

Iris ChoD, SPeaD, CorD, CosD 0.9592

Wave21 ED, AD, MCED, SED 0.7771

Egg ChoD, SPeaD, CorD, CosD 0.9772

Wave40 ED, AD, MCED, SED 0.7588

Letter recognition VSDF2 0.9514

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 17

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

recall. HasD also has achieved the highest average

recall on three data sets, namely, Cancer, BCW,

and QSAR with 96.08%, 38.33%, and 80.41% aver-

age recalls, respectively. Among the inner product

and other distance measures families, SPeaD,

CorD, ChoD, and CosD outperform the other

tested distances in three data sets, namely, Bal-

ance, Iris, and Egg with 64.37%, 95.92%, and

97.72% average recalls, respectively.

Table 9 shows the distances that obtained the highest

precision on each data set. Based on these results, we

summarize the following observations.

The distance measures in L

family outperformed

the other distance families in ﬁve data sets. CanD

achieved the highest precision on two data sets,

namely, Australian and Wine with 81.88% and

74.08% average precisions, respectively. SD and

SoD achieved the highest precision on the Segmen

data set with 84.66% average precision. In addi-

tion, LD achieved the highest precision on three

data sets, namely, Glass, Vehicle, and Vowel,

with 51.15%, 55.37%, and 97.87% average preci-

sions, respectively. Among the L

Minkowski

and L

distance families, the MD, NID, and

MCD achieved similar performance in all data

sets: this is due to similarity of their equations as

clariﬁed previously.

Inner Product family outperforms other distance

families in two data sets. Also, JacD and DicD out-

perform the other tested measures on Wholesale

data set with 58.53% average precision. Among

the L

Minkowski and L

distance families, the

CD, JacD and DicD outperformed the other tested

distances on the Banknote dataset with 100%

average precision.

In SCD measures family, MatD, SCD, and HeD

achieved similar performance with overall preci-

sion results in all data sets: this is due to similarity

of their equations as clariﬁed previously.

In Squared L

distance measures family, SCSD

and PSCSD achieved similar performance: this is

due to similarity of their equations as clariﬁed pre-

viously. The distance measures in this family out-

perform the other distance families on three data

sets, namely, ASCSD achieved the highest average

precisions on two data sets, Diabetes and German

with 44.01% and 43.43% average precisions, re-

spectively. ClaD and DicD also achieved the high-

est precision on the Vote data set with 92.11%

average precision. Among the L

Minkowski and

Squared L

distance measures families, the ED,

SED, and AD achieved similar performance as

expected, due to their similarity. These distances

and MCED outperform the other tested measures

in two data sets, namely, Wave21 and Wave40

with 77.75% and 75.9% average precisions, respec-

tively. Also, ED, SED, and AD outperform the

other tested measures on the Letter recognition

with 95.57% average precision.

In Shannon entropy distance measures family,

JSD and TopD achieved similar performance

with overall precision in all data sets, due to sim-

ilarity of their equations as clariﬁed earlier.

The Vicissitude distance measures family outper-

forms other distance families on four data sets.

VSDF1 achieved the highest average precisions

on three data sets: Liver, Parkinson, and Phoneme

with 43.24%, 99.97%, and 87.23% average preci-

sions, respectively. MiSCSD also achieved the

highest precision on the Sonar data set with

57.99% average precision.

The other distance measures family outperforms

all the other distance families in six data sets. In

particular, HamD achieved the highest precision

on the Heart data set with 51.12% average

Table 9. The highest precision in each data set

Data set Distance Precision

Australian CanD 0.8188

Balance ChoD, SPeaD, CorD, CosD 0.6895

Banknote CD, DicD, JacD 1

BCW HasD 0.3835

Cancer HasD 0.9562

Diabetes ASCSD 0.4401

Glass LD 0.5115

Haberman SPeaD, CorD, CosD 0.3887

Heart HamD 0.5112

Ionosphere HasD 0.5812

Liver VSDF1 0.4324

Monkey1 WIAD 0.95

Parkinson VSDF1 0.9997

Phoneme VSDF1 0.8723

QSAR HasD 0.8154

Segmen SoD, SD 0.8466

Sonar MiSCSD 0.5799

Vehicle LD 0.5537

Vote ClaD, DivD 0.9211

Vowel LD 0.9787

Wholesale DicD, JacD 0.5853

Wine CanD 0.7408

German ASCSD 0.4343

Iris ChoD, SPeaD, CorD, CosD 0.9585

Wave21 ED, AD, MCED, SED 0.7775

Egg ChoD, SPeaD, CorD, CosD 0.9722

Wave40 ED, AD, MCED, SED 0.759

Letter recognition ED, AD, SED 0.9557

18 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

precision. Also, WIAD achieved the highest preci-

sion on the Monkey1 data set with 95% average

precision. Moreover, HasD yields the highest pre-

cision in four data sets, namely, Cancer, BCW,

Ionosphere, and QSAR, with 38.35%, 95.62%,

58.12%, and 81.54% average precisions, respec-

tively. Among the inner product and other dis-

tance measures families, SPeaD, CorD, ChoD,

and CosD outperform the other tested distances

in three data sets, namely, Balance, Iris, and Egg,

with 68.95%, 95.85%, and 97.22% average preci-

sions, respectively. Also, CosD, SPeaD, and

CorD achieved the highest precision on Heber-

man data set with 38.87% average precision.

Table 10 gives the top 10 distances with respect to

the overall average accuracy, recall, and precision

over all data sets. HasD outperforms all other tested

distances in all performance measures, followed by

LD, CanD, and SCSD. Moreover, a closer look at the

data of the average as well as highest accuracies, preci-

sions, and recalls, we ﬁnd that HasD outperforms all

distance measures on four data sets, namely, Cancer,

BCW, Ionosphere, and QSAR: this is true for accuracy,

precision, and recall, and it is the only distance metric

that won at least four data sets in this noise-free exper-

iment set. Note that the performance of the following

ﬁve group members (1) MCD, MD, and NID, (2)

AD, ED, and SED, (3) TopD and JSD, (4) SquD and

PSCSD, and (5) MatD, SCD, and HeD is the same

within themselves due to their close similarity in deﬁn-

ing the corresponding distances.

We attribute the success of HasD in this experimen-

tal part to its characteristics discussed in Distance

Measures Review section (see distance equation in

8.13, Fig. 2), where each dimension in the tested vectors

contributes maximally 1 to the ﬁnal distance, this low-

ers and neutralizes the effects of outliers in different

data sets. To further analyze the performance of

HasD comparing with other top distances, we used

the Wilcoxon’s rank-sum test.

This is a nonparamet-

ric pairwise test that aims to detect signiﬁcant differ-

ences between two sample means, to judge whether

the null hypothesis is true or not. Null hypothesis is a

hypothesis used in statistics that assumes there is no

signiﬁcant difference between different results or ob-

servations. This test was conducted between HasD

and with each of the other top distances (Table 10)

over the tested data sets. Therefore, our null hypothesis

is ‘‘there is no signiﬁcant difference between the perfor-

mance of HasD and the compared distance over all the

data sets used.’’ According to the Wilcoxon test, if the

result of the test showed that the p-value is less than

the signiﬁcance level (0.05), then we reject the null

hypothesis, and conclude that there is a signiﬁcant

difference between the tested samples; otherwise we

cannot conclude anything about the signiﬁcant dif-

ference.

The accuracies, recalls, and precisions of HasD over

all the data sets used in this experiment set were com-

pared with those of each of the top 10 distance mea-

sures, with the corresponding p-values given in

Table 11. The p-values that were less than the signiﬁ-

cance level (0.05) are highlighted in bold. As given in

Table 11, the p-values of accuracy results are less

than the signiﬁcance level (0.05) eight times, here we

can reject the null hypothesis and conclude that there

is a signiﬁcant difference in the performance of HasD

compared with ED, CanD, CosD, ClaD, SCSD,

WIAD, CorD, and DivD, and since the average perfor-

mance of HasD was better than all of these distance

measures from the previous tables, we can conclude

that the accuracy yielded by HasD is better than that

of most of the distance measures tested. Similar

Table 10. The top 10 distances in terms of average accuracy, recall, and precision-based performance on noise-free data sets

Rank Accuracy distance Average Rank Recall distance Average Rank Precision distance Average

1 HasD 0.8394 1 HasD 0.7018 1 HasD 0.701

2 LD 0.8316 2 LD 0.6964 2 LD 0.6934

3 CanD 0.8282 3 CanD 0.6932 3 CanD 0.6916

4 SCSD 0.8275 4 SCSD 0.693 4 SCSD 0.6909

5 ClaD 0.8227 5 ClaD 0.6892 5 ClaD 0.6871

6 DivD 0.812 6 MD 0.6831 6 MD 0.681

6 WIAD 0.812 7 WIAD 0.6815 7 WIAD 0.6804

7 MD 0.8113 8 AvgD 0.6811 8 DivD 0.6768

8 AvgD 0.8084 9 DivD 0.678 9 DicD 0.6739

9 CosD 0.803 10 DicD 0.6756 10 ED 0.6724

9 CorD 0.803

10 DicD 0.8024

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 19

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

analysis applies for the recall and precision columns

comparing Hassanat results with the other distances.

With noise

These next experiments aim to identify the impact of

noisy data on the performance of KNN classiﬁer re-

garding accuracy, recall, and precision using different

distance measures. Accordingly, nine different levels

of noise were added into each data set using Algorithm

2. For simplicity, this set of experiments conducted

using only the top 10 distances given in Table 10 are

obtained based on the noise-free data sets.

Figure 4 shows the experimental results of KNN

classiﬁer that clarify the impact of noise on the accu-

racy performance measure using the top 10 distances.

x-axis represents the noise level and y-axis represents

the classiﬁcation accuracy. Each column at each noise

level represents the overall average accuracy for each

distance on all data sets used. Error bars represent

the average of standard deviation values for each dis-

tance on all data sets. Figure 5 shows the recall results

of KNN classiﬁer that clarify the impact of noise on the

performance using the top 10 distance measures. Fig-

ure 6 shows the precision results of KNN classiﬁer

that clarify the impact of noise on the performance

using the top 10 distance measures. As can be seen

from Figures 4–6, the performance (measured by

Table 11. The p-values of the Wilcoxon test for the results

of Hassanat distance with each of other top distances

over the data sets used

Distance Accuracy Recall Precision

ED 0.0418 0.0582 0.0446

MD 0.1469 0.1492 0.1446

CanD 0.0017 0.008 0.0034

LD 0.0594 0.121 0.0427

CosD 0.0048 0.0066 0.0064

DicD 0.0901 0.0934 0.0778

ClaD 0.0089 0.0197 0.0129

SCSD 0.0329 0.0735 0.0708

WIAD 0.0183 0.0281 0.0207

CorD 0.0048 0.0066 0.0064

AvgD 0.1357 0.1314 0.102

DivD 0.0084 0.0188 0.017

The p-values that were less than the signiﬁcance level (0.05) are high-

lighted in boldface.

FIG. 4. The overall average accuracies and standard deviations of KNN classiﬁer using top 10 distance measures

with different levels of noise. AvgD, Average (L

) distance; CanD, Canberra distance; ClaD, Clark distance; CorD,

Correlation distance; CosD, Cosine distance; DicD, Dice distance; DivD, Divergence distance; LD, Lorentzian

distance; MD, Manhattan distance; SCSD, Squared Chi-Squared; WIAD, Whittaker’s index of association distance.

Color images are available online.

20 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

FIG. 5. The overall average recalls and standard deviations of KNN classiﬁer using top 10 distance measures

with different levels of noise. Color images are available online.

FIG. 6. The overall average precisions and standard deviations of KNN classiﬁer using top 10 distance

measures with different levels of noise. Color images are available online.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

accuracy, recall, and precision, respectively) of the

KNN degraded only *20%, whereas the noise level

reaches 90%, this is true for all the distances used.

This means that the KNN classiﬁer using any of the

top 10 distances tolerates noise to a certain degree.

Moreover, some distances are less affected by the

added noise comparing with other distances. There-

fore, we ordered the distances according to their overall

average accuracy, and recall and precision results for

each level of noise. The distance with highest perfor-

mance is ranked in the ﬁrst position, whereas the dis-

tance with the lowest performance is ranked in the

last position of the order. Tables 12–14 give this rank-

ing structure in terms of accuracy, precision, and recall

under each noise level from low 10% to high 90%. The

empty cells occur because of sharing same rank by

more than one distance. The following points summa-

rize the observations in terms of accuracy, precision,

and recall values.

According to the average precision results, the

highest precision was obtained by HasD, which

achieved the ﬁrst rank in the majority of noise lev-

els. This distance succeeds to be in the ﬁrst rank at

noise levels 10% up to 70%. However, at a level

80%, LD outperformed HasD. Also, MD outper-

formed HasD at a noise level 90%.

LD achieved the second rank at noise levels 10%,

30%, 40%, 50%, and 70%. CanD achieved the sec-

ond rank at noise levels 20% and 60%. Moreover,

Table 12. Ranking of distances in descending order based on the accuracy results at each noise level

Rank 10% 20% 30% 40% 50% 60% 70% 80% 90%

1 HasD HasD HasD HasD HasD HasD HasD HasD HasD

2 LD CanD LD LD LD CanD CanD LD LD

3 CanD SCSD CanD CanD CanD ClaD LD CanD MD

SCSD

4 ClaD ClaD SCSD SCSD SCSD LD SCSD ClaD CanD

5 DivD LD ClaD ClaD ClaD SCSD ClaD MD AvgD

6 WIAD DivD DivD MD MD DivD CosD SCSD SCSD

CorD

7 MD WIAD MD DivD DivD MD MD AvgD CorD

CosD

CorD

8 CD MD AvgD AvgD AvgD AvgD AvgD CorD CosD

9 CorD AvgD DicD CosD WIAD ED DivD CosD DicD

CorD

10 AvgD DicD ED WIAD CoD DicD DicD DivD WIAD

WIAD

11 DicD ED WIAD DicD CorD CorD WIAD DicD ED

CosD

12 ED CosD — ED DicD WIAD ED — ClaD

13 — CorD — — ED — — — DivD

Table 13. Ranking of distances in descending order based on the recall results at each noise level

Rank 10% 20% 30% 40% 50% 60% 70% 80% 90%

1 HasD HasD HasD HasD HasD HasD HasD LD, HasD MD

2 LD CanD LD LD LD SCSD LD CanD ED

3 SCSD ClaD CanD CanD CanD CanD CanD MD HasD

4 CanD, SCSD SCSD SCSD SCSD ClaD SCSD AvgD LD

5 ClaD LD ClaD ClaD ClaD MD ClaD SCSD CanD

6 MD DivD MD MD MD LD MD ClaD AvgD

7 DivD MD AvgD AvgD AvgD ED AvgD ED WIAD

8 WIOD WIOD DivD DivD DivD DivD DicD DicD SCSD

9 AvgD AvgD ED WIAD DicD AvgD ED WIAD ClaD

10 CosD DicD DicD ED ED WIAD CosD DivD DicD

CorD

DivD

11 DicD ED CosD DicD WIAD DicD WIAD CosD DivD

12 ED CosD, CorD CosD CoD CosD — CorD CorD

13 CorD CorD WIOD CorD CorD CorD — — CosD

22 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

this distance achieved the third rank in the rest

noise levels except at noise levels 50% and 90%.

SCSD achieved the fourth rank at noise levels

10%, 30%, 40%, and 70%, and the third rank at

a level of noise 50%. This distance was equal

with LD at a noise level 20%. ClaD achieved the

third rank at noise levels 20%, and 60%.

The rest of distances achieved the middle and the

last ranks in different orders at each level of noise.

CosD at level 80% was equal to WIAD in the result.

This distance was also equal to CorD at levels 30%

and 70%. These two distances performed the worst

(lowest precision) in most noise levels.

Based on results given in Tables 12–14, we observe

that the ranking of distances in terms of accuracy, re-

call, and precision without the presence of noise is dif-

ferent from their ranking when adding the ﬁrst level of

noise 10%, and it signiﬁcantly varies when we increased

the level of noise progressively. This means that the dis-

tances are affected by noise. However, the crucial ques-

tion is, which one of the distances is least affected by

noise? From the given results, we conclude that HasD

is the least affected one, followed by LD, CanD, and

SCSD. Good performance of the KNN achieved by

these distances might be contributed by their good

characteristics. Table 15 gives these characteristics of

the top 10 distances in our analysis. All of these top

10 distances are symmetric, and we further provide

input and output ranges and the number of operations.

Precise evaluation of the effects of noise

To justify why some distances are affected either less

or more by noise, the following toy Examples 1 and 2

are designed. These illustrate the effect of noise on

the ﬁnal decision of the KNN classiﬁer using HasD

and the standard ED. In both examples, we assume

that we have two training vectors (v1 and v2) having

three attributes for each, in addition to one test vector

(v3). As usual, we calculate the distances between v3

and both v1 and v2 using both of ED and HasD.

Example 1. This example shows the KNN classiﬁca-

tion using two different distances on clean data (without

noise). We ﬁnd the distance to test vector (v3) according

to ED and HasD.

X1 X2 X3 X4 Class

Dist(,,V3)

ED HasD

V1 3 4 5 3 2 2 0.87

V2 1 3 4 2 1 1 0.33

V32342 ?

Table 15. Some characteristics of the top 10 distances

(nis the number of features)

Distance

Input

range

Output

range Symmetric Metric

No. of

operations

HasD (N,+N) [0, n] Yes Yes 6n (positives),

9n (negatives)

ED (N,+N) [0, +N) Yes Yes 1 +3n

MD (N,+N) [0, +N) Yes Yes 3n

CanD (N,+N) [0, n] Yes Yes 7n

LD (N,+N) [0, +N) Yes Yes 5n

CosD [0, +N) [0, 1] Yes No 5 +6n

DicD [0, +N) [0, 1] Yes No 4 +6n

ClaD [0, +N) [0, n] Yes Yes 1 +6n

SCSD (N,+N) [0, +N) Yes Yes 6n

WIAD (N,+N) [0, 1] Yes Yes 1 +7n

CorD [0, +N) [0, 1] Yes No 8 +10n

AvgD (N,+N) [0, +N) Yes Yes 2 +6n

DivD (N,+N) [0, +N) Yes Yes 1 +6n

Table 14. Ranking of distances in descending order based on the precision results at each noise level

Rank 10% 20% 30% 40% 50% 60% 70% 80% 90%

1 HasD HasD HasD HasD HasD HasD HasD LD MD

2 LD CanD LD LD LD CanD LD HasD HasD

3 CanD ClaD CanD CanD SCSD ClaD CanD CanD ED

4 SCSD LD SCSD SCSD CanD LD SCSD MD LD

SCSD

5 ClaD DivD ClaD ClaD ClaD SCSD ClaD SCSD CanD

6 MD MD MD MD MD MD AvgD ClaD AvgD

WIAD

7 AvgD AvgD AvgD AvgD AvgD DivD MD AvgD WIAD

8 WIAD DicD DivD WIAD ED AvgD CosD ED SCSD

CorD

9 DivD ED ED DivD DivD ED DicD CosD DicD

WIAD

10 DicD CosD DicD ED DicD DicD ED CorD ClaD

11 ED CorD CosD DicD WIAD WIAD DivD DicD DivD

CorD

12 CosD — WIAD CosD CorD CorD WIAD DivD CorD

13 CorD — — CorD CosD CosD — — CosD

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 23

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

As shown, assuming that we use k=1, based on the

1-NN approach, and using both distances, the test vec-

tor is assigned to class 1, both results are reasonable,

because V3 is almost the same as V2 (class =1) except

for the ﬁrst feature, which differs only by 1.

Example 2. This example shows the same feature

vectors as in Example 1, but after corrupting one of

the features with an added noise. That is, we make the

same previous calculations using noisy data instead of

the clean data; the ﬁrst attribute in V2 is corrupted by

an added noise of (4, i.e., X1 =5).

X1 X2 X3 X4 Class

Dist(,,V3)

ED HasD

V13453 2 2 0.87

V25342 1 3 0.5

V32342 ?

Based on the minimum distance approach, using

Euclidian distance, the test vector is assigned to class

2 instead of 1. However, the test vector is assigned to

class 1 using the HasD, this makes the distance more

accurate with the existence of noise. Although simple,

these examples showed that the ED was affected by

noise and consequently affected the KNN classiﬁcation

ability. Although the performance of the KNN classiﬁer

is decreased as the noise increased (as shown by the ex-

tensive experiments with various data sets), we ﬁnd

that some distances are less affected by noise than

other distances. For example, when using ED, any

change in any attribute contributes highly to the ﬁnal

distance, even if both vectors were similar, but in one

feature there was noise, the distance (in such a case) be-

comes unpredictable. In contrast, with the HasD we

found that the distance between both consecutive attri-

butes is bounded in the range [0,1], thus, regardless of

the value of the added noise, each feature will contrib-

ute up to 1 maximally to the ﬁnal distance, and not pro-

portional to the value of the added noise. Therefore, the

impact of noise on the ﬁnal classiﬁcation is mitigated.

It is worth mentioning that the aforementioned exp-

eriments used the KNN classiﬁer with Kequal to 1, also

known as the nearest neighbor classiﬁer. In fact, the

choice of distance metric might affect the optimal Kas

well. A K=1 choice is more sensitive to noise than larger

Kvalues, because an unrelated noisy example might be

the nearest to a test example. Therefore, a valid action

with noisy data would be to choose a larger K;it

would be of interest to see which distance measure han-

dles this aspect best. We remark that choosing the opti-

mal Kis out of the scope of this review, and we refer to

Hassanat

and Alkasassbeh et al.

However, we have

repeated the experiments on all data sets using K=3

and k=ﬃﬃﬃ

p,asdonebyHassanat,

where nis the num-

ber of examples in the training data set, with 50% noise.

This is done using the top 10 distances. As given in

Table 16, the average accuracy of most of the top 10 dis-

tances has been slightly improved compared with those

shown in Figure 4 with noisy data, and Table 10 without

noise, this is due to the larger number of neighbors (K)

used; however, there are some exceptions, including the

WIAD distance, which seems to be negatively affected

by increasing the number of neighbors (K).

In previous experiments, the noisy data have been

used with the top 10 distances only. However, it

would be interesting to see whether any of the other

measures would handle noise better than these partic-

ular top 10 measures. Table 17 gives the average accu-

racy of all distances over the ﬁrst 14 data sets, using

K=1 with and without noise. As given in Table 17,

some of the top 10 distances still remain ranked the

highest even with the existence of noise compared

with all other distances, this includes HasD, LD,

DivD, CanD, ClaD, and SCSD. However, interestingly,

some of the other distances (which were ranked low

when using data without noise) have shown less vul-

nerability to noise, these include SquD, PSCSD,

MSCSD, SCD, MatD, and HeD. According to the ex-

tensive experiments conducted for the purpose of this

review, and regardless of the type of the experiments,

in general, the nonconvex distance HasD is the best

distance to be used with the KNN classiﬁer, with

other distances such as LD, DivD, CanD, ClaD, and

SCSD performing close to best.

Table 16. The average accuracy of the top 10 distances

over all data sets, using K=3 and K=ﬃﬃﬃ

p(nis the number

of features), with and without noise

Distance

With 50% noise Without noise

K=3K=ﬃﬃﬃ

pK=3K=ﬃﬃﬃ

HasD 0.6302 0.6314 0.8497 0.833721

LD 0.6283 0.6237 0.8427 0.827561

CanD 0.6247 0.6231 0.8384 0.825171

ClaD 0.6171 0.6136 0.8283 0.8098

SCSD 0.6146 0.6087 0.8332 0.8045

MD 0.6124 0.6089 0.8152 0.79705

DivD 0.6114 0.6049 0.8183 0.792768

CosD 0.6086 0.6021 0.8085 0.791625

CorD 0.6085 0.6020 0.8085 0.786696

AvgD 0.6079 0.6081 0.8119 0.792768

ED 0.6016 0.6011 0.8031 0.781639

DicD 0.5998 0.6016 0.8021 0.778054

WIAD 0.5680 0.5989 0.7935 0.788361

24 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

Conclusions and Future Perspectives

In this review, the performance (accuracy, precision,

and recall) of the KNN classiﬁer has been evaluated

using a large number of distance measures, on clean

and noisy data sets, attempting to ﬁnd the most appro-

priate distance measure that can be used with the

KNN in general. In addition, we tried ﬁnding the

most appropriate and robust distance that can be

used in the case of noisy data. A large number of exper-

iments conducted for the purposes of this review and

the results and analysis of these experiments show

the following:

1. The performance of KNN classiﬁer depends sig-

niﬁcantly on the distance used, the results showed

large gaps between the performances of different

distances. For example, we found that HasD per-

formed the best when applied on most data sets

comparing with the other tested distances.

2. We get similar classiﬁcation results when we use

distances from the same family having almost

the same equation, some distances are very simi-

lar, for example, one is twice the other, or one is

the square of another. In these cases, and since the

KNN compares examples using the same dis-

tance, the nearest neighbors will be the same if

all distances were multiplied or divided by the

same constant.

3. There was no optimal distance metric that can be

used for all types of data sets, as the results show

that each data set favors a speciﬁc distance metric,

and this result complies with the no-free-lunch

theorem.

4. The performance (measured by accuracy, preci-

sion, and recall) of the KNN degraded only

*20% while the noise level reaches 90%, this is

true for all the distances used. This means that

the KNN classiﬁer using any of the top 10 dis-

tances tolerates noise to a certain degree.

5. Some distances are less affected by the added

noise comparing with other distances, for exam-

ple, we found that HasD performed the best

when applied on most data sets under different

levels of heavy noise.

Our study has the following limitations, and future

work will focus on studying, evaluating, and investigat-

ing these.

1. Although we have tested a large number of dis-

tance measures, there are still other distances

and similarity measures that are available in the

machine learning area that need to be tested

and evaluated for optimal performance with

and without noise.

2. The 28 data sets although higher than previously

tested still might not be enough to draw signiﬁ-

cant conclusions in terms of the effectiveness of

Table 17. The average accuracy of all distances over the ﬁrst

14 data sets, using K=1 with and without noise

Distance With 50% noise Without noise

HasD 0.6331 0.8108

LD 0.6302 0.7975

DivD 0.6284 0.8068

CanD 0.6271 0.8053

ClaD 0.6245 0.7999

SquD 0.6227 0.7971

PSCSD 0.6227 0.7971

SCSD 0.6227 0.7971

MSCSD 0.6223 0.8004

SCD 0.6208 0.7989

MatD 0.6208 0.7989

HeD 0.6208 0.7989

VSDF3 0.6179 0.7891

WIAD 0.6144 0.7877

CorD 0.6119 0.7635

SPeaD 0.6119 0.7635

CosD 0.6118 0.7636

NCSD 0.6108 0.7674

ChoD 0.6102 0.755

JeD 0.6071 0.7772

PCSD 0.6043 0.7813

KJD 0.6038 0.7465

PeaD 0.6034 0.7066

VSDF2 0.6032 0.7473

MD 0.6029 0.7565

NID 0.6029 0.7565

MCD 0.6029 0.7565

SD 0.6024 0.7557

SoD 0.6024 0.7557

MotD 0.6024 0.7557

ASCSD 0.6016 0.7458

VSDF1 0.6012 0.7427

AvgD 0.5995 0.7523

TanD 0.5986 0.7421

JDD 0.5955 0.741

MiSCSD 0.595 0.7596

VWHD 0.5945 0.7428

JacD 0.5936 0.746

DicD 0.5936 0.746

ED 0.5927 0.7429

SED 0.5927 0.7429

AD 0.5927 0.7429

KD 0.5911 0.7154

MCED 0.5889 0.7416

HamD 0.5843 0.7048

CD 0.5742 0.7154

HauD 0.5474 0.621

KDD 0.5295 0.5357

TopD 0.5277 0.6768

JSD 0.5276 0.6768

CSSD 0.5273 0.4895

KLD 0.5089 0.5399

MeeD 0.4958 0.4324

BD 0.4747 0.4908

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 25

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

certain distance measures and, therefore, there is

a need to use a larger number of data sets with

varied data types.

3. The creation of noisy data is done by replacing a

certain percentage (in the range 10%–90%) of the

examples by completely random values in the attri-

butes. We used this type of noise for its simplicity

and straightforwardness in the interpretation of

the effects of distance measure choice with KNN

classiﬁer. However, this type of noise might not

simulate other types of noise that occur in the

real-world data. It is an interesting task to try

other realistic noise types to evaluate the distance

measures for robustness in a similar manner.

4. Only KNN classiﬁer was reviewed in this study,

other variants of KNN such as Hassanat

90–93

need to be investigated.

5. Distance measures are not used only with the KNN,

but also with other machine learning algorithms,

such as different types of clustering, those need to

be evaluated under different distance measures.

Author Disclosure Statement

No competing ﬁnancial interests exist.

References

1. Fix E, Hodges Jr JL. Discriminatory analysis-nonparametric discrimination:

Consistency properties. Technical report, University of California, Ber-

keley, 1951.

2. Cover TM, Hart P. Nearest neighbor pattern classiﬁcation. IEEE Trans Inf

Theory. 1967;13:21–27.

3. Wu X, Kumar V, Quinlan RJ, et al. Top 10 algorithms in data mining. Knowl

Inf Syst. 2008;14:1–37.

4. Bhatia N, Vandana A. Survey of nearest neighbor techniques. Int J Com-

put Sci Inf Secur. 2010;8:302–305.

5. Xu S, Wu Y. An algorithm for remote sensing image classiﬁcation based

on artiﬁcial immune b-cell network. ISPRS Arch. 2008;37:107–112.

6. Manne S, Kotha SK, Fatima SS. Text categorization with K-nearest

neighbor approach. In: Proceedings of the International Conference on

Information Systems Design and Intelligent Applications 2012 (INDIA

2012) held in Visakhapatnam, India, Springer, January 2012. pp. 413–

420.

7. Geng X, Liu T-Y, Qin T, et al. Query dependent ranking using K-nearest

neighbor. In: Proceedings of the 31st Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval,

ACM, 2008. pp. 115–122.

8. Bajramovic F, Mattern F, Butko N, et al. A comparison of nearest neighbor

search algorithms for generic object recognition. In: Proceedings of the

International Conference on Advanced Concepts for Intelligent Vision

Systems, Springer, 2006. pp. 1186–1197.

9. Yang Y, Ault T, Pierce T, et al. Improving text categorization methods for

event tracking. In: Proceedings of the 23rd Annual International ACM

SIGIR Conference on Research and Development in Information

Retrieval, ACM, 2000. pp. 65–72.

10. Kataria A, Singh MD. A review of data classiﬁcation using k-nearest

neighbour algorithm. Int J Emerg Technol Adv Eng. 2013;3:354–360.

11. Wettschereck D, Aha DW, Mohri T. A review and empirical evaluation of

feature weighting methods for a class of lazy learning algorithms. Artif

Intell Rev. 1997;11:273–314.

12. Maillo J, Triguero I, Herrera F. A mapreduce-based K-nearest neighbor

approach for big data classiﬁcation. In: 2015 IEEE Trustcom/BigDataSE/

ISPA, Volume 2, IEEE, 2015. pp. 167–172.

13. Maillo J, Ramı

´rez S, Triguero I, et al.. KNN-is: An iterative spark-based

design of the K-nearest neighbors classiﬁer for big data. Knowl Based

Syst. 2017;117:3–15.

14. Deng Z, Zhu X, Cheng D, et al. Efﬁcient KNN classiﬁcation algorithm for

big data. Neurocomputing. 2016;195:143–148.

15. Gallego A-J, Calvo-Zaragoza J, Valero-Mas JJ, et al. Clustering-based

K-nearest neighbor classiﬁcation for large-scale data with neural codes

representation. Pattern Recogn. 2018;74:531–543.

16. Wang F, Wang Q, Nie F, et al. Efﬁcient tree classiﬁers for large scale

datasets. Neurocomputing. 2018;284:70–79.

17. Hassanat AB. Solving the problem of the k parameter in the KN Ncl assiﬁer using

an ensemble learning approach. Int J Comput Sci Inf Secur. 2014;12:33–39.

18. Chomboon K, Chujai P, Teerarassamee P, et al. An empirical study of distance

metrics for K-nearest neighbor algorithm. In: Proceedings of the 3rd Inter-

national Conference on Industrial Application Engineering. Kitakyushu,

Japan: The Institute of Industrial Applications Engineers, 2005. pp. 1–6.

19. Mulak P, Talhar N. Analysis of distance measures using K-nearest neigh-

bor algorithm on KDD dataset. Int J Sci Res. 2015;7:2101–2104.

20. Tavallaee M, Bagheri E, Lu W, et al. A detailed analysis of the KDD cup 99

data set. In: 2009 IEEE Symposium on Computational Intelligence for

Security and Defense Applications, IEEE, 2009. pp. 1–6.

21. Hu L-Y, Huang M-W, Ke S-W, et al. The distance function effecton K-nearest

neighbor classiﬁcation for medical datasets. SpringerPlus. 2016;5:1304.

22. Todeschini R, Ballabio D, Consonni V. Distances and other dissimilarity

measures in chemometrics. In: Meyers RA (ed). Encyclopedia of Analytical

Chemistry. Hoboken, NJ: Wiley Online Library, 2015.

23. Todeschini R, Ballabio D, Consonni V, et al. A new concept of higher-order

similarity and the role of distance/similarity measures in local classiﬁ-

cation methods. Chemom Intell Lab Syst. 2016;157:50–57.

24. Lopes N, Ribeiro B. On the impact of distance metrics in instance-based

learning algorithms. In: Iberian Conference on Pattern Recognition and

Image Analysis, Springer, 2015. pp. 48–56.

25. Alkasassbeh M, Altarawneh GA, Hassanat A. On enhancing the perfor-

mance of nearest neighbour classiﬁers using Hassanat distance metric.

Can J Pure Appl Sci. 2015;9:3291–3298.

26. Hassanat AB. Dimensionality invariant similarity measure. J Am Sci. 2014;

10:221–226.

27. Jirina M. Using singularity exponent in distance based classiﬁer. In: 2010

10th International Conference on Intelligent Systems Design and

Applications, IEEE, 2010. pp. 220–224.

28. Lindi GA. Development of face recognition system for use on the NAO

robot. Master’s thesis, Faculty of Science and Technology, 2016.

29. Jir

ˇina M, Jir

ˇina Jr M. Classiﬁer based on inverted indexes of neighbors.

Technical Report No. V-1034, 2008.

30. Abbadi MA, Altarawneh GA, Hassanat AB, et al. Solving the problem of

the k parameter in the KNN classiﬁer using an ensemble learning ap-

proach. Int J Comput Sci Inf Secur. 2014;12:33–39.

31. Hart P. The condensed nearest neighbor rule (CORRESP.). IEEE Trans Inf

Theory. 1968;14:515–516.

32. Gates G. The reduced nearest neighbor rule (CORRESP.). IEEE Trans Inf

Theory. 1972;18:431–433.

33. Alpaydin E. Voting over multiple condensed nearest neighbors. In: Lazy

Learning, Springer, 1997. pp. 115–132.

34. Kubat M, Cooperson, Jr M. Voting nearest-neighbour subclassiﬁers.

In: The 17th International Conference on Machine Learning (ICML),

San Francisco, CA: Morgan Kaufmann, 2000. pp. 503–510.

35. Wilson DR, Martinez TR. Reduction techniques for instance-based learn-

ing algorithms. Mach Learn. 2000;38:257–286.

36. Arya S, Mount DM. Approximate nearest neighbor queries in ﬁxed di-

mensions. SODA. 1993;93:271–280.

37. Zheng Y, Guo Q, Tung AKH, et al. Lazylsh: Approximate nearest neighbor

search for multiple distance functions with a single index. In: Pro-

ceedings of the 2016 International Conference on Management of

Data, ACM, 2016. pp. 2023–2037.

38. Nettleton DF, Orriols-Puig A, Fornells A. A study of the effect of different

types of noise on the precision of supervised learning techniques. Artif

Intell Rev. 2010;33:275–306.

39. Zhu X, Wu X. Class noise vs. attribute noise: A quantitative study. Artif

Intell Rev. 2004;22:177–210.

26 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

40. Garcı

´a S, Luengo J, Herrera F. Data preprocessing in data mining. Cham,

Switzerland: Springer International Publishing, 2015.

41. Sa

´Ez JA, Galar M, Luengo JN, et al. Tackling the problem of classiﬁcation

with noisy data using multiple classiﬁer systems: Analysis of the per-

formance and robustness. Inf Sci. 2013;247:1–20.

42. Heath TL. The thirteen books of Euclid’s Elements. North Chelmsford,

MA: Courier Corporation, 1956.

43. Cha S-H. Comprehensive survey on distance/similarity measures between

probability density functions. City 2007;1:1.

44. Deza MM, Deza E. Encyclopedia of distances. In: Encyclopedia of dis-

tances, Springer, 2009. pp. 1–583.

45. Grabusts P. The choice of metrics for clustering algorithms. In:

Proceedings of the 8th International Scientiﬁc and Practical Conference,

Volume 2. Tomsk, Russia: Tomsk Polytechnic University, 2011.

pp. 70–76.

46. Premaratne P. Human computer interaction using hand gestures.

Singapore: Springer Science and Business Media, 2014.

47. Verma JP. Data analysis in management with SPSS software. New Delhi,

India: Springer Science and Business Media, 2012.

48. Lance GN, Williams WT. Computer programs for hierarchical polythetic

classiﬁcation (‘‘similarity analyses’’). Comput J. 1966;9:60–64.

49. Lance GN, Williams WT. Mixed-data classiﬁcatory programs I—agglom-

erative systems. Aust Comput J. 1967;1:15–20.

50. Akila A, Chandra E. Slope ﬁnder—A distance measure for DTW based iso-

lated word speech recognition. Int J Eng Comput Sci. 2013;2:3411–3417.

51. Sorensen TA. A method of establishing groups of equal amplitude in plant

sociology based on similarity of species content and its application to

analyses of the vegetation on danish commons. Biol Skar. 1948;5:1–34.

52. Szmidt E. Distances and similarities in intuitionistic fuzzy sets (Studies in

Fuzziness and Soft Computing book series). Cham, Switzerland: Springer,

2014.

53. Ngom A, Chetty M, Ahmad S. Pattern recognition in bioinformatics. Ber lin,

Heidelberg: Springer, 2008.

54. Zhou T, Chan KCC, Wang Z. Topevm: Using co-occurrence and topology

patterns of enzymes in metabolic networks to construct phylogenetic

trees. In: IAPR International Conference on Pattern Recognition in Bio-

informatics, Springer, 2008. pp. 225–236.

55. Willett P, Barnard JM, Downs GM. Chemical similarity searching. J Chem

Inf Comput Sci. 1998;38:983–996.

56. Jaccard P. Comparative study of the ﬂoral distribution in a portion of the

Alps and Jura [in French] Bull Soc Vaudoise Sci Nat. 1901;37:547–579.

57. Cesare S, Xiang Y. Software similarity and classiﬁcation. London, England:

Springer Science and Business Media, 2012.

58. Dice LR. Measures of the amount of ecologic association between spe-

cies. Ecology. 1945;26:297–302.

59. Gan G, Ma C, Wu J. Data clustering: Theory, algorithms, and applications

(ASA-SIAM Series on Statistics and Applied Probability), volume 20.

Philadelphia, PA: Society for Industrial and Applied Mathematics, 2007.

60. Orloci L. An agglomerative method for classiﬁcation of plant communi-

ties. J Ecol. 1967;55:193–206.

61. Legendre P, Legendre LFJ. Numerical ecology, volume 24. Amsterdam,

Netherlands: Elsevier, 2012.

62. Shirkhorshidi AS, Aghabozorgi S, Wah TY. A comparison study on simi-

larity and dissimilarity measures in clustering continuous data. PLoS

One. 2015;10:e0144059.

63. Bhattacharyya A. On a measure of divergence between two statistical

populations deﬁned by their probability distributions. Bull Calcutta

Math Soc. 1943;35:99–109.

64. Abbad A, Tairi H. Combining Jaccard and Mahalanobis cosine distance

to enhance the face recognition rate. Int J Eng Comput Sci. 2016;16:

171–178.

65. Hellinger E. New justiﬁcation of the theory of quadratic forms of inﬁnite

variables [in German]. J Reine Angew Math. 1909;136:210–271.

66. Clark PJ. An extension of the coefﬁcient of divergence for use with mul-

tiple characters. Copeia. 1952;1952:61–64.

67. Neyman J. Contributions to the theory of the v

test. In: Proceedings of

the ﬁrst Berkeley Symposium on Mathematical Statistics and Probabil-

ity. University of California Press, Berkeley, 1949.

68. Pearson K. X. on the criterion that a given system of deviations from the

probable in the case of a correlated system of variables is such that it

can be reasonably supposed to have arisen from random sampling.

London Edin Dubl Phil Mag J Sci. 1900;50:157–175.

69. Shannon CE. A mathematical theory of communication. ACM SIGMOBILE

Mobile Comput Commun Rev. 2001;5:3–55.

70. Kullback S, Leibler RA. On information and sufﬁciency. Ann Math Stat.

1951;22:79–86.

71. Pinto D, Benedı

´J-M, Rosso P. Clustering narrow-domain short texts by

using the Kullback-Leibler distance. In: International Conference on

Intelligent Text Processing and Computational Linguistics, Springer,

2007. pp. 611–622.

72. Jeffreys H. An invariant form for the prior probability in estimation

problems. Proc R Soc London Ser A Math Phys Sci. 1946;186:

453–461.

73. Topsoe F. Some inequalities for information divergence and related

measures of discrimination. IEEE Trans Inf Theory. 2000;46:1602–

1609.

74. Sibson R. Information radius. Information radius [in German]. Prob Theory

Rel. 1969;14:149–160.

75. Hatzigiorgaki M, Skodras AN. Compressed domain image retrieval: A

comparative study of similarity metrics. In: Visual Communications and

Image Processing 2003, volume 5150, International Society for Optics

and Photonics, 2003. pp. 439–448.

76. Patel BV, Meshram BB. Content based video retrieval systems. arXiv

preprint arXiv:1205.1641, 2012.

77. Giusti R, Batista GEAPA. An empirical comparison of dissimilarity mea-

sures for time series classiﬁcation. In: 2013 Brazilian Conference on

Intelligent Systems, IEEE, 2013. pp. 82–88.

78. Macklem M. Multidimensional modelling of image ﬁdelity measures.

Centre County, PA: Citeseer, 2002.

79. Bharkad SD, Kokare M. Performance evaluation of distance metrics:

Application to ﬁngerprint recognition. Int J Pattern Recogn. 2011;25:

777–806.

80. Hedges TS. An empirical modiﬁcation to linear wave theory. Proc Inst Civ

Eng. 1976;61:575–579.

81. Taneja IJ. New developments in generalized information measures. Adv

Imag Elect Phys. 1995;91:37–135.

82. Fulekar MH. Bioinformatics: Applications in life and environmental sci-

ences. Dordrecht, Netherlands: Springer Science and Business Media,

2009.

83. Hamming RW. Error detecting and error correcting codes. Bell Syst Tech J.

1950;29:147–160.

84. Kadir A, Nugroho LE, Susanto A, et al. Experiments of distance measure-

ments in a foliage plant retrieval system. Int J Signal Process Image

Process Pattern Recogn. 2012;5:1–14.

85. Rubner Y, Tomasi C. Perceptual metrics for image database navigation,

volume 594. New York, NY: Springer Science and Business Media, 2013.

86. Whittaker RH. A study of summer foliage insect communities in the great

smoky mountains. Ecol Monogr. 1952;22:1–44.

87. UC Irvine. 2016. UC Irvine machine learning repository. Available online at

http://archive.ics.uci.edu/ml (last accessed July 10, 2019).

88. Wilcoxon F. Individual comparisons by ranking methods. In: Kotz S,

Johnson NL (eds): Breakthroughs in statistics. New York, NY: Springer,

1992. pp. 196–202.

89. Derrac J, Garcı

´a S, Molina D, et al. A practical tutorial on the use of non-

parametric statistical tests as a methodology for comparing evolu-

tionary and swarm intelligence algorithms. Swarm Evol Comput. 2011;

1:3–18.

90. Hassanat ABA. Two-point-based binary search trees for accelerating big

data classiﬁcation using KNN. PLoS One. 2018;13:e0207772.

91. Hassanat A. Furthest-pair-based decision trees: Experimental results on

big data classiﬁcation. Information. 2018;9:284.

92. Hassanat ABA. Furthest-pair-based binary search tree for speeding

big data classiﬁcation using K-nearest neighbors. Big Data. 2018;6:

225–235.

93. Hassanat A. Norm-based binary search trees for speeding up KNN big

data classiﬁcation. Computers. 2018;7:54.

Cite this article as: Abu Alfeilat HA, Hassanat ABA, Lasassmeh O,

Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS (2019)

Effects of distance measure choice on K-nearest neighbor classiﬁer

performance: A review. Big Data 3:X, 1–28, DOI: 10.1089/

big.2018.0175.

EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 27

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

Abbreviations Used

1-NN ¼1-nearest neighbor

AD ¼Average distance

ASCSD ¼Additive Symmetric v

distance

AvgD ¼Average (L

) distance

BCW ¼Breast Cancer Wisconsin

BD ¼Bhattacharyya distance

CanD ¼Canberra distance

CD ¼Chebyshev distance

ChoD ¼Chord distance

ClaD ¼Clark distance

CorD ¼Correlation distance

CosD ¼Cosine distance

CSSD ¼v

statistic distance

DicD ¼Dice distance

DivD ¼Divergence distance

ED ¼Euclidean distance

FN ¼false negative

FP ¼false positive

HamD ¼Hamming distance

HasD ¼Hassanat distance

HauD ¼Hausdorff distance

HeD ¼Hellinger distance

JacD ¼Jaccard distance

JDD ¼Jensen difference distance

JefD ¼Jeffreys distance

JSD ¼Jensen–Shannon distance

KD ¼Kulczynski distance

KDD ¼K divergence distance

KJD ¼Kumar–Johnson distance

KLD ¼Kullback–Leibler distance

KNN ¼K-nearest neighbor

LD ¼Lorentzian distance

MatD ¼Matusita distance

MCD ¼Mean Character distance

MCED ¼Mean Censored Euclidean distance

MD ¼Manhattan distance

MeeD ¼Meehl distance

MiSCSD ¼Min Symmetric v

distance

MotD ¼Motyka distance

MSCSD ¼Max Symmetric v

distance

NCSD ¼Neyman v

distance

NID ¼Non Intersection distance

PCSD ¼Pearson v

distance

PeaD ¼Pearson distance

PSCSD ¼Probabilistic Symmetric v

distance

QSAR ¼quantitative structure activity relationships

SCD ¼Squared chord distance

SCSD ¼Squared Chi-Squared

SD ¼Sorensen distance

SED ¼Squared Euclidean distance

SoD ¼Soergel distance

SPeaD ¼Squared Pearson distance

SquD ¼Squared v

distance

TanD ¼Taneja distance

TN ¼true negative

TopD ¼Topsoe distance

TP ¼true positive

UCI ¼University of California, Irvine

VSD ¼Vicis Symmetric distance

VWHD ¼Vicis-Wave Hedges distance

WIAD ¼Whittaker’s index of association distance

28 ABU ALFEILAT ET AL.

Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.

Block-based Local Binary Patterns for Distant Iris Recognition Using Various Distance Metrics

Article

Full-text available

Jun 2024

Nowadays iris recognition has become a promising biometric for human identification and authentication. In this case, feature extraction from near-infrared (NIR) iris images under less-constraint environments is rather challenging to identify an individual accurately. This paper extends a texture descriptor to represent the local spatial patterns. The iris texture is first divided into several blocks from which the shape and appearance of intrinsic iris patterns are extracted with the help of block-based Local Binary Patterns (LBP b). The concepts of uniform, rotation, and invariant patterns are employed to reduce the length of feature space. Additionally, the simplicity of the image descriptor allows for very fast feature extraction. The recognition is performed using a supervised machine learning classifier with various distance metrics in the extracted feature space as a dissimilarity measure. The proposed approach effectively deals with lighting variations, blur focuses on misaligned images and elastic deformation of iris textures. Extensive experiments are conducted on the largest and most publicly accessible CASIA-v4 distance image database. Some statistical measures are computed as performance indicators for the validation of classification outcomes. The area under the Receiver Operating Characteristic (ROC) curves is illustrated to compare the diagnostic ability of the classifier for the LBP and its extensions. The experimental results suggest that the LBP b is more effective than other rotation invariants and uniform rotation invariants in local binary patterns for distant iris recognition. The Braycurtis distance metric provides the highest possible accuracy compared to other distance metrics and competitive methods.

Investigating the Performance of a Novel Modified Binary Black Hole Optimization Algorithm for Enhancing Feature Selection

Article

Full-text available

Jun 2024

High-dimensional datasets often harbor redundant, irrelevant, and noisy features that detrimentally impact classification algorithm performance. Feature selection (FS) aims to mitigate this issue by identifying and retaining only the most pertinent features, thus reducing dataset dimensions. In this study, we propose an FS approach based on black hole algorithms (BHOs) augmented with a mutation technique termed MBHO. BHO typically comprises two primary phases. During the exploration phase, a set of stars is iteratively modified based on existing solutions, with the best star selected as the “black hole”. In the exploration phase, stars nearing the event horizon are replaced, preventing the algorithm from being trapped in local optima. To address the potential randomness-induced challenges, we introduce inversion mutation. Moreover, we enhance a widely used objective function for wrapper feature selection by integrating two new terms based on the correlation among selected features and between features and classification labels. Additionally, we employ a transfer function, the V2 transfer function, to convert continuous values into discrete ones, thereby enhancing the search process. Our approach undergoes rigorous evaluation experiments using fourteen benchmark datasets, and it is compared favorably against Binary Cuckoo Search (BCS), Mutual Information Maximization (MIM), Joint Mutual Information (JMI), and minimum Redundancy Maximum Eelevance (mRMR), approaches. The results demonstrate the efficacy of our proposed model in selecting superior features that enhance classifier performance metrics. Thus, MBHO is presented as a viable alternative to the existing state-of-the-art approaches. We make our implementation source code available for community use and further development.

Fault diagnosis method for imbalanced and unlabeled data based on bayesian graph balanced learning

Article

Full-text available

Jun 2024
MEAS SCI TECHNOL

you zi Zhou

In fault diagnosis, it is crucial to address the combined challenges of imbalanced sample sizes and unlabeled data. Traditional methods often generate pseudo-samples or pseudo-labels. These can lead to inaccurate diagnostic outcomes if they are not representative of the original data. To address these challenges, this paper proposes an innovative fault diagnosis method based on bayesian graph balanced learning (BGBL). Firstly, a balancing strategy was developed to tackle sample imbalance by assigning and optimizing weights for samples in imbalanced categories. Graph theory techniques were then used on unlabeled data to establish and update category beliefs. Following this, posterior estimates of samples were derived within the bayesian neural networks framework. This led to the training of a fault diagnosis model. Finally, fault diagnosis was conducted using this trained model. Three sets of experiments were conducted on the planetary gearbox fault dataset. The results showed that the proposed BGBL method significantly improved the accuracy of fault diagnosis. Specifically, under conditions of imbalanced data and missing labels, the BGBL method increased the accuracy by over 26% compared to existing methods. This demonstrates its effectiveness in these challenging scenarios.

A comparative study on machine learning approaches for rock mass classification using drilling data

Preprint

Full-text available

Mar 2024

Current rock engineering design in drill and blast tunnelling primarily relies on engineers' observational assessments. Measure While Drilling (MWD) data, a high-resolution sensor dataset collected during tunnel excavation, is underutilised, mainly serving for geological visualisation. This study aims to automate the translation of MWD data into actionable metrics for rock engineering. It seeks to link data to specific engineering actions, thus providing critical decision support for geological challenges ahead of the tunnel face. Leveraging a large and geologically diverse dataset of ~500,000 drillholes from 15 tunnels, the research introduces models for accurate rock mass quality classification in a real-world tunnelling context. Both conventional machine learning and image-based deep learning are explored to classify MWD data into Q-classes and Q-values-examples of metrics describing the stability of the rock mass-using both tabular-and image data. The results indicate that the K-nearest neighbours algorithm in an ensemble with tree-based models using tabular data, effectively classifies rock mass quality. It achieves a cross-validated balanced accuracy of 0.86 in classifying rock mass into the Q-classes A, B, C, D, E1, E2, and 0.95 for a binary classification with E versus the rest. Classification using a CNN with MWD-images for each blasting round resulted in a balanced accuracy of 0.82 for binary classification. Regressing the Q-value from tabular MWD-data achieved cross-validated R2 and MSE scores of 0.80 and 0.18 for a similar ensemble model as in classification. High performance in regression and classification boosts confidence in automated rock mass assessment. Applying advanced modelling on a unique dataset demonstrates MWD data's value in improving rock mass classification accuracy and advancing data-driven rock engineering design, reducing manual intervention.

Assessing Distance Measures for Change Point Detection in Continual Learning Scenarios

Chapter

Jun 2024

Detecting relevant change points in time-series data is a necessary task in various applications. Change point detection methods are effective techniques for discovering abrupt changes in data streams. Although prior work has explored the effectiveness of different algorithms on real-world data, little has been done to explore the impact of different distance measures on change detection performance. In this paper, we modify the architecture of a change point detection workflow to assess the impact of distance measure choices on change detection accuracy and efficiency in continual learning scenarios, where the goal is detecting transitions between tasks or concepts. An experimental evaluation of 41 distance measure across several benchmark datasets demonstrated that the change detection accuracy depends on the distance measure selected. Furthermore, our analysis showed performance patterns for distance measures in the same family.

A Novel Outlier-Robust Accuracy Measure for Machine Learning Regression Based on Hassanat Distance Metric

Preprint

Full-text available

May 2024

Regression, a supervised machine learning approach, establishes relationships between independent variables and a continuous dependent variable. It is widely applied in areas like price prediction and time series forecasting. The performance of regression models is typically assessed using error metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). However, these metrics present challenges including sensitivity to outliers (notably MSE and RMSE) and scale dependency, which complicates comparisons across different models. Additionally, traditional metrics sometimes yield values that are difficult to interpret across various problems. Consequently, there is a need for a metric that consistently reflects regression model performance, independent of the problem domain, data scale, and outlier presence. To overcome these shortcomings, this paper introduces a new regression model accuracy measure based on Hassanat distance metric. This measure is not only invariant to outliers but also easy to interpret as it provides an accuracy-like value that ranges from 0 to 1 (or 0-100\%). We validate the proposed metric against traditional measures across multiple benchmarks, demonstrating its robustness under various model scenarios and data types. Hence, we suggest it as a new standard for assessing regression models' accuracy.

Effect of Using Numerical Data Scaling on Supervised Machine Learning Performance

Article

Full-text available

Jun 2024

Mona Ali Mohammed

Before building machine learning models, the dataset should be prepared to be a high quality dataset, we should give the model the best possible representation of the data. Different attributes may have different scales which possibly will increase the difficulty of the problem that is modeled. A model with varying scale values may suffers from poor performance during learning. Our study explores the usage of Numerical Data Scaling as a data pre-processing step with the purpose of how effectively these methods can be used to improve the accuracy of learning algorithms. In particular, three numerical data Scaling methods with four machine learning classifiers to predict disease severity were compared. The experiments were built on Coronavirus 2 (SARS-CoV-2) datasets which included 1206 patients who were admitted during the period between June 2020 and April 2021. The diagnosis of all cases was confirmed with RT-PCR. Basic demographic data and medical characteristics of all participants was collected. The reported results indicate that all techniques are performing well with Numerical Data Scaling and there are significant improvement in the models for unseen data. lastly, we can conclude that there are increase in the classifier performance while using scaling techniques. However, these methods help the algorithms to better understand learn the patterns in the dataset which help making accurate models

Optimizing the KNN Algorithm to Diagnose Obstructive Pulmonary Diseases

Article

Dec 2023

Introduction: According to the World Health Organization, lung diseases are the third cause of death in the world. These diseases are chronic, so early diagnosis of these diseases is very important. Pulmonary function tests are important tools in examining and monitoring patients with respiratory injuries. This research aimed to optimize the K -Nearest Neighbor algorithm, which facilitates and accelerates self-assessment and interpretation of spirometry test results with higher accuracy. Method: In this study, a method is proposed that improves the limitations of the basic algorithm by optimizing, valuing features, and weighted voting. Using this method, obstructive pulmonary diseases are detected based on the data set of spirometry tests, and general parameters are classified into three categories, namely, asthma, chronic bronchitis, and emphysema. Results: In determining the appropriate method for calculating the data distance, the Minkowski method was chosen, and by applying the coefficients of the feature values, the accuracy of the classification increased. Weighted voting was done in the final part of the algorithm based on the Gaussian kernel, based on which a constant performance was obtained for changing the parameter of the number of neighbors. The results of the evaluations were carried out in the form of mutual validation. 95.4% accuracy and 93.2% precision were obtained in 3.12 seconds. Conclusion: The use of machine learning algorithms can be effective in the analysis of medical data. Therefore, in this study, these approaches were used to provide a new method of classification, so that the proposed algorithm could improve the basic method, and also, had better accuracy and performance than other previous methods.

Developing Machine Learning-Based Predictive Models for Hallux Valgus Recurrence Based on Measurements From Radiographs

Article

Jun 2024
FAI

Background Machine learning (ML) is increasingly used to predict the prognosis of numerous diseases. This retrospective analysis aimed to develop a prediction model using ML algorithms and to identify predictors associated with the recurrence of hallux valgus (HV) following surgery. Methods A total of 198 symptomatic feet that underwent chevron osteotomy combined with a distal soft tissue procedure were enrolled and analyzed from 2 independent medical centers. The feet were grouped according to nonrecurrence or recurrence based on 1-year follow-up outcomes. Preoperative weightbearing radiographs and immediate postoperative nonweightbearing radiographs were obtained for each HV foot. Radiographic measurements (eg, HV angle and intermetatarsal angle) were acquired and used for ML model training. A total of 9 commonly used ML models were trained on the data obtained from one institute (108 feet), and tested on the other data set from another independent institute (90 feet) for external validation. Optimal feature sets for each model were identified based on a 2000-resample bootstrap-based internal validation via an exhaustive search. The performance of each model was then tested on the external validation set. The area under the curve (AUC), classification accuracy, sensitivity, and specificity of each model were calculated to evaluate the performance of each model. Results The support vector machine (SVM) model showed the highest predictive accuracy compared to other methods, with an AUC of 0.88 and an accuracy of 75.6%. Preoperative hallux valgus angle, tibial sesamoid position, postoperative intermetatarsal angle, and postoperative tibial sesamoid position were identified as the most selected features by several ML models. Conclusion ML classifiers such as SVM could predict the recurrence of HV (an HVA >20 degrees) at a 1-year follow-up while identifying associated predictors in a multivariate manner. This study holds the potential for foot and ankle surgeons to effectively identify individuals at higher risk of HV recurrence postsurgery.

Light curve classification with DistClassiPy: A new distance-based classifier

Article

Jun 2024

A Review of Data Classification Using K-Nearest Neighbour Algorithm

Article

Full-text available

Jun 2013

To classify data whether it is in the field of neural networks or maybe it is any application of Biometrics viz: Handwriting classification or Iris detection, feasibly the most candid classifier in the stockpile or machine learning techniques is the Nearest Neighbor Classifier in which classification is achieved by identifying the nearest neighbors to a query example and using those neighbors to determine the class of the query. K-NN classification classifies instances based on their similarity to instances in the training data. This paper presents various output with various distance used in algorithm and may help to know the response of classifier for the desired application it also represents computational issues in identifying nearest neighbors and mechanisms for reducing the dimension of the data.

Two-point-based binary search trees for accelerating big data classification using KNN

Article

Full-text available

Nov 2018
PLOS ONE

Ahmad Hassanat

Big data classification is very slow when using traditional machine learning classifiers, particularly when using a lazy and slow-by-nature classifier such as the k-nearest neighbors algorithm (KNN). This paper proposes a new approach which is based on sorting the feature vectors of training data in a binary search tree to accelerate big data classification using the KNN approach. This is done using two methods, both of which utilize two local points to sort the examples based on their similarity to these local points. The first method chooses the local points based on their similarity to the global extreme points, while the second method chooses the local points randomly. The results of various experiments conducted on different big datasets show reasonable accuracy rates compared to state-of-the-art methods and the KNN classifier itself. More importantly, they show the high classification speed of both methods. This strong trait can be used to further improve the accuracy of the proposed methods.

Furthest-Pair-Based Decision Trees: Experimental Results on Big Data Classification

Article

Full-text available

Nov 2018

Ahmad Hassanat

Big Data classification has recently received a great deal of attention due to the main properties of Big Data, which are volume, variety, and velocity. The furthest-pair-based binary search tree (FPBST) shows a great potential for Big Data classification. This work attempts to improve the performance the FPBST in terms of computation time, space consumed and accuracy. The major enhancement of the FPBST includes converting the resultant BST to a decision tree, in order to remove the need for the slow K-nearest neighbors (KNN), and to obtain a smaller tree, which is useful for memory usage, speeding both training and testing phases and increasing the classification accuracy. The proposed decision trees are based on calculating the probabilities of each class at each node using various methods; these probabilities are then used by the testing phase to classify an unseen example. The experimental results on some (small, intermediate and big) machine learning datasets show the efficiency of the proposed methods, in terms of space, speed and accuracy compared to the FPBST, which shows a great potential for further enhancements of the proposed methods to be used in practice.

Norm-Based Binary Search Trees for Speeding Up KNN Big Data Classification

Article

Full-text available

Oct 2018

Ahmad Hassanat

Due to their large sizes and/or dimensions, the classification of Big Data is a challenging task using traditional machine learning, particularly if it is carried out using the well-known K-nearest neighbors classifier (KNN) classifier, which is a slow and lazy classifier by its nature. In this paper, we propose a new approach to Big Data classification using the KNN classifier, which is based on inserting the training examples into a binary search tree to be used later for speeding up the searching process for test examples. For this purpose, we used two methods to sort the training examples. The first calculates the minimum/maximum scaled norm and rounds it to 0 or 1 for each example. Examples with 0-norms are sorted in the left-child of a node, and those with 1-norms are sorted in the right child of the same node; this process continues recursively until we obtain one example or a small number of examples with the same norm in a leaf node. The second proposed method inserts each example into the binary search tree based on its similarity to the examples of the minimum and maximum Euclidean norms. The experimental results of classifying several machine learning big datasets show that both methods are much faster than most of the state-of-the-art methods compared, with competing accuracy rates obtained by the second method, which shows great potential for further enhancements of both methods to be used in practice.

Furthest-Pair-Based Binary Search Tree for Speeding Big Data Classification Using K-Nearest Neighbors

Article

Full-text available

Sep 2018

Ahmad Hassanat

The advances in information technology of both hardware and software have allowed big data to emerge recently, classification of such data is extremely slow, particularly when using K-nearest neighbors (KNN) classifier. In this article, we propose a new approach that creates a binary search tree (BST) to be used later by the KNN to speed up the big data classification. This approach is based on finding the furthest pair of points (diameter) in a data set, and then, it uses this pair of points to sort the examples of the training data set into a BST. At each node of the BST, the furthest-pair is found and the examples located at that particular node are further sorted based on their distances to these local furthest points. The created BST is then searched for a test example to the leaf; the examples found in that particular leaf are used to classify the test example using the KNN classifier. The experimental results on some well-known machine learning data sets show the efficiency of the proposed method, in terms of speed and accuracy compared with the state-of-the- art methods reviewed. With some optimization, the proposed method has a great potential to be used for big data classification and can be generalized for other applications, particularly when classification speed is the main concern.

Slope Finder – A Distance Measure for DTW based Isolated Word Speech Recognition

Article

Full-text available

Dec 2013

Akila Ganesh

Clustering-based k-Nearest Neighbor Classification for Large-Scale Data with Neural Codes Representation

Article

Full-text available

Sep 2017
PATTERN RECOGN

While standing as one of the most widely considered and successful supervised classification algorithms, the k-Nearest Neighbor (kNN) classifier generally depicts a poor efficiency due to being an instance-based method. In this sense, Approximated Similarity Search (ASS) stands as a possible alternative to improve those efficiency issues at the expense of typically lowering the performance of the classifier. In this paper we take as initial point an ASS strategy based on clustering. We then improve its performance by solving issues related to instances located close to the cluster boundaries by enlarging their size and considering the use of Deep Neural Networks for learning a suitable representation for the classification task at issue. Results using a collection of eight different datasets show that the combined use of these two strategies entails a significant improvement in the accuracy performance, with a considerable reduction in the number of distances needed to classify a sample in comparison to the basic kNN rule.

Article

Full-text available

Aug 2017

The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested example and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures? This review attempts to answer the previous question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN degraded only about 20% while the noise level reaches 90%, this is true for all the distances used. This means that the KNN classifier using any of the top 10 distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances.

Efficient Tree Classifiers for Large Scale Datasets

Article

Apr 2018
NEUROCOMPUTING

Classification plays a significant role in production activities and lives. In this era of big data, it is especially important to design efficient classifiers with high classification accuracy for large scale datasets. In this paper, we propose a randomly partitioned and a Principal Component Analysis (PCA)-partitioned multivariate decision tree classifiers, of which the training time is quite short and the classification accuracy is quite high. Approximately balanced trees are created in the form of a full binary tree based on two simple ways of generating multivariate combination weights and a median-based method to select the divide value having ensured the efficiency and effectiveness of the proposed algorithms. Extensive experiments conducted on a series of large datasets have demonstrated that the proposed methods are superior to other classifiers in most cases.

Bioinformatics: Applications in Life and Environmental Sciences

Book

Jan 2009

Madhusudan Fulekar

Bioinformatics, computational biology, is a relatively new field that applies computer science and information technology to biology. In recent years, the discipline of bioinformatics has allowed biologists to make full use of the advances in Computer sciences and Computational statistics for advancing the biological data. Researchers in life sciences generate, collect and need to analyze an increasing number of different types of scientific data, DNA, RNA and protein sequences, in-situ and microarray gene expression including 3D protein structures and biological pathways. This book is aiming to provide information on bioinformatics at various levels. The chapters included in this book cover introductory to advanced aspects, including applications of various documented research work and specific case studies related to bioinformatics. This book will be of immense value to readers of different backgrounds such as engineers, scientists, consultants and policy makers for industry, government, academics and social and private organisations.

Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review

Abstract and Figures

Recommended publications

Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier - A Revi...

Robust Distance Measures for kNN Classification of Cancer Data

Two-point-based binary search trees for accelerating big data classification using KNN

On Converting the Furthest-Pair-Based Binary Search Tree to a Decision Tree: Experimental Results on...