ArticlePDF AvailableLiterature Review

Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review

Authors:

Abstract and Figures

The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision, and recall) of the KNN using a large number of distance measures, tested on a number of real-world data sets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, and the results showed large gaps between the performances of different distances. We found that a recently proposed nonconvex distance performed the best when applied on most data sets comparing with the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only ∼20% while the noise level reaches 90%, this is true for most of the distances used as well. This means that the KNN classifier using any of the top 10 distances tolerates noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing with other distances.
Content may be subject to copyright.
Effects of Distance Measure Choice on K-Nearest
Neighbor Classifier Performance: A Review
Haneen Arafat Abu Alfeilat,
1
Ahmad B.A. Hassanat,
1,
*Omar Lasassmeh,
1
Ahmad S. Tarawneh,
2
Mahmoud Bashir Alhasanat,
3,4
Hamzeh S. Eyal Salman,
1
and V.B. Surya Prasath
5–8
Abstract
The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance
competes with the most complex classifiers in the literature. The core of this classifier depends mainly on mea-
suring the distance or similarity between the tested examples and the training examples. This raises a major
question about which distance measures to be used for the KNN classifier among a large number of distance
and similarity measures available? This review attempts to answer this question through evaluating the perfor-
mance (measured by accuracy, precision, and recall) of the KNN using a large number of distance measures,
tested on a number of real-world data sets, with and without adding different levels of noise. The experimental
results show that the performance of KNN classifier depends significantly on the distance used, and the results
showed large gaps between the performances of different distances. We found that a recently proposed non-
convex distance performed the best when applied on most data sets comparing with the other tested distances.
In addition, the performance of the KNN with this top performing distance degraded only *20% while the noise
level reaches 90%, this is true for most of the distances used as well. This means that the KNN classifier using any
of the top 10 distances tolerates noise to a certain degree. Moreover, the results show that some distances are
less affected by the added noise comparing with other distances.
Keywords: big data; K-nearest neighbor; machine learning; noise; supervised learning
Introduction
Classification is an important problem in big data, data
science, and machine learning. The K-nearest neighbor
(KNN) is one of the oldest, simplest, and accurate algo-
rithms for patterns classification and regression models.
KNN was proposed in 1951 by Evelyn and Hodges,
1
and
then modified by Cover and Hart
2
KNN has been
identified as one of the top 10 methods in data mining.
3
Consequently, KNN has been studied over the past few
decades and widely applied in many fields.
4
Thus, KNN
comprises the baseline classifier in many pattern classi-
fication problems such as pattern recognition,
5
text cat-
egorization,
6
ranking models,
7
object recognition,
8
and
event recognition
9
applications. KNN is a nonparamet-
ric algorithm.
10
Nonparametric means either there are
no parameters or fixed number of parameters irrespec-
tive of size of data. Instead, parameters would be deter-
mined by the size of the training data set, although there
are no assumptions that need to be made to the under-
lying data distribution. Thus, KNN could be the best
choice for any classification study that involves a little
or no prior knowledge about the distribution of the
data. In addition, KNN is one of the laziest learning
methods. This implies storing all training data and
waits until having the test data produced, without having
to create a learning model.
11
Despite its slow characteristic, surprisingly, KNN is
used extensively for big data classification, this includes
1
Department of Computer Science, Faculty of Information Technology, Mutah University, Karak, Jordan.
2
Department of Algorithm and Their Applications, Eo
¨tvo
¨s Lora
´nd University, Budapest, Hungary.
3
Department of Geomatics, Faculty of Environmental Design, King Abdulaziz University, Jeddah, Saudi Arabia.
4
Department of Civil Engineering, Faculty of Engineering, Al-Hussein Bin Talal University, Maan, Jordan.
5
Department of Pediatrics, Division of Biomedical Informatics, Cincinnati Children’s Hospit al Medical Center, Cincinnati, Ohio.
Departments of
6
Pediatrics and
7
Biomedical Informatics, College of Medicine, University of Cincinnati, Cincinnati, Ohio.
8
Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati , Ohio.
*Address correspondence to: Ahmad B.A. Hassanat, Department of Computer Science, Faculty of Information Technology, Mutah University, Karak 61710, Jordan, E-mail:
hasanat@mutah.edu.jo
Big Data
Volume 00, Number 00, 2019
ªMary Ann Liebert, Inc.
DOI: 10.1089/big.2018.0175
1
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
the work of Refs.,
12–16
this is due to the other good
characteristics of the KNN such as simplicity and rea-
sonable accuracy, since the speed issue is normally
solved using some kind of divided-and-conquer ap-
proaches. Slowness is not the only problem associated
with the KNN classifier, in addition to choosing the
best K-neighbors problem,
17
choosing the best dis-
tance/similarity measure is an important problem,
this is because the performance of the KNN classifier
is dependent on the distance/similarity measure used.
This article focuses on finding the best distance/simi-
larity measure to be used with the KNN classifier to
guarantee the highest possible accuracy.
Related work
Several studies have been conducted to analyze the per-
formance of KNN classifier using different distance mea-
sures. Each study was applied on various kinds of data
sets with different distributions, types of data, and using
different number of distance and similarity measures.
Chomboon et al.
18
analyzed the performance of
KNN classifier using 11 distance measures. These in-
clude Euclidean distance (ED), Mahalanobis distance,
Manhattan distance (MD), Minkowski distance, Che-
bychev distance, Cosine distance (CosD), Correlation
distance (CorD), Hamming distance (HamD), Jaccard
distance ( JacD), Standardized Euclidean distance, and
Spearman distance. Their experiment had been applied
on eight binary synthetic data sets with various kinds of
distributions that were generated using MATLAB.
They divided each data set into 70% for training set
and 30% for the testing set. The results showed that
the MD, Minkowski distance, Chebychev distance,
ED, Mahalanobis distance, and Standardized Euclidean
distance measures achieved similar accuracy results
and outperformed other tested distances.
Mulak and Talhar
19
evaluated the performance of
KNN classifier using Chebychev distance, ED, and
MD, distance measures on K divergence distance
(KDD) data set.
20
The KDD data set contains 41 features
and 2 classes, the type of data of which is numeric. The
data set was normalized before conducting the experi-
ment. To evaluate the performance of KNN, accuracy,
sensitivity, and specificity measures were calculated
for each distance. The reported results indicate that
the use of MD outperform the other tested distances,
with 97.8% accuracy rate, 96.76% sensitivity rate, and
98.35% specificity rate.
Hu et al.
21
analyzed the effect of distance measures
on KNN classifier for medical domain data sets.
Their experiments were based on three different types
of medical data sets containing categorical, numerical,
and mixed types of data, which were chosen from the
University of California, Irvine (UCI) machine learning
repository, and four distance metrics including ED,
CosD, Chi-square, and Minkowsky distances. They di-
vided each data set into 90% of data as training and
10% as testing set, with K values ranging from 1 to
15. The experimental results showed that Chi-square
distance function was the best choice for the three dif-
ferent types of data sets. However, the CosD, ED, and
Minkowsky distance metrics performed the ‘‘worst’
over the mixed type of data sets. The ‘‘worst’’ perfor-
mance means the method with the lowest accuracy.
Todeschini et al.
22,23
analyzed the effect of 18 differ-
ent distance measures on the performance of KNN
classifier using eight benchmark data sets. The investi-
gated distance measures included MD, ED, Soergel dis-
tance (SoD), Lance–Williams distance, contracted
Jaccard–Tanimoto distance, Jaccard–Tanimoto dis-
tance, Bhattacharyya distance (BD), Lagrange distance,
Mahalanobis distance, Canberra distance (CanD),
Wave-Edge distance, Clark distance (ClaD), CosD,
CorD, and four locally centered Mahalanobis distances.
For evaluating the performance of these distances, the
nonerror rate and average rank were calculated for
each distance. The result indicated that the ‘‘best’’ per-
formance was the MD, ED, SoD, Contracted Jaccard–
Tanimoto distance, and Lance–Williams distance mea-
sures. The ‘‘best’’ performance means the method with
the highest accuracy.
Lopes and Ribeiro
24
analyzed the impact of five
distance metrics, namely ED, MD, CanD, Chebychev
distance, and Minkowsky distance in instance-based
learning algorithms. Particularly, 1-nearest neighbor
(1-NN) classifier and the Incremental Hypersphere Clas-
sifier, they reported the results of their empirical evalua-
tion on 15 data sets with different sizes showing that the
Euclidean and Manhattan metrics significantly yield good
results comparing with the other tested distances.
Alkasassbeh et al.
25
investigated the effect of Eucli-
dean, Manhattan, and a nonconvex distance due to Has-
sanat
26
distance metrics on the performance of the KNN
classifier, with K ranging from 1 to the square root of
the size of the training set, considering only the odd
K’s. In addition, they experimented on other classifi-
ers such as the Ensemble Nearest Neighbor classifier
17
and the Inverted Indexes of Neighbors Classifier.
27
Their
experiments were conducted on 28 data sets taken
from the UCI machine learning repository, the reported
2 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
results show that Hassanat distance (HasD)
26
outper-
formed both of MD and ED in most of the tested data
sets using the three investigated classifiers.
Lindi
28
investigated three distance metrics to use
the best performer among them with the KNN classifier,
which was employed as a matcher for their face recog-
nition system that was proposed for the NAO robot.
The tested distances were Chi-square distance, ED,
and HasD. Their experiments showed that HasD out-
performed the other two distances in terms of precision,
but was slower than both of the other distances.
Table 1 provides a summary of these previous stud-
ies on evaluating various distances within KNN classi-
fier, along with the best distance assessed by each of
them. As can be seen from the mentioned literature re-
view of most related studies, all of the previous studies
have investigated either a small number of distance and
similarity measures (ranging from 3 to 18 distances), a
small number of data sets, or both.
Contributions
In KNN classifier, the distances between the test sample
and the training data samples are identified by different
measures. Therefore, distance measures play a vital role
in determining the final classification output.
21
ED is
the most widely used distance metric in KNN classifi-
cations; however, only few studies examined the effect
of different distance metrics on the performance of
KNN, these used a small number of distances, a small
number of data sets, or both. Such shortage in experi-
ments does not prove which distance is the best to be
used with the KNN classifier. Therefore, this review at-
tempts to bridge this gap by testing a large number of
distance metrics on a large number of different data
sets, in addition to investigating the distance metrics
that are least affected by added noise.
The KNN classifier can deal with noisy data; there-
fore, we need to investigate the impact of choosing dif-
ferent distance measures on the KNN performance
when classifying a large number of real-world data
sets, in addition to investigate which distance has the
lowest noise implications. There are two main research
questions addressed in this review:
1. Which is the ‘‘best’’ distance metric to be imple-
mented with the KNN classifier?
2. Which is the ‘‘best’’ distance metric to be imple-
mented with the KNN classifier in the case of
noise existence?
We mean by the ‘‘best distance metric’’ (in this re-
view) is the one that allows the KNN to classify test ex-
amples with the highest precision, recall, and accuracy,
that is, the one that gives best performance of the KNN
in terms of accuracy.
The previous questions were partially answered by
the aforementioned studies; however, most of the
reviewed research in this regard used a small number
of distances/similarity measures and/or a small number
of data sets. This study investigates the use of a relatively
large number of distances and data sets, to draw more
significant conclusions, in addition to reviewing a larger
number of distances in one place.
Organization
We organized our review as follows. First in KNN and
Distance Measures section, we provide an introductory
overview to KNN classification method and present its
history, characteristics, advantages, and disadvantages.
We review the definitions of various distance measures
used in conjunction with KNN. Experimental Frame-
work section explains the data sets that were used in
classification experiments, the structure of the experi-
ments model, and the performance evaluations mea-
sures. We present and discuss the results produced
by the experimental framework. Finally, in Conclu-
sions and Future Perspectives section, we provide the
conclusions and possible future directions.
KNN and Distance Measures
Brief overview of KNN classifier
The KNN algorithm classifies an unlabeled test sample
based on the majority of similar samples among the
KNNs that are the closest to test sample. The distances
Table 1. Comparison between previous studies for distance
measures in K-nearest neighbor classifier along with ‘‘best’’
performing distance
Reference
No. of
distances
No. of
data sets Best distance
18 11 8 Manhattan, Minkowski
Chebychev
Euclidean, Mahalanobis
Standardized Euclidean
19 3 1 Manhattan
21 4 37 Chi-square
22 18 8 Manhattan, Euclidean, Soergel
Contracted Jaccard–Tanimoto
Lance–Williams
24 5 15 Euclidean and Manhattan
25 3 28 Hassanat
28 3 2 Hassanat
Ours 54 28 Hassanat
Comparatively our current study compares the highest number of dis-
tance measures on a variety of data sets.
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 3
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
between the test sample and each of the training data
samples are determined by a specific distance measure.
Figure 1 shows that a KNN example contains training
samples with two classes, the first class is ‘‘blue square’
and the second class is ‘‘red triangle.’’ The test sample
is represented in green circle. These samples are placed
into two dimensional feature spaces with one dimension
foreachfeature.Toclassifythetestsamplethatbelongs
to class ‘‘blue square’’ or to class ‘‘red triangle,’’ KNN
adopts a distance function to find the KNNs to the test
sample. Finding the majority of classes among the
KNNs predicts the class of the test sample. In this case,
when k=3, the test sample is classified to the first class
‘‘red triangle’’ because there are two red triangles and
only one blue square inside the inner circle, but when
k=5, it is classified to the ‘‘blue square’’ class because
there are two red triangles and only three blue squares.
KNN is simple, but proved to be a highly efficient
and effective algorithm for solving various real-life
classification problems. However, KNN has got some
disadvantages that include the following:
1. How to find the optimum K value in KNN
Algorithm?
2. High computational time cost as we need to
compute the distance between each test sample
and all training samples, for each test example
we need O(nm) time complexity (number of op-
erations), where n is the number of examples in
the training data and m is the number of features
for each example.
3. High memory requirement as we need to store all
training samples O(nm) space complexity.
4. Finally, we need to determine the distance func-
tion that is the core of this study.
The first problem was solved either by using all the
examples and taking the inverted indexes,
29
or using en-
semble learning.
30
For the second and third problems,
many studies have proposed different solutions depend-
ing on reducing the size of the training data set, those in-
clude and not limited to Refs.,
31–35
or using approximate
KNN classification such as those in Arya and Mount
36
and Zheng et al.
37
Although some previous studies
exist in the literature that investigated the fourth prob-
lem (see Related Work section), in this study we attempt
to investigate the fourth problem on a much larger scale,
that is, investigating a large number of distance metrics
tested on a large set of problems. In addition, we inves-
tigate the effect of noise on choosing the most suitable
distance metric to be used by the KNN classifier.
The basic KNN classifier steps can be described as
follows:
Algorithm 1: Basic KNN algorithm
Input: Training samples D, test sample d,K
Output: Class label of test sample
1: Compute the distance between dand every sample in D
2: Choose the Ksamples in Dthat are nearest to d; denote the set
by P(˛D)
3: Assign dthe class that is the most frequent class (or the majority class)
1. Training phase: The training samples and the
class labels of these samples are stored, no miss-
ing data allowed, no non-numeric data allowed.
2. Classification phase: Each test sample is classified
using majority vote of its neighbors by the follow-
ing steps:
(a) Distances from the test sample to all stored
training samples are calculated using a spe-
cific distance function or similarity measure.
FIG. 1. An example of KNN classification with K
neighbors K=3 (solid line circle) and K=5 (dashed
line circle), distance measure is ED. Each triangle
represents a training example with two features
(x, y), which belongs to class 1. Each square
represents a training example with two features
(x, y), which belongs to class 2. The gray circle
represents a test example with two features (x, y),
which belongs to an unknown (?) class, and the
KNN needs to predict its class based on the ED
distance. ED, Euclidean distance; KNN, K-nearest
neighbor.
4 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
(b) The KNNs of the test sample are selected,
where Kis a predefined small integer.
(c) The most repeated class of these KNNs is
assigned to the test sample. In other words, a
test sample is assigned to the class c if it is the
most frequent class label among the Knearest
training samples. If K=1, then the class of the
nearest neighbor is assigned to the test sample.
KNN algorithm is described by Algorithm 1.
We provide a toy example to illustrate how to com-
pute the KNN classifier. Assuming that we have three
training examples, having three attributes for each,
and one test example as given in Table 2.
Step1: Determine the parameter K=number of the
nearest neighbors to be considered. For this example
we assume K=1.
Step 2: Calculate the distance between test sample
and all training samples using a specific similarity
measure, in this example, ED is used, see Table 3.
Step 3: Sort all examples based on their similarity or
distance to the tested example, and then keep only
the Ksimilar (nearest) examples as given in Table 4:
Step 4: Based on the minimum distance, the class of
the test sample is assigned to be 1. However, if K=3
for instance, the class will be 2.
Noisy data
The existence of noise in data is mainly related to the way
that has been applied to acquire and preprocess data from
its environment.
38
Noisy data are a corrupted form of
data in some way, which leads to partial alteration of the
data values. Two main sources of noise can be identified:
first, the implicit errors caused by measurement tools,
such as using different types of sensors. Second, the random
errors caused by batch processes or experts while collecting
data, for example, errors during the process document dig-
itization. Based on these two sources of errors, two types of
noise can be classified in a given data set
39
:
1. Class noise occurs when the sample is incorrectly
labeled due to several causes such as data entry er-
rors during labeling process, or the inadequacy of
information that is being used to label each sample.
2. Attribute noise refers to the corruptions in values
of 1 or more attributes due to several causes, such
as failures in sensor devices, irregularities in sam-
pling, or transcription errors.
40
The generation of noise can be classified by three
main characteristics
41
:
1. The place where the noise is introduced: Noise
may affect the attributes, class, training data,
and test data separately or in combination.
2. The noise distribution: The way in which the noise
is introduced, for example, uniform or Gaussian.
3. The magnitude of generated noise values: The ex-
tent to which the noise affects the data can be rel-
ative to each data value of each attribute, or relative
to the standard deviation, minimum, maximum
for each attribute.
In this study, we add different noise levels to the
tested data sets to find the optimal distance metric
that is least affected by this added noise with respect
to the KNN classifier performance.
Distance measures review
The first appearance of the word distance can be found
in the writings of Aristoteles (384 AC–322 AC), who
argued that the word distance means, ‘‘It is between
extremities that distance is greatest’’ or ‘‘things which
have something between them, that is, a certain distance.’’
In addition, ‘‘distance has the sense of dimension [as in
space has three dimensions, length, breadth and depth].’’
Euclid, one of the most important mathematicians of
the ancient history, used the word distance only in his
third postulate of the Principia
42
: ‘‘Every circle can be de-
scribed by a centre and a distance.’’ The distance is a
Table 2. Training and testing data examples
X1 X2 X3 Class
Training sample (1) 5 4 3 1
Training sample (2) 1 2 2 2
Training sample (3) 1 2 3 2
Test sample 4 4 2 ?
Table 3. Training and testing data examples with distances
X1 X2 X3 Class Distance
Training
sample (1)
543 1 D=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(4 5)2þ(4 4)2þ(2 3)2
p=1:4
Training
sample (2)
122 2 D=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(4 1)2þ(4 2)2þ(2 2)2
p=3:6
Training
sample (3)
123 2 D=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(4 1)2þ(4 2)2þ(2 3)2
p=3:7
Test sample 4 4 2 ?
Table 4. Training and testing data examples with distances
X1 X2 X3 Class Distance
Rank minimum
distance
Training sample (1) 5 4 3 1 1.4 1
Training sample (2) 1 2 2 2 3.6 2
Training sample (3) 1 2 3 2 3.7 3
Test sample 4 4 2 ?
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 5
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
numerical description of how far apart entities are. In data
mining, the distance means a concrete way of describing
what it means for elements of some space to be close to or
far away from each other. Synonyms for distance include
farness, dissimilarity, and diversity, and synonyms for
similarity include proximity
43
and nearness.
22
The distance function between two vectors xand yis
a function d(x,y) that defines the distance between
both vectors as a non-negative real number. This func-
tion is considered as a metric if it satisfies a certain
number of properties
44
that include the following:
1. Non-negativity: The distance between xand yis
always a value 0.
d(x,y)0:
2. Identity of indiscernibles: The distance between
xand yis equal to 0 if and only if xis equal to y.
d(x,y)=0iff x=y:
3. Symmetry: The distance between xand yis equal
to the distance between yand x.
d(x,y)=d(y,x):
4. Triangle inequality: Considering the presence of
a third point z, the distance between xand yis al-
ways less than or equal to the sum of the distance
between xand zand the distance between yand z.
d(x,y)d(x,z)þd(z,y):
Whenthedistanceisintherange[0,1],thecalculation
of a corresponding similarity measure s(x,y)isasfollows:
s(x,y)=1d(x,y):
We consider the 8 major distance families that consist
of 54 total distance measures. We categorized these dis-
tance measures following a similar categorization done
by Cha.
43
In what follows, we give the mathematical def-
initions of distances to measure the closeness between
two vectors xand y,wherex=(x
1
,x
2
,.,x
n
)and
y=(y
1
,y
2
,.,y
n
) having numeric attributes. As an exam-
ple, we show the computed distance value between the ex-
ample vectors v1={5.1,3.5,1.4,0.3}, v2={5.4,3.4,1.7,0.2}
as a result in each of these categories of distances
reviewed here. Theoretical analysis of these various dis-
tance metrics is beyond the scope of this study.
1. L
p
Minkowski distance measures: This family of
distances includes three distance metrics that are
special cases of Minkowski distance, correspond-
ing to different values of pfor this power distance.
The Minkowski distance, which is also known as
L
p
norm, is a generalized metric. It is defined as
DMink(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+
n
i=1jxiyijp
p
s,
where pis a positive value. When p=2, the distance
becomes the ED. When p=1, it becomes MD. Cheby-
shev distance (CD) is a variant of Minkowski distance,
where p=N.x
i
is the ith value in vector xand y
i
is the
ith value in vector y.
1.1. MD: The MD, also known as L
1
norm, Taxi-
cab norm, rectilinear distance, or City block
distance, which was considered by Hermann
Minkowski in 19th-century Germany. This dis-
tance represents the sum of the absolute differ-
ences between the opposite values in vectors.
MD(x,y)=+
n
i=1jxiyij:
1.2. CD: CD is also known as maximum value
distance,
45
Lagrange,
22
and chessboard dis-
tance.
46
This distance is appropriate in
cases when two objects are to be defined as
different if they are different in any one di-
mension.
47
It is a metric defined on a vector
space where distance between two vectors is
the greatest of their difference along any co-
ordinate dimension.
CD(x,y)=max
ijxiyij:
1.3. ED: Also known as L
2
norm or Ruler dis-
tance, which is an extension to the Pythago-
rean theorem. This distance represents the
root of the sum of the square of differences
between the opposite values in vectors.
ED(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+
n
i=1jxiyij2
s:
L
p
Minkowski distance measures
Abbreviation Name Definition Result
MD Manhattan +n
i=1jxiyij0.8
CD Chebyshev max
i
jx
i
y
i
j0.3
ED Euclidean ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1jxiyij2
q0.4472
2. L
1
Distance measures: This distance family de-
pends on mainly finding the absolute difference,
the family includes Lorentzian distance (LD),
CanD, Sorensen distance (SD), SoD, Kulczynski
6 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
distance (KD), Mean Character distance (MCD),
and Non Intersection distance (NID).
2.1. LD: LD is represented by the natural log of
the absolute difference between two vectors.
This distance is sensitive to small changes
since the log scale expands the lower range
and compresses the higher range.
LD(x,y)=+
n
i=1
ln(1þjxiyij),
where ln is the natural logarithm, and to ensure that the
non-negativity property and to avoid log of 0, 1 is added.
2.2. CanD: CanD is introduced by
48
and modified
by Lance and Williams.
49
It is a weighted ver-
sion of MD, wherein the absolute difference
between the attribute values of the vectors x
and yis divided by the sum of the absolute at-
tribute values before summing.
50
This distance
is mainly used for positive values. It is very
sensitive to small changes near 0, where it is
more sensitive to proportional than to abso-
lute differences. Therefore, this characteristic
becomes more apparent in higher dimen-
sional space, respectively, with an increasing
number of variables. The CanD is often used
for data scattered around an origin.
CanD(x,y)=+
n
i=1
jxiyij
jxijþjyij:
2.3. SD: The SD,
51
also known as Bray–Curtis, is
one of the most commonly applied measure-
ments to express relationships in ecology, en-
vironmental sciences, and related fields. It is a
modified Manhattan metric, wherein the
summed differences between the attribute
values of the vectors xand yare standardized
by their summed attribute values.
52
When all
the vector values are positive, this measure
take value between 0 and 1.
SD(x,y)=+n
i=1jxiyij
+n
i=1(xiþyi):
2.4. SoD: SoD is one of the distance measures that
is widely used to calculate the evolutionary dis-
tance.
53
It is also known as Ruzicka distance.
For binary variables only, this distance is iden-
tical to the complement of the Tanimoto (or
Jaccard) similarity coefficient.
54
This distance
obeys all four metric properties provided by
all attributes that have non-negative values.
55
SoD(x,y)=+n
i=1jxiyij
+n
i=1max (xi,yi):
2.5. KD: Similar to the SoD, but instead of using
the maximum, it uses the minimum function.
KD(x,y)=+n
i=1jxiyij
+n
i=1min (xi,yi):
2.6. MCD: Also known as average Manhattan, or
Gower distance.
MCD(x,y)=+n
i=1jxiyij
n:
2.7. NID: NID is the complement to the intersec-
tion similarity and is obtained by subtracting
the intersection similarity from 1.
NID(x,y)=1
2+
n
i=1jxiyij:
L
1
Distance measures
Abbreviation Name Definition Result
LD Lorentzian +n
i=1ln(1 þjxiyij) 0.7153
CanD Canberra +n
i=1jxiyij
jxijþjyij0.0381
SD Sorensen +n
i=1jxiyij
+n
i=1(xiþyi)0.0381
SoD Soergel +n
i=1jxiyij
+n
i=1max(xi,yi)0.0734
KD Kulczynski +n
i=1jxiyij
+n
i=1min(xi,yi)0.0792
MCD Mean Character +n
i=1jxiyij
n0.2
NID Non Intersection 1
2+n
i=1jxiyij0.4
3. Inner product distance measures: Distance mea-
sures belonging to this family are calculated by
some products of pairwise values from both vec-
tors, this type of distances includes JacD, CosD,
Dice distance (DicD), and Chord distance (ChoD).
3.1. JacD: The JacD measures dissimilarity be-
tween sample sets, it is a complementary to
the Jaccard similarity coefficient
56
and is
obtained by subtracting the Jaccard coeffi-
cient from 1. This distance is a metric.
57
JacD(x,y)=+n
i=1(xiyi)2
+n
i=1x2
iþ+n
i=1y2
i+n
i=1xiyi
:
3.2. CosD: The CosD, also called angular dis-
tance, is derived from the cosine similarity
that measures the angle between two vectors,
where CosD is obtained by subtracting the
cosine similarity from 1.
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 7
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
CosD(x,y)=1+n
i=1xiyi
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1x2
i
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1y2
i
q:
3.3. DicD: The DicD is derived from the dice sim-
ilarity,
58
which is a complementary to the
dice similarity and is obtained by subtracting
the dice similarity from 1. It can be sensitive
to values near 0. This distance is a not a met-
ric, in particular, the property of triangle in-
equality does not hold. This distance is
widely used in information retrieval in docu-
ments and biological taxonomy.
DicD(x,y)=12+n
i=1xiyi
+n
i=1x2
iþ+n
i=1y2
i
:
3.4. ChoD: It is a modification of ED,
59
which was
introduced by Orloci
60
and to be used in ana-
lyzing community composition data.
61
It was
defined as the length of the chord joining
two normalized points within a hypersphere
of radius 1. This distance is one of the distance
measures that is commonly used for clustering
continuous data.
62
ChoD(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
22+n
i=1xiyi
+n
i=1x2
i+n
i=1y2
i
s:
Inner product distance measures family
Abbreviation Name Definition Result
JacD Jaccard +n
i=1(xiyi)2
+n
i=1x2
iþ+n
i=1y2
i+n
i=1xiyi0.0048
CosD Cosine 1 +n
i=1xiyi
ffiffiffiffiffiffiffiffiffiffi
+n
i=1x2
i
pffiffiffiffiffiffiffiffiffiffi
+n
i=1y2
i
p0.0016
DicD Dice 1 2+n
i=1xiyi
+n
i=1x2
iþ+n
i=1y2
i
0.9524
ChoD Chord ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
22+n
i=1xiyi
+n
i=1x2
i+n
i=1y2
i
r0.0564
4. Squared chord distance (SCD) measures: Dis-
tances that belong to this family are obtained by
calculating the sum of geometrics. The geometric
mean of two values is the square root of their prod-
uct. The distances in this family cannot be used
with features vector of negative values, this family
includes Bhattachayya distance, SCD, Matusita
distance (MatD), and Hellinger distance (HeD).
4.1. BD: The BD measures the similarity of two
probability distributions.
63
BD(x,y)=ln +
n
i=1ffiffiffiffiffiffiffi
xiyi
p:
4.2. SCD: SCD is mostly used with paleontologists
and in studies on pollen. In this distance, the
sum of square of square root difference at each
point is taken along both vectors, which incre-
ases the difference for more dissimilar feature.
SCD(x,y)=+
n
i=1
(ffiffiffi
xi
pffiffiffi
yi
p)2:
4.3. MatD: MatD is the square root of the SCD.
MatD(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+
n
i=1
(ffiffiffi
xi
pffiffiffi
yi
p)2
s:
4.4. HeD: HeD also called Jeffries–Matusita dis-
tance
64
was introduced in 1909 by Hellin-
ger,
65
it is a metric used to measure the
similarity between two probability distribu-
tions. This distance is closely related to BD.
HeD(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2+
n
i=1
(ffiffiffi
xi
pffiffiffi
yi
p)2
s:
SCD measures family
Abbreviation Name Definition Result
BD Bhattacharyya ln+n
i=1ffiffiffiffiffiffi
xiyi
p2.34996
SCD Squared chord +n
i=1(ffiffiffi
xi
pffiffiffi
yi
p)20.0297
MatD Matusita ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(ffiffiffi
xi
pffiffiffi
yi
p)2
q0.1722
HeD Hellinger ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2+n
i=1(ffiffiffi
xi
pffiffiffi
yi
p)2
q0.2436
5. Squared L
2
distance measures: In L
2
distance
measure family, the square of difference at each
point along both vectors is considered for the
total distance, this family includes Squared Eucli-
dean distance (SED), ClaD, Neyman v
2
distance
(NCSD), Pearson v
2
distance (PCSD), Squared
v
2
distance (SquD), Probabilistic Symmetric v
2
distance (PSCSD), Divergence distance (DivD),
Additive Symmetric v
2
distance (ASCSD), Aver-
agedistance(AD),MeanCensoredEuclidean
distance (MCED), and Squared Chi-Squared
distance (SCSD).
5.1. SED: SED is the sum of the squared differ-
ences without taking the square root.
SED(x,y)=+
n
i=1
(xiyi)2:
5.2. ClaD: The ClaD also called coefficient of di-
vergence was introduced by Clark.
66
It is
the squared root of half of the DivD.
8 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
ClaD(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+
n
i=1
xiyi
jxijþjyij

2
s:
5.3. NCSD: The NCSD
67
is called a quasi-distance.
NCSD(x,y)=+
n
i=1
(xiyi)2
xi
:
5.4. PCSD: PCSD,
68
also called v
2
distance.
PCSD(x,y)=+
n
i=1
(xiyi)2
yi
:
5.5. SquD: SquD also called triangular discrimina-
tion distance. This distance is a quasi-distance.
SquD(x,y)=+
n
i=1
(xiyi)2
xiþyi
:
5.6. PSCSD: This distance is equivalent to Sangvi
v
2
distance.
PSCSD(x,y)=2+
n
i=1
(xiyi)2
xiþyi
:
5.7. DivD:
DivD(x,y)=2+
n
i=1
(xiyi)2
(xiþyi)2:
5.8. ASCSD: Also known as symmetric v
2
diver-
gence.
ASCSD(x,y)=2+
n
i=1
(xiyi)2(xiþyi)
xiyi
:
5.9. AD: The AD, also known as average Eucli-
dean, is a modified version of the ED,
62
wherein the ED has the following drawback,
‘‘if two data vectors have no attribute values
in common, they may have a smaller distance
than the other pair of data vectors containing
the same attribute values,’’
59
so that, this dis-
tance was adopted.
AD(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n+
n
i=1
(xiyi)2
s:
5.10. MCED: In this distance, the sum of squared
differences between values is calculated and,
to get the mean value, the summed value is
divided by the total number of values
wherein the pairs values do not equal to 0.
After that, the square root of the mean
should be computed to get the final distance.
MCED(x,y)=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xiyi)2
+n
i=11x2
iþy2
i0
s:
5.11. SCSD:
SCSD(x,y)=+
n
i=1
(xiyi)2
jxiþyij:
Squared L
2
distance measures family
Abbreviation Name Definition Result
SED Squared Euclidean +n
i=1(xiyi)20.2
ClaD Clark ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1jxiyij
xiþyi

2
r0.2245
NCSD Neyman v
2
+n
i=1
(xiyi)2
xi0.1181
PCSD Pearson v
2
+n
i=1
(xiyi)2
yi0.1225
SquD Squared v
2
+n
i=1
(xiyi)2
xiþyi0.0591
PSCSD Probabilistic Symmetric v
2
2+n
i=1
(xiyi)2
xiþyi0.1182
DivD Divergence 2+n
i=1
(xiyi)2
(xiþyi)20.1008
ASCSD Additive Symmetric v
2
2+n
i=1
(xiyi)2(xiþyi)
xiyi0.8054
AD Average ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n+n
i=1(xiyi)2
q0.2236
MCED Mean Censored Euclidean ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xiyi)2
+n
i=11x2
iþy2
i0
r0.2236
SCSD Squared Chi-Squared +n
i=1
(xiyi)2
jxiþyij0.0591
6. Shannon entropy distance measures: The dis-
tance measures belonging to this family are re-
lated to the Shannon entropy.
69
These distances
include Kullback–Leibler distance (KLD), Jeffreys
distance ( JefD), KDD, Topsoe distance (TopD),
Jensen–Shannon distance ( JSD), and Jensen dif-
ference distance ( JDD).
6.1. KLD: KLD was introduced by Kullback and
Leibler,
70
it is also known as Kullback–Leibler
divergence, relative entropy, or information
deviation, which measures the difference be-
tween two probability distributions. This dis-
tanceisnotametricmeasure,becauseitis
not symmetric. Furthermore, it does not satisfy
triangular inequality property, therefore, it
is called quasi-distance. Kullback–Leibler di-
vergence has been used in several natural
language applications such as for query expan-
sion, language models, and categorization.
71
KLD(x,y)=+
n
i=1
xiln xi
yi
,
where ln is the natural logarithm.
6.2. JefD: JefD,
72
also called J-divergence or KL2-
distance, is a symmetric version of the KLD.
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 9
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
JefD(x,y)=+
n
i=1
(xiyi)ln xi
yi
:
6.3. KDD:
KDD(x,y)=+
n
i=1
xiln 2xi
xiþyi
:
6.4. TopD: The TopD,
73
also called information
statistics, is a symmetric version of the
KLD. The TopD is twice the Jensen–Shannon
divergence. This distance is not a metric, but
its square root is a metric.
TopD(x,y)=+
n
i=1
xiln 2xi
xiþyi

þ+
n
i=1
yiln 2yi
xiþyi

:
6.5. JSD: JSD is the square root of the Jensen–
Shannon divergence. It is half of the TopD,
which uses the average method to make the
K divergence symmetric.
JSD(x,y)=1
2+
n
i=1
xiln 2xi
xiþyi

þ+
n
i=1
yiln 2yi
xiþyi

:
6.6. JDD: JDD was introduced by Sibson.
74
JDD(x,y)=1
2+
n
i=1
xilnxiþyilnyi
2xiþyi
2

ln xiþyi
2


:
Shannon entropy distance measures family
Abbreviation Name Definition Result
KLD Kullback–
Leibler
+n
i=1xiln xi
yi0.3402
JefD Jeffreys +n
i=1(xiyi)ln xi
yi0.1184
KDD K divergence +n
i=1xiln 2xi
xiþyi0.1853
TopD Topsoe +n
i=1xiln 2xi
xiþyi

þ+n
i=1yiln 2yi
xiþyi

0.0323
JSD Jensen–
Shannon
1
2+n
i=1xiln 2xi
xiþyiþ+n
i=1yiln 2yi
xiþyi
hi
0.014809
JDD Jensen
difference
1
2+n
i=1
xilnxiþyilnyi
2xiþyi
2

ln xiþyi
2

hi
0.0074
7. Vicissitude distance measures: Vicissitude distance
family consists of four distances, Vicis-Wave Hed-
gesdistance(VWHD),VicisSymmetricdistance
(VSD), Max Symmetric v
2
distance (MSCSD),
and Min Symmetric v
2
distance (MiSCSD). These
distances were generated from syntactic relation-
ship for the aforementioned distance measures.
7.1. VWHD: The so-called Wave-Hedges dis-
tance has been applied to compressed image
retrieval,
75
content-based video retrieval,
76
time series classification,
77
image fidelity,
78
fin-
ger print recognition,
79
etc. Interestingly, the
source of the ‘‘Wave-Hedges’’ metric has not
been correctly cited, and some of the previ-
ously mentioned resources allude to it incor-
rectly as given by Hedges.
80
Thesourceof
this metric eludes the authors, despite best
efforts otherwise. Even the name of the dis-
tance ‘‘Wave-Hedges’’ is questioned.
26
VWHD(x,y)=+
n
i=1
jxiyij
min (xi,yi):
7.2. VSD: VSD is defined by three formulas,
VSDF1, VSDF2, and VSDF3 as the following:
VSDF1(x,y)=+
n
i=1
(xiyi)2
min (xi,yi)2,
VSDF2(x,y)=+
n
i=1
(xiyi)2
min (xi,yi),
VSDF3(x,y)=+
n
i=1
(xiyi)2
max (xi,yi)
7.3. MSCSD:
MSCD(x,y)=max +
n
i=1
(xiyi)2
xi
,+
n
i=1
(xiyi)2
yi

:
7.4. MiSCSD:
MiSCSD(x,y)=min +
n
i=1
(xiyi)2
xi
,+
n
i=1
(xiyi)2
yi

:
Vicissitude distance measures family
Abbreviation Name Definition Result
VWHD Vicis-Wave Hedges +n
i=1jxiyij
min (xi,yi)0.8025
VSDF1 Vicis Symmetric1 +n
i=1
(xiyi)2
min (xi,yi)20.3002
VSDF2 Vicis Symmetric2 +n
i=1
(xiyi)2
min (xi,yi)0.1349
VSDF3 Vicis Symmetric3 +n
i=1
(xiyi)2
max (xi,yi)0.1058
MSCSD Max Symmetric v
2
max +n
i=1
(xiyi)2
xi,+n
i=1
(xiyi)2
yi

0.1225
MiSCSD Min Symmetric v
2
min +n
i=1
(xiyi)2
xi,+n
i=1
(xiyi)2
yi

0.1181
8. Other distance measures: These metrics exhibit
distance measures utilizing multiple ideas or
measures from previous distance measures,
these include and not limited to Average distance
(L
1
,L
N
) (AvgD), Kumar–Johnson distance (KJD),
Taneja distance (TanD), Pearson distance (PeaD),
CorD, Squared Pearson distance (SPeaD), HamD,
Hausdorff distance (HauD), v
2
statistic distance
(CSSD), Whittaker’s index of association distance
(WIAD), Meehl distance (MeeD), Motyka dis-
tance (MotD), and HasD.
10 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
8.1. AvgD: AvgD is the average of MD and CD.
AvgD(x,y)=+n
i=1jxiyimaxijxiyij
2:
8.2. KJD:
KJD(x,y)=+
n
i=1
(x2
iþy2
i)2
2(xiyi)3=2
!
:
8.3. TanD
81
:
TJD(x,y)=+
n
i=1
xiþyi
2

ln xiþyi
2ffiffiffiffiffiffiffi
xiyi
p

:
8.4. PeaD: The PeaD is derived from the Pear-
son’s correlation coefficient, which measures
the linear relationship between two vectors.
82
This distance is obtained by subtracting the
Pearson’s correlation coefficient from 1.
PeaD(x,y)=1+n
i=1(xi
x)(yi
y)
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xi
x)2?+n
i=1(yi
y)2
q,
where
x=1
n+n
i=1xi.
8.5. CorD: CorD is a version of the PeaD, where
the PeaD is scaled to obtain a distance mea-
sure in the range between 0 and 1.
CorD(x,y)=1
21+n
i=1(xi
x)(yi
y)
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xi
x)2
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(yi
y)2
q
0
B
@1
C
A:
8.6. SPeaD:
SPeaD(x,y)=1+n
i=1(xi
x)(yi
y)
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xi
x)2
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(yi
y)2
q
0
B
@1
C
A
2
:
8.7. HamD: HamD
83
is a distance metric that
measures the number of mismatches between
two vectors. It is mostly used for nominal
data, string, and bit-wise analyses, and also
can be useful for numerical data.
HamD(x,y)=+
n
i=1
1xiyi:
8.8. HauD:
HauD(x,y)=max (h(x,y), h(y,x)),
where h(x,y)=max
xi˛x
min
yi˛y
jjx
i
y
i
jj, and jj,jj is
the vector norm (e.g., L
2
norm). The function h(x,y)is
called the directed HauD from xto y. The HauD(x,y)
measures the degree of mismatch between the sets x
and yby measuring the remoteness between each
point x
i
and y
i
and vice versa.
8.9. CSSD: The CSSD was used for image retriev-
al,
84
histogram,
85
etc.
CSSD(x,y)=+
n
i=1
ximi
mi
,
where mi=xiþyi
2.
8.10. WIAD: WIAD was designed for species
abundance data.
86
WIAD(x,y)=1
2+
n
i=1jxi
+n
i=1xiyi
+n
i=1yij:
8.11. MeeD: MeeD depends on one consecutive
point in each vector.
MeeD(x,y)=+
n1
i=1
(xiyixiþ1þyiþ1)2:
8.12. MotD:
MotD(x,y)=+n
i=1max(xi,yi)
+n
i=1(xiþyi):
8.13. HasD: This is a nonconvex distance intro-
duced by Hassanat
26
HasD(x,y)=+
n
i=1
D(xi,yi),
where
D(x,y)=11þmin (xi,yi)
1þmax (xi,yi), min (xi,yi)0
11þmin (xi,yi)þjmin (xi,yi)j
1þmax (xi,yi)þjmin (xi,yi)j, min (xi,yi)<0
(:
As can be seen, HasD is bounded by [0,1]. It reaches 1
when the maximum value approaches infinity assuming
the minimum is finite, or when the minimum value ap-
proaches minus infinity assuming the maximum is finite.
This is shown in Figure 2 and the following equations.
lim
max (Ai,Bi)!1 D(Ai,Bi)=lim
max (Ai,Bi)!1D(Ai,Bi)=1:
By satisfying all the metric properties, this distance
was proved to be a metric by Hassanat.
26
In this metric
no matter what the difference between two values is,
the distance will be in the range of 0 to 1. So the max-
imum distance approaches the dimension of the tested
vectors; therefore, the increase in dimensions increases
the distance linearly in the worst case.
Other distance measures family
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 11
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
Experimental Framework
Data sets used for experiments
The experiments were done on 28 data sets that repre-
sent real-life classification problems, obtained from
the UCI Machine Learning Repository.
87
The UCI
Machine Learning Repository is a collection of data-
bases, domain theories, and data generators that are
used by the machine learning community for the em-
pirical analysis of machine learning algorithms. The
database was created in 1987 by David Aha and fellow
graduate students at UCI. Since that time, it has been
widely used by students, educators, and researchers all
over the world as a primary source of machine learn-
ing data sets.
Each data set consists of a set of examples. Each ex-
ample is defined by a number of attributes and all the
examples inside the data are represented by the same
number of attributes. One of these attributes is called
the class attribute, which contains the class value
(label) of the data, whose values are predicted for test-
ing the examples. Short description of all the data sets
used is provided in Table 5.
FIG. 2. Representation of HasD between the point 0 and n,wherenbelongs to [10,10]. HasD, Hassanat distance.
Abbreviation Name Definition Result
AvgD Average (L
1
,L
N
)+n
i=1jxiyimaxijxiyij
20.55
KJD Kumar–Johnson +n
i=1
(x2
iþy2
i)2
2(xiyi)3=2
 21.2138
TanD Taneja +n
i=1
xiþyi
2

ln xiþyi
2ffiffiffiffiffi
xiyi
p
 0.0149
PeaD Pearson 1 +n
i=1(xi
x)(yi
y)
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xi
x)2?+n
i=1(yi
y)2
p
x=1
n+n
i=1xi0.9684
CorD Correlation 1
21+n
i=1(xi
x)(yi
y)
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xi
x)2
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(yi
y)2
p
 0.4842
SPeaD Squared Pearson 1 +n
i=1(xi
x)(yi
y)
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(xi
x)2
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
+n
i=1(yi
y)2
p

2
0.999
HamD Hamming +n
i=11xiyi4
HauD Hausdorff Max [h(x,y), h(y,x)]
h(x,y)=max
xi˛x
min
yi˛y
jjx
i
y
i
jj 0.3
CSSD v
2
statistic +n
i=1
ximi
mi,mi=xiþyi
20.0894
WIAD Whittaker’s index
of association
1
2+n
i=1jxi
+n
i=1xiyi
+n
i=1yij1.9377
MeeD Meehl +n1
i=1(xiyixiþ1þyiþ1)20.48
MotD Motyka +n
i=1max(xi,yi)
+n
i=1(xiþyi)0.5190
HasD Hassanat +n
i=1D(xi,yi)
=11þmin (xi,yi)
1þmax (xi,yi), min (xi,yi)0
11þmin (xi,yi)þjmin (xi,yi)j
1þmax (xi,yi)þjmin (xi,yi)j, min(xi,yi)<0
(0.2571
12 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
Experimental setup
Each data set is divided into two data sets, one for train-
ing and the other for testing. For this purpose, 34% of
the data set is used for testing and 66% of the data set is
dedicated for training. The value of K is set to 1 for sim-
plicity. The 34% of the data, which were used as a test
sample, were chosen randomly, and each experiment
on each data set was repeated 10 times to obtain ran-
dom examples for testing and training. The overall ex-
perimental framework is shown in Figure 3. Our
experiments are divided into two major parts:
1. The first part of experiments aim to find the best
distance measures to be used by KNN classifier
without any noise in the data sets. We used all
the 54 distances that were reviewed in Distance
Measures Review section.
2. The second part of experiments aim to find the best
distance measure to be used by KNN classifier in
the case of noisy data. In this study, we define the
‘‘best’’ method as the method that performs with
the highest accuracy. We added noise into each
data set at various levels of noise. The experiments
in the second part were conducted using the top 10
distances, those that achieved the best results in the
first part of experiments. Therefore, to create a
noisy data set from the original data set, a level of
noise x% is selected in the range of (10%–90%),
the level of noise means the number of examples
that need to be noisy, the amount of noise is se-
lected randomly between the minimum and max-
imum values of each attribute, all attributes for
each examples are corrupted by a random noise,
the number of noisy examples is selected ran-
domly. Algorithm 2 describes the process of cor-
rupting data with random noise to be used for
further experiments for the purposes of this study.
Algorithm 2: Create noisy data set
Input: Original data set D, level of noise x% [10%–90%]
Output: Noisy data set
1: Number of noisy examples: N=x%* number of examples in D
2: Array NoisyExample [N]
3: for K=1toNdo
4: Randomly choose an example number as Efrom D
5: if Eis chosen previously then
6: Go to Step 4
7: else
8: NoisyExample [k]=E
9: for each attribute A
i
do do
10: for each NoisyExample NE
j
do do
11: RV =Random value between Min(A
i
) and Max(A
i
)
12: NE
j
A
j
=RV
Performance evaluation measures
Different measures are available for evaluating the per-
formance of classifiers. In this study, three measures
were used: accuracy, precision, and recall. Accuracy is
calculated to evaluate the overall classifier performance.
It is defined as the ratio of the test samples that are cor-
rectly classified to the number of tested examples,
Accuracy =Numberofcorrectclassifications
Totalnumberoftestsamples :(1)
To assess the performance with respect to every class in
a data set, we compute precision and recall measures.
Precision (or positive predictive value) is the fraction
of retrieved instances that are relevant, whereas recall
(or sensitivity) is the fraction of relevant instances
that are retrieved. These measures can be constructed
by computing the following:
1. True positive (TP): The number of correctly clas-
sified examples of a specific class (as we calculate
these measures for each class).
2. True negative (TN): The number of correctly
classified examples that were not belonging to
the specific class.
Table 5. Description of the real-world data sets used
(from the UCI machine learning repository)
Name #E #F #C Data type Min Max
Heart 270 25 2 Real and integer 0 564
Balance 625 4 3 Positive integer 1 5
Cancer 683 9 2 Positive integer 0 9
German 1000 24 2 Positive integer 0 184
Liver 345 6 2 Real and integer 0 297
Vehicle 846 18 4 Positive integer 0 1018
Vote 399 10 2 Positive integer 0 2
BCW 699 10 2 Positive integer 1 13,454,352
Haberman 306 3 2 Positive integer 0 83
Letter
recognition
20,000 16 26 Positive integer 0 15
Wholesale 440 7 2 Positive integer 1 112,151
Australian 690 42 2 Positive real 0 100,001
Glass 214 9 6 Positive real 0 75.41
Sonar 208 60 2 Positive real 0 1
Wine 178 13 3 Positive real 0.13 1680
EEG 14,980 14 2 Positive real 86.67 715,897
Parkinson 1040 27 2 Positive real 0 1490
Iris 150 4 3 Positive real 0.1 7.9
Diabetes 768 8 2 Real and integer 0 846
Monkey1 556 17 2 Binary 0 1
Ionosphere 351 34 2 Real 11
Phoneme 5404 5 2 Real 1.82 4.38
Segmen 2310 19 7 Real 50 1386.33
Vowel 528 10 11 Real 5.21 5.07
Wave21 5000 21 3 Real 4.2 9.06
Wave40 5000 40 3 Real 3.97 8.82
Banknote 1372 4 2 Real 13.77 17.93
QSAR 1055 41 2 Real 5.256 147
BCW, Breast Cancer Wisconsin; #C, number of classes; #E, number of
examples; #F, number of features; Max, maximum; Min, minimum;
QSAR, quantitative structure activity relationships.
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 13
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
3. False positive (FP): The number of examples
that are incorrectly assigned to the specific
class.
4. False negative (FN): The number of examples that
are incorrectly assigned to another class.
The precision and recall of a multiclass classification
system are defined by
AveragePrecision =1
N+
N
i=1
TPi
TPiþFPi
,(2)
AverageRecall =1
N+
N
i=1
TPi
TPiþFNi
,(3)
where Nis the number of classes, TP
i
is the number of
TP for class i,FN
i
is the number of FN for class i, and
FP
i
is the number of FP for class i.
These performance measures can be derived from
the confusion matrix. The confusion matrix is repre-
sented by a matrix that shows the predicted and actual
classification. The matrix is n·n, where nis the num-
ber of classes. The structure of confusion matrix for
multiclass classification is given by
FIG. 3. The framework of our experiments for discerning the effect of various distances on the performance
of KNN classifier. 1-NN, 1-nearest neighbor.
PredictedClass
Classifiedas c1Classifiedas c12  Classifiedas c1n
ActualClass c1c11 c12  c1n
ActualClass c2c21 c22  c2n
.
.
..
.
..
.
..
.
..
.
.
ActualClass cncn1cn2 cnn
0
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
A
(4)
14 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
This matrix reports the number of FPs, FNs, TPs, and
TNs that are defined through elements of the confusion
matrix as follows:
TPi=cii,(5)
FPi=+
N
k=1
cki TPi,(6)
FNi=+
N
k=1
cik TPi,(7)
TNi=+
N
k=1
+
N
f=1
ckf TPiFPiFNi:(8)
Accuracy, precision, and recall are calculated for the
KNN classifier using all the similarity measures and
distance metrics discussed in Distance Measures
Review section, on all the data sets described in
Table 5, this is to compare and assess the performance
of the KNN classifier using different distance metrics
and similarity measures.
Experimental results and discussion
For the purposes of this review, two sets of experiments
were conducted. The aim of the first set is to compare
the performance of the KNN classifiers when used with
each of the 54 distances and similarity measures
reviewed in Distance Measures Review section without
any noise. The second set of experiments is designed to
find the most robust distance that affected the least
with different noise levels.
Without noise
A number of different predefined distance families were
used in this set of experiments. The accuracy of each dis-
tance on each data set is averaged for 10 runs. The same
technique is followed for all other distance families to
Table 6. Average accuracies, recalls, precisions over all data sets for each distance
Distance Accuracy Recall Precision Distance Accuracy Recall Precision
ED 0.8001 0.6749 0.6724 PSCSD 0.6821 0.5528 0.5504
MD 0.8113 0.6831 0.681 DivD 0.812 0.678 0.6768
CD 0.7708 0.656 0.6467 ClaD 0.8227 0.6892 0.6871
LD 0.8316 0.6964 0.6934 ASCSD 0.6259 0.4814 0.4861
CanD 0.8282 0.6932 0.6916 SED 0.8001 0.6749 0.6724
SD 0.7407 0.6141 0.6152 AD 0.8001 0.6749 0.6724
SoD 0.7881 0.6651 0.6587 SCSD 0.8275 0.693 0.6909
KD 0.6657 0.5369 0.5325 MCED 0.7973 0.6735 0.6704
MCD 0.8113 0.6831 0.681 TopD 0.6793 0.461 0.4879
NID 0.8113 0.6831 0.681 JSD 0.6793 0.461 0.4879
CosD 0.803 0.6735 0.6693 JDD 0.7482 0.5543 0.5676
ChoD 0.7984 0.662 0.6647 JefD 0.7951 0.6404 0.6251
JacD 0.8024 0.6756 0.6739 KLD 0.513 0.3456 0.3808
DicD 0.8024 0.6756 0.6739 KDD 0.5375 0.3863 0.4182
SCD 0.8164 0.65 0.4813 KJD 0.6501 0.4984 0.5222
HeD 0.8164 0.65 0.6143 TanD 0.7496 0.5553 0.5718
BD 0.4875 0.3283 0.4855 AvgD 0.8084 0.6811 0.6794
MatD 0.8164 0.65 0.5799 HamD 0.6413 0.5407 0.5348
VWHD 0.6174 0.4772 0.5871 MeeD 0.415 0.1605 0.3333
VSDF1 0.7514 0.6043 0.5125 WIAD 0.812 0.6815 0.6804
VSDF2 0.6226 0.4828 0.5621 HauD 0.5967 0.4793 0.4871
VSDF3 0.7084 0.5791 0.5621 CSSD 0.4397 0.2538 0.332
MSCSD 0.7224 0.5876 0.3769 SPeaD 0.8023 0.6711 0.6685
MiSCSD 0.6475 0.5137 0.5621 CorD 0.803 0.6717 0.6692
PCSD 0.6946 0.5709 0.5696 PeaD 0.759 0.6546 0.6395
NCSD 0.6536 0.5144 0.5148 MotD 0.7407 0.6141 0.6152
SquD 0.6821 0.5528 0.5504 HasD 0.8394 0.7018 0.701
Bold values signify the best performance, which is the highest accuracy, precision or recall.
HasD obtained the highest overall average.
AD, Average distance; ASCSD, Additive Symmetric v
2
; AvgD, Average (L
1
,L
N
) distance; BD, Bhattacharyya distance; CanD, Canberra distance; CD,
Chebyshev distance; ChoD, Chord distance; ClaD, Clark distance; CorD, Correlation distance; CosD, Cosine distance; CSSD, v
2
statistic distance; DicD,
Dice distance; DivD, Divergence distance; ED, Euclidean distance; HamD, Hamming distance; HasD, Hassanat distance; HauD, Hausdorff distance; HeD,
Hellinger distance; JacD, Jaccard distance; JDD, Jensen difference distance; JefD, Jeffreys distance; JSD, Jensen–Shannon distance; KD, Kulczynski dis-
tance; KDD, K divergence distance; KJD, Kumar–Johnson distance; KLD, Kullback–Leibler distance; LD, Lorentzian distance; MatD, Matusita distance;
MCD, Mean Character distance; MCED, Mean Censored Euclidean distance; MD, Manhattan distance; MeeD, Meehl distance; MiSCSD, Min symmetric v
2
distance; MotD, Motyka distance; MSCSD, Max symmetric v
2
distance; NCSD, Neyman v
2
distance; NID, Non Intersection distance; PCSD, Pearson v
2
distance; PeaD, Pearson distance; PSCSD, Probabilistic Symmetric v
2
distance; SCD, Squared chord distance; SCSD, Squared Chi-Squared; SD, Sorensen
distance; SED, Squared Euclidean distance; SoD, Soergel distance; SPeaD, Squared Pearson distance; SquD, Squared v
2
distance; TanD, Taneja distance;
TopD, Topsoe distance; VSD, Vicis symmetric distance; VWHD, Vicis-Wave Hedges distance; WIAD, Whittaker’s index of association distance.
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 15
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
report accuracy, recall, and precision of the KNN classi-
fier for eachdistance on each data set. The average values
for each of 54 distances considered in the article are sum-
marized in Table 6, where HasD obtained the highest
overall average.
Table 7 shows the distances that obtained the highest
accuracy on each data set. Based on these results we
summarize the following observations.
ThedistancemeasuresinL
1
family outperformed
the other distance families in five data sets. LD
achieved the highest accuracy in two data sets,
namely on Vehicle and Vowel with average accura-
cies of 69.13% and 97.71%, respectively. In contrast,
CanD achieved the highest accuracy in two data
sets, Australian and Wine data sets achieved average
accuracies of 82.09% and 98.5%, respectively. SD
and SoD achieved the highest accuracy on Segmen
data set with an average accuracy of 96.76%.
Among the L
p
Minkowski and L
1
distance families,
the MD, NID, and MCD achieved similar perfor-
mance with overall accuracies on all data sets: this
is due to the similarity between these distances.
In inner product family, JacD and DicD outper-
form all other tested distances on Letter recog-
nition data set with an average accuracy of
95.16%. Among the L
p
Minkowski and L
1
distance
families, the CD, JacD, and DicD outperform the
other tested distances on the Banknote data set
with an average accuracy of 100%.
In Squared chord family, MatD, SCD, and HeD
achieved similar performance with overall accura-
cies on all data sets, this is expected because these
distances are very similar.
In Squared L
2
distance measures family, SquD and
PSCSD achieved similar performance with overall
accuracy in all data sets: this is due to the similarity
between these two distances. The distance measures
in this family outperform the other distance families
on two data sets, namely, ASCSD achieved the high-
est accuracy on the German data set with an average
accuracy of 71%. ClaD and DivD achieved the high-
est accuracy on the Vote data set with an average ac-
curacy of 91.87%. Among the L
p
Minkowski and
Squared L
2
distance measures family, ED, SED,
and AD achieved similar performance in all data
sets: this is due to the similarity between these
three distances. Also, these distances and MCED
outperform the other tested distances in two data
sets, Wave21 and Wave40, with average accuracies
of 77.74% and 75.87%, respectively. Among the L
1
distance and Squared L
2
families, MCED and LD
achieved the highest accuracy on the Glass data
set with an average accuracy of 71.11%.
In Shannon entropy distance measures family,
JSD and TopD achieved similar performance
with overall accuracies on all data sets: this is
due to similarity between both of the distances,
as TopD is twice the JSD. KLD outperforms all
the tested distances on Haberman data set with
an average accuracy of 73.27%.
The Vicissitude distance measures family outper-
form the other distance families on five data sets,
namely, VSDF1 achieved the highest accuracy in
three data sets, Liver, Parkinson, and Phoneme
with accuracies of 65.81%, 99.97%, and 89.8%, re-
spectively. MSCSD achieved the highest accuracy
on the Diabetes data set with an average accuracy
of 68.79%. MiSCSD also achieved the highest ac-
curacy on Sonar data set with an average accuracy
of 87.71%.
The other distance measures family outperforms all
other distance families in seven data sets. The
WIAD achieved the highest accuracy on Monkey1
data set with an average accuracy of 94.97%. The
AvgD also achieved the highest accuracy on the
Wholesale data set with an average accuracy of
Table 7. The highest accuracy in each data set
Data set Distance Accuracy
Australian CanD 0.8209
Balance ChoD, SPeaD, CorD, CosD 0.933
Banknote CD, DicD, JacD 1
BCW HasD 0.9624
Cancer HasD 0.9616
Diabetes MSCSD 0.6897
Glass LD, MCED 0.7111
Haberman KLD 0.7327
Heart HamD 0.7714
Ionosphere HasD 0.9025
Liver VSDF1 0.6581
Monkey1 WIAD 0.9497
Parkinson VSDF1 0.9997
Phoneme VSDF1 0.898
QSAR HasD 0.8257
Segmen SoD, SD 0.9676
Sonar MiSCSD 0.8771
Vehicle LD 0.6913
Vote ClaD, DivD 0.9178
Vowel LD 0.9771
Wholesale AvgD 0.8866
Wine CanD 0.985
German ASCSD 0.71
Iris ChoD, SPeaD, CorD, CosD 0.9588
Wave21 ED, AD, MCED, SED 0.7774
Egg ChoD, SPeaD, CorD, CosD 0.9725
Wave40 ED, AD, MCED, SED 0.7587
Letter recognition JacD, DicD 0.9516
16 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
88.66%. HasD also achieved the highest accuracy in
four data sets, namely, Cancer, Breast Cancer Wis-
consin (BCW), Ionosphere, and quantitative struc-
ture activity relationships (QSAR) with average
accuracies of 96.16%, 96.24%, 90.25%, and 82.57%,
respectively. Finally, HamD achieved the highest ac-
curacy on the Heart data set with an average accu-
racy of 77.14%. Among the inner product and
other distance measures families, SPeaD, CorD,
ChoD, and CosD outperform other tested distances
in three data sets, namely, Balance, Iris, and Egg
with average accuracies of 94.3%, 95.88%, and
97.25%, respectively.
Table 8 shows the distances that obtained the highest
recall on each data set. Based on these results, we sum-
marize the following observations.
The L
1
distance measures family outperforms the
other distance families in seven data sets, for ex-
ample, CanD achieved the highest recalls in two
data sets, Australian and Wine with 81.83% and
73.94% average recalls respectively. LD also
achieved the highest recalls on four data sets,
Glass, Ionosphere, Vehicle, and Vowel with
51.15%, 61.52%, 54.85%, and 97.68% average re-
calls, respectively. SD and SoD achieved the high-
est recall on Segmen data set with 84.67% average
recall. Among the L
p
Minkowski and L
1
distance
families, MD, NID, and MCD achieved similar
performance as expected, due to their similarity.
In Inner Product distance measures family, JacD
and DicD outperform all other tested distances
in Heberman data set with 38.53% average recall.
Among the L
p
Minkowski and Inner Product dis-
tance measures families, CD, JacD, and DicD also
outperform the other tested distances in the Bank-
note data set with 100% average recall.
In SCD measures family, MatD, SCD, and HeD
achieved similar performance: this is due to simi-
larity of their equations as clarified previously.
The Squared L
2
distance measures family outper-
forms the other distance families on two data sets,
namely, ClaD and DivD outperform the other
tested distances on Vote data set with 91.03% av-
erage recall. PCSD outperforms the other tested
distances on Wholesale data set with 58.16% aver-
age recall. ASCSD also outperforms the other
tested distances on German data set with 43.92%
average recall. Among the L
p
Minkowski and
Squared L
2
distance measures families, ED, SED,
and AD achieved similar performance in all data
sets: this is due to similarity of their equations as
clarified previously. These distances and MCED
distance outperform the other tested distances in
two data sets, namely, Wave21 and Wave40 with
77.71% and 75.88% average recalls, respectively.
In Shannon entropy distance measures family,
JSD and TopD achieved similar performance as
expected, due to their similarity.
The Vicissitude distance measures family outper-
forms the other distance families in six data sets.
VSDF1 achieved the highest recall on three data
sets: Liver, Parkinson, and Phoneme data sets
with 43.65%, 99.97%, 88.13% average recalls, re-
spectively. MSCSD achieved the highest recall on
Diabetes data set with 43.71% average recall.
MiSCSD also achieved the highest recall on
Sonar data set with 58.88% average recall.
VSDF2 achieved the highest recall on Letter rec-
ognition data set with 95.14% average recall.
The other distance measures family outperforms
all other distance families in five data sets. Partic-
ularly, HamD achieved the highest recall on the
Heart data set with 51.22% average recall.
WIAD also achieved the highest average recall
on the Monkey1 data set with 94.98% average
Table 8. The highest recall in each data set
Data set Distance Recall
Australian CanD 0.8183
Balance ChoD, SPeaD, CorD, CosD 0.6437
Banknote CD, DicD, JacD 1
BCW HasD 0.3833
Cancer HasD 0.9608
Diabetes MSCSD 0.4371
Glass LD 0.5115
Haberman DicD, JacD 0.3853
Heart HamD 0.5122
Ionosphere LD 0.6152
Liver VSDF1 0.4365
Monkey1 WIAD 0.9498
Parkinson VSDF1 0.9997
Phoneme VSDF1 0.8813
QSAR HasD 0.8041
Segmen SoD, SD 0.8467
Sonar MiSCSD 0.5888
Vehicle LD 0.5485
Vote ClaD, DivD 0.9103
Vowel LD 0.9768
Wholesale PCSD 0.5816
Wine CanD 0.7394
German ASCSD 0.4392
Iris ChoD, SPeaD, CorD, CosD 0.9592
Wave21 ED, AD, MCED, SED 0.7771
Egg ChoD, SPeaD, CorD, CosD 0.9772
Wave40 ED, AD, MCED, SED 0.7588
Letter recognition VSDF2 0.9514
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 17
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
recall. HasD also has achieved the highest average
recall on three data sets, namely, Cancer, BCW,
and QSAR with 96.08%, 38.33%, and 80.41% aver-
age recalls, respectively. Among the inner product
and other distance measures families, SPeaD,
CorD, ChoD, and CosD outperform the other
tested distances in three data sets, namely, Bal-
ance, Iris, and Egg with 64.37%, 95.92%, and
97.72% average recalls, respectively.
Table 9 shows the distances that obtained the highest
precision on each data set. Based on these results, we
summarize the following observations.
The distance measures in L
1
family outperformed
the other distance families in five data sets. CanD
achieved the highest precision on two data sets,
namely, Australian and Wine with 81.88% and
74.08% average precisions, respectively. SD and
SoD achieved the highest precision on the Segmen
data set with 84.66% average precision. In addi-
tion, LD achieved the highest precision on three
data sets, namely, Glass, Vehicle, and Vowel,
with 51.15%, 55.37%, and 97.87% average preci-
sions, respectively. Among the L
p
Minkowski
and L
1
distance families, the MD, NID, and
MCD achieved similar performance in all data
sets: this is due to similarity of their equations as
clarified previously.
Inner Product family outperforms other distance
families in two data sets. Also, JacD and DicD out-
perform the other tested measures on Wholesale
data set with 58.53% average precision. Among
the L
p
Minkowski and L
1
distance families, the
CD, JacD and DicD outperformed the other tested
distances on the Banknote dataset with 100%
average precision.
In SCD measures family, MatD, SCD, and HeD
achieved similar performance with overall preci-
sion results in all data sets: this is due to similarity
of their equations as clarified previously.
In Squared L
2
distance measures family, SCSD
and PSCSD achieved similar performance: this is
due to similarity of their equations as clarified pre-
viously. The distance measures in this family out-
perform the other distance families on three data
sets, namely, ASCSD achieved the highest average
precisions on two data sets, Diabetes and German
with 44.01% and 43.43% average precisions, re-
spectively. ClaD and DicD also achieved the high-
est precision on the Vote data set with 92.11%
average precision. Among the L
p
Minkowski and
Squared L
2
distance measures families, the ED,
SED, and AD achieved similar performance as
expected, due to their similarity. These distances
and MCED outperform the other tested measures
in two data sets, namely, Wave21 and Wave40
with 77.75% and 75.9% average precisions, respec-
tively. Also, ED, SED, and AD outperform the
other tested measures on the Letter recognition
with 95.57% average precision.
In Shannon entropy distance measures family,
JSD and TopD achieved similar performance
with overall precision in all data sets, due to sim-
ilarity of their equations as clarified earlier.
The Vicissitude distance measures family outper-
forms other distance families on four data sets.
VSDF1 achieved the highest average precisions
on three data sets: Liver, Parkinson, and Phoneme
with 43.24%, 99.97%, and 87.23% average preci-
sions, respectively. MiSCSD also achieved the
highest precision on the Sonar data set with
57.99% average precision.
The other distance measures family outperforms
all the other distance families in six data sets. In
particular, HamD achieved the highest precision
on the Heart data set with 51.12% average
Table 9. The highest precision in each data set
Data set Distance Precision
Australian CanD 0.8188
Balance ChoD, SPeaD, CorD, CosD 0.6895
Banknote CD, DicD, JacD 1
BCW HasD 0.3835
Cancer HasD 0.9562
Diabetes ASCSD 0.4401
Glass LD 0.5115
Haberman SPeaD, CorD, CosD 0.3887
Heart HamD 0.5112
Ionosphere HasD 0.5812
Liver VSDF1 0.4324
Monkey1 WIAD 0.95
Parkinson VSDF1 0.9997
Phoneme VSDF1 0.8723
QSAR HasD 0.8154
Segmen SoD, SD 0.8466
Sonar MiSCSD 0.5799
Vehicle LD 0.5537
Vote ClaD, DivD 0.9211
Vowel LD 0.9787
Wholesale DicD, JacD 0.5853
Wine CanD 0.7408
German ASCSD 0.4343
Iris ChoD, SPeaD, CorD, CosD 0.9585
Wave21 ED, AD, MCED, SED 0.7775
Egg ChoD, SPeaD, CorD, CosD 0.9722
Wave40 ED, AD, MCED, SED 0.759
Letter recognition ED, AD, SED 0.9557
18 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
precision. Also, WIAD achieved the highest preci-
sion on the Monkey1 data set with 95% average
precision. Moreover, HasD yields the highest pre-
cision in four data sets, namely, Cancer, BCW,
Ionosphere, and QSAR, with 38.35%, 95.62%,
58.12%, and 81.54% average precisions, respec-
tively. Among the inner product and other dis-
tance measures families, SPeaD, CorD, ChoD,
and CosD outperform the other tested distances
in three data sets, namely, Balance, Iris, and Egg,
with 68.95%, 95.85%, and 97.22% average preci-
sions, respectively. Also, CosD, SPeaD, and
CorD achieved the highest precision on Heber-
man data set with 38.87% average precision.
Table 10 gives the top 10 distances with respect to
the overall average accuracy, recall, and precision
over all data sets. HasD outperforms all other tested
distances in all performance measures, followed by
LD, CanD, and SCSD. Moreover, a closer look at the
data of the average as well as highest accuracies, preci-
sions, and recalls, we find that HasD outperforms all
distance measures on four data sets, namely, Cancer,
BCW, Ionosphere, and QSAR: this is true for accuracy,
precision, and recall, and it is the only distance metric
that won at least four data sets in this noise-free exper-
iment set. Note that the performance of the following
five group members (1) MCD, MD, and NID, (2)
AD, ED, and SED, (3) TopD and JSD, (4) SquD and
PSCSD, and (5) MatD, SCD, and HeD is the same
within themselves due to their close similarity in defin-
ing the corresponding distances.
We attribute the success of HasD in this experimen-
tal part to its characteristics discussed in Distance
Measures Review section (see distance equation in
8.13, Fig. 2), where each dimension in the tested vectors
contributes maximally 1 to the final distance, this low-
ers and neutralizes the effects of outliers in different
data sets. To further analyze the performance of
HasD comparing with other top distances, we used
the Wilcoxon’s rank-sum test.
88
This is a nonparamet-
ric pairwise test that aims to detect significant differ-
ences between two sample means, to judge whether
the null hypothesis is true or not. Null hypothesis is a
hypothesis used in statistics that assumes there is no
significant difference between different results or ob-
servations. This test was conducted between HasD
and with each of the other top distances (Table 10)
over the tested data sets. Therefore, our null hypothesis
is ‘‘there is no significant difference between the perfor-
mance of HasD and the compared distance over all the
data sets used.’’ According to the Wilcoxon test, if the
result of the test showed that the p-value is less than
the significance level (0.05), then we reject the null
hypothesis, and conclude that there is a significant
difference between the tested samples; otherwise we
cannot conclude anything about the significant dif-
ference.
89
The accuracies, recalls, and precisions of HasD over
all the data sets used in this experiment set were com-
pared with those of each of the top 10 distance mea-
sures, with the corresponding p-values given in
Table 11. The p-values that were less than the signifi-
cance level (0.05) are highlighted in bold. As given in
Table 11, the p-values of accuracy results are less
than the significance level (0.05) eight times, here we
can reject the null hypothesis and conclude that there
is a significant difference in the performance of HasD
compared with ED, CanD, CosD, ClaD, SCSD,
WIAD, CorD, and DivD, and since the average perfor-
mance of HasD was better than all of these distance
measures from the previous tables, we can conclude
that the accuracy yielded by HasD is better than that
of most of the distance measures tested. Similar
Table 10. The top 10 distances in terms of average accuracy, recall, and precision-based performance on noise-free data sets
Rank Accuracy distance Average Rank Recall distance Average Rank Precision distance Average
1 HasD 0.8394 1 HasD 0.7018 1 HasD 0.701
2 LD 0.8316 2 LD 0.6964 2 LD 0.6934
3 CanD 0.8282 3 CanD 0.6932 3 CanD 0.6916
4 SCSD 0.8275 4 SCSD 0.693 4 SCSD 0.6909
5 ClaD 0.8227 5 ClaD 0.6892 5 ClaD 0.6871
6 DivD 0.812 6 MD 0.6831 6 MD 0.681
6 WIAD 0.812 7 WIAD 0.6815 7 WIAD 0.6804
7 MD 0.8113 8 AvgD 0.6811 8 DivD 0.6768
8 AvgD 0.8084 9 DivD 0.678 9 DicD 0.6739
9 CosD 0.803 10 DicD 0.6756 10 ED 0.6724
9 CorD 0.803
10 DicD 0.8024
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 19
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
analysis applies for the recall and precision columns
comparing Hassanat results with the other distances.
With noise
These next experiments aim to identify the impact of
noisy data on the performance of KNN classifier re-
garding accuracy, recall, and precision using different
distance measures. Accordingly, nine different levels
of noise were added into each data set using Algorithm
2. For simplicity, this set of experiments conducted
using only the top 10 distances given in Table 10 are
obtained based on the noise-free data sets.
Figure 4 shows the experimental results of KNN
classifier that clarify the impact of noise on the accu-
racy performance measure using the top 10 distances.
x-axis represents the noise level and y-axis represents
the classification accuracy. Each column at each noise
level represents the overall average accuracy for each
distance on all data sets used. Error bars represent
the average of standard deviation values for each dis-
tance on all data sets. Figure 5 shows the recall results
of KNN classifier that clarify the impact of noise on the
performance using the top 10 distance measures. Fig-
ure 6 shows the precision results of KNN classifier
that clarify the impact of noise on the performance
using the top 10 distance measures. As can be seen
from Figures 4–6, the performance (measured by
Table 11. The p-values of the Wilcoxon test for the results
of Hassanat distance with each of other top distances
over the data sets used
Distance Accuracy Recall Precision
ED 0.0418 0.0582 0.0446
MD 0.1469 0.1492 0.1446
CanD 0.0017 0.008 0.0034
LD 0.0594 0.121 0.0427
CosD 0.0048 0.0066 0.0064
DicD 0.0901 0.0934 0.0778
ClaD 0.0089 0.0197 0.0129
SCSD 0.0329 0.0735 0.0708
WIAD 0.0183 0.0281 0.0207
CorD 0.0048 0.0066 0.0064
AvgD 0.1357 0.1314 0.102
DivD 0.0084 0.0188 0.017
The p-values that were less than the significance level (0.05) are high-
lighted in boldface.
FIG. 4. The overall average accuracies and standard deviations of KNN classifier using top 10 distance measures
with different levels of noise. AvgD, Average (L
1
,L
N
) distance; CanD, Canberra distance; ClaD, Clark distance; CorD,
Correlation distance; CosD, Cosine distance; DicD, Dice distance; DivD, Divergence distance; LD, Lorentzian
distance; MD, Manhattan distance; SCSD, Squared Chi-Squared; WIAD, Whittaker’s index of association distance.
Color images are available online.
20 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
FIG. 5. The overall average recalls and standard deviations of KNN classifier using top 10 distance measures
with different levels of noise. Color images are available online.
FIG. 6. The overall average precisions and standard deviations of KNN classifier using top 10 distance
measures with different levels of noise. Color images are available online.
21
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
accuracy, recall, and precision, respectively) of the
KNN degraded only *20%, whereas the noise level
reaches 90%, this is true for all the distances used.
This means that the KNN classifier using any of the
top 10 distances tolerates noise to a certain degree.
Moreover, some distances are less affected by the
added noise comparing with other distances. There-
fore, we ordered the distances according to their overall
average accuracy, and recall and precision results for
each level of noise. The distance with highest perfor-
mance is ranked in the first position, whereas the dis-
tance with the lowest performance is ranked in the
last position of the order. Tables 12–14 give this rank-
ing structure in terms of accuracy, precision, and recall
under each noise level from low 10% to high 90%. The
empty cells occur because of sharing same rank by
more than one distance. The following points summa-
rize the observations in terms of accuracy, precision,
and recall values.
According to the average precision results, the
highest precision was obtained by HasD, which
achieved the first rank in the majority of noise lev-
els. This distance succeeds to be in the first rank at
noise levels 10% up to 70%. However, at a level
80%, LD outperformed HasD. Also, MD outper-
formed HasD at a noise level 90%.
LD achieved the second rank at noise levels 10%,
30%, 40%, 50%, and 70%. CanD achieved the sec-
ond rank at noise levels 20% and 60%. Moreover,
Table 12. Ranking of distances in descending order based on the accuracy results at each noise level
Rank 10% 20% 30% 40% 50% 60% 70% 80% 90%
1 HasD HasD HasD HasD HasD HasD HasD HasD HasD
2 LD CanD LD LD LD CanD CanD LD LD
3 CanD SCSD CanD CanD CanD ClaD LD CanD MD
SCSD
4 ClaD ClaD SCSD SCSD SCSD LD SCSD ClaD CanD
5 DivD LD ClaD ClaD ClaD SCSD ClaD MD AvgD
6 WIAD DivD DivD MD MD DivD CosD SCSD SCSD
CorD
7 MD WIAD MD DivD DivD MD MD AvgD CorD
CosD
CorD
8 CD MD AvgD AvgD AvgD AvgD AvgD CorD CosD
9 CorD AvgD DicD CosD WIAD ED DivD CosD DicD
CorD
10 AvgD DicD ED WIAD CoD DicD DicD DivD WIAD
WIAD
ED
11 DicD ED WIAD DicD CorD CorD WIAD DicD ED
CosD
12 ED CosD — ED DicD WIAD ED — ClaD
13 — CorD — ED — DivD
Table 13. Ranking of distances in descending order based on the recall results at each noise level
Rank 10% 20% 30% 40% 50% 60% 70% 80% 90%
1 HasD HasD HasD HasD HasD HasD HasD LD, HasD MD
2 LD CanD LD LD LD SCSD LD CanD ED
3 SCSD ClaD CanD CanD CanD CanD CanD MD HasD
4 CanD, SCSD SCSD SCSD SCSD ClaD SCSD AvgD LD
5 ClaD LD ClaD ClaD ClaD MD ClaD SCSD CanD
6 MD DivD MD MD MD LD MD ClaD AvgD
7 DivD MD AvgD AvgD AvgD ED AvgD ED WIAD
8 WIOD WIOD DivD DivD DivD DivD DicD DicD SCSD
9 AvgD AvgD ED WIAD DicD AvgD ED WIAD ClaD
10 CosD DicD DicD ED ED WIAD CosD DivD DicD
CorD
DivD
11 DicD ED CosD DicD WIAD DicD WIAD CosD DivD
12 ED CosD, CorD CosD CoD CosD CorD CorD
13 CorD CorD WIOD CorD CorD CorD CosD
22 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
this distance achieved the third rank in the rest
noise levels except at noise levels 50% and 90%.
SCSD achieved the fourth rank at noise levels
10%, 30%, 40%, and 70%, and the third rank at
a level of noise 50%. This distance was equal
with LD at a noise level 20%. ClaD achieved the
third rank at noise levels 20%, and 60%.
The rest of distances achieved the middle and the
last ranks in different orders at each level of noise.
CosD at level 80% was equal to WIAD in the result.
This distance was also equal to CorD at levels 30%
and 70%. These two distances performed the worst
(lowest precision) in most noise levels.
Based on results given in Tables 12–14, we observe
that the ranking of distances in terms of accuracy, re-
call, and precision without the presence of noise is dif-
ferent from their ranking when adding the first level of
noise 10%, and it significantly varies when we increased
the level of noise progressively. This means that the dis-
tances are affected by noise. However, the crucial ques-
tion is, which one of the distances is least affected by
noise? From the given results, we conclude that HasD
is the least affected one, followed by LD, CanD, and
SCSD. Good performance of the KNN achieved by
these distances might be contributed by their good
characteristics. Table 15 gives these characteristics of
the top 10 distances in our analysis. All of these top
10 distances are symmetric, and we further provide
input and output ranges and the number of operations.
Precise evaluation of the effects of noise
To justify why some distances are affected either less
or more by noise, the following toy Examples 1 and 2
are designed. These illustrate the effect of noise on
the final decision of the KNN classifier using HasD
and the standard ED. In both examples, we assume
that we have two training vectors (v1 and v2) having
three attributes for each, in addition to one test vector
(v3). As usual, we calculate the distances between v3
and both v1 and v2 using both of ED and HasD.
Example 1. This example shows the KNN classifica-
tion using two different distances on clean data (without
noise). We find the distance to test vector (v3) according
to ED and HasD.
X1 X2 X3 X4 Class
Dist(,,V3)
ED HasD
V1 3 4 5 3 2 2 0.87
V2 1 3 4 2 1 1 0.33
V32342 ?
Table 15. Some characteristics of the top 10 distances
(nis the number of features)
Distance
Input
range
Output
range Symmetric Metric
No. of
operations
HasD (N,+N) [0, n] Yes Yes 6n (positives),
9n (negatives)
ED (N,+N) [0, +N) Yes Yes 1 +3n
MD (N,+N) [0, +N) Yes Yes 3n
CanD (N,+N) [0, n] Yes Yes 7n
LD (N,+N) [0, +N) Yes Yes 5n
CosD [0, +N) [0, 1] Yes No 5 +6n
DicD [0, +N) [0, 1] Yes No 4 +6n
ClaD [0, +N) [0, n] Yes Yes 1 +6n
SCSD (N,+N) [0, +N) Yes Yes 6n
WIAD (N,+N) [0, 1] Yes Yes 1 +7n
CorD [0, +N) [0, 1] Yes No 8 +10n
AvgD (N,+N) [0, +N) Yes Yes 2 +6n
DivD (N,+N) [0, +N) Yes Yes 1 +6n
Table 14. Ranking of distances in descending order based on the precision results at each noise level
Rank 10% 20% 30% 40% 50% 60% 70% 80% 90%
1 HasD HasD HasD HasD HasD HasD HasD LD MD
2 LD CanD LD LD LD CanD LD HasD HasD
3 CanD ClaD CanD CanD SCSD ClaD CanD CanD ED
4 SCSD LD SCSD SCSD CanD LD SCSD MD LD
SCSD
5 ClaD DivD ClaD ClaD ClaD SCSD ClaD SCSD CanD
6 MD MD MD MD MD MD AvgD ClaD AvgD
WIAD
7 AvgD AvgD AvgD AvgD AvgD DivD MD AvgD WIAD
8 WIAD DicD DivD WIAD ED AvgD CosD ED SCSD
CorD
9 DivD ED ED DivD DivD ED DicD CosD DicD
WIAD
10 DicD CosD DicD ED DicD DicD ED CorD ClaD
11 ED CorD CosD DicD WIAD WIAD DivD DicD DivD
CorD
12 CosD WIAD CosD CorD CorD WIAD DivD CorD
13 CorD CorD CosD CosD CosD
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 23
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
As shown, assuming that we use k=1, based on the
1-NN approach, and using both distances, the test vec-
tor is assigned to class 1, both results are reasonable,
because V3 is almost the same as V2 (class =1) except
for the first feature, which differs only by 1.
Example 2. This example shows the same feature
vectors as in Example 1, but after corrupting one of
the features with an added noise. That is, we make the
same previous calculations using noisy data instead of
the clean data; the first attribute in V2 is corrupted by
an added noise of (4, i.e., X1 =5).
X1 X2 X3 X4 Class
Dist(,,V3)
ED HasD
V13453 2 2 0.87
V25342 1 3 0.5
V32342 ?
Based on the minimum distance approach, using
Euclidian distance, the test vector is assigned to class
2 instead of 1. However, the test vector is assigned to
class 1 using the HasD, this makes the distance more
accurate with the existence of noise. Although simple,
these examples showed that the ED was affected by
noise and consequently affected the KNN classification
ability. Although the performance of the KNN classifier
is decreased as the noise increased (as shown by the ex-
tensive experiments with various data sets), we find
that some distances are less affected by noise than
other distances. For example, when using ED, any
change in any attribute contributes highly to the final
distance, even if both vectors were similar, but in one
feature there was noise, the distance (in such a case) be-
comes unpredictable. In contrast, with the HasD we
found that the distance between both consecutive attri-
butes is bounded in the range [0,1], thus, regardless of
the value of the added noise, each feature will contrib-
ute up to 1 maximally to the final distance, and not pro-
portional to the value of the added noise. Therefore, the
impact of noise on the final classification is mitigated.
It is worth mentioning that the aforementioned exp-
eriments used the KNN classifier with Kequal to 1, also
known as the nearest neighbor classifier. In fact, the
choice of distance metric might affect the optimal Kas
well. A K=1 choice is more sensitive to noise than larger
Kvalues, because an unrelated noisy example might be
the nearest to a test example. Therefore, a valid action
with noisy data would be to choose a larger K;it
would be of interest to see which distance measure han-
dles this aspect best. We remark that choosing the opti-
mal Kis out of the scope of this review, and we refer to
Hassanat
17
and Alkasassbeh et al.
25
However, we have
repeated the experiments on all data sets using K=3
and k=ffiffi
n
p,asdonebyHassanat,
17
where nis the num-
ber of examples in the training data set, with 50% noise.
This is done using the top 10 distances. As given in
Table 16, the average accuracy of most of the top 10 dis-
tances has been slightly improved compared with those
shown in Figure 4 with noisy data, and Table 10 without
noise, this is due to the larger number of neighbors (K)
used; however, there are some exceptions, including the
WIAD distance, which seems to be negatively affected
by increasing the number of neighbors (K).
In previous experiments, the noisy data have been
used with the top 10 distances only. However, it
would be interesting to see whether any of the other
measures would handle noise better than these partic-
ular top 10 measures. Table 17 gives the average accu-
racy of all distances over the first 14 data sets, using
K=1 with and without noise. As given in Table 17,
some of the top 10 distances still remain ranked the
highest even with the existence of noise compared
with all other distances, this includes HasD, LD,
DivD, CanD, ClaD, and SCSD. However, interestingly,
some of the other distances (which were ranked low
when using data without noise) have shown less vul-
nerability to noise, these include SquD, PSCSD,
MSCSD, SCD, MatD, and HeD. According to the ex-
tensive experiments conducted for the purpose of this
review, and regardless of the type of the experiments,
in general, the nonconvex distance HasD is the best
distance to be used with the KNN classifier, with
other distances such as LD, DivD, CanD, ClaD, and
SCSD performing close to best.
Table 16. The average accuracy of the top 10 distances
over all data sets, using K=3 and K=ffiffiffi
n
p(nis the number
of features), with and without noise
Distance
With 50% noise Without noise
K=3K=ffiffiffi
n
pK=3K=ffiffiffi
n
p
HasD 0.6302 0.6314 0.8497 0.833721
LD 0.6283 0.6237 0.8427 0.827561
CanD 0.6247 0.6231 0.8384 0.825171
ClaD 0.6171 0.6136 0.8283 0.8098
SCSD 0.6146 0.6087 0.8332 0.8045
MD 0.6124 0.6089 0.8152 0.79705
DivD 0.6114 0.6049 0.8183 0.792768
CosD 0.6086 0.6021 0.8085 0.791625
CorD 0.6085 0.6020 0.8085 0.786696
AvgD 0.6079 0.6081 0.8119 0.792768
ED 0.6016 0.6011 0.8031 0.781639
DicD 0.5998 0.6016 0.8021 0.778054
WIAD 0.5680 0.5989 0.7935 0.788361
24 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
Conclusions and Future Perspectives
In this review, the performance (accuracy, precision,
and recall) of the KNN classifier has been evaluated
using a large number of distance measures, on clean
and noisy data sets, attempting to find the most appro-
priate distance measure that can be used with the
KNN in general. In addition, we tried finding the
most appropriate and robust distance that can be
used in the case of noisy data. A large number of exper-
iments conducted for the purposes of this review and
the results and analysis of these experiments show
the following:
1. The performance of KNN classifier depends sig-
nificantly on the distance used, the results showed
large gaps between the performances of different
distances. For example, we found that HasD per-
formed the best when applied on most data sets
comparing with the other tested distances.
2. We get similar classification results when we use
distances from the same family having almost
the same equation, some distances are very simi-
lar, for example, one is twice the other, or one is
the square of another. In these cases, and since the
KNN compares examples using the same dis-
tance, the nearest neighbors will be the same if
all distances were multiplied or divided by the
same constant.
3. There was no optimal distance metric that can be
used for all types of data sets, as the results show
that each data set favors a specific distance metric,
and this result complies with the no-free-lunch
theorem.
4. The performance (measured by accuracy, preci-
sion, and recall) of the KNN degraded only
*20% while the noise level reaches 90%, this is
true for all the distances used. This means that
the KNN classifier using any of the top 10 dis-
tances tolerates noise to a certain degree.
5. Some distances are less affected by the added
noise comparing with other distances, for exam-
ple, we found that HasD performed the best
when applied on most data sets under different
levels of heavy noise.
Our study has the following limitations, and future
work will focus on studying, evaluating, and investigat-
ing these.
1. Although we have tested a large number of dis-
tance measures, there are still other distances
and similarity measures that are available in the
machine learning area that need to be tested
and evaluated for optimal performance with
and without noise.
2. The 28 data sets although higher than previously
tested still might not be enough to draw signifi-
cant conclusions in terms of the effectiveness of
Table 17. The average accuracy of all distances over the first
14 data sets, using K=1 with and without noise
Distance With 50% noise Without noise
HasD 0.6331 0.8108
LD 0.6302 0.7975
DivD 0.6284 0.8068
CanD 0.6271 0.8053
ClaD 0.6245 0.7999
SquD 0.6227 0.7971
PSCSD 0.6227 0.7971
SCSD 0.6227 0.7971
MSCSD 0.6223 0.8004
SCD 0.6208 0.7989
MatD 0.6208 0.7989
HeD 0.6208 0.7989
VSDF3 0.6179 0.7891
WIAD 0.6144 0.7877
CorD 0.6119 0.7635
SPeaD 0.6119 0.7635
CosD 0.6118 0.7636
NCSD 0.6108 0.7674
ChoD 0.6102 0.755
JeD 0.6071 0.7772
PCSD 0.6043 0.7813
KJD 0.6038 0.7465
PeaD 0.6034 0.7066
VSDF2 0.6032 0.7473
MD 0.6029 0.7565
NID 0.6029 0.7565
MCD 0.6029 0.7565
SD 0.6024 0.7557
SoD 0.6024 0.7557
MotD 0.6024 0.7557
ASCSD 0.6016 0.7458
VSDF1 0.6012 0.7427
AvgD 0.5995 0.7523
TanD 0.5986 0.7421
JDD 0.5955 0.741
MiSCSD 0.595 0.7596
VWHD 0.5945 0.7428
JacD 0.5936 0.746
DicD 0.5936 0.746
ED 0.5927 0.7429
SED 0.5927 0.7429
AD 0.5927 0.7429
KD 0.5911 0.7154
MCED 0.5889 0.7416
HamD 0.5843 0.7048
CD 0.5742 0.7154
HauD 0.5474 0.621
KDD 0.5295 0.5357
TopD 0.5277 0.6768
JSD 0.5276 0.6768
CSSD 0.5273 0.4895
KLD 0.5089 0.5399
MeeD 0.4958 0.4324
BD 0.4747 0.4908
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 25
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
certain distance measures and, therefore, there is
a need to use a larger number of data sets with
varied data types.
3. The creation of noisy data is done by replacing a
certain percentage (in the range 10%–90%) of the
examples by completely random values in the attri-
butes. We used this type of noise for its simplicity
and straightforwardness in the interpretation of
the effects of distance measure choice with KNN
classifier. However, this type of noise might not
simulate other types of noise that occur in the
real-world data. It is an interesting task to try
other realistic noise types to evaluate the distance
measures for robustness in a similar manner.
4. Only KNN classifier was reviewed in this study,
other variants of KNN such as Hassanat
90–93
need to be investigated.
5. Distance measures are not used only with the KNN,
but also with other machine learning algorithms,
such as different types of clustering, those need to
be evaluated under different distance measures.
Author Disclosure Statement
No competing financial interests exist.
References
1. Fix E, Hodges Jr JL. Discriminatory analysis-nonparametric discrimination:
Consistency properties. Technical report, University of California, Ber-
keley, 1951.
2. Cover TM, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf
Theory. 1967;13:21–27.
3. Wu X, Kumar V, Quinlan RJ, et al. Top 10 algorithms in data mining. Knowl
Inf Syst. 2008;14:1–37.
4. Bhatia N, Vandana A. Survey of nearest neighbor techniques. Int J Com-
put Sci Inf Secur. 2010;8:302–305.
5. Xu S, Wu Y. An algorithm for remote sensing image classification based
on artificial immune b-cell network. ISPRS Arch. 2008;37:107–112.
6. Manne S, Kotha SK, Fatima SS. Text categorization with K-nearest
neighbor approach. In: Proceedings of the International Conference on
Information Systems Design and Intelligent Applications 2012 (INDIA
2012) held in Visakhapatnam, India, Springer, January 2012. pp. 413–
420.
7. Geng X, Liu T-Y, Qin T, et al. Query dependent ranking using K-nearest
neighbor. In: Proceedings of the 31st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval,
ACM, 2008. pp. 115–122.
8. Bajramovic F, Mattern F, Butko N, et al. A comparison of nearest neighbor
search algorithms for generic object recognition. In: Proceedings of the
International Conference on Advanced Concepts for Intelligent Vision
Systems, Springer, 2006. pp. 1186–1197.
9. Yang Y, Ault T, Pierce T, et al. Improving text categorization methods for
event tracking. In: Proceedings of the 23rd Annual International ACM
SIGIR Conference on Research and Development in Information
Retrieval, ACM, 2000. pp. 65–72.
10. Kataria A, Singh MD. A review of data classification using k-nearest
neighbour algorithm. Int J Emerg Technol Adv Eng. 2013;3:354–360.
11. Wettschereck D, Aha DW, Mohri T. A review and empirical evaluation of
feature weighting methods for a class of lazy learning algorithms. Artif
Intell Rev. 1997;11:273–314.
12. Maillo J, Triguero I, Herrera F. A mapreduce-based K-nearest neighbor
approach for big data classification. In: 2015 IEEE Trustcom/BigDataSE/
ISPA, Volume 2, IEEE, 2015. pp. 167–172.
13. Maillo J, Ramı
´rez S, Triguero I, et al.. KNN-is: An iterative spark-based
design of the K-nearest neighbors classifier for big data. Knowl Based
Syst. 2017;117:3–15.
14. Deng Z, Zhu X, Cheng D, et al. Efficient KNN classification algorithm for
big data. Neurocomputing. 2016;195:143–148.
15. Gallego A-J, Calvo-Zaragoza J, Valero-Mas JJ, et al. Clustering-based
K-nearest neighbor classification for large-scale data with neural codes
representation. Pattern Recogn. 2018;74:531–543.
16. Wang F, Wang Q, Nie F, et al. Efficient tree classifiers for large scale
datasets. Neurocomputing. 2018;284:70–79.
17. Hassanat AB. Solving the problem of the k parameter in the KN Ncl assifier using
an ensemble learning approach. Int J Comput Sci Inf Secur. 2014;12:33–39.
18. Chomboon K, Chujai P, Teerarassamee P, et al. An empirical study of distance
metrics for K-nearest neighbor algorithm. In: Proceedings of the 3rd Inter-
national Conference on Industrial Application Engineering. Kitakyushu,
Japan: The Institute of Industrial Applications Engineers, 2005. pp. 1–6.
19. Mulak P, Talhar N. Analysis of distance measures using K-nearest neigh-
bor algorithm on KDD dataset. Int J Sci Res. 2015;7:2101–2104.
20. Tavallaee M, Bagheri E, Lu W, et al. A detailed analysis of the KDD cup 99
data set. In: 2009 IEEE Symposium on Computational Intelligence for
Security and Defense Applications, IEEE, 2009. pp. 1–6.
21. Hu L-Y, Huang M-W, Ke S-W, et al. The distance function effecton K-nearest
neighbor classification for medical datasets. SpringerPlus. 2016;5:1304.
22. Todeschini R, Ballabio D, Consonni V. Distances and other dissimilarity
measures in chemometrics. In: Meyers RA (ed). Encyclopedia of Analytical
Chemistry. Hoboken, NJ: Wiley Online Library, 2015.
23. Todeschini R, Ballabio D, Consonni V, et al. A new concept of higher-order
similarity and the role of distance/similarity measures in local classifi-
cation methods. Chemom Intell Lab Syst. 2016;157:50–57.
24. Lopes N, Ribeiro B. On the impact of distance metrics in instance-based
learning algorithms. In: Iberian Conference on Pattern Recognition and
Image Analysis, Springer, 2015. pp. 48–56.
25. Alkasassbeh M, Altarawneh GA, Hassanat A. On enhancing the perfor-
mance of nearest neighbour classifiers using Hassanat distance metric.
Can J Pure Appl Sci. 2015;9:3291–3298.
26. Hassanat AB. Dimensionality invariant similarity measure. J Am Sci. 2014;
10:221–226.
27. Jirina M. Using singularity exponent in distance based classifier. In: 2010
10th International Conference on Intelligent Systems Design and
Applications, IEEE, 2010. pp. 220–224.
28. Lindi GA. Development of face recognition system for use on the NAO
robot. Master’s thesis, Faculty of Science and Technology, 2016.
29. Jir
ˇina M, Jir
ˇina Jr M. Classifier based on inverted indexes of neighbors.
Technical Report No. V-1034, 2008.
30. Abbadi MA, Altarawneh GA, Hassanat AB, et al. Solving the problem of
the k parameter in the KNN classifier using an ensemble learning ap-
proach. Int J Comput Sci Inf Secur. 2014;12:33–39.
31. Hart P. The condensed nearest neighbor rule (CORRESP.). IEEE Trans Inf
Theory. 1968;14:515–516.
32. Gates G. The reduced nearest neighbor rule (CORRESP.). IEEE Trans Inf
Theory. 1972;18:431–433.
33. Alpaydin E. Voting over multiple condensed nearest neighbors. In: Lazy
Learning, Springer, 1997. pp. 115–132.
34. Kubat M, Cooperson, Jr M. Voting nearest-neighbour subclassifiers.
In: The 17th International Conference on Machine Learning (ICML),
San Francisco, CA: Morgan Kaufmann, 2000. pp. 503–510.
35. Wilson DR, Martinez TR. Reduction techniques for instance-based learn-
ing algorithms. Mach Learn. 2000;38:257–286.
36. Arya S, Mount DM. Approximate nearest neighbor queries in fixed di-
mensions. SODA. 1993;93:271–280.
37. Zheng Y, Guo Q, Tung AKH, et al. Lazylsh: Approximate nearest neighbor
search for multiple distance functions with a single index. In: Pro-
ceedings of the 2016 International Conference on Management of
Data, ACM, 2016. pp. 2023–2037.
38. Nettleton DF, Orriols-Puig A, Fornells A. A study of the effect of different
types of noise on the precision of supervised learning techniques. Artif
Intell Rev. 2010;33:275–306.
39. Zhu X, Wu X. Class noise vs. attribute noise: A quantitative study. Artif
Intell Rev. 2004;22:177–210.
26 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
40. Garcı
´a S, Luengo J, Herrera F. Data preprocessing in data mining. Cham,
Switzerland: Springer International Publishing, 2015.
41. Sa
´Ez JA, Galar M, Luengo JN, et al. Tackling the problem of classification
with noisy data using multiple classifier systems: Analysis of the per-
formance and robustness. Inf Sci. 2013;247:1–20.
42. Heath TL. The thirteen books of Euclid’s Elements. North Chelmsford,
MA: Courier Corporation, 1956.
43. Cha S-H. Comprehensive survey on distance/similarity measures between
probability density functions. City 2007;1:1.
44. Deza MM, Deza E. Encyclopedia of distances. In: Encyclopedia of dis-
tances, Springer, 2009. pp. 1–583.
45. Grabusts P. The choice of metrics for clustering algorithms. In:
Proceedings of the 8th International Scientific and Practical Conference,
Volume 2. Tomsk, Russia: Tomsk Polytechnic University, 2011.
pp. 70–76.
46. Premaratne P. Human computer interaction using hand gestures.
Singapore: Springer Science and Business Media, 2014.
47. Verma JP. Data analysis in management with SPSS software. New Delhi,
India: Springer Science and Business Media, 2012.
48. Lance GN, Williams WT. Computer programs for hierarchical polythetic
classification (‘‘similarity analyses’’). Comput J. 1966;9:60–64.
49. Lance GN, Williams WT. Mixed-data classificatory programs I—agglom-
erative systems. Aust Comput J. 1967;1:15–20.
50. Akila A, Chandra E. Slope finder—A distance measure for DTW based iso-
lated word speech recognition. Int J Eng Comput Sci. 2013;2:3411–3417.
51. Sorensen TA. A method of establishing groups of equal amplitude in plant
sociology based on similarity of species content and its application to
analyses of the vegetation on danish commons. Biol Skar. 1948;5:1–34.
52. Szmidt E. Distances and similarities in intuitionistic fuzzy sets (Studies in
Fuzziness and Soft Computing book series). Cham, Switzerland: Springer,
2014.
53. Ngom A, Chetty M, Ahmad S. Pattern recognition in bioinformatics. Ber lin,
Heidelberg: Springer, 2008.
54. Zhou T, Chan KCC, Wang Z. Topevm: Using co-occurrence and topology
patterns of enzymes in metabolic networks to construct phylogenetic
trees. In: IAPR International Conference on Pattern Recognition in Bio-
informatics, Springer, 2008. pp. 225–236.
55. Willett P, Barnard JM, Downs GM. Chemical similarity searching. J Chem
Inf Comput Sci. 1998;38:983–996.
56. Jaccard P. Comparative study of the floral distribution in a portion of the
Alps and Jura [in French] Bull Soc Vaudoise Sci Nat. 1901;37:547–579.
57. Cesare S, Xiang Y. Software similarity and classification. London, England:
Springer Science and Business Media, 2012.
58. Dice LR. Measures of the amount of ecologic association between spe-
cies. Ecology. 1945;26:297–302.
59. Gan G, Ma C, Wu J. Data clustering: Theory, algorithms, and applications
(ASA-SIAM Series on Statistics and Applied Probability), volume 20.
Philadelphia, PA: Society for Industrial and Applied Mathematics, 2007.
60. Orloci L. An agglomerative method for classification of plant communi-
ties. J Ecol. 1967;55:193–206.
61. Legendre P, Legendre LFJ. Numerical ecology, volume 24. Amsterdam,
Netherlands: Elsevier, 2012.
62. Shirkhorshidi AS, Aghabozorgi S, Wah TY. A comparison study on simi-
larity and dissimilarity measures in clustering continuous data. PLoS
One. 2015;10:e0144059.
63. Bhattacharyya A. On a measure of divergence between two statistical
populations defined by their probability distributions. Bull Calcutta
Math Soc. 1943;35:99–109.
64. Abbad A, Tairi H. Combining Jaccard and Mahalanobis cosine distance
to enhance the face recognition rate. Int J Eng Comput Sci. 2016;16:
171–178.
65. Hellinger E. New justification of the theory of quadratic forms of infinite
variables [in German]. J Reine Angew Math. 1909;136:210–271.
66. Clark PJ. An extension of the coefficient of divergence for use with mul-
tiple characters. Copeia. 1952;1952:61–64.
67. Neyman J. Contributions to the theory of the v
2
test. In: Proceedings of
the first Berkeley Symposium on Mathematical Statistics and Probabil-
ity. University of California Press, Berkeley, 1949.
68. Pearson K. X. on the criterion that a given system of deviations from the
probable in the case of a correlated system of variables is such that it
can be reasonably supposed to have arisen from random sampling.
London Edin Dubl Phil Mag J Sci. 1900;50:157–175.
69. Shannon CE. A mathematical theory of communication. ACM SIGMOBILE
Mobile Comput Commun Rev. 2001;5:3–55.
70. Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat.
1951;22:79–86.
71. Pinto D, Benedı
´J-M, Rosso P. Clustering narrow-domain short texts by
using the Kullback-Leibler distance. In: International Conference on
Intelligent Text Processing and Computational Linguistics, Springer,
2007. pp. 611–622.
72. Jeffreys H. An invariant form for the prior probability in estimation
problems. Proc R Soc London Ser A Math Phys Sci. 1946;186:
453–461.
73. Topsoe F. Some inequalities for information divergence and related
measures of discrimination. IEEE Trans Inf Theory. 2000;46:1602–
1609.
74. Sibson R. Information radius. Information radius [in German]. Prob Theory
Rel. 1969;14:149–160.
75. Hatzigiorgaki M, Skodras AN. Compressed domain image retrieval: A
comparative study of similarity metrics. In: Visual Communications and
Image Processing 2003, volume 5150, International Society for Optics
and Photonics, 2003. pp. 439–448.
76. Patel BV, Meshram BB. Content based video retrieval systems. arXiv
preprint arXiv:1205.1641, 2012.
77. Giusti R, Batista GEAPA. An empirical comparison of dissimilarity mea-
sures for time series classification. In: 2013 Brazilian Conference on
Intelligent Systems, IEEE, 2013. pp. 82–88.
78. Macklem M. Multidimensional modelling of image fidelity measures.
Centre County, PA: Citeseer, 2002.
79. Bharkad SD, Kokare M. Performance evaluation of distance metrics:
Application to fingerprint recognition. Int J Pattern Recogn. 2011;25:
777–806.
80. Hedges TS. An empirical modification to linear wave theory. Proc Inst Civ
Eng. 1976;61:575–579.
81. Taneja IJ. New developments in generalized information measures. Adv
Imag Elect Phys. 1995;91:37–135.
82. Fulekar MH. Bioinformatics: Applications in life and environmental sci-
ences. Dordrecht, Netherlands: Springer Science and Business Media,
2009.
83. Hamming RW. Error detecting and error correcting codes. Bell Syst Tech J.
1950;29:147–160.
84. Kadir A, Nugroho LE, Susanto A, et al. Experiments of distance measure-
ments in a foliage plant retrieval system. Int J Signal Process Image
Process Pattern Recogn. 2012;5:1–14.
85. Rubner Y, Tomasi C. Perceptual metrics for image database navigation,
volume 594. New York, NY: Springer Science and Business Media, 2013.
86. Whittaker RH. A study of summer foliage insect communities in the great
smoky mountains. Ecol Monogr. 1952;22:1–44.
87. UC Irvine. 2016. UC Irvine machine learning repository. Available online at
http://archive.ics.uci.edu/ml (last accessed July 10, 2019).
88. Wilcoxon F. Individual comparisons by ranking methods. In: Kotz S,
Johnson NL (eds): Breakthroughs in statistics. New York, NY: Springer,
1992. pp. 196–202.
89. Derrac J, Garcı
´a S, Molina D, et al. A practical tutorial on the use of non-
parametric statistical tests as a methodology for comparing evolu-
tionary and swarm intelligence algorithms. Swarm Evol Comput. 2011;
1:3–18.
90. Hassanat ABA. Two-point-based binary search trees for accelerating big
data classification using KNN. PLoS One. 2018;13:e0207772.
91. Hassanat A. Furthest-pair-based decision trees: Experimental results on
big data classification. Information. 2018;9:284.
92. Hassanat ABA. Furthest-pair-based binary search tree for speeding
big data classification using K-nearest neighbors. Big Data. 2018;6:
225–235.
93. Hassanat A. Norm-based binary search trees for speeding up KNN big
data classification. Computers. 2018;7:54.
Cite this article as: Abu Alfeilat HA, Hassanat ABA, Lasassmeh O,
Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS (2019)
Effects of distance measure choice on K-nearest neighbor classifier
performance: A review. Big Data 3:X, 1–28, DOI: 10.1089/
big.2018.0175.
EFFECTS OF DISTANCE MEASURE CHOICE ON KNN 27
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
Abbreviations Used
1-NN ¼1-nearest neighbor
AD ¼Average distance
ASCSD ¼Additive Symmetric v
2
distance
AvgD ¼Average (L
1
,L
N
) distance
BCW ¼Breast Cancer Wisconsin
BD ¼Bhattacharyya distance
CanD ¼Canberra distance
CD ¼Chebyshev distance
ChoD ¼Chord distance
ClaD ¼Clark distance
CorD ¼Correlation distance
CosD ¼Cosine distance
CSSD ¼v
2
statistic distance
DicD ¼Dice distance
DivD ¼Divergence distance
ED ¼Euclidean distance
FN ¼false negative
FP ¼false positive
HamD ¼Hamming distance
HasD ¼Hassanat distance
HauD ¼Hausdorff distance
HeD ¼Hellinger distance
JacD ¼Jaccard distance
JDD ¼Jensen difference distance
JefD ¼Jeffreys distance
JSD ¼Jensen–Shannon distance
KD ¼Kulczynski distance
KDD ¼K divergence distance
KJD ¼Kumar–Johnson distance
KLD ¼Kullback–Leibler distance
KNN ¼K-nearest neighbor
LD ¼Lorentzian distance
MatD ¼Matusita distance
MCD ¼Mean Character distance
MCED ¼Mean Censored Euclidean distance
MD ¼Manhattan distance
MeeD ¼Meehl distance
MiSCSD ¼Min Symmetric v
2
distance
MotD ¼Motyka distance
MSCSD ¼Max Symmetric v
2
distance
NCSD ¼Neyman v
2
distance
NID ¼Non Intersection distance
PCSD ¼Pearson v
2
distance
PeaD ¼Pearson distance
PSCSD ¼Probabilistic Symmetric v
2
distance
QSAR ¼quantitative structure activity relationships
SCD ¼Squared chord distance
SCSD ¼Squared Chi-Squared
SD ¼Sorensen distance
SED ¼Squared Euclidean distance
SoD ¼Soergel distance
SPeaD ¼Squared Pearson distance
SquD ¼Squared v
2
distance
TanD ¼Taneja distance
TN ¼true negative
TopD ¼Topsoe distance
TP ¼true positive
UCI ¼University of California, Irvine
VSD ¼Vicis Symmetric distance
VWHD ¼Vicis-Wave Hedges distance
WIAD ¼Whittaker’s index of association distance
28 ABU ALFEILAT ET AL.
Downloaded by 45.52.5.3 from www.liebertpub.com at 08/15/19. For personal use only.
... It is observed from the above discussions that the iris recognition system at a distance still suffers from various real-time challenges. The work is motivated by the performances of various distance metrics in [33,34], the study of LBP [7,8,10], and it strengthens against varying levels of illumination that appear in iris images. ...
... There are several measurement techniques to evaluate the performance of a classifier. The most widely used statistical measures are average precision, recall, and 1 F -measure computation [34]. Specifically, accuracy indicates the overall performance of a classifier and it is computed as the ratio of the number of accurate classified images and the total number of test images. ...
Article
Full-text available
Nowadays iris recognition has become a promising biometric for human identification and authentication. In this case, feature extraction from near-infrared (NIR) iris images under less-constraint environments is rather challenging to identify an individual accurately. This paper extends a texture descriptor to represent the local spatial patterns. The iris texture is first divided into several blocks from which the shape and appearance of intrinsic iris patterns are extracted with the help of block-based Local Binary Patterns (LBP b). The concepts of uniform, rotation, and invariant patterns are employed to reduce the length of feature space. Additionally, the simplicity of the image descriptor allows for very fast feature extraction. The recognition is performed using a supervised machine learning classifier with various distance metrics in the extracted feature space as a dissimilarity measure. The proposed approach effectively deals with lighting variations, blur focuses on misaligned images and elastic deformation of iris textures. Extensive experiments are conducted on the largest and most publicly accessible CASIA-v4 distance image database. Some statistical measures are computed as performance indicators for the validation of classification outcomes. The area under the Receiver Operating Characteristic (ROC) curves is illustrated to compare the diagnostic ability of the classifier for the LBP and its extensions. The experimental results suggest that the LBP b is more effective than other rotation invariants and uniform rotation invariants in local binary patterns for distant iris recognition. The Braycurtis distance metric provides the highest possible accuracy compared to other distance metrics and competitive methods.
... KNN uses a set of distance functions such as the Euclidean distance (ED), Mahala Nobis distance, Manhattan distance (MD) [118], Earth movers distance, Chebyshev distance, and Canberra distance. We set K to 3, and we used an MD measurement in this work. ...
Article
Full-text available
High-dimensional datasets often harbor redundant, irrelevant, and noisy features that detrimentally impact classification algorithm performance. Feature selection (FS) aims to mitigate this issue by identifying and retaining only the most pertinent features, thus reducing dataset dimensions. In this study, we propose an FS approach based on black hole algorithms (BHOs) augmented with a mutation technique termed MBHO. BHO typically comprises two primary phases. During the exploration phase, a set of stars is iteratively modified based on existing solutions, with the best star selected as the “black hole”. In the exploration phase, stars nearing the event horizon are replaced, preventing the algorithm from being trapped in local optima. To address the potential randomness-induced challenges, we introduce inversion mutation. Moreover, we enhance a widely used objective function for wrapper feature selection by integrating two new terms based on the correlation among selected features and between features and classification labels. Additionally, we employ a transfer function, the V2 transfer function, to convert continuous values into discrete ones, thereby enhancing the search process. Our approach undergoes rigorous evaluation experiments using fourteen benchmark datasets, and it is compared favorably against Binary Cuckoo Search (BCS), Mutual Information Maximization (MIM), Joint Mutual Information (JMI), and minimum Redundancy Maximum Eelevance (mRMR), approaches. The results demonstrate the efficacy of our proposed model in selecting superior features that enhance classifier performance metrics. Thus, MBHO is presented as a viable alternative to the existing state-of-the-art approaches. We make our implementation source code available for community use and further development.
... The KNN classifier is a widely used, straightforward, and efficient classification algorithm, often serving as a benchmark for performance evaluation [32][33][34][35]. This algorithm operates based on a distance metric d and a positive integer k. ...
Article
Full-text available
In fault diagnosis, it is crucial to address the combined challenges of imbalanced sample sizes and unlabeled data. Traditional methods often generate pseudo-samples or pseudo-labels. These can lead to inaccurate diagnostic outcomes if they are not representative of the original data. To address these challenges, this paper proposes an innovative fault diagnosis method based on bayesian graph balanced learning (BGBL). Firstly, a balancing strategy was developed to tackle sample imbalance by assigning and optimizing weights for samples in imbalanced categories. Graph theory techniques were then used on unlabeled data to establish and update category beliefs. Following this, posterior estimates of samples were derived within the bayesian neural networks framework. This led to the training of a fault diagnosis model. Finally, fault diagnosis was conducted using this trained model. Three sets of experiments were conducted on the planetary gearbox fault dataset. The results showed that the proposed BGBL method significantly improved the accuracy of fault diagnosis. Specifically, under conditions of imbalanced data and missing labels, the BGBL method increased the accuracy by over 26% compared to existing methods. This demonstrates its effectiveness in these challenging scenarios.
... In Table 3, we show the impact of different methods, concluding with no outliers removed. The KNN algorithm is sensitive to outliers, as outliers can influence the nearest neighbour search, but the sensitivity is somewhat reduced by MinMax scaling and using the Manhattan distance metric [53], [54]. This setup, combined with the balanced accuracy metric-sensitive to imbalances in the dataset, ensures that the model is particularly attuned to detecting the critical minority class of weak rocks without being misled by their relative rarity or outlier-like status. ...
Preprint
Full-text available
Current rock engineering design in drill and blast tunnelling primarily relies on engineers' observational assessments. Measure While Drilling (MWD) data, a high-resolution sensor dataset collected during tunnel excavation, is underutilised, mainly serving for geological visualisation. This study aims to automate the translation of MWD data into actionable metrics for rock engineering. It seeks to link data to specific engineering actions, thus providing critical decision support for geological challenges ahead of the tunnel face. Leveraging a large and geologically diverse dataset of ~500,000 drillholes from 15 tunnels, the research introduces models for accurate rock mass quality classification in a real-world tunnelling context. Both conventional machine learning and image-based deep learning are explored to classify MWD data into Q-classes and Q-values-examples of metrics describing the stability of the rock mass-using both tabular-and image data. The results indicate that the K-nearest neighbours algorithm in an ensemble with tree-based models using tabular data, effectively classifies rock mass quality. It achieves a cross-validated balanced accuracy of 0.86 in classifying rock mass into the Q-classes A, B, C, D, E1, E2, and 0.95 for a binary classification with E versus the rest. Classification using a CNN with MWD-images for each blasting round resulted in a balanced accuracy of 0.82 for binary classification. Regressing the Q-value from tabular MWD-data achieved cross-validated R2 and MSE scores of 0.80 and 0.18 for a similar ensemble model as in classification. High performance in regression and classification boosts confidence in automated rock mass assessment. Applying advanced modelling on a unique dataset demonstrates MWD data's value in improving rock mass classification accuracy and advancing data-driven rock engineering design, reducing manual intervention.
... The first two of these results align with the findings of [1], who examined the impact of distance measures on KNN classifier performance, suggesting distance measure choice effects a variety of machine learning algorithms. A natural extension of our study is integrating our generalized change detection algorithm in continual/lifelong learning anomaly detection frameworks. ...
Chapter
Detecting relevant change points in time-series data is a necessary task in various applications. Change point detection methods are effective techniques for discovering abrupt changes in data streams. Although prior work has explored the effectiveness of different algorithms on real-world data, little has been done to explore the impact of different distance measures on change detection performance. In this paper, we modify the architecture of a change point detection workflow to assess the impact of distance measure choices on change detection accuracy and efficiency in continual learning scenarios, where the goal is detecting transitions between tasks or concepts. An experimental evaluation of 41 distance measure across several benchmark datasets demonstrated that the change detection accuracy depends on the distance measure selected. Furthermore, our analysis showed performance patterns for distance measures in the same family.
... Although it is not a common practice to consider the best Regressor as the one that best fits the line with slope = 1 in the actual vs predicted plot, we used this approach just for visualization purposes as demonstrated in Figures 4,5,6,7,8,and 9. ...
Preprint
Full-text available
Regression, a supervised machine learning approach, establishes relationships between independent variables and a continuous dependent variable. It is widely applied in areas like price prediction and time series forecasting. The performance of regression models is typically assessed using error metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). However, these metrics present challenges including sensitivity to outliers (notably MSE and RMSE) and scale dependency, which complicates comparisons across different models. Additionally, traditional metrics sometimes yield values that are difficult to interpret across various problems. Consequently, there is a need for a metric that consistently reflects regression model performance, independent of the problem domain, data scale, and outlier presence. To overcome these shortcomings, this paper introduces a new regression model accuracy measure based on Hassanat distance metric. This measure is not only invariant to outliers but also easy to interpret as it provides an accuracy-like value that ranges from 0 to 1 (or 0-100\%). We validate the proposed metric against traditional measures across multiple benchmarks, demonstrating its robustness under various model scenarios and data types. Hence, we suggest it as a new standard for assessing regression models' accuracy.
Article
Full-text available
Before building machine learning models, the dataset should be prepared to be a high quality dataset, we should give the model the best possible representation of the data. Different attributes may have different scales which possibly will increase the difficulty of the problem that is modeled. A model with varying scale values may suffers from poor performance during learning. Our study explores the usage of Numerical Data Scaling as a data pre-processing step with the purpose of how effectively these methods can be used to improve the accuracy of learning algorithms. In particular, three numerical data Scaling methods with four machine learning classifiers to predict disease severity were compared. The experiments were built on Coronavirus 2 (SARS-CoV-2) datasets which included 1206 patients who were admitted during the period between June 2020 and April 2021. The diagnosis of all cases was confirmed with RT-PCR. Basic demographic data and medical characteristics of all participants was collected. The reported results indicate that all techniques are performing well with Numerical Data Scaling and there are significant improvement in the models for unseen data. lastly, we can conclude that there are increase in the classifier performance while using scaling techniques. However, these methods help the algorithms to better understand learn the patterns in the dataset which help making accurate models
Article
Introduction: According to the World Health Organization, lung diseases are the third cause of death in the world. These diseases are chronic, so early diagnosis of these diseases is very important. Pulmonary function tests are important tools in examining and monitoring patients with respiratory injuries. This research aimed to optimize the K -Nearest Neighbor algorithm, which facilitates and accelerates self-assessment and interpretation of spirometry test results with higher accuracy. Method: In this study, a method is proposed that improves the limitations of the basic algorithm by optimizing, valuing features, and weighted voting. Using this method, obstructive pulmonary diseases are detected based on the data set of spirometry tests, and general parameters are classified into three categories, namely, asthma, chronic bronchitis, and emphysema. Results: In determining the appropriate method for calculating the data distance, the Minkowski method was chosen, and by applying the coefficients of the feature values, the accuracy of the classification increased. Weighted voting was done in the final part of the algorithm based on the Gaussian kernel, based on which a constant performance was obtained for changing the parameter of the number of neighbors. The results of the evaluations were carried out in the form of mutual validation. 95.4% accuracy and 93.2% precision were obtained in 3.12 seconds. Conclusion: The use of machine learning algorithms can be effective in the analysis of medical data. Therefore, in this study, these approaches were used to provide a new method of classification, so that the proposed algorithm could improve the basic method, and also, had better accuracy and performance than other previous methods.
Article
Background Machine learning (ML) is increasingly used to predict the prognosis of numerous diseases. This retrospective analysis aimed to develop a prediction model using ML algorithms and to identify predictors associated with the recurrence of hallux valgus (HV) following surgery. Methods A total of 198 symptomatic feet that underwent chevron osteotomy combined with a distal soft tissue procedure were enrolled and analyzed from 2 independent medical centers. The feet were grouped according to nonrecurrence or recurrence based on 1-year follow-up outcomes. Preoperative weightbearing radiographs and immediate postoperative nonweightbearing radiographs were obtained for each HV foot. Radiographic measurements (eg, HV angle and intermetatarsal angle) were acquired and used for ML model training. A total of 9 commonly used ML models were trained on the data obtained from one institute (108 feet), and tested on the other data set from another independent institute (90 feet) for external validation. Optimal feature sets for each model were identified based on a 2000-resample bootstrap-based internal validation via an exhaustive search. The performance of each model was then tested on the external validation set. The area under the curve (AUC), classification accuracy, sensitivity, and specificity of each model were calculated to evaluate the performance of each model. Results The support vector machine (SVM) model showed the highest predictive accuracy compared to other methods, with an AUC of 0.88 and an accuracy of 75.6%. Preoperative hallux valgus angle, tibial sesamoid position, postoperative intermetatarsal angle, and postoperative tibial sesamoid position were identified as the most selected features by several ML models. Conclusion ML classifiers such as SVM could predict the recurrence of HV (an HVA >20 degrees) at a 1-year follow-up while identifying associated predictors in a multivariate manner. This study holds the potential for foot and ankle surgeons to effectively identify individuals at higher risk of HV recurrence postsurgery.
Article
Full-text available
To classify data whether it is in the field of neural networks or maybe it is any application of Biometrics viz: Handwriting classification or Iris detection, feasibly the most candid classifier in the stockpile or machine learning techniques is the Nearest Neighbor Classifier in which classification is achieved by identifying the nearest neighbors to a query example and using those neighbors to determine the class of the query. K-NN classification classifies instances based on their similarity to instances in the training data. This paper presents various output with various distance used in algorithm and may help to know the response of classifier for the desired application it also represents computational issues in identifying nearest neighbors and mechanisms for reducing the dimension of the data.
Article
Full-text available
Big data classification is very slow when using traditional machine learning classifiers, particularly when using a lazy and slow-by-nature classifier such as the k-nearest neighbors algorithm (KNN). This paper proposes a new approach which is based on sorting the feature vectors of training data in a binary search tree to accelerate big data classification using the KNN approach. This is done using two methods, both of which utilize two local points to sort the examples based on their similarity to these local points. The first method chooses the local points based on their similarity to the global extreme points, while the second method chooses the local points randomly. The results of various experiments conducted on different big datasets show reasonable accuracy rates compared to state-of-the-art methods and the KNN classifier itself. More importantly, they show the high classification speed of both methods. This strong trait can be used to further improve the accuracy of the proposed methods.
Article
Full-text available
Big Data classification has recently received a great deal of attention due to the main properties of Big Data, which are volume, variety, and velocity. The furthest-pair-based binary search tree (FPBST) shows a great potential for Big Data classification. This work attempts to improve the performance the FPBST in terms of computation time, space consumed and accuracy. The major enhancement of the FPBST includes converting the resultant BST to a decision tree, in order to remove the need for the slow K-nearest neighbors (KNN), and to obtain a smaller tree, which is useful for memory usage, speeding both training and testing phases and increasing the classification accuracy. The proposed decision trees are based on calculating the probabilities of each class at each node using various methods; these probabilities are then used by the testing phase to classify an unseen example. The experimental results on some (small, intermediate and big) machine learning datasets show the efficiency of the proposed methods, in terms of space, speed and accuracy compared to the FPBST, which shows a great potential for further enhancements of the proposed methods to be used in practice.
Article
Full-text available
Due to their large sizes and/or dimensions, the classification of Big Data is a challenging task using traditional machine learning, particularly if it is carried out using the well-known K-nearest neighbors classifier (KNN) classifier, which is a slow and lazy classifier by its nature. In this paper, we propose a new approach to Big Data classification using the KNN classifier, which is based on inserting the training examples into a binary search tree to be used later for speeding up the searching process for test examples. For this purpose, we used two methods to sort the training examples. The first calculates the minimum/maximum scaled norm and rounds it to 0 or 1 for each example. Examples with 0-norms are sorted in the left-child of a node, and those with 1-norms are sorted in the right child of the same node; this process continues recursively until we obtain one example or a small number of examples with the same norm in a leaf node. The second proposed method inserts each example into the binary search tree based on its similarity to the examples of the minimum and maximum Euclidean norms. The experimental results of classifying several machine learning big datasets show that both methods are much faster than most of the state-of-the-art methods compared, with competing accuracy rates obtained by the second method, which shows great potential for further enhancements of both methods to be used in practice.
Article
Full-text available
The advances in information technology of both hardware and software have allowed big data to emerge recently, classification of such data is extremely slow, particularly when using K-nearest neighbors (KNN) classifier. In this article, we propose a new approach that creates a binary search tree (BST) to be used later by the KNN to speed up the big data classification. This approach is based on finding the furthest pair of points (diameter) in a data set, and then, it uses this pair of points to sort the examples of the training data set into a BST. At each node of the BST, the furthest-pair is found and the examples located at that particular node are further sorted based on their distances to these local furthest points. The created BST is then searched for a test example to the leaf; the examples found in that particular leaf are used to classify the test example using the KNN classifier. The experimental results on some well-known machine learning data sets show the efficiency of the proposed method, in terms of speed and accuracy compared with the state-of-the- art methods reviewed. With some optimization, the proposed method has a great potential to be used for big data classification and can be generalized for other applications, particularly when classification speed is the main concern.
Article
Full-text available
While standing as one of the most widely considered and successful supervised classification algorithms, the k-Nearest Neighbor (kNN) classifier generally depicts a poor efficiency due to being an instance-based method. In this sense, Approximated Similarity Search (ASS) stands as a possible alternative to improve those efficiency issues at the expense of typically lowering the performance of the classifier. In this paper we take as initial point an ASS strategy based on clustering. We then improve its performance by solving issues related to instances located close to the cluster boundaries by enlarging their size and considering the use of Deep Neural Networks for learning a suitable representation for the classification task at issue. Results using a collection of eight different datasets show that the combined use of these two strategies entails a significant improvement in the accuracy performance, with a considerable reduction in the number of distances needed to classify a sample in comparison to the basic kNN rule.
Article
Full-text available
The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested example and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures? This review attempts to answer the previous question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN degraded only about 20% while the noise level reaches 90%, this is true for all the distances used. This means that the KNN classifier using any of the top 10 distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances.
Article
Classification plays a significant role in production activities and lives. In this era of big data, it is especially important to design efficient classifiers with high classification accuracy for large scale datasets. In this paper, we propose a randomly partitioned and a Principal Component Analysis (PCA)-partitioned multivariate decision tree classifiers, of which the training time is quite short and the classification accuracy is quite high. Approximately balanced trees are created in the form of a full binary tree based on two simple ways of generating multivariate combination weights and a median-based method to select the divide value having ensured the efficiency and effectiveness of the proposed algorithms. Extensive experiments conducted on a series of large datasets have demonstrated that the proposed methods are superior to other classifiers in most cases.
Book
Bioinformatics, computational biology, is a relatively new field that applies computer science and information technology to biology. In recent years, the discipline of bioinformatics has allowed biologists to make full use of the advances in Computer sciences and Computational statistics for advancing the biological data. Researchers in life sciences generate, collect and need to analyze an increasing number of different types of scientific data, DNA, RNA and protein sequences, in-situ and microarray gene expression including 3D protein structures and biological pathways. This book is aiming to provide information on bioinformatics at various levels. The chapters included in this book cover introductory to advanced aspects, including applications of various documented research work and specific case studies related to bioinformatics. This book will be of immense value to readers of different backgrounds such as engineers, scientists, consultants and policy makers for industry, government, academics and social and private organisations.