ArticlePDF Available

Distributed Nearest Neighbor Classification for Large-Scale Multi-label Data on Spark

Authors:
Distributed Nearest Neighbor Classification for
Large-Scale Multi-label Data on Spark
Jorge Gonzalez-Lopeza, Sebasti´an Venturab,c,d, Alberto Canoa
aDepartment of Computer Science, Virginia Commonwealth University, USA
bDepartment of Computer Science and Numerical Analysis, University of Cordoba, Spain
cComputing and Information Technology, King Abdulaziz University, Saudi Arabia
dMaimonides Biomedical Research Institute of Cordoba, Spain
Abstract
Modern data is characterized by its ever-increasing volume and complexity,
particularly when data instances belong to many categories simultaneously.
This learning paradigm is known as multi-label classification and one of its
most renowned methods is the multi-label knearest neighbor (Ml-knn).
The traditional implementations of this method are not feasible for large-scale
multi-label data due to its complexity and memory restrictions. We propose
a distributed Ml-knn implementation based on the MapReduce program-
ming model, implemented on Apache Spark. We compare three strategies for
distributed nearest neighbor search: 1) iteratively broadcasting instances, 2)
using a distributed tree-based index structure, and 3) building hash tables
to group instances. The experimental study evaluates the trade-off between
the quality of the predictions and runtimes on 22 benchmark datasets, and
compares the scalability using different sizes of data. The results indicate
that the tree-based index strategy outperforms the other approaches, having
a speedup of up to 266x for the largest dataset, while achieving an accuracy
equivalent to the exact methods. This strategy enables Ml-knn to scale
efficiently with respect to the size of the problem.
Keywords: Apache Spark, MapReduce, Distributed Computing, Big Data,
Multi-label classification, Nearest Neighbors
Email addresses: gonzalezlopej@vcu.edu (Jorge Gonzalez-Lopez),
sventura@uco.es (Sebasti´an Ventura), acano@vcu.edu (Alberto Cano)
Preprint submitted to Future Generation Computer Systems April 18, 2018
1. Introduction
The emergence of new data-oriented technologies has led to an exponen-
tial increase of the volume of data gathered by modern information systems.
The exponential growth of data, in size and complexity, has produced a
pressing need to develop scalable machine learning algorithms to harness the
worthwhile information in the vast collections of data.
Besides its large volume, modern data is characterized by its increased
complexity. A large part of the data generated nowadays is made of instances
that belong to multiple categories at once. This type of data is known as
multi-label data.Multi-label classification comprehends the task of learning
the set of categories, also known as labels, that an instance belongs to. Multi-
label classification has attracted growing interest in the last decade, because
of the real-world applications that fit in the paradigm [1–3].
One of the simplest, yet most effective technique is lazy learning in which
the generalization of the data is delayed until a query is made, opposed
to other techniques where a model is learned on training data. The most
popular multi-label lazy learner is the multi-label kNearest Neighbor (Ml-
knn) [4], which is a non-parametric method where the generalization is done
in feature space. As a lazy learning method, Ml-knn requires to storing all
the training instances, then it computes statistical information of the set of
labels and the training instances, by performing a pair-wise computation of
a distance metric or similarity function [5, 6]. This information can be used
to predict the set of labels of unseen instances, however it comes at a high
computational and memory cost preventing its application to big data prob-
lems. Ml-knn has been used extensively to evaluate the features and/or
label transformations, since it is the method in which features or labels have
the largest impact in performance. Some of the scenarios in which it has
been applied are feature selection [7–9], learning by using label dependen-
cies and correlations [10, 11], label space transformation [12], dimensionality
reduction [13], and using specific features on a label space [14].
These issues can be handled using parallel computing in a distributed
environment, because it can considerably improve the performance by of-
fering all shared computational and memory resources [15]. The MapRe-
duce programming model [16] offers a simple and robust paradigm to handle
large-scale datasets in a cluster of computers. This model is suitable for
processing big data because of its fault-tolerant mechanism, which is highly
recommendable for long time executions. One of the first implementations of
2
the MapReduce model was Hadoop [17], yet one of its critical disadvantages
is that it processes data from a distributed file system, which introduces a
high latency. However, Spark [18] provides in-memory computation which
results in a large performance improvement, especially on iterative jobs. Us-
ing in-memory data has been proved to be of the most relevant in machine
learning scenarios [19]. Spark highlights as a powerful framework for its ease
of use, and it has been successfully applied in many scenarios [20–24].
1.1. Motivation
Ml-knn has been widely used in evaluating features and labels trans-
formations. Furthermore, various lazy learning methods for multi-label data
have been proposed in the last years [25–28]. These methods have advantages
and weaknesses, depending on the data distribution, feature dimensionality,
or number of labels. However, all these methods are based on the original
Ml-knn and suffer from its original limitations. The combination of the
memory and computation limitations of the nearest neighbor search, in ad-
dition to the increased complexity of the multi-label data, have produced a
pressing need to develop scalable solutions for Ml-knn.
In this paper, we propose a distributed implementation of the Ml-knn to
classify large-scale multi-label data using Spark. This implementation relies
on the nearest neighbor search method to gather statistical information, both
in the train and test phases. Hence, both its prediction and computation
performance will be bounded by the method incorporated, despite being
only a fraction of all the distributed computation performed.
The distributed implementations of nearest neighbor search can be cate-
gorized into the following: brute force, tree indexes, and hash indexes. There
are multiple implementations of knn in distributed environments using Spark
in the literature [29–32], and each of them belong to one of the groups men-
tioned. We selected the best method out of each strategy and incorporated
it into our distributed Ml-knn. To the best of our knowledge, all previous
work of distributed nearest neighbor searches is focused on traditional classi-
fication, and this is the first study implementing and analyzing the scalability
of Ml-knn on a distributed environment.
The main contributions of this paper are as follows:
Analysis and parallelization in a distributed environment of Ml-knn,
presenting a detailed explanation of the proposed MapReduce algo-
rithm on Spark.
3
Comparison of the three main strategies for the nearest neighbor search
in a distributed environment and evaluating their impact. Specifically,
the Ml-knn performance is analyzed, focusing on the trade-off between
accuracy and runtime applied to multi-label classification.
Evaluation of the experimental results: reliability of the predictions,
execution times for train and test phases, and a study of scalability
with respect to the problem size.
Experimental results performed on 22 varied multi-label datasets show
that the proposed Ml-knn method, which uses a tree-based index structure,
outperforms the other methods in every scenario. This method produces pre-
dictions which are equivalent to those of an exact method, while considerably
reducing the execution time and scaling accordingly with respect to the size
of the problem (instances and labels).
The structure of this paper is as follows. Section 2 reviews the related
works. Section 3 introduces the distributed implementation for Ml-knn and
Section 4 presents the strategies to find nearest neighbors. Section 5 describes
the experimental study and Section 6 analyzes the results. Finally, Section 7
presents the conclusions of this work.
2. Background
This section reviews definitions and related works on the use of MapRe-
duce, nearest neighbor search, and multi-label learning. First, Section 2.1
introduces the concept of MapReduce and the Spark framework. Section 2.2
defines the nearest neighbor searches and the different strategies to optimize
the problem. Finally, Section 2.3 formally defines multi-label learning and
the multi-label knearest neighbor (Ml-knn) algorithm.
2.1. MapReduce programming model and Spark framework
The MapReduce programming model [16] was developed to process data
using a distributed strategy, allowing to scale to data unable to fit in the
physical memory of a single machine. This framework provides an abstrac-
tion of the underlying hardware and software of the cluster. It partitions,
duplicates and distributes the data providing fault-tolerance. Moreover, it
schedules the jobs and the network communications, as a result the user only
needs to implement the Map and Reduce primitives.
4
Hadoop [17] is an open-source implementation of MapReduce. Despite its
popularity, Hadoop presents some important weaknesses, such as the impos-
sibility to maintain data in-memory, thus leading to poor online performance,
interactive, and/or iterative methods [33–35].
Spark [18] is a distributed computing platform that has become one of the
most powerful tools for the big data scenario. Spark was designed to overcome
the limitations of Hadoop, one of its main advantages is the possibility of
maintaining data in memory. In fact, Spark has been shown to outperform
Hadoop in many cases (up to 100x in memory) [36].
The main components of the Spark architecture are driver,workers, and
executors. The driver loads the application and create a relation of tasks to
be executed. The workers are the set of nodes in the distributed environment,
each of them with at least one executor. The executors are distributed agents
that execute the tasks using local partitions of the data in their assigned
memory space. The scheduling of the tasks and assigning of the resources to
each executor is done by the cluster manager.
For large-scale data mining, multiple common learning algorithms and
statistics utilities were created and packaged into MLlib [37], the machine
learning library for Spark. This library gives support to several tasks such
as: classification, regression, filtering, clustering, and data preprocessing.
2.2. Nearest Neighbor search strategies
Nearest neighbor (NN) is a classic non-parametric and instance-based
technique that has attracted the attention from the research community due
to its simplicity and effectiveness. This technique has a wide range of ap-
plications such as density estimation [38], dimensional hashing [39], pattern
recognition [40], data compression [41], and so on.
The nearest neighbor search is an optimization problem, whose goal is to
find an instance that minimizes a certain distance or similarity function [5, 6].
The most popular distance used is the Euclidean distance, however there are
other scenarios in which other criteria needs to be considered such us energy
efficiency [42–46], load across machines [47], or privacy [48, 49]. A straightfor-
ward generalization of this problem is the Knearest neighbor search (knn)
which finds the kinstances minimizing the distance function.
Whenever the search method performs a pair-wise comparison using all
the instances, it is called exact nearest neighbor search. This approach finds
the exact nearest neighbor at high computational cost, however there are
many techniques that reduce this complexity by indexing the feature space.
5
2.2.1. Tree indexes
One of the first methods developed was the kd-tree [50]. This structure
recursively subdivides the feature space by a hyperplane that is orthogonal
to one of the axes and that partitions the data points as evenly as possible.
In [51] they propose to use simultaneously multiple kd-tree to increase
the performance of nearest neighbor searches. They use rotations of the
dataset that would force the tree to use features that otherwise would be dis-
carded. They show that by rotating the dataset to align it with its principal
axis direction using PCA, and then applying randodm House-holder trans-
formations that that preserve the PCA subspace of appropriate dimension,
the kd-tree performance can be significantly improved.
One of the weaknesses of kd-tree is that the indexed space can have a
high aspect ratio, which makes it impossible to use volume bounds. Arya
et al. [52] introduced a Balanced Box-Decomposition tree (bbd-tree) which
guarantees both balanced aspect ratio and a logarithmic depth. This mod-
ification allows to perform error bound approximate search by considering
(1 + )-approximate nearest neighbors. This concept was also adapted by
Duncan et al. [53] using the Balanced Aspect Ratio tree (bar-tree), which
was later extended to higher dimensions [54]. This tree does not exclusively
use axis-orthogonal hyperplane cuts, which leads to good aspect ratio, bal-
anced depth, and convex regions. Other variations of the kd-tree are: the
pca-tree[55], the rp-tree[56], and the trinary projection tree[57].
The other main weakness of kd-tree is related to the curse of dimen-
sionality. kd-tree is very effective at low dimensions: after traveling down
the nodes of the tree, all the instances in one leaf tend to be much closer
to each other than to instances in other leafs. However, this property dis-
appears in high dimensional spaces. This has been solved in a later method
called Metric tree [58] which subdivides the feature space by a hyperplane
defined by the midpoint of two instances (pivots). This partitioning creates
two disjoint sets with no information shared between them. The search pro-
cess goes through the tree choosing the nearest node in each level, allowing
to “backtrack” in case that some branches have remained unpruned. Spill
tree [59] modifies the Metric tree avoiding the tedious backtracking process
by allowing an overlap area between the nodes. This overlapped buffer allows
that they same instance can be indexed by both pivots, which leads to an
increased accuracy at the cost of redundancy.
6
In order to combine the advantages of both, Metric tree and Spill tree,
[59] proposes a combination called Hybrid tree. This structure allows to
have both types of nodes, where a decision is made in each node whether
to use an overlap node or non-overlap node. In the search process, it only
does backtracking for non-overlap nodes (as in conventional Metric tree), and
defeatist search in overlapping nodes (Spill tree).
Some other approaches, which are based on completely different concepts,
have been proposed. For example, [60] presents the product quantification
approach in which they decompose the space into low dimensional subspaces
and represent the instances by compact codes computed as the quantifica-
tion indices in these subspaces. These compact codes can be compared to
the query points using an asymmetric approximate distance. A modifica-
tion of the standard quantification process was introduced by [61] in which
they use an inverted index with product quantification that produces denser
subdivision of the search space.
2.2.2. Hashing indexes
The best known hashing based nearest neighbor technique is Locality
Sensitive Hashing (LSH) [62]. An LSH function maps the instances in the
feature space to a space of reduced dimensionality in a way that similar
instances map to the same hash entries. Then, a similarity search query can
be answered by first hashing the query instance and then finding the close
instances withint the instances that have been mapped to the same entry. To
guarantee both, a good search quality and a good search efficiency, one needs
to use multiple LSH tables and combine their results. Unfortunately, LSH
requires of a large number of hash tables[63, 64]. There are some variants
of LSH such as multi-probe LSH [65] in which the number of hash tables is
reduced by searching other entries in the hash tables within certain distance,
and LSH Forest [66] in which they remove the data-dependent parameters
achieving better adaptation for skewed data distributions.
The performance of LSH methods is highly dependent on the hashing
functions they use. There is a large amount of research aimed at improving
hashing methods by using data-dependent hashing functions using various
techniques: parameter sensitive hashing [67], spectral hashing [68], random-
ized LSH from learned metrics [69], kernelized LSH [70], learned binary em-
bedding [71], shift-invariant kernel hashing [72], semi-supervised hashing [73],
optimized kernel hashing [74], and complementary hashing [75].
7
2.2.3. Graph indexes
Nearest neighbors graph methods build a graph structure in which ver-
tices represent the instances and edges connect nearest neighbors. There
are two critical components in these methods: query strategy and graph
construction.
There are multiple approaches that aim to minimize the impact of con-
sulting a nearest neighbor graph. In [76] the authors consider to use a sample
of well separated instances as seeds and start the graph exploration using a
best-first strategy. Similarly, [77] incorporate a hill-climbing strategy and
pick the starting points at random.
The graph construction is the target of substantial research, however these
methods do not scale, or are specific of certain similarity measures. Paredes
et al. [78] proposed two methods for the graph construction using general
metric spaces and low empirical complexity. However, both methods require
a global data structure and are difficult to parallelize across machines. [79]
proposes to use divide and conquer methods to recursive data partitioning. In
[80] authors presented a graph construction technique using Morton ordering
and based on space filling curves. [81] presents a graph construction method
based on local search. They consider that a neighbor of a neighbor, is likely
to be a neighbor too. Therefore, by initializing each vertex with a random
set of neighbors, the method iteratively improves the neighbors of each node.
Although there have been some methods which focused on efficiently
building a nearest neighbor graph before, it is considered that they are not ef-
ficient enough in a distributed environment. Consequently, these techniques
are not included in this study.
2.3. Multi-label K Nearest Neighbors ( Ml-knn)
Table 1 summarizes all the symbols and notation used to establish the
knowledge of previous works, as to explain into detail our proposed solutions.
Let Xdenote the domain of instances, a single instance x∈ X is rep-
resented as a set of features x={x1,x2, . . . , xd}, where dis the number
of features. Multi-label considers a finite set of labels L={y1,y2,...,yq},
where qis the number of labels. In a training set, each of the instances
is associated with a subset of labels D={(x1,Y1),(x2,Y2),...,(xn,Yn)},
(xi X ,Yi⊆ L). For convenience, let Yibe the multi-label vector for xi,
where its jth component Yx(j) (j∈ L) is 1 if j∈ L and 0 otherwise.
Tsoumakas et al. [1] divide the methods for multi-label classification into
two categories: problem transformation and algorithm adaptation. The first
8
Table 1: Symbols and notation used in this paper
Definition Symbol/Notation
Instance Space X=Rd(or Zd)
Number of Input Attributes d
Instance x= (x1,x2,...,xd),x∈ X
Label Space L={y1,y2,...,yq}
Number of Labels q
Label Set Y={y1,y2,...,yq} ∈ {0,1}
Predicted Label Set Z={z1,z2,...,zq} ∈ {0,1}
Multi-label Training Set D={(xi,Yi),1im}
Number of Training Instances m
Multi-label Test set T={(xi,Yi),1ip}
Number of Test Instances p
Number of Neighbors k
Neighbor Set N(x) = {x1,x2,xk}∈X
one modifies the data to adapt it to traditional classification algorithms.
However, either they do not consider the correlations between labels leading
to weak expressive power [82], or they do it at an exponential cost respect the
number of labels [2]. The latter focuses on adapting algorithms to work di-
rectly with multi-label data, such as decision trees [83], neural networks [84],
support vector machines [85], and multiple lazy learning methods [26].
Multi-label kNearest Neighbor (Ml-knn) [4] was the first lazy learning
method proposed for multi-label classification, and it is still the most used
approach for this type of learning. This algorithm uses statistical informa-
tion from each label to predict the set of labels without any kind of data
transformation. Ml-knn inherits the advantages from both lazy learning
and Bayesian reasoning [86]. Additionally, the class-imbalanced, which is
a common problem in multi-label data [87], is mitigated due to the prior
probabilities estimated for each label. On the other hand, one of the known
disadvantages of Ml-knn is that is a first-order approach which reasons the
relevance of each label separately.
Some alternatives have been proposed to Ml-knn that try to smooth the
first-order approach or the negative impact of the number of labels. DMl-
knn [25] which instead of using only the statistical information from positive
9
instances, it also considers the negative instances. BR-knn [26] combined
multiple knn, one per label, in a binary relevance (BR) manner. They use
the count of labels in the set of neighbors as the confidence score for the
predictions.MLCW-knn [27] improves the previous method by assigning
weights to each of the instances according to their distances to the query
sample. In a similar manner, the labels can be ranked according to the
probabilities of the label association using the neighboring samples around
a query sample [3]. IBLR-Ml-knn [28] combines the linear regressions and
knn algorithm, having one classifier per label as in BR methods.
3. Distributed ML-KNN
We propose a Ml-knn implementation in Spark, which will focus on
scaling the algorithm in a distributed environment. The distribution of the
computation can be achieved in both, the train and test phases.
Firstly, the train phase computes the statistical information of the train-
ing instances by finding the prior and posterior probabilities. The prior
probabilities can be found by frequency counting of the labels. The poste-
rior probabilities are more complex and need of nearest neighbors to gather
statistical information.
Secondly, the test phase uses the previously computed prior and posterior
probabilities. For each of the test instances, the probabilities are combined
with the information gathered by the nearest neighbors on the training in-
stances to produce a probability for each of the labels.
As it can be seen, Ml-knn is a complex algorithm whose performance is
limited by the two nearest neighbor searches. The first one between all the
training instances, and the second one between test and training instances.
This introduces an increased complexity respect lazy methods in traditional
classification problems. Additionally, some aspects such as broadcasting val-
ues or persisting the right data in-memory can have a large impact on the
final performance. This section provides a detailed explanation of the imple-
mentation that maximizes the resources of a distributed environment.
3.1. Train phase: computing prior and posterior probabilities
The train phase is divided into computing the prior and posterior prob-
abilities. The first by performing a frequency count of the labels, and the
second by performing a frequency count subject to the neighbors.
10
Prior probabilities are defined as P(Hj) and P(¬Hj), and represent
the probabilities of a label being found in the dataset before the arrival of
new information. Figure 1 presents the process to compute the probabilities.
A user-defined Reduce Operation is applied to the collection of labels
of all the training instances. This operation adds the binary label vector
Yto a vector µj=Pm
i=1[[yjYi]], which counts the occurrence of each
label. Next, the prior probabilities are computed averaging and smoothing
the count vector P(Hj) = (s+µj)/(s2 + m). Additionally, we can compute
P(¬Hj) = 1 P(Hj) by finding the opposite probability.
Posterior probabilities are defined as P(Cj|Hj) and P(CjHj), it
represents the probabilities of a label being present in exactly jinstances
among its knearest neighbors, conditioned to the event of having the label
present. Figure 2 presents the steps to compute the probabilities.
First, it finds the knearest neighbors for every training instance, followed
by a user-defined Map Operation applied to each instance. This operation
computes the counts of each label among its neighbors Cj. Next, it creates
two frequency matrices Krj =κj[r] and ˜
Krj = ˜κj[r] for each instance x.
Each position (r, j) is initialized to 0, and it is updated with K(Cj, j)=1
whenever yiYi, or ˜
K(Cj, j) = 1 whenever yi/Yi.
Figure 1: Ml-knn train phase: Computation of the prior probabilities P(H) and P(¬H).
Frequency count of the labels followed by averaging and smoothing the values.
11
Figure 2: Ml-knn train phase: Computation of Posterior probabilities P(C|H) and
P(CH). A frequency count of the labels among each of the neighbors is performed
per instance, followed by averaging and smoothing the values.
Second, a user-defined Reduce Operation is applied to the collection of
Kand ˜
Kadding the matrices of all the training instances. The final result
stores the number of training instances that have exactly rneighbors with
jth label, both for the case yiYiand yi/Yi.
Finally, the posterior probabilities are computed averaging and smoothing
the frequency matrices P(Cj|Hj)=(s+KCj,j)/(s×(k+ 1) Pk
r=0 Kr,j) and
P(CjHj) = (s+˜
KCj,j)/(s×(k+ 1) Pk
r=0 ˜
Kr,j).
3.2. Test phase: Prediction of label set
The test phase inducts the predicted label set for new unlabeled instances.
This set relies on the prior and posterior probabilities, previously computed
in the train phase. Figure 3 shows the predictions of the test phase labels.
First, it finds the knearest neighbors for every test instance among the
training instances. Next, a user-defined Map Operation is performed on
each test instance to compute the counts of each label Cjamong its neigh-
bors. Then, each test instance finds the corresponding value in the poste-
rior probabilities, and assigns each label by determining for each yjwhether
P(Hj)×P(Cj|Hj) is greater than P(¬Hj)×P(CjHj).
12
Figure 3: Ml-knn test phase. First, the prior and posterior probabilities are broadcasted.
Then, each label set is predicted by combining the probabilities with the information
collected by the nearest neighbors.
4. Distributed Nearest Neighbors methods
The Ml-knn performance is going to be bounded by the knearest neigh-
bor search, both in the train and test phases. The main issue of distributing
the search process over a cluster of nodes is that every time a node needs to
access the information of another node it triggers a shuffle operation, which
sends information over the network and is considered to be an slow process.
Although the most naive strategy would be to use a cross product and ex-
change the information of all the nodes with each other, in practice this is
unfeasible because of memory and network limitations.
There are multiple strategies that aim to minimize this impact, from
distributed index structures to hashing matching. We propose to create three
versions of Ml-knn, and study their impact in the overall performance of
the algorithm. Each implementation represents one of the strategies studied
in Section 2.2, namely Ml-knn-it,Ml-knn-ht, and Ml-knn-lsh.
13
4.1. Iterative Multi-label k Nearest Neighbor (Ml-knn-it)
Our first approach was to incorporate an iterative version of Ml-knn
in Spark based on the principles presented by Maillo [29, 88], where they
adapted the brute force algorithm to a distributed environment. Despite be-
ing a naive approach, in which no index structure has been used, it showed
promising results. However, they only compared to another algorithm that
has been previously developed by themselves, hence it is difficult to appreci-
ate the real performance of the algorithm. Additionally, their original imple-
mentation 1suffers from some limitations, such as: an excessive number of
parameters, it is limited by an outdated version of Spark (not taking advan-
tage of new functionality introduced in Spark 2.0+), it inefficiently iterates
over the test instances on the driver by assigning a partition id and sorting
the instances (triggering a shuffle operation), do not maintain the Row struc-
ture which discard any extra information on the instances, among others.
We modified the original method, solving the previously mentioned issues
and incorporating support to keep the label information. Our implementa-
tion performs an exact nearest neighbor search by iteratively broadcasting a
buffer of test instances, instead of broadcasting full partitions of test data.
However, both methods are comparable and if the buffer size is set to the
partition size, they would be equivalent. The combination of this search
method and our proposed Ml-knn is named Ml-knn-it. Figure 4 shows
the functionality of the test phase, since this method does not require of any
training. The diagram shows one iteration of the method, thus finding the
nearest neighbor of the test instances stored in the buffer.
First, it uses a local iterator of the test instances, which brings a partition
at a time to the driver avoiding to overload the memory. This iterator
allows to iterate locally over the test instances, while avoiding to filter the
test instances by a partition id and collecting the results. Next, a buffer
of a fixed size is filled with the local instances and broadcasted to all the
nodes. After that a map operation over the train instances will find the
k nearest neighbors of the broadcasted instances within the local partition.
Finally, a reduce by key operation will combine all the partitions keeping
the top knearest neighbors of each of the instances of the buffer. It is
important to keep the original instances in-memory for two reasons: they
will be accessed multiple times to find the neighbors, and this avoids undoing
1knn-is:https://github.com/JMailloH/kNN_IS
14
Figure 4: Test phase for Ml-knn-it. In each iteration a buffer of test instances is broad-
casted, which will be used to find the nearest neighbors among the training instances.
the transformations applied to this data to recover their original form.
The main advantages of this method are that it performs an exact search
and does not require of any training. Furthermore, the reduce by key oper-
ation is more efficient than other operations which require a shuffle of data
such as join.
On the other hand, this method suffers from the same problem than the
traditional implementation: pair-wise distance computation. Therefore, each
instance will be broadcasted to all the nodes once, no matter the size of the
buffer or the number of iterations. Additionally, the result is created by
combining the train partitions and the broadcasted test instances, hence it is
not modifying the original test data, instead it is creating new test data with
the neighbors in it. Consequently, it will iteratively duplicate the test data,
until the test phase is over, and then the original test data can be discarded.
4.2. Hybrid Tree Multi-label k Nearest Neighbor ( Ml-knn-ht)
This method was presented in [31] and aims to use a tree-based index
structure to achieve high accuracy and search efficiency in a distributed en-
15
vironment, also there is a public implementation available 2. The available
implementation works seamlessly with the new versions of Spark, as well as
supporting specifying the columns to be preserved in the neighbors. There-
fore, it can be easily incorporated to our Ml-knn algorithm, and it is named
Ml-knn-ht.
This algorithm uses two different structures: a top tree (metric tree) and
multiple subtrees (spill trees) on the nodes, hence the combination of trees is
named hybrid tree. This method requires of a train phase, where all the trees
are built, and a test phase, in which we can query the trees to find nearest
neighbors. Figure 5 presents the process to build the trees.
First, it uses a randomized sample of the training instances to build the
top tree (metric tree), whose leaf nodes corresponds to specific partitions of
the data. A copy of the top tree is broadcasted to all the nodes, so all the train
instances can compute a value that identifies the index of the partition where
2knn-ht:https://github.com/saurfang/spark-knn
Figure 5: Train phase for Ml-knn-ht. First, a top tree (metric tree) is built locally on the
driver using a sample of train instances. Next the instances are indexed and partitioned
using the top tree. Then, each partition builds a local subtree (spill tree).
16
they belong. Next, the training data is repartitioned by the index, hence it is
sent to the partition indicated by the top tree. Then, each partition builds a
subtree (spill tree) which will index the local training data. At the end both
types of trees, top tree and subtrees, are persisted in-memory since they can
be consulted multiple times.
Once all the trees are computed and broadcasted, we can query the struc-
ture to find nearest neighbors among training instances. Figure 6 shows the
process to find nearest neighbors for the test instances. First, each partition
is indexed using the top tree and the test instances are repartitioned by in-
dex. Next, each partition uses the local subtree to find the nearest neighbors
among the training instances. Then, the distance to the farthest neighbor is
used to evaluate if it is necessary to search other partitions.
One of the parameters in this method is the overlap buffer width for the
spill nodes. This buffer needs to be large enough to always include the k
nearest neighbors, but not so large that it impacts negatively in the overall
performance. The details of this parameter estimation can be found on [31].
Figure 6: Test phase for Ml-knn-ht. First, the test instances are indexed and reparti-
tioned using the top tree (metric tree). Then, each partition finds its nearest neighbors
using their local subtree (spill tree).
17
The advantage of this algorithm is the speedup of the nearest neighbor
search using multiple index structures. First, by using the top tree to find the
corresponding partition for each instance, and second by using the subtrees to
find the nearest neighbors within each partition. Moreover, it only executes
one shuffle operation to find the partition where each instance belongs.
However, these advantages come at the cost of building the indexes of
training data. This cost is reflected both in computational time (find the
splits) and memory (store the pivots). Moreover, to maximize the use of these
structures it avoids using backtracking for both trees: in top tree it might
send duplicate test instances to several nodes instead, and in the subtrees it
uses an overlap buffer to consider instances near the decision boundary.
Additionally, the algorithm uses as many partitions as leaf nodes on the
top tree, which can lead to not using all the available nodes in a big cluster.
Also, the size of the partitions is defined by the splits on the tree, thus pro-
ducing unbalanced workloads on the nodes by having partitions of different
sizes. The only way to minimize this impact is to have more partitions that
nodes, since this ensures that all the nodes have tasks assigned and the larger
tasks are break down into smaller pieces that can be distributed more evenly
among the nodes.
4.3. Locally Sensitive Hashing Multi-label k Nearest Neighbor ( Ml-knn-lsh)
This method focuses on the application of locally sensitive hashing (LSH)
functions that preserve the similarity of the original feature space on a dis-
tributed environment. The main property of these functions is that they
map, with high probability, similar instances to the same hash entries. The
indexing is done using LSH functions and by building several hash tables to
increase the probability of collision for close points.
There are multiple hashing functions that fulfill those properties, the most
popular are: MinHash which finds the similarity between two sets defined
by the ratio of the number of elements of their intersection and the number
of elements of their union. Bucketed Random Projection that projects the
feature vectors onto a random unit vector and portions the projected result
into buckets. Sign Random Projection based on creating a bit vector with
the signs of the projection of the feature vector onto multiple random unit
vectors.
In the nearest neighbor search problem, the data should be normalized
since we want to give equal weight to all the features without considering
their scale. We consider our data to be real numbers in the range [0,1], hence
18
Figure 7: Train phase for Ml-knn-lsh. A set of random vectors is created and broad-
casted. Then, the training instances will compute their sign projection using those vectors
to find their keys for each of the hash tables.
the hashing functions MinHash and Bucketed Random Projection would not
work as expected. For that reason, we decided to use Sign Random Projection
with Euclidean distance.
There is an implementation of both, MinHash and Bucketed Random Pro-
jection, available in the official machine learning library for Spark [37] since
the 2.1.0 version. We decided to implement the Sign Random Projection
function, by substituting the hash function in the Bucketed Random Pro-
jection. This ensures that our implementation follows the original structure
and functionality. Furthermore, the original code required to use a combina-
tion of join and group by operations to find the nearest neighbors for each
instance, however we found out that this produced performance issues which
could be minimized by using the co-group operation instead.
This algorithm has a short train phase, since it only needs to prepare the
hashing functions and compute the values for the training instances. Figure 7
shows the train phase. First, for every hash table it creates as many unit
random vectors as the predefined signature length. Then, the random vectors
are broadcasted to all the nodes where the training instances will compute
their sign projection signature. Additionally, the hashed training instances
are persisted in-memory since they can be consulted multiple times.
19
Figure 8: Test phase for Ml-knn-lsh. First, the test instances find their keys for each hash table by
computing the sign projection using the random vectors. Then, the data is “flatten” by the number of hash
tables and co-grouped so that instances with the same keys will be together. Finally, for each instance we
find the nearest neighbors among all the instances that belong to the same keys.
Figure 8 presents the test phase, where each test instance finds the
approximate nearest neighbors. First, the test instances repeat the same
steps of the train phase to find their hash values. Next, the training and
test instances use an explode operation, where the instances are “flatten”
by the number of hash tables, and then they are co-grouped by the tuple
(table position, signature). This operation combines the test instances with
the training instances that belong to the same entries of the hash tables.
Finally, a reduce by key operation finds the top knearest neighbors among
all the instances that were grouped together.
The main advantage of this method is the dimensionality reduction, since
it is less expensive to compare hashes (which are expected to have reduced
length) than comparing instances in the full feature space. Additionally, it
does not need to bring information to the driver since all the computation
happens between nodes.
On the other hand, as it as explained before, this method has a high
memory consumption since it will need to compare all the partitions to match
the hash entry and signature, thus triggering a cartesian product, and for each
match it will compute the distance. Although the number of matches can be
controlled through the number of hash tables, and the signature length.
20
4.4. Review of methods
Table 2 summarizes the key aspects of the three methods. It presents the
following series of characteristics: type search, duplicate of instances, shuffle
of data, indexing of instances, and complexity of the method.
Table 2: Summary of the three Ml-knn strategies
Characteristic Ml-knn-it Ml-knn-ht Ml-knn-lsh
Type of search Exact Approximate Approximate
Duplicates Test inst. per nodes Few test instances (Train inst. + Test inst.) per
num. of partitions
Shuffle Once per iteration Once Once (Cross product)
Index None Metric tree + Spill
trees
Hash entries
Estimate Train runtime None Medium Low
Estimate Test runtime High Low High
5. Experimental setup
This section describes the experimental setup. First, Section 5.1 intro-
duces the metrics used to measure the quality of the predictions. Section
5.2 summarizes the characteristics of the benchmark datasets. Section 5.3
discusses the parameters selected for each of the methods. Finally, Section
5.4 specifies the hardware and software resources used on the experiments.
5.1. Performance metrics
The evaluation metrics for multi-label classification differ from the metrics
in traditional classification. The most common metrics used are presented
in [1], and can be categorized into example-based metrics and label-based
metrics. Example-based evaluates all the labels for each instance and aver-
age the result over all the instances. Label-based evaluates each label and
then averages the values over all labels. Within label-based there are two
types of metrics: micro-averaged and macro-averaged, where the former is
more affected by labels with fewer instances and the later by labels with more
instances. In this paper we address the performance with the most represen-
tative metrics of both groups. Let Zbe the set of predicted labels and Y
21
the set of true labels, each set is represented by a binary vector where each
position is a label and takes value 1 if the label is present and 0 otherwise.
The following metrics can be defined:
Hamming Loss computes the symmetric difference, by applying the
hamming distance between the predicted label set and the true label
set.
HammingLoss =1
p
p
X
i=1
|ZY|(1)
Subset Accuracy requires that the predicted label set is exactly the
same as the true label set. Therefore, [[·]] = 1 if the two sets are equal
or 0 otherwise.
SubsetAccuracy =1
p
p
X
i=1
[[Z=Y]] (2)
Micro-averaged F-measure (Micro F1) is the harmonic mean between
the micro-precision and micro-recall, hence it takes both false positives
and false negatives into account. The dividend finds two times the
number of labels common in both sets, and the divisor adds the number
of labels presents in the predicted label set and the true label set.
Micro-averaged is based on computing the metric for each label in each
instance.
F1micro =2Pq
j=1 Pp
i=1 Zi
jYi
j
Pq
j=1 Pp
i=1 Zi
j+Pq
j=1 Pp
i=1 Yi
j
(3)
Macro-averaged F-measure (Micro F1) is the macro-averaged version
of the previous. Macro-averaged is based on computing the metric for
each label in all the instances, and then averaged them.
F1macro =1
q
q
X
j=1 2Pp
i=1 Zi
jYi
j
Pp
i=1 Zi
j+Pp
i=1 Yi
j(4)
First, the Hamming Loss and Subset Accuracy will be used as the most
restrictive metrics to indicate whenever there exist any difference between
the exact method (Ml-knn-it) and the approximate methods (Ml-knn-
ht and Ml-knn-lsh). Followed by Micro F1 and Macro-F1, which are
complementary, and will indicate if the given difference has a significant
impact of the prediction performance.
22
5.2. Datasets
Table 3 summarizes the characteristics of the 22 datasets for multi-label
classification throughout the experiments, where the number of instances,
number of features, number of labels, cardinality [1], and density [1] are
shown. These datasets have been collected from the Knowledge Discovery
and Intelligent Systems (KDIS) repository 3, although originally they could
be found on the MULAN 4and MEKA 5repositories websites.
3KDIS: http://www.uco.es/kdis/mllresources
4MULAN: http://mulan.sf.net
5MEKA: http://meka.sf.net
Table 3: Summary description of the datasets
Dataset Instances Features Labels Cardinality Density
Flags 194 19 7 3.3918 0.4845
CAL500 502 68 174 26.0438 0.1497
CHD 555 49 6 2.5802 0.4300
Emotions 593 72 6 1.8685 0.3114
Birds 645 260 19 1.0140 0.0534
Medical 978 1,449 45 1.2454 0.0277
Plant 978 440 12 1.0787 0.0899
Water quality 1,060 16 14 5.0726 0.3623
Langlog 1,460 1,004 75 1.1801 0.0157
Enron 1,702 1,001 53 3.3784 0.0637
Scene 2,407 294 6 1.0740 0.1790
Yeast 2,417 103 14 4.2371 0.3026
Human 3,106 440 14 1.1851 0.0847
Slashdot 3,782 1,079 22 1.1809 0.0537
Corel5k 5,000 499 374 3.5220 0.0094
Bibtex 7,395 1,836 159 2.4019 0.0151
Yelp 10,806 671 5 1.6383 0.3277
20NG 19,300 1,006 20 1.0289 0.0514
TMC2007 28,596 500 22 2.2196 0.1009
Mediamill 43,907 120 101 4.3756 0.0433
Bookmarks 87,856 2150 208 2.0281 0.0097
IMDB 120,919 1001 28 1.9996 0.0714
23
Experiments were performed using 10-fold cross validation to objectively
evaluate the models’ performances. The data is divided fairly into 10 equally
sized folds. Each iteration of the cross-validation evaluation holds out a fold
as the test instances while the remainder of the data is used as the training
instances. These sets are stored and used by each algorithm, ensuring that
the instances held in each of the fold are the same for all of them. This
procedure ensures the model is not optimistically biased towards the full
dataset and the algorithms are evaluated fairly over the same data in each
fold. The folds are built using a stratified division [89], where each unique
subset of labels present in the data is considered as a fictitious label, and then
the desired percentage of instances is extracted from each of those labels.
This ensures that each of the folds has the same data distribution as the
original file.
All our experiments are carried on using classifiers that rely on a metric
distance, thus we decided to normalize (re-scale from 0 to 1) the datasets.
Normalizing the data ensures that all the features have the same weight when
computing the distances. Additionally, the attributes in all the datasets are
either numeric or binary, thus by normalizing the values we can ensure that
the distances are computed correctly for both types of data. The numeric
attributes will produce a numeric distance, and the binary attributes will
have distance 1 whenever the value is the same or 0 otherwise.
5.3. Methods and parameters
The methods to cover are the ones presented in Section 3, however some of
those methods depend on a series of parameters. The most relevant parame-
ter for all the methods that attempt to use nearest neighbors for classification
is the number of neighbors to consider, in this case we use k= 3. It is im-
portant to consider that although the final predictions would vary with the
number of neighbors, we do not want to find the optimal number of neigh-
bors, but to set the parameters in a way that all the methods are compared
in equal conditions. The following parameters were used to facilitate the
reproducibility of the experiments, and to provide further insight into the
obtained results.
Ml-knn-it depends on the number of iterations used to broadcast all
the test instances and compute the pair-wise distances. However, we
consider that setting the number of iterations could eventually lead to
performance problems, since the larger the number of test instances,
24
the more information would need to be sent at once and could overload
the memory of the nodes. For that reason, we decided to set instead the
size of the buffer used to broadcast the instances. In our experiments
this method will always broadcast 1000 instances at a time from the
driver to the rest of nodes.
Ml-knn-ht requires of a sample of train instances to build the top
tree in the driver, since it would not be feasible to use all the training
instances in terms of execution times and memory. We decided to use
a sample of data equal to 1000 train instanced, hence ensuring that the
overload would be the same than in Ml-knn-it.
Another critical parameter is the overlap buffer width for the spill trees.
The details of this parameter estimation can be found on [31]. However,
we are going to briefly outline how the estimation works.
Assuming points are uniformly distributed in the feature space, it
should be approximately the average distance between instances. Specif-
ically, the number of instances within a certain radius of a given point is
proportionally to the density of instances raised to the effective number
of features (dimensions), of which manifold data exist on:
Rs=c
N(1/d)
s
(5)
where Rsis the radius, Nsis the number of instances, dis effective
number of dimension, and cis a constant. To estimate Rsfor the entire
data, we can take samples of different size Nsto compute Rs. We can
estimate cand dusing linear regression. Finally, we can calculate Rs
using total number of observations.
Ml-knn-lsh needs to set the number of hash tables and the signature
length in each entry. The number of hash tables will have a consid-
erable impact on memory, since the instances will be duplicated by
this value. On the other hand, the signature length will affect the
number of training and test instances that are grouped together. We
decided to study a wide range of values to evaluate their impact in the
quality of the metrics and the execution times, the selected values are
{1,2,4,8,16,32}for number of tables and {1,2,4,8,16,32,64}for the
signature length.
25
5.4. Hardware and software environment
All the experiments were executed on the same environment, a local clus-
ter composed by 2 Intel Xeon CPU E5-2690v4 with 28 cores (56 threads) in
total and 128 GB of memory. Out of all the resources 6 cores and 25 GB
were reserved for the driver and the rest where used by the workers. The
experiments were executed using Spark 2.2.0 and Scala 2.11 on CentOS 7.4.
6. Experimental study
This section presents and discusses the experimental results. Section
6.1 compares the quality of the predictions, and includes the study of any
parameter that would have a significant impact on the predictions. Section
6.2 studies the execution times, and considers whenever the performance
gain introduced by the index methods surpasses the initial overhead. Section
6.3 evaluates whenever the methods scale-out correctly with respect to the
increasing number of instances, attributes and labels.
6.1. Prediction comparison: approximate versus exact
This experiment compares the quality of the predictions produced by the
three approaches. The predictions of Ml-knn-it are not affected by any
of its parameters, and it is considered an exact method. Ml-knn-ht can
be affected by the overlap buffer width, however the estimation of this value
was explained in Section 5.3. On the other hand, Ml-knn-lsh is expected
to be deeply affected by both the number of tables and the signature length
of its hash entries. First, we are going to study the parameter configurations
of Ml-knn-lsh, and then the overall best setting will be used in the final
comparison to compare the three methods in equal conditions.
Figure 9 illustrates the evolution of the predictions, and the impact on
execution time, produced by Ml-knn-lsh using up to 64 hash tables and
signatures up to size 32. Since it would not be possible to show the results
over all the datasets, the most representative datasets have been selected.
These datasets cover a wide range of number of instances, attributes and
labels. These datasets obtained a considerably high subset accuracy on the
exact search, thus the metric which would be affected the most by the loss
of performance. The plots on the left side present the subset accuracy, and
on the right side show the execution times in minutes and on a logarithmic
scale.
26
24816 32 24816 32 64
0.2
0.4
0.6
signature length
num. tables
Subset Accuracy
24816 32 24816 32 64
1
10
100
signature length
num. tables
Time (min)
(a) Medical dataset
24816 32 24816 32 64
0.2
0.4
0.6
0.8
signature length
num. tables
Subset Accuracy
24816 32 24816 32 64
1
10
100
signature length
num. tables
Time (min)
(b) Scene dataset
24816 32 24816 32 64
0.2
0.4
signature length
num. tables
Subset Accuracy
24816 32 24816 32 64
0.1
1
signature length
num. tables
Time (min)
(c) Emotions dataset
24816 32 24816 32 64
0.45
0.55
signature length
num. tables
Subset Accuracy
24816 32 24816 32 64
0.1
1
10
signature length
num. tables
Time (min)
(d) Birds dataset
Figure 9: Subset Accuracy obtained by Ml-knn-lsh on Medical, Scene, Emotions, and
Birds datasets 27
The most accurate predictions are obtained when more tables are used
together with smaller signatures, since the number of tables reflects the num-
ber of groups that will be created in the matching process and the signature
length define inversely the size of those groups. Therefore, the more tables
and smaller signatures, the more instances are compared to find neighbors.
However, the trade-off is that the more instances are compared, the more
distances need to be computed, and the method becomes slower. The results
indicate that the execution times grow exponentially with the number of ta-
bles and scales logarithmically with the signature size. Therefore, the best
compromise between prediction and execution performance is produced by
using two tables and signature of size eight.
Table 4 evaluates the performance of the methods using the selected pa-
rameters and the two most restrictive metrics, subset accuracy and hamming
loss. In most of the datasets the best results are obtained by Ml-knn-it,
since it is an exact method. The few exceptions where Ml-knn-ht outper-
forms Ml-knn-it are produced by the fact that the overlap buffer might not
consider all the real neighbors, and this approximation conveniently consid-
ers more appropriate neighbors. This is produced only in a few cases, and
the overall prediction performance of Ml-knn-it and Ml-knn-ht can be
considered equivalent. On the other hand, the approximation done by the
Ml-knn-lsh tends to have lower values of subset accuracy, however there
are some exceptions where the difference is small due to data distributions.
Despite some exceptions, it is considered that Ml-knn-lsh is the most in-
accurate of the three methods.
Table 5 compares the predictions using two metrics which are more repre-
sentative of the real quality of the predictions, micro-average F1 and macro-
average F1.Ml-knn-it and Ml-knn-ht obtained almost the same results,
since their performance was very similar for the most restrictive metrics. Ml-
knn-lsh differs more considerably from the other two methods. However,
the magnitude of the dissimilarity indicated by the subset accuracy is not
reflected by these metrics, since even if the predicted set is not exactly the
same as the true label set, they are considerably similar. Although it can be
appreciated that for some datasets, such as 20NG or Bibtex, because the data
is not uniformly distributed and the projections can lead to a poor approx-
imation. We conclude that Ml-knn-lsh falls behind in terms of prediction
results, however the difference depends on the data.
28
Table 4: Hamming Loss and Subset Accuracy results obtained by the three methods
Hamming Loss Subset Accuracy
Dataset Ml-knn-it Ml-knn-ht Ml-knn-lsh Ml-knn-it Ml-knn-ht Ml-knn-lsh
Flags 0.2411 0.2440 0.2827 0.1667 0.1458 0.1042
CAL500 0.1360 0.1357 0.1370 0.0000 0.0000 0.0000
CHD 0.3031 0.3031 0.3007 0.1159 0.1159 0.1522
Emotions 0.2072 0.2050 0.2140 0.2703 0.2770 0.2568
Birds 0.0487 0.0497 0.0494 0.5031 0.5155 0.5031
Medical 0.0159 0.0156 0.0165 0.4751 0.5339 0.4932
Plant 0.0879 0.0890 0.0893 0.0422 0.0844 0.0042
Water quality 0.3080 0.3096 0.3248 0.0152 0.0152 0.0190
Langlog 0.0155 0.0158 0.0156 0.1399 0.1433 0.1365
Enron 0.0498 0.0498 0.0542 0.0669 0.1297 0.0209
Scene 0.0984 0.0929 0.1082 0.6456 0.6007 0.5973
Yeast 0.2046 0.2058 0.2060 0.1493 0.1493 0.1493
Human 0.0831 0.0831 0.0829 0.0026 0.0026 0.0000
Slashdot 0.0520 0.0517 0.0535 0.0538 0.0802 0.0033
Corel5k 0.0094 0.0094 0.0094 0.0024 0.0016 0.0000
Bibtex 0.0088 0.0088 0.0098 0.1075 0.1110 0.0043
Yelp 0.1916 0.1890 0.2096 0.4056 0.4100 0.3733
20NG 0.0394 0.0399 0.0507 0.2832 0.2830 0.0220
TMC2007 0.0636 0.0639 0.0714 0.2683 0.2555 0.2001
Mediamill 0.0278 0.0278 0.0313 0.1656 0.1655 0.0919
Bookmarks 0.0055 0.0055 0.2313 0.2337
IMDB 0.0714 0.0714 0.0009 0.0009
Experiment could not execute due to computational/memory limitations.
6.2. Performance comparison: execution times for train and test phases
This experiment compares the execution times of the three methods. The
Ml-knn algorithm is affected by the number of instances, attributes and
labels, consequently we consider multiple datasets that represent a wide range
characteristics and analyze the impact of those factors.
Table 6 shows the execution time, in minutes, for each dataset sorted by
the number of instances. The left side of the table presents the results for
the train phase, and the right side presents the results for the test phase.
Ml-knn-it has the best execution times for the small datasets, followed
very closely by Ml-knn-ht. However, the performance declines with the
number of instances since the complexity of the algorithm is tightly bounded
by the number of instances.
29
Table 5: Micro-average F1 and Macro-average F1 results obtained by the three methods
Micro-average F1 Macro-average F1
Dataset Ml-knn-it Ml-knn-ht Ml-knn-lsh Ml-knn-it Ml-knn-ht Ml-knn-lsh
Flags 0.7362 0.7338 0.6844 0.5670 0.5649 0.4870
CAL500 0.3171 0.3315 0.3305 0.0805 0.0822 0.0832
CHD 0.6144 0.6121 0.6289 0.3154 0.3095 0.3410
Emotions 0.6406 0.6459 0.6154 0.6205 0.6286 0.5876
Birds 0.3318 0.2475 0.1065 0.1985 0.1624 0.0546
Medical 0.6520 0.6652 0.6435 0.6374 0.6553 0.6326
Plant 0.0741 0.1365 0.0078 0.0272 0.0413 0.0038
Water quality 0.5271 0.5238 0.4468 0.4307 0.4281 0.3357
Langlog 0.0058 0.0114 0.0000 0.2963 0.2982 0.2933
Enron 0.4130 0.4499 0.3875 0.2385 0.2432 0.2070
Scene 0.7126 0.7089 0.6840 0.7236 0.7125 0.6932
Yeast 0.6246 0.6228 0.6246 0.3459 0.3448 0.3472
Human 0.0045 0.0045 0.0000 0.0029 0.0029 0.0000
Slashdot 0.1017 0.1592 0.0056 0.1907 0.2089 0.1401
Corel5k 0.0210 0.0095 0.0023 0.1690 0.1677 0.1642
Bibtex 0.2949 0.3008 0.0807 0.0911 0.1010 0.0289
Yelp 0.6556 0.6607 0.6423 0.5604 0.5605 0.5357
20NG 0.4315 0.4287 0.0464 0.4218 0.4162 0.0451
TMC2007 0.6362 0.6573 0.5782 0.4073 0.4105 0.3002
Mediamill 0.6039 0.6036 0.5282 0.2125 0.2123 0.0653
Bookmarks 0.3551 0.3591 0.1134 0.1153
IMDB 0.0017 0.0016 0.0118 0.0117
Experiment could not execute due to computational/memory limitations.
Ml-knn-ht quickly surpasses Ml-knn-it once the dataset is big enough
to require an execution time where the construction of an index represents a
small fraction of the total time. The difference of performance is even greater
for the test phase, since the index has already been constructed and it only
needs to query the structures.
Ml-knn-lsh has the longest execution times for most of the datasets.
The gap in performance is especially big for the test phase, since the co-group
operation is less efficient between two different sets of instances. Despite
computing pair-wise distances only within the same group, instead of all
the instances like Ml-knn-it, the performance gain is drag down by the
duplication of instances and the computation of the hash entries for all the
hash tables. This could lead to memory issues, as it can be appreciated for
30
Table 6: Execution times for the train and test phases in minutes for the three methods
Train time Test time
Dataset Ml-knn-it Ml-knn-ht Ml-knn-lsh Ml-knn-it Ml-knn-ht Ml-knn-lsh
Flags 0.0470 0.0475 0.0413 0.0225 0.0143 0.0176
CAL500 0.0630 0.1524 1.5737 0.0323 0.0361 1.0140
CHD 0.0501 0.0597 0.0847 0.0243 0.0214 0.0651
Emotions 0.0517 0.0640 0.2412 0.0250 0.0214 0.3415
Birds 0.0640 0.0834 0.1530 0.0306 0.0245 0.2276
Medical 0.1910 0.2365 1.0820 0.0787 0.0715 2.3176
Plant 0.0938 0.1204 0.2549 0.0457 0.0433 0.4977
Water quality 0.0575 0.0880 0.4765 0.0244 0.0316 0.3426
Langlog 0.1919 0.3333 0.6817 0.0891 0.1058 0.5617
Enron 0.2560 0.3892 0.5233 0.0718 0.1027 0.2781
Scene 0.1397 0.1861 0.5484 0.0552 0.0689 1.9123
Yeast 0.1256 0.1776 1.3601 0.0438 0.0753 0.7848
Human 0.2609 0.2730 0.6156 0.0978 0.1231 1.3664
Slashdot 0.3831 0.3632 0.7124 0.1286 0.1335 1.0321
Corel5k 0.8753 3.6945 18.4798 0.2816 1.5695 4.7568
Bibtex 3.5429 2.8774 8.2819 0.6209 0.7865 6.8489
Yelp 3.1812 0.3867 0.4578 0.8685 0.1460 4.1907
20NG 14.4004 1.0986 3.7498 4.4056 0.3758 23.4893
TMC2007 44.9470 1.2671 9.1497 14.6589 0.6059 41.5393
Mediamill 168.2233 3.8833 691.1088 60.3439 2.0957 241.1892
Bookmarks 2170.5851 27.9763 543.7354 8.6002
IMDB 4844.1880 17.6109 1559.7201 6.3888
Experiment could not execute due to computational/memory limitations.
the largest datasets where it was not possible to finish the executions due to
hardware limitations.
6.3. Scalability analysis on the number of instances, features, and labels
Another important aspect is to consider the evolution of the execution
times with regards of the size of the data. The data can increase in size
by multiple factors: number of instances, number of attributes and number
of labels. This experiment studies the scalability of the three methods by
observing the total execution times (train and test phases together) over
different samplings of the 20NG dataset. Since this experiment only studies
the computational performance, it is irrelevant which instances, attributes
or labels are selected.
31
2 4 6 8 10 12 14 16
·103
0
5
10
15
20
25
Number of instances
Total execution time (min)
Ml-knn-is
Ml-knn-ht
Ml-knn-lsh
Figure 10: Execution times according to the number of instances
12345678910
·103
0
20
40
60
80
100
Number of attributes
Total execution time (min)
Ml-knn-is
Ml-knn-ht
Ml-knn-lsh
Figure 11: Execution times according to the number of attributes
32
12345678910
·102
0
10
20
30
40
Number of labels
Total execution time (min)
Ml-knn-is
Ml-knn-ht
Ml-knn-lsh
Figure 12: Execution times according to the number of labels
Figure 10 presents the execution times on a range of 1,000 up to 16,000
instances, with a fixed number of attributes and labels.
The execution times for both Ml-knn-it Ml-knn-lsh increase exponen-
tially with the number of instances. Ml-knn-it needs to execute a pair-wise
comparison between instances, hence this exponential growth is expected.
On the other hand, Ml-knn-lsh executes a reduced pair-wise comparison,
only within grouped instances, however the explode operation duplicates the
instances by the number of tables. This leads to an extra overload of memory
and computational time that eventually drags down the performance of the
method. Finally, Ml-knn-ht presents the best scalability, with considerable
reduced and linear execution times.
Figure 11 shows the performance of the methods on a range of attributes
from 1,000 up to 10,000, with a fixed number of instances and labels.
Ml-knn-it and Ml-knn-ht scale linearly with the number of attributes,
where Ml-knn-ht has executions times orders of magnitude smaller since
a reduced number of instances will compute their distances. On the other
hand, Ml-knn-lsh increases exponentially with the number of attributes.
This is produced by the computation of the entries in the hash tables, since
they depend directly on the number of attributes. Moreover, this method
33
depends on exchanging a lot of information between nodes, hence a high-
dimensionality increases the network traffic.
Figure 12 illustrates the execution times varying the number of labels
from 100 up to 1,000, with a fixed number of instances and attributes.
The three methods present a constant execution times despite the number
of labels used. This indicates that the execution times of Ml-knn are not
deeply affected by the number of labels, becoming a good alternative for
datasets with high-dimensional label spaces. Despite not having a significant
impact on the performance, the number of labels could eventually lead to
memory problems, especially for those methods that duplicate the instances.
7. Conclusion
In this paper we have presented and evaluated three strategies to dis-
tribute Ml-knn over Spark. Each of the three approaches incorporate a
different strategy for the distributed nearest neighbor search: Brute force,
tree-based index, and locally sensitive hashing. The impact of these strate-
gies in Ml-knn has been studied into detail considering multiple metrics,
regarding prediction accuracy, execution times, and scalability factor.
The experimental study carried out has shown that Ml-knn can handle
large datasets over the Spark framework, obtaining competitive results, both
in prediction and computational performance. Considering each of the three
methods, Ml-knn-it obtained the base-line accuracy, since it is an exact
method, and the lowest execution times for the smaller datasets. However,
the experiments show that it scales poorly, especially with the number of in-
stances. Ml-knn-ht produced an accuracy equivalent of an exact method,
besides having the fastest execution times for most of the datasets. Addi-
tionally, it scaled-out more efficiently than the other methods, being able to
handle even the largest datasets. Ml-knn-lsh produced the most inconsis-
tent results, while it produced larger differences over the strictest metrics, it
is considered that final accuracy was acceptable for an approximate method.
However, it scaled-out poorly and it had problems to execute the largest
datasets.
These results indicate that by incorporating the right strategy for nearest
neighbor searches, Spark enables Ml-knn to execute over large datasets that
would not be feasible to consider in a single machine. As future work, Ml-
knn-it could possibly be improved by exchanging the information using
cross-joins with a specific test partition, that way we would avoid using the
34
driver as an intermediary to exchange information. On the other hand, the
algorithm with the largest capacity for improvement is the Ml-knn-lsh. At
the moment the available implementation relies on flattening rows and co-
grouping them which has a high computational cost. We could use a hash
table on the driver to indicate the partitions that store specific buckets of
the hash tables. Eventually, a proper implementation of Ml-knn-lsh could
even surpass the Ml-knn-ht performance for high-dimensional data.
Acknowledgments
This research was supported by the VCU Research Support Funds, the Span-
ish Ministry of Economy and Competitiveness and the European Regional
Development Fund, project TIN2017-83445-P.
References
[1] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: Data
mining and Knowledge Discovery handbook, 667–685, 2009.
[2] K. Trohidis, G. Tsoumakas, G. Kalliris, I. P. Vlahavas, Multi-label clas-
sification of music into emotions, in: International Conference on Music
Information Retrieval, vol. 8, 325–330, 2008.
[3] H. Zhang, S. Kiranyaz, M. Gabbouj, A k-nearest neighbor multilabel
ranking algorithm with application to content-based image retrieval, in:
IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, 2587–2591, 2017.
[4] M. L. Zhang, Z. H. Zhou, ML-KNN: A lazy learning approach to multi-
label learning, Pattern Recognition 40 (7) (2007) 2038–2048.
[5] K. Q. Weinberger, J. Blitzer, L. K. Saul, Distance metric learning for
large margin nearest neighbor classification, in: Advances in Neural
Information Processing Systems, 1473–1480, 2006.
[6] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, L. Cazzanti, Similarity-
based classification: concepts and algorithms, Journal of Machine Learn-
ing Research 10 (2009) 747–776.
35
[7] G. Doquire, M. Verleysen, Mutual information-based feature selection
for multi-label classification, Neurocomputing 122 (2013) 148–155.
[8] N. SpolaˆoR, E. A. Cherman, M. C. Monard, H. D. Lee, A comparison of
multi-label feature selection methods using the problem transformation
approach, Electronic Notes in Theoretical Computer Science 292 (2013)
135–151.
[9] H. Lim, J. Lee, D.-W. Kim, Optimization approach for feature selection
in multi-label classification, Pattern Recognition Letters 89 (2017) 25–
30.
[10] M.-L. Zhang, K. Zhang, Multi-label learning by exploiting label de-
pendency, in: ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 999–1008, 2010.
[11] S.-J. Huang, Z.-H. Zhou, Z. Zhou, Multi-Label Learning by Exploiting
Label Correlations Locally, in: AAAI Conference on Artificial Intelli-
gence, 949–955, 2012.
[12] F. Tai, H.-T. Lin, Multilabel classification with principal label space
transformation, Neural Computation 24 (9) (2012) 2508–2542.
[13] Y. Zhang, Z.-H. Zhou, Multilabel dimensionality reduction via depen-
dence maximization, ACM Transactions on Knowledge Discovery from
Data 4 (3) (2010) 14.
[14] M.-L. Zhang, L. Wu, Lift: Multi-label learning with label-specific fea-
tures, IEEE Transactions on Pattern Analysis and Machine Intelligence
37 (1) (2015) 107–120.
[15] U. Schwiegelshohn, R. M. Badia, M. Bubak, M. Danelutto, S. Dustdar,
F. Gagliardi, A. Geiger, L. Hluchy, D. Kranzlm¨uller, E. Laure, et al.,
Perspectives on grid computing, Future Generation Computer Systems
26 (8) (2010) 1104–1115.
[16] J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large
clusters, Communications of the ACM 51 (1) (2008) 107–113.
[17] T. White, Hadoop: The definitive guide, O’Reilly Media, Inc., 2012.
36
[18] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,
X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al., Apache
Spark: A unified engine for big data processing, Communications of the
ACM 59 (11) (2016) 56–65.
[19] Z. Bei, Z. Yu, N. Luo, C. Jiang, C. Xu, S. Feng, Configuring in-memory
cluster computing using random forest, Future Generation Computer
Systems 79 (2018) 1–15.
[20] Z. Tang, X. Zhang, K. Li, K. Li, An intermediate data placement al-
gorithm for load balancing in Spark computing environment, Future
Generation Computer Systems 78 (2018) 287–301.
[21] H. Singh, S. Bawa, A MapReduce-based scalable discovery and indexing
of structured big data, Future Generation Computer Systems 73 (2017)
32–43.
[22] D. Harnie, M. Saey, A. E. Vapirev, J. K. Wegner, A. Gedich, M. Stei-
jaert, H. Ceulemans, R. Wuyts, W. De Meuter, Scaling machine learning
for target prediction in drug discovery using Apache Spark, Future Gen-
eration Computer Systems 67 (2017) 409–417.
[23] ´
A. B. Hern´andez, M. S. Perez, S. Gupta, V. Munes-Mulero, Using ma-
chine learning to optimize parallelism in big data applications, Future
Generation Computer Systems (2017) 1–12.
[24] J. Gonzalez-Lopez, A. Cano, S. Ventura, Large-scale multi-label ensem-
ble learning on Spark, in: IEEE Trustcom/BigDataSE/ICESS, 893–900,
2017.
[25] Z. Younes, F. Abdallah, T. Denoeux, H. Snoussi, A dependent multi-
label classification method derived from the k-nearest neighbor rule,
EURASIP Journal on Advances in Signal Processing (2011) 1–14.
[26] E. Spyromitros, G. Tsoumakas, I. Vlahavas, An empirical study of lazy
multi-label classification algorithms, in: Hellenic Conference on Artifi-
cial Intelligence, 401–406, 2008.
[27] J. Xu, Multi-label weighted k-nearest neighbor classifier with adaptive
weight estimation, in: International Conference on Neural Information
Processing, 79–88, 2011.
37
[28] W. Cheng, E. H¨ullermeier, Combining instance-based learning and lo-
gistic regression for multi-label classification, Machine Learning 76 (2)
(2009) 211–225.
[29] J. Maillo, S. Ram´ırez, I. Triguero, F. Herrera, kNN-IS: An iterative
Spark-based design of the k-Nearest Neighbors classifier for big data,
Knowledge-Based Systems 117 (2017) 3–15.
[30] S. Ram´ırez-Gallego, B. Krawczyk, S. Garc´ıa, M. Wzniak, J. M. Ben´ıtez,
F. Herrera, Nearest Neighbor Classification for High-Speed Big Data
Streams Using Spark, IEEE Transactions on Systems, Man, and Cyber-
netics: Systems 47 (10) (2017) 2727–2739.
[31] T. Liu, C. Rosenberg, H. A. Rowley, Clustering billions of images with
large scale nearest neighbor search, in: IEEE Workshop on Applications
of Computer Vision, 28–28, 2007.
[32] J. Maillo, J. Luengo, S. Garc´ıa, F. Herrera, I. Triguero, Exact fuzzy k-
nearest neighbor classification for big datasets, in: IEEE International
Conference on Fuzzy Systems, 1–6, 2017.
[33] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, B. Moon, Parallel data
processing with MapReduce: a survey, ACM SIGMOD 40 (4) (2012)
11–20.
[34] J. Lin, MapReduce is good enough? If all you have is a hammer, throw
away everything that’s not a nail!, Big Data 1 (1) (2013) 28–37.
[35] A. Cano, C. Garc´ıa-Mart´ınez, S. Ventura, Extremely high-dimensional
optimization with MapReduce: scaling functions and algorithm, Infor-
mation Sciences 415 (2017) 110–127.
[36] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J.
Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: A fault-
tolerant abstraction for in-memory cluster computing, in: USENIX Con-
ference on Networked Systems Design and Implementation, 2–2, 2012.
[37] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu,
J. Freeman, D. Tsai, M. Amde, S. Owen, et al., Mllib: Machine learning
in Apache Spark, Journal of Machine Learning Research 17 (1) (2016)
1235–1241.
38
[38] Q. Wang, S. R. Kulkarni, S. Verd´u, Divergence estimation for multidi-
mensional densities via k-nearest-neighbor distances, IEEE Transactions
on Information Theory 55 (5) (2009) 2392–2405.
[39] A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate
nearest neighbor in high dimensions, in: Foundations of Computer Sci-
ence, 459–468, 2006.
[40] Y. Song, J. Huang, D. Zhou, H. Zha, C. L. Giles, IKNN: Informative
K-Nearest Neighbor Pattern Classification, in: European Conference on
Principles of Data Mining and Knowledge Discovery, 248–264, 2007.
[41] Y. Iano, F. S. da Silva, A. M. Cruz, A fast and efficient hybrid fractal-
wavelet image coder, IEEE Transactions on Image Processing 15 (1)
(2006) 98–105.
[42] B. Aldawsari, T. Baker, D. England, Trusted energy efficient cloud-
based services brokerage platform, International Journal of Computa-
tional Intelligence Research 6 (2015) 630–639.
[43] T. Baker, M. Asim, H. Tawfik, B. Aldawsari, R. Buyya, An energy-aware
service composition algorithm for multiple cloud-based IoT applications,
Journal of Network and Computer Applications 89 (2017) 96–108.
[44] T. Baker, Y. Ngoko, R. Tolosana-Calasanz, O. F. Rana, M. Randles, En-
ergy efficient cloud computing environment via autonomic meta-director
framework, in: International Conference on Developments in eSystems
Engineering, 198–203, 2013.
[45] T. Baker, B. Al-Dawsari, H. Tawfik, D. Reid, Y. Ngoko, GreeDi: An en-
ergy efficient routing algorithm for big data on cloud, Ad Hoc Networks
35 (2015) 83–96.
[46] Y. Wang, W.-Z. Song, W. Wang, X.-Y. Li, T. A. Dahlberg, LEARN:
Localized energy aware restricted neighborhood routing for ad hoc net-
works, in: IEEE Communications Society on Sensor and Ad Hoc Com-
munications and Networks, vol. 2, 508–517, 2006.
[47] P. V. Krishna, Honey bee behavior inspired load balancing of tasks in
cloud computing environments, Applied Soft Computing 13 (5) (2013)
2292–2303.
39
[48] T. Baker, M. Mackay, A. Shaheed, B. Aldawsari, Security-oriented cloud
platform for soa-based scada, in: IEEE/ACM International Symposium
on Cluster, Cloud and Grid Computing, 961–970, 2015.
[49] J. Gao, J. X. Yu, R. Jin, J. Zhou, T. Wang, D. Yang, Neighborhood-
privacy protected shortest distance computing in cloud, in: ACM SIG-
MOD International Conference on Management of data, 409–420, 2011.
[50] J. H. Friedman, J. L. Bentley, R. A. Finkel, An algorithm for finding
best matches in logarithmic expected time, ACM Transactions on Math-
ematical Software 3 (3) (1977) 209–226.
[51] C. Silpa-Anan, R. Hartley, Optimised KD-trees for fast image descrip-
tor matching, in: IEEE Conference on Computer Vision and Pattern
Recognition, 1–8, 2008.
[52] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, A. Y. Wu, An
optimal algorithm for approximate nearest neighbor searching fixed di-
mensions, Journal of the ACM 45 (6) (1998) 891–923.
[53] C. A. Duncan, M. T. Goodrich, S. G. Kobourov, Balanced aspect ra-
tio trees and their use for drawing very large graphs, in: International
Symposium on Graph Drawing, 111–124, 1998.
[54] C. A. Duncan, M. T. Goodrich, S. Kobourov, Balanced aspect ratio
trees: Combining the advantages of kd trees and octrees, Journal of
Algorithms 38 (1) (2001) 303–333.
[55] R. F. Sproull, Refinements to nearest-neighbor searching ink-
dimensional trees, Algorithmica 6 (1-6) (1991) 579–589.
[56] S. Dasgupta, Y. Freund, Random projection trees and low dimensional
manifolds, in: ACM symposium on Theory of computing, 537–546, 2008.
[57] Y. Jia, J. Wang, G. Zeng, H. Zha, X.-S. Hua, Optimizing kd-trees for
scalable visual descriptor indexing, in: IEEE Conference on Computer
Vision and Pattern Recognition, 3392–3399, 2010.
[58] A. W. Moore, The anchors hierarchy: Using the triangle inequality to
survive high dimensional data, in: Conference on Uncertainty in Artifi-
cial Intelligence, 397–405, 2000.
40
[59] T. Liu, A. W. Moore, K. Yang, A. G. Gray, An investigation of prac-
tical approximate nearest neighbor algorithms, in: Advances in Neural
Information Processing Systems, 825–832, 2005.
[60] H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neigh-
bor search, IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 33 (1) (2011) 117–128.
[61] A. Babenko, V. Lempitsky, The inverted multi-index, in: IEEE Confer-
ence on Computer Vision and Pattern Recognition, 3069–3076, 2012.
[62] P. Indyk, R. Motwani, Approximate nearest neighbors: Towards re-
moving the curse of dimensionality, in: ACM Symposium on Theory of
Computing, 604–613, 1998.
[63] M. Covell, S. Baluja, LSH banding for large-scale retrieval with memory
and recall constraints, in: IEEE International Conference on Acoustics,
Speech and Signal Processing, 1865–1868, 2009.
[64] J. Buhler, Efficient large-scale sequence comparison by locality-sensitive
hashing, Bioinformatics 17 (5) (2001) 419–428.
[65] Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, Multi-probe LSH: ef-
ficient indexing for high-dimensional similarity search, in: International
Conference on Very Large Data Bases, 950–961, 2007.
[66] M. Bawa, T. Condie, P. Ganesan, LSH forest: self-tuning indexes for
similarity search, in: Iinternational Conference on World Wide Web,
651–660, 2005.
[67] G. Shakhnarovich, P. Viola, T. Darrell, Fast pose estimation with
parameter-sensitive hashing, in: IEEE International Conference on
Computer Vision, 750, 2003.
[68] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in
Neural Information Processing Systems, 1753–1760, 2009.
[69] P. Jain, B. Kulis, K. Grauman, Fast image search for learned metrics,
in: IEEE Conference on Computer Vision and Pattern Recognition, 1–8,
2008.
41
[70] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable
image search, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2130–2137, 2009.
[71] B. Kulis, T. Darrell, Learning to hash with binary reconstructive embed-
dings, in: Advances in Neural Information Processing Systems, 1042–
1050, 2009.
[72] M. Raginsky, S. Lazebnik, Locality-sensitive binary codes from shift-
invariant kernels, in: Advances in Neural Information Processing Sys-
tems, 1509–1517, 2009.
[73] J. Wang, S. Kumar, S.-F. Chang, Semi-supervised hashing for scalable
image retrieval, in: IEEE Conference on Computer Vision and Pattern
Recognition, 3424–3431, 2010.
[74] J. He, W. Liu, S.-F. Chang, Scalable similarity search with optimized
kernel hashing, in: ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, 1129–1138, 2010.
[75] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, N. Yu, Complementary hashing for
approximate nearest neighbor search, in: IEEE International Conference
on Computer Vision, 1631–1638, 2011.
[76] T. B. Sebastian, B. B. Kimia, Metric-based shape retrieval in large
databases, in: International Conference on Pattern Recognition, vol. 3,
291–296, 2002.
[77] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, H. Zhang, Fast approximate
nearest-neighbor search with k-nearest neighbor graph, in: International
Joint Conference on Artificial Intelligence, vol. 22, 1312, 2011.
[78] R. Paredes, E. Ch´avez, K. Figueroa, G. Navarro, Practical construction
of k-nearest neighbor graphs in metric spaces, in: Workshop Ecosystem
Architectures, vol. 6, 85–97, 2006.
[79] J. Chen, H.-r. Fang, Y. Saad, Fast approximate kNN graph construction
for high dimensional data via recursive Lanczos bisection, Journal of
Machine Learning Research 10 (2009) 1989–2012.
42
[80] M. Connor, P. Kumar, Fast construction of k-nearest neighbor graphs
for point clouds, IEEE Transactions on Visualization and Computer
Graphics 16 (4) (2010) 599–608.
[81] W. Dong, C. Moses, K. Li, Efficient k-nearest neighbor graph construc-
tion for generic similarity measures, in: ACM International Conference
on World Wide Web, 577–586, 2011.
[82] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification,
in: Advances in Neural Information Processing Systems, 681–687, 2002.
[83] C. Vens, J. Struyf, L. Schietgat, S. Dˇzeroski, H. Blockeel, Decision trees
for hierarchical multi-label classification, Machine Learning 73 (2) (2008)
185.
[84] M. L. Zhang, Z. H. Zhou, Multi-label neural networks with applications
to functional genomics and text categorization, IEEE Transactions on
Knowledge and Data Engineering 18 (10) (2006) 1338–1351.
[85] S. Abe, Fuzzy support vector machines for multilabel classification, Pat-
tern Recognition 48 (6) (2015) 2110–2117.
[86] M. L. Zhang, Z. H. Zhou, A review on multi-label learning algorithms,
IEEE Transactions on Knowledge and Data Engineering 26 (8) (2014)
1819–1837.
[87] F. Charte, A. Rivera, M. del Jesus, F. Herrera, MLSMOTE: Approach-
ing imbalanced multi-label learning through synthetic instance genera-
tion, Knowledge-Based Systems 89 (2015) 385–397.
[88] J. Maillo, I. Triguero, F. Herrera, A MapReduce-based k-nearest
neighbor approach for big data classification, in: IEEE Trust-
com/BigDataSE/ICESS, vol. 2, 167–172, 2015.
[89] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratification of multi-
label data, Machine Learning and Knowledge Discovery in Databases
(2011) 145–158.
43
... The k nearest neighbor classification method (kNNC) is a conceptually simple, non-parametric and popular classification method which defers its computational burden to the prediction stage. The consistency properties of kNNC are well studied (Fix and Hodges 1951;Cover and Hart 1967;Devroye et al. 1994;Chaudhuri and Dasgupta 2014). Traditional kNNC assumes training data is stored centrally in a single machine, but such central processing and storing assumptions become unrealistic in the big data era. ...
... Traditional kNNC assumes training data is stored centrally in a single machine, but such central processing and storing assumptions become unrealistic in the big data era. An effective way to overcome this issue is to distribute the data across multiple machines and use specific distributed computing environment such as Hadoop or Spark with MapReduce paradigm (Anchalia and Roy 2014;Mallio, Triguero, and Herrera 2015;Gonzalez-Lopez, Ventura, and Cano 2018). Zhang et al. (2020) proposed a kNNC algorithm based on the concept of distributed storage and computing for processing large datasets in cyber-physical systems where k-nearest neighbor search is performed locally using a kd-tree. ...
Article
The mathematical formalization of a neurological mechanism in the fruit-fly olfactory circuit as a locality sensitive hash (Flyhash) and bloom filter (FBF) has been recently proposed and "reprogrammed" for various learning tasks such as similarity search, outlier detection and text embeddings. We propose a novel reprogramming of this hash and bloom filter to emulate the canonical nearest neighbor classifier (NNC) in the challenging Federated Learning (FL) setup where training and test data are spread across parties and no data can leave their respective parties. Specifically, we utilize Flyhash and FBF to create the FlyNN classifier, and theoretically establish conditions where FlyNN matches NNC. We show how FlyNN is trained exactly in a FL setup with low communication overhead to produce FlyNNFL, and how it can be differentially private. Empirically, we demonstrate that (i) FlyNN matches NNC accuracy across 70 OpenML datasets, (ii) FlyNNFL training is highly scalable with low communication overhead, providing up to 8x speedup with 16 parties.
... After reading and reviewing the previously published works, the author found that the prediction of big data time series is a new topic that is in line with the current development direction of information technology and has great development prospects, especially for big data time series prediction using distributed transportation structure and based on Spark framework, taking full advantage of Spark. Ventura et al. proposed several achievable and applicable methods based on the Spark framework [14]. After that, some researchers successively introduced kNN-based [15], and deep learning-based big data time series forecasting methods in 2016-2017. ...
Article
With the development and change of big data related technologies, more and more large amounts of data need to be analyzed. Now there are companies like Google, Yahoo, etc. Frameworks such as MapReduce, Hadoop, Spark, etc. are developed for processing large amounts of data. In this paper, relevant discussions and researches are carried out on time series forecasting under the new era of big data. Now there are time series forecasting methods based on map reduce, Hadoop, spark data processing framework, including nearest neighbor distribution method, neural network method, etc., which have made quite good achievements in big data time series forecasting. By reading the relevant research literature, it is universally acknowledged that the Sparks framework has good application prospects and potential in predicting big data time series. As a result, this paper is mainly aimed at the optimization and improvement of the big data time series forecasting method on the basis of the spark framework. The author noticed that most of the default configurations of spark clusters are generated by default or automatically, rather than the optimal solution obtained after algorithm optimization, so there is still room for improvement in this regard. In this regard, this paper proposes a kernel method for visual data processing of related configurations and parameters, and then optimizes the default data configuration as much as possible to improve the accuracy and feasibility of the big data time series prediction method on the basis of the spark framework. In this paper, the optimized scheme is used to forecast the domestic electricity consumption in the past five years, and the results show that the optimized scheme has a good improvement performance on the basis of the original method.
... Suppose, the training sample set is rather small and contains a notable noise. In this case, using knearest neighbor (kNN) classifier and regressor would be a reasonable choice [56][57][58][59]. The idea behind these algorithms is fairly simple. ...
Article
Full-text available
Artistic robotic painting automates the process of creating an artwork. This complex and challenging task includes several aspects: creating algorithms for rendering brushstrokes, reproducing the exact shape of a brushstroke, and developing the principles of mixing paints. This work contributes to the previously unsolved problem of accurately reproducing colors of brushstrokes by means of artistic paints. The main contributions of this paper include: the development of a novel 4-component data-driven mathematical model for artistic paint mixing; the design and implementation of a novel robot capable of accurately dosing and mixing acrylic paints thanks to the improved syringe pumps and the innovative paint mixer; the development of a novel pneumatic system for paint release with a build-in clogging detection mechanism. The capabilities of the designed robotic system are demonstrated by painting four artworks: replicas of Claude Monet’s and Arkady Rylov’s landscapes, a synthetic image generated using the StyleGAN2 neural network trained on Vincent van Gogh’s artistic heritage, and a synthetic image generated using the Midjourney neural network. The obtained results can be useful in various applications of computer creativity, as well as in artistic image replication and restoration, and also in colored 3D printing.
... Our experimental results and statistical tests showed that searches on NN In scenarios with very large collections, in the order of dozens of millions of objects, where common computers are not capable of executing NN graph-based methods due to memory constrains, it is possible to employ a distributed version of these algorithms by applying for example the MapReduce programming model. Some distributed approaches were proposed in the literature [19,39] for the tree-based and hash-based methods. Commonly, data is split across various computers, where indices are created separately. ...
Article
Full-text available
Nearest neighbors graphs have gained a lot of attention from the information retrieval community since they were demonstrated to outperform classical approaches in the task of approximate nearest neighbor search. These approaches, firstly, index feature vectors by using a graph-based data structure. Then, for a given query, the search is performed by traversing the graph in a greedy-way, moving in each step towards the neighbor of the current vertex that is closer to the query (based on a distance function). However, local topological properties of vertices could be also considered at the moment of deciding the next vertex to be explored. In this work, we introduce a Genetic Programming framework that combines topological properties along with the distance to the query, aiming to improve the selection of the next vertex in each step of graph traversal and, therefore, reduce the number of vertices explored (scan rate) to find the true nearest neighbors. Experimental results, conducted over three large collections of feature vectors and four different graph-based techniques, show significant gains of the proposed approach over classic graph-based search algorithms.
Article
This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.
Chapter
Dependability is a key requirement of machine learning (ML) systems in safety-critical applications such as vehicles, finance, and medicine domains. To handle errors in the underlying ML hardware and provide a dependable outcome, error-tolerant techniques have been extensively investigated for several high-performance ML algorithms/schemes such as neural networks. However, the dependability of other simpler algorithms that can also provide good learning performance has received significantly less attention. In this chapter, the state-of-the-art error-tolerant techniques for K Nearest Neighbors, Random Forest, and Support Vector Machine ML classifiers are reviewed. By investigating the impact of errors on the final classification results and allowing “no-impact” errors, these solutions have been designed to be very efficient in terms of protection overhead (e.g., power dissipation). The overall goal is to provide readers with a set of techniques for hardening the classifiers when they are used in safety-critical applications.
Article
Full-text available
Currently, the blooming growth of social networks such as Facebook, Twitter, Instagram, etc., has generated and is still generating a big amount of data, which can be regarded as a gold mine for business analysts and researchers where several insights that are useful and essential for effective decision making have to be provided. However, multiple problems and challenges affect the decisional support systems, especially at the level of the Extraction–Transformation–Loading processes. These processes are responsible for the selection, filtering and normalizing of data sources in order to obtain relevant decisions. As far as this research paper is concerned, we aim to focus on adapting the transformation phase with the MapReduce paradigm to process data in a distributed and parallel environment. Subsequently, we set forward a conceptual model of this second phase that is composed of several operations that handle NoSQL structure, which is suitable for Big Data storage. Finally, we implement through Talend for Big Data our new components, which help the designer apply selection, projection and joining operations on the extracted data from social media.
Conference Paper
Full-text available
Multi-label learning is a challenging problem which has received growing attention in the research community over the last years. Hence, there is a growing demand of effective and scalable multi-label learning methods for larger datasets both in terms of number of instances and numbers of output labels. The use of ensemble classifiers is a popular approach for improving multi-label model accuracy, especially for datasets with high-dimensional label spaces. However, the increasing computational complexity of the algorithms in such ever-growing high-dimensional label spaces, requires new approaches to manage data effectively and efficiently in distributed computing environments. Spark is a framework based on MapReduce, a distributed programming model that offers a robust paradigm to handle large-scale datasets in a cluster of nodes. This paper focuses on multi-label ensembles and proposes a number of implementations through the use of parallel and distributed computing using Spark. Additionally, five different implementations are proposed and the impact on the performance of the ensemble is analyzed. The experimental study shows the benefits of using distributed implementations over the traditional single-node single-thread execution, in terms of performance over multiple metrics as well as significant speedup tested on 29 benchmark datasets.
Article
Full-text available
Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.
Article
Full-text available
In-memory cluster computing platforms have gained momentum in the last years, due to their ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments. In addition, there is a lack of tools to better understand and optimize such platforms that consequently form the backbone of big data infrastructure and technologies. This directly leads to underutilization of available resources and application failures in such environment. One of the key aspects that can address this problem is optimization of the task parallelism of application in such environments. In this paper, we propose a machine learning based method that recommends optimal parameters for task parallelization in big data workloads. By monitoring and gathering metrics at system and application level, we are able to find statistical correlations that allow us to characterize and predict the effect of different parallelism settings on performance. These predictions are used to recommend an optimal configuration to users before launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed. We observe up to a 51% gain on performance when using the recommended parallelism settings. The model is also interpretable and can give insights to the user into how different metrics and parameters affect the performance.
Article
Full-text available
Large scale optimization is an active research area in which many algorithms, benchmark functions, and competitions have been proposed to date. However, extremely high-dimensional optimization problems comprising millions of variables demand new approaches to perform effectively in results quality and efficiently in time. Memetic algorithms are popular in continuous optimization but they are hampered on such extremely large dimensionality due to the limitations of computational and memory resources, and heuristics must tackle the immensity of the search space. This work advances on how the MapReduce parallel programming model allows scaling to problems with millions of variables, and presents an adaptation of the MA-SW-Chains algorithm to the MapReduce framework. Benchmark functions from the IEEE CEC 2010 and 2013 competitions are considered and results with 1, 3 and 10 million variables are presented. MapReduce demonstrates to be an effective approach to scale optimization algorithms on extremely high-dimensional problems, taking advantage of the combined computational and memory resources distributed in a computer cluster.
Article
Recently, in-memory cluster computing (IMC) gains momentum because it accelerates traditional on-disk cluster computing (ODC) up to several tens of times for iterative and interaction applications. The most popular IMC framework is Spark and it has more than 100 configuration parameters. However, it is unclear how significantly these parameters affect the system performance because IMC is a quite new computing paradigm. Consequently, there is yet no study addressing how to optimally configure IMC frameworks. In this paper, we first investigate how significantly the configuration parameters affect the performance of Spark workloads. We find that the configuration caused performance variation can be as large as 20.7, indicating configuring Spark workloads is extremely important to their performance. However, manually configuring Spark workloads is notoriously difficult because there are so many configuration parameters which might interfere with each other in a complex way. To address this issue, we propose an approach to Automatically Configure Spark workloads, named ACS. It firstly constructs performance models as functions of Spark configuration parameters by using random forest which is an ensemble learning algorithm. Subsequently, ACS leverages genetic algorithm to search the optimum configuration by taking configurations and the corresponding performance predicted by the performance models as inputs. We employ six Spark programs, each with five input data sets to evaluate the performance improvements. The results show that ACS speeds up the 30 program-input pairs by a factor of 2.2× on average and up to 8.2×. In addition, the performance improvements obtained by ACS increase along with the increments of the input data set sizes of Spark workloads, which is a nice property for big data analytics.
Conference Paper
The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k Nearest Neighbors (FuzzykNN) classifier being among the most successful ones. FuzzykNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results.
Article
Various methods and techniques have been proposed in past for improving performance of queries on structured and unstructured data. The paper proposes a parallel B-Tree index in the MapReduce framework for improving efficiency of random reads over the existing approaches. The benefit of using the MapReduce framework is that it encapsulates the complexity of implementing parallelism and fault tolerance from users and presents these in a user friendly way. The proposed index reduces the number of data accesses for range queries and thus improves efficiency. The B-Tree index on MapReduce is implemented in a chained-MapReduce process that reduces intermediate data access time between successive map and reduce functions, and improves efficiency. Finally, five performance metrics have been used to validate the performance of proposed index for range search query in MapReduce, such as, varying cluster size and, size of range search query coverage on execution time, the number of map tasks and size of Input/Output (I/O) data. The effect of varying Hadoop Distributed File System (HDFS) block size and, analysis of the size of heap memory and intermediate data generated during map and reduce functions also shows the superiority of the proposed index. It is observed through experimental results that the parallel B-Tree index along with a chained-MapReduce environment performs better than default non-indexed dataset of the Hadoop and B-Tree like Global Index (Zhao et al., 2012) in MapReduce.
Article
There has been a shift in research towards the convergence of the Internet-of-Things (IoT) and cloud computing paradigms motivated by the need for IoT applications to leverage the unique characteristics of the cloud. IoT acts as an enabler to interconnect intelligent and self-configurable nodes “things” to establish an efficient and dynamic platform for communication and collaboration. IoT is becoming a major source of big data, contributing huge amounts of streamed information from a large number of interconnected nodes, which have to be stored, processed, and presented in an efficient, and easily interpretable form. Cloud computing can enable IoT to have the privilege of a virtual resources utilization infrastructure, which integrates storage devices, visualization platforms, resource monitoring, analytical tools, and client delivery. Given the number of things connected and the amount of data generated, a key challenge is the energy efficient composition and interoperability of heterogeneous things integrated with cloud resources and scattered across the globe, in order to create an on-demand energy efficient cloud based IoT application. In many cases, when a single service is not enough to complete the business requirement; a composition of web services is carried out. These composed web services are expected to collaborate towards a common goal with large amount of data exchange and various other operations. Massive data sets have to be exchanged between several geographically distributed and scattered services. The movement of mass data between services influences the whole application process in terms of energy consumption. One way to significantly reduce this massive data exchange is to use fewer services for a composition, which need to be created to complete a business requirement. Integrating fewer services can result in a reduction in data interchange, which in return helps in reducing the energy consumption and carbon footprint.
Article
Nowadays, many data sources that include multi-label learning and multi-label classification have emerged in recent application areas. To achieve high classification accuracy, the multi-label feature selection method has received much attention because its accuracy can be significantly improved by selecting important features. In previous multi-label feature selection studies, a score function was designed based on the measure of the dependency between features and labels. However, identifying the optimal feature subset is an impractical task because all possible feature subsets are 2N, where N is the number of total features in a given dataset. Thus, the conventional methods utilized a greedy search approach that can be stuck in local optima. To circumvent the drawback of the greedy approaches, we design a score function based on mutual information and present a numerical optimization approach to avoid being stuck in the local optima. The experimental results demonstrate the superiority of the proposed multi-label feature selection method.