ArticlePDF Available

Distributed Nearest Neighbor Classification for Large-Scale Multi-label Data on Spark

October 2018
Future Generation Computer Systems 87:66-82

October 2018
87:66-82

DOI:10.1016/j.future.2018.04.094

Authors:

Sebastian Ventura

University of Cordoba (Spain)

Alberto Cano

Virginia Commonwealth University

Ml-knn train phase: Computation of the prior probabilities P (H) and P (¬H). Frequency count of the labels followed by averaging and smoothing the values.

…

Ml-knn train phase: Computation of Posterior probabilities P (C|H) and P (C|¬H). A frequency count of the labels among each of the neighbors is performed per instance, followed by averaging and smoothing the values.

…

Summary of the three Ml-knn strategies.

…

Ml-knn test phase. First, the prior and posterior probabilities are broadcasted. Then, each label set is predicted by combining the probabilities with the information collected by the nearest neighbors.

…

Summary description of the datasets.

…

Figures - uploaded by Alberto Cano

Content may be subject to copyright.

Content uploaded by Alberto Cano

Content may be subject to copyright.

Distributed Nearest Neighbor Classiﬁcation for

Large-Scale Multi-label Data on Spark

Jorge Gonzalez-Lopeza, Sebasti´an Venturab,c,d, Alberto Canoa

aDepartment of Computer Science, Virginia Commonwealth University, USA

bDepartment of Computer Science and Numerical Analysis, University of Cordoba, Spain

cComputing and Information Technology, King Abdulaziz University, Saudi Arabia

dMaimonides Biomedical Research Institute of Cordoba, Spain

Abstract

Modern data is characterized by its ever-increasing volume and complexity,

particularly when data instances belong to many categories simultaneously.

This learning paradigm is known as multi-label classiﬁcation and one of its

most renowned methods is the multi-label knearest neighbor (Ml-knn).

The traditional implementations of this method are not feasible for large-scale

multi-label data due to its complexity and memory restrictions. We propose

a distributed Ml-knn implementation based on the MapReduce program-

ming model, implemented on Apache Spark. We compare three strategies for

distributed nearest neighbor search: 1) iteratively broadcasting instances, 2)

using a distributed tree-based index structure, and 3) building hash tables

to group instances. The experimental study evaluates the trade-oﬀ between

the quality of the predictions and runtimes on 22 benchmark datasets, and

compares the scalability using diﬀerent sizes of data. The results indicate

that the tree-based index strategy outperforms the other approaches, having

a speedup of up to 266x for the largest dataset, while achieving an accuracy

equivalent to the exact methods. This strategy enables Ml-knn to scale

eﬃciently with respect to the size of the problem.

Keywords: Apache Spark, MapReduce, Distributed Computing, Big Data,

Multi-label classiﬁcation, Nearest Neighbors

Email addresses: gonzalezlopej@vcu.edu (Jorge Gonzalez-Lopez),

sventura@uco.es (Sebasti´an Ventura), acano@vcu.edu (Alberto Cano)

Preprint submitted to Future Generation Computer Systems April 18, 2018

1. Introduction

The emergence of new data-oriented technologies has led to an exponen-

tial increase of the volume of data gathered by modern information systems.

The exponential growth of data, in size and complexity, has produced a

pressing need to develop scalable machine learning algorithms to harness the

worthwhile information in the vast collections of data.

Besides its large volume, modern data is characterized by its increased

complexity. A large part of the data generated nowadays is made of instances

that belong to multiple categories at once. This type of data is known as

multi-label data.Multi-label classiﬁcation comprehends the task of learning

the set of categories, also known as labels, that an instance belongs to. Multi-

label classiﬁcation has attracted growing interest in the last decade, because

of the real-world applications that ﬁt in the paradigm [1–3].

One of the simplest, yet most eﬀective technique is lazy learning in which

the generalization of the data is delayed until a query is made, opposed

to other techniques where a model is learned on training data. The most

popular multi-label lazy learner is the multi-label kNearest Neighbor (Ml-

knn) [4], which is a non-parametric method where the generalization is done

in feature space. As a lazy learning method, Ml-knn requires to storing all

the training instances, then it computes statistical information of the set of

labels and the training instances, by performing a pair-wise computation of

a distance metric or similarity function [5, 6]. This information can be used

to predict the set of labels of unseen instances, however it comes at a high

computational and memory cost preventing its application to big data prob-

lems. Ml-knn has been used extensively to evaluate the features and/or

label transformations, since it is the method in which features or labels have

the largest impact in performance. Some of the scenarios in which it has

been applied are feature selection [7–9], learning by using label dependen-

cies and correlations [10, 11], label space transformation [12], dimensionality

reduction [13], and using speciﬁc features on a label space [14].

These issues can be handled using parallel computing in a distributed

environment, because it can considerably improve the performance by of-

fering all shared computational and memory resources [15]. The MapRe-

duce programming model [16] oﬀers a simple and robust paradigm to handle

large-scale datasets in a cluster of computers. This model is suitable for

processing big data because of its fault-tolerant mechanism, which is highly

recommendable for long time executions. One of the ﬁrst implementations of

the MapReduce model was Hadoop [17], yet one of its critical disadvantages

is that it processes data from a distributed ﬁle system, which introduces a

high latency. However, Spark [18] provides in-memory computation which

results in a large performance improvement, especially on iterative jobs. Us-

ing in-memory data has been proved to be of the most relevant in machine

learning scenarios [19]. Spark highlights as a powerful framework for its ease

of use, and it has been successfully applied in many scenarios [20–24].

1.1. Motivation

Ml-knn has been widely used in evaluating features and labels trans-

formations. Furthermore, various lazy learning methods for multi-label data

have been proposed in the last years [25–28]. These methods have advantages

and weaknesses, depending on the data distribution, feature dimensionality,

or number of labels. However, all these methods are based on the original

Ml-knn and suﬀer from its original limitations. The combination of the

memory and computation limitations of the nearest neighbor search, in ad-

dition to the increased complexity of the multi-label data, have produced a

pressing need to develop scalable solutions for Ml-knn.

In this paper, we propose a distributed implementation of the Ml-knn to

classify large-scale multi-label data using Spark. This implementation relies

on the nearest neighbor search method to gather statistical information, both

in the train and test phases. Hence, both its prediction and computation

performance will be bounded by the method incorporated, despite being

only a fraction of all the distributed computation performed.

The distributed implementations of nearest neighbor search can be cate-

gorized into the following: brute force, tree indexes, and hash indexes. There

are multiple implementations of knn in distributed environments using Spark

in the literature [29–32], and each of them belong to one of the groups men-

tioned. We selected the best method out of each strategy and incorporated

it into our distributed Ml-knn. To the best of our knowledge, all previous

work of distributed nearest neighbor searches is focused on traditional classi-

ﬁcation, and this is the ﬁrst study implementing and analyzing the scalability

of Ml-knn on a distributed environment.

The main contributions of this paper are as follows:

•Analysis and parallelization in a distributed environment of Ml-knn,

presenting a detailed explanation of the proposed MapReduce algo-

rithm on Spark.

•Comparison of the three main strategies for the nearest neighbor search

in a distributed environment and evaluating their impact. Speciﬁcally,

the Ml-knn performance is analyzed, focusing on the trade-oﬀ between

accuracy and runtime applied to multi-label classiﬁcation.

•Evaluation of the experimental results: reliability of the predictions,

execution times for train and test phases, and a study of scalability

with respect to the problem size.

Experimental results performed on 22 varied multi-label datasets show

that the proposed Ml-knn method, which uses a tree-based index structure,

outperforms the other methods in every scenario. This method produces pre-

dictions which are equivalent to those of an exact method, while considerably

reducing the execution time and scaling accordingly with respect to the size

of the problem (instances and labels).

The structure of this paper is as follows. Section 2 reviews the related

works. Section 3 introduces the distributed implementation for Ml-knn and

Section 4 presents the strategies to ﬁnd nearest neighbors. Section 5 describes

the experimental study and Section 6 analyzes the results. Finally, Section 7

presents the conclusions of this work.

2. Background

This section reviews deﬁnitions and related works on the use of MapRe-

duce, nearest neighbor search, and multi-label learning. First, Section 2.1

introduces the concept of MapReduce and the Spark framework. Section 2.2

deﬁnes the nearest neighbor searches and the diﬀerent strategies to optimize

the problem. Finally, Section 2.3 formally deﬁnes multi-label learning and

the multi-label knearest neighbor (Ml-knn) algorithm.

2.1. MapReduce programming model and Spark framework

The MapReduce programming model [16] was developed to process data

using a distributed strategy, allowing to scale to data unable to ﬁt in the

physical memory of a single machine. This framework provides an abstrac-

tion of the underlying hardware and software of the cluster. It partitions,

duplicates and distributes the data providing fault-tolerance. Moreover, it

schedules the jobs and the network communications, as a result the user only

needs to implement the Map and Reduce primitives.

Hadoop [17] is an open-source implementation of MapReduce. Despite its

popularity, Hadoop presents some important weaknesses, such as the impos-

sibility to maintain data in-memory, thus leading to poor online performance,

interactive, and/or iterative methods [33–35].

Spark [18] is a distributed computing platform that has become one of the

most powerful tools for the big data scenario. Spark was designed to overcome

the limitations of Hadoop, one of its main advantages is the possibility of

maintaining data in memory. In fact, Spark has been shown to outperform

Hadoop in many cases (up to 100x in memory) [36].

The main components of the Spark architecture are driver,workers, and

executors. The driver loads the application and create a relation of tasks to

be executed. The workers are the set of nodes in the distributed environment,

each of them with at least one executor. The executors are distributed agents

that execute the tasks using local partitions of the data in their assigned

memory space. The scheduling of the tasks and assigning of the resources to

each executor is done by the cluster manager.

For large-scale data mining, multiple common learning algorithms and

statistics utilities were created and packaged into MLlib [37], the machine

learning library for Spark. This library gives support to several tasks such

as: classiﬁcation, regression, ﬁltering, clustering, and data preprocessing.

2.2. Nearest Neighbor search strategies

Nearest neighbor (NN) is a classic non-parametric and instance-based

technique that has attracted the attention from the research community due

to its simplicity and eﬀectiveness. This technique has a wide range of ap-

plications such as density estimation [38], dimensional hashing [39], pattern

recognition [40], data compression [41], and so on.

The nearest neighbor search is an optimization problem, whose goal is to

ﬁnd an instance that minimizes a certain distance or similarity function [5, 6].

The most popular distance used is the Euclidean distance, however there are

other scenarios in which other criteria needs to be considered such us energy

eﬃciency [42–46], load across machines [47], or privacy [48, 49]. A straightfor-

ward generalization of this problem is the Knearest neighbor search (knn)

which ﬁnds the kinstances minimizing the distance function.

Whenever the search method performs a pair-wise comparison using all

the instances, it is called exact nearest neighbor search. This approach ﬁnds

the exact nearest neighbor at high computational cost, however there are

many techniques that reduce this complexity by indexing the feature space.

2.2.1. Tree indexes

One of the ﬁrst methods developed was the kd-tree [50]. This structure

recursively subdivides the feature space by a hyperplane that is orthogonal

to one of the axes and that partitions the data points as evenly as possible.

In [51] they propose to use simultaneously multiple kd-tree to increase

the performance of nearest neighbor searches. They use rotations of the

dataset that would force the tree to use features that otherwise would be dis-

carded. They show that by rotating the dataset to align it with its principal

axis direction using PCA, and then applying randodm House-holder trans-

formations that that preserve the PCA subspace of appropriate dimension,

the kd-tree performance can be signiﬁcantly improved.

One of the weaknesses of kd-tree is that the indexed space can have a

high aspect ratio, which makes it impossible to use volume bounds. Arya

et al. [52] introduced a Balanced Box-Decomposition tree (bbd-tree) which

guarantees both balanced aspect ratio and a logarithmic depth. This mod-

iﬁcation allows to perform error bound approximate search by considering

(1 + )-approximate nearest neighbors. This concept was also adapted by

Duncan et al. [53] using the Balanced Aspect Ratio tree (bar-tree), which

was later extended to higher dimensions [54]. This tree does not exclusively

use axis-orthogonal hyperplane cuts, which leads to good aspect ratio, bal-

anced depth, and convex regions. Other variations of the kd-tree are: the

pca-tree[55], the rp-tree[56], and the trinary projection tree[57].

The other main weakness of kd-tree is related to the curse of dimen-

sionality. kd-tree is very eﬀective at low dimensions: after traveling down

the nodes of the tree, all the instances in one leaf tend to be much closer

to each other than to instances in other leafs. However, this property dis-

appears in high dimensional spaces. This has been solved in a later method

called Metric tree [58] which subdivides the feature space by a hyperplane

deﬁned by the midpoint of two instances (pivots). This partitioning creates

two disjoint sets with no information shared between them. The search pro-

cess goes through the tree choosing the nearest node in each level, allowing

to “backtrack” in case that some branches have remained unpruned. Spill

tree [59] modiﬁes the Metric tree avoiding the tedious backtracking process

by allowing an overlap area between the nodes. This overlapped buﬀer allows

that they same instance can be indexed by both pivots, which leads to an

increased accuracy at the cost of redundancy.

In order to combine the advantages of both, Metric tree and Spill tree,

[59] proposes a combination called Hybrid tree. This structure allows to

have both types of nodes, where a decision is made in each node whether

to use an overlap node or non-overlap node. In the search process, it only

does backtracking for non-overlap nodes (as in conventional Metric tree), and

defeatist search in overlapping nodes (Spill tree).

Some other approaches, which are based on completely diﬀerent concepts,

have been proposed. For example, [60] presents the product quantiﬁcation

approach in which they decompose the space into low dimensional subspaces

and represent the instances by compact codes computed as the quantiﬁca-

tion indices in these subspaces. These compact codes can be compared to

the query points using an asymmetric approximate distance. A modiﬁca-

tion of the standard quantiﬁcation process was introduced by [61] in which

they use an inverted index with product quantiﬁcation that produces denser

subdivision of the search space.

2.2.2. Hashing indexes

The best known hashing based nearest neighbor technique is Locality

Sensitive Hashing (LSH) [62]. An LSH function maps the instances in the

feature space to a space of reduced dimensionality in a way that similar

instances map to the same hash entries. Then, a similarity search query can

be answered by ﬁrst hashing the query instance and then ﬁnding the close

instances withint the instances that have been mapped to the same entry. To

guarantee both, a good search quality and a good search eﬃciency, one needs

to use multiple LSH tables and combine their results. Unfortunately, LSH

requires of a large number of hash tables[63, 64]. There are some variants

of LSH such as multi-probe LSH [65] in which the number of hash tables is

reduced by searching other entries in the hash tables within certain distance,

and LSH Forest [66] in which they remove the data-dependent parameters

achieving better adaptation for skewed data distributions.

The performance of LSH methods is highly dependent on the hashing

functions they use. There is a large amount of research aimed at improving

hashing methods by using data-dependent hashing functions using various

techniques: parameter sensitive hashing [67], spectral hashing [68], random-

ized LSH from learned metrics [69], kernelized LSH [70], learned binary em-

bedding [71], shift-invariant kernel hashing [72], semi-supervised hashing [73],

optimized kernel hashing [74], and complementary hashing [75].

2.2.3. Graph indexes

Nearest neighbors graph methods build a graph structure in which ver-

tices represent the instances and edges connect nearest neighbors. There

are two critical components in these methods: query strategy and graph

construction.

There are multiple approaches that aim to minimize the impact of con-

sulting a nearest neighbor graph. In [76] the authors consider to use a sample

of well separated instances as seeds and start the graph exploration using a

best-ﬁrst strategy. Similarly, [77] incorporate a hill-climbing strategy and

pick the starting points at random.

The graph construction is the target of substantial research, however these

methods do not scale, or are speciﬁc of certain similarity measures. Paredes

et al. [78] proposed two methods for the graph construction using general

metric spaces and low empirical complexity. However, both methods require

a global data structure and are diﬃcult to parallelize across machines. [79]

proposes to use divide and conquer methods to recursive data partitioning. In

[80] authors presented a graph construction technique using Morton ordering

and based on space ﬁlling curves. [81] presents a graph construction method

based on local search. They consider that a neighbor of a neighbor, is likely

to be a neighbor too. Therefore, by initializing each vertex with a random

set of neighbors, the method iteratively improves the neighbors of each node.

Although there have been some methods which focused on eﬃciently

building a nearest neighbor graph before, it is considered that they are not ef-

ﬁcient enough in a distributed environment. Consequently, these techniques

are not included in this study.

2.3. Multi-label K Nearest Neighbors ( Ml-knn)

Table 1 summarizes all the symbols and notation used to establish the

knowledge of previous works, as to explain into detail our proposed solutions.

Let Xdenote the domain of instances, a single instance x∈ X is rep-

resented as a set of features x={x1,x2, . . . , xd}, where dis the number

of features. Multi-label considers a ﬁnite set of labels L={y1,y2,...,yq},

where qis the number of labels. In a training set, each of the instances

is associated with a subset of labels D={(x1,Y1),(x2,Y2),...,(xn,Yn)},

(xi∈ X ,Yi⊆ L). For convenience, let Yibe the multi-label vector for xi,

where its jth component Yx(j) (j∈ L) is 1 if j∈ L and 0 otherwise.

Tsoumakas et al. [1] divide the methods for multi-label classiﬁcation into

two categories: problem transformation and algorithm adaptation. The ﬁrst

Table 1: Symbols and notation used in this paper

Deﬁnition Symbol/Notation

Instance Space X=Rd(or Zd)

Number of Input Attributes d

Instance x= (x1,x2,...,xd),x∈ X

Label Space L={y1,y2,...,yq}

Number of Labels q

Label Set Y={y1,y2,...,yq} ∈ {0,1}

Predicted Label Set Z={z1,z2,...,zq} ∈ {0,1}

Multi-label Training Set D={(xi,Yi),1≤i≤m}

Number of Training Instances m

Multi-label Test set T={(xi,Yi),1≤i≤p}

Number of Test Instances p

Number of Neighbors k

Neighbor Set N(x) = {x1,x2,xk}∈X

one modiﬁes the data to adapt it to traditional classiﬁcation algorithms.

However, either they do not consider the correlations between labels leading

to weak expressive power [82], or they do it at an exponential cost respect the

number of labels [2]. The latter focuses on adapting algorithms to work di-

rectly with multi-label data, such as decision trees [83], neural networks [84],

support vector machines [85], and multiple lazy learning methods [26].

Multi-label kNearest Neighbor (Ml-knn) [4] was the ﬁrst lazy learning

method proposed for multi-label classiﬁcation, and it is still the most used

approach for this type of learning. This algorithm uses statistical informa-

tion from each label to predict the set of labels without any kind of data

transformation. Ml-knn inherits the advantages from both lazy learning

and Bayesian reasoning [86]. Additionally, the class-imbalanced, which is

a common problem in multi-label data [87], is mitigated due to the prior

probabilities estimated for each label. On the other hand, one of the known

disadvantages of Ml-knn is that is a ﬁrst-order approach which reasons the

relevance of each label separately.

Some alternatives have been proposed to Ml-knn that try to smooth the

ﬁrst-order approach or the negative impact of the number of labels. DMl-

knn [25] which instead of using only the statistical information from positive

instances, it also considers the negative instances. BR-knn [26] combined

multiple knn, one per label, in a binary relevance (BR) manner. They use

the count of labels in the set of neighbors as the conﬁdence score for the

predictions.MLCW-knn [27] improves the previous method by assigning

weights to each of the instances according to their distances to the query

sample. In a similar manner, the labels can be ranked according to the

probabilities of the label association using the neighboring samples around

a query sample [3]. IBLR-Ml-knn [28] combines the linear regressions and

knn algorithm, having one classiﬁer per label as in BR methods.

3. Distributed ML-KNN

We propose a Ml-knn implementation in Spark, which will focus on

scaling the algorithm in a distributed environment. The distribution of the

computation can be achieved in both, the train and test phases.

Firstly, the train phase computes the statistical information of the train-

ing instances by ﬁnding the prior and posterior probabilities. The prior

probabilities can be found by frequency counting of the labels. The poste-

rior probabilities are more complex and need of nearest neighbors to gather

statistical information.

Secondly, the test phase uses the previously computed prior and posterior

probabilities. For each of the test instances, the probabilities are combined

with the information gathered by the nearest neighbors on the training in-

stances to produce a probability for each of the labels.

As it can be seen, Ml-knn is a complex algorithm whose performance is

limited by the two nearest neighbor searches. The ﬁrst one between all the

training instances, and the second one between test and training instances.

This introduces an increased complexity respect lazy methods in traditional

classiﬁcation problems. Additionally, some aspects such as broadcasting val-

ues or persisting the right data in-memory can have a large impact on the

ﬁnal performance. This section provides a detailed explanation of the imple-

mentation that maximizes the resources of a distributed environment.

3.1. Train phase: computing prior and posterior probabilities

The train phase is divided into computing the prior and posterior prob-

abilities. The ﬁrst by performing a frequency count of the labels, and the

second by performing a frequency count subject to the neighbors.

Prior probabilities are deﬁned as P(Hj) and P(¬Hj), and represent

the probabilities of a label being found in the dataset before the arrival of

new information. Figure 1 presents the process to compute the probabilities.

A user-deﬁned Reduce Operation is applied to the collection of labels

of all the training instances. This operation adds the binary label vector

Yto a vector µj=Pm

i=1[[yj∈Yi]], which counts the occurrence of each

label. Next, the prior probabilities are computed averaging and smoothing

the count vector P(Hj) = (s+µj)/(s∗2 + m). Additionally, we can compute

P(¬Hj) = 1 −P(Hj) by ﬁnding the opposite probability.

Posterior probabilities are deﬁned as P(Cj|Hj) and P(Cj|¬Hj), it

represents the probabilities of a label being present in exactly jinstances

among its knearest neighbors, conditioned to the event of having the label

present. Figure 2 presents the steps to compute the probabilities.

First, it ﬁnds the knearest neighbors for every training instance, followed

by a user-deﬁned Map Operation applied to each instance. This operation

computes the counts of each label among its neighbors Cj. Next, it creates

two frequency matrices Krj =κj[r] and ˜

Krj = ˜κj[r] for each instance x.

Each position (r, j) is initialized to 0, and it is updated with K(Cj, j)=1

whenever yi∈Yi, or ˜

K(Cj, j) = 1 whenever yi/∈Yi.

Figure 1: Ml-knn train phase: Computation of the prior probabilities P(H) and P(¬H).

Frequency count of the labels followed by averaging and smoothing the values.

Figure 2: Ml-knn train phase: Computation of Posterior probabilities P(C|H) and

P(C|¬H). A frequency count of the labels among each of the neighbors is performed

per instance, followed by averaging and smoothing the values.

Second, a user-deﬁned Reduce Operation is applied to the collection of

Kand ˜

Kadding the matrices of all the training instances. The ﬁnal result

stores the number of training instances that have exactly rneighbors with

jth label, both for the case yi∈Yiand yi/∈Yi.

Finally, the posterior probabilities are computed averaging and smoothing

the frequency matrices P(Cj|Hj)=(s+KCj,j)/(s×(k+ 1) Pk

r=0 Kr,j) and

P(Cj|¬Hj) = (s+˜

KCj,j)/(s×(k+ 1) Pk

r=0 ˜

Kr,j).

3.2. Test phase: Prediction of label set

The test phase inducts the predicted label set for new unlabeled instances.

This set relies on the prior and posterior probabilities, previously computed

in the train phase. Figure 3 shows the predictions of the test phase labels.

First, it ﬁnds the knearest neighbors for every test instance among the

training instances. Next, a user-deﬁned Map Operation is performed on

each test instance to compute the counts of each label Cjamong its neigh-

bors. Then, each test instance ﬁnds the corresponding value in the poste-

rior probabilities, and assigns each label by determining for each yjwhether

P(Hj)×P(Cj|Hj) is greater than P(¬Hj)×P(Cj|¬Hj).

Figure 3: Ml-knn test phase. First, the prior and posterior probabilities are broadcasted.

Then, each label set is predicted by combining the probabilities with the information

collected by the nearest neighbors.

4. Distributed Nearest Neighbors methods

The Ml-knn performance is going to be bounded by the knearest neigh-

bor search, both in the train and test phases. The main issue of distributing

the search process over a cluster of nodes is that every time a node needs to

access the information of another node it triggers a shuﬄe operation, which

sends information over the network and is considered to be an slow process.

Although the most naive strategy would be to use a cross product and ex-

change the information of all the nodes with each other, in practice this is

unfeasible because of memory and network limitations.

There are multiple strategies that aim to minimize this impact, from

distributed index structures to hashing matching. We propose to create three

versions of Ml-knn, and study their impact in the overall performance of

the algorithm. Each implementation represents one of the strategies studied

in Section 2.2, namely Ml-knn-it,Ml-knn-ht, and Ml-knn-lsh.

4.1. Iterative Multi-label k Nearest Neighbor (Ml-knn-it)

Our ﬁrst approach was to incorporate an iterative version of Ml-knn

in Spark based on the principles presented by Maillo [29, 88], where they

adapted the brute force algorithm to a distributed environment. Despite be-

ing a naive approach, in which no index structure has been used, it showed

promising results. However, they only compared to another algorithm that

has been previously developed by themselves, hence it is diﬃcult to appreci-

ate the real performance of the algorithm. Additionally, their original imple-

mentation 1suﬀers from some limitations, such as: an excessive number of

parameters, it is limited by an outdated version of Spark (not taking advan-

tage of new functionality introduced in Spark 2.0+), it ineﬃciently iterates

over the test instances on the driver by assigning a partition id and sorting

the instances (triggering a shuﬄe operation), do not maintain the Row struc-

ture which discard any extra information on the instances, among others.

We modiﬁed the original method, solving the previously mentioned issues

and incorporating support to keep the label information. Our implementa-

tion performs an exact nearest neighbor search by iteratively broadcasting a

buﬀer of test instances, instead of broadcasting full partitions of test data.

However, both methods are comparable and if the buﬀer size is set to the

partition size, they would be equivalent. The combination of this search

method and our proposed Ml-knn is named Ml-knn-it. Figure 4 shows

the functionality of the test phase, since this method does not require of any

training. The diagram shows one iteration of the method, thus ﬁnding the

nearest neighbor of the test instances stored in the buﬀer.

First, it uses a local iterator of the test instances, which brings a partition

at a time to the driver avoiding to overload the memory. This iterator

allows to iterate locally over the test instances, while avoiding to ﬁlter the

test instances by a partition id and collecting the results. Next, a buﬀer

of a ﬁxed size is ﬁlled with the local instances and broadcasted to all the

nodes. After that a map operation over the train instances will ﬁnd the

k nearest neighbors of the broadcasted instances within the local partition.

Finally, a reduce by key operation will combine all the partitions keeping

the top knearest neighbors of each of the instances of the buﬀer. It is

important to keep the original instances in-memory for two reasons: they

will be accessed multiple times to ﬁnd the neighbors, and this avoids undoing

1knn-is:https://github.com/JMailloH/kNN_IS

Figure 4: Test phase for Ml-knn-it. In each iteration a buﬀer of test instances is broad-

casted, which will be used to ﬁnd the nearest neighbors among the training instances.

the transformations applied to this data to recover their original form.

The main advantages of this method are that it performs an exact search

and does not require of any training. Furthermore, the reduce by key oper-

ation is more eﬃcient than other operations which require a shuﬄe of data

such as join.

On the other hand, this method suﬀers from the same problem than the

traditional implementation: pair-wise distance computation. Therefore, each

instance will be broadcasted to all the nodes once, no matter the size of the

buﬀer or the number of iterations. Additionally, the result is created by

combining the train partitions and the broadcasted test instances, hence it is

not modifying the original test data, instead it is creating new test data with

the neighbors in it. Consequently, it will iteratively duplicate the test data,

until the test phase is over, and then the original test data can be discarded.

4.2. Hybrid Tree Multi-label k Nearest Neighbor ( Ml-knn-ht)

This method was presented in [31] and aims to use a tree-based index

structure to achieve high accuracy and search eﬃciency in a distributed en-

vironment, also there is a public implementation available 2. The available

implementation works seamlessly with the new versions of Spark, as well as

supporting specifying the columns to be preserved in the neighbors. There-

fore, it can be easily incorporated to our Ml-knn algorithm, and it is named

Ml-knn-ht.

This algorithm uses two diﬀerent structures: a top tree (metric tree) and

multiple subtrees (spill trees) on the nodes, hence the combination of trees is

named hybrid tree. This method requires of a train phase, where all the trees

are built, and a test phase, in which we can query the trees to ﬁnd nearest

neighbors. Figure 5 presents the process to build the trees.

First, it uses a randomized sample of the training instances to build the

top tree (metric tree), whose leaf nodes corresponds to speciﬁc partitions of

the data. A copy of the top tree is broadcasted to all the nodes, so all the train

instances can compute a value that identiﬁes the index of the partition where

2knn-ht:https://github.com/saurfang/spark-knn

Figure 5: Train phase for Ml-knn-ht. First, a top tree (metric tree) is built locally on the

driver using a sample of train instances. Next the instances are indexed and partitioned

using the top tree. Then, each partition builds a local subtree (spill tree).

they belong. Next, the training data is repartitioned by the index, hence it is

sent to the partition indicated by the top tree. Then, each partition builds a

subtree (spill tree) which will index the local training data. At the end both

types of trees, top tree and subtrees, are persisted in-memory since they can

be consulted multiple times.

Once all the trees are computed and broadcasted, we can query the struc-

ture to ﬁnd nearest neighbors among training instances. Figure 6 shows the

process to ﬁnd nearest neighbors for the test instances. First, each partition

is indexed using the top tree and the test instances are repartitioned by in-

dex. Next, each partition uses the local subtree to ﬁnd the nearest neighbors

among the training instances. Then, the distance to the farthest neighbor is

used to evaluate if it is necessary to search other partitions.

One of the parameters in this method is the overlap buﬀer width for the

spill nodes. This buﬀer needs to be large enough to always include the k

nearest neighbors, but not so large that it impacts negatively in the overall

performance. The details of this parameter estimation can be found on [31].

Figure 6: Test phase for Ml-knn-ht. First, the test instances are indexed and reparti-

tioned using the top tree (metric tree). Then, each partition ﬁnds its nearest neighbors

using their local subtree (spill tree).

The advantage of this algorithm is the speedup of the nearest neighbor

search using multiple index structures. First, by using the top tree to ﬁnd the

corresponding partition for each instance, and second by using the subtrees to

ﬁnd the nearest neighbors within each partition. Moreover, it only executes

one shuﬄe operation to ﬁnd the partition where each instance belongs.

However, these advantages come at the cost of building the indexes of

training data. This cost is reﬂected both in computational time (ﬁnd the

splits) and memory (store the pivots). Moreover, to maximize the use of these

structures it avoids using backtracking for both trees: in top tree it might

send duplicate test instances to several nodes instead, and in the subtrees it

uses an overlap buﬀer to consider instances near the decision boundary.

Additionally, the algorithm uses as many partitions as leaf nodes on the

top tree, which can lead to not using all the available nodes in a big cluster.

Also, the size of the partitions is deﬁned by the splits on the tree, thus pro-

ducing unbalanced workloads on the nodes by having partitions of diﬀerent

sizes. The only way to minimize this impact is to have more partitions that

nodes, since this ensures that all the nodes have tasks assigned and the larger

tasks are break down into smaller pieces that can be distributed more evenly

among the nodes.

4.3. Locally Sensitive Hashing Multi-label k Nearest Neighbor ( Ml-knn-lsh)

This method focuses on the application of locally sensitive hashing (LSH)

functions that preserve the similarity of the original feature space on a dis-

tributed environment. The main property of these functions is that they

map, with high probability, similar instances to the same hash entries. The

indexing is done using LSH functions and by building several hash tables to

increase the probability of collision for close points.

There are multiple hashing functions that fulﬁll those properties, the most

popular are: MinHash which ﬁnds the similarity between two sets deﬁned

by the ratio of the number of elements of their intersection and the number

of elements of their union. Bucketed Random Projection that projects the

feature vectors onto a random unit vector and portions the projected result

into buckets. Sign Random Projection based on creating a bit vector with

the signs of the projection of the feature vector onto multiple random unit

vectors.

In the nearest neighbor search problem, the data should be normalized

since we want to give equal weight to all the features without considering

their scale. We consider our data to be real numbers in the range [0,1], hence

Figure 7: Train phase for Ml-knn-lsh. A set of random vectors is created and broad-

casted. Then, the training instances will compute their sign projection using those vectors

to ﬁnd their keys for each of the hash tables.

the hashing functions MinHash and Bucketed Random Projection would not

work as expected. For that reason, we decided to use Sign Random Projection

with Euclidean distance.

There is an implementation of both, MinHash and Bucketed Random Pro-

jection, available in the oﬃcial machine learning library for Spark [37] since

the 2.1.0 version. We decided to implement the Sign Random Projection

function, by substituting the hash function in the Bucketed Random Pro-

jection. This ensures that our implementation follows the original structure

and functionality. Furthermore, the original code required to use a combina-

tion of join and group by operations to ﬁnd the nearest neighbors for each

instance, however we found out that this produced performance issues which

could be minimized by using the co-group operation instead.

This algorithm has a short train phase, since it only needs to prepare the

hashing functions and compute the values for the training instances. Figure 7

shows the train phase. First, for every hash table it creates as many unit

random vectors as the predeﬁned signature length. Then, the random vectors

are broadcasted to all the nodes where the training instances will compute

their sign projection signature. Additionally, the hashed training instances

are persisted in-memory since they can be consulted multiple times.

Figure 8: Test phase for Ml-knn-lsh. First, the test instances ﬁnd their keys for each hash table by

computing the sign projection using the random vectors. Then, the data is “ﬂatten” by the number of hash

tables and co-grouped so that instances with the same keys will be together. Finally, for each instance we

ﬁnd the nearest neighbors among all the instances that belong to the same keys.

Figure 8 presents the test phase, where each test instance ﬁnds the

approximate nearest neighbors. First, the test instances repeat the same

steps of the train phase to ﬁnd their hash values. Next, the training and

test instances use an explode operation, where the instances are “ﬂatten”

by the number of hash tables, and then they are co-grouped by the tuple

(table position, signature). This operation combines the test instances with

the training instances that belong to the same entries of the hash tables.

Finally, a reduce by key operation ﬁnds the top knearest neighbors among

all the instances that were grouped together.

The main advantage of this method is the dimensionality reduction, since

it is less expensive to compare hashes (which are expected to have reduced

length) than comparing instances in the full feature space. Additionally, it

does not need to bring information to the driver since all the computation

happens between nodes.

On the other hand, as it as explained before, this method has a high

memory consumption since it will need to compare all the partitions to match

the hash entry and signature, thus triggering a cartesian product, and for each

match it will compute the distance. Although the number of matches can be

controlled through the number of hash tables, and the signature length.

4.4. Review of methods

Table 2 summarizes the key aspects of the three methods. It presents the

following series of characteristics: type search, duplicate of instances, shuﬄe

of data, indexing of instances, and complexity of the method.

Table 2: Summary of the three Ml-knn strategies

Characteristic Ml-knn-it Ml-knn-ht Ml-knn-lsh

Type of search Exact Approximate Approximate

Duplicates Test inst. per nodes Few test instances (Train inst. + Test inst.) per

num. of partitions

Shuﬄe Once per iteration Once Once (Cross product)

Index None Metric tree + Spill

trees

Hash entries

Estimate Train runtime None Medium Low

Estimate Test runtime High Low High

5. Experimental setup

This section describes the experimental setup. First, Section 5.1 intro-

duces the metrics used to measure the quality of the predictions. Section

5.2 summarizes the characteristics of the benchmark datasets. Section 5.3

discusses the parameters selected for each of the methods. Finally, Section

5.4 speciﬁes the hardware and software resources used on the experiments.

5.1. Performance metrics

The evaluation metrics for multi-label classiﬁcation diﬀer from the metrics

in traditional classiﬁcation. The most common metrics used are presented

in [1], and can be categorized into example-based metrics and label-based

metrics. Example-based evaluates all the labels for each instance and aver-

age the result over all the instances. Label-based evaluates each label and

then averages the values over all labels. Within label-based there are two

types of metrics: micro-averaged and macro-averaged, where the former is

more aﬀected by labels with fewer instances and the later by labels with more

instances. In this paper we address the performance with the most represen-

tative metrics of both groups. Let Zbe the set of predicted labels and Y

the set of true labels, each set is represented by a binary vector where each

position is a label and takes value 1 if the label is present and 0 otherwise.

The following metrics can be deﬁned:

•Hamming Loss computes the symmetric diﬀerence, by applying the

hamming distance between the predicted label set and the true label

set.

HammingLoss =1

i=1

|Z∆Y|(1)

•Subset Accuracy requires that the predicted label set is exactly the

same as the true label set. Therefore, [[·]] = 1 if the two sets are equal

or 0 otherwise.

SubsetAccuracy =1

i=1

[[Z=Y]] (2)

•Micro-averaged F-measure (Micro F1) is the harmonic mean between

the micro-precision and micro-recall, hence it takes both false positives

and false negatives into account. The dividend ﬁnds two times the

number of labels common in both sets, and the divisor adds the number

of labels presents in the predicted label set and the true label set.

Micro-averaged is based on computing the metric for each label in each

instance.

F1−micro =2Pq

j=1 Pp

i=1 Zi

jYi

j=1 Pp

i=1 Zi

j+Pq

j=1 Pp

i=1 Yi

(3)

•Macro-averaged F-measure (Micro F1) is the macro-averaged version

of the previous. Macro-averaged is based on computing the metric for

each label in all the instances, and then averaged them.

F1−macro =1

j=1 2Pp

i=1 Zi

jYi

i=1 Zi

j+Pp

i=1 Yi

j(4)

First, the Hamming Loss and Subset Accuracy will be used as the most

restrictive metrics to indicate whenever there exist any diﬀerence between

the exact method (Ml-knn-it) and the approximate methods (Ml-knn-

ht and Ml-knn-lsh). Followed by Micro F1 and Macro-F1, which are

complementary, and will indicate if the given diﬀerence has a signiﬁcant

impact of the prediction performance.

5.2. Datasets

Table 3 summarizes the characteristics of the 22 datasets for multi-label

classiﬁcation throughout the experiments, where the number of instances,

number of features, number of labels, cardinality [1], and density [1] are

shown. These datasets have been collected from the Knowledge Discovery

and Intelligent Systems (KDIS) repository 3, although originally they could

be found on the MULAN 4and MEKA 5repositories websites.

3KDIS: http://www.uco.es/kdis/mllresources

4MULAN: http://mulan.sf.net

5MEKA: http://meka.sf.net

Table 3: Summary description of the datasets

Dataset Instances Features Labels Cardinality Density

Flags 194 19 7 3.3918 0.4845

CAL500 502 68 174 26.0438 0.1497

CHD 555 49 6 2.5802 0.4300

Emotions 593 72 6 1.8685 0.3114

Birds 645 260 19 1.0140 0.0534

Medical 978 1,449 45 1.2454 0.0277

Plant 978 440 12 1.0787 0.0899

Water quality 1,060 16 14 5.0726 0.3623

Langlog 1,460 1,004 75 1.1801 0.0157

Enron 1,702 1,001 53 3.3784 0.0637

Scene 2,407 294 6 1.0740 0.1790

Yeast 2,417 103 14 4.2371 0.3026

Human 3,106 440 14 1.1851 0.0847

Slashdot 3,782 1,079 22 1.1809 0.0537

Corel5k 5,000 499 374 3.5220 0.0094

Bibtex 7,395 1,836 159 2.4019 0.0151

Yelp 10,806 671 5 1.6383 0.3277

20NG 19,300 1,006 20 1.0289 0.0514

TMC2007 28,596 500 22 2.2196 0.1009

Mediamill 43,907 120 101 4.3756 0.0433

Bookmarks 87,856 2150 208 2.0281 0.0097

IMDB 120,919 1001 28 1.9996 0.0714

Experiments were performed using 10-fold cross validation to objectively

evaluate the models’ performances. The data is divided fairly into 10 equally

sized folds. Each iteration of the cross-validation evaluation holds out a fold

as the test instances while the remainder of the data is used as the training

instances. These sets are stored and used by each algorithm, ensuring that

the instances held in each of the fold are the same for all of them. This

procedure ensures the model is not optimistically biased towards the full

dataset and the algorithms are evaluated fairly over the same data in each

fold. The folds are built using a stratiﬁed division [89], where each unique

subset of labels present in the data is considered as a ﬁctitious label, and then

the desired percentage of instances is extracted from each of those labels.

This ensures that each of the folds has the same data distribution as the

original ﬁle.

All our experiments are carried on using classiﬁers that rely on a metric

distance, thus we decided to normalize (re-scale from 0 to 1) the datasets.

Normalizing the data ensures that all the features have the same weight when

computing the distances. Additionally, the attributes in all the datasets are

either numeric or binary, thus by normalizing the values we can ensure that

the distances are computed correctly for both types of data. The numeric

attributes will produce a numeric distance, and the binary attributes will

have distance 1 whenever the value is the same or 0 otherwise.

5.3. Methods and parameters

The methods to cover are the ones presented in Section 3, however some of

those methods depend on a series of parameters. The most relevant parame-

ter for all the methods that attempt to use nearest neighbors for classiﬁcation

is the number of neighbors to consider, in this case we use k= 3. It is im-

portant to consider that although the ﬁnal predictions would vary with the

number of neighbors, we do not want to ﬁnd the optimal number of neigh-

bors, but to set the parameters in a way that all the methods are compared

in equal conditions. The following parameters were used to facilitate the

reproducibility of the experiments, and to provide further insight into the

obtained results.

•Ml-knn-it depends on the number of iterations used to broadcast all

the test instances and compute the pair-wise distances. However, we

consider that setting the number of iterations could eventually lead to

performance problems, since the larger the number of test instances,

the more information would need to be sent at once and could overload

the memory of the nodes. For that reason, we decided to set instead the

size of the buﬀer used to broadcast the instances. In our experiments

this method will always broadcast 1000 instances at a time from the

driver to the rest of nodes.

•Ml-knn-ht requires of a sample of train instances to build the top

tree in the driver, since it would not be feasible to use all the training

instances in terms of execution times and memory. We decided to use

a sample of data equal to 1000 train instanced, hence ensuring that the

overload would be the same than in Ml-knn-it.

Another critical parameter is the overlap buﬀer width for the spill trees.

The details of this parameter estimation can be found on [31]. However,

we are going to brieﬂy outline how the estimation works.

Assuming points are uniformly distributed in the feature space, it

should be approximately the average distance between instances. Specif-

ically, the number of instances within a certain radius of a given point is

proportionally to the density of instances raised to the eﬀective number

of features (dimensions), of which manifold data exist on:

Rs=c

N(1/d)

(5)

where Rsis the radius, Nsis the number of instances, dis eﬀective

number of dimension, and cis a constant. To estimate Rsfor the entire

data, we can take samples of diﬀerent size Nsto compute Rs. We can

estimate cand dusing linear regression. Finally, we can calculate Rs

using total number of observations.

•Ml-knn-lsh needs to set the number of hash tables and the signature

length in each entry. The number of hash tables will have a consid-

erable impact on memory, since the instances will be duplicated by

this value. On the other hand, the signature length will aﬀect the

number of training and test instances that are grouped together. We

decided to study a wide range of values to evaluate their impact in the

quality of the metrics and the execution times, the selected values are

{1,2,4,8,16,32}for number of tables and {1,2,4,8,16,32,64}for the

signature length.

5.4. Hardware and software environment

All the experiments were executed on the same environment, a local clus-

ter composed by 2 Intel Xeon CPU E5-2690v4 with 28 cores (56 threads) in

total and 128 GB of memory. Out of all the resources 6 cores and 25 GB

were reserved for the driver and the rest where used by the workers. The

experiments were executed using Spark 2.2.0 and Scala 2.11 on CentOS 7.4.

6. Experimental study

This section presents and discusses the experimental results. Section

6.1 compares the quality of the predictions, and includes the study of any

parameter that would have a signiﬁcant impact on the predictions. Section

6.2 studies the execution times, and considers whenever the performance

gain introduced by the index methods surpasses the initial overhead. Section

6.3 evaluates whenever the methods scale-out correctly with respect to the

increasing number of instances, attributes and labels.

6.1. Prediction comparison: approximate versus exact

This experiment compares the quality of the predictions produced by the

three approaches. The predictions of Ml-knn-it are not aﬀected by any

of its parameters, and it is considered an exact method. Ml-knn-ht can

be aﬀected by the overlap buﬀer width, however the estimation of this value

was explained in Section 5.3. On the other hand, Ml-knn-lsh is expected

to be deeply aﬀected by both the number of tables and the signature length

of its hash entries. First, we are going to study the parameter conﬁgurations

of Ml-knn-lsh, and then the overall best setting will be used in the ﬁnal

comparison to compare the three methods in equal conditions.

Figure 9 illustrates the evolution of the predictions, and the impact on

execution time, produced by Ml-knn-lsh using up to 64 hash tables and

signatures up to size 32. Since it would not be possible to show the results

over all the datasets, the most representative datasets have been selected.

These datasets cover a wide range of number of instances, attributes and

labels. These datasets obtained a considerably high subset accuracy on the

exact search, thus the metric which would be aﬀected the most by the loss

of performance. The plots on the left side present the subset accuracy, and

on the right side show the execution times in minutes and on a logarithmic

scale.

24816 32 24816 32 64

0.2

0.4

0.6

signature length

num. tables

Subset Accuracy

24816 32 24816 32 64

100

signature length

num. tables

Time (min)

(a) Medical dataset

24816 32 24816 32 64

0.2

0.4

0.6

0.8

signature length

num. tables

Subset Accuracy

24816 32 24816 32 64

100

signature length

num. tables

Time (min)

(b) Scene dataset

24816 32 24816 32 64

0.2

0.4

signature length

num. tables

Subset Accuracy

24816 32 24816 32 64

0.1

signature length

num. tables

Time (min)

24816 32 24816 32 64

0.45

0.55

signature length

num. tables

Subset Accuracy

24816 32 24816 32 64

0.1

signature length

num. tables

Time (min)

(d) Birds dataset

Figure 9: Subset Accuracy obtained by Ml-knn-lsh on Medical, Scene, Emotions, and

Birds datasets 27

The most accurate predictions are obtained when more tables are used

together with smaller signatures, since the number of tables reﬂects the num-

ber of groups that will be created in the matching process and the signature

length deﬁne inversely the size of those groups. Therefore, the more tables

and smaller signatures, the more instances are compared to ﬁnd neighbors.

However, the trade-oﬀ is that the more instances are compared, the more

distances need to be computed, and the method becomes slower. The results

indicate that the execution times grow exponentially with the number of ta-

bles and scales logarithmically with the signature size. Therefore, the best

compromise between prediction and execution performance is produced by

using two tables and signature of size eight.

Table 4 evaluates the performance of the methods using the selected pa-

rameters and the two most restrictive metrics, subset accuracy and hamming

loss. In most of the datasets the best results are obtained by Ml-knn-it,

since it is an exact method. The few exceptions where Ml-knn-ht outper-

forms Ml-knn-it are produced by the fact that the overlap buﬀer might not

consider all the real neighbors, and this approximation conveniently consid-

ers more appropriate neighbors. This is produced only in a few cases, and

the overall prediction performance of Ml-knn-it and Ml-knn-ht can be

considered equivalent. On the other hand, the approximation done by the

Ml-knn-lsh tends to have lower values of subset accuracy, however there

are some exceptions where the diﬀerence is small due to data distributions.

Despite some exceptions, it is considered that Ml-knn-lsh is the most in-

accurate of the three methods.

Table 5 compares the predictions using two metrics which are more repre-

sentative of the real quality of the predictions, micro-average F1 and macro-

average F1.Ml-knn-it and Ml-knn-ht obtained almost the same results,

since their performance was very similar for the most restrictive metrics. Ml-

knn-lsh diﬀers more considerably from the other two methods. However,

the magnitude of the dissimilarity indicated by the subset accuracy is not

reﬂected by these metrics, since even if the predicted set is not exactly the

same as the true label set, they are considerably similar. Although it can be

appreciated that for some datasets, such as 20NG or Bibtex, because the data

is not uniformly distributed and the projections can lead to a poor approx-

imation. We conclude that Ml-knn-lsh falls behind in terms of prediction

results, however the diﬀerence depends on the data.

Table 4: Hamming Loss and Subset Accuracy results obtained by the three methods

Hamming Loss ↓Subset Accuracy ↑

Dataset Ml-knn-it Ml-knn-ht Ml-knn-lsh Ml-knn-it Ml-knn-ht Ml-knn-lsh

Flags 0.2411 0.2440 0.2827 0.1667 0.1458 0.1042

CAL500 0.1360 0.1357 0.1370 0.0000 0.0000 0.0000

CHD 0.3031 0.3031 0.3007 0.1159 0.1159 0.1522

Emotions 0.2072 0.2050 0.2140 0.2703 0.2770 0.2568

Birds 0.0487 0.0497 0.0494 0.5031 0.5155 0.5031

Medical 0.0159 0.0156 0.0165 0.4751 0.5339 0.4932

Plant 0.0879 0.0890 0.0893 0.0422 0.0844 0.0042

Water quality 0.3080 0.3096 0.3248 0.0152 0.0152 0.0190

Langlog 0.0155 0.0158 0.0156 0.1399 0.1433 0.1365

Enron 0.0498 0.0498 0.0542 0.0669 0.1297 0.0209

Scene 0.0984 0.0929 0.1082 0.6456 0.6007 0.5973

Yeast 0.2046 0.2058 0.2060 0.1493 0.1493 0.1493

Human 0.0831 0.0831 0.0829 0.0026 0.0026 0.0000

Slashdot 0.0520 0.0517 0.0535 0.0538 0.0802 0.0033

Corel5k 0.0094 0.0094 0.0094 0.0024 0.0016 0.0000

Bibtex 0.0088 0.0088 0.0098 0.1075 0.1110 0.0043

Yelp 0.1916 0.1890 0.2096 0.4056 0.4100 0.3733

20NG 0.0394 0.0399 0.0507 0.2832 0.2830 0.0220

TMC2007 0.0636 0.0639 0.0714 0.2683 0.2555 0.2001

Mediamill 0.0278 0.0278 0.0313 0.1656 0.1655 0.0919

Bookmarks 0.0055 0.0055 −0.2313 0.2337 −

IMDB 0.0714 0.0714 −0.0009 0.0009 −

−Experiment could not execute due to computational/memory limitations.

6.2. Performance comparison: execution times for train and test phases

This experiment compares the execution times of the three methods. The

Ml-knn algorithm is aﬀected by the number of instances, attributes and

labels, consequently we consider multiple datasets that represent a wide range

characteristics and analyze the impact of those factors.

Table 6 shows the execution time, in minutes, for each dataset sorted by

the number of instances. The left side of the table presents the results for

the train phase, and the right side presents the results for the test phase.

Ml-knn-it has the best execution times for the small datasets, followed

very closely by Ml-knn-ht. However, the performance declines with the

number of instances since the complexity of the algorithm is tightly bounded

by the number of instances.

Table 5: Micro-average F1 and Macro-average F1 results obtained by the three methods

Micro-average F1 ↑Macro-average F1 ↑

Dataset Ml-knn-it Ml-knn-ht Ml-knn-lsh Ml-knn-it Ml-knn-ht Ml-knn-lsh

Flags 0.7362 0.7338 0.6844 0.5670 0.5649 0.4870

CAL500 0.3171 0.3315 0.3305 0.0805 0.0822 0.0832

CHD 0.6144 0.6121 0.6289 0.3154 0.3095 0.3410

Emotions 0.6406 0.6459 0.6154 0.6205 0.6286 0.5876

Birds 0.3318 0.2475 0.1065 0.1985 0.1624 0.0546

Medical 0.6520 0.6652 0.6435 0.6374 0.6553 0.6326

Plant 0.0741 0.1365 0.0078 0.0272 0.0413 0.0038

Water quality 0.5271 0.5238 0.4468 0.4307 0.4281 0.3357

Langlog 0.0058 0.0114 0.0000 0.2963 0.2982 0.2933

Enron 0.4130 0.4499 0.3875 0.2385 0.2432 0.2070

Scene 0.7126 0.7089 0.6840 0.7236 0.7125 0.6932

Yeast 0.6246 0.6228 0.6246 0.3459 0.3448 0.3472

Human 0.0045 0.0045 0.0000 0.0029 0.0029 0.0000

Slashdot 0.1017 0.1592 0.0056 0.1907 0.2089 0.1401

Corel5k 0.0210 0.0095 0.0023 0.1690 0.1677 0.1642

Bibtex 0.2949 0.3008 0.0807 0.0911 0.1010 0.0289

Yelp 0.6556 0.6607 0.6423 0.5604 0.5605 0.5357

20NG 0.4315 0.4287 0.0464 0.4218 0.4162 0.0451

TMC2007 0.6362 0.6573 0.5782 0.4073 0.4105 0.3002

Mediamill 0.6039 0.6036 0.5282 0.2125 0.2123 0.0653

Bookmarks 0.3551 0.3591 −0.1134 0.1153 −

IMDB 0.0017 0.0016 −0.0118 0.0117 −

−Experiment could not execute due to computational/memory limitations.

Ml-knn-ht quickly surpasses Ml-knn-it once the dataset is big enough

to require an execution time where the construction of an index represents a

small fraction of the total time. The diﬀerence of performance is even greater

for the test phase, since the index has already been constructed and it only

needs to query the structures.

Ml-knn-lsh has the longest execution times for most of the datasets.

The gap in performance is especially big for the test phase, since the co-group

operation is less eﬃcient between two diﬀerent sets of instances. Despite

computing pair-wise distances only within the same group, instead of all

the instances like Ml-knn-it, the performance gain is drag down by the

duplication of instances and the computation of the hash entries for all the

hash tables. This could lead to memory issues, as it can be appreciated for

Table 6: Execution times for the train and test phases in minutes for the three methods

Train time Test time

Dataset Ml-knn-it Ml-knn-ht Ml-knn-lsh Ml-knn-it Ml-knn-ht Ml-knn-lsh

Flags 0.0470 0.0475 0.0413 0.0225 0.0143 0.0176

CAL500 0.0630 0.1524 1.5737 0.0323 0.0361 1.0140

CHD 0.0501 0.0597 0.0847 0.0243 0.0214 0.0651

Emotions 0.0517 0.0640 0.2412 0.0250 0.0214 0.3415

Birds 0.0640 0.0834 0.1530 0.0306 0.0245 0.2276

Medical 0.1910 0.2365 1.0820 0.0787 0.0715 2.3176

Plant 0.0938 0.1204 0.2549 0.0457 0.0433 0.4977

Water quality 0.0575 0.0880 0.4765 0.0244 0.0316 0.3426

Langlog 0.1919 0.3333 0.6817 0.0891 0.1058 0.5617

Enron 0.2560 0.3892 0.5233 0.0718 0.1027 0.2781

Scene 0.1397 0.1861 0.5484 0.0552 0.0689 1.9123

Yeast 0.1256 0.1776 1.3601 0.0438 0.0753 0.7848

Human 0.2609 0.2730 0.6156 0.0978 0.1231 1.3664

Slashdot 0.3831 0.3632 0.7124 0.1286 0.1335 1.0321

Corel5k 0.8753 3.6945 18.4798 0.2816 1.5695 4.7568

Bibtex 3.5429 2.8774 8.2819 0.6209 0.7865 6.8489

Yelp 3.1812 0.3867 0.4578 0.8685 0.1460 4.1907

20NG 14.4004 1.0986 3.7498 4.4056 0.3758 23.4893

TMC2007 44.9470 1.2671 9.1497 14.6589 0.6059 41.5393

Mediamill 168.2233 3.8833 691.1088 60.3439 2.0957 241.1892

Bookmarks 2170.5851 27.9763 −543.7354 8.6002 −

IMDB 4844.1880 17.6109 −1559.7201 6.3888 −

−Experiment could not execute due to computational/memory limitations.

the largest datasets where it was not possible to ﬁnish the executions due to

hardware limitations.

6.3. Scalability analysis on the number of instances, features, and labels

Another important aspect is to consider the evolution of the execution

times with regards of the size of the data. The data can increase in size

by multiple factors: number of instances, number of attributes and number

of labels. This experiment studies the scalability of the three methods by

observing the total execution times (train and test phases together) over

diﬀerent samplings of the 20NG dataset. Since this experiment only studies

the computational performance, it is irrelevant which instances, attributes

or labels are selected.

2 4 6 8 10 12 14 16

·103

Number of instances

Total execution time (min)

Ml-knn-is

Ml-knn-ht

Ml-knn-lsh

Figure 10: Execution times according to the number of instances

12345678910

·103

100

Number of attributes

Total execution time (min)

Ml-knn-is

Ml-knn-ht

Ml-knn-lsh

Figure 11: Execution times according to the number of attributes

12345678910

·102

Number of labels

Total execution time (min)

Ml-knn-is

Ml-knn-ht

Ml-knn-lsh

Figure 12: Execution times according to the number of labels

Figure 10 presents the execution times on a range of 1,000 up to 16,000

instances, with a ﬁxed number of attributes and labels.

The execution times for both Ml-knn-it Ml-knn-lsh increase exponen-

tially with the number of instances. Ml-knn-it needs to execute a pair-wise

comparison between instances, hence this exponential growth is expected.

On the other hand, Ml-knn-lsh executes a reduced pair-wise comparison,

only within grouped instances, however the explode operation duplicates the

instances by the number of tables. This leads to an extra overload of memory

and computational time that eventually drags down the performance of the

method. Finally, Ml-knn-ht presents the best scalability, with considerable

reduced and linear execution times.

Figure 11 shows the performance of the methods on a range of attributes

from 1,000 up to 10,000, with a ﬁxed number of instances and labels.

Ml-knn-it and Ml-knn-ht scale linearly with the number of attributes,

where Ml-knn-ht has executions times orders of magnitude smaller since

a reduced number of instances will compute their distances. On the other

hand, Ml-knn-lsh increases exponentially with the number of attributes.

This is produced by the computation of the entries in the hash tables, since

they depend directly on the number of attributes. Moreover, this method

depends on exchanging a lot of information between nodes, hence a high-

dimensionality increases the network traﬃc.

Figure 12 illustrates the execution times varying the number of labels

from 100 up to 1,000, with a ﬁxed number of instances and attributes.

The three methods present a constant execution times despite the number

of labels used. This indicates that the execution times of Ml-knn are not

deeply aﬀected by the number of labels, becoming a good alternative for

datasets with high-dimensional label spaces. Despite not having a signiﬁcant

impact on the performance, the number of labels could eventually lead to

memory problems, especially for those methods that duplicate the instances.

7. Conclusion

In this paper we have presented and evaluated three strategies to dis-

tribute Ml-knn over Spark. Each of the three approaches incorporate a

diﬀerent strategy for the distributed nearest neighbor search: Brute force,

tree-based index, and locally sensitive hashing. The impact of these strate-

gies in Ml-knn has been studied into detail considering multiple metrics,

regarding prediction accuracy, execution times, and scalability factor.

The experimental study carried out has shown that Ml-knn can handle

large datasets over the Spark framework, obtaining competitive results, both

in prediction and computational performance. Considering each of the three

methods, Ml-knn-it obtained the base-line accuracy, since it is an exact

method, and the lowest execution times for the smaller datasets. However,

the experiments show that it scales poorly, especially with the number of in-

stances. Ml-knn-ht produced an accuracy equivalent of an exact method,

besides having the fastest execution times for most of the datasets. Addi-

tionally, it scaled-out more eﬃciently than the other methods, being able to

handle even the largest datasets. Ml-knn-lsh produced the most inconsis-

tent results, while it produced larger diﬀerences over the strictest metrics, it

is considered that ﬁnal accuracy was acceptable for an approximate method.

However, it scaled-out poorly and it had problems to execute the largest

datasets.

These results indicate that by incorporating the right strategy for nearest

neighbor searches, Spark enables Ml-knn to execute over large datasets that

would not be feasible to consider in a single machine. As future work, Ml-

knn-it could possibly be improved by exchanging the information using

cross-joins with a speciﬁc test partition, that way we would avoid using the

driver as an intermediary to exchange information. On the other hand, the

algorithm with the largest capacity for improvement is the Ml-knn-lsh. At

the moment the available implementation relies on ﬂattening rows and co-

grouping them which has a high computational cost. We could use a hash

table on the driver to indicate the partitions that store speciﬁc buckets of

the hash tables. Eventually, a proper implementation of Ml-knn-lsh could

even surpass the Ml-knn-ht performance for high-dimensional data.

Acknowledgments

This research was supported by the VCU Research Support Funds, the Span-

ish Ministry of Economy and Competitiveness and the European Regional

Development Fund, project TIN2017-83445-P.

References

[1] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: Data

mining and Knowledge Discovery handbook, 667–685, 2009.

[2] K. Trohidis, G. Tsoumakas, G. Kalliris, I. P. Vlahavas, Multi-label clas-

siﬁcation of music into emotions, in: International Conference on Music

Information Retrieval, vol. 8, 325–330, 2008.

[3] H. Zhang, S. Kiranyaz, M. Gabbouj, A k-nearest neighbor multilabel

ranking algorithm with application to content-based image retrieval, in:

IEEE International Conference on Acoustics, Speech and Signal Pro-

cessing, 2587–2591, 2017.

[4] M. L. Zhang, Z. H. Zhou, ML-KNN: A lazy learning approach to multi-

label learning, Pattern Recognition 40 (7) (2007) 2038–2048.

[5] K. Q. Weinberger, J. Blitzer, L. K. Saul, Distance metric learning for

large margin nearest neighbor classiﬁcation, in: Advances in Neural

Information Processing Systems, 1473–1480, 2006.

[6] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, L. Cazzanti, Similarity-

based classiﬁcation: concepts and algorithms, Journal of Machine Learn-

ing Research 10 (2009) 747–776.

[7] G. Doquire, M. Verleysen, Mutual information-based feature selection

for multi-label classiﬁcation, Neurocomputing 122 (2013) 148–155.

[8] N. SpolaˆoR, E. A. Cherman, M. C. Monard, H. D. Lee, A comparison of

multi-label feature selection methods using the problem transformation

approach, Electronic Notes in Theoretical Computer Science 292 (2013)

135–151.

[9] H. Lim, J. Lee, D.-W. Kim, Optimization approach for feature selection

in multi-label classiﬁcation, Pattern Recognition Letters 89 (2017) 25–

30.

[10] M.-L. Zhang, K. Zhang, Multi-label learning by exploiting label de-

pendency, in: ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, 999–1008, 2010.

[11] S.-J. Huang, Z.-H. Zhou, Z. Zhou, Multi-Label Learning by Exploiting

Label Correlations Locally, in: AAAI Conference on Artiﬁcial Intelli-

gence, 949–955, 2012.

[12] F. Tai, H.-T. Lin, Multilabel classiﬁcation with principal label space

transformation, Neural Computation 24 (9) (2012) 2508–2542.

[13] Y. Zhang, Z.-H. Zhou, Multilabel dimensionality reduction via depen-

dence maximization, ACM Transactions on Knowledge Discovery from

Data 4 (3) (2010) 14.

[14] M.-L. Zhang, L. Wu, Lift: Multi-label learning with label-speciﬁc fea-

tures, IEEE Transactions on Pattern Analysis and Machine Intelligence

37 (1) (2015) 107–120.

[15] U. Schwiegelshohn, R. M. Badia, M. Bubak, M. Danelutto, S. Dustdar,

F. Gagliardi, A. Geiger, L. Hluchy, D. Kranzlm¨uller, E. Laure, et al.,

Perspectives on grid computing, Future Generation Computer Systems

26 (8) (2010) 1104–1115.

[16] J. Dean, S. Ghemawat, MapReduce: Simpliﬁed data processing on large

clusters, Communications of the ACM 51 (1) (2008) 107–113.

[17] T. White, Hadoop: The deﬁnitive guide, O’Reilly Media, Inc., 2012.

[18] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,

X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al., Apache

Spark: A uniﬁed engine for big data processing, Communications of the

ACM 59 (11) (2016) 56–65.

[19] Z. Bei, Z. Yu, N. Luo, C. Jiang, C. Xu, S. Feng, Conﬁguring in-memory

cluster computing using random forest, Future Generation Computer

Systems 79 (2018) 1–15.

[20] Z. Tang, X. Zhang, K. Li, K. Li, An intermediate data placement al-

gorithm for load balancing in Spark computing environment, Future

Generation Computer Systems 78 (2018) 287–301.

[21] H. Singh, S. Bawa, A MapReduce-based scalable discovery and indexing

of structured big data, Future Generation Computer Systems 73 (2017)

32–43.

[22] D. Harnie, M. Saey, A. E. Vapirev, J. K. Wegner, A. Gedich, M. Stei-

jaert, H. Ceulemans, R. Wuyts, W. De Meuter, Scaling machine learning

for target prediction in drug discovery using Apache Spark, Future Gen-

eration Computer Systems 67 (2017) 409–417.

[23] ´

A. B. Hern´andez, M. S. Perez, S. Gupta, V. Munt´es-Mulero, Using ma-

chine learning to optimize parallelism in big data applications, Future

Generation Computer Systems (2017) 1–12.

[24] J. Gonzalez-Lopez, A. Cano, S. Ventura, Large-scale multi-label ensem-

ble learning on Spark, in: IEEE Trustcom/BigDataSE/ICESS, 893–900,

2017.

[25] Z. Younes, F. Abdallah, T. Denoeux, H. Snoussi, A dependent multi-

label classiﬁcation method derived from the k-nearest neighbor rule,

EURASIP Journal on Advances in Signal Processing (2011) 1–14.

[26] E. Spyromitros, G. Tsoumakas, I. Vlahavas, An empirical study of lazy

multi-label classiﬁcation algorithms, in: Hellenic Conference on Artiﬁ-

cial Intelligence, 401–406, 2008.

[27] J. Xu, Multi-label weighted k-nearest neighbor classiﬁer with adaptive

weight estimation, in: International Conference on Neural Information

Processing, 79–88, 2011.

[28] W. Cheng, E. H¨ullermeier, Combining instance-based learning and lo-

gistic regression for multi-label classiﬁcation, Machine Learning 76 (2)

(2009) 211–225.

[29] J. Maillo, S. Ram´ırez, I. Triguero, F. Herrera, kNN-IS: An iterative

Spark-based design of the k-Nearest Neighbors classiﬁer for big data,

Knowledge-Based Systems 117 (2017) 3–15.

[30] S. Ram´ırez-Gallego, B. Krawczyk, S. Garc´ıa, M. Wo´zniak, J. M. Ben´ıtez,

F. Herrera, Nearest Neighbor Classiﬁcation for High-Speed Big Data

Streams Using Spark, IEEE Transactions on Systems, Man, and Cyber-

netics: Systems 47 (10) (2017) 2727–2739.

[31] T. Liu, C. Rosenberg, H. A. Rowley, Clustering billions of images with

large scale nearest neighbor search, in: IEEE Workshop on Applications

of Computer Vision, 28–28, 2007.

[32] J. Maillo, J. Luengo, S. Garc´ıa, F. Herrera, I. Triguero, Exact fuzzy k-

nearest neighbor classiﬁcation for big datasets, in: IEEE International

Conference on Fuzzy Systems, 1–6, 2017.

[33] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, B. Moon, Parallel data

processing with MapReduce: a survey, ACM SIGMOD 40 (4) (2012)

11–20.

[34] J. Lin, MapReduce is good enough? If all you have is a hammer, throw

away everything that’s not a nail!, Big Data 1 (1) (2013) 28–37.

[35] A. Cano, C. Garc´ıa-Mart´ınez, S. Ventura, Extremely high-dimensional

optimization with MapReduce: scaling functions and algorithm, Infor-

mation Sciences 415 (2017) 110–127.

[36] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J.

Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: A fault-

tolerant abstraction for in-memory cluster computing, in: USENIX Con-

ference on Networked Systems Design and Implementation, 2–2, 2012.

[37] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu,

J. Freeman, D. Tsai, M. Amde, S. Owen, et al., Mllib: Machine learning

in Apache Spark, Journal of Machine Learning Research 17 (1) (2016)

1235–1241.

[38] Q. Wang, S. R. Kulkarni, S. Verd´u, Divergence estimation for multidi-

mensional densities via k-nearest-neighbor distances, IEEE Transactions

on Information Theory 55 (5) (2009) 2392–2405.

[39] A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate

nearest neighbor in high dimensions, in: Foundations of Computer Sci-

ence, 459–468, 2006.

[40] Y. Song, J. Huang, D. Zhou, H. Zha, C. L. Giles, IKNN: Informative

K-Nearest Neighbor Pattern Classiﬁcation, in: European Conference on

Principles of Data Mining and Knowledge Discovery, 248–264, 2007.

[41] Y. Iano, F. S. da Silva, A. M. Cruz, A fast and eﬃcient hybrid fractal-

wavelet image coder, IEEE Transactions on Image Processing 15 (1)

(2006) 98–105.

[42] B. Aldawsari, T. Baker, D. England, Trusted energy eﬃcient cloud-

based services brokerage platform, International Journal of Computa-

tional Intelligence Research 6 (2015) 630–639.

[43] T. Baker, M. Asim, H. Tawﬁk, B. Aldawsari, R. Buyya, An energy-aware

service composition algorithm for multiple cloud-based IoT applications,

Journal of Network and Computer Applications 89 (2017) 96–108.

[44] T. Baker, Y. Ngoko, R. Tolosana-Calasanz, O. F. Rana, M. Randles, En-

ergy eﬃcient cloud computing environment via autonomic meta-director

framework, in: International Conference on Developments in eSystems

Engineering, 198–203, 2013.

[45] T. Baker, B. Al-Dawsari, H. Tawﬁk, D. Reid, Y. Ngoko, GreeDi: An en-

ergy eﬃcient routing algorithm for big data on cloud, Ad Hoc Networks

35 (2015) 83–96.

[46] Y. Wang, W.-Z. Song, W. Wang, X.-Y. Li, T. A. Dahlberg, LEARN:

Localized energy aware restricted neighborhood routing for ad hoc net-

works, in: IEEE Communications Society on Sensor and Ad Hoc Com-

munications and Networks, vol. 2, 508–517, 2006.

[47] P. V. Krishna, Honey bee behavior inspired load balancing of tasks in

cloud computing environments, Applied Soft Computing 13 (5) (2013)

2292–2303.

[48] T. Baker, M. Mackay, A. Shaheed, B. Aldawsari, Security-oriented cloud

platform for soa-based scada, in: IEEE/ACM International Symposium

on Cluster, Cloud and Grid Computing, 961–970, 2015.

[49] J. Gao, J. X. Yu, R. Jin, J. Zhou, T. Wang, D. Yang, Neighborhood-

privacy protected shortest distance computing in cloud, in: ACM SIG-

MOD International Conference on Management of data, 409–420, 2011.

[50] J. H. Friedman, J. L. Bentley, R. A. Finkel, An algorithm for ﬁnding

best matches in logarithmic expected time, ACM Transactions on Math-

ematical Software 3 (3) (1977) 209–226.

[51] C. Silpa-Anan, R. Hartley, Optimised KD-trees for fast image descrip-

tor matching, in: IEEE Conference on Computer Vision and Pattern

Recognition, 1–8, 2008.

[52] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, A. Y. Wu, An

optimal algorithm for approximate nearest neighbor searching ﬁxed di-

mensions, Journal of the ACM 45 (6) (1998) 891–923.

[53] C. A. Duncan, M. T. Goodrich, S. G. Kobourov, Balanced aspect ra-

tio trees and their use for drawing very large graphs, in: International

Symposium on Graph Drawing, 111–124, 1998.

[54] C. A. Duncan, M. T. Goodrich, S. Kobourov, Balanced aspect ratio

trees: Combining the advantages of kd trees and octrees, Journal of

Algorithms 38 (1) (2001) 303–333.

[55] R. F. Sproull, Reﬁnements to nearest-neighbor searching ink-

dimensional trees, Algorithmica 6 (1-6) (1991) 579–589.

[56] S. Dasgupta, Y. Freund, Random projection trees and low dimensional

manifolds, in: ACM symposium on Theory of computing, 537–546, 2008.

[57] Y. Jia, J. Wang, G. Zeng, H. Zha, X.-S. Hua, Optimizing kd-trees for

scalable visual descriptor indexing, in: IEEE Conference on Computer

Vision and Pattern Recognition, 3392–3399, 2010.

[58] A. W. Moore, The anchors hierarchy: Using the triangle inequality to

survive high dimensional data, in: Conference on Uncertainty in Artiﬁ-

cial Intelligence, 397–405, 2000.

[59] T. Liu, A. W. Moore, K. Yang, A. G. Gray, An investigation of prac-

tical approximate nearest neighbor algorithms, in: Advances in Neural

Information Processing Systems, 825–832, 2005.

[60] H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neigh-

bor search, IEEE Transactions on Pattern Analysis and Machine Intel-

ligence 33 (1) (2011) 117–128.

[61] A. Babenko, V. Lempitsky, The inverted multi-index, in: IEEE Confer-

ence on Computer Vision and Pattern Recognition, 3069–3076, 2012.

[62] P. Indyk, R. Motwani, Approximate nearest neighbors: Towards re-

moving the curse of dimensionality, in: ACM Symposium on Theory of

Computing, 604–613, 1998.

[63] M. Covell, S. Baluja, LSH banding for large-scale retrieval with memory

and recall constraints, in: IEEE International Conference on Acoustics,

Speech and Signal Processing, 1865–1868, 2009.

[64] J. Buhler, Eﬃcient large-scale sequence comparison by locality-sensitive

hashing, Bioinformatics 17 (5) (2001) 419–428.

[65] Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, Multi-probe LSH: ef-

ﬁcient indexing for high-dimensional similarity search, in: International

Conference on Very Large Data Bases, 950–961, 2007.

[66] M. Bawa, T. Condie, P. Ganesan, LSH forest: self-tuning indexes for

similarity search, in: Iinternational Conference on World Wide Web,

651–660, 2005.

[67] G. Shakhnarovich, P. Viola, T. Darrell, Fast pose estimation with

parameter-sensitive hashing, in: IEEE International Conference on

Computer Vision, 750, 2003.

[68] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in

Neural Information Processing Systems, 1753–1760, 2009.

[69] P. Jain, B. Kulis, K. Grauman, Fast image search for learned metrics,

in: IEEE Conference on Computer Vision and Pattern Recognition, 1–8,

2008.

[70] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable

image search, in: IEEE Conference on Computer Vision and Pattern

Recognition, 2130–2137, 2009.

[71] B. Kulis, T. Darrell, Learning to hash with binary reconstructive embed-

dings, in: Advances in Neural Information Processing Systems, 1042–

1050, 2009.

[72] M. Raginsky, S. Lazebnik, Locality-sensitive binary codes from shift-

invariant kernels, in: Advances in Neural Information Processing Sys-

tems, 1509–1517, 2009.

[73] J. Wang, S. Kumar, S.-F. Chang, Semi-supervised hashing for scalable

image retrieval, in: IEEE Conference on Computer Vision and Pattern

Recognition, 3424–3431, 2010.

[74] J. He, W. Liu, S.-F. Chang, Scalable similarity search with optimized

kernel hashing, in: ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, 1129–1138, 2010.

[75] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, N. Yu, Complementary hashing for

approximate nearest neighbor search, in: IEEE International Conference

on Computer Vision, 1631–1638, 2011.

[76] T. B. Sebastian, B. B. Kimia, Metric-based shape retrieval in large

databases, in: International Conference on Pattern Recognition, vol. 3,

291–296, 2002.

[77] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, H. Zhang, Fast approximate

nearest-neighbor search with k-nearest neighbor graph, in: International

Joint Conference on Artiﬁcial Intelligence, vol. 22, 1312, 2011.

[78] R. Paredes, E. Ch´avez, K. Figueroa, G. Navarro, Practical construction

of k-nearest neighbor graphs in metric spaces, in: Workshop Ecosystem

Architectures, vol. 6, 85–97, 2006.

[79] J. Chen, H.-r. Fang, Y. Saad, Fast approximate kNN graph construction

for high dimensional data via recursive Lanczos bisection, Journal of

Machine Learning Research 10 (2009) 1989–2012.

[80] M. Connor, P. Kumar, Fast construction of k-nearest neighbor graphs

for point clouds, IEEE Transactions on Visualization and Computer

Graphics 16 (4) (2010) 599–608.

[81] W. Dong, C. Moses, K. Li, Eﬃcient k-nearest neighbor graph construc-

tion for generic similarity measures, in: ACM International Conference

on World Wide Web, 577–586, 2011.

[82] A. Elisseeﬀ, J. Weston, A kernel method for multi-labelled classiﬁcation,

in: Advances in Neural Information Processing Systems, 681–687, 2002.

[83] C. Vens, J. Struyf, L. Schietgat, S. Dˇzeroski, H. Blockeel, Decision trees

for hierarchical multi-label classiﬁcation, Machine Learning 73 (2) (2008)

185.

[84] M. L. Zhang, Z. H. Zhou, Multi-label neural networks with applications

to functional genomics and text categorization, IEEE Transactions on

Knowledge and Data Engineering 18 (10) (2006) 1338–1351.

[85] S. Abe, Fuzzy support vector machines for multilabel classiﬁcation, Pat-

tern Recognition 48 (6) (2015) 2110–2117.

[86] M. L. Zhang, Z. H. Zhou, A review on multi-label learning algorithms,

IEEE Transactions on Knowledge and Data Engineering 26 (8) (2014)

1819–1837.

[87] F. Charte, A. Rivera, M. del Jesus, F. Herrera, MLSMOTE: Approach-

ing imbalanced multi-label learning through synthetic instance genera-

tion, Knowledge-Based Systems 89 (2015) 385–397.

[88] J. Maillo, I. Triguero, F. Herrera, A MapReduce-based k-nearest

neighbor approach for big data classiﬁcation, in: IEEE Trust-

com/BigDataSE/ICESS, vol. 2, 167–172, 2015.

[89] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratiﬁcation of multi-

label data, Machine Learning and Knowledge Discovery in Databases

(2011) 145–158.

Federated Nearest Neighbor Classification with a Colony of Fruit-Flies

Article

Jun 2022

The mathematical formalization of a neurological mechanism in the fruit-fly olfactory circuit as a locality sensitive hash (Flyhash) and bloom filter (FBF) has been recently proposed and "reprogrammed" for various learning tasks such as similarity search, outlier detection and text embeddings. We propose a novel reprogramming of this hash and bloom filter to emulate the canonical nearest neighbor classifier (NNC) in the challenging Federated Learning (FL) setup where training and test data are spread across parties and no data can leave their respective parties. Specifically, we utilize Flyhash and FBF to create the FlyNN classifier, and theoretically establish conditions where FlyNN matches NNC. We show how FlyNN is trained exactly in a FL setup with low communication overhead to produce FlyNNFL, and how it can be differentially private. Empirically, we demonstrate that (i) FlyNN matches NNC accuracy across 70 OpenML datasets, (ii) FlyNNFL training is highly scalable with low communication overhead, providing up to 8x speedup with 16 parties.

An optimization in big data time series prediction method by Parzen estimation with Spark

Article

Dec 2023

Hao Liu

With the development and change of big data related technologies, more and more large amounts of data need to be analyzed. Now there are companies like Google, Yahoo, etc. Frameworks such as MapReduce, Hadoop, Spark, etc. are developed for processing large amounts of data. In this paper, relevant discussions and researches are carried out on time series forecasting under the new era of big data. Now there are time series forecasting methods based on map reduce, Hadoop, spark data processing framework, including nearest neighbor distribution method, neural network method, etc., which have made quite good achievements in big data time series forecasting. By reading the relevant research literature, it is universally acknowledged that the Sparks framework has good application prospects and potential in predicting big data time series. As a result, this paper is mainly aimed at the optimization and improvement of the big data time series forecasting method on the basis of the spark framework. The author noticed that most of the default configurations of spark clusters are generated by default or automatically, rather than the optimal solution obtained after algorithm optimization, so there is still room for improvement in this regard. In this regard, this paper proposes a kernel method for visual data processing of related configurations and parameters, and then optimizes the default data configuration as much as possible to improve the accuracy and feasibility of the big data time series prediction method on the basis of the spark framework. In this paper, the optimized scheme is used to forecast the domestic electricity consumption in the past five years, and the results show that the optimized scheme has a good improvement performance on the basis of the original method.

A Robot for Artistic Painting in Authentic Colors

Article

Full-text available

Mar 2023
J INTELL ROBOT SYST

Artistic robotic painting automates the process of creating an artwork. This complex and challenging task includes several aspects: creating algorithms for rendering brushstrokes, reproducing the exact shape of a brushstroke, and developing the principles of mixing paints. This work contributes to the previously unsolved problem of accurately reproducing colors of brushstrokes by means of artistic paints. The main contributions of this paper include: the development of a novel 4-component data-driven mathematical model for artistic paint mixing; the design and implementation of a novel robot capable of accurately dosing and mixing acrylic paints thanks to the improved syringe pumps and the innovative paint mixer; the development of a novel pneumatic system for paint release with a build-in clogging detection mechanism. The capabilities of the designed robotic system are demonstrated by painting four artworks: replicas of Claude Monet’s and Arkady Rylov’s landscapes, a synthetic image generated using the StyleGAN2 neural network trained on Vincent van Gogh’s artistic heritage, and a synthetic image generated using the Midjourney neural network. The obtained results can be useful in various applications of computer creativity, as well as in artistic image replication and restoration, and also in colored 3D printing.

A genetic programming approach for searching on nearest neighbors graphs

Article

Full-text available

Jul 2022
MULTIMED TOOLS APPL

Nearest neighbors graphs have gained a lot of attention from the information retrieval community since they were demonstrated to outperform classical approaches in the task of approximate nearest neighbor search. These approaches, firstly, index feature vectors by using a graph-based data structure. Then, for a given query, the search is performed by traversing the graph in a greedy-way, moving in each step towards the neighbor of the current vertex that is closer to the query (based on a distance function). However, local topological properties of vertices could be also considered at the moment of deciding the next vertex to be explored. In this work, we introduce a Genetic Programming framework that combines topological properties along with the distance to the query, aiming to improve the selection of the next vertex in each step of graph traversal and, therefore, reduce the number of vertices explored (scan rate) to find the true nearest neighbors. Experimental results, conducted over three large collections of feature vectors and four different graph-based techniques, show significant gains of the proposed approach over classic graph-based search algorithms.

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Article

Mar 2024

This work presents a novel undersampling scheme to tackle the imbalance problem in multi-label datasets. We use the principles of the natural nearest neighborhood and follow a paradigm of label-specific undersampling. Natural-nearest neighborhood is a parameter-free principle. Our scheme’s novelty lies in exploring the parameter-optimization-free natural nearest neighborhood principles. The class imbalance problem is particularly challenging in a multi-label context, as the imbalance ratio and the majority–minority distributions vary from label to label. Consequently, the majority–minority class overlaps also vary across the labels. Working on this aspect, we propose a framework where a single natural neighbor search is sufficient to identify all the label-specific overlaps. Natural neighbor information is also used to find the key lattices of the majority class (which we do not undersample). The performance of the proposed method, NaNUML, indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent. We could also establish a statistically superior performance over other competing methods several times. An empirical study involving twelve real-world multi-label datasets, seven competing methods, and four evaluating metrics—shows that the proposed method effectively handles the class-imbalance issue in multi-label datasets. In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search and the key factor, neighborhood size ‘k’ is determined without invoking any parameter optimization.

PML-ED: A method of partial multi-label learning by using encoder-decoder framework and exploring label correlation

Article

Mar 2024
INFORM SCIENCES

Error-Tolerant Techniques for Classifiers Beyond Neural Networks for Dependable Machine Learning

Chapter

Jan 2024

Dependability is a key requirement of machine learning (ML) systems in safety-critical applications such as vehicles, finance, and medicine domains. To handle errors in the underlying ML hardware and provide a dependable outcome, error-tolerant techniques have been extensively investigated for several high-performance ML algorithms/schemes such as neural networks. However, the dependability of other simpler algorithms that can also provide good learning performance has received significantly less attention. In this chapter, the state-of-the-art error-tolerant techniques for K Nearest Neighbors, Random Forest, and Support Vector Machine ML classifiers are reviewed. By investigating the impact of errors on the final classification results and allowing “no-impact” errors, these solutions have been designed to be very efficient in terms of protection overhead (e.g., power dissipation). The overall goal is to provide readers with a set of techniques for hardening the classifiers when they are used in safety-critical applications.

Conceptual modeling of big data SPJ operations with Twitter social medium

Article

Full-text available

Aug 2023

Currently, the blooming growth of social networks such as Facebook, Twitter, Instagram, etc., has generated and is still generating a big amount of data, which can be regarded as a gold mine for business analysts and researchers where several insights that are useful and essential for effective decision making have to be provided. However, multiple problems and challenges affect the decisional support systems, especially at the level of the Extraction–Transformation–Loading processes. These processes are responsible for the selection, filtering and normalizing of data sources in order to obtain relevant decisions. As far as this research paper is concerned, we aim to focus on adapting the transformation phase with the MapReduce paradigm to process data in a distributed and parallel environment. Subsequently, we set forward a conceptual model of this second phase that is composed of several operations that handle NoSQL structure, which is suitable for Big Data storage. Finally, we implement through Talend for Big Data our new components, which help the designer apply selection, projection and joining operations on the extracted data from social media.

Stateful Adaptive Streams with Approximate Computing and Elastic Scaling

Conference Paper

Jun 2023

An Efficient Framework for Multi-Label Learning in Non-stationary Data Stream

Conference Paper

Dec 2021

Large-Scale Multi-label Ensemble Learning on Spark

Conference Paper

Full-text available

Aug 2017

Multi-label learning is a challenging problem which has received growing attention in the research community over the last years. Hence, there is a growing demand of effective and scalable multi-label learning methods for larger datasets both in terms of number of instances and numbers of output labels. The use of ensemble classifiers is a popular approach for improving multi-label model accuracy, especially for datasets with high-dimensional label spaces. However, the increasing computational complexity of the algorithms in such ever-growing high-dimensional label spaces, requires new approaches to manage data effectively and efficiently in distributed computing environments. Spark is a framework based on MapReduce, a distributed programming model that offers a robust paradigm to handle large-scale datasets in a cluster of nodes. This paper focuses on multi-label ensembles and proposes a number of implementations through the use of parallel and distributed computing using Spark. Additionally, five different implementations are proposed and the impact on the performance of the ensemble is analyzed. The experimental study shows the benefits of using distributed implementations over the traditional single-node single-thread execution, in terms of performance over multiple metrics as well as significant speedup tested on 29 benchmark datasets.

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Article

Full-text available

Oct 2017

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.

Using machine learning to optimize parallelism in big data applications

Article

Full-text available

Jul 2017
FUTURE GENER COMP SY

In-memory cluster computing platforms have gained momentum in the last years, due to their ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments. In addition, there is a lack of tools to better understand and optimize such platforms that consequently form the backbone of big data infrastructure and technologies. This directly leads to underutilization of available resources and application failures in such environment. One of the key aspects that can address this problem is optimization of the task parallelism of application in such environments. In this paper, we propose a machine learning based method that recommends optimal parameters for task parallelization in big data workloads. By monitoring and gathering metrics at system and application level, we are able to find statistical correlations that allow us to characterize and predict the effect of different parallelism settings on performance. These predictions are used to recommend an optimal configuration to users before launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed. We observe up to a 51% gain on performance when using the recommended parallelism settings. The model is also interpretable and can give insights to the user into how different metrics and parameters affect the performance.

A k-nearest neighbor multilabel ranking algorithm with application to content-based image retrieval

Conference Paper

Full-text available

Mar 2017

Extremely High-dimensional Optimization with MapReduce: Scaling Functions and Algorithm

Article

Full-text available

Nov 2017
INFORM SCIENCES

Large scale optimization is an active research area in which many algorithms, benchmark functions, and competitions have been proposed to date. However, extremely high-dimensional optimization problems comprising millions of variables demand new approaches to perform effectively in results quality and efficiently in time. Memetic algorithms are popular in continuous optimization but they are hampered on such extremely large dimensionality due to the limitations of computational and memory resources, and heuristics must tackle the immensity of the search space. This work advances on how the MapReduce parallel programming model allows scaling to problems with millions of variables, and presents an adaptation of the MA-SW-Chains algorithm to the MapReduce framework. Benchmark functions from the IEEE CEC 2010 and 2013 competitions are considered and results with 1, 3 and 10 million variables are presented. MapReduce demonstrates to be an effective approach to scale optimization algorithms on extremely high-dimensional problems, taking advantage of the combined computational and memory resources distributed in a computer cluster.

Configuring in-memory cluster computing using random forest

Article

Sep 2017
FUTURE GENER COMP SY

Recently, in-memory cluster computing (IMC) gains momentum because it accelerates traditional on-disk cluster computing (ODC) up to several tens of times for iterative and interaction applications. The most popular IMC framework is Spark and it has more than 100 configuration parameters. However, it is unclear how significantly these parameters affect the system performance because IMC is a quite new computing paradigm. Consequently, there is yet no study addressing how to optimally configure IMC frameworks. In this paper, we first investigate how significantly the configuration parameters affect the performance of Spark workloads. We find that the configuration caused performance variation can be as large as 20.7, indicating configuring Spark workloads is extremely important to their performance. However, manually configuring Spark workloads is notoriously difficult because there are so many configuration parameters which might interfere with each other in a complex way. To address this issue, we propose an approach to Automatically Configure Spark workloads, named ACS. It firstly constructs performance models as functions of Spark configuration parameters by using random forest which is an ensemble learning algorithm. Subsequently, ACS leverages genetic algorithm to search the optimum configuration by taking configurations and the corresponding performance predicted by the performance models as inputs. We employ six Spark programs, each with five input data sets to evaluate the performance improvements. The results show that ACS speeds up the 30 program-input pairs by a factor of 2.2× on average and up to 8.2×. In addition, the performance improvements obtained by ACS increase along with the increments of the input data set sizes of Spark workloads, which is a nice property for big data analytics.

Exact fuzzy k-Nearest neighbor classification for big datasets

Conference Paper

Jul 2017

The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k Nearest Neighbors (FuzzykNN) classifier being among the most successful ones. FuzzykNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results.

A MapReduce-based scalable discovery and indexing of structured big data

Article

Apr 2017
FUTURE GENER COMP SY

Various methods and techniques have been proposed in past for improving performance of queries on structured and unstructured data. The paper proposes a parallel B-Tree index in the MapReduce framework for improving efficiency of random reads over the existing approaches. The benefit of using the MapReduce framework is that it encapsulates the complexity of implementing parallelism and fault tolerance from users and presents these in a user friendly way. The proposed index reduces the number of data accesses for range queries and thus improves efficiency. The B-Tree index on MapReduce is implemented in a chained-MapReduce process that reduces intermediate data access time between successive map and reduce functions, and improves efficiency. Finally, five performance metrics have been used to validate the performance of proposed index for range search query in MapReduce, such as, varying cluster size and, size of range search query coverage on execution time, the number of map tasks and size of Input/Output (I/O) data. The effect of varying Hadoop Distributed File System (HDFS) block size and, analysis of the size of heap memory and intermediate data generated during map and reduce functions also shows the superiority of the proposed index. It is observed through experimental results that the parallel B-Tree index along with a chained-MapReduce environment performs better than default non-indexed dataset of the Hadoop and B-Tree like Global Index (Zhao et al., 2012) in MapReduce.

An Energy-aware Service Composition Algorithm for Multiple Cloud-based IoT Applications

Article

Mar 2017

There has been a shift in research towards the convergence of the Internet-of-Things (IoT) and cloud computing paradigms motivated by the need for IoT applications to leverage the unique characteristics of the cloud. IoT acts as an enabler to interconnect intelligent and self-configurable nodes “things” to establish an efficient and dynamic platform for communication and collaboration. IoT is becoming a major source of big data, contributing huge amounts of streamed information from a large number of interconnected nodes, which have to be stored, processed, and presented in an efficient, and easily interpretable form. Cloud computing can enable IoT to have the privilege of a virtual resources utilization infrastructure, which integrates storage devices, visualization platforms, resource monitoring, analytical tools, and client delivery. Given the number of things connected and the amount of data generated, a key challenge is the energy efficient composition and interoperability of heterogeneous things integrated with cloud resources and scattered across the globe, in order to create an on-demand energy efficient cloud based IoT application. In many cases, when a single service is not enough to complete the business requirement; a composition of web services is carried out. These composed web services are expected to collaborate towards a common goal with large amount of data exchange and various other operations. Massive data sets have to be exchanged between several geographically distributed and scattered services. The movement of mass data between services influences the whole application process in terms of energy consumption. One way to significantly reduce this massive data exchange is to use fewer services for a composition, which need to be created to complete a business requirement. Integrating fewer services can result in a reduction in data interchange, which in return helps in reducing the energy consumption and carbon footprint.

Optimization approach for feature selection in multi-label classification

Article

Feb 2017
PATTERN RECOGN LETT

Nowadays, many data sources that include multi-label learning and multi-label classification have emerged in recent application areas. To achieve high classification accuracy, the multi-label feature selection method has received much attention because its accuracy can be significantly improved by selecting important features. In previous multi-label feature selection studies, a score function was designed based on the measure of the dependency between features and labels. However, identifying the optimal feature subset is an impractical task because all possible feature subsets are 2N, where N is the number of total features in a given dataset. Thus, the conventional methods utilized a greedy search approach that can be stuck in local optima. To circumvent the drawback of the greedy approaches, we design a score function based on mutual information and present a numerical optimization approach to avoid being stuck in the local optima. The experimental results demonstrate the superiority of the proposed multi-label feature selection method.

Distributed Nearest Neighbor Classification for Large-Scale Multi-label Data on Spark

Figures

Recommended publications

MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification

Distributed Selection of Continuous Features in Multilabel Classification Using Mutual Information

Distributed multi-label feature selection using individual mutual information measures

Large-Scale Multi-label Ensemble Learning on Spark

ARFF Data Source Library for Distributed Single/Multiple Instance, Single/Multiple Output Learning o...