Content uploaded by Wensheng Gan
Author content
All content in this area was uploaded by Wensheng Gan on Apr 18, 2018
Content may be subject to copyright.
Overview
Data mining in distributed
environment: a survey
Wensheng Gan,
1
Jerry Chun-Wei Lin,
1
*Han-Chieh Chao
2
and Justin Zhan
3
Due to the rapid growth of resource sharing, distributed systems are developed,
which can be used to utilize the computations. Data mining (DM) provides power-
ful techniques for finding meaningful and useful information from a very large
amount of data, and has a wide range of real-world applications. However, tradi-
tional DM algorithms assume that the data is centrally collected, memory-resident,
and static. It is challenging to manage the large-scale data and process them with
very limited resources. For example, large amounts of data are quickly produced
and stored at multiple locations. It becomes increasingly expensive to centralize
them in a single place. Moreover, traditional DM algorithms generally have some
problems and challenges, such as memory limits, low processing ability, and inad-
equate hard disk, and so on. To solve the above problems, DM on distributed com-
puting environment [also called distributed data mining (DDM)] has been
emerging as a valuable alternative in many applications. In this study, a survey of
state-of-the-art DDM techniques is provided, including distributed frequent item-
set mining, distributed frequent sequence mining, distributed frequent graph min-
ing, distributed clustering, and privacy preserving of distributed data mining. We
finally summarize the opportunities of data mining tasks in distributed environ-
ment. © 2017 Wiley Periodicals, Inc.
How to cite this article:
WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216
INTRODUCTION
With the rapid development of information tech-
nology and data collection, Knowledge Dis-
covery in Databases (KDD), provides a powerful
capability to discover meaningful and useful informa-
tion coming from a collection of data.
1–4
KDD has
numerous real-life applications and has resulted in
several DM tasks, such as association rule mining
(ARM),
2,3
sequential pattern mining (SPM),
4,5
clustering,
6,7
classification,
8,9
and outline detection,
10
among others. Depending on different requirements
in various domains and applications, the discovered
knowledge can be generally classified as frequent
itemsets and association rules,
2,11–13
sequential
patterns,
1,4,5
sequential rules,
14,15
graphs,
16,17
high-
utility patterns,
18–20
weight-based patterns,
21,22
and
other interesting patterns.
23,24
As an important task
for a wide range of real-world applications, frequent
itemset mining (FIM) or ARM has been extensively
studied. Two well-known algorithms, Apriori
25
and
FP-growth,
3
are proposed to mine frequent itemsets
and association rules based on the generation-and-
test or pattern-growth approaches.
3
Many algorithms
have been developed to efficiently mine the desired
patterns and information from various type data
bases.
2,3,6,12,17,23,25
In general, distribution of data and computation
allows researchers/engineers to solve many problems
and can be applied and performed in various applica-
tions that are distributed in nature. Distributed
*Correspondence to: jerrylin@ieee.org
1
School of Computer Science and Technology, Harbin Institute of
Technology Shenzhen Graduate School, Shenzhen University Town
Xili, Shenzhen, China
2
Department of Computer Science and Information Engineering,
National Dong Hwa University, Shoufeng, Taiwan
3
Department of Computer Science, University of Nevada, Las
Vegas, NV, USA
Conflict of interest: The authors have declared no conflicts of inter-
est for this article.
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 1of19
systems indicate that the distributed computational
units are connected and organized by networks to
meet the demand of both large-scale and high-
performance computing, which have received consid-
erable attention over the past decades.
26–31
Many
types of distributed systems, such as grids,
32,33
peer-
to-peer (P2P) systems,
34
ad-hoc networks,
35
cloud
computing systems,
36
and online social network
systems,
37
have been widely studied. Currently, the
applications of distributed systems are varied, such as
web service, scientific computation, and file storage.
At the same time, DM has also been extensively stud-
ied.
2,3,6,12,17,23,25
By using DM techniques organiza-
tions, businesses, companies, and scientific centers can
discover different kinds of hidden but useful and
meaningful patterns and information. As mentioned
before, the distribution of the collected data can be
analyzed by DM techniques.
28
An important scenario
of DM is that the databases are distributed between
two or more parties, and each party owns a portion
of the data. In the past, traditional methods typically
made the assumption that the data is centralized and
memory-resident.
2,3,6,12,17,23,25
This assumption is no
longer tenable in distributed systems. Unfortunately, a
direct application of traditional mining algorithms to
distributed databases is not effective because it implies
a large amount of communication overhead. Imple-
mentation of high-performance DM in distributed
computing environments has thus become a critical
improvement in utilizing the scalability of a system.
In traditional DM technologies, a centralized
approach is fundamentally inappropriate due to
many reasons, such as the huge amount of data,
infeasibility to centralize data stored at multiple sites,
bandwidth limitations, energy limitations, and pri-
vacy concerns. Therefore, it is important to develop a
more adaptable and flexible mining framework to
discover hidden but useful and meaningful patterns,
and information from the distributed and complex
databases instead of the centralized ones. To solve
these problems, DM on distributed environments
[also called distributed data mining (DDM)] has
emerged as an important research area.
38–41
In the
DDM literature, one of the two assumptions is com-
monly adopted as to how data is distributed across
sites: homogeneously (horizontally partitioned) and
heterogeneously (vertically partitioned).
42
In general,
DDM deals with some challenges for analyzing dis-
tributed data and offers many algorithmic solutions
to perform different data analysis and mining opera-
tions in a fundamentally distributed manner, which
pays very careful attention to resource constraints.
To improve the performance of DM and to improve
the scalability, many researchers provide different
techniques to work on a distributed environment like
grid computing,
32,33
cloud,
36
Hadoop (the popular
open source implementation of MapReduce,
26
http://
hadoop.apache.org), and so forth and distribute the
mining computation over more than a single node.
From the previous studies,
38–41
it has been shown
that DDM is a powerful tool for the end-user, enter-
prise or government to analyze data and discover dif-
ferent kinds of useful knowledge. It provides new
opportunities but poses some challenges for DM.
Although some related surveys have been previ-
ously studied, most of them provide a very prelimi-
nary review of a single type of distributed system,
such as the survey of load balancing in grids,
32,33
the
survey of load balancing in cloud computing,
27
and
the survey of load balancing in peer-to-peer (P2P)
systems.
34,43
How to summarize the related studies
in various types of DM in distributed systems and
make a general taxonomy on them? The methods
summarized in this study cover not only the distribu-
ted systems,
44,45
but also related literature on DM,
12
parallel computing,
46
big data technologies,
47,48
and
database management,
49
This study thus aims to
review current research on DDM. Main contribu-
tions of this study are described as follows:
1. We first point out the difference between tradi-
tional DM algorithms and those based on dis-
tributed environments. There are more
challenges to be encountered when accomplish-
ing DM tasks in a distributed system.
2. We review contemporary works of DM on dis-
tributed environments in recent years. This is a
high level survey about distributed system tech-
niques for DM in several aspects, including dis-
tributed frequent itemset mining (DFIM),
distributed frequent sequence mining (DFSM),
distributed frequent graph mining (DFGM),
distributed clustering (DC), and privacy preser-
ving of distributed data mining (PPDDM).
3. Finally, some opportunities for future research
in DM task in distributed environment are
briefly summarized.
The study is organized as follows: Distributed Systems
and Its Technical Challenges section introduces the
definitions, some important features of distributed sys-
tems, and respectively, summarizes some challenges in
distributed systems and DDM. Data Mining Techni-
ques on Distributed Environment section highlights
and discusses the state-of-the-art research on DM in
distributed computing resources. Opportunity for Dis-
tributed Data Mining section briefly summarizes some
Overview wires.wiley.com/dmkd
2of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
opportunities for DM task in distributed environment.
Finally, conclusions are given in Conclusion section.
DISTRIBUTED SYSTEMS AND ITS
TECHNICAL CHALLENGES
In this section, the related definitions and some
important features of distributed systems are stated.
Some technical challenges on distributed systems and
DDM are then briefly reviewed and summarized.
Distributed Systems
Unlike traditional centralized systems, the term dis-
tributed system refers to a large collection of
resources that is shared between computers con-
nected by a network. For example, hardware sharing,
software sharing, data sharing, service sharing, and
media stream sharing. The development of collabora-
tive computing, parallel computing and distributed
computing, motivated the development of distributed
systems. A distributed system is defined as one in
which components at networked computers that
communicate and coordinate their actions only by
passing messages.
44,45
In other words, a distributed
system is a collection of autonomous computing ele-
ments (subsystems) that appears to its users as a sin-
gle coherent system. A distributed system has a
complex nature that requires powerful technologies
and advanced algorithms as shown in Figure 1.
From Figure 1, it can be observed that there are
two aspects in a distributed system: independent com-
puting elements and single system w.r.t. middleware.
There are some important features of distributed sys-
tems, including: (1) concurrency, multiprocess and
multithread concurrent execution, and resource shar-
ing (sharing of information and services); (2) no
global clock, program coordination depending on
message passing; (3) independent failure, such as
process failure, cannot be known by other pro-
cesses.
44,45
According to Refs 44,45, some properties
in a distributed system, such as transparency, scalabil-
ity, availability, reliability, serviceability (manageabil-
ity), and safety, should be discussed and studied.
Challenges in Distributed System
Distributed systems, in which the distributed compu-
tational units are connected and organized by net-
works to meet the demand of large-scale and high-
performance computing, have received considerable
attention over the past decades.
26–31
Many types of
distributed systems, such as grids,
32,33
P2P systems,
34
ad hoc networks,
35
cloud computing systems,
36
and
online social network systems
37
have been widely
studied. Currently, there are various applications in
distributed systems, such as DM, web servicing, sci-
entific computation, and file storage. Although great
developments in distributed systems have been made,
there are still some technical challenges.
44,45
As
shown in Figure 2, the main challenge in distributed
systems can be referred to eight aspects, including
heterogeneity, openness, security, scalability, failure
handling, concurrency, transparency, and quality of
service. Details of each challenge can be referred to
Refs 44,45.
Challenges in Distributed Data Mining
In recent decades, many models and algorithms have
beendevelopedinDMtoefficiently discover desired
knowledge in various types of databases,
2,3,12,23,25
but some challenges in DM have yet to be solved. In
2006, Yang and Wu
50
introduced 10 challenging
problems in DM research, such as developing a uni-
fying theory of DM, scaling up for high-dimensional
data, DDM and mining multiagent data, security,
privacy, and so on. Traditional DM algorithms
assume that the data is centralized, memory-resident,
and static. Because of the growth of large-scale data
in recent decades, two challenges have to be met.
First, the amounts of data are rapidly produced. Sec-
ond, the data are stored at multiple locations and it
becomes increasingly expensive to centralize it in one
place. Therefore, the problem of DDM is quite
important in various complex network databases. In
a distributed environment (such as a sensor or IP net-
work), one has distributed probes placed at strategic
locations within the network, especially in areas with
limited energy and limited memory (e.g., limited
CPU computation and I/O calls across a distributed
architecture). Therefore, techniques of DDM are
more challenging and complex than that of tradi-
tional DM.
38–41
With the collected data from the distributed
sites, DDM explores techniques of how to apply DM
in a noncentralized way. The goal here obviously
would be to minimize the amount of data shipped
between the various sites. Some important challenges
for this DDM issue—such as how to essentially
reduce the communication overhead, how to mine
across multiple heterogeneous data sources,
i.e., multisource databases, how to perform multire-
lational mining in distributed environment—have
been studied. As shown in Figure 1, the eight techni-
cal challenges in a distributed system, including het-
erogeneity, openness, security, scalability, failure
handling, concurrency, transparency, and quality of
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 3of19
service, are the same challenges when performing
DDM—especially the heterogeneity, security, and
scalability. The DDM deals with these challenges in
analyzing the distributed data and offers many algo-
rithmic solutions to perform different data analysis
and mining operations in a fundamentally distributed
manner that pays careful attention to the resource
constraints.
DATA MINING TECHNIQUES IN
DISTRIBUTED ENVIRONMENT
In this section, the state-of-the-art algorithms related
to DM on distributed environment—including
DFIM, DFSM, DFGM, DC, and privacy preserving
of DDM (PPDDM)—are given below. The prelimi-
naries and the problem statement are given simply
and then we describe the novel ideas of the related
works in detail and highlight specific ideas.
Distributed Frequent Itemset Mining
Let I={i
1
,i
2
,…,i
n
} be a set of items, an itemset
X={i
1
,i
2
,…,i
k
} with kitems is a subset of I. The
length or size of Xis denoted as |X|, i.e., the number
of items in Xw.r.t. k. Given a transactional data-
base D, where each transaction T
q
2Dis generally
identified by a transaction id (TID), and |D| denotes
the total number of transactions. The support of
Xin database Dis denoted as sup(X) and is the
proportion of transactions containing X, i.e., sup
(X) =|{T
q
|T
q
2D,XT
q
}|/|D|. The support count
or frequency of itemset Xis the number of transac-
tions in Dcontaining X. An itemset is said to be a
frequent itemset (FI) if its support is greater than the
Component-1 Component-n
Host-3
Middleware
Local OS
Hardware
Distributed
Network
Application 3
Component-1 …Component-n
Host-1
Middleware
Local OS
Hardware
Application 1
Component-1 …Component-n
Host-n
Middleware
Local OS
Hardware
Application n
Component-1 …Component-n
Host-2
Middleware
Local OS
Hardware
Application 2
Same interface everywhere
Other Host
FIGURE 1 |Architecture of distributed system.
Scalability Failure
handling
Concurrency
Transparency
Quality of
Service
Heterogeneity
Openness
Security
Technical
challenges
FIGURE 2 |Technical challenges in distributed system.
Overview wires.wiley.com/dmkd
4of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
user-defined minimum support threshold, minsup.
Therefore, the problem of frequent itemset mining is
to discover all itemsets in which the support of each
itemset is not less than the user-defined minimum
support threshold, i.e., sup(X) ≥minsup.
25
As the most important task for a wide range of
real-world applications, FIM and ARM have been
extensively studied.
2,11–13
The ARM consists of two
phases. It first discovers the frequent itemsets, then
generates the association rules from the derived fre-
quent itemsets. Due to the first phase being more
challenging and interesting than the second phase,
most efforts on ARM address the problem of FIM.
Two well-known algorithms, Apriori
25
and FP-
growth,
3
were respectively proposed to mine frequent
itemsets and association rules. Many algorithms have
been developed to efficiently mine the desired fre-
quent itemsets or association rules from various type
databases.
2,3,12,23,25
Previously, the problem of FIM
in a distributed/parallel environment (DFIM) has
been extensively studied and a number of approaches
have been explored to address this problem. Table 1
shows an overview of frequently distributed itemset
mining on distributed/parallel environment.
In 1995, Mueller first proposed two parallel
algorithms, called parallel efficient association rules
(PEAR)
51
and parallel partition association rules
(PPAR).
51
Park et al. also proposed an algorithm
named parallel data mining (PDM) to parallel mining
of association rules,
52
and the fast distributed mining
(FDM) algorithm for distributed databases
53
was
developed later. Cheung et al. proposed a mining
algorithm named DMA to mine association rules in
distributed databases.
56
An algorithm named Hash
Partitioned Apriori (HPA) was first introduced in Ref
54, and the modified HPA-ELD
54
approach that
HPA with extremely large itemset duplication was
then proposed. Based on the partition technology,
Zaki et al. then developed the Partitioned Candidate
Common Database (PCCD) and Common Candidate
Partitioned Database (CCPD).
55
At the same time,
some data distribution (DD)-based technologies have
been extensively studied, such as CD,
46
CD tree
projection,
59
DD,
46
HD,
58
IDD,
58
IDD tree
projection,
59
and DDDM,
60
and so forth.
By extending the vertical mining approach
Eclat,
57
the parallel-based Eclat (ParEclat),
57
and the
distributed Eclat (Dist-Eclat)
68
were, respectively,
developed. With the consideration of dynamic min-
ing, the ZIGZAG-based incremental approach
62
was
proposed to distributed and parallel incremental
mining frequent rules. Lin et al. developed three ver-
sions of Apriori algorithm, namely single pass count-
ing (SPC), fixed passes combined counting (FPC),
and dynamic passes combined counting (DPC) on
theMapReduceframework.
66
SPC is a straightfor-
ward algorithm while FPC aims at reducing the num-
ber of scheduling invocations and DPC features in
dynamically combining candidates of various
lengths.
Recently, many DDM algorithms are developed
based on Spark or Hadoop platforms. Hadoop is one
of the well-known platforms using MapReduce
framework,
26
and it is the open-source software for
any implementations. Hadoop distributed file system
(HDFS) is used to store the dataset in Hadoop (http://
hadoop.apache.org). Spark
78
is a new in-memory, dis-
tributed data-flow platform, which uses Resilient Dis-
tributed Dataset (RDD) architecture to store the results
at the end of an iteration and provide the results for
next iteration. In general, Spark has 1–2 orders of mag-
nitude faster than MapReduce.
78
Research efforts have
already made to improve the Apriori-based and the
traditional FIM/ARM algorithms by converting them
into distributed versions under the MapReduce
26
or
Spark
78
environment. For example, a parallel FP-
Growth,
64
a parallel randomized algorithm PARMA
67
for approximate association rules mining in MapRe-
duce, the MapReduce-based H-mine algorithm,
73
a
parallel FIM algorithm with Spark (R-Apriori),
75
PaMPa-HD,
74
and so on. Details of the above algo-
rithms are described below.
An adaptation of FP-Growth to MapReduce,
26
called PFP, is presented in Ref 64. PFP is a parallel
form of the classical FP-Growth, it splits a large-
scale mining task into independent and parallel
tasks. First, a parallel/distributed counting approach
is used to compute the frequent items, which ran-
domly partitions the datasets into several groups. In
a single MapReduce round, the transactions in the
dataset are used to generate group-dependent trans-
actions. The PFP approach shows good performance
with a near-linear speedup. Although, PARMA
67
is
not the first algorithm using MapReduce to solve
the task of DFIM, it is the first randomized MapRe-
duce algorithm for discovering the approximate col-
lections of frequent itemsets or association rules
with near-linear speedup. PARMA is also the first
algorithm combining random sampling and parallel-
ization to mine frequent itemsets or association
rules. As shown in the study,
68
Dist-Eclat is a
MapReduce implementation of the well-known
Eclat algorithm.
57
BigFIM is a hybrid approach
exploiting both Apriori and Eclat paradigms based
on MapReduce.
68
Dist-Eclat focuses on speeding up
the mining performance while BigFIM is optimized to
run on really large datasets. Baralis et al.
69
presented a
parallel disk-based approach, named P-Mine, to solve
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 5of19
TABLE 1 |Algorithms for Distributed Frequent Itemset Mining
Name Description Year
PEAR
51
Parallel efficient association rules 1995
PPAR
51
Parallel partition association rules 1995
PDM
52
Parallel mining of association rules 1995
FDM
53
Fast distributed mining for distributed
databases
1995
HPA
54
Hash-partitioned Apriori 1996
PCCD
55
Partitioned candidate common database 1996
DMA
56
Mine association rules in distributed
databases
1996
CCPD
55
Common candidate partitioned database 1996
CD
46
Count distribution 1996
HPA-ELD
54
HPA with extremely large itemset
duplication
1996
ParEclat
57
Parallel Eclat 1997
HD
58
Hybrid distribution 2000
CD Tree Projection
59
Count distributed tree projection 2001
DD
46
Data distribution 1996
IDD
58
Intelligent data distribution 2000
IDD tree projection
59
Intelligent data distribution tree
projection
2001
DDDM
60
Distributed dual decision miner,
communication efficient distributed
mining of association rules
2001
Fast distributed data mining
61
Distributed mining of classification rules 2002
ZIGZAG-based incremental approach
62
Distributed and parallel incremental
mining frequent rules
2004
Par-FP
63
Parallel FP-growth with sampling 2005
PFP
64
An adaptation of FP-Growth to
MapReduce
2008
DPA
65
Distributed parallel Apriori 2010
DPC
66
Dynamic passes combined-counting 2012
FPC
66
Fixed passes combined-counting 2012
PARMA
67
A parallel randomized algorithm for
approximate association rules mining
in MapReduce
2012
BigFIM
68
Frequent itemset mining for big data 2013
Dist-Eclat
68
Distributed Eclat-based on MapReduce 2013
P-Mine
69
Parallel itemset mining on large datasets 2013
RuleMR
70
Classification rule discovery with
MapReduce
2014
YAFIM
71
A parallel frequent itemset mining
algorithm on Spark
2014
DFIMA
72
Apriori-like distributed frequent itemset
mining algorithm
2015
MRH-mine
73
MapReduce-based H-mine algorithm 2015
(continued overleaf )
Overview wires.wiley.com/dmkd
6of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
the task of DFIM on a multicore processor by improv-
ing the I/O performance with a prefetching strategy.
Recently, Qiu et al.
71
have reported speed up of nearly
18 times on average for various benchmarks for the yet
another frequent itemset mining (YAFIM) algorithm
basedonSpark.Theresultsobtainedonreal-world
medical data show that YAFIM is much faster than all
Hadoop-based algorithms. Kaul and Kashyap
75
then
proposed the Reduced-Apriori (R-Apriori) algorithm,
which is a parallel Apriori algorithm based on the
Spark Resilient Distributed Dataset (RDD) framework.
It adds an additional phase to YAFIM and speed up the
second round for generating the promising candidate-
set in order to achieve higher performance compared to
the YAFIM.
According to these studies, it has been shown
that the implementation approaches based on Spark
is generally more efficient than those on Hadoop
model. In general, the performance of the above
approaches might not be satisfactory due to the bot-
tleneck of iterative computation when handling
large-scale datasets. Therefore, a distributed algo-
rithm for frequent itemset mining (DFIMA) was pro-
posed to improve and speed up the process of FIM.
72
Some distributed and highly scalable parallel mining
approaches were also developed in recent years, such
as FDMCN (Fast and Distributed Mining algorithm
for discovering frequent patterns in Congested Net-
works)
76
and PHIKS (parallel highly informative K-
itemset).
77
Different from the general itemset mining
problem, Salah et al.
77
studied the problem of paral-
lel mining of maximally informative k-itemsets (miki)
based on joint entropy, and proposed the PHIKS, a
highly scalable and parallel miki mining algorithm.
With the classification application, Cho and
Wüthrich
61
introduced a model for FDM of classifi-
cation rules and the MapReduce-based RuleMR
70
was further developed.
Distributed Sequential Pattern Mining
Different from FIM, SPM discovers frequent subse-
quences as the interesting patterns in a sequential
database which contains the embedded timestamp of
events. The itemset mining model was then extended
to handle sequences by Srikant and Agrawal.
4
A
sequential database SDB ={S
1
,S
2
,…,S
n
} is a set of
tuples (sid,S), where sid is a sequence identifier and
S
k
is an input sequence. A sequence S
α
=(α
1
,α
2
,…,
α
n
) is called a subsequence of another sequence S
β
=
(β
1
,β
1
,…,β
m
)(n<m) and S
β
is called a super-
sequence of S
α
if there exist an integer 1 < i
1
<…<
i
n
<msuch that β
1
β
i1
,…,β
n
β
in
, denoted as
S
α
S
β
. A tuple (sid,S) is said to contain a
sequence S
α
if Sis a super-sequence of S
α
. The support
of a sequence S
α
in a sequence database SDB is the
number of tuples in SDB that contains S
α
,minsup
(S
α
). The sequential pattern mining problem was first
introduced by Srikant and Agrawal
4
and can be for-
mulated as follows: Given a set of sequences,where
each sequence consists of a list of elements and each
element consists of a set of items,and given a user-
specified minsup threshold,sequential pattern mining
is to find all of the frequent subsequences,i.e., the
subsequences whose occurrence frequency in the set
of sequences is not less than minsup.
4
Some well-known algorithms for sequential
pattern mining, such as AprioriAll,
1
general sequen-
tial patterns (GSP),
4
BI-Directional Extension
(BIDE),
79
CloSpan,
80
Frequent pattern-projected
Sequential pattern mining (FreeSpan),
81
Prefix-
projected Sequential pattern mining (PrefixSpan),
82
Sequential PAttern Discovery using Equivalence
classes (SPADE),
83
and so forth. have been exten-
sively proposed. It has been shown that SPM has
broad applications in real-world situations. Among
them, AprioriAll
1
and GSP
4
are the fundamental
Apriori-based algorithms, which are required to mine
TABLE 1 |Continued
Name Description Year
PaMPa-HD
74
Parallel MapReduce-based frequent
pattern miner for high dimensional
data
2015
R-Apriori
75
An efficient Apriori-based algorithm on
Spark
2015
FDMCN
76
A fast and distributed mining algorithm
for discovering frequent patterns in
congested networks algorithm
2016
PHIKS
77
A highly scalable parallel algorithm,
named parallel highly informative K-
itemset, for maximally informative k-
itemset mining
2016
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 7of19
the sequential patterns in a levelwise manner. Up to
now, many researchers have provided different tech-
niques to work on distributed environments like grid
computing,
32,33
cloud,
36
Hadoop (http://hadoop.
apache.org), or distribute the mining computation
over more than one node for mining the sequential
patterns. As shown in Table 2, some distributed and
parallel methods for SPM are described below.
In 1998, Shintani and Kitsuregawa partitioned
the input sequences in nonpartitioned sequential pat-
tern mining (NPSPM), yet they assumed that the
entire candidate set can be replicated and fit into the
overall memory (random access memory and hard
drive) of a process.
84
Similar assumptions were made
in EVEnt distribution (EVE),
85
EVEnt and CANdi-
date distribution (EVECAN),
85
and in data parallel
formulation (DPF).
88
As well, a hash function was
used in the hash partitioned sequential pattern min-
ing (HPSPM) algorithm to assign input and candi-
date sequences to specific processes.
84
Input
partitioning is, however, not inherently necessary for
shared memory or MapReduce distributed systems.
In the case of shared memory systems, the input data,
i.e., sequences, should fit in the aggregated system
memory and is available to be read by all processes.
Thus, Zaki extended the efficient SPADE algorithm
to the shared memory parallel architecture, called
pSPADE.
86
In the pSPADE framework, the input
data is assumed to be residing on shared hard drive
space and stored in the vertical database format.
In order to balance the mining tasks, Cong
et al. designed several models, Par-FP,
63
Par-ASP
63
and Par-CSP,
89
to accomplish the task. They use a
sampling technique that requires the entire input set
be available at each process. In addition, the 2PDF-
Index,
87
2PDF-Compression,
87
and DFSP
94
algo-
rithms were proposed and applied to scalable mining
sequential patterns from biological sequences. After
that, some distributed and parallel mining methods,
such as MapReduce Distributed GSP (DGSP) and
large-scale frequent sequence mining (MG-FSM),
were proposed by extending the traditional SPM
TABLE 2 |Algorithms for Distributed Sequential Pattern Mining
Name Description Year
HPSPM
84
Partitioned sequential pattern mining 1998
NPSPM
84
Nonpartitioned sequential pattern mining 1998
EVE
85
EVEnt distribution 1999
EVECAN
85
EVEnt and CANdidate distribution 1999
pSPADE
86
Parallel SPADE 2001
2PDF-Index and 2PDF-Compression
87
Scalable sequential pattern mining for
biological sequences
2004
DPF
88
Data parallel formulation 2004
Par-ASP
63
Parallel PrefixSpan with sampling 2005
Par-CSP
89
Parallel CloSpan with sampling 2005
DGSP
90
Distributed GSP 2008
PLUTE
91
Parallel sequential patterns mining 2010
MG-FSM
92
Large-scale frequent sequence mining 2013
ACME
93
Advanced parallel motif extractor 2013
DFSP
94
A Depth-First SPelling algorithm for
sequential pattern mining of biological
sequences
2013
An iterative MapReduce framework
95
Manage data uncertainty in SPM and
design an iterative MapReduce
framework to execute the uncertain
SPM algorithm in parallel
2015
LASH
96
LArge-scale Sequence mining with
Hierarchies
2015
Distributed DP
97
A memory-efficient distributed DP
approach and use an extended prefix-
tree to save intermediate results
2016
Overview wires.wiley.com/dmkd
8of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
algorithms. With the consideration of motif, uncer-
tain sequences, and hierarchies, the advanced parallel
motif extractor (ACME)
93
; an iterative MapReduce
framework,
95
and LASH
96
algorithm were also pro-
posed for large-scale distributed sequence mining.
Other related algorithms for distributed sequential
pattern mining are still developed in progress, such
as the memory-efficient distributed DP approach.
97
As mentioned before, the problem of sequential
pattern mining is more complicated than frequent
itemset or ARM, thus the related DFSM approaches
seem somehow less than those of DFIM. With the
rapid development of SPM techniques and the latest
platforms and tools with distributed system, the
state-of-the-art research efforts on distributed sequen-
tial pattern mining are still being developed. Gener-
ally speaking, DFSM is a considerable research topic
in the fields of DM and big data analytics.
Distributed Frequent Graph Mining
In this section, we continue to discuss another DDM
approach of DFGM. Different from FIM and SPM,
graph has been a ubiquitous and essential data repre-
sentation to model real-world objects and their rela-
tionships.
16
Today, large amounts of graphical data
have been generated by various applications, includ-
ing social networks, biological networks, WWW,
and so on. Different from other general data struc-
ture, e.g., itemset and sequence, labeled graph struc-
ture is much more complicated and can be used to
model for discovering substructure patterns among
data. Therefore, frequent graph mining (FGM) pro-
blems take an input graph Gwhere vertices and
edges are labeled; vertices and edges have unique ids,
and their labels are arbitrary, domain-specific attri-
butes that can be null.
16
In 2003, Yan and Han developed the first
pattern-growth FGM method, named graph-based
substructure pattern mining (gSpan).
17
It avoids
duplicates by only expanding subtrees which lie on
the rightmost path in the depth-first traversal. With
the overwhelming information encoded in these
graphs, there is a crucial need for efficient tools to
quickly discover large graphs and return the concise
patterns that can be easily understood. Distributed
data processing platforms, such as MapReduce,
98
Pregel,
99
GraphLab,
100
and GraphX.
101
have sub-
stantially simplified the design and deployment of
distributed graph analytics algorithms. In particular,
these platforms represent a good performance of dis-
tributed graph mining problems. Besides, a pattern is
an arbitrary graph; finding frequent subgraphs in a
labeled graph is an important topic in graph mining
problems. Up to now, successful algorithms for FGM
are related to those designed for FIM. In this section,
we provide a brief overview of some key distributed
methods for DFGM and then discuss each of them in
detail. As shown in Table 3, the current methods for
DFGM are summarized below.
A pattern-growth method called Molecular
Fragment miner (MoFa) was introduced by Borgelt
et al. It can mine both molecular substructures and
general frequent subgraphs. With a dynamic load
balancing strategy, Fatta and Berthold proposed the
distributed MoFa with a dynamic load balancing (d-
MoFa) algorithm.
39
By extending the well-known
gSpan algorithm, a parallel gSpan algorithm named
p-gSpan was also proposed.
102
Wang and Parthasar-
athy then designed a Toolkit to mine motifs patterns
and named this tool as MotifMiner.
103
Based on the
MapReduce distributed data processing platform,
researchers contribute great efforts to DFGM, such
as MRPF
104
algorithm for MapReduce-based sub-
graph pattern finding or MRFSE
106
for MapReduce-
based frequent subgraph extraction. In real-world
situations, however, the natural graphs have com-
monly been found to have highly skewed power-law
degree distributions, which challenge the assumptions
made by previous approaches. Thus, Gonzalez
et al. introduced a new approach, PowerGraph, to
distributed graph placement and representation that
exploits the structure of power-law graphs.
105
In
addition, a two-step filter-and-refinement MapRe-
duce framework for frequent subgraph mining was
presented in Ref 107. In recent years, several distrib-
uted graph mining and analytics systems have been
proposed, including GraphX,
101
GridGraph,
108
UNICORN,
109
Arabesque,
110
and DistGraph,
111
and
so forth. The GraphX aims at processing graphs in a
distributed dataflow framework, an integrated graph
and collections Application Programming Interface
(API) which is sufficient to express existing graph
abstractions and enable a much wider range of com-
putation.
101
With the development of Grid technol-
ogy, GridGraph is a large-scale graph processing
system on a single machine using 2-level hierarchical
partitioning.
108
As an open source version of
Bigtable,
44
UNICORN exploits the random write
characteristic of HBASE (http://hbase.apache.org/) to
improve the performance of generalized iterative
matrix–vector multiplication.
109
Arabesque,
110
the
first distributed data processing platform for imple-
menting graph mining algorithms, automates the
process of exploring a very large number of sub-
graphs, and it defines a high-level filter-process com-
putational model. Recently, the DistGraph
111
was
proposed as the first distributed method to mine a
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 9of19
massive input graph that is too large to fit in the
memory of any individual compute node.
Distributed Clustering
Successful algorithms for clustering are related to the
distributed environment and the Distributed Cluster-
ing (DC)
112
thus becomes an important research
topic of clustering. In this section, we further provide
a brief overview of some key methods for DC. Table 4
lists and summarizes the distributed methods.
Techniques of clustering algorithms can be
classified into two main categories: single-machine
and multiple-machine clustering techniques. The lat-
ter, DC
112
is related to the distributed and parallel
systems, and most of them were designed based on
MapReduce. In 2004, the hybrid energy-efficient
distributed clustering (HEED) algorithm
113
was
introduced by Younis et al. Zhou et al. then pre-
sented an EM-based framework for distributed data
stream clustering.
114
In the distributed data cluster-
ing, a comparative analysis system with three
approaches, respectively, named Improved Distribu-
ted Combining Algorithm (IDCA), Distributed K-
Means (DKMA), and traditional Centralized Clus-
tering Algorithm (CCA) were proposed in Ref 115.
Based on MapReduce, an efficient parallel K-means
clustering (PKMeans) was proposed by directly
extending the traditional K-means algorithm for
clustering,
116
and the optimized K-means clustering
algorithms were further proposed by using MapRe-
duce.
123
Bahmani et al. also proposed an efficient
parallel k-mean called sequential K-means++
120
to
handle the sequential data. The MapReduce K-
means++ method replaces the iterations among mul-
tiple machines with a single machine. It can signifi-
cantly reduce the communication and I/O costs. The
above K-means-based approaches are designed to
return exact results. It is, however, not an easy task
to quickly find the exact results from the big data.
Therefore, an efficient approximated approach
called K-Means++ approximation with MapReduce
was introduced in Ref 124. It can drastically reduce
the number of MapReduce jobs by using only one
MapReduce job to obtain k centers. At the same
time, Han and Luo proposed a fast K-means method
using a statistical bootstrap.
122
With the consideration of sensor network appli-
cations, some DC methods have been proposed, such
as a generic algorithm for distributed data clustering
in sensor networks
119
and the novel DKM algorithm
for clustering observations collected by spatially
TABLE 3 |Algorithms for Distributed Frequent Graph Mining
Name Description Year
p-MoFa
102
Parallel MoFa 2006
p-gSpan
102
Parallel gSpan 2006
d-MoFa
39
Distributed MoFa with dynamic load
balancing
2006
MotifMiner
103
MotifMiner toolkit 2004
MRPF
104
MapReduce-based pattern finding 2009
Pregel
99
A system for large-scale graph processing 2010
PowerGraph
105
Distributed graph-parallel computation on
natural graphs
2012
MRFSE
106
MapReduce-based frequent subgraph
extraction
2013
Filter-and-refinement
107
A two-step filter-and-refinement
MapReduce framework for frequent
subgraph mining
2014
GraphX
101
A distributed dataflow framework 2014
GridGraph
108
Large-scale graph processing using
hierarchical partitioning
2015
UNICORN
109
A graph mining library on top of HBASE 2015
Arabesque
110
A system for distributed graph mining 2015
DistGraph
111
A distributed approach for graph mining
in massive networks
2016
Overview wires.wiley.com/dmkd
10 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
distributed resource-aware sensors.
118
Recently, two
K-means-based models, distributed PCA and K-
means
121
and KPCA+ K-means clustering,
125
were
developed based on the PCA
127
concept and kernel
PCA
128
concept. Mashayekhi et al. proposed
GDCluster, a general fully decentralized clustering
method, which is capable of clustering dynamic and
distributed datasets.
126
In GDCluster, nodes continu-
ously cooperate through decentralized gossip-based
communication to maintain summarized views of the
dataset. Other approaches for DC are still in progress.
Privacy Preserving of Distributed Data
Mining
Before reviewing current works in privacy-preserving
DM in distributed environment (PPDDM), we first
stress the significance and motivations for this
research topic. With the rapid development of net-
works, such as communications and computer tech-
nology, privacy preserving data mining (PPDM) has
become an increasingly important topic in DM.
129
Specially, in distributed environments, how to protect
‘data privacy’while doing DM tasks from a large
number of distributed data is more challenging and
interesting. PPDM has emerged as an important topic
in DM, and many related works have been exten-
sively studied, such as PPDM of association rules and
frequent itemsets, PPDM of sequential patterns,
PPDM of graph, and so on.
129
In particular, some
papers have addressed the privacy issues in mining of
association rules and frequent itemsets from distribu-
ted data. In the literature, Clifton et al. first proposed
the issue of PPDDM of association rules and frequent
itemsets.
129
A simple overview of PPDDM is shown
in Table 5.
In 2004, Kantarcioglu and Clifton proposed the
PPDM for association rules in horizontally distribu-
ted databases that uses Yao’s generic secure-
computation protocol as a subprotocol. They also
designed several methods to incorporate the crypto-
graphic techniques to minimize the information
TABLE 4 |Algorithms for Distributed Clustering
Name Description Year
HEED
113
Hybrid energy-efficient distributed
clustering
2004
EM-based framework
114
Distributed data stream clustering 2007
IDCA, DKMA, CCA
115
Distributed data clustering-a comparative
analysis system
2009
PKMeans
116
Parallel K-Means clustering based on
MapReduce
2009
CloudClustering
117
Toward an iterative data processing
pattern on the cloud
2011
Novel DKM
118
A distributed algorithms for clustering
observations collected by spatially
distributed resource-aware sensors
2011
A generic algorithm
119
Distributed data clustering in sensor
networks
2011
K-Means++
120
An efficient parallel version k-means|| of
the inherently sequential K-means++
2012
Distributed PCA and K-Means
121
Distributed PCA and K-Means clustering 2013
Bootstrapping K-means
122
A fast K-means method using a statistical
bootstrap
2014
Optimize K-means
123
Optimize K-means clustering algorithm
using MapReduce
2014
MapReduce K-means++
124
Efficient k-Means++ approximation with
MapReduce
2014
KPCA + K-Means clustering
125
A communication efficient algorithm to
perform kernel PCA in the distributed
setting
2015
GDCluster
126
A general distributed clustering algorithm 2015
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 11 of 19
shared while adding little overhead to the mining
task.
130
Luo et al. then proposed the GridDMM
algorithm
131
for distributed mining of maximal fre-
quent itemsets on a data grid system. In Ref 132, two
algorithms for both vertically and horizontally parti-
tioned data with cryptographically strong privacy
were introduced. In addition, hybrid CF-based refer-
rals with decent accuracy on cross distributed data
(CDD) were represented in Ref 134. Privacy preser-
vation in distributed systems has been focused in sev-
eral areas such as multiparty privacy preservation
DDM
133
and privacy preserving SOM-based recom-
mendations on horizontally distributed data,
135
among others.
In Ref 136, the researchers proposed the Multi-
level Trust (MLT)-PPDM model to expand the scope
of perturbation-based PPDM to multilevel trust. In
order to reduce the disjunctive operations, Chun
et al. developed the PPDNF approach for privacy-
preserving disjunctive normal form operations on dis-
tributed sets.
137
Tassa then proposed a protocol for
secure mining of association rules in horizontally dis-
tributed databases that improves significantly upon
the current leading protocol in terms of privacy and
efficiency.
139
Different from the previous approaches
of PPDDM, the first algorithm for privacy-preserving
sub-feature selection in DDM was introduced by
Bhuyan and Kamila.
140
It focuses on the issue of sub-
feature selection instead of the traditional pattern
(itemset, sequence, graph, tree, etc.). In order to solve
visualization problem of PPDDM, a novel technique
called DPcode
141
was recently proposed for privacy-
preserving frequent visual patterns publication on
Cloud. Furthermore, some reviews of privacy-
preserving computing in distributed data have been
summarized and discussed.
142–145
TABLE 5 |Algorithms for Privacy Preserving of Distributed Data Mining (PPDDM)
Name Description Year
Toolkit
129
Tools for privacy-preserving distributed
mining
2002
Secure mining
130
PPDM for association rules in horizontally
distributed databases
2004
GridDMM
131
Distributed mining of maximal frequent
itemsets on a data grid system
2006
Two algorithms for vertically partitioned
data
132
Algorithms for both vertically and
horizontally partitioned data, with
cryptographically strong privacy
2007
Multiparty PPDM
133
A game-theoretic approach for PPDDM 2007
PPCF on CDD
134
Hybrid CF-based referrals with decent
accuracy on cross distributed data
(CDD)
2012
SOM-based recommendation
135
A privacy-preserving scheme to provide
recommendations on horizontally
partitioned data among multiple
parties
2012
MLT-PPDM
136
Relax this assumption and expand the
scope of perturbation-based PPDM to
multilevel trust
2012
PPDNF
137
Privacy-preserving disjunctive normal
form operations on distributed sets
2013
Privacy-preserving two-party distributed
mining
138
Privacy-preserving two-party distributed
association rules mining on horizontally
partitioned data
2013
Secure mining
139
Secure mining of association rules in
horizontally distributed databases
2014
Sub-feature selection
140
Privacy-preserving sub-feature selection in
distributed data mining
2015
DPcode
141
Privacy-preserving frequent visual
patterns publication on cloud
2016
Overview wires.wiley.com/dmkd
12 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
OPPORTUNITY FOR DISTRIBUTED
DATA MINING
Undoubtedly, the world is shrinking into a small vil-
lage owing to the tangible influence of network and
various types of distributed systems, such as online
social network systems,
37
P2P systems,
34
Ad-hoc
networks,
35
and cloud computing systems,
36
It con-
nects people from different parts of the world by
sharing data, service, and media stream. Many
researchers have proposed various DDM techniques
based solely on different domain requirements and
applications, such as DFIM, DFSM, DFGM, DC,
and PPDDM. As mentioned before, Challenges in
Distributed Data Mining section provides an up-to-
date view on the challenges for DDM. DDM is to
deal with complex distributed systems and also
reveals many opportunities. We next highlight some
important research opportunities.
1. Developing more efficient algorithms. DDM is
computationally expensive in terms of compu-
tational cost and memory usage for making
resources accessible (e.g., limited CPU compu-
tation and I/O calls across the distributed archi-
tecture). In order to achieve high performance,
some distributed/parallel DM platforms and
tools have developed in recent years, such as
MapReduce
26
or Spark.
78
These developments
can provide the necessary theoretical and tech-
nical supports for DDM. Although, currently
developed algorithms are efficient, there is still
a need for improvement when handling large-
scale data.
2. Heterogeneity. Relational or nonrelational
database systems often utilize a single schema
or the files have the homogeneous format. In
the Big Data era, a large amount of heterogene-
ous distributed data must be processed. Tradi-
tional DM techniques are designed to discover
useful knowledge in structured data, while the
heterogeneity is the inhesion factor of distribu-
ted data. Thus, it is a major challenge and
opportunity for DM, particularly for DDM to
discover the useful knowledge embedded in
unstructured and/or semistructured data.
3. Different types of mining pattern. Besides FIM,
ARM, sequential pattern mining, graph mining,
several other pattern mining problems have
been studied, e.g., sequential rule mining,
14,15
high-utility pattern mining,
18–20
weight-based
pattern mining,
21,22
and other interesting pat-
tern mining.
23,24
Research on these problems
inspire distributed pattern mining. Thus, many
research opportunities in DDM can be further
discussed.
4. A wide range of applications in various
domains. Based on the specific applications,
many possibilities for further research on DDM
can be extensively studied. How to utilize
DFIM, DFSM, DFGM, DC, and PPDDM in
new or existing applications is an interesting
issue. We expect more research topics on the
DDM in the nearly future.
5. Security. Undoubtedly, the information
resources that are made available and main-
tained in distributed systems have a high intrin-
sic value to the users.
44,45
Therefore, security
issue is an important topic in DDM. To analyze
the big dataset, security and privacy issues are
the emerging topics. Several PPDDM have men-
tioned and discussed in this study. However,
how to improve the applicability and flexibility
of PPDDM is still a major challenge, and many
opportunities can be extended and studied.
CONCLUSION
Typically, DM algorithms aim to discover the desired
patterns (i.e., frequent itemset, sequential pattern,
graph, etc.) or perform clustering, classification, out-
line detection, and so on. In general, the collected
data and executed applications of data analysis are
distributed in nature. Due to some problems and
challenges associated with traditional DM algorithms
when processing distributed data, DM on distributed
computing environments has emerged as an impor-
tant research topic. However, seldom have studies
summarized the related development in various types
of DM in distributed systems and instead make a
general taxonomy on them.
In this study, we thus introduce the definitions,
the general architectures and several important fea-
tures of a distributed system, and then point out the
challenges of DM tasks in distributed environments.
The main contributions are that we investigate recent
advances of distributed DM and provide state-of-the-
art details, including DFIM, DFSM, DFGM, DC, and
PPDDM. For future research, some opportunities of
DM tasks in a distributed environment can be rea-
sonably considered and further developed: (1) DM in
multisource data, multimodal data, and heterogene-
ous data, (2) a new type of pattern representation or
knowledge representation in DDM, (3) visualization
techniques of DDM, and (4) security issues and qual-
ity of service of DDM in the big data era.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 13 of 19
ACKNOWLEDGMENTS
This research was partially supported by the National Natural Science Foundation of China (NSFC) under
grant no. 61503092 by the Research on the Technical Platform of Rural Cultural Tourism Planning Basing on
Digital Media under grant 2017A020220011, and by the Tencent Project under grant CCF-Tencent
IAGR20160115.
REFERENCES
1. Agrawal R, Srikant R. Mining sequential patterns. In:
Proceedings of the International Conference on Data
Engineering, Taipei, Taiwan, 1995, 3–14.
2. Agrawal R, Imielinski T, Swami A. Mining associa-
tion rules between sets of items in large database. In:
Proceedings of the ACM SIGMOD International
Conference on Management of Data, Washington,
DC, USA, 1993, 207–216.
3. Han J, Pei J, Yin Y, Mao R. Mining frequent patterns
without candidate generation: a frequent-pattern tree
approach. Data Min Knowl Discov 2004, 8
(1):53–87.
4. Srikant R, Agrawal R. Mining sequential patterns:
generalizations and performance improvements. In:
Proceedings of the International Conference on
Extending Database Technology:Advances in Data-
base Technology, Avignon, France, 1996, 3–17.
5. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H,
Chen Q, Hsu MC. Mining sequential patterns by
pattern-growth: the prefixspan approach. IEEE Trans
Knowl Data Eng 2004, 16(11):1424–1440.
6. Berkhin P. A survey of clustering data mining techni-
ques. In: Grouping Multidimensional Data. Berlin
Heidelberg: Springer; 2006, 25–71.
7. Jarvis RA, Patrick EA. Clustering using a similarity
measure based on shared near neighbors. IEEE Trans
Comput 1973, 100(11):1025–1034.
8. Kotsiantis SB. Supervised machine learning: a review
of classification techniques. Informatica 2007, 31
(3):249–269.
9. Quinlan JR. C4.5: Programs for Machine Learning.
San Francisco, CA: Morgan Kaufmann Publishers
Inc.; 1993.
10. Lee W, Stolfo S, Mok K. Adaptive intrusion detec-
tion: a data mining approach. Artif Intell Rev 2000,
14(6):533–567.
11. Vo B, Le T, Hong TP, Le B. Fast updated frequent-
itemset lattice for transaction deletion. Data Knowl
Eng 2015, 96:78–89.
12. Chen MS, Han J, Yu PS. Data mining: an overview
from a database perspective. IEEE Trans Knowl Data
Eng 1996, 8(6):866–883.
13. Vo B, Hong TP, Le B. A lattice-based approach for
mining most generalization association rules. Knowl-
Based Syst 2013, 45:20–30.
14. Fournier-Viger P, Nkambou R, Tseng VS. Rule-
Growth: mining sequential rules common to several
sequences by pattern-growth. In: Proceedings of the
ACM Symp Appl Comput, Taichung, Taiwan,
2011:956–961.
15. Fournier-Viger P, Faghihi U, Nkambou R,
Nguifo EM. CMRules: mining sequential rules com-
mon to several sequences. Knowl-Based Syst 2012,
25:63–76.
16. Kuramochi M, Karypis G. Frequent subgraph discov-
ery. In: Proceedings of the IEEE Int Conf Data Min,
San Jose, California, USA, 2001:313–320.
17. Yan X, Han J. Gspan: graph-based substructure pat-
tern mining. In: Proceedings of the IEEE Int Conf
Data Min, Melbourne, Florida, USA, 2003:721–724.
18. Lin JCW, Gan W, Hong TP, Zhang B. An incremen-
tal high-utility mining algorithm with transaction
insertion. Scientific World J 2015, Article ID 161564.
19. Tseng VS, Wu CW, Shie BE, Yu PS. UP-Growth: an
efficient algorithm for high utility itemset mining. In:
Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Min-
ing, Washington, DC, USA, 2010, 253–262.
20. Yao H, Hamilton HJ, Butz CJ. A foundational
approach to mining itemset utilities from databases.
In: Proceedings of the SIAM Int Conf Data Min,
Lake Buena Vista, Florida, USA, 2004:211–225.
21. Lin JCW, Gan W, Fournier-Viger P, Hong TP.
RWFIM: recent weighted-frequent itemsets mining.
Eng Appl Artif Intel 2015, 45:18–32.
22. Vo B, Coenen F, Le B. A new method for mining fre-
quent weighted itemsets based on wit-trees. Expert
Syst Appl 2013, 40(4):1256–1264.
23. Geng L, Hamilton HJ. Interestingness measures for
data mining: A survey. ACM Comput Surv 2006, 38
(3):1–32.
24. Hong TP, Wu YY, Wang SL. An effective mining
approach for up-to-date patterns. Expert Syst Appl
2009, 36(6):9747–9752.
Overview wires.wiley.com/dmkd
14 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
25. Agrawal R, Srikant R. Fast algorithms for mining
association rules in large databases. In: Proceedings
of the International Conference on Very Large Data
Bases, Santiago de Chile, Chile, 1994, 487–499.
26. Dean J, Ghemawat S. MapReduce: a flexible data
processing tool. Commun ACM 2010, 53(1):72–77.
27. Jiang Y. A survey of task allocation and load balan-
cing in distributed systems. IEEE Trans Parallel Dis-
trib Syst 2016, 27(2):585–599.
28. Park B, Kargupta H, Johnson E, Sanseverino E,
Hershberger D, Silvestre L. Distributed, collaborative
data analysis from heterogeneous sites using a scala-
ble evolutionary technique. Appl Intell 2002, 16
(1):19–42.
29. R Riesen, R Brightwell, and AB Maccabe, Differences
between distributed and parallel systems, SAND98-
2221, Unlimited Release, 1998. Available at: http://
www.cs.sandia.gov/rbbrigh/papers/distpar.pdf
30. Steen M, Pierre G, Voulgaris S. Challenges in very
large distributed systems. J Internet Serv Appl 2012,
3(1):59–66.
31. Xu L, Huang Z, Jiang H, Tian L, Swanson D. VSFS:
a searchable distributed file system. In: Proceedings of
the IEEE Parallel Data Storage Workshop, New
Orleans, Louisiana, 2014, 25–30.
32. Liu J, Jin X, Wang Y. Agent-based load balancing on
homogeneous minigrids: macroscopic modeling and
characterization. IEEE Trans Parallel Distrib Syst
2005, 16(7):586–598.
33. Luo P, Lü K, Shi Z, He Q. Distributed data mining in
grid computing environments. Future Gener Comput
Syst 2007, 23(1):84–91.
34. Rao W, Chen L, Fu AWC, Wang G. Optimal
resource placement in structured peer-to-peer net-
works. IEEE Trans Parallel Distrib Syst 2010, 21
(7):1011–1026.
35. Xue Y, Li B, Nahrstedt K. Optimal resource alloca-
tion in wireless ad hoc networks: a price-based
approach. IEEE Trans Mobile Comput 2006, 5
(4):347–364.
36. Gkatzikis L, Koutsopoulos I. Migrate or not? Exploit-
ing dynamic task migration in mobile cloud comput-
ing systems. IEEE Wireless Commun 2013, 20
(7):24–32.
37. Jiang Y, Jiang JC. Understanding social networks
from a multiagent perspective. IEEE Trans Parallel
Distrib Syst 2014, 25(10):2743–2759.
38. Cieslak DA, Thain D, Chawla NV. Troubleshooting
distributed systems via data mining. In: Proceedings
of the IEEE Int Symp High Perform Distrib Comput,
Paris, France, 2006:309–312.
39. Fatta GD, Berthold MR. Dynamic load balancing for
the distributed mining. IEEE Trans Parallel Distrib
Syst 2006, 17(8):773–785.
40. Silva JCD, Giannella C, Bhargava R, Kargupta H,
Klusch M. Distributed data mining and agents. Eng
Appl Artif Intel 2005, 18(7):791–807.
41. Zeng L, Li L, Duan L, Lu K, Shi Z, Wang M, Wu W,
Luo P. Distributed data mining: a survey. Inf Technol
Manage 2012, 13(4):403–409.
42. Tsoumakas G, Vlahavas I. Distributed data mining.
In: Encyclopedia of Data Warehousing and Mining,
IGI Global, Hershey, PA, USA, 2009, 709–715.
43. SM Thampi, Survey on distributed data mining in
p2p networks, 2012. arXiv preprint
arXiv:1205.3231.
44. Chang F, Dean J, Ghemawat S, Hsieh WC. Bigtable:
a distributed storage system for structured data.
ACM Trans Comput Syst 2008, 26(2):4.
45. Tanenbaum AS, Steen MV. Distributed Systems: Prin-
ciples and Paradigms. Upper Saddle River, NJ: Pren-
tice-Hall, Inc.; 2006.
46. Agrawal R, Shafer JC. Parallel Mining of Association
Rules. IEEE Trans Knowl Data Eng 1996, 8
(6):962–969.
47. Khan N, Yaqoob I, Hashem IA, Inayat Z, Ali WK,
Alam M, Shiraz M, Gani A. Big data: survey, technol-
ogies, opportunities, and challenges. Scientific World
J2014, 2014: Article ID 712826.
48. Wu X, Zhu X, Wu GQ, Ding W. Data mining with
big data. IEEE Trans Knowl Data Eng 2014, 26
(1):97–107.
49. Li F, Ooi BC, Özsu MT, Wu S. Distributed data man-
agement using MapReduce. ACM Comput Surv
2014, 46(3): 31.
50. Yang Q, Wu X. 10 challenging problems in data min-
ing research. Int J Inf Technol Decision Making
2006, 5(4):597–604.
51. A Mueller, Fast sequential and parallel algorithms for
association rule mining: a comparison. Technical
Report, University of Maryland at College
Park, 1995.
52. Park JS, Chen MS, Yu PS. Efficient parallel data min-
ing for association rules. In: Proceedings of the
ACM Int Conf Inf Knowl Manage, Baltimore, MD,
USA, 1995:31–36.
53. Cheung DW, Han J, Ng VT, Fu AW, Fu Y. A fast
distributed algorithm for mining association rules. In:
Proceedings of the Int Conf Parallel Distrib Inf Syst,
Miami Beach, Florida, USA, 1996:31–42.
54. Shintani T, Kitsuregawa M. Hash-based parallel algo-
rithms for mining association rules. In: Proceedings of
the Int Conf Parallel Distrib Inf Syst, Miami Beach,
Florida, USA, 1996:19–30.
55. Zaki MJ, Ogihara M, Parthasarathy S, Li W. Parallel
data mining for association rules on shared-memory
multi-processors. In: Proceedings of the ACM/IEEE
Conf Supercomput, Pittsburgh, PA, USA,
1996:43–43.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 15 of 19
56. Cheung DW, Ng VT, Fu AW, Fu Y. Efficient mining
of association rules in distributed databases.
IEEE Trans Knowl Data Eng 1996, 8(6):911–922.
57. Zaki MJ, Parthasarathy S, Ogihara M, Li W. Parallel
algorithms for discovery of association rules. Data
Min Knowl Discov 1997, 1(4):343–373.
58. Han EH, Karypis G, Kumar V. Scalable parallel data
mining for association rules. IEEE Trans Knowl Data
Eng 2000, 12(3):337–352.
59. Agarwal RC, Aggarwal CC, Prasad VVV. A tree pro-
jection algorithm for generation of frequent item sets.
J Parallel Distrib Comput 2001, 61(3):350–371.
60. Schuster A, Wolff R. Communication-efficient distrib-
uted mining of association rules. ACM SIGMOD
Record 2001, 30(2):473–484.
61. Cho V, Wüthrich B. Distributed mining of classifica-
tion rules. Knowl Inf Syst 2002, 4(1):1–30.
62. Otey M, Parthasarathy S, Wang C, Veloso A,
Meira W. Parallel and distributed methods for incre-
mental frequent itemset mining. IEEE Trans Syst
Man Cybern B Cybern 2004, 34(6):2439–2450.
63. Cong S, Han J, Hoeflinger J, Padua D. A sampling-
based framework for parallel data mining. In: Pro-
ceedings of the ACM SIGPLAN Symposium on Prin-
ciples and Practice of Parallel Programming,
Chicago, Illinois, USA, 2005, 255–265.
64. Li H, Wang Y, Zhang D, Zhang M, Chang EY. PFP:
parallel FP-growth for query recommendation. In:
Proceedings of the ACM Conf Recommender Syst,
Lousanne, Switzerland, 2008:107–114.
65. Yu KM, Zhou J, Hong TP, Zhou JL. A load-balanced
distributed parallel mining algorithm. Expert Syst
Appl 2010, 37(3):2459–2464.
66. Lin MY, Lee PY, Hsueh SC. Apriori-based frequent
itemset mining algorithms on MapReduce. In: Pro-
ceedings of the 6th ACM International Conference on
Ubiquitous Information Management and Communi-
cation, Kuala Lumpur, Malaysia, 2012, P. 76.
67. Riondato M, DeBrabant J, Fonseca R, Upfal E.
PARMA: a parallel randomized algorithm for
approximate association rules mining. In: Proceedings
of the 21st ACM International Conference on Infor-
mation and Knowledge Management, Maui, HI,
USA, 2012, 85–94.
68. Moens S, Aksehirli E, Goethals B. Frequent itemset
mining for big data. In: Proceedings of the IEEE Int
Conf Big Data, Santa Clara, CA, USA,
2013:111–118.
69. Baralis E, Cerquitelli T, Chiusano S. P-Mine: parallel
itemset mining on large datasets. In: Proceedings of
the IEEE 29th International Conference on Data
Engineering Workshops, Brisbane, Australia, 2013,
266–271.
70. Kolias V, Kolias C, Anagnostopoulos I, Kayafas E.
RuleMR: classification rule discovery with
MapReduce. In: Proceedings of the IEEE Int Conf
Big Data, Washington DC, USA, 2014:20–28.
71. Qiu H, Gu R, Yuan C, Huang Y. YAFIM: a parallel
frequent itemset mining algorithm with spark. In:
Proceedings of the IEEE International Parallel and
Distributed Processing Symposium Workshops
(IPDPSW), Phoenix, AZ, USA, 2014, 1664–1671.
72. Zhang F, Liu M, Gui F, Shen W, Shami A, Ma Y. A
distributed frequent itemset mining algorithm using
spark for big data analytics. Cluster Comput 2015,
18(4):1493–1501.
73. Feng X, Zhao J, Zhang Z. MapReduce-based H-
Mine algorithm. In: Proceedings of the International
Conference on Instrumentation and Measurement,
Computer,Communication and Control, Harbin,
China, 2015, 1755–1760.
74. Apiletti D, Baralis E, Cerquitelli T, Garza P,
Michiardi P, Pulvirenti F. PaMPa-HD: a parallel
MapReduce-based frequent pattern miner for high-
dimensional data. In: Proceedings of the IEEE Inter-
national Conference on Data Mining Workshop,
Atlantic City, New Jersey, 2015, 839–846.
75. Kaul SRM, Kashyap A. R-Apriori: an efficient
apriori-based algorithm on spark. In: Proceedings of
the 8th ACM Workshop on Ph.D.in Information
and Knowledge Management, 2015, 27–34.
76. Lin KW, Chung SH, Lin CC. A fast and distributed
algorithm for mining frequent patterns in congested
networks. Computing 2016, 98(3):235–256.
77. Salah S, Akbarinia R, Masseglia F. A highly scalable
parallel algorithm for maximally informative k-
itemset mining. Knowl Inf Syst 2017, 50(1):1–26.
78. Zaharia M, Chowdhury M, Das T, Dave A, Ma J.
Resilient distributed datasets: a fault-tolerant abstrac-
tion for in-memory cluster computing. In: Proceed-
ings of the 9th USENIX Conference on Networked
Systems Design and Implementation, San Jose, CA,
USA, 2012:2–2.
79. Wang J, Han J. BIDE efficient mining of frequent
closed sequences. In: Proceedings of the Int Conf
Data Eng, Boston, MA, USA, 2004:79–90.
80. Yan X, Han J, Afshar R. CloSpan: mining closed
sequential patterns in large datasets. In: Proceedings
of the SIAM International Conference on Data Min-
ing, San Francisco, CA, USA, 2003:166–177.
81. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U.
FreeSpan: frequent pattern-projected sequential pat-
tern mining. In: Proceedings of the ACM SIGKDD
Int Conf Knowl Discov Data Min, Boston, MA,
USA, 2000:355–359.
82. Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q,
Dayal U, Hsu MC. PrefixSpan: mining sequential pat-
terns efficiently by prefix-projected pattern growth.
In: Proceedings of the 17th International Conference
on Data Engineering, Heidelberg, Germany, 2001,
215–224.
Overview wires.wiley.com/dmkd
16 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
83. Zaki MJ. SPADE: an efficient algorithm for mining
frequent sequences. Mach Learn 2001, 42(1):31–60.
84. Shintani T, Kitsuregawa M. Mining algorithms for
sequential patterns in parallel-hash-based approach.
In: Proceedings of the Pacific-Asia Conf Knowl Dis-
cov Data Min, Melbourne, Australia, 1998:283–294.
85. MV Joshi, G Karypis, and V Kumar. Parallel algo-
rithms for mining sequential associations: issues and
challenges. Technical Report under preparation,
Department of Computer Science, University of Min-
nesota, vol. 119, 1999.
86. Zaki MJ. Parallel sequence mining on shared-memory
machines. J Parallel Distrib Comput 2001, 61
(3):401–426.
87. Wang K, Xu Y, Yu J. Scalable sequential pattern min-
ing for biological sequences. In: Proceedings of the
ACM Int Conf Inf Knowl Manage, Washington, DC,
USA, 2004:178–187.
88. Guralnik V, Karypis G. Parallel tree-projection-based
sequence mining algorithms. Parallel Comput 2004,
30(4):443–472.
89. Cong S, Han J, Padua D. Parallel mining of closed
sequential patterns. In: Proceedings of the
ACM SIGKDD Int Conf Knowl Dis in Data Mining,
Chicago, IL, USA, 2005:562–567.
90. Qiao S, Tang C, Dai S, Zhu M, Peng J, Li H, Ku Y.
PartSpan: parallel sequence mining of trajectory pat-
terns. Int Conf Fuzzy Syst Knowl Discov, Shandong,
China, 2008:363–367.
91. Qiao S, Li T, Peng J, Qiu J. Parallel sequential pattern
mining of massive trajectory data. Int J Comput Intell
Syst 2010, 3(3):343–356.
92. Miliaraki I, Berberich K, Gemulla R, Zoupanos S.
Mind the gap: large-scale frequent sequence mining.
In: Proceedings of the ACM SIGMOD Int Conf Man-
age Data, New York, USA, 2013:797–808.
93. Sahli M, Mansour E, Kalnis P. Parallel motif extrac-
tion from very long sequences. In: Proceedings of the
22nd ACM International Conference on Information
and Knowledge Management, San Francisco, CA,
USA, 2013, 549–558.
94. Liao VCC, Chen MS. DFSP: a depth-first spelling
algorithm for sequential pattern mining of biological
sequences. Knowl Inf Syst 2014, 38(3):623–639.
95. Ge J, Xia Y, Wang J. Mining uncertain sequential
patterns in iterative MapReduce. In: Proceedings of
the Pacific-Asia Conference on Knowledge Discovery
and Data Mining, Ho Chi Minh City, Vietnam,
2015, 243–254.
96. Beedkar K, Gemulla R. Lash: Large-scale sequence
mining with hierarchies. In: Proceedings of the ACM
SIGMOD International Conference on Management
of Data, Melbourne, VIC, Australia, 2015, 491–503.
97. Ge J, Xia Y. Distributed sequential pattern mining in
large scale uncertain databases. In: Proceedings of the
Pacific-Asia Conference on Knowledge Discovery and
Data Mining, Auckland, New Zealand, 2016, 17–29.
98. Dean J, Ghemawat S. MapReduce: simplified data
processing on large clusters. Commun ACM 2008, 51
(1):107–113.
99. Malewicz G, Austern MH, Bik AJC, Dehnert JC,
Horn I, Leiser N, Czajkowski G. Pregel: a system for
large-scale graph processing. In: Proceedings of the
ACM SIGMOD Int Conf Manage Data, Indianapo-
lis, Indiana, USA 2010:135–146.
100. Low Y. GraphLab: a distributed abstraction for large
scale machine learning. Doctoral Dissertation, Uni-
versity of California, Berkeley, CA, 2013.
101. Gonzalez JE, Xin RS, Dave A, Crankshaw D,
Franklin MJ, Stoica I. GraphX: graph processing in a
distributed dataflow framework. In: Proceedings of
the USENIX Symposium on Operating Systems
Design and Implementation (OSDI), Broomfield,
CO, USA, 2014, 599–613.
102. Meinl T, Worlein M, Fischer I, Philippsen M. Mining
molecular datasets on symmetric multiprocessor sys-
tems. In: Proceedings of the IEEE Int Conf Syst Man
Cybern, Taipei, Taiwan, 2006:1269–1274.
103. Wang C, Parthasarathy S. Parallel algorithms for
mining frequent structural motifs in scientific data.
In: Proceedings of the 18th ACM Annual Interna-
tional Conference on Supercomputing, Saint Malo,
France, 2004, 31–40.
104. Liu Y, Jiang X, Chen H, Ma J, Zhang X.
MapReduce-based pattern finding algorithm applied
in motif detection for prescription compatibility net-
work. In: Proceedings of the International Workshop
on Advanced Parallel Processing Technologies,
Shanghai, China, 2009, 341–355.
105. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C.
PowerGraph: distributed graph-parallel computation
on natural graphs. In: Proceedings of the 10th USE-
NIX Symposium on Operating Systems Design and
Implementation (OSDI), Hollywood, CA, USA,
2012, 17–30.
106. Lu W, Chen G, Tung AKH, Zhao F. Efficiently
extracting frequent subgraphs using MapReduce. In:
Proceedings of the IEEE Int Conf Big Data, Santa
Clara Marriott, California, USA, 2013:639–647.
107. Lin W, Xiao X, Ghinita G. Large-scale frequent sub-
graph mining in MapReduce. In: Proceedings of the
30th IEEE Int Conf on Data Engineering, Chicago,
IL, USA, 2014, 844–855.
108. Zhu X, Han W, Chen W. Gridgraph: large-scale
graph processing on a single machine using 2-level
hierarchical partitioning. In: Proceedings of the USE-
NIX Annual Technical Conference, Santa Clara, CA,
USA, 2015, 375–386.
109. Lee H, Shao B, Kang U. Fast graph mining with
hbase. Inform Sci 2015, 315:56–66.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 17 of 19
110. Teixeira CHC, Fonseca AJ, Serafini M, Siganos G,
Zaki MJ, Aboulnaga A. Arabesque: a system for dis-
tributed graph mining. In: Proceedings of the 25th
ACM Symposium on Operating Systems Principles,
Monterey, California, USA, 2015, 425–440.
111. Talukder N, Zaki MJ. A distributed approach for
graph mining in massive networks. Data Min Knowl
Discov 2016, 30(5): 1024–1052.
112. Shirkhorshidi AS, Aghabozorgi S, Wah TY,
Herawan T. Big data clustering: a review. In: Pro-
ceedings of the Int Conf Comput Sci Appl, Guimar-
aes, Portugal, 2014:707–720.
113. Younis O, Fahmy S. Distributed clustering in ad-hoc
sensor networks: a hybrid, energy-efficient approach.
In: Proceedings of the Annual Joint Conference of the
IEEE Computer and Communications Societies,
Hong Kong, China, vol. 1, 2004.
114. Zhou A, Cao F, Yan Y, Sha C. Distributed data stream
clustering: a fast em-based approach. In: Proceedings of
the 23rd IEEE International Conference on Data Engi-
neering, Istanbul, Turkey, 2007, 736–745.
115. Visalakshi NK, Thangavel K. Distributed data clus-
tering: a comparative analysis. Found Comput Intell
vol 6. Springer Berlin Heidelberg, 2009, 371–397.
116. Zhao W, Ma H, He Q. Parallel k-means clustering
based on MapReduce. In: Proceedings of the
IEEE Int Conf Cloud Comput, Bangalore, India,
2009:674–679.
117. A Dave, W Lu, J Jackson, and R Barga. Cloudclustering:
toward an iterative data processing pattern on the cloud.
In: Parallel and Distributed Processing Workshops and
PhD Forum, Anchorage, Alaska, USA,
2011:1132–1137.
118. Eyal I, Keidar I, Rom R. Distributed data clustering
in sensor networks. Distrib Comput 2011, 24
(5):207–222.
119. Forero PA, Cano A, Giannakis GB. Distributed clus-
tering using wireless sensor networks. IEEE J Sel Top
Signal Process 2011, 5(4):707–724.
120. Bahmani B, Moseley B, Vattani A, Kumar R. Scalable
K-means++. VLDB Endowment 2012, 5(7):622–633.
121. Liang Y, Balcan MF, Kanchanapally V. Distributed
PCA and K-means clustering. In: Proceedings of the
Big Learning Workshop at NIPS, 2013.
122. Han J, Luo M. Bootstrapping k-means for big
data analysis. In: Proceedings of the IEEE Int
Conf Big Data, Washington DC, USA,
2014:591–596.
123. Cui X, Zhu P, Yang X, Li K, Ji C. Optimized big data
K-means clustering using MapReduce. J Supercomput
2014, 70(3):1249–1259.
124. Xu Y, Qu W, Li Z, Min G, Li K, Liu Z. Efficient k-
means++ approximation with MapReduce. IEEE Trans
Parallel Distrib Syst 2014, 25(12):3135–3144.
125. MF Balcan, Y Liang, L Song, and D Woodruff, Com-
munication efficient distributed kernel principal com-
ponent analysis, 2015. arXiv preprint
arXiv:1503.06858.
126. Mashayekhi H, Habibi J, Khalafbeigi T, Voulgaris S,
Steen MV. GDCluster: a general decentralized cluster-
ing algorithm. IEEE Trans Knowl Data Eng 2015,
27(7):1892–1905.
127. Wold S, Esbensen K, Geladi P. Principal components
analysis. Chemom Intel Lab Syst 1987, 2(1–5):37–52.
128. Schölkopf B, Smola A, Müller KR. Kernel principal com-
ponent analysis. In: ProceedingsoftheIntConfArtif
Neural Netw, Lausanne, Switzerland, 1997:583–588.
129. Clifton C, Kantarcioglu M, Vaidya J, Lin X. Tools
for privacy preserving distributed data mining.
ACM SIGKDD Explor Newslett 2002, 4(2):28–34.
130. Kantarcioglu M, Clifton C. Privacy-preserving dis-
tributed mining of association rules on horizontally
partitioned data. IEEE Trans Knowl Data Eng 2004,
16(9):1026–1037.
131. Luo C, Pereira AL, Chung SM. Distributed mining of
maximal frequent itemsets on a data grid system.
J Supercomput 2006, 37(1):71–90.
132. Zhong S. Privacy-preserving algorithms for distribu-
ted mining of frequent itemsets. Inform Sci 2007,
177:490–503.
133. Kargupta H, Das K, Liu K. Multiparty, privacy pre-
serving distributed data mining using a game theo-
retic framework. In: Proceedings of the European
Conference on Principles of Data Mining and Knowl-
edge Discovery, Warsaw, Poland, 2007, 523–531.
134. Yakut I, Polat H. Privacy-preserving hybrid collabo-
rative filtering on cross distributed data. Knowl Inf
Syst 2012, 30(2):405–433.
135. Kaleli C, Polat H. Privacy-preserving SOM-based
recommendations on horizontally distributed data.
Knowl-Based Syst 2012, 33:124–135.
136. Li Y, Chen M, Li Q, Zhang W. Enabling multilevel
trust in privacy preserving data mining. IEEE Trans
Knowl Data Eng 2012, 24(9):1598–1612.
137. Chun JY, Hong D, Jeong IR, Lee DH. Privacy-
preserving disjunctive normal form operations on dis-
tributed sets. Inform Sci 2013, 231:113–122.
138. Zhang F, Rong C, Zhao G, Wu J, Wu X. Privacy-
preserving two-party distributed association rules
mining on horizontally partitioned data. In: Proceed-
ings of the Int Conf Cloud Comput Big Data,
FuZhou, China, 2013:633–640.
139. Tassa T. Secure mining of association rules in hori-
zontally distributed databases. IEEE Trans Knowl
Data Eng 2014, 26(4):970–983.
140. Bhuyan HK, Kamila NK. Privacy preserving sub-
feature selection in distributed data mining. Appl Soft
Comput 2015, 36:552–569.
Overview wires.wiley.com/dmkd
18 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
141. Qin Z, Ren K, Yu T, Weng J. DPCode: privacy-
preserving frequent visual patterns publication on
cloud. IEEE Trans Multimedia 2016, 18(5):929–939.
142. Lu R, Zhu H, Liu X, Liu J, Shao J. Toward efficient
and privacy-preserving computing in big data era.
IEEE Netw 2014, 28(4):46–50.
143. Malik MB, Ghazi MA, Ali R. Privacy preserving
data mining techniques: current scenario and
future prospects. In: Proceedings of the Int Conf
Comput Commun Technol, Allahabad, India,
2012:26–32.
144. Parthasarathy S, Ghoting A, Otey M. A survey of dis-
tributed mining of data streams. Data Streams
2007:289–307.
145. Xu L, Jiang C, Wang J, Yuan J, Ren Y. Information
security in big data-privacy and data mining.
IEEE Access 2014, 2:1149–1176.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 19 of 19