ArticlePDF Available

Data mining in distributed environment: a survey

Authors:

Abstract

Due to the rapid growth of resource sharing, distributed systems are developed, which can be used to utilize the computations. Data mining ( DM ) provides powerful techniques for finding meaningful and useful information from a very large amount of data, and has a wide range of real‐world applications. However, traditional DM algorithms assume that the data is centrally collected, memory‐resident, and static. It is challenging to manage the large‐scale data and process them with very limited resources. For example, large amounts of data are quickly produced and stored at multiple locations. It becomes increasingly expensive to centralize them in a single place. Moreover, traditional DM algorithms generally have some problems and challenges, such as memory limits, low processing ability, and inadequate hard disk, and so on. To solve the above problems, DM on distributed computing environment [also called distributed data mining (DDM)] has been emerging as a valuable alternative in many applications. In this study, a survey of state‐of‐the‐art DDM techniques is provided, including distributed frequent itemset mining, distributed frequent sequence mining, distributed frequent graph mining, distributed clustering, and privacy preserving of distributed data mining. We finally summarize the opportunities of data mining tasks in distributed environment. WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216 This article is categorized under: Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Technologies > Computer Architectures for Data Mining
Overview
Data mining in distributed
environment: a survey
Wensheng Gan,
1
Jerry Chun-Wei Lin,
1
*Han-Chieh Chao
2
and Justin Zhan
3
Due to the rapid growth of resource sharing, distributed systems are developed,
which can be used to utilize the computations. Data mining (DM) provides power-
ful techniques for nding meaningful and useful information from a very large
amount of data, and has a wide range of real-world applications. However, tradi-
tional DM algorithms assume that the data is centrally collected, memory-resident,
and static. It is challenging to manage the large-scale data and process them with
very limited resources. For example, large amounts of data are quickly produced
and stored at multiple locations. It becomes increasingly expensive to centralize
them in a single place. Moreover, traditional DM algorithms generally have some
problems and challenges, such as memory limits, low processing ability, and inad-
equate hard disk, and so on. To solve the above problems, DM on distributed com-
puting environment [also called distributed data mining (DDM)] has been
emerging as a valuable alternative in many applications. In this study, a survey of
state-of-the-art DDM techniques is provided, including distributed frequent item-
set mining, distributed frequent sequence mining, distributed frequent graph min-
ing, distributed clustering, and privacy preserving of distributed data mining. We
nally summarize the opportunities of data mining tasks in distributed environ-
ment. © 2017 Wiley Periodicals, Inc.
How to cite this article:
WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216
INTRODUCTION
With the rapid development of information tech-
nology and data collection, Knowledge Dis-
covery in Databases (KDD), provides a powerful
capability to discover meaningful and useful informa-
tion coming from a collection of data.
14
KDD has
numerous real-life applications and has resulted in
several DM tasks, such as association rule mining
(ARM),
2,3
sequential pattern mining (SPM),
4,5
clustering,
6,7
classication,
8,9
and outline detection,
10
among others. Depending on different requirements
in various domains and applications, the discovered
knowledge can be generally classied as frequent
itemsets and association rules,
2,1113
sequential
patterns,
1,4,5
sequential rules,
14,15
graphs,
16,17
high-
utility patterns,
1820
weight-based patterns,
21,22
and
other interesting patterns.
23,24
As an important task
for a wide range of real-world applications, frequent
itemset mining (FIM) or ARM has been extensively
studied. Two well-known algorithms, Apriori
25
and
FP-growth,
3
are proposed to mine frequent itemsets
and association rules based on the generation-and-
test or pattern-growth approaches.
3
Many algorithms
have been developed to efciently mine the desired
patterns and information from various type data
bases.
2,3,6,12,17,23,25
In general, distribution of data and computation
allows researchers/engineers to solve many problems
and can be applied and performed in various applica-
tions that are distributed in nature. Distributed
*Correspondence to: jerrylin@ieee.org
1
School of Computer Science and Technology, Harbin Institute of
Technology Shenzhen Graduate School, Shenzhen University Town
Xili, Shenzhen, China
2
Department of Computer Science and Information Engineering,
National Dong Hwa University, Shoufeng, Taiwan
3
Department of Computer Science, University of Nevada, Las
Vegas, NV, USA
Conict of interest: The authors have declared no conicts of inter-
est for this article.
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 1of19
systems indicate that the distributed computational
units are connected and organized by networks to
meet the demand of both large-scale and high-
performance computing, which have received consid-
erable attention over the past decades.
2631
Many
types of distributed systems, such as grids,
32,33
peer-
to-peer (P2P) systems,
34
ad-hoc networks,
35
cloud
computing systems,
36
and online social network
systems,
37
have been widely studied. Currently, the
applications of distributed systems are varied, such as
web service, scientic computation, and le storage.
At the same time, DM has also been extensively stud-
ied.
2,3,6,12,17,23,25
By using DM techniques organiza-
tions, businesses, companies, and scientic centers can
discover different kinds of hidden but useful and
meaningful patterns and information. As mentioned
before, the distribution of the collected data can be
analyzed by DM techniques.
28
An important scenario
of DM is that the databases are distributed between
two or more parties, and each party owns a portion
of the data. In the past, traditional methods typically
made the assumption that the data is centralized and
memory-resident.
2,3,6,12,17,23,25
This assumption is no
longer tenable in distributed systems. Unfortunately, a
direct application of traditional mining algorithms to
distributed databases is not effective because it implies
a large amount of communication overhead. Imple-
mentation of high-performance DM in distributed
computing environments has thus become a critical
improvement in utilizing the scalability of a system.
In traditional DM technologies, a centralized
approach is fundamentally inappropriate due to
many reasons, such as the huge amount of data,
infeasibility to centralize data stored at multiple sites,
bandwidth limitations, energy limitations, and pri-
vacy concerns. Therefore, it is important to develop a
more adaptable and exible mining framework to
discover hidden but useful and meaningful patterns,
and information from the distributed and complex
databases instead of the centralized ones. To solve
these problems, DM on distributed environments
[also called distributed data mining (DDM)] has
emerged as an important research area.
3841
In the
DDM literature, one of the two assumptions is com-
monly adopted as to how data is distributed across
sites: homogeneously (horizontally partitioned) and
heterogeneously (vertically partitioned).
42
In general,
DDM deals with some challenges for analyzing dis-
tributed data and offers many algorithmic solutions
to perform different data analysis and mining opera-
tions in a fundamentally distributed manner, which
pays very careful attention to resource constraints.
To improve the performance of DM and to improve
the scalability, many researchers provide different
techniques to work on a distributed environment like
grid computing,
32,33
cloud,
36
Hadoop (the popular
open source implementation of MapReduce,
26
http://
hadoop.apache.org), and so forth and distribute the
mining computation over more than a single node.
From the previous studies,
3841
it has been shown
that DDM is a powerful tool for the end-user, enter-
prise or government to analyze data and discover dif-
ferent kinds of useful knowledge. It provides new
opportunities but poses some challenges for DM.
Although some related surveys have been previ-
ously studied, most of them provide a very prelimi-
nary review of a single type of distributed system,
such as the survey of load balancing in grids,
32,33
the
survey of load balancing in cloud computing,
27
and
the survey of load balancing in peer-to-peer (P2P)
systems.
34,43
How to summarize the related studies
in various types of DM in distributed systems and
make a general taxonomy on them? The methods
summarized in this study cover not only the distribu-
ted systems,
44,45
but also related literature on DM,
12
parallel computing,
46
big data technologies,
47,48
and
database management,
49
This study thus aims to
review current research on DDM. Main contribu-
tions of this study are described as follows:
1. We rst point out the difference between tradi-
tional DM algorithms and those based on dis-
tributed environments. There are more
challenges to be encountered when accomplish-
ing DM tasks in a distributed system.
2. We review contemporary works of DM on dis-
tributed environments in recent years. This is a
high level survey about distributed system tech-
niques for DM in several aspects, including dis-
tributed frequent itemset mining (DFIM),
distributed frequent sequence mining (DFSM),
distributed frequent graph mining (DFGM),
distributed clustering (DC), and privacy preser-
ving of distributed data mining (PPDDM).
3. Finally, some opportunities for future research
in DM task in distributed environment are
briey summarized.
The study is organized as follows: Distributed Systems
and Its Technical Challenges section introduces the
denitions, some important features of distributed sys-
tems, and respectively, summarizes some challenges in
distributed systems and DDM. Data Mining Techni-
ques on Distributed Environment section highlights
and discusses the state-of-the-art research on DM in
distributed computing resources. Opportunity for Dis-
tributed Data Mining section briey summarizes some
Overview wires.wiley.com/dmkd
2of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
opportunities for DM task in distributed environment.
Finally, conclusions are given in Conclusion section.
DISTRIBUTED SYSTEMS AND ITS
TECHNICAL CHALLENGES
In this section, the related denitions and some
important features of distributed systems are stated.
Some technical challenges on distributed systems and
DDM are then briey reviewed and summarized.
Distributed Systems
Unlike traditional centralized systems, the term dis-
tributed system refers to a large collection of
resources that is shared between computers con-
nected by a network. For example, hardware sharing,
software sharing, data sharing, service sharing, and
media stream sharing. The development of collabora-
tive computing, parallel computing and distributed
computing, motivated the development of distributed
systems. A distributed system is dened as one in
which components at networked computers that
communicate and coordinate their actions only by
passing messages.
44,45
In other words, a distributed
system is a collection of autonomous computing ele-
ments (subsystems) that appears to its users as a sin-
gle coherent system. A distributed system has a
complex nature that requires powerful technologies
and advanced algorithms as shown in Figure 1.
From Figure 1, it can be observed that there are
two aspects in a distributed system: independent com-
puting elements and single system w.r.t. middleware.
There are some important features of distributed sys-
tems, including: (1) concurrency, multiprocess and
multithread concurrent execution, and resource shar-
ing (sharing of information and services); (2) no
global clock, program coordination depending on
message passing; (3) independent failure, such as
process failure, cannot be known by other pro-
cesses.
44,45
According to Refs 44,45, some properties
in a distributed system, such as transparency, scalabil-
ity, availability, reliability, serviceability (manageabil-
ity), and safety, should be discussed and studied.
Challenges in Distributed System
Distributed systems, in which the distributed compu-
tational units are connected and organized by net-
works to meet the demand of large-scale and high-
performance computing, have received considerable
attention over the past decades.
2631
Many types of
distributed systems, such as grids,
32,33
P2P systems,
34
ad hoc networks,
35
cloud computing systems,
36
and
online social network systems
37
have been widely
studied. Currently, there are various applications in
distributed systems, such as DM, web servicing, sci-
entic computation, and le storage. Although great
developments in distributed systems have been made,
there are still some technical challenges.
44,45
As
shown in Figure 2, the main challenge in distributed
systems can be referred to eight aspects, including
heterogeneity, openness, security, scalability, failure
handling, concurrency, transparency, and quality of
service. Details of each challenge can be referred to
Refs 44,45.
Challenges in Distributed Data Mining
In recent decades, many models and algorithms have
beendevelopedinDMtoefciently discover desired
knowledge in various types of databases,
2,3,12,23,25
but some challenges in DM have yet to be solved. In
2006, Yang and Wu
50
introduced 10 challenging
problems in DM research, such as developing a uni-
fying theory of DM, scaling up for high-dimensional
data, DDM and mining multiagent data, security,
privacy, and so on. Traditional DM algorithms
assume that the data is centralized, memory-resident,
and static. Because of the growth of large-scale data
in recent decades, two challenges have to be met.
First, the amounts of data are rapidly produced. Sec-
ond, the data are stored at multiple locations and it
becomes increasingly expensive to centralize it in one
place. Therefore, the problem of DDM is quite
important in various complex network databases. In
a distributed environment (such as a sensor or IP net-
work), one has distributed probes placed at strategic
locations within the network, especially in areas with
limited energy and limited memory (e.g., limited
CPU computation and I/O calls across a distributed
architecture). Therefore, techniques of DDM are
more challenging and complex than that of tradi-
tional DM.
3841
With the collected data from the distributed
sites, DDM explores techniques of how to apply DM
in a noncentralized way. The goal here obviously
would be to minimize the amount of data shipped
between the various sites. Some important challenges
for this DDM issuesuch as how to essentially
reduce the communication overhead, how to mine
across multiple heterogeneous data sources,
i.e., multisource databases, how to perform multire-
lational mining in distributed environmenthave
been studied. As shown in Figure 1, the eight techni-
cal challenges in a distributed system, including het-
erogeneity, openness, security, scalability, failure
handling, concurrency, transparency, and quality of
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 3of19
service, are the same challenges when performing
DDMespecially the heterogeneity, security, and
scalability. The DDM deals with these challenges in
analyzing the distributed data and offers many algo-
rithmic solutions to perform different data analysis
and mining operations in a fundamentally distributed
manner that pays careful attention to the resource
constraints.
DATA MINING TECHNIQUES IN
DISTRIBUTED ENVIRONMENT
In this section, the state-of-the-art algorithms related
to DM on distributed environmentincluding
DFIM, DFSM, DFGM, DC, and privacy preserving
of DDM (PPDDM)are given below. The prelimi-
naries and the problem statement are given simply
and then we describe the novel ideas of the related
works in detail and highlight specic ideas.
Distributed Frequent Itemset Mining
Let I={i
1
,i
2
,,i
n
} be a set of items, an itemset
X={i
1
,i
2
,,i
k
} with kitems is a subset of I. The
length or size of Xis denoted as |X|, i.e., the number
of items in Xw.r.t. k. Given a transactional data-
base D, where each transaction T
q
2Dis generally
identied by a transaction id (TID), and |D| denotes
the total number of transactions. The support of
Xin database Dis denoted as sup(X) and is the
proportion of transactions containing X, i.e., sup
(X) =|{T
q
|T
q
2D,XT
q
}|/|D|. The support count
or frequency of itemset Xis the number of transac-
tions in Dcontaining X. An itemset is said to be a
frequent itemset (FI) if its support is greater than the
Component-1 Component-n
Host-3
Middleware
Local OS
Hardware
Distributed
Network
Application 3
Component-1 Component-n
Host-1
Middleware
Local OS
Hardware
Application 1
Component-1 Component-n
Host-n
Middleware
Local OS
Hardware
Application n
Component-1 Component-n
Host-2
Middleware
Local OS
Hardware
Application 2
Same interface everywhere
Other Host
FIGURE 1 |Architecture of distributed system.
Scalability Failure
handling
Concurrency
Transparency
Quality of
Service
Heterogeneity
Openness
Security
Technical
challenges
FIGURE 2 |Technical challenges in distributed system.
Overview wires.wiley.com/dmkd
4of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
user-dened minimum support threshold, minsup.
Therefore, the problem of frequent itemset mining is
to discover all itemsets in which the support of each
itemset is not less than the user-dened minimum
support threshold, i.e., sup(X) minsup.
25
As the most important task for a wide range of
real-world applications, FIM and ARM have been
extensively studied.
2,1113
The ARM consists of two
phases. It rst discovers the frequent itemsets, then
generates the association rules from the derived fre-
quent itemsets. Due to the rst phase being more
challenging and interesting than the second phase,
most efforts on ARM address the problem of FIM.
Two well-known algorithms, Apriori
25
and FP-
growth,
3
were respectively proposed to mine frequent
itemsets and association rules. Many algorithms have
been developed to efciently mine the desired fre-
quent itemsets or association rules from various type
databases.
2,3,12,23,25
Previously, the problem of FIM
in a distributed/parallel environment (DFIM) has
been extensively studied and a number of approaches
have been explored to address this problem. Table 1
shows an overview of frequently distributed itemset
mining on distributed/parallel environment.
In 1995, Mueller rst proposed two parallel
algorithms, called parallel efcient association rules
(PEAR)
51
and parallel partition association rules
(PPAR).
51
Park et al. also proposed an algorithm
named parallel data mining (PDM) to parallel mining
of association rules,
52
and the fast distributed mining
(FDM) algorithm for distributed databases
53
was
developed later. Cheung et al. proposed a mining
algorithm named DMA to mine association rules in
distributed databases.
56
An algorithm named Hash
Partitioned Apriori (HPA) was rst introduced in Ref
54, and the modied HPA-ELD
54
approach that
HPA with extremely large itemset duplication was
then proposed. Based on the partition technology,
Zaki et al. then developed the Partitioned Candidate
Common Database (PCCD) and Common Candidate
Partitioned Database (CCPD).
55
At the same time,
some data distribution (DD)-based technologies have
been extensively studied, such as CD,
46
CD tree
projection,
59
DD,
46
HD,
58
IDD,
58
IDD tree
projection,
59
and DDDM,
60
and so forth.
By extending the vertical mining approach
Eclat,
57
the parallel-based Eclat (ParEclat),
57
and the
distributed Eclat (Dist-Eclat)
68
were, respectively,
developed. With the consideration of dynamic min-
ing, the ZIGZAG-based incremental approach
62
was
proposed to distributed and parallel incremental
mining frequent rules. Lin et al. developed three ver-
sions of Apriori algorithm, namely single pass count-
ing (SPC), xed passes combined counting (FPC),
and dynamic passes combined counting (DPC) on
theMapReduceframework.
66
SPC is a straightfor-
ward algorithm while FPC aims at reducing the num-
ber of scheduling invocations and DPC features in
dynamically combining candidates of various
lengths.
Recently, many DDM algorithms are developed
based on Spark or Hadoop platforms. Hadoop is one
of the well-known platforms using MapReduce
framework,
26
and it is the open-source software for
any implementations. Hadoop distributed le system
(HDFS) is used to store the dataset in Hadoop (http://
hadoop.apache.org). Spark
78
is a new in-memory, dis-
tributed data-ow platform, which uses Resilient Dis-
tributed Dataset (RDD) architecture to store the results
at the end of an iteration and provide the results for
next iteration. In general, Spark has 12 orders of mag-
nitude faster than MapReduce.
78
Research efforts have
already made to improve the Apriori-based and the
traditional FIM/ARM algorithms by converting them
into distributed versions under the MapReduce
26
or
Spark
78
environment. For example, a parallel FP-
Growth,
64
a parallel randomized algorithm PARMA
67
for approximate association rules mining in MapRe-
duce, the MapReduce-based H-mine algorithm,
73
a
parallel FIM algorithm with Spark (R-Apriori),
75
PaMPa-HD,
74
and so on. Details of the above algo-
rithms are described below.
An adaptation of FP-Growth to MapReduce,
26
called PFP, is presented in Ref 64. PFP is a parallel
form of the classical FP-Growth, it splits a large-
scale mining task into independent and parallel
tasks. First, a parallel/distributed counting approach
is used to compute the frequent items, which ran-
domly partitions the datasets into several groups. In
a single MapReduce round, the transactions in the
dataset are used to generate group-dependent trans-
actions. The PFP approach shows good performance
with a near-linear speedup. Although, PARMA
67
is
not the rst algorithm using MapReduce to solve
the task of DFIM, it is the rst randomized MapRe-
duce algorithm for discovering the approximate col-
lections of frequent itemsets or association rules
with near-linear speedup. PARMA is also the rst
algorithm combining random sampling and parallel-
ization to mine frequent itemsets or association
rules. As shown in the study,
68
Dist-Eclat is a
MapReduce implementation of the well-known
Eclat algorithm.
57
BigFIM is a hybrid approach
exploiting both Apriori and Eclat paradigms based
on MapReduce.
68
Dist-Eclat focuses on speeding up
the mining performance while BigFIM is optimized to
run on really large datasets. Baralis et al.
69
presented a
parallel disk-based approach, named P-Mine, to solve
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 5of19
TABLE 1 |Algorithms for Distributed Frequent Itemset Mining
Name Description Year
PEAR
51
Parallel efcient association rules 1995
PPAR
51
Parallel partition association rules 1995
PDM
52
Parallel mining of association rules 1995
FDM
53
Fast distributed mining for distributed
databases
1995
HPA
54
Hash-partitioned Apriori 1996
PCCD
55
Partitioned candidate common database 1996
DMA
56
Mine association rules in distributed
databases
1996
CCPD
55
Common candidate partitioned database 1996
CD
46
Count distribution 1996
HPA-ELD
54
HPA with extremely large itemset
duplication
1996
ParEclat
57
Parallel Eclat 1997
HD
58
Hybrid distribution 2000
CD Tree Projection
59
Count distributed tree projection 2001
DD
46
Data distribution 1996
IDD
58
Intelligent data distribution 2000
IDD tree projection
59
Intelligent data distribution tree
projection
2001
DDDM
60
Distributed dual decision miner,
communication efcient distributed
mining of association rules
2001
Fast distributed data mining
61
Distributed mining of classication rules 2002
ZIGZAG-based incremental approach
62
Distributed and parallel incremental
mining frequent rules
2004
Par-FP
63
Parallel FP-growth with sampling 2005
PFP
64
An adaptation of FP-Growth to
MapReduce
2008
DPA
65
Distributed parallel Apriori 2010
DPC
66
Dynamic passes combined-counting 2012
FPC
66
Fixed passes combined-counting 2012
PARMA
67
A parallel randomized algorithm for
approximate association rules mining
in MapReduce
2012
BigFIM
68
Frequent itemset mining for big data 2013
Dist-Eclat
68
Distributed Eclat-based on MapReduce 2013
P-Mine
69
Parallel itemset mining on large datasets 2013
RuleMR
70
Classication rule discovery with
MapReduce
2014
YAFIM
71
A parallel frequent itemset mining
algorithm on Spark
2014
DFIMA
72
Apriori-like distributed frequent itemset
mining algorithm
2015
MRH-mine
73
MapReduce-based H-mine algorithm 2015
(continued overleaf )
Overview wires.wiley.com/dmkd
6of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
the task of DFIM on a multicore processor by improv-
ing the I/O performance with a prefetching strategy.
Recently, Qiu et al.
71
have reported speed up of nearly
18 times on average for various benchmarks for the yet
another frequent itemset mining (YAFIM) algorithm
basedonSpark.Theresultsobtainedonreal-world
medical data show that YAFIM is much faster than all
Hadoop-based algorithms. Kaul and Kashyap
75
then
proposed the Reduced-Apriori (R-Apriori) algorithm,
which is a parallel Apriori algorithm based on the
Spark Resilient Distributed Dataset (RDD) framework.
It adds an additional phase to YAFIM and speed up the
second round for generating the promising candidate-
set in order to achieve higher performance compared to
the YAFIM.
According to these studies, it has been shown
that the implementation approaches based on Spark
is generally more efcient than those on Hadoop
model. In general, the performance of the above
approaches might not be satisfactory due to the bot-
tleneck of iterative computation when handling
large-scale datasets. Therefore, a distributed algo-
rithm for frequent itemset mining (DFIMA) was pro-
posed to improve and speed up the process of FIM.
72
Some distributed and highly scalable parallel mining
approaches were also developed in recent years, such
as FDMCN (Fast and Distributed Mining algorithm
for discovering frequent patterns in Congested Net-
works)
76
and PHIKS (parallel highly informative K-
itemset).
77
Different from the general itemset mining
problem, Salah et al.
77
studied the problem of paral-
lel mining of maximally informative k-itemsets (miki)
based on joint entropy, and proposed the PHIKS, a
highly scalable and parallel miki mining algorithm.
With the classication application, Cho and
Wüthrich
61
introduced a model for FDM of classi-
cation rules and the MapReduce-based RuleMR
70
was further developed.
Distributed Sequential Pattern Mining
Different from FIM, SPM discovers frequent subse-
quences as the interesting patterns in a sequential
database which contains the embedded timestamp of
events. The itemset mining model was then extended
to handle sequences by Srikant and Agrawal.
4
A
sequential database SDB ={S
1
,S
2
,,S
n
} is a set of
tuples (sid,S), where sid is a sequence identier and
S
k
is an input sequence. A sequence S
α
=(α
1
,α
2
,,
α
n
) is called a subsequence of another sequence S
β
=
(β
1
,β
1
,,β
m
)(n<m) and S
β
is called a super-
sequence of S
α
if there exist an integer 1 < i
1
<<
i
n
<msuch that β
1
β
i1
,,β
n
β
in
, denoted as
S
α
S
β
. A tuple (sid,S) is said to contain a
sequence S
α
if Sis a super-sequence of S
α
. The support
of a sequence S
α
in a sequence database SDB is the
number of tuples in SDB that contains S
α
,minsup
(S
α
). The sequential pattern mining problem was rst
introduced by Srikant and Agrawal
4
and can be for-
mulated as follows: Given a set of sequences,where
each sequence consists of a list of elements and each
element consists of a set of items,and given a user-
specied minsup threshold,sequential pattern mining
is to nd all of the frequent subsequences,i.e., the
subsequences whose occurrence frequency in the set
of sequences is not less than minsup.
4
Some well-known algorithms for sequential
pattern mining, such as AprioriAll,
1
general sequen-
tial patterns (GSP),
4
BI-Directional Extension
(BIDE),
79
CloSpan,
80
Frequent pattern-projected
Sequential pattern mining (FreeSpan),
81
Prex-
projected Sequential pattern mining (PrexSpan),
82
Sequential PAttern Discovery using Equivalence
classes (SPADE),
83
and so forth. have been exten-
sively proposed. It has been shown that SPM has
broad applications in real-world situations. Among
them, AprioriAll
1
and GSP
4
are the fundamental
Apriori-based algorithms, which are required to mine
TABLE 1 |Continued
Name Description Year
PaMPa-HD
74
Parallel MapReduce-based frequent
pattern miner for high dimensional
data
2015
R-Apriori
75
An efcient Apriori-based algorithm on
Spark
2015
FDMCN
76
A fast and distributed mining algorithm
for discovering frequent patterns in
congested networks algorithm
2016
PHIKS
77
A highly scalable parallel algorithm,
named parallel highly informative K-
itemset, for maximally informative k-
itemset mining
2016
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 7of19
the sequential patterns in a levelwise manner. Up to
now, many researchers have provided different tech-
niques to work on distributed environments like grid
computing,
32,33
cloud,
36
Hadoop (http://hadoop.
apache.org), or distribute the mining computation
over more than one node for mining the sequential
patterns. As shown in Table 2, some distributed and
parallel methods for SPM are described below.
In 1998, Shintani and Kitsuregawa partitioned
the input sequences in nonpartitioned sequential pat-
tern mining (NPSPM), yet they assumed that the
entire candidate set can be replicated and t into the
overall memory (random access memory and hard
drive) of a process.
84
Similar assumptions were made
in EVEnt distribution (EVE),
85
EVEnt and CANdi-
date distribution (EVECAN),
85
and in data parallel
formulation (DPF).
88
As well, a hash function was
used in the hash partitioned sequential pattern min-
ing (HPSPM) algorithm to assign input and candi-
date sequences to specic processes.
84
Input
partitioning is, however, not inherently necessary for
shared memory or MapReduce distributed systems.
In the case of shared memory systems, the input data,
i.e., sequences, should t in the aggregated system
memory and is available to be read by all processes.
Thus, Zaki extended the efcient SPADE algorithm
to the shared memory parallel architecture, called
pSPADE.
86
In the pSPADE framework, the input
data is assumed to be residing on shared hard drive
space and stored in the vertical database format.
In order to balance the mining tasks, Cong
et al. designed several models, Par-FP,
63
Par-ASP
63
and Par-CSP,
89
to accomplish the task. They use a
sampling technique that requires the entire input set
be available at each process. In addition, the 2PDF-
Index,
87
2PDF-Compression,
87
and DFSP
94
algo-
rithms were proposed and applied to scalable mining
sequential patterns from biological sequences. After
that, some distributed and parallel mining methods,
such as MapReduce Distributed GSP (DGSP) and
large-scale frequent sequence mining (MG-FSM),
were proposed by extending the traditional SPM
TABLE 2 |Algorithms for Distributed Sequential Pattern Mining
Name Description Year
HPSPM
84
Partitioned sequential pattern mining 1998
NPSPM
84
Nonpartitioned sequential pattern mining 1998
EVE
85
EVEnt distribution 1999
EVECAN
85
EVEnt and CANdidate distribution 1999
pSPADE
86
Parallel SPADE 2001
2PDF-Index and 2PDF-Compression
87
Scalable sequential pattern mining for
biological sequences
2004
DPF
88
Data parallel formulation 2004
Par-ASP
63
Parallel PrexSpan with sampling 2005
Par-CSP
89
Parallel CloSpan with sampling 2005
DGSP
90
Distributed GSP 2008
PLUTE
91
Parallel sequential patterns mining 2010
MG-FSM
92
Large-scale frequent sequence mining 2013
ACME
93
Advanced parallel motif extractor 2013
DFSP
94
A Depth-First SPelling algorithm for
sequential pattern mining of biological
sequences
2013
An iterative MapReduce framework
95
Manage data uncertainty in SPM and
design an iterative MapReduce
framework to execute the uncertain
SPM algorithm in parallel
2015
LASH
96
LArge-scale Sequence mining with
Hierarchies
2015
Distributed DP
97
A memory-efcient distributed DP
approach and use an extended prex-
tree to save intermediate results
2016
Overview wires.wiley.com/dmkd
8of19 © 2017 Wi ley Per i o d i cals, In c . Vol u m e 7 , N ovemb e r / D e cembe r 2 0 1 7
algorithms. With the consideration of motif, uncer-
tain sequences, and hierarchies, the advanced parallel
motif extractor (ACME)
93
; an iterative MapReduce
framework,
95
and LASH
96
algorithm were also pro-
posed for large-scale distributed sequence mining.
Other related algorithms for distributed sequential
pattern mining are still developed in progress, such
as the memory-efcient distributed DP approach.
97
As mentioned before, the problem of sequential
pattern mining is more complicated than frequent
itemset or ARM, thus the related DFSM approaches
seem somehow less than those of DFIM. With the
rapid development of SPM techniques and the latest
platforms and tools with distributed system, the
state-of-the-art research efforts on distributed sequen-
tial pattern mining are still being developed. Gener-
ally speaking, DFSM is a considerable research topic
in the elds of DM and big data analytics.
Distributed Frequent Graph Mining
In this section, we continue to discuss another DDM
approach of DFGM. Different from FIM and SPM,
graph has been a ubiquitous and essential data repre-
sentation to model real-world objects and their rela-
tionships.
16
Today, large amounts of graphical data
have been generated by various applications, includ-
ing social networks, biological networks, WWW,
and so on. Different from other general data struc-
ture, e.g., itemset and sequence, labeled graph struc-
ture is much more complicated and can be used to
model for discovering substructure patterns among
data. Therefore, frequent graph mining (FGM) pro-
blems take an input graph Gwhere vertices and
edges are labeled; vertices and edges have unique ids,
and their labels are arbitrary, domain-specic attri-
butes that can be null.
16
In 2003, Yan and Han developed the rst
pattern-growth FGM method, named graph-based
substructure pattern mining (gSpan).
17
It avoids
duplicates by only expanding subtrees which lie on
the rightmost path in the depth-rst traversal. With
the overwhelming information encoded in these
graphs, there is a crucial need for efcient tools to
quickly discover large graphs and return the concise
patterns that can be easily understood. Distributed
data processing platforms, such as MapReduce,
98
Pregel,
99
GraphLab,
100
and GraphX.
101
have sub-
stantially simplied the design and deployment of
distributed graph analytics algorithms. In particular,
these platforms represent a good performance of dis-
tributed graph mining problems. Besides, a pattern is
an arbitrary graph; nding frequent subgraphs in a
labeled graph is an important topic in graph mining
problems. Up to now, successful algorithms for FGM
are related to those designed for FIM. In this section,
we provide a brief overview of some key distributed
methods for DFGM and then discuss each of them in
detail. As shown in Table 3, the current methods for
DFGM are summarized below.
A pattern-growth method called Molecular
Fragment miner (MoFa) was introduced by Borgelt
et al. It can mine both molecular substructures and
general frequent subgraphs. With a dynamic load
balancing strategy, Fatta and Berthold proposed the
distributed MoFa with a dynamic load balancing (d-
MoFa) algorithm.
39
By extending the well-known
gSpan algorithm, a parallel gSpan algorithm named
p-gSpan was also proposed.
102
Wang and Parthasar-
athy then designed a Toolkit to mine motifs patterns
and named this tool as MotifMiner.
103
Based on the
MapReduce distributed data processing platform,
researchers contribute great efforts to DFGM, such
as MRPF
104
algorithm for MapReduce-based sub-
graph pattern nding or MRFSE
106
for MapReduce-
based frequent subgraph extraction. In real-world
situations, however, the natural graphs have com-
monly been found to have highly skewed power-law
degree distributions, which challenge the assumptions
made by previous approaches. Thus, Gonzalez
et al. introduced a new approach, PowerGraph, to
distributed graph placement and representation that
exploits the structure of power-law graphs.
105
In
addition, a two-step lter-and-renement MapRe-
duce framework for frequent subgraph mining was
presented in Ref 107. In recent years, several distrib-
uted graph mining and analytics systems have been
proposed, including GraphX,
101
GridGraph,
108
UNICORN,
109
Arabesque,
110
and DistGraph,
111
and
so forth. The GraphX aims at processing graphs in a
distributed dataow framework, an integrated graph
and collections Application Programming Interface
(API) which is sufcient to express existing graph
abstractions and enable a much wider range of com-
putation.
101
With the development of Grid technol-
ogy, GridGraph is a large-scale graph processing
system on a single machine using 2-level hierarchical
partitioning.
108
As an open source version of
Bigtable,
44
UNICORN exploits the random write
characteristic of HBASE (http://hbase.apache.org/) to
improve the performance of generalized iterative
matrixvector multiplication.
109
Arabesque,
110
the
rst distributed data processing platform for imple-
menting graph mining algorithms, automates the
process of exploring a very large number of sub-
graphs, and it denes a high-level lter-process com-
putational model. Recently, the DistGraph
111
was
proposed as the rst distributed method to mine a
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 9of19
massive input graph that is too large to t in the
memory of any individual compute node.
Distributed Clustering
Successful algorithms for clustering are related to the
distributed environment and the Distributed Cluster-
ing (DC)
112
thus becomes an important research
topic of clustering. In this section, we further provide
a brief overview of some key methods for DC. Table 4
lists and summarizes the distributed methods.
Techniques of clustering algorithms can be
classied into two main categories: single-machine
and multiple-machine clustering techniques. The lat-
ter, DC
112
is related to the distributed and parallel
systems, and most of them were designed based on
MapReduce. In 2004, the hybrid energy-efcient
distributed clustering (HEED) algorithm
113
was
introduced by Younis et al. Zhou et al. then pre-
sented an EM-based framework for distributed data
stream clustering.
114
In the distributed data cluster-
ing, a comparative analysis system with three
approaches, respectively, named Improved Distribu-
ted Combining Algorithm (IDCA), Distributed K-
Means (DKMA), and traditional Centralized Clus-
tering Algorithm (CCA) were proposed in Ref 115.
Based on MapReduce, an efcient parallel K-means
clustering (PKMeans) was proposed by directly
extending the traditional K-means algorithm for
clustering,
116
and the optimized K-means clustering
algorithms were further proposed by using MapRe-
duce.
123
Bahmani et al. also proposed an efcient
parallel k-mean called sequential K-means++
120
to
handle the sequential data. The MapReduce K-
means++ method replaces the iterations among mul-
tiple machines with a single machine. It can signi-
cantly reduce the communication and I/O costs. The
above K-means-based approaches are designed to
return exact results. It is, however, not an easy task
to quickly nd the exact results from the big data.
Therefore, an efcient approximated approach
called K-Means++ approximation with MapReduce
was introduced in Ref 124. It can drastically reduce
the number of MapReduce jobs by using only one
MapReduce job to obtain k centers. At the same
time, Han and Luo proposed a fast K-means method
using a statistical bootstrap.
122
With the consideration of sensor network appli-
cations, some DC methods have been proposed, such
as a generic algorithm for distributed data clustering
in sensor networks
119
and the novel DKM algorithm
for clustering observations collected by spatially
TABLE 3 |Algorithms for Distributed Frequent Graph Mining
Name Description Year
p-MoFa
102
Parallel MoFa 2006
p-gSpan
102
Parallel gSpan 2006
d-MoFa
39
Distributed MoFa with dynamic load
balancing
2006
MotifMiner
103
MotifMiner toolkit 2004
MRPF
104
MapReduce-based pattern nding 2009
Pregel
99
A system for large-scale graph processing 2010
PowerGraph
105
Distributed graph-parallel computation on
natural graphs
2012
MRFSE
106
MapReduce-based frequent subgraph
extraction
2013
Filter-and-renement
107
A two-step lter-and-renement
MapReduce framework for frequent
subgraph mining
2014
GraphX
101
A distributed dataow framework 2014
GridGraph
108
Large-scale graph processing using
hierarchical partitioning
2015
UNICORN
109
A graph mining library on top of HBASE 2015
Arabesque
110
A system for distributed graph mining 2015
DistGraph
111
A distributed approach for graph mining
in massive networks
2016
Overview wires.wiley.com/dmkd
10 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
distributed resource-aware sensors.
118
Recently, two
K-means-based models, distributed PCA and K-
means
121
and KPCA+ K-means clustering,
125
were
developed based on the PCA
127
concept and kernel
PCA
128
concept. Mashayekhi et al. proposed
GDCluster, a general fully decentralized clustering
method, which is capable of clustering dynamic and
distributed datasets.
126
In GDCluster, nodes continu-
ously cooperate through decentralized gossip-based
communication to maintain summarized views of the
dataset. Other approaches for DC are still in progress.
Privacy Preserving of Distributed Data
Mining
Before reviewing current works in privacy-preserving
DM in distributed environment (PPDDM), we rst
stress the signicance and motivations for this
research topic. With the rapid development of net-
works, such as communications and computer tech-
nology, privacy preserving data mining (PPDM) has
become an increasingly important topic in DM.
129
Specially, in distributed environments, how to protect
data privacywhile doing DM tasks from a large
number of distributed data is more challenging and
interesting. PPDM has emerged as an important topic
in DM, and many related works have been exten-
sively studied, such as PPDM of association rules and
frequent itemsets, PPDM of sequential patterns,
PPDM of graph, and so on.
129
In particular, some
papers have addressed the privacy issues in mining of
association rules and frequent itemsets from distribu-
ted data. In the literature, Clifton et al. rst proposed
the issue of PPDDM of association rules and frequent
itemsets.
129
A simple overview of PPDDM is shown
in Table 5.
In 2004, Kantarcioglu and Clifton proposed the
PPDM for association rules in horizontally distribu-
ted databases that uses Yaos generic secure-
computation protocol as a subprotocol. They also
designed several methods to incorporate the crypto-
graphic techniques to minimize the information
TABLE 4 |Algorithms for Distributed Clustering
Name Description Year
HEED
113
Hybrid energy-efcient distributed
clustering
2004
EM-based framework
114
Distributed data stream clustering 2007
IDCA, DKMA, CCA
115
Distributed data clustering-a comparative
analysis system
2009
PKMeans
116
Parallel K-Means clustering based on
MapReduce
2009
CloudClustering
117
Toward an iterative data processing
pattern on the cloud
2011
Novel DKM
118
A distributed algorithms for clustering
observations collected by spatially
distributed resource-aware sensors
2011
A generic algorithm
119
Distributed data clustering in sensor
networks
2011
K-Means++
120
An efcient parallel version k-means|| of
the inherently sequential K-means++
2012
Distributed PCA and K-Means
121
Distributed PCA and K-Means clustering 2013
Bootstrapping K-means
122
A fast K-means method using a statistical
bootstrap
2014
Optimize K-means
123
Optimize K-means clustering algorithm
using MapReduce
2014
MapReduce K-means++
124
Efcient k-Means++ approximation with
MapReduce
2014
KPCA + K-Means clustering
125
A communication efcient algorithm to
perform kernel PCA in the distributed
setting
2015
GDCluster
126
A general distributed clustering algorithm 2015
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 11 of 19
shared while adding little overhead to the mining
task.
130
Luo et al. then proposed the GridDMM
algorithm
131
for distributed mining of maximal fre-
quent itemsets on a data grid system. In Ref 132, two
algorithms for both vertically and horizontally parti-
tioned data with cryptographically strong privacy
were introduced. In addition, hybrid CF-based refer-
rals with decent accuracy on cross distributed data
(CDD) were represented in Ref 134. Privacy preser-
vation in distributed systems has been focused in sev-
eral areas such as multiparty privacy preservation
DDM
133
and privacy preserving SOM-based recom-
mendations on horizontally distributed data,
135
among others.
In Ref 136, the researchers proposed the Multi-
level Trust (MLT)-PPDM model to expand the scope
of perturbation-based PPDM to multilevel trust. In
order to reduce the disjunctive operations, Chun
et al. developed the PPDNF approach for privacy-
preserving disjunctive normal form operations on dis-
tributed sets.
137
Tassa then proposed a protocol for
secure mining of association rules in horizontally dis-
tributed databases that improves signicantly upon
the current leading protocol in terms of privacy and
efciency.
139
Different from the previous approaches
of PPDDM, the rst algorithm for privacy-preserving
sub-feature selection in DDM was introduced by
Bhuyan and Kamila.
140
It focuses on the issue of sub-
feature selection instead of the traditional pattern
(itemset, sequence, graph, tree, etc.). In order to solve
visualization problem of PPDDM, a novel technique
called DPcode
141
was recently proposed for privacy-
preserving frequent visual patterns publication on
Cloud. Furthermore, some reviews of privacy-
preserving computing in distributed data have been
summarized and discussed.
142145
TABLE 5 |Algorithms for Privacy Preserving of Distributed Data Mining (PPDDM)
Name Description Year
Toolkit
129
Tools for privacy-preserving distributed
mining
2002
Secure mining
130
PPDM for association rules in horizontally
distributed databases
2004
GridDMM
131
Distributed mining of maximal frequent
itemsets on a data grid system
2006
Two algorithms for vertically partitioned
data
132
Algorithms for both vertically and
horizontally partitioned data, with
cryptographically strong privacy
2007
Multiparty PPDM
133
A game-theoretic approach for PPDDM 2007
PPCF on CDD
134
Hybrid CF-based referrals with decent
accuracy on cross distributed data
(CDD)
2012
SOM-based recommendation
135
A privacy-preserving scheme to provide
recommendations on horizontally
partitioned data among multiple
parties
2012
MLT-PPDM
136
Relax this assumption and expand the
scope of perturbation-based PPDM to
multilevel trust
2012
PPDNF
137
Privacy-preserving disjunctive normal
form operations on distributed sets
2013
Privacy-preserving two-party distributed
mining
138
Privacy-preserving two-party distributed
association rules mining on horizontally
partitioned data
2013
Secure mining
139
Secure mining of association rules in
horizontally distributed databases
2014
Sub-feature selection
140
Privacy-preserving sub-feature selection in
distributed data mining
2015
DPcode
141
Privacy-preserving frequent visual
patterns publication on cloud
2016
Overview wires.wiley.com/dmkd
12 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
OPPORTUNITY FOR DISTRIBUTED
DATA MINING
Undoubtedly, the world is shrinking into a small vil-
lage owing to the tangible inuence of network and
various types of distributed systems, such as online
social network systems,
37
P2P systems,
34
Ad-hoc
networks,
35
and cloud computing systems,
36
It con-
nects people from different parts of the world by
sharing data, service, and media stream. Many
researchers have proposed various DDM techniques
based solely on different domain requirements and
applications, such as DFIM, DFSM, DFGM, DC,
and PPDDM. As mentioned before, Challenges in
Distributed Data Mining section provides an up-to-
date view on the challenges for DDM. DDM is to
deal with complex distributed systems and also
reveals many opportunities. We next highlight some
important research opportunities.
1. Developing more efcient algorithms. DDM is
computationally expensive in terms of compu-
tational cost and memory usage for making
resources accessible (e.g., limited CPU compu-
tation and I/O calls across the distributed archi-
tecture). In order to achieve high performance,
some distributed/parallel DM platforms and
tools have developed in recent years, such as
MapReduce
26
or Spark.
78
These developments
can provide the necessary theoretical and tech-
nical supports for DDM. Although, currently
developed algorithms are efcient, there is still
a need for improvement when handling large-
scale data.
2. Heterogeneity. Relational or nonrelational
database systems often utilize a single schema
or the les have the homogeneous format. In
the Big Data era, a large amount of heterogene-
ous distributed data must be processed. Tradi-
tional DM techniques are designed to discover
useful knowledge in structured data, while the
heterogeneity is the inhesion factor of distribu-
ted data. Thus, it is a major challenge and
opportunity for DM, particularly for DDM to
discover the useful knowledge embedded in
unstructured and/or semistructured data.
3. Different types of mining pattern. Besides FIM,
ARM, sequential pattern mining, graph mining,
several other pattern mining problems have
been studied, e.g., sequential rule mining,
14,15
high-utility pattern mining,
1820
weight-based
pattern mining,
21,22
and other interesting pat-
tern mining.
23,24
Research on these problems
inspire distributed pattern mining. Thus, many
research opportunities in DDM can be further
discussed.
4. A wide range of applications in various
domains. Based on the specic applications,
many possibilities for further research on DDM
can be extensively studied. How to utilize
DFIM, DFSM, DFGM, DC, and PPDDM in
new or existing applications is an interesting
issue. We expect more research topics on the
DDM in the nearly future.
5. Security. Undoubtedly, the information
resources that are made available and main-
tained in distributed systems have a high intrin-
sic value to the users.
44,45
Therefore, security
issue is an important topic in DDM. To analyze
the big dataset, security and privacy issues are
the emerging topics. Several PPDDM have men-
tioned and discussed in this study. However,
how to improve the applicability and exibility
of PPDDM is still a major challenge, and many
opportunities can be extended and studied.
CONCLUSION
Typically, DM algorithms aim to discover the desired
patterns (i.e., frequent itemset, sequential pattern,
graph, etc.) or perform clustering, classication, out-
line detection, and so on. In general, the collected
data and executed applications of data analysis are
distributed in nature. Due to some problems and
challenges associated with traditional DM algorithms
when processing distributed data, DM on distributed
computing environments has emerged as an impor-
tant research topic. However, seldom have studies
summarized the related development in various types
of DM in distributed systems and instead make a
general taxonomy on them.
In this study, we thus introduce the denitions,
the general architectures and several important fea-
tures of a distributed system, and then point out the
challenges of DM tasks in distributed environments.
The main contributions are that we investigate recent
advances of distributed DM and provide state-of-the-
art details, including DFIM, DFSM, DFGM, DC, and
PPDDM. For future research, some opportunities of
DM tasks in a distributed environment can be rea-
sonably considered and further developed: (1) DM in
multisource data, multimodal data, and heterogene-
ous data, (2) a new type of pattern representation or
knowledge representation in DDM, (3) visualization
techniques of DDM, and (4) security issues and qual-
ity of service of DDM in the big data era.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 13 of 19
ACKNOWLEDGMENTS
This research was partially supported by the National Natural Science Foundation of China (NSFC) under
grant no. 61503092 by the Research on the Technical Platform of Rural Cultural Tourism Planning Basing on
Digital Media under grant 2017A020220011, and by the Tencent Project under grant CCF-Tencent
IAGR20160115.
REFERENCES
1. Agrawal R, Srikant R. Mining sequential patterns. In:
Proceedings of the International Conference on Data
Engineering, Taipei, Taiwan, 1995, 314.
2. Agrawal R, Imielinski T, Swami A. Mining associa-
tion rules between sets of items in large database. In:
Proceedings of the ACM SIGMOD International
Conference on Management of Data, Washington,
DC, USA, 1993, 207216.
3. Han J, Pei J, Yin Y, Mao R. Mining frequent patterns
without candidate generation: a frequent-pattern tree
approach. Data Min Knowl Discov 2004, 8
(1):5387.
4. Srikant R, Agrawal R. Mining sequential patterns:
generalizations and performance improvements. In:
Proceedings of the International Conference on
Extending Database Technology:Advances in Data-
base Technology, Avignon, France, 1996, 317.
5. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H,
Chen Q, Hsu MC. Mining sequential patterns by
pattern-growth: the prexspan approach. IEEE Trans
Knowl Data Eng 2004, 16(11):14241440.
6. Berkhin P. A survey of clustering data mining techni-
ques. In: Grouping Multidimensional Data. Berlin
Heidelberg: Springer; 2006, 2571.
7. Jarvis RA, Patrick EA. Clustering using a similarity
measure based on shared near neighbors. IEEE Trans
Comput 1973, 100(11):10251034.
8. Kotsiantis SB. Supervised machine learning: a review
of classication techniques. Informatica 2007, 31
(3):249269.
9. Quinlan JR. C4.5: Programs for Machine Learning.
San Francisco, CA: Morgan Kaufmann Publishers
Inc.; 1993.
10. Lee W, Stolfo S, Mok K. Adaptive intrusion detec-
tion: a data mining approach. Artif Intell Rev 2000,
14(6):533567.
11. Vo B, Le T, Hong TP, Le B. Fast updated frequent-
itemset lattice for transaction deletion. Data Knowl
Eng 2015, 96:7889.
12. Chen MS, Han J, Yu PS. Data mining: an overview
from a database perspective. IEEE Trans Knowl Data
Eng 1996, 8(6):866883.
13. Vo B, Hong TP, Le B. A lattice-based approach for
mining most generalization association rules. Knowl-
Based Syst 2013, 45:2030.
14. Fournier-Viger P, Nkambou R, Tseng VS. Rule-
Growth: mining sequential rules common to several
sequences by pattern-growth. In: Proceedings of the
ACM Symp Appl Comput, Taichung, Taiwan,
2011:956961.
15. Fournier-Viger P, Faghihi U, Nkambou R,
Nguifo EM. CMRules: mining sequential rules com-
mon to several sequences. Knowl-Based Syst 2012,
25:6376.
16. Kuramochi M, Karypis G. Frequent subgraph discov-
ery. In: Proceedings of the IEEE Int Conf Data Min,
San Jose, California, USA, 2001:313320.
17. Yan X, Han J. Gspan: graph-based substructure pat-
tern mining. In: Proceedings of the IEEE Int Conf
Data Min, Melbourne, Florida, USA, 2003:721724.
18. Lin JCW, Gan W, Hong TP, Zhang B. An incremen-
tal high-utility mining algorithm with transaction
insertion. Scientic World J 2015, Article ID 161564.
19. Tseng VS, Wu CW, Shie BE, Yu PS. UP-Growth: an
efcient algorithm for high utility itemset mining. In:
Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Min-
ing, Washington, DC, USA, 2010, 253262.
20. Yao H, Hamilton HJ, Butz CJ. A foundational
approach to mining itemset utilities from databases.
In: Proceedings of the SIAM Int Conf Data Min,
Lake Buena Vista, Florida, USA, 2004:211225.
21. Lin JCW, Gan W, Fournier-Viger P, Hong TP.
RWFIM: recent weighted-frequent itemsets mining.
Eng Appl Artif Intel 2015, 45:1832.
22. Vo B, Coenen F, Le B. A new method for mining fre-
quent weighted itemsets based on wit-trees. Expert
Syst Appl 2013, 40(4):12561264.
23. Geng L, Hamilton HJ. Interestingness measures for
data mining: A survey. ACM Comput Surv 2006, 38
(3):132.
24. Hong TP, Wu YY, Wang SL. An effective mining
approach for up-to-date patterns. Expert Syst Appl
2009, 36(6):97479752.
Overview wires.wiley.com/dmkd
14 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
25. Agrawal R, Srikant R. Fast algorithms for mining
association rules in large databases. In: Proceedings
of the International Conference on Very Large Data
Bases, Santiago de Chile, Chile, 1994, 487499.
26. Dean J, Ghemawat S. MapReduce: a exible data
processing tool. Commun ACM 2010, 53(1):7277.
27. Jiang Y. A survey of task allocation and load balan-
cing in distributed systems. IEEE Trans Parallel Dis-
trib Syst 2016, 27(2):585599.
28. Park B, Kargupta H, Johnson E, Sanseverino E,
Hershberger D, Silvestre L. Distributed, collaborative
data analysis from heterogeneous sites using a scala-
ble evolutionary technique. Appl Intell 2002, 16
(1):1942.
29. R Riesen, R Brightwell, and AB Maccabe, Differences
between distributed and parallel systems, SAND98-
2221, Unlimited Release, 1998. Available at: http://
www.cs.sandia.gov/rbbrigh/papers/distpar.pdf
30. Steen M, Pierre G, Voulgaris S. Challenges in very
large distributed systems. J Internet Serv Appl 2012,
3(1):5966.
31. Xu L, Huang Z, Jiang H, Tian L, Swanson D. VSFS:
a searchable distributed le system. In: Proceedings of
the IEEE Parallel Data Storage Workshop, New
Orleans, Louisiana, 2014, 2530.
32. Liu J, Jin X, Wang Y. Agent-based load balancing on
homogeneous minigrids: macroscopic modeling and
characterization. IEEE Trans Parallel Distrib Syst
2005, 16(7):586598.
33. Luo P, Lü K, Shi Z, He Q. Distributed data mining in
grid computing environments. Future Gener Comput
Syst 2007, 23(1):8491.
34. Rao W, Chen L, Fu AWC, Wang G. Optimal
resource placement in structured peer-to-peer net-
works. IEEE Trans Parallel Distrib Syst 2010, 21
(7):10111026.
35. Xue Y, Li B, Nahrstedt K. Optimal resource alloca-
tion in wireless ad hoc networks: a price-based
approach. IEEE Trans Mobile Comput 2006, 5
(4):347364.
36. Gkatzikis L, Koutsopoulos I. Migrate or not? Exploit-
ing dynamic task migration in mobile cloud comput-
ing systems. IEEE Wireless Commun 2013, 20
(7):2432.
37. Jiang Y, Jiang JC. Understanding social networks
from a multiagent perspective. IEEE Trans Parallel
Distrib Syst 2014, 25(10):27432759.
38. Cieslak DA, Thain D, Chawla NV. Troubleshooting
distributed systems via data mining. In: Proceedings
of the IEEE Int Symp High Perform Distrib Comput,
Paris, France, 2006:309312.
39. Fatta GD, Berthold MR. Dynamic load balancing for
the distributed mining. IEEE Trans Parallel Distrib
Syst 2006, 17(8):773785.
40. Silva JCD, Giannella C, Bhargava R, Kargupta H,
Klusch M. Distributed data mining and agents. Eng
Appl Artif Intel 2005, 18(7):791807.
41. Zeng L, Li L, Duan L, Lu K, Shi Z, Wang M, Wu W,
Luo P. Distributed data mining: a survey. Inf Technol
Manage 2012, 13(4):403409.
42. Tsoumakas G, Vlahavas I. Distributed data mining.
In: Encyclopedia of Data Warehousing and Mining,
IGI Global, Hershey, PA, USA, 2009, 709715.
43. SM Thampi, Survey on distributed data mining in
p2p networks, 2012. arXiv preprint
arXiv:1205.3231.
44. Chang F, Dean J, Ghemawat S, Hsieh WC. Bigtable:
a distributed storage system for structured data.
ACM Trans Comput Syst 2008, 26(2):4.
45. Tanenbaum AS, Steen MV. Distributed Systems: Prin-
ciples and Paradigms. Upper Saddle River, NJ: Pren-
tice-Hall, Inc.; 2006.
46. Agrawal R, Shafer JC. Parallel Mining of Association
Rules. IEEE Trans Knowl Data Eng 1996, 8
(6):962969.
47. Khan N, Yaqoob I, Hashem IA, Inayat Z, Ali WK,
Alam M, Shiraz M, Gani A. Big data: survey, technol-
ogies, opportunities, and challenges. Scientic World
J2014, 2014: Article ID 712826.
48. Wu X, Zhu X, Wu GQ, Ding W. Data mining with
big data. IEEE Trans Knowl Data Eng 2014, 26
(1):97107.
49. Li F, Ooi BC, Özsu MT, Wu S. Distributed data man-
agement using MapReduce. ACM Comput Surv
2014, 46(3): 31.
50. Yang Q, Wu X. 10 challenging problems in data min-
ing research. Int J Inf Technol Decision Making
2006, 5(4):597604.
51. A Mueller, Fast sequential and parallel algorithms for
association rule mining: a comparison. Technical
Report, University of Maryland at College
Park, 1995.
52. Park JS, Chen MS, Yu PS. Efcient parallel data min-
ing for association rules. In: Proceedings of the
ACM Int Conf Inf Knowl Manage, Baltimore, MD,
USA, 1995:3136.
53. Cheung DW, Han J, Ng VT, Fu AW, Fu Y. A fast
distributed algorithm for mining association rules. In:
Proceedings of the Int Conf Parallel Distrib Inf Syst,
Miami Beach, Florida, USA, 1996:3142.
54. Shintani T, Kitsuregawa M. Hash-based parallel algo-
rithms for mining association rules. In: Proceedings of
the Int Conf Parallel Distrib Inf Syst, Miami Beach,
Florida, USA, 1996:1930.
55. Zaki MJ, Ogihara M, Parthasarathy S, Li W. Parallel
data mining for association rules on shared-memory
multi-processors. In: Proceedings of the ACM/IEEE
Conf Supercomput, Pittsburgh, PA, USA,
1996:4343.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 15 of 19
56. Cheung DW, Ng VT, Fu AW, Fu Y. Efcient mining
of association rules in distributed databases.
IEEE Trans Knowl Data Eng 1996, 8(6):911922.
57. Zaki MJ, Parthasarathy S, Ogihara M, Li W. Parallel
algorithms for discovery of association rules. Data
Min Knowl Discov 1997, 1(4):343373.
58. Han EH, Karypis G, Kumar V. Scalable parallel data
mining for association rules. IEEE Trans Knowl Data
Eng 2000, 12(3):337352.
59. Agarwal RC, Aggarwal CC, Prasad VVV. A tree pro-
jection algorithm for generation of frequent item sets.
J Parallel Distrib Comput 2001, 61(3):350371.
60. Schuster A, Wolff R. Communication-efcient distrib-
uted mining of association rules. ACM SIGMOD
Record 2001, 30(2):473484.
61. Cho V, Wüthrich B. Distributed mining of classica-
tion rules. Knowl Inf Syst 2002, 4(1):130.
62. Otey M, Parthasarathy S, Wang C, Veloso A,
Meira W. Parallel and distributed methods for incre-
mental frequent itemset mining. IEEE Trans Syst
Man Cybern B Cybern 2004, 34(6):24392450.
63. Cong S, Han J, Hoeinger J, Padua D. A sampling-
based framework for parallel data mining. In: Pro-
ceedings of the ACM SIGPLAN Symposium on Prin-
ciples and Practice of Parallel Programming,
Chicago, Illinois, USA, 2005, 255265.
64. Li H, Wang Y, Zhang D, Zhang M, Chang EY. PFP:
parallel FP-growth for query recommendation. In:
Proceedings of the ACM Conf Recommender Syst,
Lousanne, Switzerland, 2008:107114.
65. Yu KM, Zhou J, Hong TP, Zhou JL. A load-balanced
distributed parallel mining algorithm. Expert Syst
Appl 2010, 37(3):24592464.
66. Lin MY, Lee PY, Hsueh SC. Apriori-based frequent
itemset mining algorithms on MapReduce. In: Pro-
ceedings of the 6th ACM International Conference on
Ubiquitous Information Management and Communi-
cation, Kuala Lumpur, Malaysia, 2012, P. 76.
67. Riondato M, DeBrabant J, Fonseca R, Upfal E.
PARMA: a parallel randomized algorithm for
approximate association rules mining. In: Proceedings
of the 21st ACM International Conference on Infor-
mation and Knowledge Management, Maui, HI,
USA, 2012, 8594.
68. Moens S, Aksehirli E, Goethals B. Frequent itemset
mining for big data. In: Proceedings of the IEEE Int
Conf Big Data, Santa Clara, CA, USA,
2013:111118.
69. Baralis E, Cerquitelli T, Chiusano S. P-Mine: parallel
itemset mining on large datasets. In: Proceedings of
the IEEE 29th International Conference on Data
Engineering Workshops, Brisbane, Australia, 2013,
266271.
70. Kolias V, Kolias C, Anagnostopoulos I, Kayafas E.
RuleMR: classication rule discovery with
MapReduce. In: Proceedings of the IEEE Int Conf
Big Data, Washington DC, USA, 2014:2028.
71. Qiu H, Gu R, Yuan C, Huang Y. YAFIM: a parallel
frequent itemset mining algorithm with spark. In:
Proceedings of the IEEE International Parallel and
Distributed Processing Symposium Workshops
(IPDPSW), Phoenix, AZ, USA, 2014, 16641671.
72. Zhang F, Liu M, Gui F, Shen W, Shami A, Ma Y. A
distributed frequent itemset mining algorithm using
spark for big data analytics. Cluster Comput 2015,
18(4):14931501.
73. Feng X, Zhao J, Zhang Z. MapReduce-based H-
Mine algorithm. In: Proceedings of the International
Conference on Instrumentation and Measurement,
Computer,Communication and Control, Harbin,
China, 2015, 17551760.
74. Apiletti D, Baralis E, Cerquitelli T, Garza P,
Michiardi P, Pulvirenti F. PaMPa-HD: a parallel
MapReduce-based frequent pattern miner for high-
dimensional data. In: Proceedings of the IEEE Inter-
national Conference on Data Mining Workshop,
Atlantic City, New Jersey, 2015, 839846.
75. Kaul SRM, Kashyap A. R-Apriori: an efcient
apriori-based algorithm on spark. In: Proceedings of
the 8th ACM Workshop on Ph.D.in Information
and Knowledge Management, 2015, 2734.
76. Lin KW, Chung SH, Lin CC. A fast and distributed
algorithm for mining frequent patterns in congested
networks. Computing 2016, 98(3):235256.
77. Salah S, Akbarinia R, Masseglia F. A highly scalable
parallel algorithm for maximally informative k-
itemset mining. Knowl Inf Syst 2017, 50(1):126.
78. Zaharia M, Chowdhury M, Das T, Dave A, Ma J.
Resilient distributed datasets: a fault-tolerant abstrac-
tion for in-memory cluster computing. In: Proceed-
ings of the 9th USENIX Conference on Networked
Systems Design and Implementation, San Jose, CA,
USA, 2012:22.
79. Wang J, Han J. BIDE efcient mining of frequent
closed sequences. In: Proceedings of the Int Conf
Data Eng, Boston, MA, USA, 2004:7990.
80. Yan X, Han J, Afshar R. CloSpan: mining closed
sequential patterns in large datasets. In: Proceedings
of the SIAM International Conference on Data Min-
ing, San Francisco, CA, USA, 2003:166177.
81. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U.
FreeSpan: frequent pattern-projected sequential pat-
tern mining. In: Proceedings of the ACM SIGKDD
Int Conf Knowl Discov Data Min, Boston, MA,
USA, 2000:355359.
82. Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q,
Dayal U, Hsu MC. PrexSpan: mining sequential pat-
terns efciently by prex-projected pattern growth.
In: Proceedings of the 17th International Conference
on Data Engineering, Heidelberg, Germany, 2001,
215224.
Overview wires.wiley.com/dmkd
16 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
83. Zaki MJ. SPADE: an efcient algorithm for mining
frequent sequences. Mach Learn 2001, 42(1):3160.
84. Shintani T, Kitsuregawa M. Mining algorithms for
sequential patterns in parallel-hash-based approach.
In: Proceedings of the Pacic-Asia Conf Knowl Dis-
cov Data Min, Melbourne, Australia, 1998:283294.
85. MV Joshi, G Karypis, and V Kumar. Parallel algo-
rithms for mining sequential associations: issues and
challenges. Technical Report under preparation,
Department of Computer Science, University of Min-
nesota, vol. 119, 1999.
86. Zaki MJ. Parallel sequence mining on shared-memory
machines. J Parallel Distrib Comput 2001, 61
(3):401426.
87. Wang K, Xu Y, Yu J. Scalable sequential pattern min-
ing for biological sequences. In: Proceedings of the
ACM Int Conf Inf Knowl Manage, Washington, DC,
USA, 2004:178187.
88. Guralnik V, Karypis G. Parallel tree-projection-based
sequence mining algorithms. Parallel Comput 2004,
30(4):443472.
89. Cong S, Han J, Padua D. Parallel mining of closed
sequential patterns. In: Proceedings of the
ACM SIGKDD Int Conf Knowl Dis in Data Mining,
Chicago, IL, USA, 2005:562567.
90. Qiao S, Tang C, Dai S, Zhu M, Peng J, Li H, Ku Y.
PartSpan: parallel sequence mining of trajectory pat-
terns. Int Conf Fuzzy Syst Knowl Discov, Shandong,
China, 2008:363367.
91. Qiao S, Li T, Peng J, Qiu J. Parallel sequential pattern
mining of massive trajectory data. Int J Comput Intell
Syst 2010, 3(3):343356.
92. Miliaraki I, Berberich K, Gemulla R, Zoupanos S.
Mind the gap: large-scale frequent sequence mining.
In: Proceedings of the ACM SIGMOD Int Conf Man-
age Data, New York, USA, 2013:797808.
93. Sahli M, Mansour E, Kalnis P. Parallel motif extrac-
tion from very long sequences. In: Proceedings of the
22nd ACM International Conference on Information
and Knowledge Management, San Francisco, CA,
USA, 2013, 549558.
94. Liao VCC, Chen MS. DFSP: a depth-rst spelling
algorithm for sequential pattern mining of biological
sequences. Knowl Inf Syst 2014, 38(3):623639.
95. Ge J, Xia Y, Wang J. Mining uncertain sequential
patterns in iterative MapReduce. In: Proceedings of
the Pacic-Asia Conference on Knowledge Discovery
and Data Mining, Ho Chi Minh City, Vietnam,
2015, 243254.
96. Beedkar K, Gemulla R. Lash: Large-scale sequence
mining with hierarchies. In: Proceedings of the ACM
SIGMOD International Conference on Management
of Data, Melbourne, VIC, Australia, 2015, 491503.
97. Ge J, Xia Y. Distributed sequential pattern mining in
large scale uncertain databases. In: Proceedings of the
Pacic-Asia Conference on Knowledge Discovery and
Data Mining, Auckland, New Zealand, 2016, 1729.
98. Dean J, Ghemawat S. MapReduce: simplied data
processing on large clusters. Commun ACM 2008, 51
(1):107113.
99. Malewicz G, Austern MH, Bik AJC, Dehnert JC,
Horn I, Leiser N, Czajkowski G. Pregel: a system for
large-scale graph processing. In: Proceedings of the
ACM SIGMOD Int Conf Manage Data, Indianapo-
lis, Indiana, USA 2010:135146.
100. Low Y. GraphLab: a distributed abstraction for large
scale machine learning. Doctoral Dissertation, Uni-
versity of California, Berkeley, CA, 2013.
101. Gonzalez JE, Xin RS, Dave A, Crankshaw D,
Franklin MJ, Stoica I. GraphX: graph processing in a
distributed dataow framework. In: Proceedings of
the USENIX Symposium on Operating Systems
Design and Implementation (OSDI), Broomeld,
CO, USA, 2014, 599613.
102. Meinl T, Worlein M, Fischer I, Philippsen M. Mining
molecular datasets on symmetric multiprocessor sys-
tems. In: Proceedings of the IEEE Int Conf Syst Man
Cybern, Taipei, Taiwan, 2006:12691274.
103. Wang C, Parthasarathy S. Parallel algorithms for
mining frequent structural motifs in scientic data.
In: Proceedings of the 18th ACM Annual Interna-
tional Conference on Supercomputing, Saint Malo,
France, 2004, 3140.
104. Liu Y, Jiang X, Chen H, Ma J, Zhang X.
MapReduce-based pattern nding algorithm applied
in motif detection for prescription compatibility net-
work. In: Proceedings of the International Workshop
on Advanced Parallel Processing Technologies,
Shanghai, China, 2009, 341355.
105. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C.
PowerGraph: distributed graph-parallel computation
on natural graphs. In: Proceedings of the 10th USE-
NIX Symposium on Operating Systems Design and
Implementation (OSDI), Hollywood, CA, USA,
2012, 1730.
106. Lu W, Chen G, Tung AKH, Zhao F. Efciently
extracting frequent subgraphs using MapReduce. In:
Proceedings of the IEEE Int Conf Big Data, Santa
Clara Marriott, California, USA, 2013:639647.
107. Lin W, Xiao X, Ghinita G. Large-scale frequent sub-
graph mining in MapReduce. In: Proceedings of the
30th IEEE Int Conf on Data Engineering, Chicago,
IL, USA, 2014, 844855.
108. Zhu X, Han W, Chen W. Gridgraph: large-scale
graph processing on a single machine using 2-level
hierarchical partitioning. In: Proceedings of the USE-
NIX Annual Technical Conference, Santa Clara, CA,
USA, 2015, 375386.
109. Lee H, Shao B, Kang U. Fast graph mining with
hbase. Inform Sci 2015, 315:5666.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 17 of 19
110. Teixeira CHC, Fonseca AJ, Serani M, Siganos G,
Zaki MJ, Aboulnaga A. Arabesque: a system for dis-
tributed graph mining. In: Proceedings of the 25th
ACM Symposium on Operating Systems Principles,
Monterey, California, USA, 2015, 425440.
111. Talukder N, Zaki MJ. A distributed approach for
graph mining in massive networks. Data Min Knowl
Discov 2016, 30(5): 10241052.
112. Shirkhorshidi AS, Aghabozorgi S, Wah TY,
Herawan T. Big data clustering: a review. In: Pro-
ceedings of the Int Conf Comput Sci Appl, Guimar-
aes, Portugal, 2014:707720.
113. Younis O, Fahmy S. Distributed clustering in ad-hoc
sensor networks: a hybrid, energy-efcient approach.
In: Proceedings of the Annual Joint Conference of the
IEEE Computer and Communications Societies,
Hong Kong, China, vol. 1, 2004.
114. Zhou A, Cao F, Yan Y, Sha C. Distributed data stream
clustering: a fast em-based approach. In: Proceedings of
the 23rd IEEE International Conference on Data Engi-
neering, Istanbul, Turkey, 2007, 736745.
115. Visalakshi NK, Thangavel K. Distributed data clus-
tering: a comparative analysis. Found Comput Intell
vol 6. Springer Berlin Heidelberg, 2009, 371397.
116. Zhao W, Ma H, He Q. Parallel k-means clustering
based on MapReduce. In: Proceedings of the
IEEE Int Conf Cloud Comput, Bangalore, India,
2009:674679.
117. A Dave, W Lu, J Jackson, and R Barga. Cloudclustering:
toward an iterative data processing pattern on the cloud.
In: Parallel and Distributed Processing Workshops and
PhD Forum, Anchorage, Alaska, USA,
2011:11321137.
118. Eyal I, Keidar I, Rom R. Distributed data clustering
in sensor networks. Distrib Comput 2011, 24
(5):207222.
119. Forero PA, Cano A, Giannakis GB. Distributed clus-
tering using wireless sensor networks. IEEE J Sel Top
Signal Process 2011, 5(4):707724.
120. Bahmani B, Moseley B, Vattani A, Kumar R. Scalable
K-means++. VLDB Endowment 2012, 5(7):622633.
121. Liang Y, Balcan MF, Kanchanapally V. Distributed
PCA and K-means clustering. In: Proceedings of the
Big Learning Workshop at NIPS, 2013.
122. Han J, Luo M. Bootstrapping k-means for big
data analysis. In: Proceedings of the IEEE Int
Conf Big Data, Washington DC, USA,
2014:591596.
123. Cui X, Zhu P, Yang X, Li K, Ji C. Optimized big data
K-means clustering using MapReduce. J Supercomput
2014, 70(3):12491259.
124. Xu Y, Qu W, Li Z, Min G, Li K, Liu Z. Efcient k-
means++ approximation with MapReduce. IEEE Trans
Parallel Distrib Syst 2014, 25(12):31353144.
125. MF Balcan, Y Liang, L Song, and D Woodruff, Com-
munication efcient distributed kernel principal com-
ponent analysis, 2015. arXiv preprint
arXiv:1503.06858.
126. Mashayekhi H, Habibi J, Khalafbeigi T, Voulgaris S,
Steen MV. GDCluster: a general decentralized cluster-
ing algorithm. IEEE Trans Knowl Data Eng 2015,
27(7):18921905.
127. Wold S, Esbensen K, Geladi P. Principal components
analysis. Chemom Intel Lab Syst 1987, 2(15):3752.
128. Schölkopf B, Smola A, Müller KR. Kernel principal com-
ponent analysis. In: ProceedingsoftheIntConfArtif
Neural Netw, Lausanne, Switzerland, 1997:583588.
129. Clifton C, Kantarcioglu M, Vaidya J, Lin X. Tools
for privacy preserving distributed data mining.
ACM SIGKDD Explor Newslett 2002, 4(2):2834.
130. Kantarcioglu M, Clifton C. Privacy-preserving dis-
tributed mining of association rules on horizontally
partitioned data. IEEE Trans Knowl Data Eng 2004,
16(9):10261037.
131. Luo C, Pereira AL, Chung SM. Distributed mining of
maximal frequent itemsets on a data grid system.
J Supercomput 2006, 37(1):7190.
132. Zhong S. Privacy-preserving algorithms for distribu-
ted mining of frequent itemsets. Inform Sci 2007,
177:490503.
133. Kargupta H, Das K, Liu K. Multiparty, privacy pre-
serving distributed data mining using a game theo-
retic framework. In: Proceedings of the European
Conference on Principles of Data Mining and Knowl-
edge Discovery, Warsaw, Poland, 2007, 523531.
134. Yakut I, Polat H. Privacy-preserving hybrid collabo-
rative ltering on cross distributed data. Knowl Inf
Syst 2012, 30(2):405433.
135. Kaleli C, Polat H. Privacy-preserving SOM-based
recommendations on horizontally distributed data.
Knowl-Based Syst 2012, 33:124135.
136. Li Y, Chen M, Li Q, Zhang W. Enabling multilevel
trust in privacy preserving data mining. IEEE Trans
Knowl Data Eng 2012, 24(9):15981612.
137. Chun JY, Hong D, Jeong IR, Lee DH. Privacy-
preserving disjunctive normal form operations on dis-
tributed sets. Inform Sci 2013, 231:113122.
138. Zhang F, Rong C, Zhao G, Wu J, Wu X. Privacy-
preserving two-party distributed association rules
mining on horizontally partitioned data. In: Proceed-
ings of the Int Conf Cloud Comput Big Data,
FuZhou, China, 2013:633640.
139. Tassa T. Secure mining of association rules in hori-
zontally distributed databases. IEEE Trans Knowl
Data Eng 2014, 26(4):970983.
140. Bhuyan HK, Kamila NK. Privacy preserving sub-
feature selection in distributed data mining. Appl Soft
Comput 2015, 36:552569.
Overview wires.wiley.com/dmkd
18 of 19 © 2017 Wi l ey Peri o d i c a ls, In c . Volum e 7 , N o vembe r / D e c ember 2 0 1 7
141. Qin Z, Ren K, Yu T, Weng J. DPCode: privacy-
preserving frequent visual patterns publication on
cloud. IEEE Trans Multimedia 2016, 18(5):929939.
142. Lu R, Zhu H, Liu X, Liu J, Shao J. Toward efcient
and privacy-preserving computing in big data era.
IEEE Netw 2014, 28(4):4650.
143. Malik MB, Ghazi MA, Ali R. Privacy preserving
data mining techniques: current scenario and
future prospects. In: Proceedings of the Int Conf
Comput Commun Technol, Allahabad, India,
2012:2632.
144. Parthasarathy S, Ghoting A, Otey M. A survey of dis-
tributed mining of data streams. Data Streams
2007:289307.
145. Xu L, Jiang C, Wang J, Yuan J, Ren Y. Information
security in big data-privacy and data mining.
IEEE Access 2014, 2:11491176.
WIREs Data Mining and Knowledge Discovery Data mining in distributed environment
Vo l u m e 7 , N ovemb e r / D e cembe r 2 0 1 7 © 2017 W i l e y P e r iodi c a l s , I n c. 19 of 19
... In recent years, data science [1], [2] and machine learning (e.g., deep learning [3]) have been widely used in various fields. As technology continues to advance, the models we train are growing increasingly complex and require longer training times, posing significant challenges for efficient and effective model development [4], [5]. ...
... Therefore, it has attracted a large number of researchers to conduct research on relatively efficient model training with limited resources. Based on distributed training, various parallel strategies have been proposed, including data parallelism 1 [8], [9], [10], model parallelism [11], [12], [13], 3D parallelism [14], [15], etc. In addition, various training optimization strategies have also been proposed, such as model compression [16], [17], [18], reduced activation calculations [19], [20], mixed precision calculations [21], optimizer strategies [22], [23], [24], etc. ...
... Common pre-training models include Word2Vec [57], Embeddings from Language Models (ELMo) [58], Generative Pre-trained Transformer (GPT) [59], Bidirectional Encoder Representations from Transformers (BERT) [60], and others. Note that the pre-training model has broader applications and can be used for various tasks such as word embedding [61], text classification [62], named entity recognition [63], data mining [64,65], and more. ...
Preprint
Full-text available
To address challenges in the digital economy's landscape of digital intelligence, large language models (LLMs) have been developed. Improvements in computational power and available resources have significantly advanced LLMs, allowing their integration into diverse domains for human life. Medical LLMs are essential application tools with potential across various medical scenarios. In this paper, we review LLM developments, focusing on the requirements and applications of medical LLMs. We provide a concise overview of existing models, aiming to explore advanced research directions and benefit researchers for future medical applications. We emphasize the advantages of medical LLMs in applications, as well as the challenges encountered during their development. Finally, we suggest directions for technical integration to mitigate challenges and potential research directions for the future of medical LLMs, aiming to meet the demands of the medical field better.
... Image association rule mining is a data mining technique that involves discovering interesting patterns or associations between items in an image dataset [14][15][16][17]. ...
Article
Full-text available
In this paper, a new approach for mining image association rules is presented, which involves the fine-tuned CNN model, as well as the proposed FIAR and OFIAR algorithms. Initially, the image transactional database is generated using feature vectors obtained from the fine-tuned CNN architecture. The proposed FIAR algorithm is used to generate hash-indexed image association rules, which are further optimized using the proposed OFIAR algorithm. This methodology combines the strengths of the CNN model to extract histogram features from images, the FIAR algorithm to efficiently mine frequent image itemsets, and the OFIAR algorithm to optimize image association rules. The proposed methodology can be used to discover hidden relationships among images, leading to new insights in image processing and analysis. Efficient results were obtained with a minimum support of 0.50 and a minimum confidence of 0.50. Experiments were performed on the fruits image dataset consisting of 2618 images from six different classes, and the results show that image mining is feasible and can produce strong optimized image association rules that can be further used for classification purposes.
... We plan to make a distributed version of the algorithm to further improve its performance. We will compare the efficiency of largescales analyses according to different Big Data storage and processing platforms (Gan et al., 2017). Second, the experiments have shown that the algorithm generates a lot of extracted patterns. ...
Thesis
Full-text available
Today, real-world entities are becoming increasingly interconnected (e.g., individuals interacting on social platforms). Nevertheless, these entities and their interconnectivity evolve continually over time. They may appear and disappear over time. Moreover, their descriptive characteristics may be added, removed or updated over time. Data generated by interconnected entities are generally represented by Graphs. However, static graphs are not enough to integrate the concept of temporal evolution. This thesis addresses therefore the problem of enabling analyses on graph data enriched with temporal evolution. This problem induces three main challenges: (i) How can we incorporate temporal evolution in a static graph?, (ii) How can we find information in a temporal graph? and, (iii) How can we discover knowledge in such a graph? From a modelling point of view, we define a complete management solution for graphs with temporal evolution, from a conceptual model to its implementation. First, our conceptual model, called Temporal Graph, includes concepts close to the real-world: entities, relationships, and states to capture their temporal evolution. Second, we propose mapping rules to translate automatically our conceptual model to the property graph model. Third, we propose an implementation in graph-oriented data stores. Finally, the experimental results show that our management solution is (i) feasible, i.e., implementable in graph-oriented data stores, (ii) usable for business analyses, (iii) efficient in terms of storage and query performance, and (iv) scalable when the data volume increases. From a querying point of view, we provide a solution allowing users to find information for answering business questions (‘What?’, ‘Who?’, ‘Where?’, ‘When?’). The advantage of our querying solution for graphs with temporal evolution is to be complete. First, this solution includes conceptual operators, which are user-oriented and composable. They enable to find time-dependent information on the topology, as well as on different components of the temporal graph. To be implementable, we propose mapping rules of our conceptual operators into logical operators for querying the property graph model. We verify through experiments that our querying solution allows effectively applying business analyses on real-world datasets. From a knowledge discovery point of view, we offer a solution allowing users to extract hidden information for answering complex business questions (‘How?’). On the one hand, this solution defines a novel pattern, specifying a combination of information pieces of temporal graph to be extracted. Our pattern has the advantages of (i) fully capturing information from the multiple dimensions of a temporal graph and (ii) representing evolution mechanisms spanning several groups of connected entities instead of a single one. On the other hand, we propose an algorithm to extract our pattern from a temporal graph. Since all pattern mining algorithms face the problem of high computational complexity, we propose a mining strategy to reduce the latter. We conduct experiments confirming that (i) our pattern is useful, notably to understand the impacts of disruptive events in real-world datasets, and that (ii) our algorithm is scalable when data volume increases.
... Input: The node t of the prefix RFM-pattern and the MT-Chain of node t. 1 scan MT-Chain: 2 insert the I-extension items into ilist and the S-extension items into slist; 3 remove items of low PM from ilist and slist; 4 remove items of large TSP from ilist and slist; 5 Parameter settings. Based on the problem statement, experiments focus on β. ...
Article
Full-text available
A large number of association rules often minimizes the reliability of data mining results; hence, a dimensionality reduction technique is crucial for data analysis. When analyzing massive datasets, existing models take more time to scan the entire database because they discover unnecessary items and transactions that are not necessary for data analysis. For this purpose, the Fuzzy Rough Set-based Horse Herd Optimization (FRS-HHO) algorithm is proposed to be integrated with the Map Reduce algorithm to minimize query retrieval time and improve performance. The HHO algorithm minimizes the number of unnecessary items and transactions with minimal support value from the dataset to maximize fitness based on multiple objectives such as support, confidence, interestingness, and lift to evaluate the quality of association rules. The feature value of each item in the population is obtained by a Map Reduce-based fitness function to generate optimal frequent itemsets with minimum time. The Horse Herd Optimization (HHO) is employed to solve the high-dimensional optimization problems. The proposed FRS-HHO approach takes less time to execute for dimensions and has a space complexity of 38% for a total of 10 k transactions. Also, the FRS-HHO approach offers a speedup rate of 17% and a 12% decrease in input–output communication cost when compared to other approaches. The proposed FRS-HHO model enhances performance in terms of execution time, space complexity, and speed.
Article
Full-text available
Big Data has gained much attention from the academia and the IT industry. In the digital and computing world, information is generated and collected at a rate that rapidly exceeds the boundary range. Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009. As information is transferred and shared at light speed on optic fiber and wireless networks, the volume of data and the speed of market growth increase. However,the fast growth rate of such large data generates numerous challenges, such as the rapid growth of data, transfer speed, diverse data,and security. Nonetheless, Big Data is still in its infancy stage, and the domain has not been reviewed in general. Hence, this study comprehensively surveys and classifies the various attributes of Big Data, including its nature, definitions, rapid growth rate, volume,management, analysis, and security. This study also proposes a data life cycle that uses the technologies and terminologies of Big Data. Future research directions in this field are determined based on opportunities and several open issues in Big Data domination.These research directions facilitate the exploration of the domain and the development of optimal techniques to address Big Data.
Conference Paper
We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.
Article
Mining for associations between items in large transactional databases is a central problem in the field of knowledge discovery. When the database is partitioned among several share-nothing machines, the problem can be addressed using distributed data mining algorithms. One such algorithm, called CD, was proposed by Agrawal and Shafer in [1] and was later enhanced by the FDM algorithm of Cheung, Han et al. [5]. The main problem with these algorithms is that they do not scale well with the number of partitions. They are thus impractical for use in modern distributed environments such as peer-to-peer systems, in which hundreds or thousands of computers may interact. In this paper we present a set of new algorithms that solve the Distributed Association Rule Mining problem using far less communication. In addition to being very efficient, the new algorithms are also extremely robust. Unlike existing algorithms, they continue to be efficient even when the data is skewed or the partition sizes are imbalanced. We present both experimental and theoretical results concerning the behavior of these algorithms and explain how they can be implemented in different settings.
Conference Paper
Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ε over s workers has communication cost Õ(spk/ε+sk²/ε³) words, where p is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches.
Conference Paper
Frequent Item set Mining (FIM) is a very effective method for knowledge acquisition from data, but with the advent of the era of big data, traditional algorithms based on memory are facing severe challenges such as the computation speed and storage capacity. Fortunately, Map Reduce model provides an efficient framework for distributed programming and operation framework. This paper proposes a novel Map Reduce-based H-mine algorithm (MRH-mine), a version of H-mine algorithm in the distributed operation environment. Experimental results show that MRH-mine algorithm has a better performance and scalability than traditional H-Mine when facing massive data growth.