ArticlePDF Available

Data mining in distributed environment: a survey

July 2017
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7(6):e1216

July 2017
7(6):e1216

Authors:

Jinan University (Guangzhou, China)

Due to the rapid growth of resource sharing, distributed systems are developed, which can be used to utilize the computations. Data mining ( DM ) provides powerful techniques for finding meaningful and useful information from a very large amount of data, and has a wide range of real‐world applications. However, traditional DM algorithms assume that the data is centrally collected, memory‐resident, and static. It is challenging to manage the large‐scale data and process them with very limited resources. For example, large amounts of data are quickly produced and stored at multiple locations. It becomes increasingly expensive to centralize them in a single place. Moreover, traditional DM algorithms generally have some problems and challenges, such as memory limits, low processing ability, and inadequate hard disk, and so on. To solve the above problems, DM on distributed computing environment [also called distributed data mining (DDM)] has been emerging as a valuable alternative in many applications. In this study, a survey of state‐of‐the‐art DDM techniques is provided, including distributed frequent itemset mining, distributed frequent sequence mining, distributed frequent graph mining, distributed clustering, and privacy preserving of distributed data mining. We finally summarize the opportunities of data mining tasks in distributed environment. WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216 This article is categorized under: Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Technologies > Computer Architectures for Data Mining

Content uploaded by Wensheng Gan

Content may be subject to copyright.

Overview

Data mining in distributed

environment: a survey

Wensheng Gan,

Jerry Chun-Wei Lin,

*Han-Chieh Chao

and Justin Zhan

Due to the rapid growth of resource sharing, distributed systems are developed,

which can be used to utilize the computations. Data mining (DM) provides power-

ful techniques for ﬁnding meaningful and useful information from a very large

amount of data, and has a wide range of real-world applications. However, tradi-

tional DM algorithms assume that the data is centrally collected, memory-resident,

and static. It is challenging to manage the large-scale data and process them with

very limited resources. For example, large amounts of data are quickly produced

and stored at multiple locations. It becomes increasingly expensive to centralize

them in a single place. Moreover, traditional DM algorithms generally have some

problems and challenges, such as memory limits, low processing ability, and inad-

equate hard disk, and so on. To solve the above problems, DM on distributed com-

puting environment [also called distributed data mining (DDM)] has been

emerging as a valuable alternative in many applications. In this study, a survey of

state-of-the-art DDM techniques is provided, including distributed frequent item-

set mining, distributed frequent sequence mining, distributed frequent graph min-

ing, distributed clustering, and privacy preserving of distributed data mining. We

ﬁnally summarize the opportunities of data mining tasks in distributed environ-

How to cite this article:

WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216

INTRODUCTION

With the rapid development of information tech-

nology and data collection, Knowledge Dis-

covery in Databases (KDD), provides a powerful

capability to discover meaningful and useful informa-

tion coming from a collection of data.

1–4

KDD has

numerous real-life applications and has resulted in

several DM tasks, such as association rule mining

(ARM),

2,3

sequential pattern mining (SPM),

4,5

clustering,

6,7

classiﬁcation,

8,9

and outline detection,

among others. Depending on different requirements

in various domains and applications, the discovered

knowledge can be generally classiﬁed as frequent

itemsets and association rules,

2,11–13

sequential

patterns,

1,4,5

sequential rules,

14,15

graphs,

16,17

high-

utility patterns,

18–20

weight-based patterns,

21,22

and

other interesting patterns.

23,24

As an important task

for a wide range of real-world applications, frequent

itemset mining (FIM) or ARM has been extensively

studied. Two well-known algorithms, Apriori

and

FP-growth,

are proposed to mine frequent itemsets

and association rules based on the generation-and-

test or pattern-growth approaches.

Many algorithms

have been developed to efﬁciently mine the desired

patterns and information from various type data

bases.

2,3,6,12,17,23,25

In general, distribution of data and computation

allows researchers/engineers to solve many problems

and can be applied and performed in various applica-

tions that are distributed in nature. Distributed

*Correspondence to: jerrylin@ieee.org

School of Computer Science and Technology, Harbin Institute of

Technology Shenzhen Graduate School, Shenzhen University Town

Xili, Shenzhen, China

Department of Computer Science and Information Engineering,

National Dong Hwa University, Shoufeng, Taiwan

Department of Computer Science, University of Nevada, Las

Vegas, NV, USA

Conﬂict of interest: The authors have declared no conﬂicts of inter-

est for this article.

systems indicate that the distributed computational

units are connected and organized by networks to

meet the demand of both large-scale and high-

performance computing, which have received consid-

erable attention over the past decades.

26–31

Many

types of distributed systems, such as grids,

32,33

peer-

to-peer (P2P) systems,

ad-hoc networks,

cloud

computing systems,

and online social network

systems,

have been widely studied. Currently, the

applications of distributed systems are varied, such as

web service, scientiﬁc computation, and ﬁle storage.

At the same time, DM has also been extensively stud-

ied.

2,3,6,12,17,23,25

By using DM techniques organiza-

tions, businesses, companies, and scientiﬁc centers can

discover different kinds of hidden but useful and

meaningful patterns and information. As mentioned

before, the distribution of the collected data can be

analyzed by DM techniques.

An important scenario

of DM is that the databases are distributed between

two or more parties, and each party owns a portion

of the data. In the past, traditional methods typically

made the assumption that the data is centralized and

memory-resident.

2,3,6,12,17,23,25

This assumption is no

longer tenable in distributed systems. Unfortunately, a

direct application of traditional mining algorithms to

distributed databases is not effective because it implies

a large amount of communication overhead. Imple-

mentation of high-performance DM in distributed

computing environments has thus become a critical

improvement in utilizing the scalability of a system.

In traditional DM technologies, a centralized

approach is fundamentally inappropriate due to

many reasons, such as the huge amount of data,

infeasibility to centralize data stored at multiple sites,

bandwidth limitations, energy limitations, and pri-

vacy concerns. Therefore, it is important to develop a

more adaptable and ﬂexible mining framework to

discover hidden but useful and meaningful patterns,

and information from the distributed and complex

databases instead of the centralized ones. To solve

these problems, DM on distributed environments

[also called distributed data mining (DDM)] has

emerged as an important research area.

38–41

In the

DDM literature, one of the two assumptions is com-

monly adopted as to how data is distributed across

sites: homogeneously (horizontally partitioned) and

heterogeneously (vertically partitioned).

In general,

DDM deals with some challenges for analyzing dis-

tributed data and offers many algorithmic solutions

to perform different data analysis and mining opera-

tions in a fundamentally distributed manner, which

pays very careful attention to resource constraints.

To improve the performance of DM and to improve

the scalability, many researchers provide different

techniques to work on a distributed environment like

grid computing,

32,33

cloud,

Hadoop (the popular

open source implementation of MapReduce,

http://

hadoop.apache.org), and so forth and distribute the

mining computation over more than a single node.

From the previous studies,

38–41

it has been shown

that DDM is a powerful tool for the end-user, enter-

prise or government to analyze data and discover dif-

ferent kinds of useful knowledge. It provides new

opportunities but poses some challenges for DM.

Although some related surveys have been previ-

ously studied, most of them provide a very prelimi-

nary review of a single type of distributed system,

such as the survey of load balancing in grids,

32,33

the

survey of load balancing in cloud computing,

and

the survey of load balancing in peer-to-peer (P2P)

systems.

34,43

How to summarize the related studies

in various types of DM in distributed systems and

make a general taxonomy on them? The methods

summarized in this study cover not only the distribu-

ted systems,

44,45

but also related literature on DM,

parallel computing,

big data technologies,

47,48

and

database management,

This study thus aims to

review current research on DDM. Main contribu-

tions of this study are described as follows:

1. We ﬁrst point out the difference between tradi-

tional DM algorithms and those based on dis-

tributed environments. There are more

challenges to be encountered when accomplish-

ing DM tasks in a distributed system.

2. We review contemporary works of DM on dis-

tributed environments in recent years. This is a

high level survey about distributed system tech-

niques for DM in several aspects, including dis-

tributed frequent itemset mining (DFIM),

distributed frequent sequence mining (DFSM),

distributed frequent graph mining (DFGM),

distributed clustering (DC), and privacy preser-

ving of distributed data mining (PPDDM).

3. Finally, some opportunities for future research

in DM task in distributed environment are

brieﬂy summarized.

The study is organized as follows: Distributed Systems

and Its Technical Challenges section introduces the

deﬁnitions, some important features of distributed sys-

tems, and respectively, summarizes some challenges in

distributed systems and DDM. Data Mining Techni-

ques on Distributed Environment section highlights

and discusses the state-of-the-art research on DM in

distributed computing resources. Opportunity for Dis-

tributed Data Mining section brieﬂy summarizes some

Overview wires.wiley.com/dmkd

opportunities for DM task in distributed environment.

Finally, conclusions are given in Conclusion section.

DISTRIBUTED SYSTEMS AND ITS

TECHNICAL CHALLENGES

In this section, the related deﬁnitions and some

important features of distributed systems are stated.

Some technical challenges on distributed systems and

DDM are then brieﬂy reviewed and summarized.

Distributed Systems

Unlike traditional centralized systems, the term dis-

tributed system refers to a large collection of

resources that is shared between computers con-

nected by a network. For example, hardware sharing,

software sharing, data sharing, service sharing, and

media stream sharing. The development of collabora-

tive computing, parallel computing and distributed

computing, motivated the development of distributed

systems. A distributed system is deﬁned as one in

which components at networked computers that

communicate and coordinate their actions only by

passing messages.

44,45

In other words, a distributed

system is a collection of autonomous computing ele-

ments (subsystems) that appears to its users as a sin-

gle coherent system. A distributed system has a

complex nature that requires powerful technologies

and advanced algorithms as shown in Figure 1.

From Figure 1, it can be observed that there are

two aspects in a distributed system: independent com-

puting elements and single system w.r.t. middleware.

There are some important features of distributed sys-

tems, including: (1) concurrency, multiprocess and

multithread concurrent execution, and resource shar-

ing (sharing of information and services); (2) no

global clock, program coordination depending on

message passing; (3) independent failure, such as

process failure, cannot be known by other pro-

cesses.

44,45

According to Refs 44,45, some properties

in a distributed system, such as transparency, scalabil-

ity, availability, reliability, serviceability (manageabil-

ity), and safety, should be discussed and studied.

Challenges in Distributed System

Distributed systems, in which the distributed compu-

tational units are connected and organized by net-

works to meet the demand of large-scale and high-

performance computing, have received considerable

attention over the past decades.

26–31

Many types of

distributed systems, such as grids,

32,33

P2P systems,

ad hoc networks,

cloud computing systems,

and

online social network systems

have been widely

studied. Currently, there are various applications in

distributed systems, such as DM, web servicing, sci-

entiﬁc computation, and ﬁle storage. Although great

developments in distributed systems have been made,

there are still some technical challenges.

44,45

shown in Figure 2, the main challenge in distributed

systems can be referred to eight aspects, including

heterogeneity, openness, security, scalability, failure

handling, concurrency, transparency, and quality of

service. Details of each challenge can be referred to

Refs 44,45.

Challenges in Distributed Data Mining

In recent decades, many models and algorithms have

beendevelopedinDMtoefﬁciently discover desired

knowledge in various types of databases,

2,3,12,23,25

but some challenges in DM have yet to be solved. In

2006, Yang and Wu

introduced 10 challenging

problems in DM research, such as developing a uni-

fying theory of DM, scaling up for high-dimensional

data, DDM and mining multiagent data, security,

privacy, and so on. Traditional DM algorithms

assume that the data is centralized, memory-resident,

and static. Because of the growth of large-scale data

in recent decades, two challenges have to be met.

First, the amounts of data are rapidly produced. Sec-

ond, the data are stored at multiple locations and it

becomes increasingly expensive to centralize it in one

place. Therefore, the problem of DDM is quite

important in various complex network databases. In

a distributed environment (such as a sensor or IP net-

work), one has distributed probes placed at strategic

locations within the network, especially in areas with

limited energy and limited memory (e.g., limited

CPU computation and I/O calls across a distributed

architecture). Therefore, techniques of DDM are

more challenging and complex than that of tradi-

tional DM.

38–41

With the collected data from the distributed

sites, DDM explores techniques of how to apply DM

in a noncentralized way. The goal here obviously

would be to minimize the amount of data shipped

between the various sites. Some important challenges

for this DDM issue—such as how to essentially

reduce the communication overhead, how to mine

across multiple heterogeneous data sources,

i.e., multisource databases, how to perform multire-

lational mining in distributed environment—have

been studied. As shown in Figure 1, the eight techni-

cal challenges in a distributed system, including het-

erogeneity, openness, security, scalability, failure

handling, concurrency, transparency, and quality of

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

service, are the same challenges when performing

DDM—especially the heterogeneity, security, and

scalability. The DDM deals with these challenges in

analyzing the distributed data and offers many algo-

rithmic solutions to perform different data analysis

and mining operations in a fundamentally distributed

manner that pays careful attention to the resource

constraints.

DATA MINING TECHNIQUES IN

DISTRIBUTED ENVIRONMENT

In this section, the state-of-the-art algorithms related

to DM on distributed environment—including

DFIM, DFSM, DFGM, DC, and privacy preserving

of DDM (PPDDM)—are given below. The prelimi-

naries and the problem statement are given simply

and then we describe the novel ideas of the related

works in detail and highlight speciﬁc ideas.

Distributed Frequent Itemset Mining

Let I={i

,…,i

} be a set of items, an itemset

X={i

,…,i

} with kitems is a subset of I. The

length or size of Xis denoted as |X|, i.e., the number

of items in Xw.r.t. k. Given a transactional data-

base D, where each transaction T

2Dis generally

identiﬁed by a transaction id (TID), and |D| denotes

the total number of transactions. The support of

Xin database Dis denoted as sup(X) and is the

proportion of transactions containing X, i.e., sup

(X) =|{T

2D,XT

}|/|D|. The support count

or frequency of itemset Xis the number of transac-

tions in Dcontaining X. An itemset is said to be a

frequent itemset (FI) if its support is greater than the

Component-1 Component-n

Host-3

Middleware

Local OS

Hardware

Distributed

Network

Application 3

Component-1 …Component-n

Host-1

Middleware

Local OS

Hardware

Application 1

Component-1 …Component-n

Host-n

Middleware

Local OS

Hardware

Application n

Component-1 …Component-n

Host-2

Middleware

Local OS

Hardware

Application 2

Same interface everywhere

Other Host

FIGURE 1 |Architecture of distributed system.

Scalability Failure

handling

Concurrency

Transparency

Quality of

Service

Heterogeneity

Openness

Security

Technical

challenges

FIGURE 2 |Technical challenges in distributed system.

Overview wires.wiley.com/dmkd

user-deﬁned minimum support threshold, minsup.

Therefore, the problem of frequent itemset mining is

to discover all itemsets in which the support of each

itemset is not less than the user-deﬁned minimum

support threshold, i.e., sup(X) ≥minsup.

As the most important task for a wide range of

real-world applications, FIM and ARM have been

extensively studied.

2,11–13

The ARM consists of two

phases. It ﬁrst discovers the frequent itemsets, then

generates the association rules from the derived fre-

quent itemsets. Due to the ﬁrst phase being more

challenging and interesting than the second phase,

most efforts on ARM address the problem of FIM.

Two well-known algorithms, Apriori

and FP-

growth,

were respectively proposed to mine frequent

itemsets and association rules. Many algorithms have

been developed to efﬁciently mine the desired fre-

quent itemsets or association rules from various type

databases.

2,3,12,23,25

Previously, the problem of FIM

in a distributed/parallel environment (DFIM) has

been extensively studied and a number of approaches

have been explored to address this problem. Table 1

shows an overview of frequently distributed itemset

mining on distributed/parallel environment.

In 1995, Mueller ﬁrst proposed two parallel

algorithms, called parallel efﬁcient association rules

(PEAR)

and parallel partition association rules

(PPAR).

Park et al. also proposed an algorithm

named parallel data mining (PDM) to parallel mining

of association rules,

and the fast distributed mining

(FDM) algorithm for distributed databases

was

developed later. Cheung et al. proposed a mining

algorithm named DMA to mine association rules in

distributed databases.

An algorithm named Hash

Partitioned Apriori (HPA) was ﬁrst introduced in Ref

54, and the modiﬁed HPA-ELD

approach that

HPA with extremely large itemset duplication was

then proposed. Based on the partition technology,

Zaki et al. then developed the Partitioned Candidate

Common Database (PCCD) and Common Candidate

Partitioned Database (CCPD).

At the same time,

some data distribution (DD)-based technologies have

been extensively studied, such as CD,

CD tree

projection,

DD,

HD,

IDD,

IDD tree

projection,

and DDDM,

and so forth.

By extending the vertical mining approach

Eclat,

the parallel-based Eclat (ParEclat),

and the

distributed Eclat (Dist-Eclat)

were, respectively,

developed. With the consideration of dynamic min-

ing, the ZIGZAG-based incremental approach

was

proposed to distributed and parallel incremental

mining frequent rules. Lin et al. developed three ver-

sions of Apriori algorithm, namely single pass count-

ing (SPC), ﬁxed passes combined counting (FPC),

and dynamic passes combined counting (DPC) on

theMapReduceframework.

SPC is a straightfor-

ward algorithm while FPC aims at reducing the num-

ber of scheduling invocations and DPC features in

dynamically combining candidates of various

lengths.

Recently, many DDM algorithms are developed

based on Spark or Hadoop platforms. Hadoop is one

of the well-known platforms using MapReduce

framework,

and it is the open-source software for

any implementations. Hadoop distributed ﬁle system

(HDFS) is used to store the dataset in Hadoop (http://

hadoop.apache.org). Spark

is a new in-memory, dis-

tributed data-ﬂow platform, which uses Resilient Dis-

tributed Dataset (RDD) architecture to store the results

at the end of an iteration and provide the results for

next iteration. In general, Spark has 1–2 orders of mag-

nitude faster than MapReduce.

Research efforts have

already made to improve the Apriori-based and the

traditional FIM/ARM algorithms by converting them

into distributed versions under the MapReduce

Spark

environment. For example, a parallel FP-

Growth,

a parallel randomized algorithm PARMA

for approximate association rules mining in MapRe-

duce, the MapReduce-based H-mine algorithm,

parallel FIM algorithm with Spark (R-Apriori),

PaMPa-HD,

and so on. Details of the above algo-

rithms are described below.

An adaptation of FP-Growth to MapReduce,

called PFP, is presented in Ref 64. PFP is a parallel

form of the classical FP-Growth, it splits a large-

scale mining task into independent and parallel

tasks. First, a parallel/distributed counting approach

is used to compute the frequent items, which ran-

domly partitions the datasets into several groups. In

a single MapReduce round, the transactions in the

dataset are used to generate group-dependent trans-

actions. The PFP approach shows good performance

with a near-linear speedup. Although, PARMA

not the ﬁrst algorithm using MapReduce to solve

the task of DFIM, it is the ﬁrst randomized MapRe-

duce algorithm for discovering the approximate col-

lections of frequent itemsets or association rules

with near-linear speedup. PARMA is also the ﬁrst

algorithm combining random sampling and parallel-

ization to mine frequent itemsets or association

rules. As shown in the study,

Dist-Eclat is a

MapReduce implementation of the well-known

Eclat algorithm.

BigFIM is a hybrid approach

exploiting both Apriori and Eclat paradigms based

on MapReduce.

Dist-Eclat focuses on speeding up

the mining performance while BigFIM is optimized to

run on really large datasets. Baralis et al.

presented a

parallel disk-based approach, named P-Mine, to solve

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

TABLE 1 |Algorithms for Distributed Frequent Itemset Mining

Name Description Year

PEAR

Parallel efﬁcient association rules 1995

PPAR

Parallel partition association rules 1995

PDM

Parallel mining of association rules 1995

FDM

Fast distributed mining for distributed

databases

1995

HPA

Hash-partitioned Apriori 1996

PCCD

Partitioned candidate common database 1996

DMA

Mine association rules in distributed

databases

1996

CCPD

Common candidate partitioned database 1996

Count distribution 1996

HPA-ELD

HPA with extremely large itemset

duplication

1996

ParEclat

Parallel Eclat 1997

Hybrid distribution 2000

CD Tree Projection

Count distributed tree projection 2001

Data distribution 1996

IDD

Intelligent data distribution 2000

IDD tree projection

Intelligent data distribution tree

projection

2001

DDDM

Distributed dual decision miner,

communication efﬁcient distributed

mining of association rules

2001

Fast distributed data mining

Distributed mining of classiﬁcation rules 2002

ZIGZAG-based incremental approach

Distributed and parallel incremental

mining frequent rules

2004

Par-FP

Parallel FP-growth with sampling 2005

PFP

An adaptation of FP-Growth to

MapReduce

2008

DPA

Distributed parallel Apriori 2010

DPC

Dynamic passes combined-counting 2012

FPC

Fixed passes combined-counting 2012

PARMA

A parallel randomized algorithm for

approximate association rules mining

in MapReduce

2012

BigFIM

Frequent itemset mining for big data 2013

Dist-Eclat

Distributed Eclat-based on MapReduce 2013

P-Mine

Parallel itemset mining on large datasets 2013

RuleMR

Classiﬁcation rule discovery with

MapReduce

2014

YAFIM

A parallel frequent itemset mining

algorithm on Spark

2014

DFIMA

Apriori-like distributed frequent itemset

mining algorithm

2015

MRH-mine

MapReduce-based H-mine algorithm 2015

(continued overleaf )

Overview wires.wiley.com/dmkd

the task of DFIM on a multicore processor by improv-

ing the I/O performance with a prefetching strategy.

Recently, Qiu et al.

have reported speed up of nearly

18 times on average for various benchmarks for the yet

another frequent itemset mining (YAFIM) algorithm

basedonSpark.Theresultsobtainedonreal-world

medical data show that YAFIM is much faster than all

Hadoop-based algorithms. Kaul and Kashyap

then

proposed the Reduced-Apriori (R-Apriori) algorithm,

which is a parallel Apriori algorithm based on the

Spark Resilient Distributed Dataset (RDD) framework.

It adds an additional phase to YAFIM and speed up the

second round for generating the promising candidate-

set in order to achieve higher performance compared to

the YAFIM.

According to these studies, it has been shown

that the implementation approaches based on Spark

is generally more efﬁcient than those on Hadoop

model. In general, the performance of the above

approaches might not be satisfactory due to the bot-

tleneck of iterative computation when handling

large-scale datasets. Therefore, a distributed algo-

rithm for frequent itemset mining (DFIMA) was pro-

posed to improve and speed up the process of FIM.

Some distributed and highly scalable parallel mining

approaches were also developed in recent years, such

as FDMCN (Fast and Distributed Mining algorithm

for discovering frequent patterns in Congested Net-

works)

and PHIKS (parallel highly informative K-

itemset).

Different from the general itemset mining

problem, Salah et al.

studied the problem of paral-

lel mining of maximally informative k-itemsets (miki)

based on joint entropy, and proposed the PHIKS, a

highly scalable and parallel miki mining algorithm.

With the classiﬁcation application, Cho and

Wüthrich

introduced a model for FDM of classiﬁ-

cation rules and the MapReduce-based RuleMR

was further developed.

Distributed Sequential Pattern Mining

Different from FIM, SPM discovers frequent subse-

quences as the interesting patterns in a sequential

database which contains the embedded timestamp of

events. The itemset mining model was then extended

to handle sequences by Srikant and Agrawal.

sequential database SDB ={S

,…,S

} is a set of

tuples (sid,S), where sid is a sequence identiﬁer and

is an input sequence. A sequence S

=(α

,α

,…,

) is called a subsequence of another sequence S

(β

,β

,…,β

)(n<m) and S

is called a super-

sequence of S

if there exist an integer 1 < i

<…<

<msuch that β

β

,…,β

β

, denoted as

S

. A tuple (sid,S) is said to contain a

sequence S

if Sis a super-sequence of S

. The support

of a sequence S

in a sequence database SDB is the

number of tuples in SDB that contains S

,minsup

). The sequential pattern mining problem was ﬁrst

introduced by Srikant and Agrawal

and can be for-

mulated as follows: Given a set of sequences,where

each sequence consists of a list of elements and each

element consists of a set of items,and given a user-

speciﬁed minsup threshold,sequential pattern mining

is to ﬁnd all of the frequent subsequences,i.e., the

subsequences whose occurrence frequency in the set

of sequences is not less than minsup.

Some well-known algorithms for sequential

pattern mining, such as AprioriAll,

general sequen-

tial patterns (GSP),

BI-Directional Extension

(BIDE),

CloSpan,

Frequent pattern-projected

Sequential pattern mining (FreeSpan),

Preﬁx-

projected Sequential pattern mining (PreﬁxSpan),

Sequential PAttern Discovery using Equivalence

classes (SPADE),

and so forth. have been exten-

sively proposed. It has been shown that SPM has

broad applications in real-world situations. Among

them, AprioriAll

and GSP

are the fundamental

Apriori-based algorithms, which are required to mine

TABLE 1 |Continued

Name Description Year

PaMPa-HD

Parallel MapReduce-based frequent

pattern miner for high dimensional

data

2015

R-Apriori

An efﬁcient Apriori-based algorithm on

Spark

2015

FDMCN

A fast and distributed mining algorithm

for discovering frequent patterns in

congested networks algorithm

2016

PHIKS

A highly scalable parallel algorithm,

named parallel highly informative K-

itemset, for maximally informative k-

itemset mining

2016

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

the sequential patterns in a levelwise manner. Up to

now, many researchers have provided different tech-

niques to work on distributed environments like grid

computing,

32,33

cloud,

Hadoop (http://hadoop.

apache.org), or distribute the mining computation

over more than one node for mining the sequential

patterns. As shown in Table 2, some distributed and

parallel methods for SPM are described below.

In 1998, Shintani and Kitsuregawa partitioned

the input sequences in nonpartitioned sequential pat-

tern mining (NPSPM), yet they assumed that the

entire candidate set can be replicated and ﬁt into the

overall memory (random access memory and hard

drive) of a process.

Similar assumptions were made

in EVEnt distribution (EVE),

EVEnt and CANdi-

date distribution (EVECAN),

and in data parallel

formulation (DPF).

As well, a hash function was

used in the hash partitioned sequential pattern min-

ing (HPSPM) algorithm to assign input and candi-

date sequences to speciﬁc processes.

Input

partitioning is, however, not inherently necessary for

shared memory or MapReduce distributed systems.

In the case of shared memory systems, the input data,

i.e., sequences, should ﬁt in the aggregated system

memory and is available to be read by all processes.

Thus, Zaki extended the efﬁcient SPADE algorithm

to the shared memory parallel architecture, called

pSPADE.

In the pSPADE framework, the input

data is assumed to be residing on shared hard drive

space and stored in the vertical database format.

In order to balance the mining tasks, Cong

et al. designed several models, Par-FP,

Par-ASP

and Par-CSP,

to accomplish the task. They use a

sampling technique that requires the entire input set

be available at each process. In addition, the 2PDF-

Index,

2PDF-Compression,

and DFSP

algo-

rithms were proposed and applied to scalable mining

sequential patterns from biological sequences. After

that, some distributed and parallel mining methods,

such as MapReduce Distributed GSP (DGSP) and

large-scale frequent sequence mining (MG-FSM),

were proposed by extending the traditional SPM

TABLE 2 |Algorithms for Distributed Sequential Pattern Mining

Name Description Year

HPSPM

Partitioned sequential pattern mining 1998

NPSPM

Nonpartitioned sequential pattern mining 1998

EVE

EVEnt distribution 1999

EVECAN

EVEnt and CANdidate distribution 1999

pSPADE

Parallel SPADE 2001

2PDF-Index and 2PDF-Compression

Scalable sequential pattern mining for

biological sequences

2004

DPF

Data parallel formulation 2004

Par-ASP

Parallel PreﬁxSpan with sampling 2005

Par-CSP

Parallel CloSpan with sampling 2005

DGSP

Distributed GSP 2008

PLUTE

Parallel sequential patterns mining 2010

MG-FSM

Large-scale frequent sequence mining 2013

ACME

Advanced parallel motif extractor 2013

DFSP

A Depth-First SPelling algorithm for

sequential pattern mining of biological

sequences

2013

An iterative MapReduce framework

Manage data uncertainty in SPM and

design an iterative MapReduce

framework to execute the uncertain

SPM algorithm in parallel

2015

LASH

LArge-scale Sequence mining with

Hierarchies

2015

Distributed DP

A memory-efﬁcient distributed DP

approach and use an extended preﬁx-

tree to save intermediate results

2016

Overview wires.wiley.com/dmkd

algorithms. With the consideration of motif, uncer-

tain sequences, and hierarchies, the advanced parallel

motif extractor (ACME)

; an iterative MapReduce

framework,

and LASH

algorithm were also pro-

posed for large-scale distributed sequence mining.

Other related algorithms for distributed sequential

pattern mining are still developed in progress, such

as the memory-efﬁcient distributed DP approach.

As mentioned before, the problem of sequential

pattern mining is more complicated than frequent

itemset or ARM, thus the related DFSM approaches

seem somehow less than those of DFIM. With the

rapid development of SPM techniques and the latest

platforms and tools with distributed system, the

state-of-the-art research efforts on distributed sequen-

tial pattern mining are still being developed. Gener-

ally speaking, DFSM is a considerable research topic

in the ﬁelds of DM and big data analytics.

Distributed Frequent Graph Mining

In this section, we continue to discuss another DDM

approach of DFGM. Different from FIM and SPM,

graph has been a ubiquitous and essential data repre-

sentation to model real-world objects and their rela-

tionships.

Today, large amounts of graphical data

have been generated by various applications, includ-

ing social networks, biological networks, WWW,

and so on. Different from other general data struc-

ture, e.g., itemset and sequence, labeled graph struc-

ture is much more complicated and can be used to

model for discovering substructure patterns among

data. Therefore, frequent graph mining (FGM) pro-

blems take an input graph Gwhere vertices and

edges are labeled; vertices and edges have unique ids,

and their labels are arbitrary, domain-speciﬁc attri-

butes that can be null.

In 2003, Yan and Han developed the ﬁrst

pattern-growth FGM method, named graph-based

substructure pattern mining (gSpan).

It avoids

duplicates by only expanding subtrees which lie on

the rightmost path in the depth-ﬁrst traversal. With

the overwhelming information encoded in these

graphs, there is a crucial need for efﬁcient tools to

quickly discover large graphs and return the concise

patterns that can be easily understood. Distributed

data processing platforms, such as MapReduce,

Pregel,

GraphLab,

100

and GraphX.

101

have sub-

stantially simpliﬁed the design and deployment of

distributed graph analytics algorithms. In particular,

these platforms represent a good performance of dis-

tributed graph mining problems. Besides, a pattern is

an arbitrary graph; ﬁnding frequent subgraphs in a

labeled graph is an important topic in graph mining

problems. Up to now, successful algorithms for FGM

are related to those designed for FIM. In this section,

we provide a brief overview of some key distributed

methods for DFGM and then discuss each of them in

detail. As shown in Table 3, the current methods for

DFGM are summarized below.

A pattern-growth method called Molecular

Fragment miner (MoFa) was introduced by Borgelt

et al. It can mine both molecular substructures and

general frequent subgraphs. With a dynamic load

balancing strategy, Fatta and Berthold proposed the

distributed MoFa with a dynamic load balancing (d-

MoFa) algorithm.

By extending the well-known

gSpan algorithm, a parallel gSpan algorithm named

p-gSpan was also proposed.

102

Wang and Parthasar-

athy then designed a Toolkit to mine motifs patterns

and named this tool as MotifMiner.

103

Based on the

MapReduce distributed data processing platform,

researchers contribute great efforts to DFGM, such

as MRPF

104

algorithm for MapReduce-based sub-

graph pattern ﬁnding or MRFSE

106

for MapReduce-

based frequent subgraph extraction. In real-world

situations, however, the natural graphs have com-

monly been found to have highly skewed power-law

degree distributions, which challenge the assumptions

made by previous approaches. Thus, Gonzalez

et al. introduced a new approach, PowerGraph, to

distributed graph placement and representation that

exploits the structure of power-law graphs.

105

addition, a two-step ﬁlter-and-reﬁnement MapRe-

duce framework for frequent subgraph mining was

presented in Ref 107. In recent years, several distrib-

uted graph mining and analytics systems have been

proposed, including GraphX,

101

GridGraph,

108

UNICORN,

109

Arabesque,

110

and DistGraph,

111

and

so forth. The GraphX aims at processing graphs in a

distributed dataﬂow framework, an integrated graph

and collections Application Programming Interface

(API) which is sufﬁcient to express existing graph

abstractions and enable a much wider range of com-

putation.

101

With the development of Grid technol-

ogy, GridGraph is a large-scale graph processing

system on a single machine using 2-level hierarchical

partitioning.

108

As an open source version of

Bigtable,

UNICORN exploits the random write

characteristic of HBASE (http://hbase.apache.org/) to

improve the performance of generalized iterative

matrix–vector multiplication.

109

Arabesque,

110

the

ﬁrst distributed data processing platform for imple-

menting graph mining algorithms, automates the

process of exploring a very large number of sub-

graphs, and it deﬁnes a high-level ﬁlter-process com-

putational model. Recently, the DistGraph

111

was

proposed as the ﬁrst distributed method to mine a

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

massive input graph that is too large to ﬁt in the

memory of any individual compute node.

Distributed Clustering

Successful algorithms for clustering are related to the

distributed environment and the Distributed Cluster-

ing (DC)

112

thus becomes an important research

topic of clustering. In this section, we further provide

a brief overview of some key methods for DC. Table 4

lists and summarizes the distributed methods.

Techniques of clustering algorithms can be

classiﬁed into two main categories: single-machine

and multiple-machine clustering techniques. The lat-

ter, DC

112

is related to the distributed and parallel

systems, and most of them were designed based on

MapReduce. In 2004, the hybrid energy-efﬁcient

distributed clustering (HEED) algorithm

113

was

introduced by Younis et al. Zhou et al. then pre-

sented an EM-based framework for distributed data

stream clustering.

114

In the distributed data cluster-

ing, a comparative analysis system with three

approaches, respectively, named Improved Distribu-

ted Combining Algorithm (IDCA), Distributed K-

Means (DKMA), and traditional Centralized Clus-

tering Algorithm (CCA) were proposed in Ref 115.

Based on MapReduce, an efﬁcient parallel K-means

clustering (PKMeans) was proposed by directly

extending the traditional K-means algorithm for

clustering,

116

and the optimized K-means clustering

algorithms were further proposed by using MapRe-

duce.

123

Bahmani et al. also proposed an efﬁcient

parallel k-mean called sequential K-means++

120

handle the sequential data. The MapReduce K-

means++ method replaces the iterations among mul-

tiple machines with a single machine. It can signiﬁ-

cantly reduce the communication and I/O costs. The

above K-means-based approaches are designed to

return exact results. It is, however, not an easy task

to quickly ﬁnd the exact results from the big data.

Therefore, an efﬁcient approximated approach

called K-Means++ approximation with MapReduce

was introduced in Ref 124. It can drastically reduce

the number of MapReduce jobs by using only one

MapReduce job to obtain k centers. At the same

time, Han and Luo proposed a fast K-means method

using a statistical bootstrap.

122

With the consideration of sensor network appli-

cations, some DC methods have been proposed, such

as a generic algorithm for distributed data clustering

in sensor networks

119

and the novel DKM algorithm

for clustering observations collected by spatially

TABLE 3 |Algorithms for Distributed Frequent Graph Mining

Name Description Year

p-MoFa

102

Parallel MoFa 2006

p-gSpan

102

Parallel gSpan 2006

d-MoFa

Distributed MoFa with dynamic load

balancing

2006

MotifMiner

103

MotifMiner toolkit 2004

MRPF

104

MapReduce-based pattern ﬁnding 2009

Pregel

A system for large-scale graph processing 2010

PowerGraph

105

Distributed graph-parallel computation on

natural graphs

2012

MRFSE

106

MapReduce-based frequent subgraph

extraction

2013

Filter-and-reﬁnement

107

A two-step ﬁlter-and-reﬁnement

MapReduce framework for frequent

subgraph mining

2014

GraphX

101

A distributed dataﬂow framework 2014

GridGraph

108

Large-scale graph processing using

hierarchical partitioning

2015

UNICORN

109

A graph mining library on top of HBASE 2015

Arabesque

110

A system for distributed graph mining 2015

DistGraph

111

A distributed approach for graph mining

in massive networks

2016

Overview wires.wiley.com/dmkd

distributed resource-aware sensors.

118

Recently, two

K-means-based models, distributed PCA and K-

means

121

and KPCA+ K-means clustering,

125

were

developed based on the PCA

127

concept and kernel

PCA

128

concept. Mashayekhi et al. proposed

GDCluster, a general fully decentralized clustering

method, which is capable of clustering dynamic and

distributed datasets.

126

In GDCluster, nodes continu-

ously cooperate through decentralized gossip-based

communication to maintain summarized views of the

dataset. Other approaches for DC are still in progress.

Privacy Preserving of Distributed Data

Mining

Before reviewing current works in privacy-preserving

DM in distributed environment (PPDDM), we ﬁrst

stress the signiﬁcance and motivations for this

research topic. With the rapid development of net-

works, such as communications and computer tech-

nology, privacy preserving data mining (PPDM) has

become an increasingly important topic in DM.

129

Specially, in distributed environments, how to protect

‘data privacy’while doing DM tasks from a large

number of distributed data is more challenging and

interesting. PPDM has emerged as an important topic

in DM, and many related works have been exten-

sively studied, such as PPDM of association rules and

frequent itemsets, PPDM of sequential patterns,

PPDM of graph, and so on.

129

In particular, some

papers have addressed the privacy issues in mining of

association rules and frequent itemsets from distribu-

ted data. In the literature, Clifton et al. ﬁrst proposed

the issue of PPDDM of association rules and frequent

itemsets.

129

A simple overview of PPDDM is shown

in Table 5.

In 2004, Kantarcioglu and Clifton proposed the

PPDM for association rules in horizontally distribu-

ted databases that uses Yao’s generic secure-

computation protocol as a subprotocol. They also

designed several methods to incorporate the crypto-

graphic techniques to minimize the information

TABLE 4 |Algorithms for Distributed Clustering

Name Description Year

HEED

113

Hybrid energy-efﬁcient distributed

clustering

2004

EM-based framework

114

Distributed data stream clustering 2007

IDCA, DKMA, CCA

115

Distributed data clustering-a comparative

analysis system

2009

PKMeans

116

Parallel K-Means clustering based on

MapReduce

2009

CloudClustering

117

Toward an iterative data processing

pattern on the cloud

2011

Novel DKM

118

A distributed algorithms for clustering

observations collected by spatially

distributed resource-aware sensors

2011

A generic algorithm

119

Distributed data clustering in sensor

networks

2011

K-Means++

120

An efﬁcient parallel version k-means|| of

the inherently sequential K-means++

2012

Distributed PCA and K-Means

121

Distributed PCA and K-Means clustering 2013

Bootstrapping K-means

122

A fast K-means method using a statistical

bootstrap

2014

Optimize K-means

123

Optimize K-means clustering algorithm

using MapReduce

2014

MapReduce K-means++

124

Efﬁcient k-Means++ approximation with

MapReduce

2014

KPCA + K-Means clustering

125

A communication efﬁcient algorithm to

perform kernel PCA in the distributed

setting

2015

GDCluster

126

A general distributed clustering algorithm 2015

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

shared while adding little overhead to the mining

task.

130

Luo et al. then proposed the GridDMM

algorithm

131

for distributed mining of maximal fre-

quent itemsets on a data grid system. In Ref 132, two

algorithms for both vertically and horizontally parti-

tioned data with cryptographically strong privacy

were introduced. In addition, hybrid CF-based refer-

rals with decent accuracy on cross distributed data

(CDD) were represented in Ref 134. Privacy preser-

vation in distributed systems has been focused in sev-

eral areas such as multiparty privacy preservation

DDM

133

and privacy preserving SOM-based recom-

mendations on horizontally distributed data,

135

among others.

In Ref 136, the researchers proposed the Multi-

level Trust (MLT)-PPDM model to expand the scope

of perturbation-based PPDM to multilevel trust. In

order to reduce the disjunctive operations, Chun

et al. developed the PPDNF approach for privacy-

preserving disjunctive normal form operations on dis-

tributed sets.

137

Tassa then proposed a protocol for

secure mining of association rules in horizontally dis-

tributed databases that improves signiﬁcantly upon

the current leading protocol in terms of privacy and

efﬁciency.

139

Different from the previous approaches

of PPDDM, the ﬁrst algorithm for privacy-preserving

sub-feature selection in DDM was introduced by

Bhuyan and Kamila.

140

It focuses on the issue of sub-

feature selection instead of the traditional pattern

(itemset, sequence, graph, tree, etc.). In order to solve

visualization problem of PPDDM, a novel technique

called DPcode

141

was recently proposed for privacy-

preserving frequent visual patterns publication on

Cloud. Furthermore, some reviews of privacy-

preserving computing in distributed data have been

summarized and discussed.

142–145

TABLE 5 |Algorithms for Privacy Preserving of Distributed Data Mining (PPDDM)

Name Description Year

Toolkit

129

Tools for privacy-preserving distributed

mining

2002

Secure mining

130

PPDM for association rules in horizontally

distributed databases

2004

GridDMM

131

Distributed mining of maximal frequent

itemsets on a data grid system

2006

Two algorithms for vertically partitioned

data

132

Algorithms for both vertically and

horizontally partitioned data, with

cryptographically strong privacy

2007

Multiparty PPDM

133

A game-theoretic approach for PPDDM 2007

PPCF on CDD

134

Hybrid CF-based referrals with decent

accuracy on cross distributed data

(CDD)

2012

SOM-based recommendation

135

A privacy-preserving scheme to provide

recommendations on horizontally

partitioned data among multiple

parties

2012

MLT-PPDM

136

Relax this assumption and expand the

scope of perturbation-based PPDM to

multilevel trust

2012

PPDNF

137

Privacy-preserving disjunctive normal

form operations on distributed sets

2013

Privacy-preserving two-party distributed

mining

138

Privacy-preserving two-party distributed

association rules mining on horizontally

partitioned data

2013

Secure mining

139

Secure mining of association rules in

horizontally distributed databases

2014

Sub-feature selection

140

Privacy-preserving sub-feature selection in

distributed data mining

2015

DPcode

141

Privacy-preserving frequent visual

patterns publication on cloud

2016

Overview wires.wiley.com/dmkd

OPPORTUNITY FOR DISTRIBUTED

DATA MINING

Undoubtedly, the world is shrinking into a small vil-

lage owing to the tangible inﬂuence of network and

various types of distributed systems, such as online

social network systems,

P2P systems,

Ad-hoc

networks,

and cloud computing systems,

It con-

nects people from different parts of the world by

sharing data, service, and media stream. Many

researchers have proposed various DDM techniques

based solely on different domain requirements and

applications, such as DFIM, DFSM, DFGM, DC,

and PPDDM. As mentioned before, Challenges in

Distributed Data Mining section provides an up-to-

date view on the challenges for DDM. DDM is to

deal with complex distributed systems and also

reveals many opportunities. We next highlight some

important research opportunities.

1. Developing more efﬁcient algorithms. DDM is

computationally expensive in terms of compu-

tational cost and memory usage for making

resources accessible (e.g., limited CPU compu-

tation and I/O calls across the distributed archi-

tecture). In order to achieve high performance,

some distributed/parallel DM platforms and

tools have developed in recent years, such as

MapReduce

or Spark.

These developments

can provide the necessary theoretical and tech-

nical supports for DDM. Although, currently

developed algorithms are efﬁcient, there is still

a need for improvement when handling large-

scale data.

2. Heterogeneity. Relational or nonrelational

database systems often utilize a single schema

or the ﬁles have the homogeneous format. In

the Big Data era, a large amount of heterogene-

ous distributed data must be processed. Tradi-

tional DM techniques are designed to discover

useful knowledge in structured data, while the

heterogeneity is the inhesion factor of distribu-

ted data. Thus, it is a major challenge and

opportunity for DM, particularly for DDM to

discover the useful knowledge embedded in

unstructured and/or semistructured data.

3. Different types of mining pattern. Besides FIM,

ARM, sequential pattern mining, graph mining,

several other pattern mining problems have

been studied, e.g., sequential rule mining,

14,15

high-utility pattern mining,

18–20

weight-based

pattern mining,

21,22

and other interesting pat-

tern mining.

23,24

Research on these problems

inspire distributed pattern mining. Thus, many

research opportunities in DDM can be further

discussed.

4. A wide range of applications in various

domains. Based on the speciﬁc applications,

many possibilities for further research on DDM

can be extensively studied. How to utilize

DFIM, DFSM, DFGM, DC, and PPDDM in

new or existing applications is an interesting

issue. We expect more research topics on the

DDM in the nearly future.

5. Security. Undoubtedly, the information

resources that are made available and main-

tained in distributed systems have a high intrin-

sic value to the users.

44,45

Therefore, security

issue is an important topic in DDM. To analyze

the big dataset, security and privacy issues are

the emerging topics. Several PPDDM have men-

tioned and discussed in this study. However,

how to improve the applicability and ﬂexibility

of PPDDM is still a major challenge, and many

opportunities can be extended and studied.

CONCLUSION

Typically, DM algorithms aim to discover the desired

patterns (i.e., frequent itemset, sequential pattern,

graph, etc.) or perform clustering, classiﬁcation, out-

line detection, and so on. In general, the collected

data and executed applications of data analysis are

distributed in nature. Due to some problems and

challenges associated with traditional DM algorithms

when processing distributed data, DM on distributed

computing environments has emerged as an impor-

tant research topic. However, seldom have studies

summarized the related development in various types

of DM in distributed systems and instead make a

general taxonomy on them.

In this study, we thus introduce the deﬁnitions,

the general architectures and several important fea-

tures of a distributed system, and then point out the

challenges of DM tasks in distributed environments.

The main contributions are that we investigate recent

advances of distributed DM and provide state-of-the-

art details, including DFIM, DFSM, DFGM, DC, and

PPDDM. For future research, some opportunities of

DM tasks in a distributed environment can be rea-

sonably considered and further developed: (1) DM in

multisource data, multimodal data, and heterogene-

ous data, (2) a new type of pattern representation or

knowledge representation in DDM, (3) visualization

techniques of DDM, and (4) security issues and qual-

ity of service of DDM in the big data era.

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

ACKNOWLEDGMENTS

This research was partially supported by the National Natural Science Foundation of China (NSFC) under

grant no. 61503092 by the Research on the Technical Platform of Rural Cultural Tourism Planning Basing on

Digital Media under grant 2017A020220011, and by the Tencent Project under grant CCF-Tencent

IAGR20160115.

REFERENCES

1. Agrawal R, Srikant R. Mining sequential patterns. In:

Proceedings of the International Conference on Data

Engineering, Taipei, Taiwan, 1995, 3–14.

2. Agrawal R, Imielinski T, Swami A. Mining associa-

tion rules between sets of items in large database. In:

Proceedings of the ACM SIGMOD International

Conference on Management of Data, Washington,

DC, USA, 1993, 207–216.

3. Han J, Pei J, Yin Y, Mao R. Mining frequent patterns

without candidate generation: a frequent-pattern tree

approach. Data Min Knowl Discov 2004, 8

(1):53–87.

4. Srikant R, Agrawal R. Mining sequential patterns:

generalizations and performance improvements. In:

Proceedings of the International Conference on

Extending Database Technology:Advances in Data-

base Technology, Avignon, France, 1996, 3–17.

5. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H,

Chen Q, Hsu MC. Mining sequential patterns by

pattern-growth: the preﬁxspan approach. IEEE Trans

Knowl Data Eng 2004, 16(11):1424–1440.

6. Berkhin P. A survey of clustering data mining techni-

ques. In: Grouping Multidimensional Data. Berlin

Heidelberg: Springer; 2006, 25–71.

7. Jarvis RA, Patrick EA. Clustering using a similarity

measure based on shared near neighbors. IEEE Trans

Comput 1973, 100(11):1025–1034.

8. Kotsiantis SB. Supervised machine learning: a review

of classiﬁcation techniques. Informatica 2007, 31

(3):249–269.

9. Quinlan JR. C4.5: Programs for Machine Learning.

San Francisco, CA: Morgan Kaufmann Publishers

Inc.; 1993.

10. Lee W, Stolfo S, Mok K. Adaptive intrusion detec-

tion: a data mining approach. Artif Intell Rev 2000,

14(6):533–567.

11. Vo B, Le T, Hong TP, Le B. Fast updated frequent-

itemset lattice for transaction deletion. Data Knowl

Eng 2015, 96:78–89.

12. Chen MS, Han J, Yu PS. Data mining: an overview

from a database perspective. IEEE Trans Knowl Data

Eng 1996, 8(6):866–883.

13. Vo B, Hong TP, Le B. A lattice-based approach for

mining most generalization association rules. Knowl-

Based Syst 2013, 45:20–30.

14. Fournier-Viger P, Nkambou R, Tseng VS. Rule-

Growth: mining sequential rules common to several

sequences by pattern-growth. In: Proceedings of the

ACM Symp Appl Comput, Taichung, Taiwan,

2011:956–961.

15. Fournier-Viger P, Faghihi U, Nkambou R,

Nguifo EM. CMRules: mining sequential rules com-

mon to several sequences. Knowl-Based Syst 2012,

25:63–76.

16. Kuramochi M, Karypis G. Frequent subgraph discov-

ery. In: Proceedings of the IEEE Int Conf Data Min,

San Jose, California, USA, 2001:313–320.

17. Yan X, Han J. Gspan: graph-based substructure pat-

tern mining. In: Proceedings of the IEEE Int Conf

Data Min, Melbourne, Florida, USA, 2003:721–724.

18. Lin JCW, Gan W, Hong TP, Zhang B. An incremen-

tal high-utility mining algorithm with transaction

insertion. Scientiﬁc World J 2015, Article ID 161564.

19. Tseng VS, Wu CW, Shie BE, Yu PS. UP-Growth: an

efﬁcient algorithm for high utility itemset mining. In:

Proceedings of the 16th ACM SIGKDD International

Conference on Knowledge Discovery and Data Min-

ing, Washington, DC, USA, 2010, 253–262.

20. Yao H, Hamilton HJ, Butz CJ. A foundational

approach to mining itemset utilities from databases.

In: Proceedings of the SIAM Int Conf Data Min,

Lake Buena Vista, Florida, USA, 2004:211–225.

21. Lin JCW, Gan W, Fournier-Viger P, Hong TP.

RWFIM: recent weighted-frequent itemsets mining.

Eng Appl Artif Intel 2015, 45:18–32.

22. Vo B, Coenen F, Le B. A new method for mining fre-

quent weighted itemsets based on wit-trees. Expert

Syst Appl 2013, 40(4):1256–1264.

23. Geng L, Hamilton HJ. Interestingness measures for

data mining: A survey. ACM Comput Surv 2006, 38

(3):1–32.

24. Hong TP, Wu YY, Wang SL. An effective mining

approach for up-to-date patterns. Expert Syst Appl

2009, 36(6):9747–9752.

Overview wires.wiley.com/dmkd

25. Agrawal R, Srikant R. Fast algorithms for mining

association rules in large databases. In: Proceedings

of the International Conference on Very Large Data

Bases, Santiago de Chile, Chile, 1994, 487–499.

26. Dean J, Ghemawat S. MapReduce: a ﬂexible data

processing tool. Commun ACM 2010, 53(1):72–77.

27. Jiang Y. A survey of task allocation and load balan-

cing in distributed systems. IEEE Trans Parallel Dis-

trib Syst 2016, 27(2):585–599.

28. Park B, Kargupta H, Johnson E, Sanseverino E,

Hershberger D, Silvestre L. Distributed, collaborative

data analysis from heterogeneous sites using a scala-

ble evolutionary technique. Appl Intell 2002, 16

(1):19–42.

29. R Riesen, R Brightwell, and AB Maccabe, Differences

between distributed and parallel systems, SAND98-

2221, Unlimited Release, 1998. Available at: http://

www.cs.sandia.gov/rbbrigh/papers/distpar.pdf

30. Steen M, Pierre G, Voulgaris S. Challenges in very

large distributed systems. J Internet Serv Appl 2012,

3(1):59–66.

31. Xu L, Huang Z, Jiang H, Tian L, Swanson D. VSFS:

a searchable distributed ﬁle system. In: Proceedings of

the IEEE Parallel Data Storage Workshop, New

Orleans, Louisiana, 2014, 25–30.

32. Liu J, Jin X, Wang Y. Agent-based load balancing on

homogeneous minigrids: macroscopic modeling and

characterization. IEEE Trans Parallel Distrib Syst

2005, 16(7):586–598.

33. Luo P, Lü K, Shi Z, He Q. Distributed data mining in

grid computing environments. Future Gener Comput

Syst 2007, 23(1):84–91.

34. Rao W, Chen L, Fu AWC, Wang G. Optimal

resource placement in structured peer-to-peer net-

works. IEEE Trans Parallel Distrib Syst 2010, 21

(7):1011–1026.

35. Xue Y, Li B, Nahrstedt K. Optimal resource alloca-

tion in wireless ad hoc networks: a price-based

approach. IEEE Trans Mobile Comput 2006, 5

(4):347–364.

36. Gkatzikis L, Koutsopoulos I. Migrate or not? Exploit-

ing dynamic task migration in mobile cloud comput-

ing systems. IEEE Wireless Commun 2013, 20

(7):24–32.

37. Jiang Y, Jiang JC. Understanding social networks

from a multiagent perspective. IEEE Trans Parallel

Distrib Syst 2014, 25(10):2743–2759.

38. Cieslak DA, Thain D, Chawla NV. Troubleshooting

distributed systems via data mining. In: Proceedings

of the IEEE Int Symp High Perform Distrib Comput,

Paris, France, 2006:309–312.

39. Fatta GD, Berthold MR. Dynamic load balancing for

the distributed mining. IEEE Trans Parallel Distrib

Syst 2006, 17(8):773–785.

40. Silva JCD, Giannella C, Bhargava R, Kargupta H,

Klusch M. Distributed data mining and agents. Eng

Appl Artif Intel 2005, 18(7):791–807.

41. Zeng L, Li L, Duan L, Lu K, Shi Z, Wang M, Wu W,

Luo P. Distributed data mining: a survey. Inf Technol

Manage 2012, 13(4):403–409.

42. Tsoumakas G, Vlahavas I. Distributed data mining.

In: Encyclopedia of Data Warehousing and Mining,

IGI Global, Hershey, PA, USA, 2009, 709–715.

43. SM Thampi, Survey on distributed data mining in

p2p networks, 2012. arXiv preprint

arXiv:1205.3231.

44. Chang F, Dean J, Ghemawat S, Hsieh WC. Bigtable:

a distributed storage system for structured data.

ACM Trans Comput Syst 2008, 26(2):4.

45. Tanenbaum AS, Steen MV. Distributed Systems: Prin-

ciples and Paradigms. Upper Saddle River, NJ: Pren-

tice-Hall, Inc.; 2006.

46. Agrawal R, Shafer JC. Parallel Mining of Association

Rules. IEEE Trans Knowl Data Eng 1996, 8

(6):962–969.

47. Khan N, Yaqoob I, Hashem IA, Inayat Z, Ali WK,

Alam M, Shiraz M, Gani A. Big data: survey, technol-

ogies, opportunities, and challenges. Scientiﬁc World

J2014, 2014: Article ID 712826.

48. Wu X, Zhu X, Wu GQ, Ding W. Data mining with

big data. IEEE Trans Knowl Data Eng 2014, 26

(1):97–107.

49. Li F, Ooi BC, Özsu MT, Wu S. Distributed data man-

agement using MapReduce. ACM Comput Surv

2014, 46(3): 31.

50. Yang Q, Wu X. 10 challenging problems in data min-

ing research. Int J Inf Technol Decision Making

2006, 5(4):597–604.

51. A Mueller, Fast sequential and parallel algorithms for

association rule mining: a comparison. Technical

Report, University of Maryland at College

Park, 1995.

52. Park JS, Chen MS, Yu PS. Efﬁcient parallel data min-

ing for association rules. In: Proceedings of the

ACM Int Conf Inf Knowl Manage, Baltimore, MD,

USA, 1995:31–36.

53. Cheung DW, Han J, Ng VT, Fu AW, Fu Y. A fast

distributed algorithm for mining association rules. In:

Proceedings of the Int Conf Parallel Distrib Inf Syst,

Miami Beach, Florida, USA, 1996:31–42.

54. Shintani T, Kitsuregawa M. Hash-based parallel algo-

rithms for mining association rules. In: Proceedings of

the Int Conf Parallel Distrib Inf Syst, Miami Beach,

Florida, USA, 1996:19–30.

55. Zaki MJ, Ogihara M, Parthasarathy S, Li W. Parallel

data mining for association rules on shared-memory

multi-processors. In: Proceedings of the ACM/IEEE

Conf Supercomput, Pittsburgh, PA, USA,

1996:43–43.

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

56. Cheung DW, Ng VT, Fu AW, Fu Y. Efﬁcient mining

of association rules in distributed databases.

IEEE Trans Knowl Data Eng 1996, 8(6):911–922.

57. Zaki MJ, Parthasarathy S, Ogihara M, Li W. Parallel

algorithms for discovery of association rules. Data

Min Knowl Discov 1997, 1(4):343–373.

58. Han EH, Karypis G, Kumar V. Scalable parallel data

mining for association rules. IEEE Trans Knowl Data

Eng 2000, 12(3):337–352.

59. Agarwal RC, Aggarwal CC, Prasad VVV. A tree pro-

jection algorithm for generation of frequent item sets.

J Parallel Distrib Comput 2001, 61(3):350–371.

60. Schuster A, Wolff R. Communication-efﬁcient distrib-

uted mining of association rules. ACM SIGMOD

Record 2001, 30(2):473–484.

61. Cho V, Wüthrich B. Distributed mining of classiﬁca-

tion rules. Knowl Inf Syst 2002, 4(1):1–30.

62. Otey M, Parthasarathy S, Wang C, Veloso A,

Meira W. Parallel and distributed methods for incre-

mental frequent itemset mining. IEEE Trans Syst

Man Cybern B Cybern 2004, 34(6):2439–2450.

63. Cong S, Han J, Hoeﬂinger J, Padua D. A sampling-

based framework for parallel data mining. In: Pro-

ceedings of the ACM SIGPLAN Symposium on Prin-

ciples and Practice of Parallel Programming,

Chicago, Illinois, USA, 2005, 255–265.

64. Li H, Wang Y, Zhang D, Zhang M, Chang EY. PFP:

parallel FP-growth for query recommendation. In:

Proceedings of the ACM Conf Recommender Syst,

Lousanne, Switzerland, 2008:107–114.

65. Yu KM, Zhou J, Hong TP, Zhou JL. A load-balanced

distributed parallel mining algorithm. Expert Syst

Appl 2010, 37(3):2459–2464.

66. Lin MY, Lee PY, Hsueh SC. Apriori-based frequent

itemset mining algorithms on MapReduce. In: Pro-

ceedings of the 6th ACM International Conference on

Ubiquitous Information Management and Communi-

cation, Kuala Lumpur, Malaysia, 2012, P. 76.

67. Riondato M, DeBrabant J, Fonseca R, Upfal E.

PARMA: a parallel randomized algorithm for

approximate association rules mining. In: Proceedings

of the 21st ACM International Conference on Infor-

mation and Knowledge Management, Maui, HI,

USA, 2012, 85–94.

68. Moens S, Aksehirli E, Goethals B. Frequent itemset

mining for big data. In: Proceedings of the IEEE Int

Conf Big Data, Santa Clara, CA, USA,

2013:111–118.

69. Baralis E, Cerquitelli T, Chiusano S. P-Mine: parallel

itemset mining on large datasets. In: Proceedings of

the IEEE 29th International Conference on Data

Engineering Workshops, Brisbane, Australia, 2013,

266–271.

70. Kolias V, Kolias C, Anagnostopoulos I, Kayafas E.

RuleMR: classiﬁcation rule discovery with

MapReduce. In: Proceedings of the IEEE Int Conf

Big Data, Washington DC, USA, 2014:20–28.

71. Qiu H, Gu R, Yuan C, Huang Y. YAFIM: a parallel

frequent itemset mining algorithm with spark. In:

Proceedings of the IEEE International Parallel and

Distributed Processing Symposium Workshops

(IPDPSW), Phoenix, AZ, USA, 2014, 1664–1671.

72. Zhang F, Liu M, Gui F, Shen W, Shami A, Ma Y. A

distributed frequent itemset mining algorithm using

spark for big data analytics. Cluster Comput 2015,

18(4):1493–1501.

73. Feng X, Zhao J, Zhang Z. MapReduce-based H-

Mine algorithm. In: Proceedings of the International

Conference on Instrumentation and Measurement,

Computer,Communication and Control, Harbin,

China, 2015, 1755–1760.

74. Apiletti D, Baralis E, Cerquitelli T, Garza P,

Michiardi P, Pulvirenti F. PaMPa-HD: a parallel

MapReduce-based frequent pattern miner for high-

dimensional data. In: Proceedings of the IEEE Inter-

national Conference on Data Mining Workshop,

Atlantic City, New Jersey, 2015, 839–846.

75. Kaul SRM, Kashyap A. R-Apriori: an efﬁcient

apriori-based algorithm on spark. In: Proceedings of

the 8th ACM Workshop on Ph.D.in Information

and Knowledge Management, 2015, 27–34.

76. Lin KW, Chung SH, Lin CC. A fast and distributed

algorithm for mining frequent patterns in congested

networks. Computing 2016, 98(3):235–256.

77. Salah S, Akbarinia R, Masseglia F. A highly scalable

parallel algorithm for maximally informative k-

itemset mining. Knowl Inf Syst 2017, 50(1):1–26.

78. Zaharia M, Chowdhury M, Das T, Dave A, Ma J.

Resilient distributed datasets: a fault-tolerant abstrac-

tion for in-memory cluster computing. In: Proceed-

ings of the 9th USENIX Conference on Networked

Systems Design and Implementation, San Jose, CA,

USA, 2012:2–2.

79. Wang J, Han J. BIDE efﬁcient mining of frequent

closed sequences. In: Proceedings of the Int Conf

Data Eng, Boston, MA, USA, 2004:79–90.

80. Yan X, Han J, Afshar R. CloSpan: mining closed

sequential patterns in large datasets. In: Proceedings

of the SIAM International Conference on Data Min-

ing, San Francisco, CA, USA, 2003:166–177.

81. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U.

FreeSpan: frequent pattern-projected sequential pat-

tern mining. In: Proceedings of the ACM SIGKDD

Int Conf Knowl Discov Data Min, Boston, MA,

USA, 2000:355–359.

82. Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q,

Dayal U, Hsu MC. PreﬁxSpan: mining sequential pat-

terns efﬁciently by preﬁx-projected pattern growth.

In: Proceedings of the 17th International Conference

on Data Engineering, Heidelberg, Germany, 2001,

215–224.

Overview wires.wiley.com/dmkd

83. Zaki MJ. SPADE: an efﬁcient algorithm for mining

frequent sequences. Mach Learn 2001, 42(1):31–60.

84. Shintani T, Kitsuregawa M. Mining algorithms for

sequential patterns in parallel-hash-based approach.

In: Proceedings of the Paciﬁc-Asia Conf Knowl Dis-

cov Data Min, Melbourne, Australia, 1998:283–294.

85. MV Joshi, G Karypis, and V Kumar. Parallel algo-

rithms for mining sequential associations: issues and

challenges. Technical Report under preparation,

Department of Computer Science, University of Min-

nesota, vol. 119, 1999.

86. Zaki MJ. Parallel sequence mining on shared-memory

machines. J Parallel Distrib Comput 2001, 61

(3):401–426.

87. Wang K, Xu Y, Yu J. Scalable sequential pattern min-

ing for biological sequences. In: Proceedings of the

ACM Int Conf Inf Knowl Manage, Washington, DC,

USA, 2004:178–187.

88. Guralnik V, Karypis G. Parallel tree-projection-based

sequence mining algorithms. Parallel Comput 2004,

30(4):443–472.

89. Cong S, Han J, Padua D. Parallel mining of closed

sequential patterns. In: Proceedings of the

ACM SIGKDD Int Conf Knowl Dis in Data Mining,

Chicago, IL, USA, 2005:562–567.

90. Qiao S, Tang C, Dai S, Zhu M, Peng J, Li H, Ku Y.

PartSpan: parallel sequence mining of trajectory pat-

terns. Int Conf Fuzzy Syst Knowl Discov, Shandong,

China, 2008:363–367.

91. Qiao S, Li T, Peng J, Qiu J. Parallel sequential pattern

mining of massive trajectory data. Int J Comput Intell

Syst 2010, 3(3):343–356.

92. Miliaraki I, Berberich K, Gemulla R, Zoupanos S.

Mind the gap: large-scale frequent sequence mining.

In: Proceedings of the ACM SIGMOD Int Conf Man-

age Data, New York, USA, 2013:797–808.

93. Sahli M, Mansour E, Kalnis P. Parallel motif extrac-

tion from very long sequences. In: Proceedings of the

22nd ACM International Conference on Information

and Knowledge Management, San Francisco, CA,

USA, 2013, 549–558.

94. Liao VCC, Chen MS. DFSP: a depth-ﬁrst spelling

algorithm for sequential pattern mining of biological

sequences. Knowl Inf Syst 2014, 38(3):623–639.

95. Ge J, Xia Y, Wang J. Mining uncertain sequential

patterns in iterative MapReduce. In: Proceedings of

the Paciﬁc-Asia Conference on Knowledge Discovery

and Data Mining, Ho Chi Minh City, Vietnam,

2015, 243–254.

96. Beedkar K, Gemulla R. Lash: Large-scale sequence

mining with hierarchies. In: Proceedings of the ACM

SIGMOD International Conference on Management

of Data, Melbourne, VIC, Australia, 2015, 491–503.

97. Ge J, Xia Y. Distributed sequential pattern mining in

large scale uncertain databases. In: Proceedings of the

Paciﬁc-Asia Conference on Knowledge Discovery and

Data Mining, Auckland, New Zealand, 2016, 17–29.

98. Dean J, Ghemawat S. MapReduce: simpliﬁed data

processing on large clusters. Commun ACM 2008, 51

(1):107–113.

99. Malewicz G, Austern MH, Bik AJC, Dehnert JC,

Horn I, Leiser N, Czajkowski G. Pregel: a system for

large-scale graph processing. In: Proceedings of the

ACM SIGMOD Int Conf Manage Data, Indianapo-

lis, Indiana, USA 2010:135–146.

100. Low Y. GraphLab: a distributed abstraction for large

scale machine learning. Doctoral Dissertation, Uni-

versity of California, Berkeley, CA, 2013.

101. Gonzalez JE, Xin RS, Dave A, Crankshaw D,

Franklin MJ, Stoica I. GraphX: graph processing in a

distributed dataﬂow framework. In: Proceedings of

the USENIX Symposium on Operating Systems

Design and Implementation (OSDI), Broomﬁeld,

CO, USA, 2014, 599–613.

102. Meinl T, Worlein M, Fischer I, Philippsen M. Mining

molecular datasets on symmetric multiprocessor sys-

tems. In: Proceedings of the IEEE Int Conf Syst Man

Cybern, Taipei, Taiwan, 2006:1269–1274.

103. Wang C, Parthasarathy S. Parallel algorithms for

mining frequent structural motifs in scientiﬁc data.

In: Proceedings of the 18th ACM Annual Interna-

tional Conference on Supercomputing, Saint Malo,

France, 2004, 31–40.

104. Liu Y, Jiang X, Chen H, Ma J, Zhang X.

MapReduce-based pattern ﬁnding algorithm applied

in motif detection for prescription compatibility net-

work. In: Proceedings of the International Workshop

on Advanced Parallel Processing Technologies,

Shanghai, China, 2009, 341–355.

105. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C.

PowerGraph: distributed graph-parallel computation

on natural graphs. In: Proceedings of the 10th USE-

NIX Symposium on Operating Systems Design and

Implementation (OSDI), Hollywood, CA, USA,

2012, 17–30.

106. Lu W, Chen G, Tung AKH, Zhao F. Efﬁciently

extracting frequent subgraphs using MapReduce. In:

Proceedings of the IEEE Int Conf Big Data, Santa

Clara Marriott, California, USA, 2013:639–647.

107. Lin W, Xiao X, Ghinita G. Large-scale frequent sub-

graph mining in MapReduce. In: Proceedings of the

30th IEEE Int Conf on Data Engineering, Chicago,

IL, USA, 2014, 844–855.

108. Zhu X, Han W, Chen W. Gridgraph: large-scale

graph processing on a single machine using 2-level

hierarchical partitioning. In: Proceedings of the USE-

NIX Annual Technical Conference, Santa Clara, CA,

USA, 2015, 375–386.

109. Lee H, Shao B, Kang U. Fast graph mining with

hbase. Inform Sci 2015, 315:56–66.

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

110. Teixeira CHC, Fonseca AJ, Seraﬁni M, Siganos G,

Zaki MJ, Aboulnaga A. Arabesque: a system for dis-

tributed graph mining. In: Proceedings of the 25th

ACM Symposium on Operating Systems Principles,

Monterey, California, USA, 2015, 425–440.

111. Talukder N, Zaki MJ. A distributed approach for

graph mining in massive networks. Data Min Knowl

Discov 2016, 30(5): 1024–1052.

112. Shirkhorshidi AS, Aghabozorgi S, Wah TY,

Herawan T. Big data clustering: a review. In: Pro-

ceedings of the Int Conf Comput Sci Appl, Guimar-

aes, Portugal, 2014:707–720.

113. Younis O, Fahmy S. Distributed clustering in ad-hoc

sensor networks: a hybrid, energy-efﬁcient approach.

In: Proceedings of the Annual Joint Conference of the

IEEE Computer and Communications Societies,

Hong Kong, China, vol. 1, 2004.

114. Zhou A, Cao F, Yan Y, Sha C. Distributed data stream

clustering: a fast em-based approach. In: Proceedings of

the 23rd IEEE International Conference on Data Engi-

neering, Istanbul, Turkey, 2007, 736–745.

115. Visalakshi NK, Thangavel K. Distributed data clus-

tering: a comparative analysis. Found Comput Intell

vol 6. Springer Berlin Heidelberg, 2009, 371–397.

116. Zhao W, Ma H, He Q. Parallel k-means clustering

based on MapReduce. In: Proceedings of the

IEEE Int Conf Cloud Comput, Bangalore, India,

2009:674–679.

117. A Dave, W Lu, J Jackson, and R Barga. Cloudclustering:

toward an iterative data processing pattern on the cloud.

In: Parallel and Distributed Processing Workshops and

PhD Forum, Anchorage, Alaska, USA,

2011:1132–1137.

118. Eyal I, Keidar I, Rom R. Distributed data clustering

in sensor networks. Distrib Comput 2011, 24

(5):207–222.

119. Forero PA, Cano A, Giannakis GB. Distributed clus-

tering using wireless sensor networks. IEEE J Sel Top

Signal Process 2011, 5(4):707–724.

120. Bahmani B, Moseley B, Vattani A, Kumar R. Scalable

K-means++. VLDB Endowment 2012, 5(7):622–633.

121. Liang Y, Balcan MF, Kanchanapally V. Distributed

PCA and K-means clustering. In: Proceedings of the

Big Learning Workshop at NIPS, 2013.

122. Han J, Luo M. Bootstrapping k-means for big

data analysis. In: Proceedings of the IEEE Int

Conf Big Data, Washington DC, USA,

2014:591–596.

123. Cui X, Zhu P, Yang X, Li K, Ji C. Optimized big data

K-means clustering using MapReduce. J Supercomput

2014, 70(3):1249–1259.

124. Xu Y, Qu W, Li Z, Min G, Li K, Liu Z. Efﬁcient k-

means++ approximation with MapReduce. IEEE Trans

Parallel Distrib Syst 2014, 25(12):3135–3144.

125. MF Balcan, Y Liang, L Song, and D Woodruff, Com-

munication efﬁcient distributed kernel principal com-

ponent analysis, 2015. arXiv preprint

arXiv:1503.06858.

126. Mashayekhi H, Habibi J, Khalafbeigi T, Voulgaris S,

Steen MV. GDCluster: a general decentralized cluster-

ing algorithm. IEEE Trans Knowl Data Eng 2015,

27(7):1892–1905.

127. Wold S, Esbensen K, Geladi P. Principal components

analysis. Chemom Intel Lab Syst 1987, 2(1–5):37–52.

128. Schölkopf B, Smola A, Müller KR. Kernel principal com-

ponent analysis. In: ProceedingsoftheIntConfArtif

Neural Netw, Lausanne, Switzerland, 1997:583–588.

129. Clifton C, Kantarcioglu M, Vaidya J, Lin X. Tools

for privacy preserving distributed data mining.

ACM SIGKDD Explor Newslett 2002, 4(2):28–34.

130. Kantarcioglu M, Clifton C. Privacy-preserving dis-

tributed mining of association rules on horizontally

partitioned data. IEEE Trans Knowl Data Eng 2004,

16(9):1026–1037.

131. Luo C, Pereira AL, Chung SM. Distributed mining of

maximal frequent itemsets on a data grid system.

J Supercomput 2006, 37(1):71–90.

132. Zhong S. Privacy-preserving algorithms for distribu-

ted mining of frequent itemsets. Inform Sci 2007,

177:490–503.

133. Kargupta H, Das K, Liu K. Multiparty, privacy pre-

serving distributed data mining using a game theo-

retic framework. In: Proceedings of the European

Conference on Principles of Data Mining and Knowl-

edge Discovery, Warsaw, Poland, 2007, 523–531.

134. Yakut I, Polat H. Privacy-preserving hybrid collabo-

rative ﬁltering on cross distributed data. Knowl Inf

Syst 2012, 30(2):405–433.

135. Kaleli C, Polat H. Privacy-preserving SOM-based

recommendations on horizontally distributed data.

Knowl-Based Syst 2012, 33:124–135.

136. Li Y, Chen M, Li Q, Zhang W. Enabling multilevel

trust in privacy preserving data mining. IEEE Trans

Knowl Data Eng 2012, 24(9):1598–1612.

137. Chun JY, Hong D, Jeong IR, Lee DH. Privacy-

preserving disjunctive normal form operations on dis-

tributed sets. Inform Sci 2013, 231:113–122.

138. Zhang F, Rong C, Zhao G, Wu J, Wu X. Privacy-

preserving two-party distributed association rules

mining on horizontally partitioned data. In: Proceed-

ings of the Int Conf Cloud Comput Big Data,

FuZhou, China, 2013:633–640.

139. Tassa T. Secure mining of association rules in hori-

zontally distributed databases. IEEE Trans Knowl

Data Eng 2014, 26(4):970–983.

140. Bhuyan HK, Kamila NK. Privacy preserving sub-

feature selection in distributed data mining. Appl Soft

Comput 2015, 36:552–569.

Overview wires.wiley.com/dmkd

141. Qin Z, Ren K, Yu T, Weng J. DPCode: privacy-

preserving frequent visual patterns publication on

cloud. IEEE Trans Multimedia 2016, 18(5):929–939.

142. Lu R, Zhu H, Liu X, Liu J, Shao J. Toward efﬁcient

and privacy-preserving computing in big data era.

IEEE Netw 2014, 28(4):46–50.

143. Malik MB, Ghazi MA, Ali R. Privacy preserving

data mining techniques: current scenario and

future prospects. In: Proceedings of the Int Conf

Comput Commun Technol, Allahabad, India,

2012:26–32.

144. Parthasarathy S, Ghoting A, Otey M. A survey of dis-

tributed mining of data streams. Data Streams

2007:289–307.

145. Xu L, Jiang C, Wang J, Yuan J, Ren Y. Information

security in big data-privacy and data mining.

IEEE Access 2014, 2:1149–1176.

WIREs Data Mining and Knowledge Discovery Data mining in distributed environment

Distributed Training of Large Language Models

Conference Paper

Full-text available

Dec 2023

Large Language Models for Medicine: A Survey

Preprint

Full-text available

May 2024

To address challenges in the digital economy's landscape of digital intelligence, large language models (LLMs) have been developed. Improvements in computational power and available resources have significantly advanced LLMs, allowing their integration into diverse domains for human life. Medical LLMs are essential application tools with potential across various medical scenarios. In this paper, we review LLM developments, focusing on the requirements and applications of medical LLMs. We provide a concise overview of existing models, aiming to explore advanced research directions and benefit researchers for future medical applications. We emphasize the advantages of medical LLMs in applications, as well as the challenges encountered during their development. Finally, we suggest directions for technical integration to mitigate challenges and potential research directions for the future of medical LLMs, aiming to meet the demands of the medical field better.

Exploring image data association: A hybrid mining approach

Article

Full-text available

Apr 2024
MULTIMED TOOLS APPL

In this paper, a new approach for mining image association rules is presented, which involves the fine-tuned CNN model, as well as the proposed FIAR and OFIAR algorithms. Initially, the image transactional database is generated using feature vectors obtained from the fine-tuned CNN architecture. The proposed FIAR algorithm is used to generate hash-indexed image association rules, which are further optimized using the proposed OFIAR algorithm. This methodology combines the strengths of the CNN model to extract histogram features from images, the FIAR algorithm to efficiently mine frequent image itemsets, and the OFIAR algorithm to optimize image association rules. The proposed methodology can be used to discover hidden relationships among images, leading to new insights in image processing and analysis. Efficient results were obtained with a minimum support of 0.50 and a minimum confidence of 0.50. Experiments were performed on the fruits image dataset consisting of 2618 images from six different classes, and the results show that image mining is feasible and can produce strong optimized image association rules that can be further used for classification purposes.

Temporal Graphs: From Modelling to Analysis

Thesis

Full-text available

Dec 2023

Landy Andriamampianina

Today, real-world entities are becoming increasingly interconnected (e.g., individuals interacting on social platforms). Nevertheless, these entities and their interconnectivity evolve continually over time. They may appear and disappear over time. Moreover, their descriptive characteristics may be added, removed or updated over time. Data generated by interconnected entities are generally represented by Graphs. However, static graphs are not enough to integrate the concept of temporal evolution. This thesis addresses therefore the problem of enabling analyses on graph data enriched with temporal evolution. This problem induces three main challenges: (i) How can we incorporate temporal evolution in a static graph?, (ii) How can we find information in a temporal graph? and, (iii) How can we discover knowledge in such a graph? From a modelling point of view, we define a complete management solution for graphs with temporal evolution, from a conceptual model to its implementation. First, our conceptual model, called Temporal Graph, includes concepts close to the real-world: entities, relationships, and states to capture their temporal evolution. Second, we propose mapping rules to translate automatically our conceptual model to the property graph model. Third, we propose an implementation in graph-oriented data stores. Finally, the experimental results show that our management solution is (i) feasible, i.e., implementable in graph-oriented data stores, (ii) usable for business analyses, (iii) efficient in terms of storage and query performance, and (iv) scalable when the data volume increases. From a querying point of view, we provide a solution allowing users to find information for answering business questions (‘What?’, ‘Who?’, ‘Where?’, ‘When?’). The advantage of our querying solution for graphs with temporal evolution is to be complete. First, this solution includes conceptual operators, which are user-oriented and composable. They enable to find time-dependent information on the topology, as well as on different components of the temporal graph. To be implementable, we propose mapping rules of our conceptual operators into logical operators for querying the property graph model. We verify through experiments that our querying solution allows effectively applying business analyses on real-world datasets. From a knowledge discovery point of view, we offer a solution allowing users to extract hidden information for answering complex business questions (‘How?’). On the one hand, this solution defines a novel pattern, specifying a combination of information pieces of temporal graph to be extracted. Our pattern has the advantages of (i) fully capturing information from the multiple dimensions of a temporal graph and (ii) representing evolution mechanisms spanning several groups of connected entities instead of a single one. On the other hand, we propose an algorithm to extract our pattern from a temporal graph. Since all pattern mining algorithms face the problem of high computational complexity, we propose a mining strategy to reduce the latter. We conduct experiments confirming that (i) our pattern is useful, notably to understand the impacts of disruptive events in real-world datasets, and that (ii) our algorithm is scalable when data volume increases.

Fast RFM Analysis in Sequence Data

Conference Paper

Full-text available

Dec 2023

Data mining based on computer game search algorithm in Japanese E-learning and multi context translation

Article

Jun 2024

He Jia

Towards Episode Rules with Non-overlapping Frequency and Targeted Mining

Article

Jun 2024
INFORM SCIENCES

Towards Reliable Collaborative Data Processing Ecosystems: Survey on Data Quality Criteria

Conference Paper

Nov 2023

Open Metaverse: Issues, Evolution, and Future

Conference Paper

Full-text available

May 2024

A fuzzy rough set-based horse herd optimization algorithm for map reduce framework for customer behavior data

Article

Full-text available

Apr 2024
KNOWL INF SYST

A large number of association rules often minimizes the reliability of data mining results; hence, a dimensionality reduction technique is crucial for data analysis. When analyzing massive datasets, existing models take more time to scan the entire database because they discover unnecessary items and transactions that are not necessary for data analysis. For this purpose, the Fuzzy Rough Set-based Horse Herd Optimization (FRS-HHO) algorithm is proposed to be integrated with the Map Reduce algorithm to minimize query retrieval time and improve performance. The HHO algorithm minimizes the number of unnecessary items and transactions with minimal support value from the dataset to maximize fitness based on multiple objectives such as support, confidence, interestingness, and lift to evaluate the quality of association rules. The feature value of each item in the population is obtained by a Map Reduce-based fitness function to generate optimal frequent itemsets with minimum time. The Horse Herd Optimization (HHO) is employed to solve the high-dimensional optimization problems. The proposed FRS-HHO approach takes less time to execute for dimensions and has a space complexity of 38% for a total of 10 k transactions. Also, the FRS-HHO approach offers a speedup rate of 17% and a 12% decrease in input–output communication cost when compared to other approaches. The proposed FRS-HHO model enhances performance in terms of execution time, space complexity, and speed.

Big data: Survey, technologies, opportunities, and challenges

Article

Full-text available

Jan 2014
TSWJ

Big Data has gained much attention from the academia and the IT industry. In the digital and computing world, information is generated and collected at a rate that rapidly exceeds the boundary range. Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009. As information is transferred and shared at light speed on optic fiber and wireless networks, the volume of data and the speed of market growth increase. However,the fast growth rate of such large data generates numerous challenges, such as the rapid growth of data, transfer speed, diverse data,and security. Nonetheless, Big Data is still in its infancy stage, and the domain has not been reviewed in general. Hence, this study comprehensively surveys and classifies the various attributes of Big Data, including its nature, definitions, rapid growth rate, volume,management, analysis, and security. This study also proposes a data life cycle that uses the technologies and terminologies of Big Data. Future research directions in this field are determined based on opportunities and several open issues in Big Data domination.These research directions facilitate the exploration of the domain and the development of optimal techniques to address Big Data.

Mining association rules between sets of items in large databases

Conference Paper

Full-text available

Jan 1993

Mining Sequential Patterns: Generalizations and Performance Improvements

Conference Paper

Jan 1996

LASH: Large-Scale Sequence Mining with Hierarchies

Conference Paper

May 2015

We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.

Mining Association rules between sets of items in large databases

Article

Jan 1993

Adaptive intrusion detection: A data mining approach

Article

Jan 2001

MapReduce: Simplified data processing on large clusters

Article

Jan 2004

Communication-efficient distributed mining of association rules

Article

Jun 2001
SIGMOD REC

Mining for associations between items in large transactional databases is a central problem in the field of knowledge discovery. When the database is partitioned among several share-nothing machines, the problem can be addressed using distributed data mining algorithms. One such algorithm, called CD, was proposed by Agrawal and Shafer in [1] and was later enhanced by the FDM algorithm of Cheung, Han et al. [5]. The main problem with these algorithms is that they do not scale well with the number of partitions. They are thus impractical for use in modern distributed environments such as peer-to-peer systems, in which hundreds or thousands of computers may interact. In this paper we present a set of new algorithms that solve the Distributed Association Rule Mining problem using far less communication. In addition to being very efficient, the new algorithms are also extremely robust. Unlike existing algorithms, they continue to be efficient even when the data is skewed or the partition sizes are imbalanced. We present both experimental and theoretical results concerning the behavior of these algorithms and explain how they can be implemented in different settings.

Communication Efficient Distributed Kernel Principal Component Analysis

Conference Paper

Aug 2016

Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ε over s workers has communication cost Õ(spk/ε+sk²/ε³) words, where p is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches.

MapReduce-Based H-Mine Algorithm

Conference Paper

Sep 2015

Frequent Item set Mining (FIM) is a very effective method for knowledge acquisition from data, but with the advent of the era of big data, traditional algorithms based on memory are facing severe challenges such as the computation speed and storage capacity. Fortunately, Map Reduce model provides an efficient framework for distributed programming and operation framework. This paper proposes a novel Map Reduce-based H-mine algorithm (MRH-mine), a version of H-mine algorithm in the distributed operation environment. Experimental results show that MRH-mine algorithm has a better performance and scalability than traditional H-Mine when facing massive data growth.

Data mining in distributed environment: a survey

Abstract

Recommended publications

Implementation of Cryptography for Privacy Preserving Data Mining

Distributed Data Mining.

Privacy Preservation with Penalty in Decentralized Network using Multiparty Computation

Guest Editors' Introduction: Distributed Data Mining--Framework and Implementations