ArticlePDF Available

Collaborative Learning With Heterogeneous Local Models: A Rule-Based Knowledge Fusion Approach

January 2023
IEEE Transactions on Knowledge and Data Engineering PP(99):1-15

January 2023
PP(99):1-15

DOI:10.1109/TKDE.2023.3341808

Authors:

Haibo Zhang

University of Otago

Jeremiah D. Deng

University of Otago

Show all 5 authorsHide

Federated Learning (FL) has emerged as a promising collaborative learning paradigm that enables to train machine learning models across decentralized devices, while keeping the training data localized to preserve user privacy. However, the heterogeneity in both decentralized training data and distributed computing resources has posed significant challenges to the design of effective and efficient FL schemes. Most existing solutions either focus on tackling a single type of heterogeneity, or are unable to fully support model heterogeneity with low communication overhead, fast convergence, and good interpretability. In this paper, we present CloREF, a novel rule-based collaborative learning framework that allows devices in FL to use completely different local learning models to cater to both data and resource heterogeneity. In CloREF, each rule is represented as a linear model, which provides good interpretability. Each participating device chooses a local model and trains it using its local data. The decision boundary of each trained local model is then approximated using a set of rules, which effectively bridges the gap arising from model heterogeneity. All participating devices collaborate to select the optimal set of rules as the global model, employing evolutionary optimization to effectively fuse the knowledge acquired from all local models. Experimental results on both synthesized and real-world datasets demonstrate that the rules generated by our proposed method can mimic the behaviors of various learning models with high fidelity ( $\gt $ 0.95 in most tests), and CloREF gives competitive performance in accuracy, AUC, and communication overhead, compared with both the best-performing model trained centrally and several state-of-the-art model-heterogeneous federated learning schemes.

Content uploaded by Jeremiah D. Deng

Content may be subject to copyright.

Collaborative Learning with Heterogeneous

Local Models: A Rule-based Knowledge Fusion

Approach

Ying Pang, Haibo Zhang, Jeremiah D. Deng, Lizhi Peng, and Fei Teng

Abstract—Federated Learning (FL) has emerged as a promising collaborative learning paradigm that enables to train machine learning

models across decentralized devices, while keeping the training data localized to preserve user privacy. However, the heterogeneity in

both decentralized training data and distributed computing resources has posed signiﬁcant challenges to the design of effective and

efﬁcient FL schemes. Most existing solutions either focus on tackling a single type of heterogeneity, or are unable to fully support

model heterogeneity with low communication overhead, fast convergence, and good interpretability. In this paper, we present CloREF,

a novel rule-based collaborative learning framework that allows devices in FL to use completely different local learning models to cater

to both data and resource heterogeneity. In CloREF, each rule is represented as a linear model, which provides good interpretability.

Each participating device chooses a local model and trains it using its local data. The decision boundary of each trained local model

is then approximated using a set of rules, which effectively bridges the gap arising from model heterogeneity. All par ticipating devices

collaborate to select the optimal set of rules as the global model, employing evolutionary optimization to effectively fuse the knowledge

acquired from all local models. Experimental results on both synthesized and real-world datasets demonstrate that the rules generated

by our proposed method can mimic the behaviors of various learning models with high ﬁdelity (>0.95 in most tests), and CloREF

gives competitive performance in accuracy, AUC, and communication overhead, compared with both the best-performing model trained

centrally and several state-of-the-art model-heterogeneous federated learning schemes.

Index Terms—collaborative learning, heterogeneous participants, rule extraction, federated learning, knowledge fusion.

1 INTRODUCTION

Federated Learning (FL) is a new paradigm of dis-

tributed machine learning that enables multiple parties

to collaboratively build a high-performance prediction

model [1]. In FL, a central server guides the training pro-

cess, enabling participating devices to iteratively reﬁne the

learning model using their local data and share the acquired

knowledge through model aggregation on the server. This

eliminates the need to share local training data, thereby

protecting user privacy. FL has enabled a wide range of

applications, including Google’s Gboard [2], anomaly de-

tection [3], smart health [4], and recommender system [5].

A widely accepted implementation for FL is that local

models employ the same architecture so as to produce a

single global prediction model. This brings several limi-

tations. Firstly, most implementations of FL use the Deep

Neural Network (DNN) model, whose number of parame-

ters to be learned grows signiﬁcantly with the increase in

network size. This requires all participants to have similar

capabilities in terms of computation, memory, power and

network bandwidth. However, such capabilities can differ

signiﬁcantly at the participating devices. Hence, the global

model complexity will be constrained by the most indigent

participating device. Secondly, the datasets held by differ-

Y. Pang, H. Zhang and J. D. Deng are with the School of Computing,

University of Otago, New Zealand, E-mail: haibo.zhang@otago.ac.nz

L. Peng is with the School of Information Science and Engineering, University

of Jinan, China

F. Teng is with the School of Computer Science and Artiﬁcial Intelligence,

Southwest Jiaotong University, China

ent participants might have diverse distributions, making

it reasonable for different participants to apply different

types of machine learning models. Even if all local datasets

have a similar distribution, the performance of collaborative

learning could be improved by exploiting the diversity of

learning models just as in ensemble learning, since different

models can have different capabilities in characterising the

decision boundaries. Apart from the limitation on homo-

geneity, highly complex models such as DNNs also have

low interpretability. Many real-world applications (e.g., IoT

and ﬁnance) generate tabular data with low or moderate di-

mensions, making it unnecessary and even unsuitable to use

over-complicated and less-interpretable learning models.

Few FL schemes can truly achieve model heterogeneity

that allows heterogeneous local models to be ﬁrst trained at

the participants and then aggregated at the central server.

The main challenge lies in the interpretation and merging of

knowledge embedded in heterogeneous models. To tackle

this challenge, one approach is knowledge distillation (KD)

[6], [7], which enables the global model (student model) to

obtain knowledge from heterogeneous local models (teacher

models) by aggregating local models’ predictions on a shar-

ing proxy dataset. However, the requirement for a shared

proxy dataset compromises privacy-preserving and may

be infeasible for participants with limited storage capacity.

Another approach is model splitting (MS) [8], [9], which

allows participants to train only sub-parts of the global

model according to their available resources. However, due

to the mismatch in model complexity, small models may not

carry the whole knowledge embedded in large models [10].

In this paper, we present a novel “Collaborative learn-

ing with optimized Rule Extraction and Fusion” (CloREF)

framework, which empowers each individual participant to

choose a customized local learning model, aligning with the

characteristics of its local training data as well as its storage

and computing capabilities. Since different participants can

choose completely different local learning models such as

SVM, naive Bayesian classiﬁer, and neural networks, it is

infeasible to perform model aggregation using traditional

methods such as parameter averaging. To address this chal-

lenge, CloREF uses decision rules, represented as linear

models, to uniformly represent the knowledge embedded

in heterogeneous learning models. Speciﬁcally, each par-

ticipating device ﬁrstly trains its learning model using its

local data, and then approximates the decision boundary of

the trained local model using a set of rules. Subsequently,

all participants work under the coordination of the central

server to jointly select the optimal set of rules as the global

model. Since the ﬁnal global model is a set of decision rules,

CloREF can provide better interpretability in comparison

with neural network based schemes. The main contributions

of this paper are summarized as follows:

•We propose a novel collaborative learning framework

CloREF, which is able to construct an effective global

prediction model by fusing the knowledge learned by

multiple participants using heterogeneous local models.

•We propose a new rule extraction method that uses mul-

tiple linear models to approximate the decision boundary

of a trained local model. Experimental results show that

our method can approximate complex decision bound-

aries and mimic the behaviours of a variety of learning

models with ﬁdelity larger than 0.95 in most tests.

•We propose a rule fusion method that employs an evolu-

tionary optimization scheme to select the best rule set that

maximizes the model performance. Experimental results

show that our method can effectively select a small set

of rules to provide competitive and sometimes even bet-

ter performance in comparison with the best-performing

model trained centrally with the whole training dataset.

•We compare CloREF with four state-of-the-art FL

schemes. Experimental results demonstrate that CloREF

has fast convergence and low communication cost. More-

over, it is robust to non-IID data and can achieve compet-

itive performance in comparison with baseline schemes.

The rest of the paper is organised as follows: Sect. 2

reviews the related works. Sect. 3 gives an overview of

CloREF. Sect. 4 and Sect. 5 present the key elements of

CloREF on rule extraction and fusion. Empirical evaluation

results are presented and analysed in Sect. 6. Finally, we

conclude the paper and discuss future work in Sect. 7.

2 RE LATED WORK

2.1 Distributed learning and federated learning

Traditional distributed learning schemes typically focus on

distributing the load of training a learning model to multiple

processing nodes and are not much concerned about privacy

issues [11]. FL was introduced in [12] as a distributed

learning model, which embodies the principles of focused

collection and data minimization but mitigates the systemic

privacy risks and costs in centralized machine learning. The

“FedAvg” algorithm proposed in [12] has become the most

widely adopted FL baseline where a global model is trained

by periodically aggregating the parameters learned by the

participants based on stochastic gradient descent. Subse-

quent studies along this line have been conducted to address

the challenges faced by FL. Among them, the heterogene-

ity problem has recently received much attention. Existing

works on heterogeneity in FL can be categorized into two

classes: data heterogeneity and model heterogeneity.

Data heterogeneity, also known as the non-IID data

problem, means non-identical data distribution across par-

ticipants. Data augmentation [13] and client scheduling [14]

aim to reduce the difference in data distributions through

data sampling or adaptively selecting the participating

clients at each round according to their data distributions.

Regularization loss [15] constraints the parameter diver-

gence between the local model and the global model by

adding a regularization loss to the local cost function.

Federated meta-learning [14], federated multi-task learning

[16], and federated transfer learning [17] adopt techniques

to build personalized models for individual participants,

with the key idea of training a global model ﬁrstly and then

adapting it to each participant using its local data [18].

Model heterogeneity arises from the context where the

storage, computational, and communication capabilities are

different among participants [15]. Only a few works have

been conducted to achieve complete model heterogeneity

due to the difﬁculty in interpreting and merging hetero-

geneous models. We categorize the existing solutions for

this problem into the knowledge distillation (KD)-based

approach and the model splitting (MS)-based approach.

The KD-based approach improves the global model by

aggregating the class scores or logit outputs from local

models on a proxy dataset. FedMD [6] is a representative im-

plementation of this approach. In FedMD, each participant

computes the class scores on a public dataset and uploads

them to the central server. The central server broadcasts the

aggregated class scores as an augmented dataset for local

training. Then, local models learn on the augmented dataset

and local dataset iteratively for personalization. However,

the existence of a public dataset may compromise privacy-

preserving and is impractical for weak participants with

limited storage. FedDF [19] releases the storage pressure at

participants by performing distillation only at the server

side. Model heterogeneity is achieved by clustering the

local model architectures and training each type of model

individually at the central server. However, this method still

requires a public dataset. FedGen [20] is a good attempt

to achieve data-free KD-based heterogeneous FL. However,

it focuses more on data heterogeneity and still requires all

local models to employ the same architecture.

The MS-based approach splits the global model into

different sub-models with various complexity levels and

allocates them to each participant according to their system

resources. For a local model, HeteroFL [8] and FedRolex

[21] select a subnetwork of the global model based on a

shrinkage ratio. Consequentially, different local models have

the same depth but different widths as the global model.

Then, the central server aggregates the overlapped weight

...

Rule Pool

Download Rules

Upload Rules

Central Server

PBIL-based Fusion and

Selection Strategy

Participant 2 Participant P

Machine Learning

Model Training Rule Extraction

Classification

Participant 1Participant 1

...

Machine Learning

Model Training Rule Extraction

Classification

Machine Learning

Model Training Rule Extraction

Classification

......

...... ......

Fig. 1. The CloREF framework for collaborative learning with heterogeneous local models.

matrices to fuse the knowledge learned from diverse local

models. Intuitively, the large models are more difﬁcult to

converge than small models due to the fewer overlapped pa-

rameters. Furthermore, because of the mismatch on model

architectures, small models may not carry the whole knowl-

edge of large models. In addition, since local models are

sub-models of a global model, the architecture of local

models is constrained by that of the global model. Therefore,

compared with the KD-based approach, MS-based approach

removes the dependence on a proxy dataset but achieves a

lower degree of model heterogeneity.

Our scheme differs from existing solutions since it allows

participants to build truly heterogeneous learning mod-

els and integrates knowledge acquired by the participants

through rule extraction and fusion. Instead of transmitting

large amounts of model parameters, only the rules are

exchanged. Hence, our scheme has less communication cost

compared with the standard FL framework.

2.2 Rule extraction

Existing rule extraction techniques can be categorized into

white box explanation methods and black box explanation

methods. White box explanation methods extract rules by

directly interpreting the strengths of connection weights,

model architecture, and other parameters [22]. The main

drawback is that they cannot preserve privacy since both

the model architecture and the training data are exposed.

Black box explanation methods, such as LIME [23] and

LMENA [24], do not require internal information of the

machine learning models. They extract rules by observing

the effect of various inputs on the model outputs. However,

they can only interpret some local areas of a complex learn-

ing model but cannot mimic its overall behaviors.

Unlike existing rule extraction schemes, our method

aims to approximate the whole decision boundary gener-

ated by a learning model using multiple linear rules. Since

the global model in our method is a set of linear rules, our

method can explain how each decision is made by showing

the coefﬁcients of the used linear rules, as these coefﬁcients

can represent the importance of different features.

3 OVERVIEW OF CLOREF

We focus on binary classiﬁcation because there are many

real-world binary classiﬁcation tasks such as anomaly de-

tection and disease diagnosis. Moreover, a multi-class clas-

siﬁcation task can be decomposed into multiple binary clas-

siﬁcation problems using transformation techniques such as

One-vs-Rest and One-vs-One [25]. Each participant has a

private dataset to train a local learning model. Different

participants can use different local models such as Sup-

port Vector Machine (SVM), naive Bayesian classiﬁer, and

Multi-Layer Perceptron (MLP). We assume all local models

can output prediction probabilities either directly or via

probability calibration, e.g., Platt scaling [26] or Isotonic

Regression [27] for SVMs, and Kearns-Mansour splitting

criterion [28] for decision trees. All participants agree on

the set of data features and the learning objective.

As illustrated in Fig. 1, our rule-based collaborative

learning scheme contains three key components:

•Local training and rule extraction: Each rule is rep-

resented using a linear model. Each participant ﬁrstly

trains its local model using its private dataset and then

extracts a set of rules to mimic the behaviors of the trained

local model with high ﬁdelity. The linear model based

rule representation enables the fusion of rules extracted

from heterogeneous machine learning models. Moreover,

a rule has only a few parameters, which can reduce the

complexity on both rule extraction and rule exchange.

•Rule exchange: Only parameters of the linear models are

exchanged between participants and the central server,

thereby preserving the privacy of the participants. The

small number of parameters to be exchanged also enables

quick rule exchange even in bandwidth-hungry scenarios.

•Rule fusion and selection: The rules received from mul-

tiple participants can be complementary, redundant, or

even contradictory. A rule fusion and selection strategy

designed based on Population-Based Incremental Learn-

ing (PBIL) is employed at the sever to select the best set of

rules that maximizes the performance of the global model.

The select best rule set then becomes the global model and

is sent back to participants for classifying unseen samples.

Concept drift may occur in many applications of ma-

chine learning, which can degrade the performance of a

well-trained model. If the training data is updated and new

rules are generated at local participants, the above process

can be repeated to update the global model.

4 LO CA L TRAINING AND RULE EXTRACTION

The key motivation of our rule extraction method is: we

can mimic the behaviors of a trained learning model if we

can model its decision boundary. In this section, we present

a rule extraction method that can extract rules to mimic the

behaviors of a trained learning model with high ﬁdelity. Fig.

2 shows a four-step routine employed for rule extraction.

These steps are elaborated as follows.

a) Synthesize Samples b) Cluster c) Fit Linear Models d) Determine ω

Fig. 2. The ﬂowchart of rule extraction.

4.1 Synthesizing boundary samples

Let H(x)represent the hypothesis function of a trained

learning model, where x= [x1, x2, ..., xn]denotes a data

sample with n-dim features. Since it is not always feasible

to model H(x)with a mathematical function, we detect the

sketch of H(x)for a trained model using synthetic samples

generated with Particle Swarm Optimization (PSO) [29]. In

our method, each particle xrepresents a candidate bound-

ary sample with the value randomly initialized in the search

space. Particles with H(x) = θare samples on the decision

boundary. Hence, we aim to ﬁnd particles that satisfy

min

x∈χ|H(x)−θ|, χ ⊂Rn,(1)

where χis the search space of the boundary samples.

We design a PSO-based boundary sample generation al-

gorithm (refer to [30] for pseudocode) to repeatedly execute

PSO to generate the number of required synthetic boundary

samples. In each execution of PSO, at most one synthetic

boundary sample can be generated. The reason for this

design is two-fold: ﬁrstly, PSO in each execution can quickly

converge due to both the inﬁnite number of boundary

samples and the only optimization constraint (ﬁtness value);

secondly, the generated synthetic samples will spread over

the entire decision boundary instead of huddling together

due to the random initialization of the swarm population

and the random parameters for velocity update.

4.2 Clustering and linear model ﬁtting

Since the H(x)can be complex, we propose to use multiple

linear models to ﬁt H(x)of a trained learning model, where

each linear model is called a rule. In this step, synthetic

boundary samples are ﬁrstly divided into groups, and then

the samples in each group are used to ﬁt a linear model, as

illustrated in Fig. 2(b) and Fig. 2(c), where ellipses represent

clusters and solid lines represent the ﬁtted linear models.

4.2.1 Rule representation

A rule is deﬁned by a triple L(x):< l(x), ω, c>.l(x)is

a linear function representing a segment of the decision

boundary:

l(x) : aTx+b, (2)

where aT= [a1, a2, ..., an]is the coefﬁcient vector, and b

is a constant. ω∈ {1,−1}is a sign used to represent the

relationship between the prediction of L(x)and the output

of l(x). Speciﬁcally, ωl(x)is the ﬁnal prediction generated

by rule L(x). For example, if ω=−1, the prediction given

by rule L(x)will be the inverse of l(x)’s output. cis the

centroid of the cluster of samples used to ﬁt l(x). Since

multiple rules may be needed to mimic H(x),cis used to

determine which rule should be applied to classify a given

sample. Suppose the set of rules extracted from a trained

model is L(x) = {Li(x)}, i = 1,· · · , m. Given a test sample

xj, the rule used to classify xj, denoted by Lk(xj), is the

one that has the minimum Euclidean distance between xj

and its centriod of a rule L(x):

Lk(xj) = arg min

Li(x)∈L(x)kxj−cik2.(3)

As shown in Fig. 2 (c), the decision boundary is approx-

imated with three linear models. Since the test sample xtis

closest to c2,l2is used to predict xt. The ﬁnal prediction is

u(ωlk(xj)), where u(·)is the step function given in Eq. (5).

4.2.2 Rule Measurement

To measure how well L(x)mimics the behaviors of the

trained learning model, we deﬁne the ﬁdelity of L(x)as [31]:

F id(L(x))= 1

|X|X

xj∈X

1− |u(H(xj)−θ)−u(ωlk(xj)|,(4)

where Xis the set of training samples, H(xj)is the probabil-

ity of xjbelonging to the positive class given by the trained

learning model, θis the threshold on prediction probability

for distinguishing two classes, and u(·)is a step function:

u(x) = 1,if x≥0;

0,otherwise.(5)

According to Eq.(4), the higher the ﬁdelity is, the better the

extracted rules mimic the trained model.

4.2.3 Clustering optimization and linear model ﬁtting

We use K-means to generate the initial clusters for NS

boundary samples. Since it is challenging to choose an ap-

propriate K, we initially generate a relatively large number

of clusters and then merge and split them to obtain the

optimal clusters based on the quality of the ﬁtted linear

models. We set Kto bNS/ncwhere NSis the number of

boundary samples and nis the feature dimension, since each

cluster needs at least nsamples to ﬁt a linear model.

We use ﬁdelity deﬁned in Eq. (4) and R2score deﬁned

as follows to measure the quality of a ﬁtted linear model.

R2= 1 −SSres

SStot

= 1 −Pt

j=1(H(xS

j)−l(xS

j))2

j=1(H(xS

j)−H(xS))2,(6)

where SSres represents the sum of squares of residuals with

respect to ﬁtted values, and SStot represents the sum of

squares with respect to the average value. R2ranges in [0,1].

The larger the value is, the better the linear model ﬁtting.

The pseudocode for generating linear models is given

in Algo. 1. Two user-deﬁned thresholds (Tsplit and Tmerge)

are used to control splitting and merging, respectively. If the

R2score for a cluster is lower than Tsplit, we use k-means

to split it into two clusters CLUi1and C LUi2(line 9). To

properly ﬁt each cluster to a linear model, the number of

samples in a cluster must be no less than the number of

features (n). If both clusters have less than nsamples, the

original cluster is further tested based on ﬁdelity (lines 10

- 11). If either cluster has less than nsamples, it is treated

as noise and discarded. Otherwise, the two new clusters are

ﬁtted and their R2scores are calculated to check if they

need to be further split (lines 14-18). For testing with ﬁdelity

(lines 19 - 28), a cluster in CLUﬁdelity−test is kept only when

including that cluster will increase the ﬁdelity. For merging

(lines 29 - 38), a cluster is merged with a neighboring cluster

if the merged cluster has a R2score no less than Tmerge.

4.3 Determining the sign ω

Since each l(x)is ﬁtted without using the label information,

a sign ωneeds be associated with each l(x)to indicate

which side is positive. To determine ωfor a given li(x), a

set of synthetic samples called probing samples is generated

subject to the following two constraints: (1) the probing sam-

ples are distributed on the normal line of li(x)that crosses

the centroid of the corresponding cluster used to ﬁt li(x).

This is because linear models may not well ﬁt the corners

of the decision boundary. For example, xP+

aand xP−

ain

Fig. 2(d) are located at a corner of the decision boundary,

and using these two samples as probing samples may get

incorrect ω. (2) Among all centroids, the probing sample

has the shortest distance to the centroid of li(x)to avoid

interference from other parts of the decision boundary. For

example, xP+

b2shouldn’t be used as a probing sample for

l1(x). Let dmin

ibe the minimum Euclidean distance between

the centroid ciof li(x)and other centroids. The probing

samples for li(x)are then generated in a pairwise way by

repeating the following two steps: (1) randomly choose a

value βfrom the range (0, dmin

i/2) as the distance between

the probing sample and ci; (2) generate a pair of probing

samples <xP+,xP−>along the normal vector of li(x)

as follows: xP+=ci+βa

|a|,xP−=ci−βa

|a|, where

ais a normal vector of the hyperplane formed by li(x).

Hence, xP+and xP−are located on different sides of the

hyperplane and have the same distance to ci, as illustrated

in Fig. 2 (d). Then the sign ωdetermined by <xP+,xP−>

is calculated as follows:

ω=









1,if (H(xP+)−θ)l(xP+)>0and

(H(xP−)−θ)l(xP−)>0

−1,if (H(xP+)−θ)l(xP+)<0and

(H(xP−)−θ)l(xP−)<0

(7)

If the values of ωdetermined by more than 80% of

the probing samples are consistent, this result is accepted;

otherwise, the current set of probing samples is discarded

and a new set of probing samples is generated. If ωcannot

be determined after generating the set of probing samples

more than ten times, the current l(x)is considered seriously

deviated from the decision boundary and is discarded.

Algorithm 1: Clustering and linear model ﬁtting.

Input: S– synthesized boundary samples, Tsplit,

Tmerge ,H(x);

Output: Lfilter (x)– ﬁltered linear models

1CLU = K-means (S, K), where K=bNS/nc;

2foreach CLUiin C LU do

3li(x)←ﬁt CLUi;

4if R2

i≥Tsplit then

5add CLUito C LU f ilter ;

6remove CLUifrom CLU;

7add li(x)to Lfilter (x);

8else

9{CLUi1, C LUi2} ← K-means(k= 2, CLUi);

10 if |CLUi1|< n and |C LUi2|< n then

11 add CLUito C LU ﬁdelity−test;

12 remove CLUifrom CLU;

13 else

14 foreach CLUjin {C LUi1, C LUi2}do

15 if |CLUj| ≥ nthen

16 add CLUjto C LU ;

17 else

18 discard CLUj;

19 F idfilter ←evaluate Lf ilter (x)with H(x);

20 foreach CLUiin C LU f idelity−test do

21 li(x)←ﬁt CLUi;

22 L∪(x)←add li(x)to Lfilter ;

23 F id∪←evaluate L∪(x)with H(x);

24 if F id∪> F idfilter then

25 add CLUito C LU f ilter ;

26 Lfilter (x) = L∪(x);

27 F idfilter =F id∪;

28 remove CLUifrom CLUfidelity −test;

29 p= 0;

30 while p < |CLU f ilter |do

31 CLUq= nearest cluster of CLUp;

32 lp∪q(x)←ﬁt (CLUp∪C LUq);

33 R2

p∪q←evaluate lp∪j(x);

34 if R2

p∪q≥Tmerge then

35 CLU f ilter [p] = C LUp∪CLUq;

36 remove CLUqfrom CLUfilter ;

37 add lp∪q(x)to Lfilter (x);

38 else p = p+1;

5 RULE FUSION AND SELECTION

Since the datasets used to train the local learning models can

have a similar distribution with some degrees of diversity,

there might be many redundant or even contradictory rules.

Hence, simply concatenating the rules generated by all local

models will not be very effective in providing the best

performance. To address this issue, we propose a PBIL-

based Rule Fusion and Selection (PBIL-RFS) scheme to select

a set of rules that maximizes the Area Under the ROC Curve

(AUC). In PBIL-RFS, the best set of rules is calculated itera-

tively and in each iteration the server needs to communicate

with the participants in order to compute AUC. Therefore,

reducing the number of rules before running PBIL-RFS can

signiﬁcantly reduce the communication overhead and speed

up the convergence of PBIL-RFS. Also we observe that

reducing the number of rules is very effective to avoid over-

ﬁtting. Due to these reasons, we further design two methods

to simplify the rules. One method is to merge the redundant

rules before PBIL-RFS based on a new rule distance metric,

and the other method is to use a modiﬁed ﬁtness function

to select a small number of rules to maximize classiﬁcation

performance and minimize redundancy at the same time.

5.1 PBIL-RFS

In PBIL-RFS, the rules received from local participants are

pooled and selected at the server, but the evaluation of

the selected rules is done at participants in a distributed

manner since only the participants have training data. The

whole process of PBIL-RFS is illustrated in solid boxes in

Fig. 3. Initially, the server merges the rules received from

all participants, shufﬂes them, and then sends the whole

set of rules to all participants. However, in the following

optimization process, only the gene strings instead of the

selected rules need to be sent to local participants, which

can signiﬁcantly reduce the amount of communication.

Coding: We encode rule selection using a binary “gene”

code, v= [v1, v2,· · · , vNR],NRbeing the total number

of rules. For each bit, vi= 1 stands for the i-th rule is

tentatively chosen; vi= 0 means it is removed.

Population generation and evaluation: PBIL keeps a proba-

bility vector π= [π1, π2,· · · , πNR]. To generate the genes, vi

is assigned “1” with probability πiand “0” with probability

1−πi. Initially, each πiis set to 0.5. In each iteration, a

population of Ngene genes is generated based on πand

sent to all participants. Each participant ﬁrst decodes these

genes to generate Ngene rule sets to be evaluated. The ﬁtness

values are AUC scores calculated using the rule sets to

classify the local training data. Then, the AUCs are sent back

to the server and aggregated separately for each gene.

Update: The the best solution till the current generation vb

is used to update the probability vector as follows:

πt+1 =πt(1 −δ) + δvb,(8)

where πt

iis the probability of the tth generation in bit

position i,viis the value of the ith position in the best

solution, and δ∈(0,1) is the learning rate. To help preserve

diversity, mutation is performed on the probability vector

as follows:

πt+1

i=πt

i(1 −µ) + rµ, (9)

where µcontrols the extent of mutation, and rtakes 0

or 1 randomly. To avoid losing the optimum, the current

optimum is reserved into the next generation.

The generation, evaluation and update procedures are

repeated until the disparity in ﬁtness values between two

consecutive generations is smaller than a threshold P.

5.2 Merging rules before PBIL-RFS

To determine which rules can be merged, we need a metric

to measure the similarity of two rules. R2score used in

Algo. 1 is no longer suitable to this task, because it relies on

data samples which are unavailable at the server for privacy

consideration. Hence, we propose a new distance metric,

called “rule distance”, to directly measure the similarity

between two rules without using any data sample.

Coding

Create a population

Decoding

Calculate AUCs of all

chromosomes on the test

dataset

Are termination

criteria met? Find the best solutions

Update (mutate) probability

vector

Generate a new pupulation

Coding

Decoding

0 0 00 00 0 0

1 1 11 00 1 1

Optimization

Initialize probability vector

1 1 11 11 1 1

1 1 11 00 1 1

Coding

Decoding

1 1 11 11 1 1

1 1 11 00 1 1

Coding

Decoding

Download

population Upload

AUCs

Decode

Calculate AUCs

with local dataset 0.98

0.91 0.33

0.46

0.98

0.91 0.33

0.46

Decode

Calculate AUCs

with local dataset 0.95

0.98 0.12

0.35

0.95

0.98 0.12

0.35

Code

Create a

population

Are termination

criteria met?

Find the best solutions

Update (mutate)

probability vector

Generate a new

pupulation

YES

Initialize

probability vector

Aggregate AUCs

0.95, 0.98

0.98, 0.91

0.12, 0.33

0.35, 0.46

0.95, 0.98

0.98, 0.91

0.12, 0.33

0.35, 0.46

Broadcast the

best solution

Central ServerCentral Server

population

AUCs

Collect rules

Central Server

Participant A Participant B

011010 111010

101101 100101

Decode Calculate AUCs

Are termination

criteria met?

Find the best

solutions

Update

probability

vector

Generate a

population

Initialization

Aggregate AUCs

Broadcast the best solution

Central Server

Participant

Evolution

YESNO

Decode Calculate AUCs

Are termination

criteria met?

Find the best solutions

Update probability vector

Generate a population Aggregate AUCs

Central Server

YES

Participant

1 1 11 11 1 1

1 1 11 00 1 1

Coding

Decoding

1 1 11 11 1 1

1 1 11 00 1 1

Coding

Decoding

Are termination

criteria met?

YES

Server

Participant

Sampling

(Sect.5.4 )

Sampling

(Sect.5.4 )

Regularizing

(Sect.5.3 )

Regularizing

(Sect.5.3 )

DecodeDecode

Generate genes

(Code)

Generate genes

(Code)

UpdateUpdateUpdate

Aggregate

fitness values

Aggregate

fitness values

Download all

rules

Download all

rules

Find the best

solutions

Find the best

solutions

Evaluate

select rules

Evaluate

select rules

Merging rules

（Sect.5.2 ）

Merging rules

（Sect.5.2 ）

1 1 11 11 1 1

1 1 11 00 1 1

Coding

Decoding

Coding

Decoding

...

1 1 1 1 1 1

1 1 0 1 0 1

Coding

Decoding

...

1 1 1 1 1 1

1 1 0 1 0 1

Are termination

criteria met?

YES

Server

Participant

Sampling

(Sect.5.4 )

Sampling

(Sect.5.4 )

Regularizing

(Sect.5.3 )

Regularizing

(Sect.5.3 )

DecodeDecode

Generate genes

(Code)

Generate genes

(Code)

UpdateUpdateUpdate

Aggregate

fitness values

Aggregate

fitness values

Download all

rules

Download all

rules

Find the best

solutions

Find the best

solutions

Evaluate

select rules

Evaluate

select rules

Merging rules

（Sect.5.2 ）

Merging rules

（Sect.5.2 ）

Coding

Decoding

...

1 1 1 1 1 1

1 1 0 1 0 1

...

Coding

Decoding

...

1 1 1 1 1 1

1 1 0 1 0 1

...

Fig. 3. PBIL-based rule fusion and selection. The subﬁgure in the upper

left is an example of the coding and decoding scheme.

For any two rules Li(x) :< li(x), ωi,ci>and Lj(x) :<

lj(x), ωj,cj>, the distance between them, denoted by

RD(Li(x), Lj(x)), is deﬁned as follows:

RD(Li(x), Lj(x)) = CD(li(x), lj(x)) + kci−cjk,(10)

where CD(li(x), lj(x)) represents the Cosine distance be-

tween the elements li(x) : aiTx+biand lj(x) : ajTx+bj,

which can be deﬁned as follows:

CD(li(x), lj(x)) = 2 1−hai,aji

kaikkajk,(11)

where h·i represents the inner product and k · k denotes

the Euclidean norm. It measures the Cosine of the angle

between the two vectors aiand ajand determines whether

the two vectors are pointing roughly in the same direction.

Note that the distance between biand bjis not considered

because its information is carried by ciand cj.

To ensure both types of distances have the same weight,

we normalize each distance to [0,1] using Min-Max nor-

malization individually before summing them. The pseu-

docode for merging rules is given in Algo. 2. Two rules

Li(x)and Lj(x)will be merged if (1) ωi=ωj, and (2)

RD(Li(x), Lj(x)) is no larger than a pre-deﬁned thresh-

old θm. The new rule generated by merging, denoted by

Lm:< lm(x), ωm,cm>, is deﬁned as follows:

lm(x) = 1

2(ai+aj)Tx+1

2(bi+bj)(12)

cm=1

2(ci+cj)(13)

ωm=ωi=ωj(14)

5.3 Regularizing rule selection

If two rule sets have the same AUC value at the current

generation of PBIL, it is reasonable to give a higher ﬁtness

value to the one with a smaller number of rules. Besides

maximizing the performance of the selected rule set mea-

sured by the average AUC value across all participants, it is

desired to reduce the complexity of the global model, hence

avoiding potential overﬁtting. We propose the following

regularized ﬁtness function to evaluate a rule set Sin PBIL:

Φ(S) = α·1

i=1

AUCi(S)−(1 −α)|S|

,(15)

Algorithm 2: Algorithm for merging rules.

Input: L- Set of rules for merging;

θm- User-deﬁned threshold for merging;

Output: L-The set of rules after merging;

1i= 0;

2while i < |L|do

3RDik ←RD(Li, Lk),∀Lk∈Lusing Eq.(10);

4Find best-matching rule Lj:j= arg mink6=iRDik;

5if RDij ≤θmand ωi== ωjthen

6Lm←merge Liand Ljusing Eqs. (12,13,14);

7Li←Lm;

8remove Ljfrom L;

9else

10 i = i+1;

where NPis the number of participants, NRis the total

number of rules, and α∈[0,1] is a weighting coefﬁcient.

5.4 Speeding-up local rule evaluation

In each iteration of PBLI-RFS, the local datasets are used to

evaluate the selected rules rather than train the local models.

This implies that, by using only representative samples,

it is possible to achieve the same level of performance as

using all samples, but it reduces the running time of rule

evaluation and improves communication efﬁciency. Hence,

we under-sample each local dataset to generate the set of

representative samples for rule evaluation.

It is commonly believed that samples close to the deci-

sion boundaries are considered more important than those

far away from the decision boundaries. Therefore, we assign

higher probabilities to samples that are closer to the decision

boundaries for them to be chosen. More speciﬁcally, we

measure the weight of each sample in accordance with

the information obtained from its K-nearest neighbors. As-

suming XNN ={xN N

1,· · · ,xN N

K}is the set of K-nearest

neighbors for data sample Xand YNN ={yNN

1, ..., yN N

K}is

the corresponding label set of XNN , the weight wiassigned

to the ith sample is calculated as follows:

wi=|{xNN

j|xNN

j∈XNN , y NN

i6=yNN

j}|

K.(16)

The representative samples can be selected by choosing

the samples with larger weights. Since only selecting the

samples near the decision boundary may not well sketch the

distribution of a dataset, we randomly select a small portion

(e.g., 10%) of samples from the remaining dataset and add

them to the prior selected samples.

6 EVALUATION

In this section, we evaluate the effectiveness and efﬁciency

of CloREF in comparison with several state-of-the-art so-

lutions through extensive experiments. We ﬁrst evaluate

CloFEF in terms of quality of rule extraction, effectiveness of

rule fusion and selection, convergence and communication

cost using 8 two-class datasets. After that, we demonstrate

the robustness of CloREF on non-IID datasets. Finally, we

use anomaly detection in IoT-based smart home as a case

study to show the practicability of CloREF.

6.1 Experiment setup

Datasets and models: We evaluate CloREF on the fol-

lowing 8 public two-class datasets provided in KEEL [32]

and UCI [33] repositories: wisconsin,glass-0-1-2-3 vs 4-5-6,

segment0,yeast1,page-blocks0,vehicle1,pimaImb, and KDD-

Cup99. Five learning models, namely SVM, naive Bayesian

classiﬁer (NB), MLP, logistic regression classiﬁer (LR), and

logistic regression classiﬁer ﬁtted via SGD (SGD), are used

as local models. Some models are tested with different

conﬁgurations, e.g., SVM with different kernels and MLP

with different numbers of hidden layers and neurons. All

models are implemented using scikit-learn [34].

Methodology and metrics: For each experiment, 5-fold

cross validation is used. The training dataset is further split

into sub-datasets with each used as the private dataset for

a participant. Speciﬁcally, shufﬂing the dataset ﬁrst and

randomly splitting it into different subsets with the same

number of samples in each set. To ensure each participant

has enough samples to train its learning model, the number

of participants varies according to the size of the original

dataset. The type of local model is randomly selected from

a predeﬁned model list. We compare CloREF with the

following baseline schemes:

•FedAvg [12]: Used as a baseline for federated learning with

homogeneous settings.

•FedMD [6]: The state-of-the-art FL algorithm with hetero-

geneous settings. For a fair comparison, we replace the

CNN with a fully connected neural network in the source-

code provided by the authors. One subset of the training

dataset is used as the public dataset to be shared with all

participants. Hence, the number of participants in FedMD

is always one less than that in other compared schemes.

•HeteroFL [8]: Each participant trains a static subnetwork

of the global deep neural network by selecting a subset of

neurons in each hidden layer based on a shrinkage ratio.

•FedRolex [21]: Different from HeteroFL, it employs a rolling

sub-model extraction scheme to guarantee that different

parts of the global model can be evenly trained.

•Centralized Model: The best-performing learning model

centrally trained with all training data.

•All Rules: An ablated version of CloREF without rule

fusion and selection at the server.

The following four metrics are used to evaluate CloREF in

comparison with the baselines: ﬁdelity (deﬁned in Eq. (4)),

G-means (GM), accuracy (ACC), and AUC. All experiments

were run on a server with 40 CPU cores and 64GB RAM.

The CPU model is Intel(R) Xeon(R) E5-2630 v4 @ 2.20GHz.

Parameter setting: Table 1 gives the setting of key param-

eters. The local models of FedMD are decided by random

selection from a pre-deﬁned neural network list. HeteroFL

and FedRolex use the same architecture of global models

as FedAvg (given in Table 3). The local models of HeteroFL

and FedRolex are extracted from the global model based

on the given model shrinkage ratios. We follow the original

papers to set the model shrinkage ratios for HeteroFL and

FedRolex, only changing the multiplying factor from 0.5 to

0.8 to ensure every sub-model has enough neurons since our

global model is simpler.

vehicle1

0.9941

0.9172

0.9497 0.9645

0.9053

0.78 0.76

0.74 0.72 0.72

0.77 0.79

0.72 0.75 0.75

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

LR SVM

(rbf)

MLP

(5,5,5)

MLP

(10,10,10)

MLP

(50,50)

wisconsin glass-0-1-2-3_vs_4-5-6 segment0 pimaImb

0.9935 0.9935

0.9221

0.8850 0.9034

0.70

0.74

0.71

0.72 0.71

0.69

0.73

0.72 0.71

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

SGD LR SVM

(rbf)

NB MLP

(10,10)

accuracy of learning modelaccuracy of extracted rules

1.0000 1.0000

0.9128

0.9762 0.9905

0.92

0.88 0.90

0.95 0.93

0.92

0.88 0.90

0.93 0.93

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

SGD LR SVM

(rbf)

NB MLP

(10,10)

1.0000

0.9492 0.9552

0.9422

0.9580

0.9354

1.0000

1.00

0.95 0.96 0.94 0.96

0.94

1.00

0.99 1.00 1.00 1.00 1.00

1.00

0.70

0.75

0.80

0.85

0.90

0.95

1.00

SGD SVM

(poly)

MLP

(50,20)

MLP

(20,20)

MLP

(20,20,20)

MLP

(10,10)

MLP

(5,5,5)

1.0000 1.000 0 0.9956 0.9933 0.9971

0.96

0.98

0.97 0.97 0.97

0.96

0.98 0.97 0.97 0.97

0.7

0.75

0.8

0.85

0.9

0.95

SGD LR SVM

(linear)

MLP

(10,10)

MLP

(20,20)

Fig. 4. Fidelity of extracted rules at local models. Models are color-coded by their corresponding algorithms.

TABLE 1

The setting for key parameters.

Methods Parameters Values

PSO θ,NS,NP,NG0.5, 20n, 20, 50

Clustering Tsplit,Tmerg e 0.75, 0.95

PBIL-RFS Ngene,δ,µ,P,θm,α20, 0.02, 0.2, 1.0e-4, 0.02, 0.9

Common

settings

for FLs

communication rounds 100

B,E,LR 10, 5, 0.01

loss binary crossentropy

activation function (relu, sigmoid)

FedMD pre-deﬁned neural networks [(20, 20, 1),(10, 10, 10, 1),

(10, 10, 1),(5, 5, 5, 1),(5, 5, 1)]

HeteroFL

FedRolex model shrinkage ratio {1, 0.8, 0.64, 0.512, 0.40}

6.2 Results of rule extraction

We use ﬁdelity and accuracy to evaluate how well the ex-

tracted rules can mimic the behaviors of the trained learning

model. Fig. 4 shows the results for ﬁve datasets (results from

other datasets show similar trends). The y-axis shows the

ﬁdelity. The solid line and the dashed line represent the

accuracy of the trained learning model and the extracted

rules, respectively. One key observation is the achieved

high ﬁdelity (>0.95 for most of the tests), which means

the extracted rules are able to well mimic the behaviors

of different machine learning models. Our method achieves

very high and stable ﬁdelity for SGD and LR, both being

linear learning models. For the non-linear models such as

NB and MLP, our method also achieves very good ﬁdelity.

It also can be seen that the ﬁdelity of the extract rules to

mimic the same learning model varies with the datasets

due to the difference on the complexity of the decision

boundary. Another key observation is that there is no close

relationship between ﬁdelity and accuracy, because ﬁdelity

can still be very high if both the trained learning model

and the extracted rules give the wrong classiﬁcation results.

For example, the ﬁdelity on vehi1 and pima is much higher

than the achieved accuracy due to the low separability of

these two datasets. The accuracy of the extracted rules is

higher than that of the trained learning model on se0. One

possible reason is that se0 is a linearly separable dataset and

some complex models may result in overﬁtting. This can be

veriﬁed by the results of the SGD model. SGD is a simple

linear model, but it achieves better performance than that of

the non-linear models on se0. Since the number of extracted

rules for this dataset is only two or three, the decision

boundary would be simpliﬁed and hence the overﬁtting

problem could be avoided to some extent.

6.3 Results of fusion and selection

In Sect. 6.3.1 and Sect. 6.3.2, we perform a series of ablation

studies to validate the effectiveness of the key components

of PBIL-RFS. In Sect. 6.3.3, the overall performance of PBIL-

RFS is evaluated by comparing it with four baselines using

two synthetic datasets and eight real-world datasets.

6.3.1 Beneﬁts of merging rules

Fig. 5 shows the performance of our rule fusion and se-

lection scheme on the vehi1,pima and yea1 datasets with

different settings on the rule merging threshold θmin Algo.

2 and the weighting parameter αin Eq. (15). The results

in the ﬁrst column show that the average number of rules

after merging decreases with the increase of θmas expected.

When θm≤0.02, merging rules before PBIL-RFS doesn’t

reduce AUC, as can be seen from the results given in the

second column. When θmis further increased, the achieved

AUC drops, especially when θm≥0.1. This is because that a

larger θmmay over-merge rules, resulting in a ﬁnal rule set

with poor ﬁdelity to the decision boundary of the original

local model. It is worth noting that, when αis decreased

to give a smaller weight on AUC in the ﬁtness function,

AUC just slightly drops. However, the number of rules after

PBIL-RFS is signiﬁcantly reduced when αis decreased from

1 to 0.7, as shown in the third column. This demonstrates

the efﬁciency of PBIL-RFS in removing redundant rules.

The last column shows the relative time cost compared to

the baseline where both rule merging and regularization are

disabled (i.e., θm= 0 and α= 1). It can be observed that

the relative time cost decreases when the decrease of αand

increase of θmbecause both will reduce the number of rules.

6.3.2 Beneﬁts of speeding-up local rule evaluation

Table 2 shows the relative results of fusing rules by using

undersampled data samples compared to using all data

samples, with means and standard deviations computed

over 5 independent runs. 0.4Bdenotes 40 percent of all

samples are sampled according to their weights calculated

by Eq. (16), whereas 0.1Rdenotes 10 percent of all sam-

ples are randomly sampled from the remaining samples

left by 0.4B. The sampling ratio for each original dataset

is determined according to its size by setting a smaller

0.61

0.65

0.69

0.73

1.00 0.90 0.80 0.70 0.60 0.50

AUC

66.8 62.2

48.2

32.4

11.2

0 0.02 0.05 0.1 0.2

Number of rules after merging

77.6 73.8

14 13

0 0.02 0.05 0.1 0.2

Number of rules after merging

1.00 0.90 0.80 0.70 0.60 0.50

Number of rules after

PBIL-RFS

(a) vehicle

(b) pimaImb

1.00 0.90 0.80 0.70 0.60 0.50

Number of rules after

PBIL-RFS

𝜃

53.6

46.6

36.2

20.2

9.8

0 0.02 0.05 0.1 0.2

Number of rules after merging

𝜃

0.2

0.4

0.6

0.8

1.2

1.00 0.90 0.80 0.70 0.60 0.50

Relative time cost



𝜃





𝜃



▲

𝜃



05 

𝜃





𝜃



0.65

0.69

0.73

0.77

1.00 0.90 0.80 0.70 0.60 0.50

AUC

0.60

0.64

0.68

0.72

0.76

0.80

1.00 0.90 0.80 0.70 0.60 0.50

AUC

1.00 0.90 0.80 0.70 0.60 0.50

Number of rules after

PBIL-RFS

0.4

0.8

1.2

1.00 0.90 0.80 0.70 0.60 0.50

Relative time cost

0.4

0.8

1.2

1.00 0.90 0.80 0.70 0.60 0.50

Relative time cost

Fig. 5. The performance with different values of αand θm. Different colors of lines represent different values of θm.

TABLE 2

Comparison results of fusing rules between using undersampled

samples and all samples.

Dataset Sampling Ratio AUCUS /AU CAll TUS /TAll

wisc 0.4B, 0.1R 1.002 (±0.01) 0.528 (±0.07)

glas 0.4B, 0.1R 0.997 (±0.05) 0.529 (±0.06)

vehi1 0.4B, 0.1R 1.019 (±0.03) 0.428 (±0.02)

pima 0.4B, 0.1R 1.029 (±0.04) 0.526 (±0.10)

yea1 0.4B, 0.1R 1.000 (±0.00) 0.516 (±0.03)

se0 0.2B, 0.05R 1.005 (±0.01) 0.250 (±0.04)

pa0 0.2B, 0.05R 1.021 (±0.01) 0.258 (±0.03)

kdd 0.04B, 0.01R 1.003 (±0.00) 0.048 (±0.01)

sampling ratio for large datasets (e.g., se0,pa0,kdd). We

use AUCU S /AUCAll to represent the relative AUC value,

where AUCUS and AU CAll are the AUC values achieved by

using undersampled samples and all samples, respectively.

Similarly, we use TUS /TAll to represent the relative running

time for the whole rule fusing process using undersampled

samples compared to using all samples.

Two interesting points can be observed from Table 2.

Firstly, all the relative AUC values are greater than one

except for the glas dataset. This demonstrates that using

an appropriate proportion of undersampled samples to fuse

rules will not reduce the performance, and even can help im-

prove the performance to some extent. One potential reason

is that using too many samples to fuse rules might result in

over-complicated decision boundaries and therefore reduce

the ability to generalize. Secondly, the reduction on running

time is almost proportionate to the undersampling ratio.

6.3.3 Overall performance of PBIL-RFS

.70

RBF SVM

.78

Random Forest

.73

Multi-Layer Perceptron

.78

Naive Bayes

.75

CloREF

.82 .95 .78 .77 .97

.85 .87 .84 .83 .83

.85 .81 .85 .78 .83

Fig. 6. Decision boundaries generated by four learning models and a set

of fused rules in two-dimensional space on two synthetic datasets.

To show how well the rules after fusion and selection ap-

proximate the decision boundaries, Fig. 6 gives the decision

boundaries generated by CloREF and four other learning

models for two 2D synthesized datasets. The ﬁrst dataset

circles, generated by Scikit-learn’s make_circle utility, has

a spherical decision boundary. The second dataset, spirals,

is generated using make_moons in Scikit-learn. The circles

dataset is split into ﬁve subsets and the spirals dataset is split

into four subsets. For CloREF, each local model is trained

with one subset of the whole training dataset. For all the

other four learning models, they are trained using the whole

training dataset. Owing to the advantage of using multiple

linear models, it can be seen that CloREF is capable of

approximating complex decision boundaries. For the spirals

dataset, the accuracy achieved by CloREF (value given at

the right bottom of each sub-ﬁgure) is better than that of

other machine learning models, which demonstrates that

TABLE 3

The average results of collaborative learning on eight real-world datasets in full feature space. #Rules represents the number of rules. Average

performance and standard deviation (in brackets) are listed for two metrics: accuracy and AUC.

Data Average Participants All Rules CloREF Centralized Model FedMD FedAvg HeteroFL FedRolex

#Rules

yea1 (5)

1.0 (±0.0) SGD fully con(10,10)

Accuracy 0.701 (±0.01) 0.701 (±0.02) 0.683 (±0.04) 0.720 (±0.04) 0.732 (±0.01) 0.736 (±0.03) 0.733 (±0.04) 0.736 (±0.03)

AUC 0.673 (±0.01) 0.642 (±0.06) 0.699 (±0.03) 0.677 (±0.05) 0.642 (±0.02) 0.641 (±0.03) 0.652 (±0.05) 0.649 (±0.06)

#Rules

glas (4)

13.2 (±4.02) SVM (rbf) fully con(10,10,10)

Accuracy 0.916 (±0.02) 0.916 (±0.05) 0.949 (±0.02) 0.944 (±0.04) 0.882 (±0.03) 0.925 (±0.03) 0.925 (±0.05) 0.950 (±0.02)

AUC 0.882 (±0.01) 0.891 (±0.07) 0.926 (±0.03) 0.923 (±0.09) 0.838 (±0.06) 0.904 (±0.04) 0.903 (±0.05) 0.933 (±0.03)

#Rules

pa0 (19)

21.4 (±6.74) MLP (10,10) fully con(10,10)

Accuracy 0.922 (±0.02) 0.921 (±0.01) 0.934 (±0.01) 0.951 (±0.02) 0.928 (±0.01) 0.953 (±0.00) 0.946 (±0.01) 0.944 (±0.01)

AUC 0.745 (±0.03) 0.705 (±0.07) 0.816 (±0.03) 0.908 (±0.02) 0.747 (±0.03) 0.853 (±0.02) 0.788 (±0.03) 0.746 (±0.03)

#Rules

se0 (14)

3.8 (±2.14) MLP (10,10,10) fully con(10,10,10)

Accuracy 0.971 (±0.01) 0.912 (±0.06) 0.991 (±0.01) 0.994 (±0.00) 0.975 (±0.01) 0.997 (±0.00) 0.997 (±0.00) 0.998 (±0.00)

AUC 0.923 (±0.05) 0.933 (±0.03) 0.986 (±0.01) 0.992 (±0.00) 0.932 (±0.02) 0.993 (±0.01) 0.994 (±0.01) 0.996 (±0.00)

#Rules

veh1 (5)

13.6 (±4.22) MLP(10,10,10) fully con(10,10,10)

Accuracy 0.727 (±0.02) 0.739 (±0.03) 0.746 (±0.05) 0.804 (±0.05) 0.725 (±0.02) 0.782 (±0.03) 0.782 (±0.04) 0.792 (±0.05)

AUC 0.626 (±0.03) 0.640 (±0.04) 0.705 (±0.04) 0.669 (±0.12) 0.570 (±0.06) 0.641 (±0.12) 0.629 (±0.12) 0.625 (±0.11)

#Rules

wisc (5)

16 (±0.80) MLP(10,10) fully con(10,10)

Accuracy 0.966 (±0.01) 0.961 (±0.01) 0.970 (±0.01) 0.968 (±0.01) 0.954 (±0.01) 0.967 (±0.01) 0.969 (±0.01) 0.970 (±0.01)

AUC 0.968 (±0.01) 0.960 (±0.01) 0.972 (±0.01) 0.970 (±0.01) 0.952 (±0.01) 0.965 (±0.01) 0.969 (±0.01) 0.970 (±0.01)

#Rules

pima (5)

21.2 (±6.62) MLP (10,10) fully con(10,10)

Accuracy 0.706 (±0.02) 0.707 (±0.06) 0.738 (±0.03) 0.757 (±0.02) 0.700 (±0.01) 0.753 (±0.03) 0.733 (±0.03) 0.747 (±0.04)

AUC 0.684 (±0.02) 0.680 (±0.06) 0.720 (±0.03) 0.723 (±0.04) 0.676 (±0.03) 0.718 (±0.06) 0.721 (±0.04) 0.715 (±0.09)

#Rules

kdd (14)

53.2 (±12.27) SVM (poly) fully con(5,5,5)

Accuracy 0.932 (±0.16) 0.675 (±0.31) 0.985 (±0.01) 0.999 (±0.00) 0.975 (±0.02) 0.999 (±0.00) 0.987 (±0.02) 0.995 (±0.00)

AUC 0.939 (±0.15) 0.724 (±0.31) 0.984 (±0.02) 0.998 (±0.00) 0.984 (±0.02) 0.998 (±0.00) 0.973 (±0.05) 0.993 (±0.00)

CloREF is able to fuse the knowledge learned by different

machine learning models and generate effective decision

boundaries. For circles, the accuracy achieved by CloREF is

slightly lower than that of Random Forest and Naive Bayes

but higher than that of RBF SVM and MLP. This is because

the boundary samples in the circles dataset are more difﬁcult

to separate as some of them are overlapped. As we pursue

to compact rules in the fusion process, some low-quality

rules are merged or discarded, thereby resulting in a slight

drop on accuracy. For spirals, even though the structure of

its decision boundaries is much complex, CloREF achieves

the best accuracy as there are clear boundaries between the

samples in the two classes. In addition, it can be seen from

Fig. 6 that, even though all models are trained using the

same dataset, the decision boundaries generated by different

learning models and the resulting accuracies are different.

This also demonstrates the beneﬁts of using heterogeneous

learning models at participants to increase diversity.

Table 3 compares CloREF with seven baselines on the

eight datasets using 5-fold cross validation. FedAvg is also

included since it allows to show the performance of CloREF

compared with homogeneous FL schemes. The number of

participants for each dataset is given in the round brackets

following the dataset name. The values for the performance

metrics are the means over 5 independent runs with stan-

dard deviations given in the following round brackets. The

Average Participants column shows the average values over

all participants. We have the following observations:

•The performance of CloREF is better than the average per-

formance of individual participants on all datasets, which

means our fusion and selection strategy can effectively

fuse the knowledge learned by different participants.

Hence, each participant can beneﬁt from the knowledge

that it cannot learn from its local data.

•CloREF outperforms All Rules on all datasets, indicating

that PBIL-RFS can effectively remove contradictory rules.

•All the results of CloREF are better than those of FedMD.

Moreover, this competence is achieved without relying on

a public shared dataset as in FedMD.

•We pay more attention to the AUC score in this analysis,

as it is the optimization objective of CloREF. Compared

with HeteroFL and FedRolex, CloREF achieves better

AUC on 5 of 8 datasets. Although FedRolex shows an

advantage in accuracy, its communication cost is sig-

niﬁcantly higher than CloREF, as shown in Table 4. In

addition, the standard deviation of the results for CloREF

is smaller than that of other FL models, in particular on

the veh1 dataset, indicating the robustness of CloREF. This

is because, unlike HeteroFL and FedRolex that restrict

local models to be subsets of the global model and must

share the same depth of the global model, CloREF enables

the local model to be any type of model that is most

suitable to their data. Moreover, the global model of

CloREF is more lightweight but still has comparable per-

formance and the advantage of interpretability compared

with other FL models. It is worth mentioning that the

AUC achieved by CloREF is very close to that of FedAvg,

showing that CloREF does not compromise much perfor-

mance due to the pursuit of model heterogeneity.

•CloREF gains comparable and sometimes (on glas and

wisc) even slightly better performance than Central-

ized Model. The main reason is that CloREF can better

mine knowledge in the local data through model hetero-

geneity, and can effectively achieve knowledge fusion and

avoid overﬁtting through PBIL-RFS.

0 25 50 75 100 125 150 175

Generation

0.480

0.485

0.490

0.495

0.500

0.505

0.510

Fitness value

(a)

0 50 100 150 200 250 300 350 400 450

Generation

0.54

0.56

0.58

0.60

Fitness value

(b)

Fig. 7. Convergence on yeast1: (a) PSO, (b)PBIL-based rule fusion.

Each curve represents one run of the algorithm.

6.4 Results on convergence

Two iterative computational components, boundary sample

generation and rule fusion &selection, are designed based on

PSO and PBIL, respectively. Both methods have proven con-

vergence properties. Fig. 7 shows the convergence curves

on the yea1 dataset. For each optimization, we performed it

multiple times. It can be seen that both optimization compo-

nents can quickly converge due to the simple optimization

objectives. The same trend of convergence is observed on

the other datasets.

6.5 Communication and computation analysis

According to the deﬁnition of rule in Sect. 4, suppose

parameter band each element in aand care deﬁned as

‘ﬂoat32’ type and parameter ωis deﬁned as ‘char’ type.

8n+ 5 bytes are needed to represent one rule, where n

is the feature dimension. Suppose there are pparticipants

and each participant extracts lrules. The amount of com-

munication for rule uploading and downloading at each

participant is (8n+ 5)(p+ 1)lbytes. During the rule fusion

and selection process, each participant needs to iteratively

receive the gene codes from and upload the calculated AUC

(4 bytes) to the server. Suppose each population has Ngene

genes and the number of iterations is Ni. Since the length of

each gene is pl

8bytes, the amount of communication at each

participant in this process is NiNgene(pl

8+ 4) bytes.

In FedAvg, the amount of communication at each partic-

ipant only depends on the architecture of the global neural

network model. Suppose Nwis the number of weights, Ni

is the number of iterations, and each weight is ‘ﬂoat32’.

The communication cost at each participation is 8NwNi

bytes. The calculation of communication for HeteroFL and

FedRolex is the same as that of FedAvg, except that the

values of Nwin HereroFL and FedRolex are smaller since

each participant trains a sub-model.

Table 4 compares the communication cost of CloREF

and other FL models. In our experiments, HeteroFL and

FedRolex use the same global model and shrinkage ratios.

Hence, they have the same communication cost as depicted

by H/F in Table 4. The number of iterations is set to 100 in

all experiments. Ngene in CloREF is set to 20. The values

of the other key parameters are given in the table. It is

worth noting that the communication cost of CloREF is

signiﬁcantly lower than that of other FL models because

(1) in most cases, the rule set is much simpler than a neural

TABLE 4

Communication cost (CC) and Storage (ST) (measured in KB). Arc

represents the Architecture of the MLP used by FedAvg.

glas yea1 vehi1 pima wis se0 kdd pa0

p4 5 5 5 5 14 14 19

n9 9 18 8 5 19 41 10

CloREF

l51 12.4 53.6 66.8 16.8 32 156.6 68.4

CC 25.1 12.0 30.3 29.5 13.4 20.9 100.6 30.5

ST 0.4 0.1 0.4 0.5 0.1 0.3 1.2 0.5

FedAvg

Arc (10,10,10) (10,10) (10,10,10) (10,10) (10,10) (10,10,10) (5,5,5) (10,10)

CC 264.8 186.7 342.2 178.1 186.7 350.8 246.9 195.3

ST 2.6 1.9 3.4 1.8 1.9 3.5 2.5 2.0

H/F CC 152.0 113.6 206.2 107.6 113.6 212.3 169.3 119.6

ST 1.5 1.1 2.0 1.1 1.1 2.1 1.7 1.2

network model. (2) one rule only occupies one bit in the

gene code but a weight for neuron connection needs 4 bytes.

Communication failure and result pollution are common

challenges for distributed learning. Compared with tradi-

tional FL, CloREF has at least two advantages: (1) as long as

a participant can successfully send its rules to the server, the

knowledge it learns from its local data could be integrated

into the global model, even if the participant can’t join

the rule evaluation process due to communication failure.

(2) In each iteration, our model transmits less data and

requires less communication time than standard FL, making

it more robust to temporary communication failures. The

synchronous fusion process also avoids result pollution,

since the evaluation score must be computed based on the

genes sent by the server in each iteration.

The time complexity of our PSO-based boundary sample

generation method is O(NSNGNP)[30], where NSis the

number of synthetic boundary samples and NPand NG

are the number of particles and the maximum number

of generations in each execution of PSO, respectively. As

shown in Fig. 7 (a), our PSO-based method converges fast

and thus will not add much computation overhead to the

participants. In the worst case, the time complexity of Algo.

1 is O(|CLU |log(|C LU |)), where |CLU|is the number of

initial clusters. In contrast to FedAvg with deep neural net-

works, which require a substantial amount of computation

for gradient calculation in each round, both the genera-

tion of boundary samples and model ﬁtting require only

a single execution in CloREF. Hence, although additional

operations are required on the client side in CloREF, these

additional operations won’t add much computing burden to

the clients. Table 4 also shows that the storage requirement

for model parameters at each client in CloREF is much less

than the requirements in FedAvg, and HeteroFL/FedRolex.

6.6 Case studies on IID and non-IID datasets

Theoretically, CloREF is not sensitive to client data distribu-

tion because it allows all participants to independently build

their local learning models with their local datasets ﬁrst and

then merges the extracted rules at the server. To verify the

robustness of CLoREF in dealing with non-IID data, we ﬁrst

compare the decision boundary generated by CloREF on the

spirals dataset with both IID and non-IID distributions.

We split the spirals dataset into four subsets in the follow-

ing two ways: (1) random sampling so that the four subsets

.97

Participant A

.93

Participant B

.84

Participant C

.93

Participant D

.97

CloREF

.79 .52 .56 .70 .97

Fig. 8. Classiﬁcation boundaries on different sub-datasets: (a)–(d) at

participants; (e) after rule fusion. First row for random splitting and

second row for cross-splitting.

have similar distributions (i.e., IID); (2) cross-splitting so

that each subset contains a quarter of the whole spiral

dataset (i.e., non-IID). Four different models, MLP (5,5),

SVM (RBF), NB, SVM (POLY), are trained using the four

subsets. The decision boundaries generated by local rules

and the ﬁnal fused rule set are shown in Fig. 8, where the

accuracy of the corresponding set of rules on the same test

dataset is shown in the lower right corner. It can be seen that,

for both local data distributions, CloREF can effectively fuse

the knowledge learnt by all participants to generate similar

decision boundaries and the ﬁnal global models achieve

the same accuracy. Furthermore, the accuracy achieved by

CloREF is higher than that of the best global model as shown

in Fig. 6, i.e., the best model that has the luxury of being

directly trained using the whole dataset.

6.7 Case study on anomaly detection for smart home

Most of the IoT devices are resource-constrained and may

not be able to deploy a complicated or homogeneous

learning model. In addition, data generated by different

IoT devices may exhibit statistical heterogeneity, making

it desirable to customize their models based on local data

distribution through model heterogeneity. In this experi-

ment, we use a public dataset N-BaIot [36] to demonstrate

the advantages of our method for anomaly detection in a

smart home system. N-BaIoT was created by collecting the

network trafﬁc with 2 common IoT botnets, i.e., BASHLITE

and Mirai, on 9 IoT devices. The speciﬁcations of each device

and their suffered attacks are given in Table 5.

The experimental settings are the same as shown in

Table 1 except that the number of generations in PBIL-

RFS and the communication round are both set to 500. The

experimental results are shown in Table 6, with mean and

standard deviation computed over 5-fold cross validation.

To better evaluate the performance of trained models, we

use a global test dataset that contains all types of attacks.

As can be seen from Table 6, local models, especially for

Cam4 and WC, perform signiﬁcantly worse than the global

model built by CloREF when testing on a global data set. Be-

cause local models are trained with only their local attacks,

they are less capable of recognizing attacks they haven’t

seen, whereas CloREF can fuse the knowledge learned from

different IoT devices. DB1’s local model also achieves much

better performance than WC’s local model even though the

types of attacks they suffered are the same. This is because

(1) the normal behaviour of a doorbell is much simpler than

15 16 2 5 3 1 6 9 10 17 4 8 7 23 12 11 13 14 19 20 22 21 18

Feature number

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fig. 9. Boxplot for the importance of features over 5-fold cross-validation.

the normal behaviour of a webcam, making it much easier

to distinguish from abnormal behaviours; (2) DB1 and WC

have different data distributions analyzed using T-NSE [37].

Another advantage of CloREF over DNNs is its direct

interpretability. From the coefﬁcients of the linear rules, we

can directly know which features play signiﬁcant roles in

making decisions without needing to employ an explana-

tion method, e.g., [24]. Since the reasoning process of a

learning model is transparent, we can decide to whether

trust or reject the decisions made by the learning model.

This is crucial for applications such as network security, au-

tonomous driving, and healthcare, where accepting a wrong

decision could have irreversible consequences. To demon-

strate the consistency on interpretability, Table 7 shows the

top 5 important features selected by our model and other

feature selection methods. We use two types of feature

selection methods: (1) univariate statistical tests, and (2)

white-box explanation method (i.e., checking the coefﬁcient

for each feature assigned by the trained model). Moreover,

we also compare the important features selected in [35].

Overall, the important features selected by our model and

other methods are highly overlapped, which demonstrates

the consistency and effectiveness of feature selection by our

model. In addition, Fig. 9 shows the boxplot of the feature

importance values of the global rule set over 5-fold cross

validations, which demonstrates the stability of CloREF.

6.8 Experiments on multiclass classiﬁcation

We now demonstrate the extension of CloREF for multiclass

classiﬁcation using the One-vs-One approach [25]. At each

participant, one binary classiﬁer is trained for each pair of

classes. At the server, the received rules are merged and

selected at a per class-pair level using the same approach

introduced in Sect.6.3 to generate the best set of global rules

for each pair of classes. When testing an unseen sample at

a local participant, majority voting is used to determine the

class the sample belongs to. If there is more than one winner,

we randomly choose one from the winning classes.

Dataset and models: We use the well-known electrocardio-

gram (ECG) dataset provided by MIT-BIH [38] for the mul-

ticlass classiﬁcation evaluation. Following the recommen-

dation of the Association for the Advancement of Medical

Instrumentation (AAMI), samples in the MIT-BIH dataset

TABLE 5

Detailed speciﬁcations of N-BaIoT dataset used in this experiment.

Device Device type Abbrev. BASHLITE Mirai

Combo Junk Scan TCP UDP Ack Scan Syn UDP UDPPlain

Danmini Doorbell DB1 X X X X X

Ennio Doorbell DB2 X X X X X

Ecobee Thermostat TM X X X X X X X X

Philips B120N/10 Baby monitor BM X X X X X X X

Provision PT-737E Secrity camera Cam1 X X X X X X X

Provision PT-838 Secrity camera Cam2 X X X X X X X

SimpleHome XCS7-1002-WHT Secrity camera Cam3 X X X X X X X

SimpleHome XCS7-1003-WHT Secrity camera Cam4 X X X X X X X

Samsung SNH 1011 N Webcam WC X X X X X

TABLE 6

Results of nine local models and CloREF on N-BaIot dataset.

DB1 DB2 TM BM Cam1 Cam2 Cam3 Cam4 WC CloREF

ACC 0.998(±0.00) 0.994(±0.01) 0.923(±0.13) 0.997(±0.00) 0.957(±0.01) 0.978(±0.00) 0.959(±0.00) 0.928(±0.04) 0.915(±0.01) 0.998(±0.00)

GM 0.995(±0.00) 0.979(±0.04) 0.942(±0.08) 0.993(±0.01) 0.973(±0.00) 0.986(±0.00) 0.975(±0.00) 0.876(±0.19) 0.614(±0.03) 0.997(±0.00)

AUC 0.995(±0.00) 0.980(±0.04) 0.946(±0.07) 0.993(±0.01) 0.974(±0.00) 0.986(±0.00) 0.975(±0.00) 0.897(±0.14) 0.689(±0.02) 0.997(±0.00)

TABLE 7

Top important features selected by different methods.

Feature Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Univariate

F-test X X X X X

χ2-test X X X X X

MutualInformation X X X X X

White-box

Explanation

LinearSVC X X X X X

RandomForest X X X X X X

SGDClassiﬁer X X X X X

LogisticRegression X X X X X

AdaBoost X X X X X

ExtraTrees X X X X X

GBDT X X X X X

Mafarja et al. [35] X X X X X X

CloREF X X X X X

can be categorized into ﬁve heartbeat classes: N, SVEB, VEB,

F, and Q. Because the dataset contains original ECG signals,

we segment the signals into beats and extract some features

using the algorithm proposed by Guerra et al [39]. In this

set of experiments, we use the Wavelets (23 features) and

Our Morph (4 features) feature descriptors. We also follow

the common practice in ECG classiﬁcation and divide the

dataset into a training set and a testing set using the inter-

patient scheme given in [40], with both the training and

testing sets containing data from 22 records with a similar

proportion of beat types. As both the training and testing

sets are ﬁxed, we calculate the average results of ﬁve runs.

The detailed speciﬁcation of this dataset is given in Table 8.

We split the dataset into ﬁve subsets for ﬁve participants.

We use MLP(20,20,20) as the global model for FedAvg, since

it gives the best performance in our experiments with MLP.

Also, we only use MLP(20,20,20) and MLP(10,10,10) as the

local models in FedMD since we observed that using more

heterogeneous models decreases the stability of FedMD.

The other parameter settings for CloREF, FedAvg, FedMD,

HeteroFL and FedRolex are the same as those in Table 1.

Metrics: Given the highly skewed class distribution in this

dataset, using accuracy alone is no longer appropriate for

ensuring a fair comparison. Following most existing ECG

classiﬁcation studies, we employ three additional metrics:

the macro-average of both recall and precision for each class

and the jκ index [41]. jκ index is special metric used in

ECG analysis to evaluate the discrimination of the most

important arrhythmias (SVER and VEB), by taking into

account the misclassiﬁcation and imbalance between all the

considered classes. A higher jκ index indicates better dis-

crimination. We prioritize these three metrics over accuracy,

given that in healthcare scenarios the consequence of failing

to detect a positive case in time is irreversible. Besides,

AAMI recommends no reward or penalty for classifying an

Fsample as a VEB sample [39]. Hence, the corresponding

cells of the confusion matrix are not counted in our analysis.

Results analysis: The results on the ECG dataset are given

in Table 9. First and foremost, the results demonstrate that

CloREF can be effectively extended to tackling multiclass

classiﬁcation problems. As can be seen in Table 9, CloREF

achieves the highest scores on all three key metrics (recall,

precision, and jκ index), demonstrating its superior ability in

dealing with class imbalance. Even though FedRolex attains

the highest accuracy, its considerably low recall suggests

that it effectively identiﬁes negative (majority) samples but

struggles to identify positive (minority) samples. Similarly,

the precision scores for positive classes such as SVEB and F

given by FedMD and FedAvg are much lower than those of

CloREF. One possible reason for the superiority of CloREF

is that, to perform multiclass classiﬁcation, CloREF takes the

OVO strategy to build multiple binary classiﬁers, potentially

mitigating the negative impact on minority classes caused

by small sample sizes. Because one subset serves as the

TABLE 8

Detailed speciﬁcations of the ECG dataset.

Class N SVEB VEB F Q

DS 1 Train 45842 944 3788 414 0

DS 2 Test 44743 1837 3447 388 8

TABLE 9

The average results on the multiclass ECG dataset.

Method Accuracy Recall Precision jκ index

Centralized 0.858 0.480 0.375 0.387

FedAvg 0.877 0.468 0.349 0.382

FedMD 0.743 0.422 0.299 0.216

HeteroFL 0.862 0.364 0.316 0.227

FedRolex 0.904 0.378 0.409 0.345

CloREF 0.839 0.505 0.462 0.434

proxy dataset, FedMD’s performance is notably compro-

mised in comparison to the other methods, primarily due

to the reduction in the number of training samples.

6.9 Discussion on hyperpameter settings

We discuss the conﬁguration of the hyperparameters in our

model within two categories: server-side and client-side.

Server-side: parameters in this category include (α, θm)

used for controlling rule selection and merging and (δ,µ,

Ngene) used for controlling the learning rate, the extent of

mutation and the number of genes in PBIL. Our experimen-

tal results given in Sect. 6.3 show that a moderately large α

(e.g., 0.9) and small θm(e.g., 0.02) can signiﬁcantly reduce

the number of rules with a nearly imperceptible decrease

in AUC. Moreover, the conﬁgurations for both αand θm

are not sensitive to datasets. Hence, we recommend α= 0.9

and θm= 0.02 as the empirical optimal conﬁguration. Other

parameters in this group (δ,µ,Ngene) are related to the PBIL

algorithm. In our experiments, we consistently observe that

their default settings given in Table 1 lead to fast and stable

convergence, as demonstrated in Sect. 6.4.

Client-side: parameters in this category include (θ,NS,NP,

NG) used for synthesizing boundary samples and (Tsplit,

Tmerge ) used for generating decision rules. Our experiments

show that the generic setting of θin Eq. (4), that is, the

median value (i.e., 0.5) of the prediction given by a local

model, already leads to satisfactory performance. NSis

the required number of synthetic boundary samples. It is

initially set to nm where nis the feature dimension and m

is the expected number of rules with the default value of

20. If the ﬁdelity of the generated rules is unsatisfactory, NS

can be tuned to generate more boundary samples. NPand

NGare the number of particles and the maximum number

of generations in each execution of PSO, and their empirical

settings are given in Table 1. Since the optimization objective

of our PSO-based method is very simple, it converges very

fast as shown in Fig. 7, which also facilitates the ﬁne-tuning

of these two parameters. For the parameters (Tsplit,Tmerg e),

their default settings are given in Table 1. In practice, we can

slightly increase the values of these two parameters around

the default values if the observed ﬁdelity on a dataset is

below 0.9.

7 CONCLUSION

In this paper, we proposed a rule-based collaborative learn-

ing framework CloREF that enables multiple heterogeneous

participants to build a global model without sharing their

local data. Beside the advantages of being lightweight and

interpretable, it also gives competitive performance on a

range of benchmarks. As a new attempt in collaborative

learning, it has some limitations. The ﬁdelity of the extracted

rules at the participants sometimes is not satisfactory. To ad-

dress this, we intend to investigate the use of other distance

metrics such as the Mahalanobis distance to improve the

quality of boundary sample clustering, which may generate

local linear models that are stable and well-performing.

Also, CloREF may not work well on very high-dimensional

data such as images and text due to the overhead in gener-

ating boundary samples. We will investigate the feasibility

of using other linear regression techniques such as elastic

net [42] to better cope with high-dimensional data. Further-

more, we will investigate personalizing the global rule set to

make it adaptive to local datasets and apply the developed

framework to more real-world, large-scale problems.

REFERENCES

[1] P. Kairouz et al., “Advances and open problems in federated

learning,” Foundations and Trends® in Machine Learning, vol. 14,

no. 1–2, pp. 1–210, 2021.

[2] A. Hard et al., “Federated learning for mobile keyboard predic-

tion,” 2018, arXiv:1811.03604.

[3] L. Cui et al., “Security and privacy-enhanced federated learning

for anomaly detection in iot infrastructures,” IEEE Transactions on

Industrial Informatics, vol. 18, no. 5, pp. 3492–3500, 2022.

[4] D. Y. Zhang, Z. Kou, and D. Wang, “Fedsens: A federated learning

approach for smart health sensing with class imbalance in resource

constrained edge computing,” in Proceedings of IEEE Conference on

Computer Communications, 2021, pp. 1–10.

[5] P. Zhou et al., “A privacy-preserving distributed contextual fed-

erated online learning framework with big data support in social

recommender systems,” IEEE Transactions on Knowledge and Data

Engineering, vol. 33, no. 3, pp. 824–838, 2021.

[6] D. Li and J. Wang, “FedMD: Heterogenous federated learning via

model distillation,” 2019, arXiv 1910.03581.

[7] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation

for robust model fusion in federated learning,” Advances in Neural

Information Processing Systems, vol. 33, pp. 2351–2363, 2020.

[8] E. Diao et al., “HeteroFL: Computation and communication efﬁ-

cient federated learning for heterogeneous clients,” in Inter. Conf.

on Learning Representations, 2021.

[9] M. G. Arivazhagan et al., “Federated learning with personalization

layers,” 2019, arXiv:1912.00818.

[10] R. Liu et al., “No one left behind: Inclusive federated learning

over heterogeneous devices,” in Proc. of the 28th SIGKDD Conf.

on Knowledge Discovery and Data Mining, August 14–18, 2022.

[11] J. Verbraeken et al., “A survey on distributed machine learning,”

ACM Comput. Surv., vol. 53, no. 2, pp. 1–33, March 2020.

[12] B. McMahan et al., “Communication-Efﬁcient Learning of Deep

Networks from Decentralized Data,” in Proc. of the 20th Inter. Conf.

on Artiﬁcial Intelligence and Statistics, vol. 54, 2017, pp. 1273–1282.

[13] M. Duan et al., “Self-balancing federated learning with global

imbalanced data in mobile systems,” IEEE Transactions on Parallel

and Distributed Systems, vol. 32, no. 1, pp. 59–71, 2020.

[14] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated

learning: A meta-learning approach,” 2020, arXiv:2002.07948.

[15] T. Li et al., “Federated optimization in heterogeneous networks,”

in Proc. of Machine Learning and Systems, vol. 2, 2020, pp. 429–450.

[16] O. Marfoq et al., “Federated multi-task learning under a mixture of

distributions,” in Proc. of Advances in Neural Information Processing

Systems, vol. 34, 2021, pp. 15 434–15 447.

[17] Y. Liu et al., “A secure federated transfer learning framework,”

IEEE Intelligent Systems, vol. 35, no. 4, pp. 70–82, 2020.

[18] A. Z. Tan, H. Yu, L. Cui, and Q. Yang, “Towards personalized

federated learning,” IEEE Transactions on Neural Networks and

Learning Systems, pp. 1–17, 2022.

[19] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation

for robust model fusion in federated learning,” Advances in Neural

Information Processing Systems, vol. 33, pp. 2351–2363, 2020.

[20] Z. Zhu, J. Hong, and J. Zhou, “Data-free knowledge distillation for

heterogeneous federated learning,” in Proc. of the 38th Inter. Conf.

on Machine Learning, vol. 139, 18–24 Jul 2021, pp. 12 878–12 889.

[21] S. Alam et al., “FedRolex: Model-heterogeneous federated learning

with rolling sub-model extraction,” in Advances in Neural Informa-

tion Processing Systems, 2022.

[22] W. Samek, T. Wiegand, and K.-R. M¨

uller, “Explainable artiﬁcial

intelligence: Understanding, visualizing and interpreting deep

learning models,” 2017, arXiv:1708.08296.

[23] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘why should I trust you?’:

Explaining the predictions of any classiﬁer,” in Proc. of the 22nd

ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining,

2016, p. 1135–1144.

[24] W. Guo et al., “LEMNA: Explaining deep learning based security

applications,” in Proc. of the 2018 ACM SIGSAC Conf. on Computer

and Communications Security, 2018, pp. 364–379.

[25] D. Tax and R. Duin, “Using two-class classiﬁers for multiclass

classiﬁcation,” in Proceedings of Inter. Conf. on Pattern Recognition,

vol. 2, 2002, pp. 124–127.

[26] J. Platt et al., “Probabilistic outputs for support vector machines

and comparisons to regularized likelihood methods,” Advances in

large margin classiﬁers, vol. 10, no. 3, pp. 61–74, 1999.

[27] B. Zadrozny and C. Elkan, “Obtaining calibrated probability esti-

mates from decision trees and naive bayesian classiﬁers,” in Icml,

vol. 1, 2001, pp. 609–616.

[28] M. Kearns and Y. Mansour, “On the boosting ability of top-down

decision tree learning algorithms,” in Proc. of the 28th Annual ACM

symposium on Theory of Computing, 1996, pp. 459–468.

[29] J. Kennedy et al., “Particle swarm optimization,” in Proc. of Inter-

national Conference on Neural Networks, vol. 4, 1995, pp. 1942–1948.

[30] Y. Pang et al., “Rule-based collaborative learning with heteroge-

neous local learning models,” in Paciﬁc-Asia Conf. on Knowledge

Discovery and Data Mining. Springer, 2022, pp. 639–651.

[31] R. Guidotti et al., “A survey of methods for explaining black box

models.” ACM Computing Surveys, vol. 51, no. 5, pp. 1 – 42, 2018.

[32] J. Alcal´

a-Fdez, et al., “Keel data-mining software tool: data set

repository, integration of algorithms and experimental analysis

framework.” Journal of Multiple-Valued Logic and Soft Computing,

vol. 17, no. 2-3, pp. 255–187, 2011.

[33] D. Dua and C. Graff, “UCI machine learning repository,” 2017.

[Online]. Available: http://archive.ics.uci.edu/ml

[34] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,”

Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[35] M. Mafarja et al., “Augmented whale feature selection for iot

attacks: Structure, analysis and applications,” Future Generation

Computer Systems, vol. 112, pp. 18–40, 2020.

[36] Y. Meidan et al., “N-baiot—network-based detection of iot bot-

net attacks using deep autoencoders,” IEEE Pervasive Computing,

vol. 17, no. 3, pp. 12–22, 2018.

[37] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”

Journal of machine learning research, vol. 9, no. 11, 2008.

[38] G. B. Moody and R. G. Mark, “The impact of the MIT-BIH

arrhythmia database,” IEEE Engineering in Medicine and Biology

Magazine, vol. 20, no. 3, pp. 45–50, 2001.

[39] M.-G. V. et al., “Heartbeat classiﬁcation fusing temporal and

morphological information of ecgs via ensemble of classiﬁers,”

Biomedical Signal Processing and Control, vol. 47, pp. 41–48, 2018.

[40] P. De Chazal et al., “Automatic classiﬁcation of heartbeats using

ecg morphology and heartbeat interval features,” IEEE transactions

on biomedical engineering, vol. 51, no. 7, pp. 1196–1206, 2004.

[41] T. Mar et al., “Optimization of ecg classiﬁcation by means of fea-

ture selection,” IEEE Transactions on Biomedical Engineering, vol. 58,

no. 8, pp. 2168–2177, 2011.

[42] H. Zou and T. Hastie, “Regularization and variable selection via

the elastic net,” Journal of the Royal Statistical Society: Series B

(Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.

Ying Pang is a Ph.D. candidate from the Depart-

ment of Computer Science, University of Otago,

New Zealand. Prior to the University of Otago,

she received her B.E. and MA.Sc. degrees in

Computer Science and Technology from Univer-

sity of Jinan, China, in 2016 and 2019, respec-

tively. Her research generally centers around

distributed optimization, federated learning, in-

terpretation of learning models, and imbalanced

learning.

Haibo Zhang received the M.Sc. degree in com-

puter science from Shandong Normal University,

Jinan, China, in 2005 and the Ph.D degree in

computer science from the University of Ade-

laide, Adelaide, Australia, in 2009. From 2009

to 2010, he was a Postdoctoral Research As-

sociate with the Automatic Control Laboratory,

School of Electrical Engineering, Royal Institute

of Technology, Stockholm, Sweden. He is cur-

rently an Associate Professor with the Depart-

ment of Computer Science, University of Otago,

Dunedin, New Zealand. His current research interests include wireless

networks, distributed computing, and distributed edge intelligence. He

has published over 100 peer-reviewed papers, and been on the program

committee for many international conferences.

Jeremiah D. Deng obtained the B.E. degree

from the University of Electronic Science and

Technology of China in 1989, and the M.Eng.

and D.Eng. from the South China University of

Technology, China, in 1992 and 1995, respec-

tively, the latter co-supervised at the University

of Hong Kong. He is now an Associate Pro-

fessor with the School of Computing, University

of Otago. Dr. Deng’s research interests include

machine learning, pattern recognition, and mod-

eling and optimization of computer networks. He

has published over 100 refereed research papers in international con-

ference proceedings and journals.

Lizhi Peng received the received the B.Eng.

degree from Xi’an Jiaotong University in 1998,

and Ph.D degrees from Harbin Institute of Tech-

nology in 2015. He was a visitor scholar to the

Department of Computer Science of University

of Otago, New Zealand, in 2017. He is currently

a Professor of School of Information Science and

Engineering, University of Jinan, Jinan, China.

His current research interests include machine

learning, evolutionary computing, computer net-

work, and parallel computing.

Fei Teng received the B.S. degree from South-

west Jiaotong University, Chengdu, China, in

2006, and the Ph.D. degree in computer science

and technology from Ecole Centrale de Paris,

Paris, France, in 2011. She is currently an Asso-

ciate Professor in the School of Computing and

Artiﬁcial Intelligence at Southwest Jiaotong Uni-

versity. Her research interests are edge comput-

ing, machine learning, medical informatics and

data mining.

ResearchGate has not been able to resolve any citations for this publication.

Rule-Based Collaborative Learning with Heterogeneous Local Learning Models

Chapter

Full-text available

May 2022

Collaborative learning such as federated learning enables to train a global prediction model in a distributed way without the need to share the training data. However, most existing schemes adopt deep learning models and require all local models to have the same architecture as the global model, making them unsuitable for applications using resource- and bandwidth-hungry devices. In this paper, we present CloREF, a novel rule-based collaborative learning framework, that allows participating devices to use different local learning models. A rule extraction method is firstly proposed to bridge the heterogeneity of local learning models by approximating their decision boundaries. Then a novel rule fusion and selection mechanism is designed based on evolutionary optimization to integrate the knowledge learned by all local models. Experimental results on a number of synthesized and real-world datasets demonstrate that the rules generated by our rule extraction method can mimic the behaviors of various learning models with high fidelity (>0.95 in most tests), and CloREF gives comparable and sometimes even better AUC compared with the best-performing model trained centrally.

Advances and Open Problems in Federated Learning

Book

Full-text available

Jan 2021

The term Federated Learning was coined as recently as 2016 to describe a machine learning setting where multiple entities collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective. Since then, the topic has gathered much interest across many different disciplines and the realization that solving many of these interdisciplinary problems likely requires not just machine learning but techniques from distributed optimization, cryptography, security, differential privacy, fairness, compressed sensing, systems, information theory, statistics, and more. This monograph has contributions from leading experts across the disciplines, who describe the latest state-of-the art from their perspective. These contributions have been carefully curated into a comprehensive treatment that enables the reader to understand the work that has been done and get pointers to where effort is required to solve many of the problems before Federated Learning can become a reality in practical systems. Researchers working in the area of distributed systems will find this monograph an enlightening read that may inspire them to work on the many challenging issues that are outlined. This monograph will get the reader up to speed quickly and easily on what is likely to become an increasingly important topic: Federated Learning.

Self-Balancing Federated Learning With Global Imbalanced Data in Mobile Systems

Article

Full-text available

Jul 2020

Federated learning (FL) is a distributed deep learning method that enables multiple participants, such as mobile and IoT devices, to contribute a neural network while their private training data remains in local devices. This distributed approach is promising in the mobile systems where have a large corpus of decentralized data and require high privacy. However, unlike the common datasets, the data distribution of the mobile systems is imbalanced which will increase the bias of model. In this paper, we demonstrate that the imbalanced distributed training data will cause an accuracy degradation of FL applications. To counter this problem, we build a self-balancing FL framework named Astraea, which alleviates the imbalances by 1) Z-score-based data augmentation, and 2) Mediator based multi-client rescheduling. The proposed framework relieves global imbalance by adaptive data augmentation and downsampling, and for averaging the local imbalance, it creates the mediator to reschedule the training of clients based on Kullback-Leibler divergence (KLD) of their data distribution. Compared with FedAvg, the most popular FL framework, Astraea shows +4.39% and +6.51% improvement of top-1 accuracy on the imbalanced EMNIST and imbalanced CINIC-10 datasets, respectively. Meanwhile, the communication traffic of Astraea is reduced by 75% compared to FedAvg.

Augmented whale feature selection for IoT attacks: Structure, analysis and applications

Article

Full-text available

May 2020
FUTURE GENER COMP SY

Smart connected appliances expand the boundaries of the conventional Internet into the new Internet of Things (IoT). IoT started to hold a significant role in our life, and in several fields as in transportation, industry, smart homes, and cities. However, one of the critical issues is how to protect IoT environments and prevent intrusions. Attacks detection systems aim to identify malicious patterns and threats that cannot be detected by traditional security countermeasures. In literature, feature selection or dimensionality reduction has been profoundly studied and applied to the design of intrusion detection systems. In this paper, we present a novel wrapper feature selection approach based on augmented Whale Optimization Algorithm (WOA), which adopted in the context of IoT attacks detection and handles the high dimensionality of the problem. In our approach, we introduce the use of both V-shaped and S-shaped transfer functions into WOA and compare the superior variant with other well-known evolutionary optimizers. The experiments are conducted using N-BaIoT dataset; wherein, five datasets were sampled from the original dataset. The dataset represents real IoT traffic, which is drawn from the UCI repository. The experimental results show that WOA based on V-shaped transfer function combined with elitist tournament binarization method is superior over S-shaped transfer function and outperforms other well-regarded evolutionary optimizers based on the obtained average accuracy, fitness, number of features, running time and convergence curves. Hence, we can conclude that the proposed approach can be deployed in IoT intrusion detection systems. For more info, visit http://aliasgharheidari.com/

No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices

Conference Paper

Aug 2022

Data-Free Knowledge Distillation for Heterogeneous Federated Learning

Article

Jul 2021

Federated Learning (FL) is a decentralized machine-learning paradigm in which a global server iteratively aggregates the model parameters of local users without accessing their data. User heterogeneity has imposed significant challenges to FL, which can incur drifted global models that are slow to converge. Knowledge Distillation has recently emerged to tackle this issue, by refining the server model using aggregated knowledge from heterogeneous users, other than directly aggregating their model parameters. This approach, however, depends on a proxy dataset, making it impractical unless such prerequisite is satisfied. Moreover, the ensemble knowledge is not fully utilized to guide local model learning, which may in turn affect the quality of the aggregated model. In this work, we propose a data-free knowledge distillation approach to address heterogeneous FL, where the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias. Empirical studies powered by theoretical implications show that, our approach facilitates FL with better generalization performance using fewer communication rounds, compared with the state-of-the-art.

Toward Personalized Federated Learning

Article

Mar 2022

In parallel with the rapid adoption of artificial intelligence (AI) empowered by advances in AI research, there has been growing awareness and concerns of data privacy. Recent significant developments in the data regulation landscape have prompted a seismic shift in interest toward privacy-preserving AI. This has contributed to the popularity of Federated Learning (FL), the leading paradigm for the training of machine learning models on data silos in a privacy-preserving manner. In this survey, we explore the domain of personalized FL (PFL) to address the fundamental challenges of FL on heterogeneous data, a universal characteristic inherent in all real-world datasets. We analyze the key motivations for PFL and present a unique taxonomy of PFL techniques categorized according to the key challenges and personalization strategies in PFL. We highlight their key ideas, challenges, opportunities, and envision promising future trajectories of research toward a new PFL architectural design, realistic PFL benchmarking, and trustworthy PFL approaches.

Security and Privacy-Enhanced Federated Learning for Anomaly Detection in IoT Infrastructures

Article

Aug 2021

IoT anomaly detection is significant due to its fundamental roles of securing modern critical infrastructures. Researchers have proposed various detection methods fostered by machine learning (ML) techniques. Federated learning (FL),as a promising distributed ML paradigm,has been employed recently to improve detection performance due to its advantages of privacy-preserving and lower latency. However,existing methods still suffer from efficiency,robustness,and security challenges. To address these problems,we initially introduce a blockchain empowered decentralized FL framework for anomaly detection in IoT systems,which provides data integrity and prevents single point failure. Further,we design an improved differentially private FL based on generative adversarial nets,aiming to optimize data utility throughout the training process. To the best of our knowledge,it is the first system to employ a decentralized FL approach with privacy-preserving for IoT anomaly detection. Simulation results demonstrate the robustness and accuracy of the developed decentralized scheme.

FedSens: A Federated Learning Approach for Smart Health Sensing with Class Imbalance in Resource Constrained Edge Computing

Conference Paper

May 2021

A Secure Federated Transfer Learning Framework

Article

Apr 2020

Machine learning relies on the availability of vast amounts of data for training. However, in reality, data are mostly scattered across different organizations and cannot be easily integrated due to many legal and practical constraints. To address this important challenge in the field of machine learning, we introduce a new technique and framework, known as federated transfer learning (FTL), to improve statistical modeling under a data federation. FTL allows knowledge to be shared without compromising user privacy and enables complementary knowledge to be transferred across domains in a data federation, thereby enabling a target-domain party to build flexible and effective models by leveraging rich labels from a source domain. This framework requires minimal modifications to the existing model structure and provides the same level of accuracy as the non-privacy-preserving transfer learning. It is flexible and can be effectively adapted to various secure multi-party machine learning tasks.