ArticlePDF Available

Collaborative Learning With Heterogeneous Local Models: A Rule-Based Knowledge Fusion Approach

Authors:

Abstract

Federated Learning (FL) has emerged as a promising collaborative learning paradigm that enables to train machine learning models across decentralized devices, while keeping the training data localized to preserve user privacy. However, the heterogeneity in both decentralized training data and distributed computing resources has posed significant challenges to the design of effective and efficient FL schemes. Most existing solutions either focus on tackling a single type of heterogeneity, or are unable to fully support model heterogeneity with low communication overhead, fast convergence, and good interpretability. In this paper, we present CloREF, a novel rule-based collaborative learning framework that allows devices in FL to use completely different local learning models to cater to both data and resource heterogeneity. In CloREF, each rule is represented as a linear model, which provides good interpretability. Each participating device chooses a local model and trains it using its local data. The decision boundary of each trained local model is then approximated using a set of rules, which effectively bridges the gap arising from model heterogeneity. All participating devices collaborate to select the optimal set of rules as the global model, employing evolutionary optimization to effectively fuse the knowledge acquired from all local models. Experimental results on both synthesized and real-world datasets demonstrate that the rules generated by our proposed method can mimic the behaviors of various learning models with high fidelity ( $\gt $ 0.95 in most tests), and CloREF gives competitive performance in accuracy, AUC, and communication overhead, compared with both the best-performing model trained centrally and several state-of-the-art model-heterogeneous federated learning schemes.
1
Collaborative Learning with Heterogeneous
Local Models: A Rule-based Knowledge Fusion
Approach
Ying Pang, Haibo Zhang, Jeremiah D. Deng, Lizhi Peng, and Fei Teng
Abstract—Federated Learning (FL) has emerged as a promising collaborative learning paradigm that enables to train machine learning
models across decentralized devices, while keeping the training data localized to preserve user privacy. However, the heterogeneity in
both decentralized training data and distributed computing resources has posed significant challenges to the design of effective and
efficient FL schemes. Most existing solutions either focus on tackling a single type of heterogeneity, or are unable to fully support
model heterogeneity with low communication overhead, fast convergence, and good interpretability. In this paper, we present CloREF,
a novel rule-based collaborative learning framework that allows devices in FL to use completely different local learning models to cater
to both data and resource heterogeneity. In CloREF, each rule is represented as a linear model, which provides good interpretability.
Each participating device chooses a local model and trains it using its local data. The decision boundary of each trained local model
is then approximated using a set of rules, which effectively bridges the gap arising from model heterogeneity. All par ticipating devices
collaborate to select the optimal set of rules as the global model, employing evolutionary optimization to effectively fuse the knowledge
acquired from all local models. Experimental results on both synthesized and real-world datasets demonstrate that the rules generated
by our proposed method can mimic the behaviors of various learning models with high fidelity (>0.95 in most tests), and CloREF
gives competitive performance in accuracy, AUC, and communication overhead, compared with both the best-performing model trained
centrally and several state-of-the-art model-heterogeneous federated learning schemes.
Index Terms—collaborative learning, heterogeneous participants, rule extraction, federated learning, knowledge fusion.
F
1 INTRODUCTION
Federated Learning (FL) is a new paradigm of dis-
tributed machine learning that enables multiple parties
to collaboratively build a high-performance prediction
model [1]. In FL, a central server guides the training pro-
cess, enabling participating devices to iteratively refine the
learning model using their local data and share the acquired
knowledge through model aggregation on the server. This
eliminates the need to share local training data, thereby
protecting user privacy. FL has enabled a wide range of
applications, including Google’s Gboard [2], anomaly de-
tection [3], smart health [4], and recommender system [5].
A widely accepted implementation for FL is that local
models employ the same architecture so as to produce a
single global prediction model. This brings several limi-
tations. Firstly, most implementations of FL use the Deep
Neural Network (DNN) model, whose number of parame-
ters to be learned grows significantly with the increase in
network size. This requires all participants to have similar
capabilities in terms of computation, memory, power and
network bandwidth. However, such capabilities can differ
significantly at the participating devices. Hence, the global
model complexity will be constrained by the most indigent
participating device. Secondly, the datasets held by differ-
Y. Pang, H. Zhang and J. D. Deng are with the School of Computing,
University of Otago, New Zealand, E-mail: haibo.zhang@otago.ac.nz
L. Peng is with the School of Information Science and Engineering, University
of Jinan, China
F. Teng is with the School of Computer Science and Artificial Intelligence,
Southwest Jiaotong University, China
ent participants might have diverse distributions, making
it reasonable for different participants to apply different
types of machine learning models. Even if all local datasets
have a similar distribution, the performance of collaborative
learning could be improved by exploiting the diversity of
learning models just as in ensemble learning, since different
models can have different capabilities in characterising the
decision boundaries. Apart from the limitation on homo-
geneity, highly complex models such as DNNs also have
low interpretability. Many real-world applications (e.g., IoT
and finance) generate tabular data with low or moderate di-
mensions, making it unnecessary and even unsuitable to use
over-complicated and less-interpretable learning models.
Few FL schemes can truly achieve model heterogeneity
that allows heterogeneous local models to be first trained at
the participants and then aggregated at the central server.
The main challenge lies in the interpretation and merging of
knowledge embedded in heterogeneous models. To tackle
this challenge, one approach is knowledge distillation (KD)
[6], [7], which enables the global model (student model) to
obtain knowledge from heterogeneous local models (teacher
models) by aggregating local models’ predictions on a shar-
ing proxy dataset. However, the requirement for a shared
proxy dataset compromises privacy-preserving and may
be infeasible for participants with limited storage capacity.
Another approach is model splitting (MS) [8], [9], which
allows participants to train only sub-parts of the global
model according to their available resources. However, due
to the mismatch in model complexity, small models may not
2
carry the whole knowledge embedded in large models [10].
In this paper, we present a novel “Collaborative learn-
ing with optimized Rule Extraction and Fusion” (CloREF)
framework, which empowers each individual participant to
choose a customized local learning model, aligning with the
characteristics of its local training data as well as its storage
and computing capabilities. Since different participants can
choose completely different local learning models such as
SVM, naive Bayesian classifier, and neural networks, it is
infeasible to perform model aggregation using traditional
methods such as parameter averaging. To address this chal-
lenge, CloREF uses decision rules, represented as linear
models, to uniformly represent the knowledge embedded
in heterogeneous learning models. Specifically, each par-
ticipating device firstly trains its learning model using its
local data, and then approximates the decision boundary of
the trained local model using a set of rules. Subsequently,
all participants work under the coordination of the central
server to jointly select the optimal set of rules as the global
model. Since the final global model is a set of decision rules,
CloREF can provide better interpretability in comparison
with neural network based schemes. The main contributions
of this paper are summarized as follows:
We propose a novel collaborative learning framework
CloREF, which is able to construct an effective global
prediction model by fusing the knowledge learned by
multiple participants using heterogeneous local models.
We propose a new rule extraction method that uses mul-
tiple linear models to approximate the decision boundary
of a trained local model. Experimental results show that
our method can approximate complex decision bound-
aries and mimic the behaviours of a variety of learning
models with fidelity larger than 0.95 in most tests.
We propose a rule fusion method that employs an evolu-
tionary optimization scheme to select the best rule set that
maximizes the model performance. Experimental results
show that our method can effectively select a small set
of rules to provide competitive and sometimes even bet-
ter performance in comparison with the best-performing
model trained centrally with the whole training dataset.
We compare CloREF with four state-of-the-art FL
schemes. Experimental results demonstrate that CloREF
has fast convergence and low communication cost. More-
over, it is robust to non-IID data and can achieve compet-
itive performance in comparison with baseline schemes.
The rest of the paper is organised as follows: Sect. 2
reviews the related works. Sect. 3 gives an overview of
CloREF. Sect. 4 and Sect. 5 present the key elements of
CloREF on rule extraction and fusion. Empirical evaluation
results are presented and analysed in Sect. 6. Finally, we
conclude the paper and discuss future work in Sect. 7.
2 RE LATED WORK
2.1 Distributed learning and federated learning
Traditional distributed learning schemes typically focus on
distributing the load of training a learning model to multiple
processing nodes and are not much concerned about privacy
issues [11]. FL was introduced in [12] as a distributed
learning model, which embodies the principles of focused
collection and data minimization but mitigates the systemic
privacy risks and costs in centralized machine learning. The
“FedAvg” algorithm proposed in [12] has become the most
widely adopted FL baseline where a global model is trained
by periodically aggregating the parameters learned by the
participants based on stochastic gradient descent. Subse-
quent studies along this line have been conducted to address
the challenges faced by FL. Among them, the heterogene-
ity problem has recently received much attention. Existing
works on heterogeneity in FL can be categorized into two
classes: data heterogeneity and model heterogeneity.
Data heterogeneity, also known as the non-IID data
problem, means non-identical data distribution across par-
ticipants. Data augmentation [13] and client scheduling [14]
aim to reduce the difference in data distributions through
data sampling or adaptively selecting the participating
clients at each round according to their data distributions.
Regularization loss [15] constraints the parameter diver-
gence between the local model and the global model by
adding a regularization loss to the local cost function.
Federated meta-learning [14], federated multi-task learning
[16], and federated transfer learning [17] adopt techniques
to build personalized models for individual participants,
with the key idea of training a global model firstly and then
adapting it to each participant using its local data [18].
Model heterogeneity arises from the context where the
storage, computational, and communication capabilities are
different among participants [15]. Only a few works have
been conducted to achieve complete model heterogeneity
due to the difficulty in interpreting and merging hetero-
geneous models. We categorize the existing solutions for
this problem into the knowledge distillation (KD)-based
approach and the model splitting (MS)-based approach.
The KD-based approach improves the global model by
aggregating the class scores or logit outputs from local
models on a proxy dataset. FedMD [6] is a representative im-
plementation of this approach. In FedMD, each participant
computes the class scores on a public dataset and uploads
them to the central server. The central server broadcasts the
aggregated class scores as an augmented dataset for local
training. Then, local models learn on the augmented dataset
and local dataset iteratively for personalization. However,
the existence of a public dataset may compromise privacy-
preserving and is impractical for weak participants with
limited storage. FedDF [19] releases the storage pressure at
participants by performing distillation only at the server
side. Model heterogeneity is achieved by clustering the
local model architectures and training each type of model
individually at the central server. However, this method still
requires a public dataset. FedGen [20] is a good attempt
to achieve data-free KD-based heterogeneous FL. However,
it focuses more on data heterogeneity and still requires all
local models to employ the same architecture.
The MS-based approach splits the global model into
different sub-models with various complexity levels and
allocates them to each participant according to their system
resources. For a local model, HeteroFL [8] and FedRolex
[21] select a subnetwork of the global model based on a
shrinkage ratio. Consequentially, different local models have
the same depth but different widths as the global model.
Then, the central server aggregates the overlapped weight
3
...
Rule Pool
Download Rules
Upload Rules
Central Server
PBIL-based Fusion and
Selection Strategy
Participant 2 Participant P
Machine Learning
Model Training Rule Extraction
Classification
Participant 1Participant 1
...
Machine Learning
Model Training Rule Extraction
Classification
Machine Learning
Model Training Rule Extraction
Classification
......
...... ......
Fig. 1. The CloREF framework for collaborative learning with heterogeneous local models.
matrices to fuse the knowledge learned from diverse local
models. Intuitively, the large models are more difficult to
converge than small models due to the fewer overlapped pa-
rameters. Furthermore, because of the mismatch on model
architectures, small models may not carry the whole knowl-
edge of large models. In addition, since local models are
sub-models of a global model, the architecture of local
models is constrained by that of the global model. Therefore,
compared with the KD-based approach, MS-based approach
removes the dependence on a proxy dataset but achieves a
lower degree of model heterogeneity.
Our scheme differs from existing solutions since it allows
participants to build truly heterogeneous learning mod-
els and integrates knowledge acquired by the participants
through rule extraction and fusion. Instead of transmitting
large amounts of model parameters, only the rules are
exchanged. Hence, our scheme has less communication cost
compared with the standard FL framework.
2.2 Rule extraction
Existing rule extraction techniques can be categorized into
white box explanation methods and black box explanation
methods. White box explanation methods extract rules by
directly interpreting the strengths of connection weights,
model architecture, and other parameters [22]. The main
drawback is that they cannot preserve privacy since both
the model architecture and the training data are exposed.
Black box explanation methods, such as LIME [23] and
LMENA [24], do not require internal information of the
machine learning models. They extract rules by observing
the effect of various inputs on the model outputs. However,
they can only interpret some local areas of a complex learn-
ing model but cannot mimic its overall behaviors.
Unlike existing rule extraction schemes, our method
aims to approximate the whole decision boundary gener-
ated by a learning model using multiple linear rules. Since
the global model in our method is a set of linear rules, our
method can explain how each decision is made by showing
the coefficients of the used linear rules, as these coefficients
can represent the importance of different features.
3 OVERVIEW OF CLOREF
We focus on binary classification because there are many
real-world binary classification tasks such as anomaly de-
tection and disease diagnosis. Moreover, a multi-class clas-
sification task can be decomposed into multiple binary clas-
sification problems using transformation techniques such as
One-vs-Rest and One-vs-One [25]. Each participant has a
private dataset to train a local learning model. Different
participants can use different local models such as Sup-
port Vector Machine (SVM), naive Bayesian classifier, and
Multi-Layer Perceptron (MLP). We assume all local models
can output prediction probabilities either directly or via
probability calibration, e.g., Platt scaling [26] or Isotonic
Regression [27] for SVMs, and Kearns-Mansour splitting
criterion [28] for decision trees. All participants agree on
the set of data features and the learning objective.
As illustrated in Fig. 1, our rule-based collaborative
learning scheme contains three key components:
Local training and rule extraction: Each rule is rep-
resented using a linear model. Each participant firstly
trains its local model using its private dataset and then
extracts a set of rules to mimic the behaviors of the trained
local model with high fidelity. The linear model based
rule representation enables the fusion of rules extracted
from heterogeneous machine learning models. Moreover,
a rule has only a few parameters, which can reduce the
complexity on both rule extraction and rule exchange.
Rule exchange: Only parameters of the linear models are
exchanged between participants and the central server,
thereby preserving the privacy of the participants. The
small number of parameters to be exchanged also enables
quick rule exchange even in bandwidth-hungry scenarios.
Rule fusion and selection: The rules received from mul-
tiple participants can be complementary, redundant, or
even contradictory. A rule fusion and selection strategy
designed based on Population-Based Incremental Learn-
ing (PBIL) is employed at the sever to select the best set of
rules that maximizes the performance of the global model.
The select best rule set then becomes the global model and
is sent back to participants for classifying unseen samples.
Concept drift may occur in many applications of ma-
chine learning, which can degrade the performance of a
well-trained model. If the training data is updated and new
rules are generated at local participants, the above process
can be repeated to update the global model.
4
4 LO CA L TRAINING AND RULE EXTRACTION
The key motivation of our rule extraction method is: we
can mimic the behaviors of a trained learning model if we
can model its decision boundary. In this section, we present
a rule extraction method that can extract rules to mimic the
behaviors of a trained learning model with high fidelity. Fig.
2 shows a four-step routine employed for rule extraction.
These steps are elaborated as follows.
a) Synthesize Samples b) Cluster c) Fit Linear Models d) Determine ω
a
Fig. 2. The flowchart of rule extraction.
4.1 Synthesizing boundary samples
Let H(x)represent the hypothesis function of a trained
learning model, where x= [x1, x2, ..., xn]denotes a data
sample with n-dim features. Since it is not always feasible
to model H(x)with a mathematical function, we detect the
sketch of H(x)for a trained model using synthetic samples
generated with Particle Swarm Optimization (PSO) [29]. In
our method, each particle xrepresents a candidate bound-
ary sample with the value randomly initialized in the search
space. Particles with H(x) = θare samples on the decision
boundary. Hence, we aim to find particles that satisfy
min
xχ|H(x)θ|, χ Rn,(1)
where χis the search space of the boundary samples.
We design a PSO-based boundary sample generation al-
gorithm (refer to [30] for pseudocode) to repeatedly execute
PSO to generate the number of required synthetic boundary
samples. In each execution of PSO, at most one synthetic
boundary sample can be generated. The reason for this
design is two-fold: firstly, PSO in each execution can quickly
converge due to both the infinite number of boundary
samples and the only optimization constraint (fitness value);
secondly, the generated synthetic samples will spread over
the entire decision boundary instead of huddling together
due to the random initialization of the swarm population
and the random parameters for velocity update.
4.2 Clustering and linear model fitting
Since the H(x)can be complex, we propose to use multiple
linear models to fit H(x)of a trained learning model, where
each linear model is called a rule. In this step, synthetic
boundary samples are firstly divided into groups, and then
the samples in each group are used to fit a linear model, as
illustrated in Fig. 2(b) and Fig. 2(c), where ellipses represent
clusters and solid lines represent the fitted linear models.
4.2.1 Rule representation
A rule is defined by a triple L(x):< l(x), ω, c>.l(x)is
a linear function representing a segment of the decision
boundary:
l(x) : aTx+b, (2)
where aT= [a1, a2, ..., an]is the coefficient vector, and b
is a constant. ω {1,1}is a sign used to represent the
relationship between the prediction of L(x)and the output
of l(x). Specifically, ωl(x)is the final prediction generated
by rule L(x). For example, if ω=1, the prediction given
by rule L(x)will be the inverse of l(x)’s output. cis the
centroid of the cluster of samples used to fit l(x). Since
multiple rules may be needed to mimic H(x),cis used to
determine which rule should be applied to classify a given
sample. Suppose the set of rules extracted from a trained
model is L(x) = {Li(x)}, i = 1,· · · , m. Given a test sample
xj, the rule used to classify xj, denoted by Lk(xj), is the
one that has the minimum Euclidean distance between xj
and its centriod of a rule L(x):
Lk(xj) = arg min
Li(x)L(x)kxjcik2.(3)
As shown in Fig. 2 (c), the decision boundary is approx-
imated with three linear models. Since the test sample xtis
closest to c2,l2is used to predict xt. The final prediction is
u(ωlk(xj)), where u(·)is the step function given in Eq. (5).
4.2.2 Rule Measurement
To measure how well L(x)mimics the behaviors of the
trained learning model, we define the fidelity of L(x)as [31]:
F id(L(x))= 1
|X|X
xjX
1 |u(H(xj)θ)u(ωlk(xj)|,(4)
where Xis the set of training samples, H(xj)is the probabil-
ity of xjbelonging to the positive class given by the trained
learning model, θis the threshold on prediction probability
for distinguishing two classes, and u(·)is a step function:
u(x) = 1,if x0;
0,otherwise.(5)
According to Eq.(4), the higher the fidelity is, the better the
extracted rules mimic the trained model.
4.2.3 Clustering optimization and linear model fitting
We use K-means to generate the initial clusters for NS
boundary samples. Since it is challenging to choose an ap-
propriate K, we initially generate a relatively large number
of clusters and then merge and split them to obtain the
optimal clusters based on the quality of the fitted linear
models. We set Kto bNS/ncwhere NSis the number of
boundary samples and nis the feature dimension, since each
cluster needs at least nsamples to fit a linear model.
We use fidelity defined in Eq. (4) and R2score defined
as follows to measure the quality of a fitted linear model.
R2= 1 SSres
SStot
= 1 Pt
j=1(H(xS
j)l(xS
j))2
Pt
j=1(H(xS
j)H(xS))2,(6)
where SSres represents the sum of squares of residuals with
respect to fitted values, and SStot represents the sum of
5
squares with respect to the average value. R2ranges in [0,1].
The larger the value is, the better the linear model fitting.
The pseudocode for generating linear models is given
in Algo. 1. Two user-defined thresholds (Tsplit and Tmerge)
are used to control splitting and merging, respectively. If the
R2score for a cluster is lower than Tsplit, we use k-means
to split it into two clusters CLUi1and C LUi2(line 9). To
properly fit each cluster to a linear model, the number of
samples in a cluster must be no less than the number of
features (n). If both clusters have less than nsamples, the
original cluster is further tested based on fidelity (lines 10
- 11). If either cluster has less than nsamples, it is treated
as noise and discarded. Otherwise, the two new clusters are
fitted and their R2scores are calculated to check if they
need to be further split (lines 14-18). For testing with fidelity
(lines 19 - 28), a cluster in CLUfidelitytest is kept only when
including that cluster will increase the fidelity. For merging
(lines 29 - 38), a cluster is merged with a neighboring cluster
if the merged cluster has a R2score no less than Tmerge.
4.3 Determining the sign ω
Since each l(x)is fitted without using the label information,
a sign ωneeds be associated with each l(x)to indicate
which side is positive. To determine ωfor a given li(x), a
set of synthetic samples called probing samples is generated
subject to the following two constraints: (1) the probing sam-
ples are distributed on the normal line of li(x)that crosses
the centroid of the corresponding cluster used to fit li(x).
This is because linear models may not well fit the corners
of the decision boundary. For example, xP+
aand xP
ain
Fig. 2(d) are located at a corner of the decision boundary,
and using these two samples as probing samples may get
incorrect ω. (2) Among all centroids, the probing sample
has the shortest distance to the centroid of li(x)to avoid
interference from other parts of the decision boundary. For
example, xP+
b2shouldn’t be used as a probing sample for
l1(x). Let dmin
ibe the minimum Euclidean distance between
the centroid ciof li(x)and other centroids. The probing
samples for li(x)are then generated in a pairwise way by
repeating the following two steps: (1) randomly choose a
value βfrom the range (0, dmin
i/2) as the distance between
the probing sample and ci; (2) generate a pair of probing
samples <xP+,xP>along the normal vector of li(x)
as follows: xP+=ci+βa
|a|,xP=ciβa
|a|, where
ais a normal vector of the hyperplane formed by li(x).
Hence, xP+and xPare located on different sides of the
hyperplane and have the same distance to ci, as illustrated
in Fig. 2 (d). Then the sign ωdetermined by <xP+,xP>
is calculated as follows:
ω=
1,if (H(xP+)θ)l(xP+)>0and
(H(xP)θ)l(xP)>0
1,if (H(xP+)θ)l(xP+)<0and
(H(xP)θ)l(xP)<0
(7)
If the values of ωdetermined by more than 80% of
the probing samples are consistent, this result is accepted;
otherwise, the current set of probing samples is discarded
and a new set of probing samples is generated. If ωcannot
be determined after generating the set of probing samples
more than ten times, the current l(x)is considered seriously
deviated from the decision boundary and is discarded.
Algorithm 1: Clustering and linear model fitting.
Input: S synthesized boundary samples, Tsplit,
Tmerge ,H(x);
Output: Lfilter (x) filtered linear models
1CLU = K-means (S, K), where K=bNS/nc;
2foreach CLUiin C LU do
3li(x)fit CLUi;
4if R2
iTsplit then
5add CLUito C LU f ilter ;
6remove CLUifrom CLU;
7add li(x)to Lfilter (x);
8else
9{CLUi1, C LUi2} K-means(k= 2, CLUi);
10 if |CLUi1|< n and |C LUi2|< n then
11 add CLUito C LU fidelitytest;
12 remove CLUifrom CLU;
13 else
14 foreach CLUjin {C LUi1, C LUi2}do
15 if |CLUj| nthen
16 add CLUjto C LU ;
17 else
18 discard CLUj;
19 F idfilter evaluate Lf ilter (x)with H(x);
20 foreach CLUiin C LU f idelitytest do
21 li(x)fit CLUi;
22 L(x)add li(x)to Lfilter ;
23 F idevaluate L(x)with H(x);
24 if F id> F idfilter then
25 add CLUito C LU f ilter ;
26 Lfilter (x) = L(x);
27 F idfilter =F id;
28 remove CLUifrom CLUfidelity test;
29 p= 0;
30 while p < |CLU f ilter |do
31 CLUq= nearest cluster of CLUp;
32 lpq(x)fit (CLUpC LUq);
33 R2
pqevaluate lpj(x);
34 if R2
pqTmerge then
35 CLU f ilter [p] = C LUpCLUq;
36 remove CLUqfrom CLUfilter ;
37 add lpq(x)to Lfilter (x);
38 else p = p+1;
5 RULE FUSION AND SELECTION
Since the datasets used to train the local learning models can
have a similar distribution with some degrees of diversity,
there might be many redundant or even contradictory rules.
Hence, simply concatenating the rules generated by all local
models will not be very effective in providing the best
performance. To address this issue, we propose a PBIL-
based Rule Fusion and Selection (PBIL-RFS) scheme to select
a set of rules that maximizes the Area Under the ROC Curve
(AUC). In PBIL-RFS, the best set of rules is calculated itera-
tively and in each iteration the server needs to communicate
with the participants in order to compute AUC. Therefore,
reducing the number of rules before running PBIL-RFS can
significantly reduce the communication overhead and speed
up the convergence of PBIL-RFS. Also we observe that
reducing the number of rules is very effective to avoid over-
fitting. Due to these reasons, we further design two methods
6
to simplify the rules. One method is to merge the redundant
rules before PBIL-RFS based on a new rule distance metric,
and the other method is to use a modified fitness function
to select a small number of rules to maximize classification
performance and minimize redundancy at the same time.
5.1 PBIL-RFS
In PBIL-RFS, the rules received from local participants are
pooled and selected at the server, but the evaluation of
the selected rules is done at participants in a distributed
manner since only the participants have training data. The
whole process of PBIL-RFS is illustrated in solid boxes in
Fig. 3. Initially, the server merges the rules received from
all participants, shuffles them, and then sends the whole
set of rules to all participants. However, in the following
optimization process, only the gene strings instead of the
selected rules need to be sent to local participants, which
can significantly reduce the amount of communication.
Coding: We encode rule selection using a binary “gene”
code, v= [v1, v2,· · · , vNR],NRbeing the total number
of rules. For each bit, vi= 1 stands for the i-th rule is
tentatively chosen; vi= 0 means it is removed.
Population generation and evaluation: PBIL keeps a proba-
bility vector π= [π1, π2,· · · , πNR]. To generate the genes, vi
is assigned “1” with probability πiand “0” with probability
1πi. Initially, each πiis set to 0.5. In each iteration, a
population of Ngene genes is generated based on πand
sent to all participants. Each participant first decodes these
genes to generate Ngene rule sets to be evaluated. The fitness
values are AUC scores calculated using the rule sets to
classify the local training data. Then, the AUCs are sent back
to the server and aggregated separately for each gene.
Update: The the best solution till the current generation vb
is used to update the probability vector as follows:
πt+1 =πt(1 δ) + δvb,(8)
where πt
iis the probability of the tth generation in bit
position i,viis the value of the ith position in the best
solution, and δ(0,1) is the learning rate. To help preserve
diversity, mutation is performed on the probability vector
as follows:
πt+1
i=πt
i(1 µ) + rµ, (9)
where µcontrols the extent of mutation, and rtakes 0
or 1 randomly. To avoid losing the optimum, the current
optimum is reserved into the next generation.
The generation, evaluation and update procedures are
repeated until the disparity in fitness values between two
consecutive generations is smaller than a threshold P.
5.2 Merging rules before PBIL-RFS
To determine which rules can be merged, we need a metric
to measure the similarity of two rules. R2score used in
Algo. 1 is no longer suitable to this task, because it relies on
data samples which are unavailable at the server for privacy
consideration. Hence, we propose a new distance metric,
called “rule distance”, to directly measure the similarity
between two rules without using any data sample.
Coding
Create a population
Decoding
Calculate AUCs of all
chromosomes on the test
dataset
Are termination
criteria met? Find the best solutions
Update (mutate) probability
vector
Generate a new pupulation
Coding
Decoding
0 0 00 00 0 0
1 1 11 00 1 1
Optimization
Initialize probability vector
1 1 11 11 1 1
1 1 11 00 1 1
Coding
Decoding
1 1 11 11 1 1
1 1 11 00 1 1
Coding
Decoding
Download
population Upload
AUCs
Decode
Calculate AUCs
with local dataset 0.98
0.91 0.33
0.46
0.98
0.91 0.33
0.46
Decode
Calculate AUCs
with local dataset 0.95
0.98 0.12
0.35
0.95
0.98 0.12
0.35
Code
Create a
population
Are termination
criteria met?
Find the best solutions
Update (mutate)
probability vector
Generate a new
pupulation
YES
Initialize
probability vector
Aggregate AUCs
0.95, 0.98
0.98, 0.91
0.12, 0.33
0.35, 0.46
0.95, 0.98
0.98, 0.91
0.12, 0.33
0.35, 0.46
Broadcast the
best solution
NO
Central ServerCentral Server
population
AUCs
Collect rules
Central Server
Participant A Participant B
011010 111010
101101 100101
Decode Calculate AUCs
Are termination
criteria met?
Find the best
solutions
Update
probability
vector
Generate a
population
Initialization
Aggregate AUCs
Broadcast the best solution
Central Server
Participant
Evolution
YESNO
Decode Calculate AUCs
Are termination
criteria met?
Find the best solutions
Update probability vector
Generate a population Aggregate AUCs
Central Server
YES
NO
Participant
1 1 11 11 1 1
1 1 11 00 1 1
Coding
Decoding
1 1 11 11 1 1
1 1 11 00 1 1
Coding
Decoding
Are termination
criteria met?
YES
NO
Server
+
Participant
Sampling
(Sect.5.4 )
Sampling
(Sect.5.4 )
Regularizing
(Sect.5.3 )
Regularizing
(Sect.5.3 )
DecodeDecode
Generate genes
(Code)
Generate genes
(Code)
UpdateUpdateUpdate
Aggregate
fitness values
Aggregate
fitness values
Download all
rules
Download all
rules
Find the best
solutions
Find the best
solutions
Evaluate
select rules
Evaluate
select rules
Merging rules
Sect.5.2
Merging rules
Sect.5.2
1 1 11 11 1 1
1 1 11 00 1 1
Coding
Decoding
1
1
Coding
Decoding
...
...
...
...
1 1 1 1 1 1
1 1 0 1 0 1
Coding
Decoding
...
...
...
...
1 1 1 1 1 1
1 1 0 1 0 1
Are termination
criteria met?
YES
NO
Server
+
Participant
Sampling
(Sect.5.4 )
Sampling
(Sect.5.4 )
Regularizing
(Sect.5.3 )
Regularizing
(Sect.5.3 )
DecodeDecode
Generate genes
(Code)
Generate genes
(Code)
UpdateUpdateUpdate
Aggregate
fitness values
Aggregate
fitness values
Download all
rules
Download all
rules
Find the best
solutions
Find the best
solutions
Evaluate
select rules
Evaluate
select rules
Merging rules
Sect.5.2
Merging rules
Sect.5.2
Coding
Decoding
...
...
...
1 1 1 1 1 1
1 1 0 1 0 1
...
Coding
Decoding
...
...
...
1 1 1 1 1 1
1 1 0 1 0 1
...
Fig. 3. PBIL-based rule fusion and selection. The subfigure in the upper
left is an example of the coding and decoding scheme.
For any two rules Li(x) :< li(x), ωi,ci>and Lj(x) :<
lj(x), ωj,cj>, the distance between them, denoted by
RD(Li(x), Lj(x)), is defined as follows:
RD(Li(x), Lj(x)) = CD(li(x), lj(x)) + kcicjk,(10)
where CD(li(x), lj(x)) represents the Cosine distance be-
tween the elements li(x) : aiTx+biand lj(x) : ajTx+bj,
which can be defined as follows:
CD(li(x), lj(x)) = 2 1hai,aji
kaikkajk,(11)
where h·i represents the inner product and k · k denotes
the Euclidean norm. It measures the Cosine of the angle
between the two vectors aiand ajand determines whether
the two vectors are pointing roughly in the same direction.
Note that the distance between biand bjis not considered
because its information is carried by ciand cj.
To ensure both types of distances have the same weight,
we normalize each distance to [0,1] using Min-Max nor-
malization individually before summing them. The pseu-
docode for merging rules is given in Algo. 2. Two rules
Li(x)and Lj(x)will be merged if (1) ωi=ωj, and (2)
RD(Li(x), Lj(x)) is no larger than a pre-defined thresh-
old θm. The new rule generated by merging, denoted by
Lm:< lm(x), ωm,cm>, is defined as follows:
lm(x) = 1
2(ai+aj)Tx+1
2(bi+bj)(12)
cm=1
2(ci+cj)(13)
ωm=ωi=ωj(14)
5.3 Regularizing rule selection
If two rule sets have the same AUC value at the current
generation of PBIL, it is reasonable to give a higher fitness
value to the one with a smaller number of rules. Besides
maximizing the performance of the selected rule set mea-
sured by the average AUC value across all participants, it is
desired to reduce the complexity of the global model, hence
avoiding potential overfitting. We propose the following
regularized fitness function to evaluate a rule set Sin PBIL:
Φ(S) = α·1
Np
NP
X
i=1
AUCi(S)(1 α)|S|
NR
,(15)
7
Algorithm 2: Algorithm for merging rules.
Input: L- Set of rules for merging;
θm- User-defined threshold for merging;
Output: L-The set of rules after merging;
1i= 0;
2while i < |L|do
3RDik RD(Li, Lk),LkLusing Eq.(10);
4Find best-matching rule Lj:j= arg mink6=iRDik;
5if RDij θmand ωi== ωjthen
6Lmmerge Liand Ljusing Eqs. (12,13,14);
7LiLm;
8remove Ljfrom L;
9else
10 i = i+1;
where NPis the number of participants, NRis the total
number of rules, and α[0,1] is a weighting coefficient.
5.4 Speeding-up local rule evaluation
In each iteration of PBLI-RFS, the local datasets are used to
evaluate the selected rules rather than train the local models.
This implies that, by using only representative samples,
it is possible to achieve the same level of performance as
using all samples, but it reduces the running time of rule
evaluation and improves communication efficiency. Hence,
we under-sample each local dataset to generate the set of
representative samples for rule evaluation.
It is commonly believed that samples close to the deci-
sion boundaries are considered more important than those
far away from the decision boundaries. Therefore, we assign
higher probabilities to samples that are closer to the decision
boundaries for them to be chosen. More specifically, we
measure the weight of each sample in accordance with
the information obtained from its K-nearest neighbors. As-
suming XNN ={xN N
1,· · · ,xN N
K}is the set of K-nearest
neighbors for data sample Xand YNN ={yNN
1, ..., yN N
K}is
the corresponding label set of XNN , the weight wiassigned
to the ith sample is calculated as follows:
wi=|{xNN
j|xNN
jXNN , y NN
i6=yNN
j}|
K.(16)
The representative samples can be selected by choosing
the samples with larger weights. Since only selecting the
samples near the decision boundary may not well sketch the
distribution of a dataset, we randomly select a small portion
(e.g., 10%) of samples from the remaining dataset and add
them to the prior selected samples.
6 EVALUATION
In this section, we evaluate the effectiveness and efficiency
of CloREF in comparison with several state-of-the-art so-
lutions through extensive experiments. We first evaluate
CloFEF in terms of quality of rule extraction, effectiveness of
rule fusion and selection, convergence and communication
cost using 8 two-class datasets. After that, we demonstrate
the robustness of CloREF on non-IID datasets. Finally, we
use anomaly detection in IoT-based smart home as a case
study to show the practicability of CloREF.
6.1 Experiment setup
Datasets and models: We evaluate CloREF on the fol-
lowing 8 public two-class datasets provided in KEEL [32]
and UCI [33] repositories: wisconsin,glass-0-1-2-3 vs 4-5-6,
segment0,yeast1,page-blocks0,vehicle1,pimaImb, and KDD-
Cup99. Five learning models, namely SVM, naive Bayesian
classifier (NB), MLP, logistic regression classifier (LR), and
logistic regression classifier fitted via SGD (SGD), are used
as local models. Some models are tested with different
configurations, e.g., SVM with different kernels and MLP
with different numbers of hidden layers and neurons. All
models are implemented using scikit-learn [34].
Methodology and metrics: For each experiment, 5-fold
cross validation is used. The training dataset is further split
into sub-datasets with each used as the private dataset for
a participant. Specifically, shuffling the dataset first and
randomly splitting it into different subsets with the same
number of samples in each set. To ensure each participant
has enough samples to train its learning model, the number
of participants varies according to the size of the original
dataset. The type of local model is randomly selected from
a predefined model list. We compare CloREF with the
following baseline schemes:
FedAvg [12]: Used as a baseline for federated learning with
homogeneous settings.
FedMD [6]: The state-of-the-art FL algorithm with hetero-
geneous settings. For a fair comparison, we replace the
CNN with a fully connected neural network in the source-
code provided by the authors. One subset of the training
dataset is used as the public dataset to be shared with all
participants. Hence, the number of participants in FedMD
is always one less than that in other compared schemes.
HeteroFL [8]: Each participant trains a static subnetwork
of the global deep neural network by selecting a subset of
neurons in each hidden layer based on a shrinkage ratio.
FedRolex [21]: Different from HeteroFL, it employs a rolling
sub-model extraction scheme to guarantee that different
parts of the global model can be evenly trained.
Centralized Model: The best-performing learning model
centrally trained with all training data.
All Rules: An ablated version of CloREF without rule
fusion and selection at the server.
The following four metrics are used to evaluate CloREF in
comparison with the baselines: fidelity (defined in Eq. (4)),
G-means (GM), accuracy (ACC), and AUC. All experiments
were run on a server with 40 CPU cores and 64GB RAM.
The CPU model is Intel(R) Xeon(R) E5-2630 v4 @ 2.20GHz.
Parameter setting: Table 1 gives the setting of key param-
eters. The local models of FedMD are decided by random
selection from a pre-defined neural network list. HeteroFL
and FedRolex use the same architecture of global models
as FedAvg (given in Table 3). The local models of HeteroFL
and FedRolex are extracted from the global model based
on the given model shrinkage ratios. We follow the original
papers to set the model shrinkage ratios for HeteroFL and
FedRolex, only changing the multiplying factor from 0.5 to
0.8 to ensure every sub-model has enough neurons since our
global model is simpler.
8
vehicle1
0.9941
0.9172
0.9497 0.9645
0.9053
0.78 0.76
0.74 0.72 0.72
0.77 0.79
0.72 0.75 0.75
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
LR SVM
(rbf)
MLP
(5,5,5)
MLP
(10,10,10)
MLP
(50,50)
wisconsin glass-0-1-2-3_vs_4-5-6 segment0 pimaImb
0.9935 0.9935
0.9221
0.8850 0.9034
0.70
0.74
0.71
0.72 0.71
0.69
0.73
0.73
0.72 0.71
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
SGD LR SVM
(rbf)
NB MLP
(10,10)
accuracy of learning modelaccuracy of extracted rules
1.0000 1.0000
0.9128
0.9762 0.9905
0.92
0.88 0.90
0.95 0.93
0.92
0.88 0.90
0.93 0.93
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
SGD LR SVM
(rbf)
NB MLP
(10,10)
1.0000
0.9492 0.9552
0.9422
0.9580
0.9354
1.0000
1.00
0.95 0.96 0.94 0.96
0.94
1.00
1.00
0.99 1.00 1.00 1.00 1.00
1.00
0.70
0.75
0.80
0.85
0.90
0.95
1.00
SGD SVM
(poly)
MLP
(50,20)
MLP
(20,20)
MLP
(20,20,20)
MLP
(10,10)
MLP
(5,5,5)
1.0000 1.000 0 0.9956 0.9933 0.9971
0.96
0.98
0.97 0.97 0.97
0.96
0.98 0.97 0.97 0.97
0.7
0.75
0.8
0.85
0.9
0.95
1
SGD LR SVM
(linear)
MLP
(10,10)
MLP
(20,20)
Fig. 4. Fidelity of extracted rules at local models. Models are color-coded by their corresponding algorithms.
TABLE 1
The setting for key parameters.
Methods Parameters Values
PSO θ,NS,NP,NG0.5, 20n, 20, 50
Clustering Tsplit,Tmerg e 0.75, 0.95
PBIL-RFS Ngene,δ,µ,P,θm,α20, 0.02, 0.2, 1.0e-4, 0.02, 0.9
Common
settings
for FLs
communication rounds 100
B,E,LR 10, 5, 0.01
loss binary crossentropy
activation function (relu, sigmoid)
FedMD pre-defined neural networks [(20, 20, 1),(10, 10, 10, 1),
(10, 10, 1),(5, 5, 5, 1),(5, 5, 1)]
HeteroFL
FedRolex model shrinkage ratio {1, 0.8, 0.64, 0.512, 0.40}
6.2 Results of rule extraction
We use fidelity and accuracy to evaluate how well the ex-
tracted rules can mimic the behaviors of the trained learning
model. Fig. 4 shows the results for five datasets (results from
other datasets show similar trends). The y-axis shows the
fidelity. The solid line and the dashed line represent the
accuracy of the trained learning model and the extracted
rules, respectively. One key observation is the achieved
high fidelity (>0.95 for most of the tests), which means
the extracted rules are able to well mimic the behaviors
of different machine learning models. Our method achieves
very high and stable fidelity for SGD and LR, both being
linear learning models. For the non-linear models such as
NB and MLP, our method also achieves very good fidelity.
It also can be seen that the fidelity of the extract rules to
mimic the same learning model varies with the datasets
due to the difference on the complexity of the decision
boundary. Another key observation is that there is no close
relationship between fidelity and accuracy, because fidelity
can still be very high if both the trained learning model
and the extracted rules give the wrong classification results.
For example, the fidelity on vehi1 and pima is much higher
than the achieved accuracy due to the low separability of
these two datasets. The accuracy of the extracted rules is
higher than that of the trained learning model on se0. One
possible reason is that se0 is a linearly separable dataset and
some complex models may result in overfitting. This can be
verified by the results of the SGD model. SGD is a simple
linear model, but it achieves better performance than that of
the non-linear models on se0. Since the number of extracted
rules for this dataset is only two or three, the decision
boundary would be simplified and hence the overfitting
problem could be avoided to some extent.
6.3 Results of fusion and selection
In Sect. 6.3.1 and Sect. 6.3.2, we perform a series of ablation
studies to validate the effectiveness of the key components
of PBIL-RFS. In Sect. 6.3.3, the overall performance of PBIL-
RFS is evaluated by comparing it with four baselines using
two synthetic datasets and eight real-world datasets.
6.3.1 Benefits of merging rules
Fig. 5 shows the performance of our rule fusion and se-
lection scheme on the vehi1,pima and yea1 datasets with
different settings on the rule merging threshold θmin Algo.
2 and the weighting parameter αin Eq. (15). The results
in the first column show that the average number of rules
after merging decreases with the increase of θmas expected.
When θm0.02, merging rules before PBIL-RFS doesn’t
reduce AUC, as can be seen from the results given in the
second column. When θmis further increased, the achieved
AUC drops, especially when θm0.1. This is because that a
larger θmmay over-merge rules, resulting in a final rule set
with poor fidelity to the decision boundary of the original
local model. It is worth noting that, when αis decreased
to give a smaller weight on AUC in the fitness function,
AUC just slightly drops. However, the number of rules after
PBIL-RFS is significantly reduced when αis decreased from
1 to 0.7, as shown in the third column. This demonstrates
the efficiency of PBIL-RFS in removing redundant rules.
The last column shows the relative time cost compared to
the baseline where both rule merging and regularization are
disabled (i.e., θm= 0 and α= 1). It can be observed that
the relative time cost decreases when the decrease of αand
increase of θmbecause both will reduce the number of rules.
6.3.2 Benefits of speeding-up local rule evaluation
Table 2 shows the relative results of fusing rules by using
undersampled data samples compared to using all data
samples, with means and standard deviations computed
over 5 independent runs. 0.4Bdenotes 40 percent of all
samples are sampled according to their weights calculated
by Eq. (16), whereas 0.1Rdenotes 10 percent of all sam-
ples are randomly sampled from the remaining samples
left by 0.4B. The sampling ratio for each original dataset
is determined according to its size by setting a smaller
9
0.61
0.65
0.69
0.73
1.00 0.90 0.80 0.70 0.60 0.50
AUC
α
66.8 62.2
48.2
32.4
11.2
0
10
20
30
40
50
60
70
80
0 0.02 0.05 0.1 0.2
Number of rules after merging
77.6 73.8
64
14 13
0
10
20
30
40
50
60
70
80
90
0 0.02 0.05 0.1 0.2
Number of rules after merging
0
5
10
15
20
25
1.00 0.90 0.80 0.70 0.60 0.50
Number of rules after
PBIL-RFS
α
(a) vehicle
(b) pimaImb
0
5
10
15
20
25
30
1.00 0.90 0.80 0.70 0.60 0.50
Number of rules after
PBIL-RFS
α
(c) yeast1
𝜃
𝜃
53.6
46.6
36.2
20.2
9.8
0
10
20
30
40
50
60
0 0.02 0.05 0.1 0.2
Number of rules after merging
𝜃
0
0.2
0.4
0.6
0.8
1
1.2
1.00 0.90 0.80 0.70 0.60 0.50
Relative time cost
α
𝜃
=
.
2
𝜃
=
0
.
1
𝜃
=
.
05
𝜃
=
0
.
02
𝜃
=
0
0.65
0.69
0.73
0.77
1.00 0.90 0.80 0.70 0.60 0.50
AUC
α
0.60
0.64
0.68
0.72
0.76
0.80
1.00 0.90 0.80 0.70 0.60 0.50
AUC
α
0
5
10
15
20
25
30
1.00 0.90 0.80 0.70 0.60 0.50
Number of rules after
PBIL-RFS
α
0
0.4
0.8
1.2
1.00 0.90 0.80 0.70 0.60 0.50
Relative time cost
α
0
0.4
0.8
1.2
1.00 0.90 0.80 0.70 0.60 0.50
Relative time cost
α
Fig. 5. The performance with different values of αand θm. Different colors of lines represent different values of θm.
TABLE 2
Comparison results of fusing rules between using undersampled
samples and all samples.
Dataset Sampling Ratio AUCUS /AU CAll TUS /TAll
wisc 0.4B, 0.1R 1.002 (±0.01) 0.528 (±0.07)
glas 0.4B, 0.1R 0.997 (±0.05) 0.529 (±0.06)
vehi1 0.4B, 0.1R 1.019 (±0.03) 0.428 (±0.02)
pima 0.4B, 0.1R 1.029 (±0.04) 0.526 (±0.10)
yea1 0.4B, 0.1R 1.000 (±0.00) 0.516 (±0.03)
se0 0.2B, 0.05R 1.005 (±0.01) 0.250 (±0.04)
pa0 0.2B, 0.05R 1.021 (±0.01) 0.258 (±0.03)
kdd 0.04B, 0.01R 1.003 (±0.00) 0.048 (±0.01)
sampling ratio for large datasets (e.g., se0,pa0,kdd). We
use AUCU S /AUCAll to represent the relative AUC value,
where AUCUS and AU CAll are the AUC values achieved by
using undersampled samples and all samples, respectively.
Similarly, we use TUS /TAll to represent the relative running
time for the whole rule fusing process using undersampled
samples compared to using all samples.
Two interesting points can be observed from Table 2.
Firstly, all the relative AUC values are greater than one
except for the glas dataset. This demonstrates that using
an appropriate proportion of undersampled samples to fuse
rules will not reduce the performance, and even can help im-
prove the performance to some extent. One potential reason
is that using too many samples to fuse rules might result in
over-complicated decision boundaries and therefore reduce
the ability to generalize. Secondly, the reduction on running
time is almost proportionate to the undersampling ratio.
6.3.3 Overall performance of PBIL-RFS
.70
RBF SVM
.78
Random Forest
.73
Multi-Layer Perceptron
.78
Naive Bayes
.75
CloREF
.82 .95 .78 .77 .97
.85 .87 .84 .83 .83
.85 .81 .85 .78 .83
Fig. 6. Decision boundaries generated by four learning models and a set
of fused rules in two-dimensional space on two synthetic datasets.
To show how well the rules after fusion and selection ap-
proximate the decision boundaries, Fig. 6 gives the decision
boundaries generated by CloREF and four other learning
models for two 2D synthesized datasets. The first dataset
circles, generated by Scikit-learn’s make_circle utility, has
a spherical decision boundary. The second dataset, spirals,
is generated using make_moons in Scikit-learn. The circles
dataset is split into five subsets and the spirals dataset is split
into four subsets. For CloREF, each local model is trained
with one subset of the whole training dataset. For all the
other four learning models, they are trained using the whole
training dataset. Owing to the advantage of using multiple
linear models, it can be seen that CloREF is capable of
approximating complex decision boundaries. For the spirals
dataset, the accuracy achieved by CloREF (value given at
the right bottom of each sub-figure) is better than that of
other machine learning models, which demonstrates that
10
TABLE 3
The average results of collaborative learning on eight real-world datasets in full feature space. #Rules represents the number of rules. Average
performance and standard deviation (in brackets) are listed for two metrics: accuracy and AUC.
Data Average Participants All Rules CloREF Centralized Model FedMD FedAvg HeteroFL FedRolex
#Rules
yea1 (5)
1.0 (±0.0) SGD fully con(10,10)
Accuracy 0.701 (±0.01) 0.701 (±0.02) 0.683 (±0.04) 0.720 (±0.04) 0.732 (±0.01) 0.736 (±0.03) 0.733 (±0.04) 0.736 (±0.03)
AUC 0.673 (±0.01) 0.642 (±0.06) 0.699 (±0.03) 0.677 (±0.05) 0.642 (±0.02) 0.641 (±0.03) 0.652 (±0.05) 0.649 (±0.06)
#Rules
glas (4)
13.2 (±4.02) SVM (rbf) fully con(10,10,10)
Accuracy 0.916 (±0.02) 0.916 (±0.05) 0.949 (±0.02) 0.944 (±0.04) 0.882 (±0.03) 0.925 (±0.03) 0.925 (±0.05) 0.950 (±0.02)
AUC 0.882 (±0.01) 0.891 (±0.07) 0.926 (±0.03) 0.923 (±0.09) 0.838 (±0.06) 0.904 (±0.04) 0.903 (±0.05) 0.933 (±0.03)
#Rules
pa0 (19)
21.4 (±6.74) MLP (10,10) fully con(10,10)
Accuracy 0.922 (±0.02) 0.921 (±0.01) 0.934 (±0.01) 0.951 (±0.02) 0.928 (±0.01) 0.953 (±0.00) 0.946 (±0.01) 0.944 (±0.01)
AUC 0.745 (±0.03) 0.705 (±0.07) 0.816 (±0.03) 0.908 (±0.02) 0.747 (±0.03) 0.853 (±0.02) 0.788 (±0.03) 0.746 (±0.03)
#Rules
se0 (14)
3.8 (±2.14) MLP (10,10,10) fully con(10,10,10)
Accuracy 0.971 (±0.01) 0.912 (±0.06) 0.991 (±0.01) 0.994 (±0.00) 0.975 (±0.01) 0.997 (±0.00) 0.997 (±0.00) 0.998 (±0.00)
AUC 0.923 (±0.05) 0.933 (±0.03) 0.986 (±0.01) 0.992 (±0.00) 0.932 (±0.02) 0.993 (±0.01) 0.994 (±0.01) 0.996 (±0.00)
#Rules
veh1 (5)
13.6 (±4.22) MLP(10,10,10) fully con(10,10,10)
Accuracy 0.727 (±0.02) 0.739 (±0.03) 0.746 (±0.05) 0.804 (±0.05) 0.725 (±0.02) 0.782 (±0.03) 0.782 (±0.04) 0.792 (±0.05)
AUC 0.626 (±0.03) 0.640 (±0.04) 0.705 (±0.04) 0.669 (±0.12) 0.570 (±0.06) 0.641 (±0.12) 0.629 (±0.12) 0.625 (±0.11)
#Rules
wisc (5)
16 (±0.80) MLP(10,10) fully con(10,10)
Accuracy 0.966 (±0.01) 0.961 (±0.01) 0.970 (±0.01) 0.968 (±0.01) 0.954 (±0.01) 0.967 (±0.01) 0.969 (±0.01) 0.970 (±0.01)
AUC 0.968 (±0.01) 0.960 (±0.01) 0.972 (±0.01) 0.970 (±0.01) 0.952 (±0.01) 0.965 (±0.01) 0.969 (±0.01) 0.970 (±0.01)
#Rules
pima (5)
21.2 (±6.62) MLP (10,10) fully con(10,10)
Accuracy 0.706 (±0.02) 0.707 (±0.06) 0.738 (±0.03) 0.757 (±0.02) 0.700 (±0.01) 0.753 (±0.03) 0.733 (±0.03) 0.747 (±0.04)
AUC 0.684 (±0.02) 0.680 (±0.06) 0.720 (±0.03) 0.723 (±0.04) 0.676 (±0.03) 0.718 (±0.06) 0.721 (±0.04) 0.715 (±0.09)
#Rules
kdd (14)
53.2 (±12.27) SVM (poly) fully con(5,5,5)
Accuracy 0.932 (±0.16) 0.675 (±0.31) 0.985 (±0.01) 0.999 (±0.00) 0.975 (±0.02) 0.999 (±0.00) 0.987 (±0.02) 0.995 (±0.00)
AUC 0.939 (±0.15) 0.724 (±0.31) 0.984 (±0.02) 0.998 (±0.00) 0.984 (±0.02) 0.998 (±0.00) 0.973 (±0.05) 0.993 (±0.00)
CloREF is able to fuse the knowledge learned by different
machine learning models and generate effective decision
boundaries. For circles, the accuracy achieved by CloREF is
slightly lower than that of Random Forest and Naive Bayes
but higher than that of RBF SVM and MLP. This is because
the boundary samples in the circles dataset are more difficult
to separate as some of them are overlapped. As we pursue
to compact rules in the fusion process, some low-quality
rules are merged or discarded, thereby resulting in a slight
drop on accuracy. For spirals, even though the structure of
its decision boundaries is much complex, CloREF achieves
the best accuracy as there are clear boundaries between the
samples in the two classes. In addition, it can be seen from
Fig. 6 that, even though all models are trained using the
same dataset, the decision boundaries generated by different
learning models and the resulting accuracies are different.
This also demonstrates the benefits of using heterogeneous
learning models at participants to increase diversity.
Table 3 compares CloREF with seven baselines on the
eight datasets using 5-fold cross validation. FedAvg is also
included since it allows to show the performance of CloREF
compared with homogeneous FL schemes. The number of
participants for each dataset is given in the round brackets
following the dataset name. The values for the performance
metrics are the means over 5 independent runs with stan-
dard deviations given in the following round brackets. The
Average Participants column shows the average values over
all participants. We have the following observations:
The performance of CloREF is better than the average per-
formance of individual participants on all datasets, which
means our fusion and selection strategy can effectively
fuse the knowledge learned by different participants.
Hence, each participant can benefit from the knowledge
that it cannot learn from its local data.
CloREF outperforms All Rules on all datasets, indicating
that PBIL-RFS can effectively remove contradictory rules.
All the results of CloREF are better than those of FedMD.
Moreover, this competence is achieved without relying on
a public shared dataset as in FedMD.
We pay more attention to the AUC score in this analysis,
as it is the optimization objective of CloREF. Compared
with HeteroFL and FedRolex, CloREF achieves better
AUC on 5 of 8 datasets. Although FedRolex shows an
advantage in accuracy, its communication cost is sig-
nificantly higher than CloREF, as shown in Table 4. In
addition, the standard deviation of the results for CloREF
is smaller than that of other FL models, in particular on
the veh1 dataset, indicating the robustness of CloREF. This
is because, unlike HeteroFL and FedRolex that restrict
local models to be subsets of the global model and must
share the same depth of the global model, CloREF enables
the local model to be any type of model that is most
suitable to their data. Moreover, the global model of
CloREF is more lightweight but still has comparable per-
formance and the advantage of interpretability compared
with other FL models. It is worth mentioning that the
AUC achieved by CloREF is very close to that of FedAvg,
showing that CloREF does not compromise much perfor-
mance due to the pursuit of model heterogeneity.
CloREF gains comparable and sometimes (on glas and
wisc) even slightly better performance than Central-
ized Model. The main reason is that CloREF can better
mine knowledge in the local data through model hetero-
geneity, and can effectively achieve knowledge fusion and
avoid overfitting through PBIL-RFS.
11
0 25 50 75 100 125 150 175
Generation
0.480
0.485
0.490
0.495
0.500
0.505
0.510
Fitness value
(a)
0 50 100 150 200 250 300 350 400 450
Generation
0.54
0.56
0.58
0.60
Fitness value
(b)
Fig. 7. Convergence on yeast1: (a) PSO, (b)PBIL-based rule fusion.
Each curve represents one run of the algorithm.
6.4 Results on convergence
Two iterative computational components, boundary sample
generation and rule fusion &selection, are designed based on
PSO and PBIL, respectively. Both methods have proven con-
vergence properties. Fig. 7 shows the convergence curves
on the yea1 dataset. For each optimization, we performed it
multiple times. It can be seen that both optimization compo-
nents can quickly converge due to the simple optimization
objectives. The same trend of convergence is observed on
the other datasets.
6.5 Communication and computation analysis
According to the definition of rule in Sect. 4, suppose
parameter band each element in aand care defined as
‘float32’ type and parameter ωis defined as ‘char’ type.
8n+ 5 bytes are needed to represent one rule, where n
is the feature dimension. Suppose there are pparticipants
and each participant extracts lrules. The amount of com-
munication for rule uploading and downloading at each
participant is (8n+ 5)(p+ 1)lbytes. During the rule fusion
and selection process, each participant needs to iteratively
receive the gene codes from and upload the calculated AUC
(4 bytes) to the server. Suppose each population has Ngene
genes and the number of iterations is Ni. Since the length of
each gene is pl
8bytes, the amount of communication at each
participant in this process is NiNgene(pl
8+ 4) bytes.
In FedAvg, the amount of communication at each partic-
ipant only depends on the architecture of the global neural
network model. Suppose Nwis the number of weights, Ni
is the number of iterations, and each weight is ‘float32’.
The communication cost at each participation is 8NwNi
bytes. The calculation of communication for HeteroFL and
FedRolex is the same as that of FedAvg, except that the
values of Nwin HereroFL and FedRolex are smaller since
each participant trains a sub-model.
Table 4 compares the communication cost of CloREF
and other FL models. In our experiments, HeteroFL and
FedRolex use the same global model and shrinkage ratios.
Hence, they have the same communication cost as depicted
by H/F in Table 4. The number of iterations is set to 100 in
all experiments. Ngene in CloREF is set to 20. The values
of the other key parameters are given in the table. It is
worth noting that the communication cost of CloREF is
significantly lower than that of other FL models because
(1) in most cases, the rule set is much simpler than a neural
TABLE 4
Communication cost (CC) and Storage (ST) (measured in KB). Arc
represents the Architecture of the MLP used by FedAvg.
glas yea1 vehi1 pima wis se0 kdd pa0
p4 5 5 5 5 14 14 19
n9 9 18 8 5 19 41 10
CloREF
l51 12.4 53.6 66.8 16.8 32 156.6 68.4
CC 25.1 12.0 30.3 29.5 13.4 20.9 100.6 30.5
ST 0.4 0.1 0.4 0.5 0.1 0.3 1.2 0.5
FedAvg
Arc (10,10,10) (10,10) (10,10,10) (10,10) (10,10) (10,10,10) (5,5,5) (10,10)
CC 264.8 186.7 342.2 178.1 186.7 350.8 246.9 195.3
ST 2.6 1.9 3.4 1.8 1.9 3.5 2.5 2.0
H/F CC 152.0 113.6 206.2 107.6 113.6 212.3 169.3 119.6
ST 1.5 1.1 2.0 1.1 1.1 2.1 1.7 1.2
network model. (2) one rule only occupies one bit in the
gene code but a weight for neuron connection needs 4 bytes.
Communication failure and result pollution are common
challenges for distributed learning. Compared with tradi-
tional FL, CloREF has at least two advantages: (1) as long as
a participant can successfully send its rules to the server, the
knowledge it learns from its local data could be integrated
into the global model, even if the participant can’t join
the rule evaluation process due to communication failure.
(2) In each iteration, our model transmits less data and
requires less communication time than standard FL, making
it more robust to temporary communication failures. The
synchronous fusion process also avoids result pollution,
since the evaluation score must be computed based on the
genes sent by the server in each iteration.
The time complexity of our PSO-based boundary sample
generation method is O(NSNGNP)[30], where NSis the
number of synthetic boundary samples and NPand NG
are the number of particles and the maximum number
of generations in each execution of PSO, respectively. As
shown in Fig. 7 (a), our PSO-based method converges fast
and thus will not add much computation overhead to the
participants. In the worst case, the time complexity of Algo.
1 is O(|CLU |log(|C LU |)), where |CLU|is the number of
initial clusters. In contrast to FedAvg with deep neural net-
works, which require a substantial amount of computation
for gradient calculation in each round, both the genera-
tion of boundary samples and model fitting require only
a single execution in CloREF. Hence, although additional
operations are required on the client side in CloREF, these
additional operations won’t add much computing burden to
the clients. Table 4 also shows that the storage requirement
for model parameters at each client in CloREF is much less
than the requirements in FedAvg, and HeteroFL/FedRolex.
6.6 Case studies on IID and non-IID datasets
Theoretically, CloREF is not sensitive to client data distribu-
tion because it allows all participants to independently build
their local learning models with their local datasets first and
then merges the extracted rules at the server. To verify the
robustness of CLoREF in dealing with non-IID data, we first
compare the decision boundary generated by CloREF on the
spirals dataset with both IID and non-IID distributions.
We split the spirals dataset into four subsets in the follow-
ing two ways: (1) random sampling so that the four subsets
12
.97
Participant A
.93
Participant B
.84
Participant C
.93
Participant D
.97
CloREF
.79 .52 .56 .70 .97
Fig. 8. Classification boundaries on different sub-datasets: (a)–(d) at
participants; (e) after rule fusion. First row for random splitting and
second row for cross-splitting.
have similar distributions (i.e., IID); (2) cross-splitting so
that each subset contains a quarter of the whole spiral
dataset (i.e., non-IID). Four different models, MLP (5,5),
SVM (RBF), NB, SVM (POLY), are trained using the four
subsets. The decision boundaries generated by local rules
and the final fused rule set are shown in Fig. 8, where the
accuracy of the corresponding set of rules on the same test
dataset is shown in the lower right corner. It can be seen that,
for both local data distributions, CloREF can effectively fuse
the knowledge learnt by all participants to generate similar
decision boundaries and the final global models achieve
the same accuracy. Furthermore, the accuracy achieved by
CloREF is higher than that of the best global model as shown
in Fig. 6, i.e., the best model that has the luxury of being
directly trained using the whole dataset.
6.7 Case study on anomaly detection for smart home
Most of the IoT devices are resource-constrained and may
not be able to deploy a complicated or homogeneous
learning model. In addition, data generated by different
IoT devices may exhibit statistical heterogeneity, making
it desirable to customize their models based on local data
distribution through model heterogeneity. In this experi-
ment, we use a public dataset N-BaIot [36] to demonstrate
the advantages of our method for anomaly detection in a
smart home system. N-BaIoT was created by collecting the
network traffic with 2 common IoT botnets, i.e., BASHLITE
and Mirai, on 9 IoT devices. The specifications of each device
and their suffered attacks are given in Table 5.
The experimental settings are the same as shown in
Table 1 except that the number of generations in PBIL-
RFS and the communication round are both set to 500. The
experimental results are shown in Table 6, with mean and
standard deviation computed over 5-fold cross validation.
To better evaluate the performance of trained models, we
use a global test dataset that contains all types of attacks.
As can be seen from Table 6, local models, especially for
Cam4 and WC, perform significantly worse than the global
model built by CloREF when testing on a global data set. Be-
cause local models are trained with only their local attacks,
they are less capable of recognizing attacks they haven’t
seen, whereas CloREF can fuse the knowledge learned from
different IoT devices. DB1’s local model also achieves much
better performance than WC’s local model even though the
types of attacks they suffered are the same. This is because
(1) the normal behaviour of a doorbell is much simpler than
15 16 2 5 3 1 6 9 10 17 4 8 7 23 12 11 13 14 19 20 22 21 18
Feature number
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Fig. 9. Boxplot for the importance of features over 5-fold cross-validation.
the normal behaviour of a webcam, making it much easier
to distinguish from abnormal behaviours; (2) DB1 and WC
have different data distributions analyzed using T-NSE [37].
Another advantage of CloREF over DNNs is its direct
interpretability. From the coefficients of the linear rules, we
can directly know which features play significant roles in
making decisions without needing to employ an explana-
tion method, e.g., [24]. Since the reasoning process of a
learning model is transparent, we can decide to whether
trust or reject the decisions made by the learning model.
This is crucial for applications such as network security, au-
tonomous driving, and healthcare, where accepting a wrong
decision could have irreversible consequences. To demon-
strate the consistency on interpretability, Table 7 shows the
top 5 important features selected by our model and other
feature selection methods. We use two types of feature
selection methods: (1) univariate statistical tests, and (2)
white-box explanation method (i.e., checking the coefficient
for each feature assigned by the trained model). Moreover,
we also compare the important features selected in [35].
Overall, the important features selected by our model and
other methods are highly overlapped, which demonstrates
the consistency and effectiveness of feature selection by our
model. In addition, Fig. 9 shows the boxplot of the feature
importance values of the global rule set over 5-fold cross
validations, which demonstrates the stability of CloREF.
6.8 Experiments on multiclass classification
We now demonstrate the extension of CloREF for multiclass
classification using the One-vs-One approach [25]. At each
participant, one binary classifier is trained for each pair of
classes. At the server, the received rules are merged and
selected at a per class-pair level using the same approach
introduced in Sect.6.3 to generate the best set of global rules
for each pair of classes. When testing an unseen sample at
a local participant, majority voting is used to determine the
class the sample belongs to. If there is more than one winner,
we randomly choose one from the winning classes.
Dataset and models: We use the well-known electrocardio-
gram (ECG) dataset provided by MIT-BIH [38] for the mul-
ticlass classification evaluation. Following the recommen-
dation of the Association for the Advancement of Medical
Instrumentation (AAMI), samples in the MIT-BIH dataset
13
TABLE 5
Detailed specifications of N-BaIoT dataset used in this experiment.
Device Device type Abbrev. BASHLITE Mirai
Combo Junk Scan TCP UDP Ack Scan Syn UDP UDPPlain
Danmini Doorbell DB1 X X X X X
Ennio Doorbell DB2 X X X X X
Ecobee Thermostat TM X X X X X X X X
Philips B120N/10 Baby monitor BM X X X X X X X
Provision PT-737E Secrity camera Cam1 X X X X X X X
Provision PT-838 Secrity camera Cam2 X X X X X X X
SimpleHome XCS7-1002-WHT Secrity camera Cam3 X X X X X X X
SimpleHome XCS7-1003-WHT Secrity camera Cam4 X X X X X X X
Samsung SNH 1011 N Webcam WC X X X X X
TABLE 6
Results of nine local models and CloREF on N-BaIot dataset.
DB1 DB2 TM BM Cam1 Cam2 Cam3 Cam4 WC CloREF
ACC 0.998(±0.00) 0.994(±0.01) 0.923(±0.13) 0.997(±0.00) 0.957(±0.01) 0.978(±0.00) 0.959(±0.00) 0.928(±0.04) 0.915(±0.01) 0.998(±0.00)
GM 0.995(±0.00) 0.979(±0.04) 0.942(±0.08) 0.993(±0.01) 0.973(±0.00) 0.986(±0.00) 0.975(±0.00) 0.876(±0.19) 0.614(±0.03) 0.997(±0.00)
AUC 0.995(±0.00) 0.980(±0.04) 0.946(±0.07) 0.993(±0.01) 0.974(±0.00) 0.986(±0.00) 0.975(±0.00) 0.897(±0.14) 0.689(±0.02) 0.997(±0.00)
TABLE 7
Top important features selected by different methods.
Feature Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Univariate
F-test X X X X X
χ2-test X X X X X
MutualInformation X X X X X
White-box
Explanation
LinearSVC X X X X X
RandomForest X X X X X X
SGDClassifier X X X X X
LogisticRegression X X X X X
AdaBoost X X X X X
ExtraTrees X X X X X
GBDT X X X X X
Mafarja et al. [35] X X X X X X
CloREF X X X X X
can be categorized into five heartbeat classes: N, SVEB, VEB,
F, and Q. Because the dataset contains original ECG signals,
we segment the signals into beats and extract some features
using the algorithm proposed by Guerra et al [39]. In this
set of experiments, we use the Wavelets (23 features) and
Our Morph (4 features) feature descriptors. We also follow
the common practice in ECG classification and divide the
dataset into a training set and a testing set using the inter-
patient scheme given in [40], with both the training and
testing sets containing data from 22 records with a similar
proportion of beat types. As both the training and testing
sets are fixed, we calculate the average results of five runs.
The detailed specification of this dataset is given in Table 8.
We split the dataset into five subsets for five participants.
We use MLP(20,20,20) as the global model for FedAvg, since
it gives the best performance in our experiments with MLP.
Also, we only use MLP(20,20,20) and MLP(10,10,10) as the
local models in FedMD since we observed that using more
heterogeneous models decreases the stability of FedMD.
The other parameter settings for CloREF, FedAvg, FedMD,
HeteroFL and FedRolex are the same as those in Table 1.
Metrics: Given the highly skewed class distribution in this
dataset, using accuracy alone is no longer appropriate for
ensuring a fair comparison. Following most existing ECG
classification studies, we employ three additional metrics:
the macro-average of both recall and precision for each class
and the index [41]. jκ index is special metric used in
ECG analysis to evaluate the discrimination of the most
important arrhythmias (SVER and VEB), by taking into
account the misclassification and imbalance between all the
considered classes. A higher index indicates better dis-
crimination. We prioritize these three metrics over accuracy,
given that in healthcare scenarios the consequence of failing
to detect a positive case in time is irreversible. Besides,
AAMI recommends no reward or penalty for classifying an
Fsample as a VEB sample [39]. Hence, the corresponding
cells of the confusion matrix are not counted in our analysis.
Results analysis: The results on the ECG dataset are given
in Table 9. First and foremost, the results demonstrate that
CloREF can be effectively extended to tackling multiclass
classification problems. As can be seen in Table 9, CloREF
achieves the highest scores on all three key metrics (recall,
precision, and index), demonstrating its superior ability in
dealing with class imbalance. Even though FedRolex attains
the highest accuracy, its considerably low recall suggests
that it effectively identifies negative (majority) samples but
struggles to identify positive (minority) samples. Similarly,
the precision scores for positive classes such as SVEB and F
given by FedMD and FedAvg are much lower than those of
CloREF. One possible reason for the superiority of CloREF
is that, to perform multiclass classification, CloREF takes the
OVO strategy to build multiple binary classifiers, potentially
mitigating the negative impact on minority classes caused
by small sample sizes. Because one subset serves as the
14
TABLE 8
Detailed specifications of the ECG dataset.
Class N SVEB VEB F Q
DS 1 Train 45842 944 3788 414 0
DS 2 Test 44743 1837 3447 388 8
TABLE 9
The average results on the multiclass ECG dataset.
Method Accuracy Recall Precision index
Centralized 0.858 0.480 0.375 0.387
FedAvg 0.877 0.468 0.349 0.382
FedMD 0.743 0.422 0.299 0.216
HeteroFL 0.862 0.364 0.316 0.227
FedRolex 0.904 0.378 0.409 0.345
CloREF 0.839 0.505 0.462 0.434
proxy dataset, FedMD’s performance is notably compro-
mised in comparison to the other methods, primarily due
to the reduction in the number of training samples.
6.9 Discussion on hyperpameter settings
We discuss the configuration of the hyperparameters in our
model within two categories: server-side and client-side.
Server-side: parameters in this category include (α, θm)
used for controlling rule selection and merging and (δ,µ,
Ngene) used for controlling the learning rate, the extent of
mutation and the number of genes in PBIL. Our experimen-
tal results given in Sect. 6.3 show that a moderately large α
(e.g., 0.9) and small θm(e.g., 0.02) can significantly reduce
the number of rules with a nearly imperceptible decrease
in AUC. Moreover, the configurations for both αand θm
are not sensitive to datasets. Hence, we recommend α= 0.9
and θm= 0.02 as the empirical optimal configuration. Other
parameters in this group (δ,µ,Ngene) are related to the PBIL
algorithm. In our experiments, we consistently observe that
their default settings given in Table 1 lead to fast and stable
convergence, as demonstrated in Sect. 6.4.
Client-side: parameters in this category include (θ,NS,NP,
NG) used for synthesizing boundary samples and (Tsplit,
Tmerge ) used for generating decision rules. Our experiments
show that the generic setting of θin Eq. (4), that is, the
median value (i.e., 0.5) of the prediction given by a local
model, already leads to satisfactory performance. NSis
the required number of synthetic boundary samples. It is
initially set to nm where nis the feature dimension and m
is the expected number of rules with the default value of
20. If the fidelity of the generated rules is unsatisfactory, NS
can be tuned to generate more boundary samples. NPand
NGare the number of particles and the maximum number
of generations in each execution of PSO, and their empirical
settings are given in Table 1. Since the optimization objective
of our PSO-based method is very simple, it converges very
fast as shown in Fig. 7, which also facilitates the fine-tuning
of these two parameters. For the parameters (Tsplit,Tmerg e),
their default settings are given in Table 1. In practice, we can
slightly increase the values of these two parameters around
the default values if the observed fidelity on a dataset is
below 0.9.
7 CONCLUSION
In this paper, we proposed a rule-based collaborative learn-
ing framework CloREF that enables multiple heterogeneous
participants to build a global model without sharing their
local data. Beside the advantages of being lightweight and
interpretable, it also gives competitive performance on a
range of benchmarks. As a new attempt in collaborative
learning, it has some limitations. The fidelity of the extracted
rules at the participants sometimes is not satisfactory. To ad-
dress this, we intend to investigate the use of other distance
metrics such as the Mahalanobis distance to improve the
quality of boundary sample clustering, which may generate
local linear models that are stable and well-performing.
Also, CloREF may not work well on very high-dimensional
data such as images and text due to the overhead in gener-
ating boundary samples. We will investigate the feasibility
of using other linear regression techniques such as elastic
net [42] to better cope with high-dimensional data. Further-
more, we will investigate personalizing the global rule set to
make it adaptive to local datasets and apply the developed
framework to more real-world, large-scale problems.
REFERENCES
[1] P. Kairouz et al., “Advances and open problems in federated
learning,” Foundations and Trends® in Machine Learning, vol. 14,
no. 1–2, pp. 1–210, 2021.
[2] A. Hard et al., “Federated learning for mobile keyboard predic-
tion,” 2018, arXiv:1811.03604.
[3] L. Cui et al., “Security and privacy-enhanced federated learning
for anomaly detection in iot infrastructures,” IEEE Transactions on
Industrial Informatics, vol. 18, no. 5, pp. 3492–3500, 2022.
[4] D. Y. Zhang, Z. Kou, and D. Wang, “Fedsens: A federated learning
approach for smart health sensing with class imbalance in resource
constrained edge computing,” in Proceedings of IEEE Conference on
Computer Communications, 2021, pp. 1–10.
[5] P. Zhou et al., “A privacy-preserving distributed contextual fed-
erated online learning framework with big data support in social
recommender systems,” IEEE Transactions on Knowledge and Data
Engineering, vol. 33, no. 3, pp. 824–838, 2021.
[6] D. Li and J. Wang, “FedMD: Heterogenous federated learning via
model distillation,” 2019, arXiv 1910.03581.
[7] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation
for robust model fusion in federated learning,” Advances in Neural
Information Processing Systems, vol. 33, pp. 2351–2363, 2020.
[8] E. Diao et al., “HeteroFL: Computation and communication effi-
cient federated learning for heterogeneous clients,” in Inter. Conf.
on Learning Representations, 2021.
[9] M. G. Arivazhagan et al., “Federated learning with personalization
layers,” 2019, arXiv:1912.00818.
[10] R. Liu et al., “No one left behind: Inclusive federated learning
over heterogeneous devices,” in Proc. of the 28th SIGKDD Conf.
on Knowledge Discovery and Data Mining, August 14–18, 2022.
[11] J. Verbraeken et al., “A survey on distributed machine learning,”
ACM Comput. Surv., vol. 53, no. 2, pp. 1–33, March 2020.
[12] B. McMahan et al., “Communication-Efficient Learning of Deep
Networks from Decentralized Data,” in Proc. of the 20th Inter. Conf.
on Artificial Intelligence and Statistics, vol. 54, 2017, pp. 1273–1282.
[13] M. Duan et al., “Self-balancing federated learning with global
imbalanced data in mobile systems,” IEEE Transactions on Parallel
and Distributed Systems, vol. 32, no. 1, pp. 59–71, 2020.
[14] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated
learning: A meta-learning approach,” 2020, arXiv:2002.07948.
[15] T. Li et al., “Federated optimization in heterogeneous networks,”
in Proc. of Machine Learning and Systems, vol. 2, 2020, pp. 429–450.
[16] O. Marfoq et al., “Federated multi-task learning under a mixture of
distributions,” in Proc. of Advances in Neural Information Processing
Systems, vol. 34, 2021, pp. 15 434–15 447.
[17] Y. Liu et al., “A secure federated transfer learning framework,”
IEEE Intelligent Systems, vol. 35, no. 4, pp. 70–82, 2020.
15
[18] A. Z. Tan, H. Yu, L. Cui, and Q. Yang, “Towards personalized
federated learning,” IEEE Transactions on Neural Networks and
Learning Systems, pp. 1–17, 2022.
[19] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation
for robust model fusion in federated learning,” Advances in Neural
Information Processing Systems, vol. 33, pp. 2351–2363, 2020.
[20] Z. Zhu, J. Hong, and J. Zhou, “Data-free knowledge distillation for
heterogeneous federated learning,” in Proc. of the 38th Inter. Conf.
on Machine Learning, vol. 139, 18–24 Jul 2021, pp. 12 878–12 889.
[21] S. Alam et al., “FedRolex: Model-heterogeneous federated learning
with rolling sub-model extraction,” in Advances in Neural Informa-
tion Processing Systems, 2022.
[22] W. Samek, T. Wiegand, and K.-R. M¨
uller, “Explainable artificial
intelligence: Understanding, visualizing and interpreting deep
learning models,” 2017, arXiv:1708.08296.
[23] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘why should I trust you?’:
Explaining the predictions of any classifier,” in Proc. of the 22nd
ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining,
2016, p. 1135–1144.
[24] W. Guo et al., “LEMNA: Explaining deep learning based security
applications,” in Proc. of the 2018 ACM SIGSAC Conf. on Computer
and Communications Security, 2018, pp. 364–379.
[25] D. Tax and R. Duin, “Using two-class classifiers for multiclass
classification,” in Proceedings of Inter. Conf. on Pattern Recognition,
vol. 2, 2002, pp. 124–127.
[26] J. Platt et al., “Probabilistic outputs for support vector machines
and comparisons to regularized likelihood methods,” Advances in
large margin classifiers, vol. 10, no. 3, pp. 61–74, 1999.
[27] B. Zadrozny and C. Elkan, “Obtaining calibrated probability esti-
mates from decision trees and naive bayesian classifiers,” in Icml,
vol. 1, 2001, pp. 609–616.
[28] M. Kearns and Y. Mansour, “On the boosting ability of top-down
decision tree learning algorithms,” in Proc. of the 28th Annual ACM
symposium on Theory of Computing, 1996, pp. 459–468.
[29] J. Kennedy et al., “Particle swarm optimization,” in Proc. of Inter-
national Conference on Neural Networks, vol. 4, 1995, pp. 1942–1948.
[30] Y. Pang et al., “Rule-based collaborative learning with heteroge-
neous local learning models,” in Pacific-Asia Conf. on Knowledge
Discovery and Data Mining. Springer, 2022, pp. 639–651.
[31] R. Guidotti et al., “A survey of methods for explaining black box
models.” ACM Computing Surveys, vol. 51, no. 5, pp. 1 42, 2018.
[32] J. Alcal´
a-Fdez, et al., “Keel data-mining software tool: data set
repository, integration of algorithms and experimental analysis
framework.” Journal of Multiple-Valued Logic and Soft Computing,
vol. 17, no. 2-3, pp. 255–187, 2011.
[33] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
[Online]. Available: http://archive.ics.uci.edu/ml
[34] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,”
Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[35] M. Mafarja et al., “Augmented whale feature selection for iot
attacks: Structure, analysis and applications,” Future Generation
Computer Systems, vol. 112, pp. 18–40, 2020.
[36] Y. Meidan et al., “N-baiot—network-based detection of iot bot-
net attacks using deep autoencoders,” IEEE Pervasive Computing,
vol. 17, no. 3, pp. 12–22, 2018.
[37] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”
Journal of machine learning research, vol. 9, no. 11, 2008.
[38] G. B. Moody and R. G. Mark, “The impact of the MIT-BIH
arrhythmia database,” IEEE Engineering in Medicine and Biology
Magazine, vol. 20, no. 3, pp. 45–50, 2001.
[39] M.-G. V. et al., “Heartbeat classification fusing temporal and
morphological information of ecgs via ensemble of classifiers,”
Biomedical Signal Processing and Control, vol. 47, pp. 41–48, 2018.
[40] P. De Chazal et al., “Automatic classification of heartbeats using
ecg morphology and heartbeat interval features,” IEEE transactions
on biomedical engineering, vol. 51, no. 7, pp. 1196–1206, 2004.
[41] T. Mar et al., “Optimization of ecg classification by means of fea-
ture selection,” IEEE Transactions on Biomedical Engineering, vol. 58,
no. 8, pp. 2168–2177, 2011.
[42] H. Zou and T. Hastie, “Regularization and variable selection via
the elastic net,” Journal of the Royal Statistical Society: Series B
(Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
Ying Pang is a Ph.D. candidate from the Depart-
ment of Computer Science, University of Otago,
New Zealand. Prior to the University of Otago,
she received her B.E. and MA.Sc. degrees in
Computer Science and Technology from Univer-
sity of Jinan, China, in 2016 and 2019, respec-
tively. Her research generally centers around
distributed optimization, federated learning, in-
terpretation of learning models, and imbalanced
learning.
Haibo Zhang received the M.Sc. degree in com-
puter science from Shandong Normal University,
Jinan, China, in 2005 and the Ph.D degree in
computer science from the University of Ade-
laide, Adelaide, Australia, in 2009. From 2009
to 2010, he was a Postdoctoral Research As-
sociate with the Automatic Control Laboratory,
School of Electrical Engineering, Royal Institute
of Technology, Stockholm, Sweden. He is cur-
rently an Associate Professor with the Depart-
ment of Computer Science, University of Otago,
Dunedin, New Zealand. His current research interests include wireless
networks, distributed computing, and distributed edge intelligence. He
has published over 100 peer-reviewed papers, and been on the program
committee for many international conferences.
Jeremiah D. Deng obtained the B.E. degree
from the University of Electronic Science and
Technology of China in 1989, and the M.Eng.
and D.Eng. from the South China University of
Technology, China, in 1992 and 1995, respec-
tively, the latter co-supervised at the University
of Hong Kong. He is now an Associate Pro-
fessor with the School of Computing, University
of Otago. Dr. Deng’s research interests include
machine learning, pattern recognition, and mod-
eling and optimization of computer networks. He
has published over 100 refereed research papers in international con-
ference proceedings and journals.
Lizhi Peng received the received the B.Eng.
degree from Xi’an Jiaotong University in 1998,
and Ph.D degrees from Harbin Institute of Tech-
nology in 2015. He was a visitor scholar to the
Department of Computer Science of University
of Otago, New Zealand, in 2017. He is currently
a Professor of School of Information Science and
Engineering, University of Jinan, Jinan, China.
His current research interests include machine
learning, evolutionary computing, computer net-
work, and parallel computing.
Fei Teng received the B.S. degree from South-
west Jiaotong University, Chengdu, China, in
2006, and the Ph.D. degree in computer science
and technology from Ecole Centrale de Paris,
Paris, France, in 2011. She is currently an Asso-
ciate Professor in the School of Computing and
Artificial Intelligence at Southwest Jiaotong Uni-
versity. Her research interests are edge comput-
ing, machine learning, medical informatics and
data mining.
ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
Collaborative learning such as federated learning enables to train a global prediction model in a distributed way without the need to share the training data. However, most existing schemes adopt deep learning models and require all local models to have the same architecture as the global model, making them unsuitable for applications using resource- and bandwidth-hungry devices. In this paper, we present CloREF, a novel rule-based collaborative learning framework, that allows participating devices to use different local learning models. A rule extraction method is firstly proposed to bridge the heterogeneity of local learning models by approximating their decision boundaries. Then a novel rule fusion and selection mechanism is designed based on evolutionary optimization to integrate the knowledge learned by all local models. Experimental results on a number of synthesized and real-world datasets demonstrate that the rules generated by our rule extraction method can mimic the behaviors of various learning models with high fidelity (>0.95 in most tests), and CloREF gives comparable and sometimes even better AUC compared with the best-performing model trained centrally.
Book
Full-text available
The term Federated Learning was coined as recently as 2016 to describe a machine learning setting where multiple entities collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective. Since then, the topic has gathered much interest across many different disciplines and the realization that solving many of these interdisciplinary problems likely requires not just machine learning but techniques from distributed optimization, cryptography, security, differential privacy, fairness, compressed sensing, systems, information theory, statistics, and more. This monograph has contributions from leading experts across the disciplines, who describe the latest state-of-the art from their perspective. These contributions have been carefully curated into a comprehensive treatment that enables the reader to understand the work that has been done and get pointers to where effort is required to solve many of the problems before Federated Learning can become a reality in practical systems. Researchers working in the area of distributed systems will find this monograph an enlightening read that may inspire them to work on the many challenging issues that are outlined. This monograph will get the reader up to speed quickly and easily on what is likely to become an increasingly important topic: Federated Learning.
Article
Full-text available
Federated learning (FL) is a distributed deep learning method that enables multiple participants, such as mobile and IoT devices, to contribute a neural network while their private training data remains in local devices. This distributed approach is promising in the mobile systems where have a large corpus of decentralized data and require high privacy. However, unlike the common datasets, the data distribution of the mobile systems is imbalanced which will increase the bias of model. In this paper, we demonstrate that the imbalanced distributed training data will cause an accuracy degradation of FL applications. To counter this problem, we build a self-balancing FL framework named Astraea, which alleviates the imbalances by 1) Z-score-based data augmentation, and 2) Mediator based multi-client rescheduling. The proposed framework relieves global imbalance by adaptive data augmentation and downsampling, and for averaging the local imbalance, it creates the mediator to reschedule the training of clients based on Kullback-Leibler divergence (KLD) of their data distribution. Compared with FedAvg, the most popular FL framework, Astraea shows +4.39% and +6.51% improvement of top-1 accuracy on the imbalanced EMNIST and imbalanced CINIC-10 datasets, respectively. Meanwhile, the communication traffic of Astraea is reduced by 75% compared to FedAvg.
Article
Full-text available
Smart connected appliances expand the boundaries of the conventional Internet into the new Internet of Things (IoT). IoT started to hold a significant role in our life, and in several fields as in transportation, industry, smart homes, and cities. However, one of the critical issues is how to protect IoT environments and prevent intrusions. Attacks detection systems aim to identify malicious patterns and threats that cannot be detected by traditional security countermeasures. In literature, feature selection or dimensionality reduction has been profoundly studied and applied to the design of intrusion detection systems. In this paper, we present a novel wrapper feature selection approach based on augmented Whale Optimization Algorithm (WOA), which adopted in the context of IoT attacks detection and handles the high dimensionality of the problem. In our approach, we introduce the use of both V-shaped and S-shaped transfer functions into WOA and compare the superior variant with other well-known evolutionary optimizers. The experiments are conducted using N-BaIoT dataset; wherein, five datasets were sampled from the original dataset. The dataset represents real IoT traffic, which is drawn from the UCI repository. The experimental results show that WOA based on V-shaped transfer function combined with elitist tournament binarization method is superior over S-shaped transfer function and outperforms other well-regarded evolutionary optimizers based on the obtained average accuracy, fitness, number of features, running time and convergence curves. Hence, we can conclude that the proposed approach can be deployed in IoT intrusion detection systems. For more info, visit http://aliasgharheidari.com/
Article
Federated Learning (FL) is a decentralized machine-learning paradigm in which a global server iteratively aggregates the model parameters of local users without accessing their data. User heterogeneity has imposed significant challenges to FL, which can incur drifted global models that are slow to converge. Knowledge Distillation has recently emerged to tackle this issue, by refining the server model using aggregated knowledge from heterogeneous users, other than directly aggregating their model parameters. This approach, however, depends on a proxy dataset, making it impractical unless such prerequisite is satisfied. Moreover, the ensemble knowledge is not fully utilized to guide local model learning, which may in turn affect the quality of the aggregated model. In this work, we propose a data-free knowledge distillation approach to address heterogeneous FL, where the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias. Empirical studies powered by theoretical implications show that, our approach facilitates FL with better generalization performance using fewer communication rounds, compared with the state-of-the-art.
Article
In parallel with the rapid adoption of artificial intelligence (AI) empowered by advances in AI research, there has been growing awareness and concerns of data privacy. Recent significant developments in the data regulation landscape have prompted a seismic shift in interest toward privacy-preserving AI. This has contributed to the popularity of Federated Learning (FL), the leading paradigm for the training of machine learning models on data silos in a privacy-preserving manner. In this survey, we explore the domain of personalized FL (PFL) to address the fundamental challenges of FL on heterogeneous data, a universal characteristic inherent in all real-world datasets. We analyze the key motivations for PFL and present a unique taxonomy of PFL techniques categorized according to the key challenges and personalization strategies in PFL. We highlight their key ideas, challenges, opportunities, and envision promising future trajectories of research toward a new PFL architectural design, realistic PFL benchmarking, and trustworthy PFL approaches.
Article
IoT anomaly detection is significant due to its fundamental roles of securing modern critical infrastructures. Researchers have proposed various detection methods fostered by machine learning (ML) techniques. Federated learning (FL),as a promising distributed ML paradigm,has been employed recently to improve detection performance due to its advantages of privacy-preserving and lower latency. However,existing methods still suffer from efficiency,robustness,and security challenges. To address these problems,we initially introduce a blockchain empowered decentralized FL framework for anomaly detection in IoT systems,which provides data integrity and prevents single point failure. Further,we design an improved differentially private FL based on generative adversarial nets,aiming to optimize data utility throughout the training process. To the best of our knowledge,it is the first system to employ a decentralized FL approach with privacy-preserving for IoT anomaly detection. Simulation results demonstrate the robustness and accuracy of the developed decentralized scheme.
Article
Machine learning relies on the availability of vast amounts of data for training. However, in reality, data are mostly scattered across different organizations and cannot be easily integrated due to many legal and practical constraints. To address this important challenge in the field of machine learning, we introduce a new technique and framework, known as federated transfer learning (FTL), to improve statistical modeling under a data federation. FTL allows knowledge to be shared without compromising user privacy and enables complementary knowledge to be transferred across domains in a data federation, thereby enabling a target-domain party to build flexible and effective models by leveraging rich labels from a source domain. This framework requires minimal modifications to the existing model structure and provides the same level of accuracy as the non-privacy-preserving transfer learning. It is flexible and can be effectively adapted to various secure multi-party machine learning tasks.