Article

Reducing leakage in distributed deep learning for sensitive health data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

For distributed machine learning with health data we demonstrate how minimizing distance correlation between raw data and intermediary representations (smashed data) reduces leakage of sensitive raw data patterns during client communications while maintaining model accuracy. Leakage (measured using KL Divergence between input and intermediate representation) is the risk associated with the invertibility from intermediary representations, can prevent resource poor health organizations from using distributed deep learning services. We demonstrate that our method reduces leakage in terms of distance correlation between raw data and communication payloads from an order of 0.95 to 0.19 and from 0.92 to 0.33 during training with image datasets while maintaining a similar classification accuracy .

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, recent works [7,13,22] have reported that existing SL frameworks are not as effective as intended. As shown in Fig. 1 (c), the attackers are still able to recover the sensitive data such as raw inputs since prior privacy metrics [1,7,10,20] only focus on the input data in the inference process. Therefore, attaining a better privacyutility balance under computational constraints in both training and inference stages is needed to securely deploy complex DNNs on edge devices. ...
... We achieve this by designing an adaptive DNN partitioning strategy and integrating a distance correlation-based privacy metric into the model training. Compared to the conventional distance correlation work [20], our work can be easily extended to different types of neural network models, including both sequential and non-sequential architectures. We evaluate awareSL with different DNN models on multiple image datasets. ...
... Research efforts [1,7,8,15,20] have been made to reduce the privacy leakage of inference data in SL through homomorphic encryption and measurable privacy metrics. For instance, homomorphic encryption [8,15] allows the ML model to perform inference directly on the encrypted data without intermediate decryption or prior knowledge of the private key, which prevents the reconstruction of sensitive information. ...
Chapter
Full-text available
With recent advances in deep neural networks (DNNs), there is a significant increase in IoT applications leveraging AI with edge-cloud infrastructures. Nevertheless, deploying large DNN models on resource-constrained edge devices is still challenging due to limitations in computation, power, and application-specific privacy requirements. Existing model partitioning methods, which deploy a partial DNN on an edge device while processing the remaining portion of the DNN on the cloud, mainly emphasize communication and power efficiency. However, DNN partitioning based on the privacy requirements and resource budgets of edge devices has not been sufficiently explored in the literature. In this paper, we propose awareSL, a model partitioning framework that splits DNN models based on the computational resources available on edge devices, preserving the privacy of input samples while maintaining high accuracy. In our evaluation of multiple DNN architectures, awareSL effectively identifies the split points that adapt to resource budgets of edge devices. Meanwhile, we demonstrate the privacy-preserving capability of awareSL against existing input reconstruction attacks without sacrificing inference accuracy in image classification tasks.
... SL has been proposed as a privacy-preserving implementation of collaborative learning [47,46,36] and has gained particular interest due to its efficiency and simplicity. For user data privacy, SL relies on the fact that only intermediate activation maps are shared between the parties. ...
... Although the previous studies [18,47,36] assumed that SL is designed to protect the intellectual property of the shared model and to reduce the risk of inference attacks 3 perpetrated by a malicious server, recent studies have shown that these assumptions are false and privacy leakage risks do exist in SL. In [46], the authors analyzed the privacy leakage of SL and found a considerable leakage from the split layer in the 2D CNN model. Also, the authors in paper [1] performed experiments on the 1D CNN model and found that sharing the activation from the split layer results in severe privacy leakage. ...
... In addition, countermeasures have also been introduced to achieve privacy-preserving deep learning. For example, Vepakomma et al. [46] proposed a method for limiting data reconstruction in split neural networks (SplitNN) by minimizing the distance correlation between the input data and the intermediate tensors during model training. Also, to protect the SL from property inference attack, one approach is to use secure aggregation protocol that can protect the privacy of the clients' data while still allowing for collaborative learning [34]. ...
Preprint
Full-text available
The popularity of Deep Learning (DL) makes the privacy of sensitive data more imperative than ever. As a result, various privacy-preserving techniques have been implemented to preserve user data privacy in DL. Among various privacy-preserving techniques, collaborative learning techniques, such as Split Learning (SL) have been utilized to accelerate the learning and prediction process. Initially, SL was considered a promising approach to data privacy. However, subsequent research has demonstrated that SL is susceptible to many types of attacks and, therefore, it cannot serve as a privacy-preserving technique. Meanwhile, countermeasures using a combination of SL and encryption have also been introduced to achieve privacy-preserving deep learning. In this work, we propose a hybrid approach using SL and Homomorphic Encryption (HE). The idea behind it is that the client encrypts the activation map (the output of the split layer between the client and the server) before sending it to the server. Hence, during both forward and backward propagation, the server cannot reconstruct the client's input data from the intermediate activation map. This improvement is important as it reduces privacy leakage compared to other SL-based works, where the server can gain valuable information about the client's input. In addition, on the MIT-BIH dataset, our proposed hybrid approach using SL and HE yields faster training time (about 6 times) and significantly reduced communication overhead (almost 160 times) compared to other HE-based approaches, thereby offering improved privacy protection for sensitive data in DL.
... However, the server can still reconstruct the private data of the devices from the received activations due to the high correlation between the activations This work was supported by the U.S. National Science Foundation under Grant CNS-2114267. and the input when the allocated device-side model is too shallow [3], [4]. Although one can reduce the possibility of data leakage by increasing the device-side model, the training will become computationally intensive for resourceconstrained devices. ...
... The work in [3] demonstrated that data leakage can happen when training convolutional neural networks in SL. In [4], the authors proposed a novel SL algorithm to enhance data privacy by minimizing the distance correlation between the intermediate activations and the input data. Meanwhile, in [6], the authors studied the use of SL at inference stage over wireless networks and the impact of non-iid datasets on its performance. ...
... As α decreases, the correlation between the input data and an intermediate layer output, i.e., activations a d,k , increases. Hence, it is possible to reconstruct input data from activations as shown in [3] and [4]. In other words, an honest-but-curious server can do model inversion attack during training to restore private input data [12]. ...
Preprint
Full-text available
Split learning (SL) is an emergent distributed learning framework which can mitigate the computation and wireless communication overhead of federated learning. It splits a machine learning model into a device-side model and a server-side model at a cut layer. Devices only train their allocated model and transmit the activations of the cut layer to the server. However, SL can lead to data leakage as the server can reconstruct the input data using the correlation between the input and intermediate activations. Although allocating more layers to a device-side model can reduce the possibility of data leakage, this will lead to more energy consumption for resource-constrained devices and more training time for the server. Moreover, non-iid datasets across devices will reduce the convergence rate leading to increased training time. In this paper, a new personalized SL framework is proposed. For this framework, a novel approach for choosing the cut layer that can optimize the tradeoff between the energy consumption for computation and wireless transmission, training time, and data privacy is developed. In the considered framework, each device personalizes its device-side model to mitigate non-iid datasets while sharing the same server-side model for generalization. To balance the energy consumption for computation and wireless transmission, training time, and data privacy, a multiplayer bargaining problem is formulated to find the optimal cut layer between devices and the server. To solve the problem, the Kalai-Smorodinsky bargaining solution (KSBS) is obtained using the bisection method with the feasibility test. Simulation results show that the proposed personalized SL framework with the cut layer from the KSBS can achieve the optimal sum utilities by balancing the energy consumption, training time, and data privacy, and it is also robust to non-iid datasets.
... To address these challenges, various modified forms of decentralised learning have been proposed. Split federated learning 24 shares only some layers of a deep learning model to prevent model inversion attacks. Tiered Federated Learning 25 groups participants based on their training time into temporal tiers, while Generative Adversarial Networks-based federated learning 26 generates synthetic data to address data imbalance issues. ...
... The shared layers, a section of hidden layers in a NN, adapted from split learning 24,[27][28][29] , form the basic mode of sharing intelligence between the clients (see Fig. 1c). An individual copy of the layers is held by each client in the orbit, trained, and transferred with one other client after each melioration. ...
Article
Full-text available
A novel collaborative and continual learning across a network of decentralised healthcare units, avoiding identifiable data-sharing capacity, is proposed. Currently available methodologies, such as federated learning and swarm learning, have demonstrated decentralised learning. However, the majority of them face shortcomings that affect their performance and accuracy. These shortcomings include a non-uniform rate of data accumulation, non-uniform patient demographics, biased human labelling, and erroneous or malicious training data. A novel method to reduce such shortcomings is proposed in the present work through selective grouping and displacing of actors in a network of many entities for intra-group sharing of learning with inter-group accessibility. The proposed system, known as Orbital Learning, incorporates various features from split learning and ensemble learning for a robust and secure performance of supervised models. A digital embodiment of the information quality and flow within a decentralised network, this platform also acts as a digital twin of healthcare network. An example of ECG classification for arrhythmia with 6 clients is used to analyse its performance and is compared against federated learning. In this example, four separate experiments are conducted with varied configurations, such as varied age demographics and clients with data tampering. The results obtained show an average area under receiver operating characteristic curve (AUROC) of 0.819 (95% CI 0.784–0.853) for orbital learning whereas 0.714 (95% CI 0.692–0.736) for federated learning. This result shows an increase in overall performance and establishes that the proposed system can address the majority of the issues faced by existing decentralised learning methodologies. Further, a scalability demo conducted establishes the versatility and scalability of this platform in handling state-of-the-art large language models.
... However, this raises serious concerns regarding user data privacy, leading to the need for privacy-preserving machine learning (PPML) solutions [9]. Split Learning (SL) and Federated Learning (FL) are two PPML methods that rely on training ML models on decentralized data sources [19]. In FL [22], every client runs a copy of the entire model on its data. ...
... Different studies showed the possibility of privacy leakage in SL. In [19], the authors analyzed the privacy leakage of SL and found a considerable leakage from the split layer in the 2D CNN model. Furthermore, the authors mentioned that it is possible to reduce the distance correlation (a measure of dependence) between the split layer and raw data by slightly scaling the weights of all layers before the split. ...
Chapter
Full-text available
Split learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate Activation Maps (AMs) and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing AMs could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies homomorphic encryption on the AMs before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.
... SL partially transfers computation to the central server by splitting the whole model into multiple network portions, which are then trained separately on distributed clients and one or more central server entities. By sharing the cut-layer representations (smashed data) [6], clients and server conduct forward and backward propagation on the global model. SL is, therefore, more suitable for resourceconstrained devices than FL. ...
... We assume that the server and all participating clients are honest-but-curious, which means that they follow the scheme but may try to infer private information of other devices during Algorithm 2: Privacy-preserving split federated learning (PPSFL) Input : Model of round t: W C t , W S t , set of N clients: S t , the minibatch B, the learning rate η, data label of client C i : Y i Output: Model of round t + 1: W C t+1 , W S t+1 1 TA executes: 2 Sample (G, p, g) ← GroupGen(1 λ ), then generate the SIFE key pair (mpk i , msk i ) and MIFE key pair (mpk m i , msk m i ) for each client C i . 3 Server executes: 4 for Aggregation round t = 1, 2, ... do 5 for C i ∈ S t , in parallel do 6 for local epochs l = 1, 2, ..., L do ...
Preprint
Full-text available
p>In this paper, we propose a novel and efficient privacy-preserving split federated learning (PPSFL) framework, that achieves both privacy protection and model accuracy with reasonable computational and communication cost. We describe the implementations of PPSFL on Multi-layer Perceptron (MLP) and Convolutional Neural Network (CNN) models with distributed clients to evaluate the performance of PPSFL. </p
... SL partially transfers computation to the central server by splitting the whole model into multiple network portions, which are then trained separately on distributed clients and one or more central server entities. By sharing the cut-layer representations (smashed data) [6], clients and server conduct forward and backward propagation on the global model. SL is, therefore, more suitable for resourceconstrained devices than FL. ...
... We assume that the server and all participating clients are honest-but-curious, which means that they follow the scheme but may try to infer private information of other devices during Algorithm 2: Privacy-preserving split federated learning (PPSFL) Input : Model of round t: W C t , W S t , set of N clients: S t , the minibatch B, the learning rate η, data label of client C i : Y i Output: Model of round t + 1: W C t+1 , W S t+1 1 TA executes: 2 Sample (G, p, g) ← GroupGen(1 λ ), then generate the SIFE key pair (mpk i , msk i ) and MIFE key pair (mpk m i , msk m i ) for each client C i . 3 Server executes: 4 for Aggregation round t = 1, 2, ... do 5 for C i ∈ S t , in parallel do 6 for local epochs l = 1, 2, ..., L do ...
Preprint
Full-text available
p>In this paper, we propose a novel and efficient privacy-preserving split federated learning (PPSFL) framework, that achieves both privacy protection and model accuracy with reasonable computational and communication cost. We describe the implementations of PPSFL on Multi-layer Perceptron (MLP) and Convolutional Neural Network (CNN) models with distributed clients to evaluate the performance of PPSFL. </p
... Split Learning (SL) and Federated Learning (FL) are the two methods of collaboratively training a model derived from distributed data sources without sharing raw data [7]. In FL, every client runs a copy of the entire model on its data. ...
... Different studies showed the possibility of privacy leakage in SL. In [7], the authors analyzed the privacy leakage of SL and found a considerable leakage from the split layer in the 2D CNN model. Furthermore, the authors mentioned that it is possible to reduce the distance correlation (a measure of dependence) between the split layer and raw data by slightly scaling the weights of all layers before the split. ...
Preprint
Full-text available
Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.
... Another one is that the attacker has no information about the client network. Compared with the existing assumption, our threat model is more pragmatic than assumptions in related works [46], [47], where the adversary is assumed to have direct access to leaked pairs of private data and the smashed data. • Attacker's capability: To conduct a stealthy backdoor attack, we impose more strict limitations on the attacker's capabilities compared to existing threat models. ...
Preprint
As a novel privacy-preserving paradigm aimed at reducing client computational costs and achieving data utility, split learning has garnered extensive attention and proliferated widespread applications across various fields, including smart health and smart transportation, among others. While recent studies have primarily concentrated on addressing privacy leakage concerns in split learning, such as inference attacks and data reconstruction, the exploration of security issues (e.g., backdoor attacks) within the framework of split learning has been comparatively limited. Nonetheless, the security vulnerability within the context of split learning is highly posing a threat and can give rise to grave security implications, such as the illegal impersonation in the face recognition model. Therefore, in this paper, we propose a stealthy backdoor attack strategy (namely SBAT) tailored to the without-label-sharing split learning architecture, which unveils the inherent security vulnerability of split learning. We posit the existence of a potential attacker on the server side aiming to introduce a backdoor into the training model, while exploring two scenarios: one with known client network architecture and the other with unknown architecture. Diverging from traditional backdoor attack methods that manipulate the training data and labels, we constructively conduct the backdoor attack by injecting the trigger embedding into the server network. Specifically, our SBAT achieves a higher level of attack stealthiness by refraining from modifying any intermediate parameters (e.g., gradients) during training and instead executing all malicious operations post-training.
... To handle such scenarios, several distributed learning approaches (DLA), are being created to address the existing challenges of privacy leakage, client-sensitive data, etc. [1]. To minimize the danger of privacy leakage [7], DLA uses private data such as text inputs, audio recordings, medical records, etc., while keeping the data local. In distributed learning framework, there are multiple clients and training processes. ...
Conference Paper
Split Learning (SL) and Federated Learning (FL) are two popular distributed machine learning approaches that aim at preserving clients' privacy. In this work, we investigate the impact of data poisoning attacks, specifically label flipping attacks, clean label attacks and backdoor attacks, on the behaviour of the SL model. We employ data poisoning attacks on the CIFAR10 dataset, considering 10 clients and ResNet18 as the baseline model, and got an accuracy of 98.22% for training and 83.99% as the testing accuracy. Based on the study conducted for the aforementioned attack scenarios, we observe that backdoor attacks are the most powerful and cause a drop of approximately 29.82% in accuracy compared to a drop of 12.5% in clean label attacks and very insignificant in the case of label flipping. We also studied the impact of each attack on the model's accuracy by altering the proportion of poisonous clients. In the future, we plan to extend the study by proposing defence strategies against these attack scenarios.
... Both [3] and [4] advocate for split learning, emphasizing its applicability in collaborative training on limited healthcare data with a focus on data privacy. The challenges of this approach become evident when [9] introduces a method aimed at curtailing data leakage in distributed health data learning. Furthermore, [10] identifies potential privacy leakages within split learning. ...
Conference Paper
Full-text available
This study presents Weighted Sampled Split Learning (WSSL), an innovative framework tailored to bolster privacy, robustness, and fairness in distributed machine learning systems. Unlike traditional approaches, WSSL disperses the learning process among multiple clients, thereby safeguarding data confidentiality. Central to WSSL's efficacy is its utilization of weighted sampling. This approach ensures equitable learning by tactically selecting influential clients based on their contributions. Our evaluation of WSSL spanned various client configurations and employed two distinct datasets: Human Gait Sensor and CIFAR-10. We observed three primary benefits: heightened model accuracy, enhanced robustness, and maintained fairness across diverse client compositions. Notably, our distributed frameworks consistently surpassed centralized counterparts, registering accuracy peaks of 82.63% and 75.51% for the Human Gait Sensor and CIFAR-10 datasets, respectively. These figures contrast with the top accuracies of 81.12% and 58.60% achieved by centralized systems. Collectively, our findings champion WSSL as a potent and scalable successor to conventional centralized learning, marking it as a pivotal stride forward in privacy-focused, resilient, and impartial distributed machine learning.
... For supervised model training, the node transmits the labels used for classifying each item of training data to the server. NoPeekNN [21] is used to reduce the likelihood of training data being compromised and leaked. ...
Article
Full-text available
With more personal devices being connected to the internet, individuals are becoming increasingly concerned about privacy. Therefore, it is important to develop machine learning algorithms that can use customer data to create personalized models while still adhering to strict privacy laws. In this paper, we propose a robust solution to this problem in a distributed, asynchronous environment with a verifiable convergence rate. Our proposed framework trains a Convolutional Neural Network on each client and sends the feature embeddings to other clients for data aggregation. This allows each client to train a deep-learning model on feature embeddings gathered from various clients in a single communication cycle. We provide a detailed description of the architecture and execution of our suggested approach. Our technique’s effectiveness is evaluated by comparing it to the top central training and federated learning (FL) algorithms, and our tests on diverse datasets demonstrate that our method outperforms FL in terms of accuracy and is comparable to central training algorithms. Our findings also show that our proposed method reduces data transfer by over 75% compared to FL, resulting in significant bandwidth savings. As a result, model training can assist companies with high security and data protection concerns in setting up reliable collaboration platforms without requiring a central service provider.
... To infer main client model parameters and raw data, the proxy client must be able to eaves-drop the smashed data sent to all other proxy clients and invert all client-side model parameters, a highly unlikely possibility if the main client and other proxy clients' network is configured with sufficiently large numbers of [76]. However, smaller main client networks may be susceptible to this issue, which can be controlled by modifying the loss function at the clientside [77]. Main client are also unable to infer proxy clients model parameters, as they only have access to the gradients of the smashed data and main-client-side updates, respectively. ...
Article
Full-text available
In Federated Learning (FL), the size of local models matters. On the one hand, it is logical to use large-capacity neural networks in pursuit of high performance. On the other hand, deep convolutional neural networks (CNNs) are exceedingly parameter-hungry, which makes memory a significant bottleneck when training large-scale CNNs on hardware-constrained devices such as smartphones or wearables sensors. Current state-of-the-art (SOTA) FL approaches either only test their convergence properties on tiny CNNs with inferior accuracy or assume clients have the adequate processing power to train large models, which remains a formidable obstacle in actual practice. To overcome these issues, we introduce FedDCT, a novel distributed learning paradigm that enables the usage of large, high-performance CNNs on resource-limited edge devices. As opposed to traditional FL approaches, which require each client to train the full-size neural network independently during each training round, the proposed FedDCT allows a cluster of several clients to collaboratively train a large deep learning model by dividing it into an ensemble of several small sub-models and train them on multiple devices in parallel while maintaining privacy. In this collaborative training process, clients from the same cluster can also learn from each other, further improving their ensemble performance. In the aggregation stage, the server takes a weighted average of all the ensemble models trained by all the clusters. FedDCT reduces the memory requirements and allows low-end devices to participate in FL. We empirically conduct extensive experiments on standardized datasets, including CIFAR-10, CIFAR-100, and two real-world medical datasets HAM10000 and VAIPE. Experimental results show that FedDCT outperforms a set of current SOTA FL methods with interesting convergence behaviors. Furthermore, compared to other existing approaches, FedDCT achieves higher accuracy and substantially reduces the number of communication rounds (with 4-8 times fewer memory requirements) to achieve the desired accuracy on the testing dataset without incurring any extra training cost on the server side.
... Recently, studies show that in FL, even though the raw data (feature and label) is not shared, sensitive information can still be leaked from the gradients and intermediate embeddings communicated between parties. For example, (Vepakomma et al. 2019) and (Sun et al. 2021) showed that server's raw features can be leaked from the forward cut layer embedding. In addition, (Li et al. 2022) studied the label leakage problem but the leakage source was the backward gradients rather than forward embeddings. ...
Article
Federated learning (FL) has gained significant attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple participants. The prior work on FL has mostly studied how to protect label privacy during model training. However, model evaluation in FL might also lead to the potential leakage of private label information. In this work, we propose an evaluation algorithm that can accurately compute the widely used AUC (area under the curve) metric when using the label differential privacy (DP) in FL. Through extensive experiments, we show our algorithms can compute accurate AUCs compared to the ground truth. The code is available at https://github.com/bytedance/fedlearner/tree/master/example/privacy/DPAUC
... vFL is often considered to be privacy-oriented, because during training process participants only exchange intermediate hidden representations and gradients rather than raw features and labels. However, recent studies revealed that it still suffers from potential privacy leakage risks: (1) label inference attack [8,20,31], which means that a honest-but-curious non-label party can successfully infer private labels, and (2) input reconstruction attack [32,34], which means that the label party can reconstruct the raw input features of the non-label party. ...
Preprint
Full-text available
Conversion rate (CVR) estimation aims to predict the probability of conversion event after a user has clicked an ad. Typically, online publisher has user browsing interests and click feedbacks, while demand-side advertising platform collects users' post-click behaviors such as dwell time and conversion decisions. To estimate CVR accurately and protect data privacy better, vertical federated learning (vFL) is a natural solution to combine two sides' advantages for training models, without exchanging raw data. Both CVR estimation and applied vFL algorithms have attracted increasing research attentions. However, standardized and systematical evaluations are missing: due to the lack of standardized datasets, existing studies adopt public datasets to simulate a vFL setting via hand-crafted feature partition, which brings challenges to fair comparison. We introduce FedAds, the first benchmark for CVR estimation with vFL, to facilitate standardized and systematical evaluations for vFL algorithms. It contains a large-scale real world dataset collected from Alibaba's advertising platform, as well as systematical evaluations for both effectiveness and privacy aspects of various vFL algorithms. Besides, we also explore to incorporate unaligned data in vFL to improve effectiveness, and develop perturbation operations to protect privacy well. We hope that future research work in vFL and CVR estimation benefits from the FedAds benchmark.
... Many defense methods have also been proposed. Some apply distance correlation to reduce the irrelevant information contained in intermediate results to alleviate the risk of inference [43], [45], [49], [51]. Others defend against inference attacks by adding random noise to intermediate results [9], [13], [33], [38], [41], [47], [53]. ...
... Label inference attack. Label leakage problem has been widely studied for VFL [11,27,42]. Although the results show that the malicious client succeeds to infer some label information, these studies are case-specific and have some limitations. ...
Preprint
Generative Adversarial Networks (GANs) have achieved state-of-the-art results in tabular data synthesis, under the presumption of direct accessible training data. Vertical Federated Learning (VFL) is a paradigm which allows to distributedly train machine learning model with clients possessing unique features pertaining to the same individuals, where the tabular data learning is the primary use case. However, it is unknown if tabular GANs can be learned in VFL. Demand for secure data transfer among clients and GAN during training and data synthesizing poses extra challenge. Conditional vector for tabular GANs is a valuable tool to control specific features of generated data. But it contains sensitive information from real data - risking privacy guarantees. In this paper, we propose GTV, a VFL framework for tabular GANs, whose key components are generator, discriminator and the conditional vector. GTV proposes an unique distributed training architecture for generator and discriminator to access training data in a privacy-preserving manner. To accommodate conditional vector into training without privacy leakage, GTV designs a mechanism training-with-shuffling to ensure that no party can reconstruct training data with conditional vector. We evaluate the effectiveness of GTV in terms of synthetic data quality, and overall training scalability. Results show that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by centralized GAN algorithm. The difference on machine learning utility can be as low as to 2.7%, even under extremely imbalanced data distributions across clients and different number of clients.
... Through [63], [64], [65], various research directions have emerged with the goal of making distributed learning processes more scalable, secure, and privacy-preserving. Additionally, other research focuses on employing distributed DL for processing and learning from sensitive data such as health data [66]or public institutions [67]. ...
Preprint
Full-text available
Robotic systems are more connected, networked, and distributed than ever. New architectures that comply with the \textit{de facto} robotics middleware standard, ROS\,2, have recently emerged to fill the gap in terms of hybrid systems deployed from edge to cloud. This paper reviews new architectures and technologies that enable containerized robotic applications to seamlessly run at the edge or in the cloud. We also overview systems that include solutions from extension to ROS\,2 tooling to the integration of Kubernetes and ROS\,2. Another important trend is robot learning, and how new simulators and cloud simulations are enabling, e.g., large-scale reinforcement learning or distributed federated learning solutions. This has also enabled deeper integration of continuous interaction and continuous deployment (CI/CD) pipelines for robotic systems development, going beyond standard software unit tests with simulated tests to build and validate code automatically. We discuss the current technology readiness and list the potential new application scenarios that are becoming available. Finally, we discuss the current challenges in distributed robotic systems and list open research questions in the field.
... Recently, studies show that in FL, even though the raw data (feature and label) is not shared, sensitive information can still be leaked from the gradients and intermediate embeddings communicated between parties. For example, [40] and [39] showed that server's raw features can be leaked from the forward cut layer embedding. In addition, [26] studied the label leakage problem but the leakage source was the backward gradients rather than forward embeddings. ...
Preprint
Federated learning (FL) has gained significant attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple participants. The prior work on FL has mostly studied how to protect label privacy during model training. However, model evaluation in FL might also lead to potential leakage of private label information. In this work, we propose an evaluation algorithm that can accurately compute the widely used AUC (area under the curve) metric when using the label differential privacy (DP) in FL. Through extensive experiments, we show our algorithms can compute accurate AUCs compared to the ground truth.
... In this case, only the outputs of the cut layer are shared between users and the parameter server, no raw data is shared so that the user privacy and security are protected. SL was first proposed to be applied in medical applications [84], [85], where a model is trained with the sensitive health data from different hospitals. ...
Preprint
The cloud-based solutions are becoming inefficient due to considerably large time delays, high power consumption, security and privacy concerns caused by billions of connected wireless devices and typically zillions bytes of data they produce at the network edge. A blend of edge computing and Artificial Intelligence (AI) techniques could optimally shift the resourceful computation servers closer to the network edge, which provides the support for advanced AI applications (e.g., video/audio surveillance and personal recommendation system) by enabling intelligent decision making on computing at the point of data generation as and when it is needed, and distributed Machine Learning (ML) with its potential to avoid the transmission of large dataset and possible compromise of privacy that may exist in cloud-based centralized learning. Therefore, AI is envisioned to become native and ubiquitous in future communication and networking systems. In this paper, we conduct a comprehensive overview of recent advances in distributed intelligence in wireless networks under the umbrella of native-AI wireless networks, with a focus on the basic concepts of native-AI wireless networks, on the AI-enabled edge computing, on the design of distributed learning architectures for heterogeneous networks, on the communication-efficient technologies to support distributed learning, and on the AI-empowered end-to-end communications. We highlight the advantages of hybrid distributed learning architectures compared to the state-of-art distributed learning techniques. We summarize the challenges of existing research contributions in distributed intelligence in wireless networks and identify the potential future opportunities.
... In addition, the study also uses auxiliary information in order to improve the reconstruction performance. In this regard, NoPeekNN [26] limited the distance correlation between the intermediate tensors and the input data during the training process of splitNN. The method was specifically designed for autoencoders to limit the reconstruction of the input data, but has not been applied or tested concerning model inversion attacks. ...
Article
Full-text available
The past decade has seen a rapid adoption of Artificial Intelligence (AI), specifically the deep learning networks, in Internet of Medical Things (IoMT) ecosystem. However, it has been shown recently that the deep learning networks can be exploited by adversarial attacks that not only make IoMT vulnerable to the data theft but also to the manipulation of medical diagnosis. The existing studies consider adding noise to the raw IoMT data or model parameters which not only reduces the overall performance concerning medical inferences but also is ineffective to the likes of deep leakage from gradients method. In this work, we propose proximal gradient split learning (PSGL) method for defense against the model inversion attacks. The proposed method intentionally attacks the IoMT data when undergoing the deep neural network training process at client side. We propose the use of proximal gradient method to recover gradient maps and a decision-level fusion strategy to improve the recognition performance. Extensive analysis show that the PGSL not only provides effective defense mechanism against the model inversion attacks but also helps in improving the recognition performance on publicly available datasets. We report 14.0 $\%$ , 17.9 $\%$ , and 36.9 $\%$ gains in accuracy over reconstructed and adversarial attacked images, respectively.
... Empirically, computation and communication take up most of the time in distributed training. In addition, recently the split learning [34][35][36][37][38][39] becomes more and more popular since it focuses on the privacy [40] of data and collaborative training. However, the core of split learning is pipeline parallelism. ...
Article
Full-text available
Pipeline parallelism is an efficient way to speed up the training of deep neural networks (DNNs) by partitioning the model and pipelining the training process across a cluster of workers in distributed systems. In this paper, we propose a new pipeline parallelization approach (Q-FB pipeline) for distributed deep learning, which can achieve both high training speed and high hardware utilization. The major novelty of Q-FB pipeline lies in a mechanism that can parallelize the backpropagation training without loss of precision. Since the parameters update of the backward phase depends on the error calculated in the forward phase, paralleling the backpropagation process naively will hurt the model’s convergence behaviour. To provide convergence guarantees, Q-FB pipeline lets the forward phase and backward phase execute in parallel on different processors with the techniques of shared model memory and accumulated gradients update. To overcome the communication bottleneck, Q-FB pipeline compresses both activations and gradients before transferring them to other workers. We adopt an activation quantization scheme for reducing traffic in the forward phase and propose a gradient compression algorithm (2-Step GC algorithm) for reducing communication costs in the backward phase. Experiments at both small and large computing clusters (e.g. Tianhe-2 supercomputer system) show that Q-FB pipeline can effectively accelerate the training process without loss in convergence or precision.
... The effectiveness of their proposal is proportional to the Stepwise parameters and it also exhibits a trade-off between accuracy and privacy preservation. Studying another approach, in NoPeek [19], [20], Vepakomma et al. have shown that by reducing the distance correlation between the intermediary representations and raw data, information leakage from reconstruction attacks could be prevented. The authors focus on distance correlation metric and white-box reconstruction attacks without considering direct leakage from visual invertibility. ...
Preprint
Full-text available
Split learning (SL) enables data privacy preservation by allowing clients to collaboratively train a deep learning model with the server without sharing raw data. However, SL still has limitations such as potential data privacy leakage and high computation at clients. In this study, we propose to binarize the SL local layers for faster computation (up to 17.5 times less forward-propagation time in both training and inference phases on mobile devices) and reduced memory usage (up to 32 times less memory and bandwidth requirements). More importantly, the binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy. To further enhance the privacy preservation, we also propose two novel approaches: 1) training with additional local leak loss and 2) applying differential privacy, which could be integrated separately or concurrently into the B-SL model. Experimental results with different datasets have affirmed the advantages of the B-SL models compared with several benchmark models. The effectiveness of B-SL models against feature-space hijacking attack (FSHA) is also illustrated. Our results have demonstrated B-SL models are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.
... It is worth noting that Vepakomma et al. (2019) further demonstrated the minimisation of distance correlation between the original data and intermediary representation. This reduces the leakage of sensitive raw data patterns during client communication, while maintaining the accuracy of model. ...
Article
Full-text available
Federated learning (FL) plays an important role in the development of smart cities. With the evolution of big data and artificial intelligence , issues related to data privacy and protection have emerged, which can be solved by FL. In this paper, the current developments in FL and its applications in various fields are reviewed. With a comprehensive investigation, the latest research on the application of FL is discussed for various fields in smart cities. We explain the current developments in FL in fields, such as the Internet of Things (IoT), transportation, communications, finance, and medicine. First, we introduce the background, definition, and key technologies of FL. Then, we review key applications and the latest results. Finally, we discuss the future applications and research directions of FL in smart cities. ARTICLE HISTORY
... Recently, studies show that in vFL, even though the raw data (feature and label) is not shared, sensitive information can still be leaked from the gradients and intermediate embeddings communicated between parties. For example, [27] and [26] showed that server's raw features can be leaked from the forward cut layer embedding. In addition, [17] studied the label leakage problem but the leakage source was the backward gradients rather than forward embeddings. ...
Preprint
Federated learning has gained great attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple parties. As a sub-category, vertical federated learning (vFL) focuses on the scenario where features and labels are split into different parties. The prior work on vFL has mostly studied how to protect label privacy during model training. However, model evaluation in vFL might also lead to potential leakage of private label information. One mitigation strategy is to apply label differential privacy (DP) but it gives bad estimations of the true (non-private) metrics. In this work, we propose two evaluation algorithms that can more accurately compute the widely used AUC (area under curve) metric when using label DP in vFL. Through extensive experiments, we show our algorithms can achieve more accurate AUCs compared to the baselines.
... Prior works provide MI resistance for SFL inference by protecting intermediate activations [9,[20][21][22] or confidence score (intermediate activations of last softmax layer) [23][24][25]. However, MI resistance at training time is significantly more difficult. ...
Preprint
Full-text available
This work aims to tackle Model Inversion (MI) attack on Split Federated Learning (SFL). SFL is a recent distributed training scheme where multiple clients send intermediate activations (i.e., feature map), instead of raw data, to a central server. While such a scheme helps reduce the computational load at the client end, it opens itself to reconstruction of raw data from intermediate activation by the server. Existing works on protecting SFL only consider inference and do not handle attacks during training. So we propose ResSFL, a Split Federated Learning Framework that is designed to be MI-resistant during training. It is based on deriving a resistant feature extractor via attacker-aware training, and using this extractor to initialize the client-side model prior to standard SFL training. Such a method helps in reducing the computational complexity due to use of strong inversion model in client-side adversarial training as well as vulnerability of attacks launched in early training epochs. On CIFAR-100 dataset, our proposed framework successfully mitigates MI attack on a VGG-11 model with a high reconstruction Mean-Square-Error of 0.050 compared to 0.005 obtained by the baseline system. The framework achieves 67.5% accuracy (only 1% accuracy drop) with very low computation overhead. Code is released at: https://github.com/zlijingtao/ResSFL.
... Enabling technology such as model parallelism, gradient compression, and Split Learning-based systems to preserve privacy minimises the similarity between the raw data and the intermediary activation vector sent from one server to another. The distance correlation metric such as pairwise correlation and mutual information score [242,261,262] can be utilised to find the leakage between the raw data and intermediary activation vector. This metric ranges from 0 to 1, where 0 implies the raw data are independent of the intermediary activation vector. ...
Preprint
Full-text available
Deep Learning-based models have been widely investigated, and they have demonstrated significant performance on non-trivial tasks such as speech recognition, image processing, and natural language understanding. However, this is at the cost of substantial data requirements. Considering the widespread proliferation of edge devices (e.g. Internet of Things devices) over the last decade, Deep Learning in the edge paradigm, such as device-cloud integrated platforms, is required to leverage its superior performance. Moreover, it is suitable from the data requirements perspective in the edge paradigm because the proliferation of edge devices has resulted in an explosion in the volume of generated and collected data. However, there are difficulties due to other requirements such as high computation, high latency, and high bandwidth caused by Deep Learning applications in real-world scenarios. In this regard, this survey paper investigates Deep Learning at the edge, its architecture, enabling technologies, and model adaption techniques, where edge servers and edge devices participate in deep learning training and inference. For simplicity, we call this paradigm the All-in EDGE paradigm. Besides, this paper presents the key performance metrics for Deep Learning at the All-in EDGE paradigm to evaluate various deep learning techniques and choose a suitable design. Moreover, various open challenges arising from the deployment of Deep Learning at the All-in EDGE paradigm are identified and discussed.
... Several studies provided solutions for the lack of sufficient data due to the privacy challenges in the medical imaging domain. [117][118][119][120][121][122][123] For instance, Sheller et al developed a supervised DNN in a federated way for semantic segmentation of brain gliomas from magnetic resonance imaging scans. 118 Chang et al 123 simulated a distributed DNN in which multiple participants collaboratively update model weights using training heuristics such as single weight transfer and cyclical weight transfer (CWT). ...
Article
Full-text available
Background Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. Objectives However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. Method This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. Conclusion As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
... We show that minimizing DCOR minimizes their Kullback-Leibler divergence, which is a measure of invertibility of the smashed data in information theory. For simplicity, we use distance covariance (DCOV) which is an unnormalized DCOR [18], Kullback-Leibler divergence D KL and cross entropy H to build the connection. ...
Preprint
Federated learning is a distributed machine learning mechanism where local devices collaboratively train a shared global model under the orchestration of a central server, while keeping all private data decentralized. In the system, model parameters and its updates are transmitted instead of raw data, and thus the communication bottleneck has become a key challenge. Besides, recent larger and deeper machine learning models also pose more difficulties in deploying them in a federated environment. In this paper, we design a federated two-stage learning framework that augments prototypical federated learning with a cut layer on devices and uses sign-based stochastic gradient descent with the majority vote method on model updates. Cut layer on devices learns informative and low-dimension representations of raw data locally, which helps reduce global model parameters and prevents data leakage. Sign-based SGD with the majority vote method for model updates also helps alleviate communication limitations. Empirically, we show that our system is an efficient and privacy preserving federated learning scheme and suits for general application scenarios.
... mechanisms [2], [3], can be used for data analysis through multi-party collaborated learning. Unfortunately, in most techniques, parties exchange their data and models in a direct and insecure fashion, leading to the compromization of data privacy [4], [5]. In a system like the smart grid, data privacy has great value. ...
Article
The smart power grid is a critical energy infrastructure where real-time electricity usage data is collected to predict future energy requirements. The existing prediction models focus on the centralized frameworks, where the collected data from various Home Area Networks (HANs) are forwarded to a central server. This process leads to cybersecurity threats. This paper proposes a Federated Learning (FL) based model with privacy preservation of smart grids data using Serverless cloud computing. The model considers the Blockchain-enabled Dew Servers (BDS) in each HAN for local data storage and local model training. Advanced perturbation and normalization techniques are used to reduce the inverse impact of irregular workload on the training results. The experiment conducted on benchmarks datasets demonstrates that the proposed model minimizes the computation and communication costs, attacking probability, and improves the test accuracy. Overall, the proposed model enables smart grids with robust privacy preservation and high accuracy.
... A comparison between collaborative and non-collaborative training modes is carried out, and the impact of the number of clients on the performance of both modes is investigated in [62]. The privacy property of SplitNN is enhanced in [63] by minimizing the distance correlation between the intermediate features and the input data to reduce leakage. The empirically evaluation and comparison of both federated learning and SplitNN using imbalanced data and non-independent and identically distributed (non-IID) data using real-world IoT settings for performance and overhead (training time, communication overhead, power consumption, and memory usage) is shown in [64]. ...
Article
Full-text available
To extract knowledge from the large data collected by edge devices, a traditional cloud-based approach that requires data upload may not be feasible due to communication bandwidth limitations as well as privacy and security concerns of end-users. A novel privacy-preserving edge intelligent computing framework for image classification in IoT is proposed to address these challenges. Specifically, the autoencoder will be trained unsupervised at each edge device individually, and then the obtained latent vectors transmitted to the edge server for the training of a classifier. This framework would reduce the communication overhead and protect end-users’ data. Compared to federated learning, the training of the classifier in the proposed framework is not subject to the constraints of the edge devices, and the autoencoder can be trained independently at each edge device without any server involvement. Compared to collaborative intelligence such as SplitNN, the proposed method does not suffer from high communication cost as noticed in SplitNN. Furthermore, the privacy of the end-users’ data is protected by transmitting latent vectors and without the additional cost of encryption. Experimental results provide insights on the image classification performance vs. various design parameters such as the data compression ratio of the autoencoder and the model complexity.
... In consequence, multiple research directions have emerged to make distributed learning processes more scalable, secure, and privacy-preserving through [20,21,22]. Additionally, other research efforts are directed towards utilizing distributed DL for processing and learning from sensitive data such as health data [23] or medical data from multiple private or public institutions [24]. ...
Article
Full-text available
Autonomous systems are becoming inherently ubiquitous with the advancements of computing and communication solutions enabling low-latency offloading and real-time collaboration of distributed devices. Decentralized technologies with blockchain and distributed ledger technologies (DLTs) are playing a key role. At the same time, advances in deep learning (DL) have significantly raised the degree of autonomy and level of intelligence of robotic and autonomous systems. While these technological revolutions were taking place, raising concerns in terms of data security and end-user privacy has become an inescapable research consideration. Federated learning (FL) is a promising solution to privacy-preserving DL at the edge, with an inherently distributed nature by learning on isolated data islands and communicating only model updates. However, FL by itself does not provide the levels of security and robustness required by today’s standards in distributed autonomous systems. This survey covers applications of FL to autonomous robots, analyzes the role of DLT and FL for these systems, and introduces the key background concepts and considerations in current research.
... To run supervised model training, the node has to send its classification labels for each training data item to the server. NoPeekNN [24] reduces the risk of training data leakages. To run SL with multiple nodes, the node model must be shared between all involved nodes. ...
... It is worth noting that Vepakomma et al. (2019) further demonstrated the minimisation of distance correlation between the original data and intermediary representation. This reduces the leakage of sensitive raw data patterns during client communication, while maintaining the accuracy of model. ...
Article
Full-text available
Federated learning (FL) plays an important role in the development of smart cities. With the evolution of big data and artificial intelligence, issues related to data privacy and protection have emerged, which can be solved by FL. In this paper, the current developments in FL and its applications in various fields are reviewed. With a comprehensive investigation, the latest research on the application of FL is discussed for various fields in smart cities. We explain the current developments in FL in fields, such as the Internet of Things (IoT), transportation, communications, finance, and medicine. First, we introduce the background, definition, and key technologies of FL. Then, we review key applications and the latest results. Finally, we discuss the future applications and research directions of FL in smart cities.
Article
Mobile Edge Computing (MEC) has great potential to facilitate cheap and fast customer behavior analysis (CBA). Model splitting, widely adopted in collaborative learning of MEC, partitions the CBA models between customer devices and the edge servers in a layer-wise manner to support efficient distributed learning. However, the split-model architecture (SMA) is vulnerable to data reconstruction attacks for privacy leakage in intermediate data, of which the risk measurements remain unexplored. In this paper, we propose a privacy risk measurement framework, called InvMetrics, for split model–based CBA systems, which assess the degree of privacy leakage from both the CBA owners’ and the regulators’ perspectives. For CBA owners, we propose a privacy metric, Distance Loss (DLoss), based on distance correlation, which is computationally efficient, and thus eligible for being deployed on the customers’ devices. For third-party evaluators, we propose Uncertainty Loss (ULoss) based on entropy, which can measure privacy risk without accessing raw data. Evaluation results on three CBA datasets and one image dataset demonstrate that InvMetrics framework with DLoss and ULoss can accurately measure privacy risk and is more efficient than the state-of-the-art.
Article
Motivated by the advancing computational capacity of distributed end-user equipment (UE), as well as the increasing concerns about sharing private data, there has been considerable recent interest in machine learning (ML) and artificial intelligence (AI) that can be processed on distributed UEs. Specifically, in this paradigm, parts of an ML process are outsourced to multiple distributed UEs. Then, the processed information is aggregated on a certain level at a central server, which turns a centralized ML process into a distributed one and brings about significant benefits. However, this new distributed ML paradigm raises new risks in terms of privacy and security issues. In this article, we provide a survey of the emerging security and privacy risks of distributed ML from a unique perspective of information exchange levels, which are defined according to the key steps of an ML process, i.e., we consider the following levels: 1) the level of preprocessed data; 2) the level of learning models; 3) the level of extracted knowledge; and 4) the level of intermediate results. We explore and analyze the potential of threats for each information exchange level based on an overview of current state-of-the-art attack mechanisms and then discuss the possible defense methods against such threats. Finally, we complete the survey by providing an outlook on the challenges and possible directions for future research in this critical area.
Article
Split learning (SL) enables data privacy preservation by allowing clients to collaboratively train a deep learning model with the server without sharing raw data. However, SL still has limitations such as potential data privacy leakage and high computation for clients. In this paper, we propose to binarize the SL local layers for faster computation (up to 17.5 times less forward-propagation time in both training and inference phases on mobile devices) and reduced memory usage (up to 32 times less memory and bandwidth requirements). More importantly, the binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy. To further enhance privacy preservation, we also propose two novel approaches: 1) training with additional local leak loss and 2) applying differential privacy, which could be integrated separately or concurrently into the B-SL model. Experimental results with different datasets have affirmed the benefits of the B-SL models compared with several benchmark models. The effectiveness of B-SL models against feature-space hijacking attack (FSHA) is also illustrated. Our results have demonstrated B-SL models are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.
Article
Full-text available
The cloud-based solutions are becoming inefficient due to considerably large time delays, high power consumption, and security and privacy concerns caused by billions of connected wireless devices and typically zillions of bytes of data they produce at the network edge. A blend of edge computing and Artificial Intelligence (AI) techniques could optimally shift the resourceful computation servers closer to the network edge, which provides the support for advanced AI applications (e.g., video/audio surveillance and personal recommendation system) by enabling intelligent decision making on computing at the point of data generation as and when it is needed, and distributed Machine Learning (ML) with its potential to avoid the transmission of the large dataset and possible compromise of privacy that may exist in cloud-based centralized learning. Besides, the deployment of AI techniques to redesign end-to-end communication is attracting attention to improve communication performance. Therefore, the interaction of AI and wireless communications generates a new concept, named native AI wireless networks. In this paper, we conduct a comprehensive overview of recent advances in distributed intelligence in wireless networks under the umbrella of native AI wireless networks, with a focus on the design of distributed learning architectures for heterogeneous networks, on AI-enabled edge computing, on the communication-efficient technologies to support distributed learning, and on the AI-empowered end-to-end communications. We highlight the advantages of hybrid distributed learning architectures compared to state-of-the-art distributed learning techniques. We summarize the challenges of existing research contributions in distributed intelligence in wireless networks and identify potential future opportunities.
Article
Communication and computation are often viewed as separate tasks. This approach is very effective from the perspective of engineering as isolated optimizations can be performed. However, for many computation-oriented applications, the main interest is a function of the local information at the devices, rather than the local information itself. In such scenarios, information theoretical results show that harnessing the interference in a multiple access channel for computation, i.e., over-the-air computation (OAC), can provide a significantly higher achievable computation rate than separating communication and computation tasks. Moreover, the gap between OAC and separation in terms of computation rate increases with more participating nodes. Given this motivation, in this study, we provide a comprehensive survey on practical OAC methods. After outlining fundamentals related to OAC, we discuss the available OAC schemes with their pros and cons. We provide an overview of the enabling mechanisms for achieving reliable computation in the wireless channel. Finally, we summarize the potential applications of OAC and point out some future directions.
Article
Privacy leakage that occurs when many IoT devices are utilized for training centralized models, a new distributed learning framework known as federated learning was created, where devices train models together while keeping their private datasets local. In a federated learning setup, a central aggregator coordinates the efforts of several clients working together to solve machine learning issues. The privacy of each device's data is protected by this setup's decentralized training data. Federated learning reduces traditional centralized machine learning systems' systemic privacy issues and costs by emphasizing local processing and model transfer. Client information is stored locally and cannot be copied or shared. By utilizing a centralized server, federated learning enables each participant's device to collect data locally for training purposes before sending the resulting model to the server for aggregate and subsequent distribution. As a means of providing a comprehensive review and encouraging further research into the topic, we introduce the works of federated learning from five different vantage points: data partitioning, privacy method, machine learning model, communication architecture, and systems heterogeneity. Then, we organize the issues plaguing federated learning today and the potential avenues for a prospective study. Finally, we provide a brief overview of the features of existing federated knowledge and discuss how it is currently being used in the field.
Preprint
Communication and computation are often viewed as separate tasks. This approach is very effective from the perspective of engineering as isolated optimizations can be performed. On the other hand, there are many cases where the main interest is a function of the local information at the devices instead of the local information itself. For such scenarios, information theoretical results show that harnessing the interference in a multiple-access channel for computation, i.e., over-the-air computation (OAC), can provide a significantly higher achievable computation rate than the one with the separation of communication and computation tasks. Besides, the gap between OAC and separation in terms of computation rate increases with more participating nodes. Given this motivation, in this study, we provide a comprehensive survey on practical OAC methods. After outlining fundamentals related to OAC, we discuss the available OAC schemes with their pros and cons. We then provide an overview of the enabling mechanisms and relevant metrics to achieve reliable computation in the wireless channel. Finally, we summarize the potential applications of OAC and point out some future directions.
Chapter
In many cooperative systems (i.e. autonomous vehicles, robotics, hospital networks), data are privately and heterogeneously distributed among devices with various computational constraints, and no party has a global view of data or device distribution. Federated Neural Architecture Search (FedNAS) was previously proposed to adapt Neural Architecture Search (NAS) into Federated Learning (FL) to provide both privacy and model performance to such uninspectable and heterogeneous systems. However, these approaches mostly apply to scenarios where parties share the same data attributes and comparable computation resources. In this chapter, we present Self-supervised Vertical Federated Neural Architecture Search (SS-VFNAS) for automating FL where participants have heterogeneous data and resource constraints, a common cross-silo scenario. SS-VFNAS not only simultaneously optimizes all parties’ model architecture and parameters for the best global performance under a vertical FL (VFL) framework using only a small set of aligned and labeled data, but also preserves each party’s local optimal model architecture under a self-supervised NAS framework. We demonstrate that SS-VFNAS is a promising framework of superior performance, communication efficiency and privacy, and is capable of generating high-performance and highly-transferable heterogeneous architectures with only limited overlapping samples, providing practical solutions for designing collaborative systems with both limited data and resource constraints.
Preprint
Full-text available
Motivated by the advancing computational capacity of distributed end-user equipments (UEs), as well as the increasing concerns about sharing private data, there has been considerable recent interest in machine learning (ML) and artificial intelligence (AI) that can be processed on on distributed UEs. Specifically, in this paradigm, parts of an ML process are outsourced to multiple distributed UEs, and then the processed ML information is aggregated on a certain level at a central server, which turns a centralized ML process into a distributed one, and brings about significant benefits. However, this new distributed ML paradigm raises new risks of privacy and security issues. In this paper, we provide a survey of the emerging security and privacy risks of distributed ML from a unique perspective of information exchange levels, which are defined according to the key steps of an ML process, i.e.: i) the level of preprocessed data, ii) the level of learning models, iii) the level of extracted knowledge and, iv) the level of intermediate results. We explore and analyze the potential of threats for each information exchange level based on an overview of the current state-of-the-art attack mechanisms, and then discuss the possible defense methods against such threats. Finally, we complete the survey by providing an outlook on the challenges and possible directions for future research in this critical area.
Article
In smart grids, a major challenge is how to effectively utilize consumers’ energy consumption data while preserving security and privacy. In this paper, we tackle this challenging issue and focus on energy theft detection, which is very important for smart grids. Specifically, we note that most existing energy theft detection schemes are centralized, which may be unscalable, and more importantly, may be very difficult to protect data privacy. To address this issue, we propose a novel privacy-preserving federated learning framework for energy theft detection, namely, FedDetect. In our framework, we consider a federated learning system that consists of a data center, a control center, and multiple detection stations. In this system, each detection station can only observe data from local consumers, which can use a local differential privacy (LDP) scheme to process their data to preserve privacy. To facilitate the training of the model, we design a secure protocol so that detection stations can send encrypted training parameters to the control center and the data center, which then use homomorphic encryption to calculate the aggregated parameters and return updated model parameters to detection stations. In our study, we prove the security of the proposed protocol with solid security analysis. To detect energy theft, we design a deep learning model based on the state-of-the-art temporal convolutional network (TCN). Finally, we conduct extensive data-driven experiments using a real energy consumption dataset. The experimental results demonstrate that the proposed federated learning framework can achieve high accuracy of detection with a smaller computation overhead.
Chapter
In the distributed collaborative machine learning (DCML) paradigm, federated learning (FL) recently attracted much attention due to its applications in health, finance, and the latest innovations such as industry 4.0 and smart vehicles. FL provides privacy-by-design. It trains a machine learning model collaboratively over several distributed clients (ranging from two to millions) such as mobile phones, without sharing their raw data with any other participant. In practical scenarios, all clients do not have sufficient computing resources (e.g., Internet of Things), the machine learning model has millions of parameters, and its privacy between the server and the clients while training/testing is a prime concern (e.g., rival parties). In this regard, FL is not sufficient, so split learning (SL) is introduced. SL is reliable in these scenarios as it splits a model into multiple portions, distributes them among clients and server, and trains/tests their respective model portions to accomplish the full model training/testing. In SL, the participants do not share both data and their model portions to any other parties, and usually, a smaller network portion is assigned to the clients where data resides. Recently, a hybrid of FL and SL, called splitfed learning, is introduced to elevate the benefits of both FL (faster training/testing time) and SL (model split and training). Following the developments from FL to SL, and considering the importance of SL, this chapter is designed to provide extensive coverage in SL and its variants. The coverage includes fundamentals, existing findings, integration with privacy measures such as differential privacy, open problems, and code implementation.
Article
Full-text available
A Hilbert space embedding of distributions---in short, kernel mean embedding---has recently emerged as a powerful machinery for probabilistic modeling, statistical inference, machine learning, and causal discovery. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It gave rise to a great deal of research and novel applications of positive definite kernels. The goal of this survey is to give a comprehensive review of existing works and recent advances in this research area, and to discuss some of the most challenging issues and open problems that could potentially lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel two-sample testing, independent testing, group anomaly detection, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes' rules---which are ubiquitous in graphical model, probabilistic inference, and reinforcement learning---in a non-parametric way using the new representation of distributions in RKHS. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions.
Article
Full-text available
Sz\'ekely, Rizzo and Bakirov (2007) and Sz\'ekely and Rizzo (2009), in two seminal papers, introduced the powerful concept of distance correlation as a measure of dependence between sets of random variables. We study in this paper an affinely invariant version of the distance correlation and an empirical version of that distance correlation, and we establish the consistency of the empirical quantity. In the case of subvectors of a multivariate normally distributed random vector, we provide exact expressions for the distance correlation in both finite-dimensional and asymptotic settings. To illustrate our results, we consider time series of wind vectors at the Stateline wind energy center in Oregon and Washington, and we derive the empirical auto and cross distance correlation functions between wind vectors at distinct meteorological stations.
Article
Full-text available
We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, Maximum Mean Discrepancies (MMD), i.e., distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with the semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.
Article
Full-text available
This paper is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example.
Conference Paper
Full-text available
We propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate is simpler than any other kernel dependence test, and requires no user-defined regularisation. Second, there is a clearly defined population quantity which the empirical estimate approaches in the large sample limit, with exponential convergence guaranteed between the two: this ensures that independence tests based on {methodname} do not suffer from slow learning rates. Finally, we show in the context of independent component analysis (ICA) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods.
Article
Full-text available
Protecting implantable medical devices against attack without compromising patient health requires balancing security and privacy goals with traditional goals such as safety and utility. Implantable medical devices monitor and treat physiological conditions within the body. These devices - including pacemakers, implantable cardiac defibrillators (ICDs), drug delivery systems, and neurostimulators - can help manage a broad range of ailments, such as cardiac arrhythmia, diabetes, and Parkinson's disease. IMDs' pervasiveness continues to swell, with upward of 25 million US citizens currently reliant on them for life-critical functions. Growth is spurred by geriatric care of the aging baby-boomer generation, and new therapies continually emerge for chronic conditions ranging from pediatric type 1 diabetes to anorgasmia and other sexual dysfunctions. Moreover, the latest IMDs support delivery of telemetry for remote monitoring over long-range, high-bandwidth wireless links, and emerging devices will communicate with other interoperating IMDs.
Article
Full-text available
Distance correlation is a new measure of dependence between random vectors. Distance covariance and distance correlation are analogous to product-moment covariance and correlation, but unlike the classical definition of correlation, distance correlation is zero only if the random vectors are independent. The empirical distance dependence measures are based on certain Euclidean distances between sample elements rather than sample moments, yet have a compact representation analogous to the classical covariance and correlation. Asymptotic properties and applications in testing independence are discussed. Implementation of the test and Monte Carlo results are also presented.
Book
A Hilbert space embedding of a distribution—in short, a kernel mean embedding—has recently emerged as a powerful tool for machine learning and statistical inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original “feature map” common to support vector machines (SVMs) and other kernel methods. In addition to the classical applications of kernel methods, the kernel mean embedding has found novel applications in fields ranging from probabilistic modeling to statistical inference, causal discovery, and deep learning. Kernel Mean Embedding of Distributions: A Review and Beyond provides a comprehensive review of existing work and recent advances in this research area, and to discuss some of the most challenging issues and open problems that could potentially lead to new research directions. The targeted audience includes graduate students and researchers in machine learning and statistics who are interested in the theory and applications of kernel mean embeddings.
Article
In domains such as health care and finance, shortage of labeled data and computational resources is a critical issue while developing machine learning algorithms. To address the issue of labeled data scarcity in training and deployment of neural network-based systems, we propose a new technique to train deep neural networks over several data sources. Our method allows for deep neural networks to be trained using data from multiple entities in a distributed fashion. We evaluate our algorithm on existing datasets and show that it obtains performance which is similar to a regular neural network trained on a single machine. We further extend it to incorporate semi-supervised learning when training with few labeled samples, and analyze any security concerns that may arise. Our algorithm paves the way for distributed training of deep neural networks in data sensitive applications when raw data may not be shared directly.
Article
In this article, we deal with the problem of inferring causal directions when the data are on discrete domain. By considering the distribution of the cause P(X) and the conditional distribution mapping cause to effect P(Y|X) as independent random variables, we propose to infer the causal direction by comparing the distance correlation between P(X) and P(Y|X) with the distance correlation between P(Y) and P(Y|X). We infer that X causes Y if the dependence coefficient between P(X) and P(Y|X) is smaller. Experiments are performed to show the performance of the proposed method.
Article
In our work, we propose a novel formulation for supervised dimensionality reduction based on a nonlinear dependency criterion called Statistical Distance Correlation, Szekely et. al. (2007). We propose an objective which is free of distributional assumptions on regression variables and regression model assumptions. Our proposed formulation is based on learning a low-dimensional feature representation $\mathbf{z}$, which maximizes the squared sum of Distance Correlations between low dimensional features $\mathbf{z}$ and response $y$, and also between features $\mathbf{z}$ and covariates $\mathbf{x}$. We propose a novel algorithm to optimize our proposed objective using the Generalized Minimization Maximizaiton method of \Parizi et. al. (2015). We show superior empirical results on multiple datasets proving the effectiveness of our proposed approach over several relevant state-of-the-art supervised dimensionality reduction methods.
Article
Information technology can improve the quality, efficiency, and cost of healthcare. In this survey, we examine the privacy requirements of emphmobile/ computing technologies that have the potential to transform healthcare. Such emphmHealth/ technology enables physicians to remotely monitor patients' health, and enables individuals to manage their own health more easily. Despite these advantages, privacy is essential for any personal monitoring technology. Through an extensive survey of the literature, we develop a conceptual privacy framework for mHealth, itemize the privacy properties needed in mHealth systems, and discuss the technologies that could support privacy-sensitive mHealth systems. We end with a list of open research questions.
Communication-efficient learning of deep networks from decentralized data
  • Eider H Brendan Mcmahan
  • Daniel Moore
  • Seth Ramage
  • Hampson
H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
The democratization of health care
  • Stanford
Stanford. The democratization of health care. In Stanford Medicine 2018 Health Trends Report, 2018.
Split learning for health: Distributed deep learning without sharing raw patient data
  • Praneeth Vepakomma
  • Otkrist Gupta
  • Tristan Swedish
  • Ramesh Raskar
Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564, 2018a.
  • Jianmin Chen
  • Xinghao Pan
  • Rajat Monga
  • Samy Bengio
  • Rafal Jozefowicz
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
Federated learning: Strategies for improving communication efficiency
  • Jakub Konečnỳ
  • Brendan Mcmahan
  • X Felix
  • Peter Yu
  • Ananda Richtárik
  • Dave Theertha Suresh
  • Bacon
Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.