Article

Reducing leakage in distributed deep learning for sensitive health data

May 2019

May 2019

Authors:

Otkrist Gupta

Massachusetts Institute of Technology

Abhimanyu Dubey

Meta

Ramesh Raskar

Massachusetts Institute of Technology

For distributed machine learning with health data we demonstrate how minimizing distance correlation between raw data and intermediary representations (smashed data) reduces leakage of sensitive raw data patterns during client communications while maintaining model accuracy. Leakage (measured using KL Divergence between input and intermediate representation) is the risk associated with the invertibility from intermediary representations, can prevent resource poor health organizations from using distributed deep learning services. We demonstrate that our method reduces leakage in terms of distance correlation between raw data and communication payloads from an order of 0.95 to 0.19 and from 0.92 to 0.33 during training with image datasets while maintaining a similar classification accuracy .

Resource-Aware DNN Partitioning for Privacy-Sensitive Edge-Cloud Systems

Chapter

Full-text available

Nov 2023

With recent advances in deep neural networks (DNNs), there is a significant increase in IoT applications leveraging AI with edge-cloud infrastructures. Nevertheless, deploying large DNN models on resource-constrained edge devices is still challenging due to limitations in computation, power, and application-specific privacy requirements. Existing model partitioning methods, which deploy a partial DNN on an edge device while processing the remaining portion of the DNN on the cloud, mainly emphasize communication and power efficiency. However, DNN partitioning based on the privacy requirements and resource budgets of edge devices has not been sufficiently explored in the literature. In this paper, we propose awareSL, a model partitioning framework that splits DNN models based on the computational resources available on edge devices, preserving the privacy of input samples while maintaining high accuracy. In our evaluation of multiple DNN architectures, awareSL effectively identifies the split points that adapt to resource budgets of edge devices. Meanwhile, we demonstrate the privacy-preserving capability of awareSL against existing input reconstruction attacks without sacrificing inference accuracy in image classification tasks.

Split Without a Leak: Reducing Privacy Leakage in Split Learning

Preprint

Full-text available

Aug 2023

The popularity of Deep Learning (DL) makes the privacy of sensitive data more imperative than ever. As a result, various privacy-preserving techniques have been implemented to preserve user data privacy in DL. Among various privacy-preserving techniques, collaborative learning techniques, such as Split Learning (SL) have been utilized to accelerate the learning and prediction process. Initially, SL was considered a promising approach to data privacy. However, subsequent research has demonstrated that SL is susceptible to many types of attacks and, therefore, it cannot serve as a privacy-preserving technique. Meanwhile, countermeasures using a combination of SL and encryption have also been introduced to achieve privacy-preserving deep learning. In this work, we propose a hybrid approach using SL and Homomorphic Encryption (HE). The idea behind it is that the client encrypts the activation map (the output of the split layer between the client and the server) before sending it to the server. Hence, during both forward and backward propagation, the server cannot reconstruct the client's input data from the intermediate activation map. This improvement is important as it reduces privacy leakage compared to other SL-based works, where the server can gain valuable information about the client's input. In addition, on the MIT-BIH dataset, our proposed hybrid approach using SL and HE yields faster training time (about 6 times) and significantly reduced communication overhead (almost 160 times) compared to other HE-based approaches, thereby offering improved privacy protection for sensitive data in DL.

A Bargaining Game for Personalized, Energy Efficient Split Learning over Wireless Networks

Preprint

Full-text available

Dec 2022

Split learning (SL) is an emergent distributed learning framework which can mitigate the computation and wireless communication overhead of federated learning. It splits a machine learning model into a device-side model and a server-side model at a cut layer. Devices only train their allocated model and transmit the activations of the cut layer to the server. However, SL can lead to data leakage as the server can reconstruct the input data using the correlation between the input and intermediate activations. Although allocating more layers to a device-side model can reduce the possibility of data leakage, this will lead to more energy consumption for resource-constrained devices and more training time for the server. Moreover, non-iid datasets across devices will reduce the convergence rate leading to increased training time. In this paper, a new personalized SL framework is proposed. For this framework, a novel approach for choosing the cut layer that can optimize the tradeoff between the energy consumption for computation and wireless transmission, training time, and data privacy is developed. In the considered framework, each device personalizes its device-side model to mitigate non-iid datasets while sharing the same server-side model for generalization. To balance the energy consumption for computation and wireless transmission, training time, and data privacy, a multiplayer bargaining problem is formulated to find the optimal cut layer between devices and the server. To solve the problem, the Kalai-Smorodinsky bargaining solution (KSBS) is obtained using the bisection method with the feasibility test. Simulation results show that the proposed personalized SL framework with the cut layer from the KSBS can achieve the optimal sum utilities by balancing the energy consumption, training time, and data privacy, and it is also robust to non-iid datasets.

Orbital learning: a novel, actively orchestrated decentralised learning for healthcare

Article

Full-text available

May 2024

A novel collaborative and continual learning across a network of decentralised healthcare units, avoiding identifiable data-sharing capacity, is proposed. Currently available methodologies, such as federated learning and swarm learning, have demonstrated decentralised learning. However, the majority of them face shortcomings that affect their performance and accuracy. These shortcomings include a non-uniform rate of data accumulation, non-uniform patient demographics, biased human labelling, and erroneous or malicious training data. A novel method to reduce such shortcomings is proposed in the present work through selective grouping and displacing of actors in a network of many entities for intra-group sharing of learning with inter-group accessibility. The proposed system, known as Orbital Learning, incorporates various features from split learning and ensemble learning for a robust and secure performance of supervised models. A digital embodiment of the information quality and flow within a decentralised network, this platform also acts as a digital twin of healthcare network. An example of ECG classification for arrhythmia with 6 clients is used to analyse its performance and is compared against federated learning. In this example, four separate experiments are conducted with varied configurations, such as varied age demographics and clients with data tampering. The results obtained show an average area under receiver operating characteristic curve (AUROC) of 0.819 (95% CI 0.784–0.853) for orbital learning whereas 0.714 (95% CI 0.692–0.736) for federated learning. This result shows an increase in overall performance and establishes that the proposed system can address the majority of the issues faced by existing decentralised learning methodologies. Further, a scalability demo conducted establishes the versatility and scalability of this platform in handling state-of-the-art large language models.

A More Secure Split: Enhancing the Security of Privacy-Preserving Split Learning

Chapter

Full-text available

Nov 2023

Split learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate Activation Maps (AMs) and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing AMs could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies homomorphic encryption on the AMs before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.

PPSFL: Privacy-preserving Split Federated Learning via Functional Encryption

Preprint

Full-text available

Sep 2023

p>In this paper, we propose a novel and efficient privacy-preserving split federated learning (PPSFL) framework, that achieves both privacy protection and model accuracy with reasonable computational and communication cost. We describe the implementations of PPSFL on Multi-layer Perceptron (MLP) and Convolutional Neural Network (CNN) models with distributed clients to evaluate the performance of PPSFL. </p

PPSFL: Privacy-preserving Split Federated Learning via Functional Encryption

Preprint

Full-text available

Sep 2023

Split Ways: Privacy-Preserving Training of Encrypted Data Using Split Learning

Preprint

Full-text available

Jan 2023

Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.

A Stealthy Backdoor Attack for Without-Label-Sharing Split Learning

Preprint

May 2024

As a novel privacy-preserving paradigm aimed at reducing client computational costs and achieving data utility, split learning has garnered extensive attention and proliferated widespread applications across various fields, including smart health and smart transportation, among others. While recent studies have primarily concentrated on addressing privacy leakage concerns in split learning, such as inference attacks and data reconstruction, the exploration of security issues (e.g., backdoor attacks) within the framework of split learning has been comparatively limited. Nonetheless, the security vulnerability within the context of split learning is highly posing a threat and can give rise to grave security implications, such as the illegal impersonation in the face recognition model. Therefore, in this paper, we propose a stealthy backdoor attack strategy (namely SBAT) tailored to the without-label-sharing split learning architecture, which unveils the inherent security vulnerability of split learning. We posit the existence of a potential attacker on the server side aiming to introduce a backdoor into the training model, while exploring two scenarios: one with known client network architecture and the other with unknown architecture. Diverging from traditional backdoor attack methods that manipulate the training data and labels, we constructively conduct the backdoor attack by injecting the trigger embedding into the server network. Specifically, our SBAT achieves a higher level of attack stealthiness by refraining from modifying any intermediate parameters (e.g., gradients) during training and instead executing all malicious operations post-training.

On Robustness of Split Neural Networks Against Data Poisoning Attacks

Conference Paper

Jan 2024

Split Learning (SL) and Federated Learning (FL) are two popular distributed machine learning approaches that aim at preserving clients' privacy. In this work, we investigate the impact of data poisoning attacks, specifically label flipping attacks, clean label attacks and backdoor attacks, on the behaviour of the SL model. We employ data poisoning attacks on the CIFAR10 dataset, considering 10 clients and ResNet18 as the baseline model, and got an accuracy of 98.22% for training and 83.99% as the testing accuracy. Based on the study conducted for the aforementioned attack scenarios, we observe that backdoor attacks are the most powerful and cause a drop of approximately 29.82% in accuracy compared to a drop of 12.5% in clean label attacks and very insignificant in the case of label flipping. We also studied the impact of each attack on the model's accuracy by altering the proportion of poisonous clients. In the future, we plan to extend the study by proposing defence strategies against these attack scenarios.

Weighted Sampled Split Learning (WSSL): Balancing Privacy, Robustness, and Fairness in Distributed Learning Environments

Conference Paper

Full-text available

Oct 2023

This study presents Weighted Sampled Split Learning (WSSL), an innovative framework tailored to bolster privacy, robustness, and fairness in distributed machine learning systems. Unlike traditional approaches, WSSL disperses the learning process among multiple clients, thereby safeguarding data confidentiality. Central to WSSL's efficacy is its utilization of weighted sampling. This approach ensures equitable learning by tactically selecting influential clients based on their contributions. Our evaluation of WSSL spanned various client configurations and employed two distinct datasets: Human Gait Sensor and CIFAR-10. We observed three primary benefits: heightened model accuracy, enhanced robustness, and maintained fairness across diverse client compositions. Notably, our distributed frameworks consistently surpassed centralized counterparts, registering accuracy peaks of 82.63% and 75.51% for the Human Gait Sensor and CIFAR-10 datasets, respectively. These figures contrast with the top accuracies of 81.12% and 58.60% achieved by centralized systems. Collectively, our findings champion WSSL as a potent and scalable successor to conventional centralized learning, marking it as a pivotal stride forward in privacy-focused, resilient, and impartial distributed machine learning.

Secure peer-to-peer learning using feature embeddings

Article

Full-text available

Oct 2023
CLUSTER COMPUT

With more personal devices being connected to the internet, individuals are becoming increasingly concerned about privacy. Therefore, it is important to develop machine learning algorithms that can use customer data to create personalized models while still adhering to strict privacy laws. In this paper, we propose a robust solution to this problem in a distributed, asynchronous environment with a verifiable convergence rate. Our proposed framework trains a Convolutional Neural Network on each client and sends the feature embeddings to other clients for data aggregation. This allows each client to train a deep-learning model on feature embeddings gathered from various clients in a single communication cycle. We provide a detailed description of the architecture and execution of our suggested approach. Our technique’s effectiveness is evaluated by comparing it to the top central training and federated learning (FL) algorithms, and our tests on diverse datasets demonstrate that our method outperforms FL in terms of accuracy and is comparable to central training algorithms. Our findings also show that our proposed method reduces data transfer by over 75% compared to FL, resulting in significant bandwidth savings. As a result, model training can assist companies with high security and data protection concerns in setting up reliable collaboration platforms without requiring a central service provider.

FedDCT: Federated Learning of Large Convolutional Neural Networks on Resource Constrained Devices Using Divide and Collaborative Training

Article

Full-text available

Jan 2023

In Federated Learning (FL), the size of local models matters. On the one hand, it is logical to use large-capacity neural networks in pursuit of high performance. On the other hand, deep convolutional neural networks (CNNs) are exceedingly parameter-hungry, which makes memory a significant bottleneck when training large-scale CNNs on hardware-constrained devices such as smartphones or wearables sensors. Current state-of-the-art (SOTA) FL approaches either only test their convergence properties on tiny CNNs with inferior accuracy or assume clients have the adequate processing power to train large models, which remains a formidable obstacle in actual practice. To overcome these issues, we introduce FedDCT, a novel distributed learning paradigm that enables the usage of large, high-performance CNNs on resource-limited edge devices. As opposed to traditional FL approaches, which require each client to train the full-size neural network independently during each training round, the proposed FedDCT allows a cluster of several clients to collaboratively train a large deep learning model by dividing it into an ensemble of several small sub-models and train them on multiple devices in parallel while maintaining privacy. In this collaborative training process, clients from the same cluster can also learn from each other, further improving their ensemble performance. In the aggregation stage, the server takes a weighted average of all the ensemble models trained by all the clusters. FedDCT reduces the memory requirements and allows low-end devices to participate in FL. We empirically conduct extensive experiments on standardized datasets, including CIFAR-10, CIFAR-100, and two real-world medical datasets HAM10000 and VAIPE. Experimental results show that FedDCT outperforms a set of current SOTA FL methods with interesting convergence behaviors. Furthermore, compared to other existing approaches, FedDCT achieves higher accuracy and substantially reduces the number of communication rounds (with 4-8 times fewer memory requirements) to achieve the desired accuracy on the testing dataset without incurring any extra training cost on the server side.

DPAUC: Differentially Private AUC Computation in Federated Learning

Article

Jun 2023

Federated learning (FL) has gained significant attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple participants. The prior work on FL has mostly studied how to protect label privacy during model training. However, model evaluation in FL might also lead to the potential leakage of private label information. In this work, we propose an evaluation algorithm that can accurately compute the widely used AUC (area under the curve) metric when using the label differential privacy (DP) in FL. Through extensive experiments, we show our algorithms can compute accurate AUCs compared to the ground truth. The code is available at https://github.com/bytedance/fedlearner/tree/master/example/privacy/DPAUC

FedAds: A Benchmark for Privacy-Preserving CVR Estimation with Vertical Federated Learning

Preprint

Full-text available

May 2023

Conversion rate (CVR) estimation aims to predict the probability of conversion event after a user has clicked an ad. Typically, online publisher has user browsing interests and click feedbacks, while demand-side advertising platform collects users' post-click behaviors such as dwell time and conversion decisions. To estimate CVR accurately and protect data privacy better, vertical federated learning (vFL) is a natural solution to combine two sides' advantages for training models, without exchanging raw data. Both CVR estimation and applied vFL algorithms have attracted increasing research attentions. However, standardized and systematical evaluations are missing: due to the lack of standardized datasets, existing studies adopt public datasets to simulate a vFL setting via hand-crafted feature partition, which brings challenges to fair comparison. We introduce FedAds, the first benchmark for CVR estimation with vFL, to facilitate standardized and systematical evaluations for vFL algorithms. It contains a large-scale real world dataset collected from Alibaba's advertising platform, as well as systematical evaluations for both effectiveness and privacy aspects of various vFL algorithms. Besides, we also explore to incorporate unaligned data in vFL to improve effectiveness, and develop perturbation operations to protect privacy well. We hope that future research work in vFL and CVR estimation benefits from the FedAds benchmark.

Focusing on Pinocchio's Nose: A Gradients Scrutinizer to Thwart Split-Learning Hijacking Attacks Using Intrinsic Attributes

Conference Paper

Full-text available

Jan 2023

GTV: Generating Tabular Data via Vertical Federated Learning

Preprint

Feb 2023

Generative Adversarial Networks (GANs) have achieved state-of-the-art results in tabular data synthesis, under the presumption of direct accessible training data. Vertical Federated Learning (VFL) is a paradigm which allows to distributedly train machine learning model with clients possessing unique features pertaining to the same individuals, where the tabular data learning is the primary use case. However, it is unknown if tabular GANs can be learned in VFL. Demand for secure data transfer among clients and GAN during training and data synthesizing poses extra challenge. Conditional vector for tabular GANs is a valuable tool to control specific features of generated data. But it contains sensitive information from real data - risking privacy guarantees. In this paper, we propose GTV, a VFL framework for tabular GANs, whose key components are generator, discriminator and the conditional vector. GTV proposes an unique distributed training architecture for generator and discriminator to access training data in a privacy-preserving manner. To accommodate conditional vector into training without privacy leakage, GTV designs a mechanism training-with-shuffling to ensure that no party can reconstruct training data with conditional vector. We evaluate the effectiveness of GTV in terms of synthetic data quality, and overall training scalability. Results show that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by centralized GAN algorithm. The difference on machine learning utility can be as low as to 2.7%, even under extremely imbalanced data distributions across clients and different number of clients.

Distributed Robotic Systems in the Edge-Cloud Continuum with ROS 2: a Review on Novel Architectures and Technology Readiness

Preprint

Full-text available

Nov 2022

Robotic systems are more connected, networked, and distributed than ever. New architectures that comply with the \textit{de facto} robotics middleware standard, ROS\,2, have recently emerged to fill the gap in terms of hybrid systems deployed from edge to cloud. This paper reviews new architectures and technologies that enable containerized robotic applications to seamlessly run at the edge or in the cloud. We also overview systems that include solutions from extension to ROS\,2 tooling to the integration of Kubernetes and ROS\,2. Another important trend is robot learning, and how new simulators and cloud simulations are enabling, e.g., large-scale reinforcement learning or distributed federated learning solutions. This has also enabled deeper integration of continuous interaction and continuous deployment (CI/CD) pipelines for robotic systems development, going beyond standard software unit tests with simulated tests to build and validate code automatically. We discuss the current technology readiness and list the potential new application scenarios that are becoming available. Finally, we discuss the current challenges in distributed robotic systems and list open research questions in the field.

DPAUC: Differentially Private AUC Computation in Federated Learning

Preprint

Aug 2022

Federated learning (FL) has gained significant attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple participants. The prior work on FL has mostly studied how to protect label privacy during model training. However, model evaluation in FL might also lead to potential leakage of private label information. In this work, we propose an evaluation algorithm that can accurately compute the widely used AUC (area under the curve) metric when using the label differential privacy (DP) in FL. Through extensive experiments, we show our algorithms can compute accurate AUCs compared to the ground truth.

Distributed Intelligence in Wireless Networks

Preprint

Jul 2022

The cloud-based solutions are becoming inefficient due to considerably large time delays, high power consumption, security and privacy concerns caused by billions of connected wireless devices and typically zillions bytes of data they produce at the network edge. A blend of edge computing and Artificial Intelligence (AI) techniques could optimally shift the resourceful computation servers closer to the network edge, which provides the support for advanced AI applications (e.g., video/audio surveillance and personal recommendation system) by enabling intelligent decision making on computing at the point of data generation as and when it is needed, and distributed Machine Learning (ML) with its potential to avoid the transmission of large dataset and possible compromise of privacy that may exist in cloud-based centralized learning. Therefore, AI is envisioned to become native and ubiquitous in future communication and networking systems. In this paper, we conduct a comprehensive overview of recent advances in distributed intelligence in wireless networks under the umbrella of native-AI wireless networks, with a focus on the basic concepts of native-AI wireless networks, on the AI-enabled edge computing, on the design of distributed learning architectures for heterogeneous networks, on the communication-efficient technologies to support distributed learning, and on the AI-empowered end-to-end communications. We highlight the advantages of hybrid distributed learning architectures compared to the state-of-art distributed learning techniques. We summarize the challenges of existing research contributions in distributed intelligence in wireless networks and identify the potential future opportunities.

Get Your Foes Fooled: Proximal Gradient Split Learning for Defense Against Model Inversion Attacks on IoMT Data

Article

Full-text available

Jul 2022

The past decade has seen a rapid adoption of Artificial Intelligence (AI), specifically the deep learning networks, in Internet of Medical Things (IoMT) ecosystem. However, it has been shown recently that the deep learning networks can be exploited by adversarial attacks that not only make IoMT vulnerable to the data theft but also to the manipulation of medical diagnosis. The existing studies consider adding noise to the raw IoMT data or model parameters which not only reduces the overall performance concerning medical inferences but also is ineffective to the likes of deep leakage from gradients method. In this work, we propose proximal gradient split learning (PSGL) method for defense against the model inversion attacks. The proposed method intentionally attacks the IoMT data when undergoing the deep neural network training process at client side. We propose the use of proximal gradient method to recover gradient maps and a decision-level fusion strategy to improve the recognition performance. Extensive analysis show that the PGSL not only provides effective defense mechanism against the model inversion attacks but also helps in improving the recognition performance on publicly available datasets. We report 14.0 $\%$ , 17.9 $\%$ , and 36.9 $\%$ gains in accuracy over reconstructed and adversarial attacked images, respectively.

Efficient DNN training based on backpropagation parallelization

Article

Full-text available

Jun 2022
COMPUTING

Pipeline parallelism is an efficient way to speed up the training of deep neural networks (DNNs) by partitioning the model and pipelining the training process across a cluster of workers in distributed systems. In this paper, we propose a new pipeline parallelization approach (Q-FB pipeline) for distributed deep learning, which can achieve both high training speed and high hardware utilization. The major novelty of Q-FB pipeline lies in a mechanism that can parallelize the backpropagation training without loss of precision. Since the parameters update of the backward phase depends on the error calculated in the forward phase, paralleling the backpropagation process naively will hurt the model’s convergence behaviour. To provide convergence guarantees, Q-FB pipeline lets the forward phase and backward phase execute in parallel on different processors with the techniques of shared model memory and accumulated gradients update. To overcome the communication bottleneck, Q-FB pipeline compresses both activations and gradients before transferring them to other workers. We adopt an activation quantization scheme for reducing traffic in the forward phase and propose a gradient compression algorithm (2-Step GC algorithm) for reducing communication costs in the backward phase. Experiments at both small and large computing clusters (e.g. Tianhe-2 supercomputer system) show that Q-FB pipeline can effectively accelerate the training process without loss in convergence or precision.

Binarizing Split Learning for Data Privacy Enhancement and Computation Reduction

Preprint

Full-text available

Jun 2022

Split learning (SL) enables data privacy preservation by allowing clients to collaboratively train a deep learning model with the server without sharing raw data. However, SL still has limitations such as potential data privacy leakage and high computation at clients. In this study, we propose to binarize the SL local layers for faster computation (up to 17.5 times less forward-propagation time in both training and inference phases on mobile devices) and reduced memory usage (up to 32 times less memory and bandwidth requirements). More importantly, the binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy. To further enhance the privacy preservation, we also propose two novel approaches: 1) training with additional local leak loss and 2) applying differential privacy, which could be integrated separately or concurrently into the B-SL model. Experimental results with different datasets have affirmed the advantages of the B-SL models compared with several benchmark models. The effectiveness of B-SL models against feature-space hijacking attack (FSHA) is also illustrated. Our results have demonstrated B-SL models are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.

Applications of federated learning in smart cities: recent advances, taxonomy, and open challenges Applications of federated learning in smart cities: recent advances, taxonomy, and open challenges

Article

Full-text available

Jun 2021
CONNECT SCI

Federated learning (FL) plays an important role in the development of smart cities. With the evolution of big data and artificial intelligence , issues related to data privacy and protection have emerged, which can be solved by FL. In this paper, the current developments in FL and its applications in various fields are reviewed. With a comprehensive investigation, the latest research on the application of FL is discussed for various fields in smart cities. We explain the current developments in FL in fields, such as the Internet of Things (IoT), transportation, communications, finance, and medicine. First, we introduce the background, definition, and key technologies of FL. Then, we review key applications and the latest results. Finally, we discuss the future applications and research directions of FL in smart cities. ARTICLE HISTORY

Differentially Private AUC Computation in Vertical Federated Learning

Preprint

May 2022

Federated learning has gained great attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple parties. As a sub-category, vertical federated learning (vFL) focuses on the scenario where features and labels are split into different parties. The prior work on vFL has mostly studied how to protect label privacy during model training. However, model evaluation in vFL might also lead to potential leakage of private label information. One mitigation strategy is to apply label differential privacy (DP) but it gives bad estimations of the true (non-private) metrics. In this work, we propose two evaluation algorithms that can more accurately compute the widely used AUC (area under curve) metric when using label DP in vFL. Through extensive experiments, we show our algorithms can achieve more accurate AUCs compared to the baselines.

ResSFL: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning

Preprint

Full-text available

May 2022

This work aims to tackle Model Inversion (MI) attack on Split Federated Learning (SFL). SFL is a recent distributed training scheme where multiple clients send intermediate activations (i.e., feature map), instead of raw data, to a central server. While such a scheme helps reduce the computational load at the client end, it opens itself to reconstruction of raw data from intermediate activation by the server. Existing works on protecting SFL only consider inference and do not handle attacks during training. So we propose ResSFL, a Split Federated Learning Framework that is designed to be MI-resistant during training. It is based on deriving a resistant feature extractor via attacker-aware training, and using this extractor to initialize the client-side model prior to standard SFL training. Such a method helps in reducing the computational complexity due to use of strong inversion model in client-side adversarial training as well as vulnerability of attacks launched in early training epochs. On CIFAR-100 dataset, our proposed framework successfully mitigates MI attack on a VGG-11 model with a high reconstruction Mean-Square-Error of 0.050 compared to 0.005 obtained by the baseline system. The framework achieves 67.5% accuracy (only 1% accuracy drop) with very low computation overhead. Code is released at: https://github.com/zlijingtao/ResSFL.

Enabling Deep Learning for All-in EDGE paradigm

Preprint

Full-text available

Apr 2022

Deep Learning-based models have been widely investigated, and they have demonstrated significant performance on non-trivial tasks such as speech recognition, image processing, and natural language understanding. However, this is at the cost of substantial data requirements. Considering the widespread proliferation of edge devices (e.g. Internet of Things devices) over the last decade, Deep Learning in the edge paradigm, such as device-cloud integrated platforms, is required to leverage its superior performance. Moreover, it is suitable from the data requirements perspective in the edge paradigm because the proliferation of edge devices has resulted in an explosion in the volume of generated and collected data. However, there are difficulties due to other requirements such as high computation, high latency, and high bandwidth caused by Deep Learning applications in real-world scenarios. In this regard, this survey paper investigates Deep Learning at the edge, its architecture, enabling technologies, and model adaption techniques, where edge servers and edge devices participate in deep learning training and inference. For simplicity, we call this paradigm the All-in EDGE paradigm. Besides, this paper presents the key performance metrics for Deep Learning at the All-in EDGE paradigm to evaluate various deep learning techniques and choose a suitable design. Moreover, various open challenges arising from the deployment of Deep Learning at the All-in EDGE paradigm are identified and discussed.

Privacy-Preserving Artificial Intelligence Techniques in Biomedicine

Article

Full-text available

Jan 2022

Background Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. Objectives However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. Method This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. Conclusion As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.

Federated Two-stage Learning with Sign-based Voting

Preprint

Dec 2021

Federated learning is a distributed machine learning mechanism where local devices collaboratively train a shared global model under the orchestration of a central server, while keeping all private data decentralized. In the system, model parameters and its updates are transmitted instead of raw data, and thus the communication bottleneck has become a key challenge. Besides, recent larger and deeper machine learning models also pose more difficulties in deploying them in a federated environment. In this paper, we design a federated two-stage learning framework that augments prototypical federated learning with a cut layer on devices and uses sign-based stochastic gradient descent with the majority vote method on model updates. Cut layer on devices learns informative and low-dimension representations of raw data locally, which helps reduce global model parameters and prevents data leakage. Sign-based SGD with the majority vote method for model updates also helps alleviate communication limitations. Empirically, we show that our system is an efficient and privacy preserving federated learning scheme and suits for general application scenarios.

Privacy-Preserving Serverless Computing Using Federated Learning for Smart Grids

Article

Nov 2021

The smart power grid is a critical energy infrastructure where real-time electricity usage data is collected to predict future energy requirements. The existing prediction models focus on the centralized frameworks, where the collected data from various Home Area Networks (HANs) are forwarded to a central server. This process leads to cybersecurity threats. This paper proposes a Federated Learning (FL) based model with privacy preservation of smart grids data using Serverless cloud computing. The model considers the Blockchain-enabled Dew Servers (BDS) in each HAN for local data storage and local model training. Advanced perturbation and normalization techniques are used to reduce the inverse impact of irregular workload on the training results. The experiment conducted on benchmarks datasets demonstrates that the proposed model minimizes the computation and communication costs, attacking probability, and improves the test accuracy. Overall, the proposed model enables smart grids with robust privacy preservation and high accuracy.

Efficient Privacy Preserving Edge Intelligent Computing Framework for Image Classification in IoT

Article

Full-text available

Oct 2021

To extract knowledge from the large data collected by edge devices, a traditional cloud-based approach that requires data upload may not be feasible due to communication bandwidth limitations as well as privacy and security concerns of end-users. A novel privacy-preserving edge intelligent computing framework for image classification in IoT is proposed to address these challenges. Specifically, the autoencoder will be trained unsupervised at each edge device individually, and then the obtained latent vectors transmitted to the edge server for the training of a classifier. This framework would reduce the communication overhead and protect end-users’ data. Compared to federated learning, the training of the classifier in the proposed framework is not subject to the constraints of the edge devices, and the autoencoder can be trained independently at each edge device without any server involvement. Compared to collaborative intelligence such as SplitNN, the proposed method does not suffer from high communication cost as noticed in SplitNN. Furthermore, the privacy of the end-users’ data is protected by transmitting latent vectors and without the additional cost of encryption. Experimental results provide insights on the image classification performance vs. various design parameters such as the data compression ratio of the autoencoder and the model complexity.

Federated Learning in Robotic and Autonomous Systems

Article

Full-text available

Aug 2021

Autonomous systems are becoming inherently ubiquitous with the advancements of computing and communication solutions enabling low-latency offloading and real-time collaboration of distributed devices. Decentralized technologies with blockchain and distributed ledger technologies (DLTs) are playing a key role. At the same time, advances in deep learning (DL) have significantly raised the degree of autonomy and level of intelligence of robotic and autonomous systems. While these technological revolutions were taking place, raising concerns in terms of data security and end-user privacy has become an inescapable research consideration. Federated learning (FL) is a promising solution to privacy-preserving DL at the edge, with an inherently distributed nature by learning on isolated data islands and communicating only model updates. However, FL by itself does not provide the levels of security and robustness required by today’s standards in distributed autonomous systems. This survey covers applications of FL to autonomous robots, analyzes the role of DLT and FL for these systems, and introduces the key background concepts and considerations in current research.

An Approach for Peer-to-Peer Federated Learning

Conference Paper

Jun 2021

Applications of federated learning in smart cities: recent advances, taxonomy, and open challenges

Article

Full-text available

Jun 2021

Federated learning (FL) plays an important role in the development of smart cities. With the evolution of big data and artificial intelligence, issues related to data privacy and protection have emerged, which can be solved by FL. In this paper, the current developments in FL and its applications in various fields are reviewed. With a comprehensive investigation, the latest research on the application of FL is discussed for various fields in smart cities. We explain the current developments in FL in fields, such as the Internet of Things (IoT), transportation, communications, finance, and medicine. First, we introduce the background, definition, and key technologies of FL. Then, we review key applications and the latest results. Finally, we discuss the future applications and research directions of FL in smart cities.

InvMetrics: Measuring Privacy Risks for Split Model–Based Customer Behavior Analysis

Article

Feb 2024

Mobile Edge Computing (MEC) has great potential to facilitate cheap and fast customer behavior analysis (CBA). Model splitting, widely adopted in collaborative learning of MEC, partitions the CBA models between customer devices and the edge servers in a layer-wise manner to support efficient distributed learning. However, the split-model architecture (SMA) is vulnerable to data reconstruction attacks for privacy leakage in intermediate data, of which the risk measurements remain unexplored. In this paper, we propose a privacy risk measurement framework, called InvMetrics, for split model–based CBA systems, which assess the degree of privacy leakage from both the CBA owners’ and the regulators’ perspectives. For CBA owners, we propose a privacy metric, Distance Loss (DLoss), based on distance correlation, which is computationally efficient, and thus eligible for being deployed on the customers’ devices. For third-party evaluators, we propose Uncertainty Loss (ULoss) based on entropy, which can measure privacy risk without accessing raw data. Evaluation results on three CBA datasets and one image dataset demonstrate that InvMetrics framework with DLoss and ULoss can accurately measure privacy risk and is more efficient than the state-of-the-art.

Love or Hate? Share or Split? Privacy-Preserving Training Using Split Learning and Homomorphic Encryption

Conference Paper

Aug 2023

Trusted AI in Multiagent Systems: An Overview of Privacy and Security for Distributed Learning

Article

Sep 2023

Motivated by the advancing computational capacity of distributed end-user equipment (UE), as well as the increasing concerns about sharing private data, there has been considerable recent interest in machine learning (ML) and artificial intelligence (AI) that can be processed on distributed UEs. Specifically, in this paradigm, parts of an ML process are outsourced to multiple distributed UEs. Then, the processed information is aggregated on a certain level at a central server, which turns a centralized ML process into a distributed one and brings about significant benefits. However, this new distributed ML paradigm raises new risks in terms of privacy and security issues. In this article, we provide a survey of the emerging security and privacy risks of distributed ML from a unique perspective of information exchange levels, which are defined according to the key steps of an ML process, i.e., we consider the following levels: 1) the level of preprocessed data; 2) the level of learning models; 3) the level of extracted knowledge; and 4) the level of intermediate results. We explore and analyze the potential of threats for each information exchange level based on an overview of current state-of-the-art attack mechanisms and then discuss the possible defense methods against such threats. Finally, we complete the survey by providing an outlook on the challenges and possible directions for future research in this critical area.

A Bargaining Game for Personalized, Energy Efficient Split Learning over Wireless Networks

Conference Paper

Mar 2023

Binarizing Split Learning for Data Privacy Enhancement and Computation Reduction

Article

Jan 2023

Split learning (SL) enables data privacy preservation by allowing clients to collaboratively train a deep learning model with the server without sharing raw data. However, SL still has limitations such as potential data privacy leakage and high computation for clients. In this paper, we propose to binarize the SL local layers for faster computation (up to 17.5 times less forward-propagation time in both training and inference phases on mobile devices) and reduced memory usage (up to 32 times less memory and bandwidth requirements). More importantly, the binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy. To further enhance privacy preservation, we also propose two novel approaches: 1) training with additional local leak loss and 2) applying differential privacy, which could be integrated separately or concurrently into the B-SL model. Experimental results with different datasets have affirmed the benefits of the B-SL models compared with several benchmark models. The effectiveness of B-SL models against feature-space hijacking attack (FSHA) is also illustrated. Our results have demonstrated B-SL models are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.

Distributed Intelligence in Wireless Networks

Article

Full-text available

Jan 2023

The cloud-based solutions are becoming inefficient due to considerably large time delays, high power consumption, and security and privacy concerns caused by billions of connected wireless devices and typically zillions of bytes of data they produce at the network edge. A blend of edge computing and Artificial Intelligence (AI) techniques could optimally shift the resourceful computation servers closer to the network edge, which provides the support for advanced AI applications (e.g., video/audio surveillance and personal recommendation system) by enabling intelligent decision making on computing at the point of data generation as and when it is needed, and distributed Machine Learning (ML) with its potential to avoid the transmission of the large dataset and possible compromise of privacy that may exist in cloud-based centralized learning. Besides, the deployment of AI techniques to redesign end-to-end communication is attracting attention to improve communication performance. Therefore, the interaction of AI and wireless communications generates a new concept, named native AI wireless networks. In this paper, we conduct a comprehensive overview of recent advances in distributed intelligence in wireless networks under the umbrella of native AI wireless networks, with a focus on the design of distributed learning architectures for heterogeneous networks, on AI-enabled edge computing, on the communication-efficient technologies to support distributed learning, and on the AI-empowered end-to-end communications. We highlight the advantages of hybrid distributed learning architectures compared to state-of-the-art distributed learning techniques. We summarize the challenges of existing research contributions in distributed intelligence in wireless networks and identify potential future opportunities.

A Survey on Over-the-Air Computation

Article

Jan 2023

Communication and computation are often viewed as separate tasks. This approach is very effective from the perspective of engineering as isolated optimizations can be performed. However, for many computation-oriented applications, the main interest is a function of the local information at the devices, rather than the local information itself. In such scenarios, information theoretical results show that harnessing the interference in a multiple access channel for computation, i.e., over-the-air computation (OAC), can provide a significantly higher achievable computation rate than separating communication and computation tasks. Moreover, the gap between OAC and separation in terms of computation rate increases with more participating nodes. Given this motivation, in this study, we provide a comprehensive survey on practical OAC methods. After outlining fundamentals related to OAC, we discuss the available OAC schemes with their pros and cons. We provide an overview of the enabling mechanisms for achieving reliable computation in the wireless channel. Finally, we summarize the potential applications of OAC and point out some future directions.

Distributed Robotic Systems in the Edge-Cloud Continuum with ROS 2: a Review on Novel Architectures and Technology Readiness

Conference Paper

Dec 2022

A Study of Federated Learning with Internet of Things for Data Privacy and Security using Privacy Preserving Techniques

Article

Jan 2023

Privacy leakage that occurs when many IoT devices are utilized for training centralized models, a new distributed learning framework known as federated learning was created, where devices train models together while keeping their private datasets local. In a federated learning setup, a central aggregator coordinates the efforts of several clients working together to solve machine learning issues. The privacy of each device's data is protected by this setup's decentralized training data. Federated learning reduces traditional centralized machine learning systems' systemic privacy issues and costs by emphasizing local processing and model transfer. Client information is stored locally and cannot be copied or shared. By utilizing a centralized server, federated learning enables each participant's device to collect data locally for training purposes before sending the resulting model to the server for aggregate and subsequent distribution. As a means of providing a comprehensive review and encouraging further research into the topic, we introduce the works of federated learning from five different vantage points: data partitioning, privacy method, machine learning model, communication architecture, and systems heterogeneity. Then, we organize the issues plaguing federated learning today and the potential avenues for a prospective study. Finally, we provide a brief overview of the features of existing federated knowledge and discuss how it is currently being used in the field.

A Survey on Over-the-Air Computation

Preprint

Oct 2022

Communication and computation are often viewed as separate tasks. This approach is very effective from the perspective of engineering as isolated optimizations can be performed. On the other hand, there are many cases where the main interest is a function of the local information at the devices instead of the local information itself. For such scenarios, information theoretical results show that harnessing the interference in a multiple-access channel for computation, i.e., over-the-air computation (OAC), can provide a significantly higher achievable computation rate than the one with the separation of communication and computation tasks. Besides, the gap between OAC and separation in terms of computation rate increases with more participating nodes. Given this motivation, in this study, we provide a comprehensive survey on practical OAC methods. After outlining fundamentals related to OAC, we discuss the available OAC schemes with their pros and cons. We then provide an overview of the enabling mechanisms and relevant metrics to achieve reliable computation in the wireless channel. Finally, we summarize the potential applications of OAC and point out some future directions.

Cross-Silo Federated Neural Architecture Search for Heterogeneous and Cooperative Systems

Chapter

Oct 2022

In many cooperative systems (i.e. autonomous vehicles, robotics, hospital networks), data are privately and heterogeneously distributed among devices with various computational constraints, and no party has a global view of data or device distribution. Federated Neural Architecture Search (FedNAS) was previously proposed to adapt Neural Architecture Search (NAS) into Federated Learning (FL) to provide both privacy and model performance to such uninspectable and heterogeneous systems. However, these approaches mostly apply to scenarios where parties share the same data attributes and comparable computation resources. In this chapter, we present Self-supervised Vertical Federated Neural Architecture Search (SS-VFNAS) for automating FL where participants have heterogeneous data and resource constraints, a common cross-silo scenario. SS-VFNAS not only simultaneously optimizes all parties’ model architecture and parameters for the best global performance under a vertical FL (VFL) framework using only a small set of aligned and labeled data, but also preserves each party’s local optimal model architecture under a self-supervised NAS framework. We demonstrate that SS-VFNAS is a promising framework of superior performance, communication efficiency and privacy, and is capable of generating high-performance and highly-transferable heterogeneous architectures with only limited overlapping samples, providing practical solutions for designing collaborative systems with both limited data and resource constraints.

ResSFL: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning

Conference Paper

Jun 2022

Trusted AI in Multi-agent Systems: An Overview of Privacy and Security for Distributed Learning

Preprint

Full-text available

Feb 2022

Motivated by the advancing computational capacity of distributed end-user equipments (UEs), as well as the increasing concerns about sharing private data, there has been considerable recent interest in machine learning (ML) and artificial intelligence (AI) that can be processed on on distributed UEs. Specifically, in this paradigm, parts of an ML process are outsourced to multiple distributed UEs, and then the processed ML information is aggregated on a certain level at a central server, which turns a centralized ML process into a distributed one, and brings about significant benefits. However, this new distributed ML paradigm raises new risks of privacy and security issues. In this paper, we provide a survey of the emerging security and privacy risks of distributed ML from a unique perspective of information exchange levels, which are defined according to the key steps of an ML process, i.e.: i) the level of preprocessed data, ii) the level of learning models, iii) the level of extracted knowledge and, iv) the level of intermediate results. We explore and analyze the potential of threats for each information exchange level based on an overview of the current state-of-the-art attack mechanisms, and then discuss the possible defense methods against such threats. Finally, we complete the survey by providing an outlook on the challenges and possible directions for future research in this critical area.

Conceptualizing Data Ecosystems for Industrial Food Production

Conference Paper

Sep 2021

FedDetect: A Novel Privacy-Preserving Federated Learning Framework for Energy Theft Detection in Smart Grid

Article

Sep 2021

In smart grids, a major challenge is how to effectively utilize consumers’ energy consumption data while preserving security and privacy. In this paper, we tackle this challenging issue and focus on energy theft detection, which is very important for smart grids. Specifically, we note that most existing energy theft detection schemes are centralized, which may be unscalable, and more importantly, may be very difficult to protect data privacy. To address this issue, we propose a novel privacy-preserving federated learning framework for energy theft detection, namely, FedDetect. In our framework, we consider a federated learning system that consists of a data center, a control center, and multiple detection stations. In this system, each detection station can only observe data from local consumers, which can use a local differential privacy (LDP) scheme to process their data to preserve privacy. To facilitate the training of the model, we design a secure protocol so that detection stations can send encrypted training parameters to the control center and the data center, which then use homomorphic encryption to calculate the aggregated parameters and return updated model parameters to detection stations. In our study, we prove the security of the proposed protocol with solid security analysis. To detect energy theft, we design a deep learning model based on the state-of-the-art temporal convolutional network (TCN). Finally, we conduct extensive data-driven experiments using a real energy consumption dataset. The experimental results demonstrate that the proposed federated learning framework can achieve high accuracy of detection with a smaller computation overhead.

Advancements of Federated Learning Towards Privacy Preservation: From Federated Learning to Split Learning

Chapter

Jun 2021

In the distributed collaborative machine learning (DCML) paradigm, federated learning (FL) recently attracted much attention due to its applications in health, finance, and the latest innovations such as industry 4.0 and smart vehicles. FL provides privacy-by-design. It trains a machine learning model collaboratively over several distributed clients (ranging from two to millions) such as mobile phones, without sharing their raw data with any other participant. In practical scenarios, all clients do not have sufficient computing resources (e.g., Internet of Things), the machine learning model has millions of parameters, and its privacy between the server and the clients while training/testing is a prime concern (e.g., rival parties). In this regard, FL is not sufficient, so split learning (SL) is introduced. SL is reliable in these scenarios as it splits a model into multiple portions, distributes them among clients and server, and trains/tests their respective model portions to accomplish the full model training/testing. In SL, the participants do not share both data and their model portions to any other parties, and usually, a smaller network portion is assigned to the clients where data resides. Recently, a hybrid of FL and SL, called splitfed learning, is introduced to elevate the benefits of both FL (faster training/testing time) and SL (model split and training). Following the developments from FL to SL, and considering the importance of SL, this chapter is designed to provide extensive coverage in SL and its variants. The coverage includes fundamentals, existing findings, integration with privacy measures such as differential privacy, open problems, and code implementation.

Kernel Mean Embedding of Distributions: A Review and Beyond

Article

Full-text available

Jan 2017

A Hilbert space embedding of distributions---in short, kernel mean embedding---has recently emerged as a powerful machinery for probabilistic modeling, statistical inference, machine learning, and causal discovery. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It gave rise to a great deal of research and novel applications of positive definite kernels. The goal of this survey is to give a comprehensive review of existing works and recent advances in this research area, and to discuss some of the most challenging issues and open problems that could potentially lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel two-sample testing, independent testing, group anomaly detection, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes' rules---which are ubiquitous in graphical model, probabilistic inference, and reinforcement learning---in a non-parametric way using the new representation of distributions in RKHS. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions.

The Affinely Invariant Distance Correlation

Article

Full-text available

Oct 2012
BERNOULLI

Sz\'ekely, Rizzo and Bakirov (2007) and Sz\'ekely and Rizzo (2009), in two seminal papers, introduced the powerful concept of distance correlation as a measure of dependence between sets of random variables. We study in this paper an affinely invariant version of the distance correlation and an empirical version of that distance correlation, and we establish the consistency of the empirical quantity. In the case of subvectors of a multivariate normally distributed random vector, we provide exact expressions for the distance correlation in both finite-dimensional and asymptotic settings. To illustrate our results, we consider time series of wind vectors at the Stateline wind energy center in Oregon and Washington, and we derive the empirical auto and cross distance correlation functions between wind vectors at distinct meteorological stations.

Equivalence of distance-based and RKHS-based statistics in hypothesis testing

Article

Full-text available

Oct 2013
ANN STAT

We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, Maximum Mean Discrepancies (MMD), i.e., distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with the semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.

Feature Screening via Distance Correlation Learning

Article

Full-text available

May 2012

This paper is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example.

Measuring Statistical Dependence with Hilbert-Schmidt Norms

Conference Paper

Full-text available

Oct 2005
Lect Notes Comput Sci

We propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate is simpler than any other kernel dependence test, and requires no user-defined regularisation. Second, there is a clearly defined population quantity which the empirical estimate approaches in the large sample limit, with exponential convergence guaranteed between the two: this ensures that independence tests based on {methodname} do not suffer from slow learning rates. Finally, we show in the context of independent component analysis (ICA) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods.

Security and Privacy for Implantable Medical Devices

Article

Full-text available

Feb 2008

Protecting implantable medical devices against attack without compromising patient health requires balancing security and privacy goals with traditional goals such as safety and utility. Implantable medical devices monitor and treat physiological conditions within the body. These devices - including pacemakers, implantable cardiac defibrillators (ICDs), drug delivery systems, and neurostimulators - can help manage a broad range of ailments, such as cardiac arrhythmia, diabetes, and Parkinson's disease. IMDs' pervasiveness continues to swell, with upward of 25 million US citizens currently reliant on them for life-critical functions. Growth is spurred by geriatric care of the aging baby-boomer generation, and new therapies continually emerge for chronic conditions ranging from pediatric type 1 diabetes to anorgasmia and other sexual dysfunctions. Moreover, the latest IMDs support delivery of telemetry for remote monitoring over long-range, high-bandwidth wireless links, and emerging devices will communicate with other interoperating IMDs.

Measuring and Testing Dependence by Correlation of Distances

Article

Full-text available

Apr 2008
ANN STAT

Distance correlation is a new measure of dependence between random vectors. Distance covariance and distance correlation are analogous to product-moment covariance and correlation, but unlike the classical definition of correlation, distance correlation is zero only if the random vectors are independent. The empirical distance dependence measures are based on certain Euclidean distances between sample elements rather than sample moments, yet have a compact representation analogous to the classical covariance and correlation. Asymptotic properties and applications in testing independence are discussed. Implementation of the test and Monte Carlo results are also presented.

Kernel Mean Embedding of Distributions: A Review and Beyond

Book

Jan 2017

A Hilbert space embedding of a distribution—in short, a kernel mean embedding—has recently emerged as a powerful tool for machine learning and statistical inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original “feature map” common to support vector machines (SVMs) and other kernel methods. In addition to the classical applications of kernel methods, the kernel mean embedding has found novel applications in ﬁelds ranging from probabilistic modeling to statistical inference, causal discovery, and deep learning. Kernel Mean Embedding of Distributions: A Review and Beyond provides a comprehensive review of existing work and recent advances in this research area, and to discuss some of the most challenging issues and open problems that could potentially lead to new research directions. The targeted audience includes graduate students and researchers in machine learning and statistics who are interested in the theory and applications of kernel mean embeddings.

Distance Correlation Autoencoder

Conference Paper

Jul 2018

Distributed learning of deep neural network over multiple agents

Article

Aug 2018

In domains such as health care and finance, shortage of labeled data and computational resources is a critical issue while developing machine learning algorithms. To address the issue of labeled data scarcity in training and deployment of neural network-based systems, we propose a new technique to train deep neural networks over several data sources. Our method allows for deep neural networks to be trained using data from multiple entities in a distributed fashion. We evaluate our algorithm on existing datasets and show that it obtains performance which is similar to a regular neural network trained on a single machine. We further extend it to incorporate semi-supervised learning when training with few labeled samples, and analyze any security concerns that may arise. Our algorithm paves the way for distributed training of deep neural networks in data sensitive applications when raw data may not be shared directly.

Causal Inference on Discrete Data via Estimating Distance Correlations

Article

Feb 2016
NEURAL COMPUT

In this article, we deal with the problem of inferring causal directions when the data are on discrete domain. By considering the distribution of the cause P(X) and the conditional distribution mapping cause to effect P(Y|X) as independent random variables, we propose to infer the causal direction by comparing the distance correlation between P(X) and P(Y|X) with the distance correlation between P(Y) and P(Y|X). We infer that X causes Y if the dependence coefficient between P(X) and P(Y|X) is smaller. Experiments are performed to show the performance of the proposed method.

Supervised Dimensionality Reduction via Distance Correlation Maximization

Article

Jan 2016

In our work, we propose a novel formulation for supervised dimensionality reduction based on a nonlinear dependency criterion called Statistical Distance Correlation, Szekely et. al. (2007). We propose an objective which is free of distributional assumptions on regression variables and regression model assumptions. Our proposed formulation is based on learning a low-dimensional feature representation $\mathbf{z}$, which maximizes the squared sum of Distance Correlations between low dimensional features $\mathbf{z}$ and response $y$, and also between features $\mathbf{z}$ and covariates $\mathbf{x}$. We propose a novel algorithm to optimize our proposed objective using the Generalized Minimization Maximizaiton method of \Parizi et. al. (2015). We show superior empirical results on multiple datasets proving the effectiveness of our proposed approach over several relevant state-of-the-art supervised dimensionality reduction methods.

Privacy in Mobile Technology for Personal Healthcare

Article

Mar 2013

Information technology can improve the quality, efficiency, and cost of healthcare. In this survey, we examine the privacy requirements of emphmobile/ computing technologies that have the potential to transform healthcare. Such emphmHealth/ technology enables physicians to remotely monitor patients' health, and enables individuals to manage their own health more easily. Despite these advantages, privacy is essential for any personal monitoring technology. Through an extensive survey of the literature, we develop a conceptual privacy framework for mHealth, itemize the privacy properties needed in mHealth systems, and discuss the technologies that could support privacy-sensitive mHealth systems. We end with a list of open research questions.

Communication-efficient learning of deep networks from decentralized data

Jan 2016

Eider H Brendan Mcmahan
Daniel Moore
Seth Ramage
Hampson

H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.

The democratization of health care

Jan 2018

Stanford

Stanford. The democratization of health care. In Stanford Medicine 2018 Health Trends Report, 2018.

Split learning for health: Distributed deep learning without sharing raw patient data

Jan 2018

Praneeth Vepakomma
Otkrist Gupta
Tristan Swedish
Ramesh Raskar

Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564, 2018a.

Jan 2016

Jianmin Chen
Xinghao Pan
Rajat Monga
Samy Bengio
Rafal Jozefowicz

Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.

Federated learning: Strategies for improving communication efficiency

Jan 2016

Jakub Konečnỳ
Brendan Mcmahan
X Felix
Peter Yu
Ananda Richtárik
Dave Theertha Suresh
Bacon

Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.

Reducing leakage in distributed deep learning for sensitive health data

Abstract

No full-text available

Recommended publications

Investigating YOLO Models for Rice Seed Classification

Robustness of Image-Based Malware Classification Models trained with Generative Adversarial Networks