Conference Paper

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation. Simple rules are designed to detect slowly-varying gradients and, therefore, trigger the reuse of outdated gradients. The resultant gradient-based algorithms are termed Lazily Aggregated Gradient — justifying our acronym LAG used henceforth. Theoretically, the merits of this contribution are: i) the convergence rate is the same as batch gradient descent in stronglyconvex, convex, and nonconvex cases; and, ii) if the distributed datasets are heterogeneous (quantified by certain measurable constants), the communication rounds needed to achieve a targeted accuracy are reduced thanks to the adaptive reuse of lagged gradients. Numerical experiments on both synthetic and real data corroborate a significant communication reduction compared to alternatives.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... 3) The synchronization interval is adjusted in lazy aggregation mode, i.e., only performing synchronization when the gradient update exceeds a predefined threshold (reducing unnecessary synchronization overhead) [14], [32]. The threshold can also be dynamically adjusted according to different data distributions at the local devices/workers [35]. ...
... The threshold can also be dynamically adjusted according to different data distributions at the local devices/workers [35]. [14], [32] Model synchronization occurs when the threshold is reached Traditional FL Optimization [33] The computing power and storage capacity of the device are used as a basis for adjusting the frequency of local updates in traditional FL. ...
Article
Full-text available
Geo-decentralized federated learning (FL) can empower fully distributed model training for future large-scale 6G networks. Without the centralized parameter server, the peer-to-peer model synchronization in geo-decentralized FL would incur excessive communication overhead. Some existing studies optimized synchronization interval for communication efficiency, but may not be applicable to latency-constrained geo-decentralized FL. This paper first proposes the synchronization interval optimization for latency-constrained geo-decentralized FL. The problem is formulated to maximize the model training accuracy within a time window under communication/computation constraints. We mathematically derive the convergence bound by jointly considering data heterogeneity, network topology and communication/computation resources. By minimizing the convergence bound, we optimize the synchronization interval based on the approximated system consistency metric. Extensive experiments on MNIST, Fashion-MNIST and CIFAR10 datasets validate the superiority of the proposed approach by achieving up to 30% higher accuracy than the state-of-the-art benchmarks.
... The second class of approaches focuses on the reduction of the communication iterations by eliminating the communication between some of the workers and the master node in some iterations [16]. The work [16] has proposed lazily aggregated gradient (LAG) for communication-efficient distributed learning in master-worker architectures. ...
... The second class of approaches focuses on the reduction of the communication iterations by eliminating the communication between some of the workers and the master node in some iterations [16]. The work [16] has proposed lazily aggregated gradient (LAG) for communication-efficient distributed learning in master-worker architectures. In LAG, each worker reports its gradient vector to the master node only if the gradient changes from the last communication iteration are large enough. ...
Article
Full-text available
This paper investigates efficient distributed training of a Federated Learning (FL) model over a wireless network of wireless devices. The communication iterations of the distributed training algorithm may be substantially deteriorated or even blocked by the effects of the devices’ background traffic, packet losses, congestion, or latency. We abstract the communication-computation impacts as an ‘iteration cost’ and propose a cost-aware causal FL algorithm (FedCau) to tackle this problem. We propose an iteration-termination method that trade-offs the training performance and networking costs. We apply our approach when workers use the slotted-ALOHA, carrier-sense multiple access with collision avoidance (CSMA/CA), and orthogonal frequency-division multiple access (OFDMA) protocols. We show that, given a total cost budget, the training performance degrades as either the background communication traffic or the dimension of the training problem increases. Our results demonstrate the importance of proactively designing optimal cost-efficient stopping criteria to avoid unnecessary communication-computation costs to achieve a marginal FL training improvement. We validate our method by training and testing FL over the MNIST and CIFAR-10 dataset. Finally, we apply our approach to existing communication efficient FL methods from the literature, achieving further efficiency. We conclude that cost-efficient stopping criteria are essential for the success of practical FL over wireless networks.
... A synchronous way of communication between client and server in an FL system is examined in the work of [20] that exchanges Gradient Descent parameters and finally uses the global aggregation of these gradient parameters at each learning step on the aggregator server. No deep learning loss functions are used, and no client participation decision is an option in this method. ...
... When the amount of training data is large, it is usually computationally prohibitive to compute the gradient descent of the loss function defined on the entire local client dataset. In cases such as our work, stochastic gradient descent is used [8,20], which elaborates the gradient computed on the loss function defined on a randomly sampled dataset of the whole local available training data, which is called a mini-batch. Every mini-batch can be considered an undersized collection of samples with no overlap between them. ...
Article
Full-text available
Cloud computing and relevant emerging technologies have presented ordinary methods for processing edge-produced data in a centralized manner. Presently, there is a tendency to offload processing tasks as close to the edge as possible to reduce the costs and network bandwidth used. In this direction, we find efforts that materialize this paradigm by introducing distributed deep learning methods and the so-called Federated Learning (FL). Such distributed architectures are valuable assets in terms of efficiently managing resources and eliciting predictions that can be used for proactive adaptation of distributed applications. In this work, we focus on deep learning local loss functions in multi-cloud environments. We introduce the MulticloudFL system that enhances the forecasting accuracy, in dynamic settings, by applying two new methods that enhance the prediction accuracy in applications and resources monitoring metrics. The proposed algorithm’s performance is evaluated via various experiments that confirm the quality and benefits of the MulticloudFL system, as it improves the prediction accuracy on time-series data while reducing the bandwidth requirements and privacy risks during the training process.
... Compared to LAG. Corresponding to eq. (70) in [10], LAG defines a Lyapunov function ...
... Compared to LAG. According to eq. (50) in [10], we have that ...
Preprint
Full-text available
The widespread adoption of Federated Learning (FL), a privacy-preserving distributed learning methodology, has been impeded by the challenge of high communication overheads, typically arising from the transmission of large-scale models. Existing adaptive quantization methods, designed to mitigate these overheads, operate under the impractical assumption of uniform device participation in every training round. Additionally, these methods are limited in their adaptability due to the necessity of manual quantization level selection and often overlook biases inherent in local devices' data, thereby affecting the robustness of the global model. In response, this paper introduces AQUILA (adaptive quantization of lazily-aggregated gradients), a novel adaptive framework devised to effectively handle these issues, enhancing the efficiency and robustness of FL. AQUILA integrates a sophisticated device selection method that prioritizes the quality and usefulness of device updates. Utilizing the exact global model stored by devices, it enables a more precise device selection criterion, reduces model deviation, and limits the need for hyperparameter adjustments. Furthermore, AQUILA presents an innovative quantization criterion, optimized to improve communication efficiency while assuring model convergence. Our experiments demonstrate that AQUILA significantly decreases communication costs compared to existing methods, while maintaining comparable model performance across diverse non-homogeneous FL settings, such as Non-IID data and heterogeneous model architectures.
... Many studies have explored the application of federated learning (FL) in wireless networks [20], addressing critical issues such as wireless resource management [21]- [27], compression and sparsification [28]- [32], and training algorithm design [33]- [35]. However, these studies rarely consider the unique characteristics of vehicular networks, such as high mobility and rapidly changing channel conditions. ...
Preprint
Leveraging the computing and sensing capabilities of vehicles, vehicular federated learning (VFL) has been applied to edge training for connected vehicles. The dynamic and interconnected nature of vehicular networks presents unique opportunities to harness direct vehicle-to-vehicle (V2V) communications, enhancing VFL training efficiency. In this paper, we formulate a stochastic optimization problem to optimize the VFL training performance, considering the energy constraints and mobility of vehicles, and propose a V2V-enhanced dynamic scheduling (VEDS) algorithm to solve it. The model aggregation requirements of VFL and the limited transmission time due to mobility result in a stepwise objective function, which presents challenges in solving the problem. We thus propose a derivative-based drift-plus-penalty method to convert the long-term stochastic optimization problem to an online mixed integer nonlinear programming (MINLP) problem, and provide a theoretical analysis to bound the performance gap between the online solution and the offline optimal solution. Further analysis of the scheduling priority reduces the original problem into a set of convex optimization problems, which are efficiently solved using the interior-point method. Experimental results demonstrate that compared with the state-of-the-art benchmarks, the proposed algorithm enhances the image classification accuracy on the CIFAR-10 dataset by 3.18% and reduces the average displacement errors on the Argoverse trajectory prediction dataset by 10.21%.
... However, the large-batch approach leads to poor generalization [26,40,56], a challenge addressed by the post-local SGD method [40], which divides training into two phases: BSP-SGD followed by Local-SGD with a fixed number of steps. In the Lazily Aggregated Algorithm (LAG) [8], a different approach was taken, using only new gradients from some selected workers and reusing the outdated gradients from the rest, which essentially skips communication rounds. Federated Averaging (FedAvg) [44] is another representative communication efficient Local-SGD algorithm, which is a pivotal method in Federated Learning (FL) [30]. ...
Preprint
Full-text available
Driven by the ever-growing volume and decentralized nature of data, coupled with the escalating size of modern models, distributed deep learning (DDL) has been entrenched as the preferred paradigm for training. However, frequent synchronization of DL models, encompassing millions to many billions of parameters, creates a communication bottleneck, severely hindering scalability. Worse yet, DDL algorithms typically waste valuable bandwidth, and make themselves less practical in bandwidth-constrained federated settings, by relying on overly simplistic, periodic, and rigid synchronization schedules. To address these shortcomings, we propose Federated Dynamic Averaging (FDA), a communication-efficient DDL strategy that dynamically triggers synchronization based on the value of the model variance. Through extensive experiments across a wide range of learning tasks we demonstrate that FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge communication-efficient algorithms. Remarkably, FDA achieves this without sacrificing convergence speed - in stark contrast to the trade-offs encountered in the field. Additionally, we show that FDA maintains robust performance across diverse data heterogeneity settings.
... Specifically, parallel ideas for distributed reinforcement learning mainly include four types: data parallelism [133], model parallelism [134], pipeline parallelism [135], and hybrid parallelism [136]. Data parallelism mainly targets large datasets and small model scenarios by pairwise slicing the dataset into several parts, which solves the problem that a single device's memory is limited and cannot store all the data, but it cannot solve the memory overflow problem triggered by the large scale of the network model. ...
Article
Full-text available
Extensive research has been carried out on reinforcement learning methods. The core idea of reinforcement learning is to learn methods by means of trial and error, and it has been successfully applied to robotics, autonomous driving, gaming, healthcare, resource management, and other fields. However, when building reinforcement learning solutions at the edge, not only are there the challenges of data-hungry and insufficient computational resources but also there is the difficulty of a single reinforcement learning method to meet the requirements of the model in terms of efficiency, generalization, robustness, and so on. These solutions rely on expert knowledge for the design of edge-side integrated reinforcement learning methods, and they lack high-level system architecture design to support their wider generalization and application. Therefore, in this paper, instead of surveying reinforcement learning systems, we survey the most commonly used options for each part of the architecture from the point of view of integrated application. We present the characteristics of traditional reinforcement learning in several aspects and design a corresponding integration framework based on them. In this process, we show a complete primer on the design of reinforcement learning architectures while also demonstrating the flexibility of the various parts of the architecture to be adapted to the characteristics of different edge tasks. Overall, reinforcement learning has become an important tool in intelligent decision making, but it still faces many challenges in the practical application in edge computing. The aim of this paper is to provide researchers and practitioners with a new, integrated perspective to better understand and apply reinforcement learning in edge decision-making tasks.
... The gTop-k gradient sparsification method in [19] reduces communication cost based on the Top-k method in [18]. [20] develops a method based on [21] that adaptively compresses the size of exchanged model gradients via quantization. ...
Conference Paper
Federated Learning (FL) has been successfully adopted for distributed training and inference of large-scale Deep Neural Networks (DNNs). However, DNNs are characterized by an extremely large number of parameters, thus, yielding significant challenges in exchanging these parameters among distributed nodes and managing the memory. Although recent DNN compression methods (e.g., sparsification, pruning) tackle such challenges, they do not holistically consider an adaptively controlled reduction of parameter exchange while maintaining high accuracy levels. We, therefore, contribute with a novel FL framework (coined FedDIP), which combines (i) dynamic model pruning with error feedback to eliminate redundant information exchange, which contributes to significant performance improvement , with (ii) incremental regularization that can achieve extreme sparsity of models. We provide convergence analysis of FedDIP and report on a comprehensive performance and comparative assessment against state-of-the-art methods using benchmark data sets and DNN models. Our results showcase that FedDIP not only controls the model sparsity but efficiently achieves similar or better performance compared to other model pruning methods adopting incremental regularization during distributed model training. The code is available at : https://github.com/EricLoong/feddip.
... Therefore, it is difficult to identify and detect other types of malicious IIoT attacks relevant to IIoT security applications, such as man in the middle. Though the work of the authors in [16] has leveraged the benefits of FL on modern IIoT traffic datasets and have achieved a better accuracy, nevertheless, they didn't consider the unsupervised need of anomaly detection in distributed datasets. Additionally, as most other works related to applying FL for IoT/IIoT security, they used vanilla federated learning, which by default is synchronous and assumes that devices have the same computing power. ...
Preprint
Full-text available
In a connection of many IoT devices that each collect data, normally training a machine learning model would involve transmitting the data to a central server which requires strict privacy rules. However, some owners are reluctant of availing their data out of the company due to data security concerns. Federated learning(FL) as a distributed machine learning approach performs training of a machine learning model on the device that gathered the data itself. In this scenario, data is not share over the network for training purpose. Fedavg as one of FL algorithms permits a model to be copied to participating devices during a training session. The devices could be chosen at random, and a device can be aborted. The resulting models are sent to the coordinating server and then average models from the devices that finished training. The process is repeated until a desired model accuracy is achieved. By doing this, FL approach solves the privacy problem for IoT/ IIoT devices that held sensitive data for the owners. In this paper, we leverage the benefits of FL and implemented Fedavg algorithm on a recent dataset that represent the modern IoT/ IIoT device networks. The results were almost the same as the centralized machine learning approach. We also evaluated some shortcomings of Fedavg such as unfairness that happens during the training when struggling devices do not participate for every stage of training. This inefficient training of local or global model could lead in a high number of false alarms in intrusion detection systems for IoT/IIoT gadgets developed using Fedavg. Hence, after evaluating the FedAv deep auto encoder with centralized deep auto encoder ML, we further proposed and designed a Fair Fedavg algorithm that will be evaluated in the future work.
... Therefore, it is difficult to identify and detect other types of malicious IIoT attacks relevant to IIoT security applications, such as man in the middle. Though the work of the authors in [15] has leveraged the benefits of FL on modern IIoT traffic datasets and have achieved a better accuracy, nevertheless, they didn't consider the unsupervised need of anomaly detection in distributed datasets. Additionally, as most other works related to applying FL for IoT/IIoT security, they used vanilla federated learning, which by default is synchronous and assumes that devices have the same computing power. ...
... Considerable research has been carried out to address these challenges, mainly in two directions: algorithmic and communication. Algorithmic methods range from reducing communication bandwidth by updating only UEs with significant training improvement [30] to compressing gradient vectors via quantization [31] or adopting a momentum method in the sparse update to accelerate training [32]. Communication methods include adapting the number of locally computing steps to the variance of the global gradient [33][34][35] or scheduling the maximum number of UEs in a given time frame [36]. ...
Preprint
Full-text available
These days with the rising computational capabilities of wireless user equipment such as smart phones, tablets, and vehicles, along with growing concerns about sharing private data, a novel machine learning model called federated learning (FL) has emerged. FL enables the separation of data acquisition and computation at the central unit, which is different from centralized learning that occurs in a data center. FL is typically used in a wireless edge network where communication resources are limited and unreliable. Bandwidth constraints necessitate scheduling only a subset of UEs for updates in each iteration, and because the wireless medium is shared, transmissions are susceptible to interference and are not assured. The article discusses the significance of Machine Learning in wireless communication and highlights Federated Learning (FL) as a novel approach that could play a vital role in future mobile networks, particularly 6G and beyond.
Article
Full-text available
As network technology advances, there is an increasing need for a trusted new-generation information management system. Blockchain technology provides a decentralized, transparent, and tamper-proof foundation. Meanwhile, data islands have become a significant obstacle for machine learning applications. Although federated learning (FL) ensures data privacy protection, server-side security concerns persist. Traditional methods have employed a blockchain system in FL frameworks to maintain a tamper-proof global model database. In this context, we propose a novel personalized federated learning (pFL) with blockchain-assisted semi-centralized framework, pFedBASC. This approach, tailored for the Internet of Things (IoT) scenarios, constructs a semi-centralized IoT structure and utilizes trusted network connections to support FL. We concentrate on designing the aggregation process and FL algorithm, as well as the block structure. To address data heterogeneity and communication costs, we propose a pFL method called FedHype. In this method, each client is assigned a compact hypernetwork (HN) alongside a normal target network (TN) whose parameters are generated by the HN. Clients pull together other clients’ HNs for local aggregation to personalize their TNs, reducing communication costs. Furthermore, FedHype can be integrated with other existing algorithms, enhancing its functionality. Experimental results reveal that pFedBASC effectively tackles data heterogeneity issues while maintaining positive accuracy, communication efficiency, and robustness.
Article
Distributed estimation has attracted great attention in the last few decades. In the problem of distributed estimation, a set of nodes estimate some parameter from noisy measurements. To leverage joint effort, the nodes communicate with each other in the estimation process. The communications consume bandwidth and energy resources, and these resources are often limited in real-world applications. To cope with the resources constraints, the event-triggered mechanism is proposed and widely adopted. It only allows signals to be transmitted if they carry significant amount of information. Various criteria of determining whether the information is significant lead to different trigger rules. With these rules, the resources can be saved. However, in the meanwhile, some inter-event information, not that important but still of certain use, is unavailable to the neighbors. The absence of these inter-event information may affect the algorithm performance. Considering this, in this paper, we come up with an inter-event information retrieval scheme to recover certain untransmitted information, which is the first work doing so to the best of our knowledge. We design an approach for inter-event information retrieval, and formulate and solve an optimization problem which has a closed-form solution to acquire information. With more information at hand, the performance degeneration caused by the event-triggered mechanism can be alleviated. We derive sufficient conditions for convergence of the overall algorithm. We also demonstrate the advantages of the proposed scheme by simulation experiments.
Article
Federated learning (FL) is a machine learning paradigm that targets model training without gathering the local data dispersed over various data sources. Standard FL, which employs a single server, can only support a limited number of users, leading to degraded learning capability. In this work, we consider a multi-server FL framework, referred to as Confederated Learning (CFL), in order to accommodate a larger number of users. A CFL system is composed of multiple networked edge servers, with each server connected to an individual set of users. Decentralized collaboration among servers is leveraged to harness all users’ data for model training. Due to the potentially massive number of users involved, it is crucial to reduce the communication overhead of the CFL system. We propose a stochastic gradient method for distributed learning in the CFL framework. The proposed method incorporates a conditionally-triggered user selection (CTUS) mechanism as the central component to effectively reduce communication overhead. Relying on a delicately designed triggering condition, the CTUS mechanism allows each server to select only a small number of users to upload their gradients, without significantly jeopardizing the convergence performance of the algorithm. Our theoretical analysis reveals that the proposed algorithm enjoys a linear convergence rate. Simulation results show that it achieves substantial improvement over state-of-the-art algorithms in terms of communication efficiency.
Chapter
In this chapter, we first introduce the preliminaries of FL. Inparticular, we introduce the federated averaging and two personalized FL algorithms. Then, we introduce four important performance metrics to quantify the FL performance over wireless networks and analyze how wireless factors affect these metrics. Finally, we present the research directions and industry interest of designing communication efficient FL over wireless networks.
Article
We consider the problem of online stochastic optimization in a distributed setting with M clients connected through a central server.We develop a distributed online learning algorithm that achieves order-optimal cumulative regret with low communication cost measured in the total number of bits transmitted over the entire learning horizon. This is in contrast to existing studies which focus on the offline measure of simple regret for learning efficiency. The holistic measure for communication cost also departs from the prevailing approach that separately tackles the communication frequency and the number of bits in each communication round.
Article
Federated learning (FL), as an emerging distributed machine learning paradigm, allows a mass of edge devices to collaboratively train a global model while preserving privacy. In this tutorial, we focus on FL via over-the-air computation (AirComp), which is proposed to reduce the communication overhead for FL over wireless networks at the cost of compromising in the learning performance due to model aggregation error arising from channel fading and noise. We first provide a comprehensive study on the convergence of AirComp-based FEDAVG (AIRFEDAVG) algorithms under both strongly convex and non-convex settings with constant and diminishing learning rates in the presence of data heterogeneity. Through convergence and asymptotic analysis, we characterize the impact of aggregation error on the convergence bound and provide insights for system design with convergence guarantees. Then we derive convergence rates for AIRFEDAVG algorithms for strongly convex and non-convex objectives. For different types of local updates that can be transmitted by edge devices (i.e., local model, gradient, and model difference), we reveal that transmitting local model in AIRFEDAVG may cause divergence in the training procedure. In addition, we consider more practical signal processing schemes to improve the communication efficiency and further extend the convergence analysis to different forms of model aggregation error caused by these signal processing schemes. Extensive simulation results under different settings of objective functions, transmitted local information, and communication schemes verify the theoretical conclusions.
Chapter
Communication bottleneck has been identified as a significant issue in large-scale training of machine learning models over a network. Recently, several approaches have been proposed to mitigate this issue, using gradient compression and infrequent communication based techniques. This chapter summarizes two communication efficient algorithms, Qsparse-local-SGD and SQuARM-SGD, for distributed and decentralized settings, respectively. These algorithms utilize composed sparsification and quantization operators for aggressive compression, along with local iteration for communication efficiency. We provide theoretical convergence guarantees for these algorithms for smooth non-convex objectives. We also review extensive numerical experiments of our methods for training ResNet architectures and compare them against the state-of-the-art methods in their respective settings. The content of this chapter is based on the papers: Basu et al. (Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations. In: NeurIPS, 2019), Basu et al. (IEEE J Sel Areas Inf Theory 1(1):217–226, 2020), Singh et al. (SPARQ-SGD: Event-triggered and compressed communication in decentralized optimization. In: IEEE Control and Decision Conference (CDC), 2020), Singh et al. (IEEE Trans Autom Control, 2022), Singh et al. (IEEE J Sel Areas Inf Theory 2(3):954–969, 2021).
Article
The widespread adoption of edge computing has revolutionized data processing by decentralizing computational power, resulting in faster response times, decreased network latency, and enhanced privacy. This paradigm shift is crucial due to the increasing demand for real-time analytics and the growth of Internet of Things devices. Optimizing resource allocation and decision-making in on-board computing presents challenges, especially in dynamic and heterogeneous environments. Distributed optimization techniques address these challenges by allowing multiple agents to collaborate and make decisions based on local knowledge. This paper focuses on distributed optimization in multi-agent systems with time-varying communication networks and explores the advantages of considering symmetric group action to reach global optimization. In particular, we focus our study into a class of Integer Linear Programming (ILP) for modeling real-world optimization problems in decentralized environments. The paper introduces a novel approach that leverages group actions and probabilistic selection of initial states to enhance convergence to desirable solutions. The method guarantees feasibility while minimizing computational effort for the agents. Additionally, numerical simulations have been conducted to validate the proposed algorithm, demonstrating its efficacy in optimizing resource allocation for a class of scheduling problems.
Article
Federated learning (FL) embraces the concepts of targeted data gathering and training, and it can reduce many of the systemic privacy costs and hazards associated with traditional machine learning frameworks. However, with the low latency requirements of the sixth generation (6 G) wireless communication networks and the Internet of Things (IoT) networks, the convergence delay of FL dramatically influences the overall system performance. In order to solve this urgent and challenging problem, in this paper, a joint client scheduling and wireless resource allocation algorithm is proposed, named SCSBA, which considers system heterogeneity, client heterogeneity, and the fairness of client participation to reduce the latency resulting from the heterogeneous communication conditions and computation capabilities among clients with the non identically independently distributed (Non-IID) data distributions. Specifically, the Stackelberg leader-follower game is first formulated in which the server decides the price of the single quota of participating in the FL process every communication round and the clients decide whether to participate in FL. Then the equilibrium solution of the game is derived and proved. In addition, a bandwidth allocation algorithm based on the covariance matrix adaptation evolutionary strategy (CMA-ES) is designed to minimize the time delay of each communication round. The simulation results verify the effectiveness of the proposed strategy for reducing the time latency of FL processes with heterogeneous clients, i.e., FedAvg and FedOpt.
Article
The widespread adoption of Federated Learning (FL), a privacy-preserving distributed learning methodology, has been impeded by the challenge of high communication overheads, typically arising from the transmission of large-scale models. Existing adaptive quantization methods, designed to mitigate these overheads, operate under the impractical assumption of uniform device participation. Additionally, these methods are limited in their adaptability due to the necessity of manual quantization level selection and often overlook biases inherent in local devices' data, thereby affecting the robustness of the global model. In response, this paper introduces AQUILA ( a daptive qu antization in dev i ce se l ection str a tegy), a novel adaptive framework devised to effectively handle these issues, enhancing the efficiency and robustness of FL. AQUILA integrates a sophisticated device selection method that prioritizes the quality and usefulness of device updates. Utilizing the exact global model stored by devices enables a more precise device selection criterion, reduces model deviation, and limits the need for hyperparameter adjustments. Furthermore, AQUILA presents an innovative quantization criterion, optimized to improve communication efficiency while assuring model convergence. Our experiments demonstrate that AQUILA significantly decreases communication costs compared to existing methods, while maintaining comparable model performance across diverse non-homogeneous FL settings, such as Non-IID data and heterogeneous model architectures.
Article
A wireless federated learning system is investigated by allowing a server and multiple workers to exchange uncoded information via orthogonal wireless channels. Since the workers frequently upload local gradients to the server via band-limited channels, the uplink transmission from the workers to the server becomes a communication bottleneck. Therefore, a one-shot distributed principle component analysis (PCA) is leveraged to reduce the dimension of uploaded gradients to relieve the communication bottleneck. A PCA-based wireless federated learning (PCA-WFL) algorithm and its accelerated version (i.e., PCA-AWFL) are proposed based on the low-dimensional gradients and the Nesterov’s momentum. For the non-convex empirical risk, a finite-time analysis is performed to quantify the impacts of system hyper-parameters on the convergence of the PCA-WFL and PCA-AWFL algorithms. The PCA-AWFL algorithm is theoretically certified to converge faster than the PCA-WFL algorithm. Besides, the convergence rates of PCA-WFL and PCA-AWFL algorithms quantitatively reveal the linear speedup with respect to the number of workers over the vanilla gradient descent algorithm. Numerical results are used to demonstrate the improved convergence rates of the proposed PCA-WFL and PCA-AWFL algorithms over the benchmarks.
Article
Deep learning's widespread adoption in various fields has made distributed training across multiple computing nodes essential. However, frequent communication between nodes can significantly slow down training speed, creating a bottleneck in distributed training. To address this issue, researchers are focusing on communication optimization algorithms for distributed deep learning systems. In this paper, we propose a standard that systematically classifies all communication optimization algorithms based on mathematical modeling, which is not achieved by existing surveys in the field. We categorize existing works into four categories based on the optimization strategies of communication: communication masking, communication compression, communication frequency reduction, and hybrid optimization. Finally, we discuss potential future challenges and research directions in the field of communication optimization algorithms for distributed deep learning systems.
Chapter
As an emerging paradigm, federated learning (FL) trains a shared global model by multi-party collaboration without leaking privacy since no private data transmission between the server and clients. However, it still faces two challenges: statistical heterogeneity and communication efficiency. To tackle them simultaneously, we propose pFedLHNs, which assigns each client with both a small hypernetwork (HN) and a large target network (NN) whose parameters are generated by the hypernetwork. Each client pulls other clients’ hypernetworks from the server for local aggregation to personalize its local target model and only interacts the small hypernetwork with other clients via the central server to reduce communication costs. Besides, the server also aggregates received local hypernetworks to construct a global hypernetwork and uses it to initialize new joining out-of-distribution (OOD) clients for cold start. Extensive experiments on three datasets with Non-IID distributions demonstrate the superiority of pFedLHNs in the trade-off between model accuracy and communication efficiency. The case studies justify its tolerance to statistical heterogeneity and new OOD clients.
Article
A major challenge of applying zeroth-order (ZO) methods is the high query complexity, especially when queries are costly. We propose a novel gradient estimation technique for ZO methods based on adaptive lazy queries that we term as LAZO. Unlike the classic one-point or two-point gradient estimation methods, LAZO develops two alternative ways to check the usefulness of old queries from previous iterations, and then adaptively reuses them to construct the low-variance gradient estimates. We rigorously establish that through judiciously reusing the old queries, LAZO can reduce the variance of stochastic gradient estimates so that it not only saves queries per iteration but also achieves the regret bound for the symmetric two-point method. We evaluate the numerical performance of LAZO, and demonstrate the low-variance property and the performance gain of LAZO in both regret and query complexity relative to several existing ZO methods. The idea of LAZO is general and can be applied to other variants of ZO methods.
Article
Federated learning (FL) is an emerging distributed machine learning paradigm that aims to realize model training without gathering the data from data sources to a central processing unit. A traditional FL framework consists of a central server as well as a number of computing devices (aka workers ). Training a model under the FL framework usually consumes a massive amount of communication resources because the server and devices should frequently communicate with each other. To alleviate the communication burden, we, in this paper, propose to adaptively sparsify the gradient vector which is transmitted to the server by each device, thus significantly reducing the amount of information that need to be sent to the central server. The proposed algorithm is built on the sparsified SAGA, a well-known variance-reduced stochastic algorithm. For the proposed algorithm, after the gradient vector is sparsified using conventional sparsification operators, an adaptive sparsification step is further added to identify the most informative elements in the sparsified gradient vector. Convergence analysis indicates that the proposed algorithm enjoys a linear convergence rate. Numerical results show that the adaptive sparsification mechanism can substantially improve the communication efficiency. Specifically, to achieve the same classification accuracy, the proposed method can reduce the communication overhead by at least 60% as compared with existing state-of-the-art sparsification-based methods.
Article
Full-text available
This paper introduces CompSkipDSGD, a new algorithm for distributed stochastic gradient descent that aims to improve communication efficiency by compressing and selectively skipping communication. In addition to compression, CompSkipDSGD allows both workers and the server to skip communication in any iteration of the training process and reserve it for future iterations without significantly decreasing testing accuracy. Our experimental results on the large-scale ImageNet dataset demonstrate that CompSkipDSGD can save hundreds of gigabytes of communication while maintaining similar levels of accuracy compared to state-of-the-art algorithms. The experimental results are supported by a theoretical analysis that demonstrates the convergence of CompSkipDSGD under established assumptions. Overall, CompSkipDSGD could be useful for reducing communication costs in distributed deep learning and enabling the use of large-scale datasets and models in complex environments.
Article
To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distributed Machine Learning (DML) systems. To alleviate the communication resource contention among DML jobs, which prolongs the time to train machine learning models, in machine learning clusters, JointPS is proposed in this paper. JointPS first minimizes the completion time of a single training epoch for each DML job via jointly optimizing the parameter server placement and flow scheduling, and predicts the number of remaining training epochs for each DML job by leveraging a dynamic model fitting method. Then, JointPS can estimate the remaining time to complete each DML job. According to such estimation, JointPS schedules DML jobs following the Minimum Remaining Time First (MRTF) principle to minimize the average job completion time. To the best of our knowledge, JointPS should be the first work that minimizes the average completion time of network-intensive DML training jobs by jointly optimizing the parameter server placement and flow scheduling without modifying the DML models and training procedures. Through both testbed experiments and extensive simulations, we demonstrate that JointPS can reduce the average completion time of DML jobs by up to 88% compared with state-of-the-art technology.</p
ResearchGate has not been able to resolve any references for this publication.