Figure 1 - uploaded by Pradeepkumar Mani
Content may be subject to copyright.
Simple network topology with two aggregation switches and four racks illustrating the tradeoff between bandwidth usage (pack servers together, (a)) and fault-tolerance (spread servers across racks, (b)). Grayed boxes indicate parts of the cluster network each allocation is using. Assuming only racks as fault domains, (a) has worst-case survival of 0.5 (four of the eight servers survive a failure of a rack), while (b) has worst-case survival of 0.75 (six of the eight servers survive).

Simple network topology with two aggregation switches and four racks illustrating the tradeoff between bandwidth usage (pack servers together, (a)) and fault-tolerance (spread servers across racks, (b)). Grayed boxes indicate parts of the cluster network each allocation is using. Assuming only racks as fault domains, (a) has worst-case survival of 0.5 (four of the eight servers survive a failure of a rack), while (b) has worst-case survival of 0.75 (six of the eight servers survive).

Source publication
Article
Full-text available
Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff...

Contexts in source publication

Context 1
... performs the best. When ignoring the number of server moves, CUT+FT+BW achieves the best performance (see Figure 10). This algorithm achieves 30%−60% reduction in band- width usage in the core of the network, while at the same time improving FT by 40% − 120%. ...
Context 2
... of comparing the performance of the algorithm on single config- urations of problem parameters, we compare the entire achievable tradeoff boundaries for these algorithms. In other words, we run the algorithm with different values of the parameters and plot the BW and FT achieved (see Figure 10). The solid line in the figure clearly represents the best algorithm, since its performance curve "dominates" the respective curves of the other two algorithms. ...
Context 3
... these algorithms first use the mini- mum k-way cut to reduce the bandwidth at the core, followed by performing gradient descent that improves fault tolerance, gradient descent that improves fault tolerance and bandwidth, and random- izing the low-talking services, respectively. We show results for one of the datacenters in Figures 10 and 11(left), and results for the remaining three datacenters in Figure 13a. We note that we have examined ways to incorporate FT optimization directly within the cut procedure. ...
Context 4
... these algorithms first use the mini- mum k-way cut to reduce the bandwidth at the core, followed by performing gradient descent that improves fault tolerance, gradient descent that improves fault tolerance and bandwidth, and random- izing the low-talking services, respectively. We show results for one of the datacenters in Figures 10 and 11(left), and results for the remaining three datacenters in Figure 13a. We note that we have examined ways to incorporate FT optimization directly within the cut procedure. ...
Context 5
... performs the minimum k-way cut (reaching the lower- left point in Figure 11(left)), followed by the steepest descent al- gorithm that only considers improvement in fault tolerance. We executed the algorithm many times, each time allowing it to swap an increasing number of servers. ...
Context 6
... executed the algorithm many times, each time allowing it to swap an increasing number of servers. The resulting BW and FT met- rics of the obtained server allocations are shown in Figure 10 (in particular, the CUT+FT curve). The diagonal line in this figure represents the achievable tradeoff boundary for this algorithm; by changing the total number of performed swaps, we can control the tradeoff between BW and FT. ...
Context 7
... results of this algorithm depend on the value of α; higher values of α put more weight on improvement of bandwidth at the cost of not improving fault tolerance as much. Figure 11 shows the progress of this algorithm for three different values of α. By running the algorithm until convergence with several different values of α, we obtain the "benchmark boundary" to which other algorithms can be compared (see the solid line in Figure 10). ...
Context 8
... 11 shows the progress of this algorithm for three different values of α. By running the algorithm until convergence with several different values of α, we obtain the "benchmark boundary" to which other algorithms can be compared (see the solid line in Figure 10). Be- cause this algorithm is not optimizing over a convex function, it is not guaranteed to reach the global optimum. ...
Context 9
... first performs the minimum k-way cut, followed by randomizing the allocation of the least-communicating services responsible for total of y% of the total traffic in the cluster. The achievable tradeoff boundary for this algorithm 8 is in Figure 10. This algorithm achieves performance close to the CUT+FT+BW algorithm, but it does not optimize the bandwidth of the low-talking services nor the fault tolerance of the high-talking ones, which ex- plains the gap between these two algorithms. ...
Context 10
... algorithm starts from the current server allocation and performs steepest descent moves on the cost function that considers the fault tolerance and bandwidth. The progress of this algorithm for different values of α is shown in Fig- ure 11(right); as in CUT+FT+BW, using larger α skews the opti- mization towards optimizing bandwidth. In this figure, each marker corresponds to moving approximately additional 2% of servers. ...
Context 11
... that improvement is significant at the beginning and then slows down. Figure 12 shows the achievable tradeoff boundaries of FT+BW for different fractions of the cluster that are required to move. For example, notice that we obtain significant improvements by mov- 8 Using y = 0, 25, 50, 60, 70, 80, 85, 90, 95, 98, 99, 99.9, 100. ...
Context 12
... just 5% of the cluster. Moving 29% of the cluster achieves results similar to moving most of machines using the CUT+FT+ BW algorithm (see the outer double line in Figure 12). Results for three additional datacenters are presented in Figure 13(b). ...
Context 13
... 29% of the cluster achieves results similar to moving most of machines using the CUT+FT+ BW algorithm (see the outer double line in Figure 12). Results for three additional datacenters are presented in Figure 13(b). ...
Context 14
... notice that when running FT+BW until convergence (see Figure 13a), it achieves results close to CUT+FT+BW even without the global optimization of graph cut. This is significant, because it means we can use FT+BW incrementally (e.g., move 2% of the servers every day) and still reach similar performance as CUT+FT+BW that reshuffles the whole datacenter at once. ...
Context 15
... fault tolerance was reduced, stayed the same, and was improved for 7%, 35%, and 58% of services, respectively. Finally, Figure 14(left) shows the changes of bandwidth and fault tolerance for all services with reduced fault tolerance. Again, a few services contributed sig- nificantly to the 47% drop in bandwidth, but paid for it by being spread across fewer fault domains. ...
Context 16
... α = 0.1, FT+BW achieved reduction of bandwidth usage by 26%, but improved the fault tolerance by 140%. In this case, fault tolerance was reduced only for 2.7% of the services (see the right plot on Figure 14) and the magnitude of the reduction was much smaller than for α = 1.0. This demonstrates how the value of α controls the tradeoff between fault tolerance and bandwidth usage. ...
Context 17
... say that a service is affected by a potential hardware failure if its worst-case survival is less than a certain threshold H. We use H = 30% that is used in the alert sys- Figure 14: The relative change in core bandwidth (x-axis) and fault tolerance (y-axis) for all services (circles) that actually re- duced their fault tolerance for α=1 (left) and α=0.1 (right). ...

Similar publications

Article
Full-text available
A fundamental goal of data center networking is to efficiently interconnect a large number of servers with the low equipment cost. Several server-centric network structures for data centers have been proposed. They, however, are not truly expandable and suffer a low degree of regularity and symmetry. Inspired by the commodity servers in today's dat...
Article
Full-text available
The data center network connecting the servers in a data center plays a crucial role in orchestrating the infrastructure to deliver peak performance to users. In order to meet high performance and reliability requirements, the data center network is usually constructed of a massive number of network devices and links to achieve 1:1 oversubscription...

Citations

... For the smooth operation of various online services and applications, DCN should provide high bandwidth, low latency, and high throughput services for hosted applications. It has some special characteristics as compared with other IP networks, such as low transmission delay, auto-scaling, many-to-one communication traffic pattern with high bandwidth, multi-rooted tree topology, and shallow buffered switches [2,3,8,9]. Due to these special characteristics of DCN, traditional TCP congestion control algorithms perform poorly [2,3]. Due to distributed nature of applications hosted in data centers, their performances are significantly affected by the communication network used in the data centers. ...
... The increase in latency may be a loss, for online sales application as the user switch to some other application. The traffic inside DCN should has the following objectives: The efficiency of the data center depends on the performance of the DCN [8]. So, knowing the traffic characteristics inside the DCN is very important. ...
Article
Full-text available
In recent years, Data Center Networks (DCN) have become a very popular platform for hosting various online services and applications, such as e-commerce, social networking, large-scale computing, and web searching due to their cost-effective and efficient service provisioning. DCN, online services, and applications typically require minimal latency in any information exchange. Moreover, compared with Internet traffic, the nature of traffic in DCN is bursty, delay, and throughput sensitive. Due to this reason, state-of-the-art TCP congestion control algorithms perform poorly and suffer from the problems, such as TCP Incast, TCP Outcast, Pseudo-Congestion Effect, Buffer pressure, and Queue build-up. For improving the performance of DCN, in recent years, various congestion control algorithms have been proposed for DCN. This paper summarizes the reason why the state-of-the-art TCP congestion control algorithm performs poorly and presents an overview of the recently proposed congestion control algorithms for DCN, followed by a comparative summary of their performance.
... Unfortunately, this tuning process is non-trivial as performance characteristics are not only highly dependent on the workloads, but also on the underlying data center architecture. To make matters worse, more and more big data systems opt for disaggregated in-memory and virtual disk storage that cross a network where interactions are complex [4,14,33], the topology is constantly changing due to failures [6,9,17,22], and next-generation designs are increasingly sophisticated [3,37,44,45]. Prior work has resulted in complex solutions that either fail to provide portable performance [20,23,[29][30][31] or are dicult to reason about [5]. ...
Preprint
Full-text available
Cloud data centers are rapidly evolving. At the same time, large-scale data analytics applications require non-trivial performance tuning that is often specific to the applications, workloads, and data center infrastructure. We propose TeShu, which makes network shuffling an extensible unified service layer common to all data analytics. Since an optimal shuffle depends on a myriad of factors, TeShu introduces parameterized shuffle templates, instantiated by accurate and efficient sampling that enables TeShu to dynamically adapt to different application workloads and data center layouts. Our experimental results with real-world graph workloads show that TeShu efficiently enables shuffling optimizations that improve performance and adapt to a variety of scenarios.
... Another group of work focuses on explicit network scheduling and job placement where the primary objective is to localize most of the traffic flow between tasks and balance the network utilization across the cluster [11,19,2,5,46,36,8]. Such frameworks have more fine-grained information about the network I/O, but lacks tight integration between network flows and application-level requirements. ...
... The equation (2) implies that the length of a pipelineableonly path is dominated by the pipelineable task with the longest execution time as shown in Fig. 5. Moreover, we can observe that the maximum throughput of the flow can also be restricted by the CPU processing speed when pipeline is used. ...
Preprint
Distributed applications, such as database queries and distributed training, consist of both compute and network tasks. DAG-based abstraction primarily targets compute tasks and has no explicit network-level scheduling. In contrast, Coflow abstraction collectively schedules network flows among compute tasks but lacks the end-to-end view of the application DAG. Because of the dependencies and interactions between these two types of tasks, it is sub-optimal to only consider one of them. We argue that co-scheduling of both compute and network tasks can help applications towards the globally optimal end-to-end performance. However, none of the existing abstractions can provide fine-grained information for co-scheduling. We propose MXDAG, an abstraction to treat both compute and network tasks explicitly. It can capture the dependencies and interactions of both compute and network tasks leading to improved application performance.
... The best way to placement VMs to PMs is not merely inserting the maximum number of VMs into the minimum number of PMs. Because in this case, in addition to the energy consumption resulting in high costs and CO 2 emissions to the environment, important criteria must be considered in VMC approaches such as migration overhead [30][31][32], performance [7,[33][34][35], SLAv [23,36], cooling [5,37,38], thermal and temperature [21,30,39], ON-OFF cycles [22,40], VMs affinity [7,28,29,41], reliability [19,20], the hardware cost and its longevity [22,23], load balancing [24,25], NBW [27,[42][43][44], resources utilization [8,45,46]. In other words, to implement the VMC algorithms, reducing power consumption along with the mentioned criteria, must be considered for holistic efficiency in the cloud. ...
... The best way to placement VMs to PMs is not merely inserting the maximum number of VMs into the minimum number of PMs. Because in this case, in addition to the energy consumption resulting in high costs and CO 2 emissions to the environment, important criteria must be considered in VMC approaches such as migration overhead [30][31][32], performance [7,[33][34][35], SLAv [23,36], cooling [5,37,38], thermal and temperature [21,30,39], ON-OFF cycles [22,40], VMs affinity [7,28,29,41], reliability [19,20], the hardware cost and its longevity [22,23], load balancing [24,25], NBW [27,[42][43][44], resources utilization [8,45,46]. In other words, to implement the VMC algorithms, reducing power consumption along with the mentioned criteria, must be considered for holistic efficiency in the cloud. ...
... We choose the most power-efficient PM for VM placement with the condition that it becomes not overloaded after migration. In many previous articles, the exact method [20][21][22][23], greedy [28,42,43], and evolutionary [44,51,66] for this section of VMC have been used. ...
Article
Full-text available
Cloud Computing Systems (CCSs) provides a computing capability through the Internet. It enables organizations or individuals to have a computing power without deploying and maintaining their own Information Technology infrastructure. As a cloud is realized on a vast scale cloud, it consumes an enormous amount of energy. Migration pattern, where several Virtual Machines (VMs) can be placed on a minimum number of active Physical Machines is called VMs Consolidation (VMC). Thus, this technique can be a practical approach for balancing electricity consumption and other QoS requirement in CCSs. Especially, VMC must meet the service quality requirements, minimization of both energy consumption and Service Level Agreement violation in CCSs. This paper presents a systematic survey of VMC in CCSs with particular attention to the VMC phases, metrics, objectives, migration patterns, optimization methods, and evaluation approaches of VMC. Our review study is presented based on the past literature with a focus on the type of hardware metrics, software metrics, objectives, algorithms, and architectures of VMC in CCSs.
... (2) different types of jobs in terms of being CPU or IO-intensive where the tasks have different input data sizes which can significantly affect the performance of Hadoop scheduler and limit the overall throughput of the system. Therefore, in a heterogeneous system with multiple tasks belonging to various jobs, designing an efficient scheduling algorithm is a vital challenge [9][10][11][12]. ...
Article
Full-text available
In the context of MapReduce task scheduling, many algorithms mainly focus on the scheduling of Reduce tasks with the assumption that scheduling of Map tasks is already done. However, in the cloud deployments of MapReduce, the input data is located on remote storage which indicates the importance of the scheduling of Map tasks as well. In this paper, we propose a two-stage Map and Reduce task scheduler for heterogeneous environments, called TMaR. TMaR schedules Map and Reduce tasks on the servers that minimize the task finish time in each stage, respectively. We employ a dynamic partition binder for Reduce tasks in the Reduce stage to lighten the shuffling traffic. Indeed, TMaR minimizes the makespan of a batch of tasks in heterogeneous environments while considering the network traffic. The simulation results demonstrate that TMaR outperforms Hadoop-stock and Hadoop-A in terms of makespan and network traffic and achieves by an average of 29%, 36%, and 14% performance using Wordcount, Sort, and Grep benchmarks. Besides, the power reduction of TMaR is up to 12%.
... Bodík et al. [16] presented an allocation plan to improve the service survivability while the bandwidth bottleneck decreases in the core of the data center network. Their proposal improves fault tolerance by spreading out VMs across multiple fault domains while minimizing overall bandwidth usage. ...
Article
Full-text available
The virtualization of the data center network is one of the technologies that enable the performance guarantee and more flexibility and improve the utilization of infrastructure resources in cloud computing. One of the key issues in the management of virtual data center (VDC) is VDC embedding, which deals with the efficient mapping of required virtual network resources from the shared resources of the infrastructure provider (InP). In this paper, we propose a new VDC embedding algorithm that is different from previous works in many aspects. First, the provision of robustness for data center infrastructure is one of the critical requirements of cloud technology; however, this challenge has not been considered in the related literature. In order to analyze and evaluate the robustness of the infrastructure network, the classical and spectral graph robustness metrics are employed. Second, in order to avoid imbalance mapping and increase the efficiency of infrastructure resources, besides the resource dynamic capacity, four node attributes are exploited to compute the nodes mapping potential. The TOPSIS technique for nodes ranking has been used to increase the compatibility with the ideal solution. Third, unlike previous works in which the mapping phases of nodes and links are getting used to being separated, in the proposed algorithm, the virtual network is mapped to a physical network in a single step. Fourth, we also consider resources for network nodes (switches or routers). For these purposes, a multi-objective mathematical optimization problem is extracted with two goals of maximizing infrastructure network robustness and minimizing the long-term average cost-to-revenue ratio mapping for InPs. Finally, a new single-stage (non-dominated sorting-based genetic algorithm) NSGAII-based online VDCE algorithm is presented, where node mapping is TOP-MANR based and edge mapping is based on the shortest path. The fat-tree topology is considered for the substrate and virtual networks, and these two networks are modeled as a weighted undirected graph.
... On one hand, to meet the growing demand of in-memory data processing and increase the competitive edge in service provisioning, Cloud providers have started to plan and upgrade their Infrastructure as a Service (IaaS) to include large memory virtual machines with terabytes (TBs) of memory in their high-end offerings [1,2]. On the other hand, memory utilization imbalance and temporal memory usage variations are frequently observed and reported in virtualized clouds [3][4][5][6][7][8][9][10][11] and production datacenters [12][13][14][15][16]. The memory upgrade trend further exaggerates the memory utilization imbalance problems. ...
... First, accurate memory allocation is hard as peak memory variations happen under different application types, workload inputs, data characteristics and traffic patterns. Second, applications often over-estimate their requirements or attempt to allocate for peak memory usage [3,7,8,12], resulting in severely imbalanced memory usage across virtual servers (e.g., VMs, containers or executors), and underutilization on the local host node and remote nodes across the cluster [11,18]. [20] that main memory will be viewed as secondary storage and secondary caches to processors. ...
... Recent work has looked at making better use of the bandwidth by the design of advanced transport protocols [13,38], flow scheduling algorithms [12,20], and job placement strategies [17,27]. Advanced transport protocols aim to mitigate congestion, scheduling algorithms can lead to higher throughput, and careful job placement and execution strategies reduce inter-rack traffic. ...
Conference Paper
Data center networks are designed to interconnect large clusters of servers. However, their static, rack-based architecture poses many constraints. For instance, due to over-subscription, bandwidth tends to be highly unbalanced---while servers in the same rack enjoy full bisection bandwidth through a top-of-rack (ToR) switch, servers across racks have much more constrained bandwidth. This translates to a series of performance issues for modern cloud applications. In this paper, we propose a rackless data center (RDC) architecture that removes this fixed "rack boundary". We achieve this by inserting circuit switches at the edge layer, and dynamically reconfiguring the circuits to allow servers from different racks to form "locality groups". RDC optimizes the topology between servers and edge switches based on the changing workloads, and achieves lower flow completion times and improved load balance for realistic workloads.
... All tree-like network topologies such as VL2 [30] and FatTree [36] can be modeled as hierarchy trees just like the one shown in Figure 5 [37]. On such a network, only data transmitted across aggregation domains consumes core bandwidth. ...
... For example, both GFS [1] and HDFS [2] put two of the three replicas into a same rack to reduce the core bandwidth consumed by data write. Authors of [37] also proposed to place services that communicate a lot with each other in a same domain. As far as we know, LAR is the first data reconstruction mechanism that has focused on reducing core bandwidth usage. ...
Article
Full-text available
Many modern distributed storage systems adopt erasure coding to protect data from frequent server failures for cost reason. Reconstructing data in failed servers efficiently is vital to these erasure‐coded storage systems. To this end, tree‐structured reconstruction mechanisms where blocks are transmitted and combined through a reconstruction tree have been proposed. However, existing tree‐structured reconstruction mechanisms build reconstruction trees from the perspective of available network bandwidths between servers, which are fluctuating and difficult to measure. Besides, these reconstruction mechanisms cannot reduce data transmission. In this study, we overcome these limitations by proposing LAR, a locality‐aware tree‐structured reconstruction mechanism. LAR builds reconstruction trees from the perspective of data locality, which is stable and easy to obtain. More importantly, by building reconstruction trees that combine blocks closer to each other first, LAR can reduce the data transmitted through the network core and hence speed up reconstruction. We prove that a minimum spanning tree is an optimal reconstruction tree that minimizes core bandwidth usage. We also design and implement a general reconstruction framework that supports all tree‐structured reconstruction mechanisms and nearly all erasure codes. Large‐scale simulations on commonly deployed network topologies show that LAR consumes 20%–61% less core bandwidth than previous reconstruction mechanisms. Thorough experiments on a testbed consisting of 40 physical servers show that LAR improves proactive recovery throughput by 23% at least and improves degraded read rate by up to 68%.
... For instance, a measurement study on DCN traffic characteristics shows that more than 98% of the links are utilized less than 1% [4]. Therefore, recent work has considered ways of making better use of the DCN bandwidth, including using advanced transport protocols [3,13], optimized flow scheduling [2,7], and intelligent job placement [5,10]. They attack the problem from different angles-advanced transport protocols can mitigate congestion more effectively, better scheduling algorithms can lead to higher throughput, and careful job placement and execution strategies could reduce inter-rack traffic. ...