Load balancing example.

Source publication

Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination

Conference Paper

Full-text available

Jun 2015

The MapReduce programming model, due to its simplicity and scalability, has become an essential tool for processing large data volumes in distributed environments. Recent Stream Processing Systems (SPS) extend this model to provide low-latency analysis of high-velocity continuous data streams. However, integrating MapReduce with streaming poses cha...

Table 1 . Hardware configuration ofthe server.

Figure 2. Video stream with public cloud service.

Figure 3. Video stream with private cloud service.

An Adaptive Video-on-Demand Framework for Multimedia Cross-Platform Cloud Services

Article

Full-text available

Jan 2016

Although Video-On-Demand (VOD) has been in existence for years, its cross-platform applicability in cloud service environments is still in increasing need. In this paper, an Adaptive Video-On-Demand (AVOD) framework that is suitable for private cloud environments is proposed. Private cloud has the key advantage of satisfying the real need of both c...

SDF-GA: a service domain feature-oriented approach for manufacturing cloud service composition

Article

Full-text available

Mar 2020

Cloud manufacturing (CMfg) is a new service-oriented manufacturing paradigm in which shared resources are integrated and encapsulated as manufacturing services. When a single service is not able to meet some manufacturing requirement, a composition of multiple services is then required via CMfg. Service composition and optimal selection (SCOS) is a...

Towards Usage-Based Dynamic Overbooking in IaaS Clouds

Conference Paper

Full-text available

Sep 2016

IaaS Cloud systems enable the Cloud provider to overbook his data centre by selling more virtual resources than physical resources available. This approach works if on average the resource utilisation of a virtual machine is lower than the virtual machine boundaries. If this assumption is violated only locally, Cloud users will experience performan...

A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems

Preprint

Nov 2019

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization -- the fraction of total time available for the system to do useful work -- that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay. Our model yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters, interestingly showing it to be dependent only on checkpoint cost and failure rate. We confirm the accuracy and efficacy of our model through experiments with Apache Flink, where we obtain improvements in system utilization for every case, especially as the system size increases. Our model provides a solid theoretical basis for the analysis and optimization of more elaborate checkpointing approaches.

A Comprehensive Survey on Parallelization and Elasticity in Stream Processing

Article

Apr 2019
ACM COMPUT SURV

Stream Processing (SP) has evolved as the leading paradigm to process and gain value from the high volume of streaming data produced, e.g., in the domain of the Internet of Things. An SP system is a middleware that deploys a network of operators between data sources, such as sensors, and the consuming applications. SP systems typically face intense and highly dynamic data streams. Parallelization and elasticity enable SP systems to process these streams with continuous high quality of service. The current research landscape provides a broad spectrum of methods for parallelization and elasticity in SP. Each method makes specific assumptions and focuses on particular aspects. However, the literature lacks a comprehensive overview and categorization of the state of the art in SP parallelization and elasticity, which is necessary to consolidate the state of the research and to plan future research directions on this basis. Therefore, in this survey, we study the literature and develop a classification of current methods for both parallelization and elasticity in SP systems.

A Comprehensive Survey on Parallelization and Elasticity in Stream Processing

Preprint

Jan 2019

Stream Processing (SP) has evolved as the leading paradigm to process and gain value from the high volume of streaming data produced e.g. in the domain of the Internet of Things. An SP system is a middleware that deploys a network of operators between data sources, such as sensors, and the consuming applications. SP systems typically face intense and highly dynamic data streams. Parallelization and elasticity enables SP systems to process these streams with continuously high quality of service. The current research landscape provides a broad spectrum of methods for parallelization and elasticity in SP. Each method makes specific assumptions and focuses on particular aspects of the problem. However, the literature lacks a comprehensive overview and categorization of the state of the art in SP parallelization and elasticity, which is necessary to consolidate the state of the research and to plan future research directions on this basis. Therefore, in this survey, we study the literature and develop a classification of current methods for both parallelization and elasticity in SP systems.

Proactive Elasticity and Energy Awareness in Data Stream Processing

Article

Full-text available

Aug 2016
J SYST SOFTWARE

Data stream processing applications have a long running nature (24hr/7d) with workload conditions that may exhibit wide variations at run-time. Elasticity is the term coined to describe the capability of applications to change dynamically their resource usage in response to workload fluctuations. This paper focuses on strategies for elastic data stream processing targeting multicore systems. The key idea is to exploit Model Predictive Control, a control-theoretic method that takes into account the system behavior over a future time horizon in order to decide the best reconfiguration to execute. We design a set of energy-aware proactive strategies, optimized for throughput and latency QoS requirements, which regulate the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support offered by modern multicore CPUs. We evaluate our strategies in a high-frequency trading application fed by synthetic and real-world workload traces. We introduce specific properties to effectively compare different elastic approaches, and the results show that our strategies are able to achieve the best outcome.

Preliminary Exploration on Node-To-Node Fault Tolerance Coordination in Distributed System

Conference Paper

Oct 2023

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

Article

Sep 2021
SOFTWARE PRACT EXPER

Fault‐tolerance is an essential part of a stream processing system that guarantees data analysis could continue even after failures. State‐of‐the‐art distributed stream processing systems use checkpointing to support fault‐tolerance for stateful computations where the state of the computations is periodically persisted. However, the frequency of performing checkpoints impacts the performance (utilization, latency, and throughput) of the system as the checkpointing process consumes resources and time that can be used for actual computations. In practice, systems are often configured to perform checkpoints based on crude values ignoring factors such as checkpoint and restart costs, leading to suboptimal performance. In our previous work, we proposed a theoretical optimal checkpoint interval that maximizes the system utilization for stream processing systems to minimize the impact of checkpointing on system performance. In this article, we investigate the practical benefits of our proposed theoretical optimal by conducting experiments in a real‐world cloud setting using different streaming applications; we use Apache Flink, a well‐known stream processing system for our experiments. The experiment results demonstrate that an optimal interval can achieve better utilization, confirming the practicality of the theoretical model when applied to real‐world applications. We observed utilization improvements from 10% to 200% for a range of failure rates from 0.3 failures per hour to 0.075 failures per minute. Moreover, we explore how performance measures: latency and throughput are affected by the optimal interval. Our observations demonstrate that significant improvements can be achieved using the optimal interval for both latency and throughput.

CAD3: Edge-facilitated Real-time Collaborative Abnormal Driving Distributed Detection

Conference Paper

Full-text available

Mar 2021

Speeding, slowing down, and sudden acceleration are the leading causes of fatal accidents on highways. Anomalous driving behavior detection can improve road safety by informing drivers who are in the vicinity of dangerous vehicles. However, detecting abnormal driving behavior at the city-scale in a centralized fashion results in considerable network and computation load, that would significantly restrict the scalability of the system. In this paper, we propose CAD3, a distributed collaborative system for road-aware and driver-aware anomaly driving detection. CAD3 considers a decentralized deployment of edge computation nodes on the roadside and combines collaborative and context-aware computation with low-latency communication to detect and inform nearby drivers of unsafe behaviors of other vehicles in real-time. Adjacent edge nodes collaborate to improve the detection of abnormal driving behavior at the city-scale. We evaluate CAD3 with a physical testbed implementation. We emulate realistic driving scenarios from a real driving data set of 3,000 vehicles, 214,000 trips, and 18 million trajectories of private cars in Shenzhen, China. At the microscopic (road) level, CAD3 significantly improves the accuracy of detection and lowers the number of potential accidents caused by false negatives up to four times and 24 times as compared to distributed standalone and centralized models, respectively. CAD3 can scale up to 256 vehicles connected to a single node while keeping the end-to-end latency under 50 ms and a required bandwidth below 5 mbps. At the mesoscopic (driver-trip) level, CAD3 performs stable and accurate detection over time, owing to local RSU interaction. With a dense deployment of edge nodes, CAD3 can scale up to the size of Shenzhen, a megalopolis of 12 million inhabitant with over 2 million concurrent vehicles at peak hours.

A Distance-Driven Alliance for a P2P Live Video System

Article

Jan 2019
IEEE T MULTIMEDIA

In peer-to-peer (P2P) networks, free-riders and redundant streams including overlapped and folded streams dramatically degrade playback quality and network performance, respectively. Although a locality-aware P2P live video can reduce the topological complexity, it cannot effectively avoid redundant streams while denying free-riders. In this paper, we first model free-rider, redundant streams and a distance-driven P2P system. Based on that model, a distance-driven alliance algorithm is proposed to construct not only an alliance that directly prevents any utility gains of free-riders through inter-user constraints but also a small-world network or a multicast tree that effectively reduces redundant streams. Finally, simulations confirm its advantages in functionality and performance over several existing strategies and distance-driven P2P live video systems.

DDLB: A Dynamic and Distributed Load Balancing Strategy

Conference Paper

Aug 2019

Efficient Time-Evolving Stream Processing at Scale

Article

Apr 2019
IEEE T PARALL DISTR

Time-evolving stream datasets exist ubiquitously in many real-world applications where their inherent hot keys often evolve over times. Nevertheless, few existing solutions can provide efficient load balance on these time-evolving datasets while preserving low memory overhead. In this paper, we present a novel grouping approach (named FISH), which can provide the efficient time-evolving stream processing at scale. The key insight of this work is that the keys of time-evolving stream data can have a skewed distribution within any bounded distance of time interval. This enables to accurately identify the recent hot keys for the real-time load balance within a bounded scope. We therefore propose an epoch-based recent hot key identification with specialized intra-epoch frequency counting (for maintaining low memory overhead) and inter-epoch hotness decaying (for suppressing superfluous computation). We also propose to heuristically infer the accurate information of remote workers through computation rather than communication for cost-efficient worker assignment. We have integrated our approach into Apache Storm. Our results on a cluster of 128 nodes for both synthetic and real-world stream datasets show that FISH significantly outperforms state-of-the-art with the average and the 99th percentile latency reduction by 87.12% and 76.34% (vs. W-Choices), and memory overhead reduction by 99.96% (vs. Shuffle Grouping).

Load balancing example.

Similar publications

Citations