Conference Paper

The potential dangers of causal consistency and an explicit solution

October 2012

October 2012

DOI:10.1145/2391229.2391251

Conference: Proceedings of the Third ACM Symposium on Cloud Computing

Authors:

Ali Ghodsi

Sharif University of Technology

Joseph M. Hellerstein

University of California, Berkeley

Show all 5 authorsHide

Causal consistency is the strongest consistency model that is available in the presence of partitions and provides useful semantics for human-facing distributed services. Here, we expose its serious and inherent scalability limitations due to write propagation requirements and traditional dependency tracking mechanisms. As an alternative to classic potential causality, we advocate the use of explicit causality, or application-defined happens-before relations. Explicit causality, a subset of potential causality, tracks only relevant dependencies and reduces several of the potential dangers of causal consistency.

Partially Replicated Causally Consistent Shared Memory: Lower Bounds and An Algorithm

Conference Paper

Jul 2019

The focus of this paper is on causal consistency in a partially replicated distributed shared memory (DSM) system that provides the abstraction of shared read/write registers. Maintaining causal consistency in distributed shared memory systems has received significant attention in the past, mostly on full replication wherein each replica stores a copy of all the registers in the shared memory. To ensure causal consistency, all causally preceding updates must be performed before an update is performed at any given replica. Therefore, some mechanism for tracking causal dependencies is required, such as vector timestamps with the number of vector elements being equal to the number of replicas in the context of full replication. In this paper, we investigate causal consistency in partially replicated systems, wherein each replica may store only a subset of the shared registers. Building on the past work, this paper makes three key contributions: • present a necessary condition on the metadata (which we refer as a timestamp) that must be maintained by each replica to be able to track causality accurately. The necessary condition identifies a set of directed edges in a share graph that a replica's timestamp must keep track of. • We present an algorithm for achieving causal consistency using a timestamp that matches the above necessary condition, thus showing that the condition is necessary and sufficient. • We define a measurement of timestamp space size and present a lower bound (in bits) on the size of the timestamps. The lower bound matches our algorithm in several special cases.

Cohérence dans les systèmes de stockage distribués : fondements théoriques avec applications au cloud storage

Thesis

Apr 2017

Paolo Vi

Engineering distributed systems is an onerous task: the design goals of performance, correctness and reliability are intertwined in complex tradeoffs, which have been outlined by multiple theoretical results. These tradeoffs have become increasingly important as computing and storage have shifted towards distributed architectures. Additionally, the general lack of systematic approaches to tackle distribution in modern programming tools, has worsened these issues — especially as nowadays most programmers have to take on the challenges of distribution. As a result, there exists an evident divide between programming abstractions, application requirements and storage semantics, which hinders the work of designers and developers.This thesis presents a set of contributions towards the overarching goal of designing reliable distributed storage systems, by examining these issues through the prism of consistency. We begin by providing a uniform, declarative framework to formally define consistency semantics. We use this framework to describe and compare over fifty non-transactional consistency semantics proposed in previous literature. The declarative and composable nature of this framework allows us to build a partial order of consistency models according to their semantic strength. We show the practical benefits of composability by designing and implementing Hybris, a storage system that leverages different models and semantics to improve over the weak consistency generally offered by public cloud storage platforms. We demonstrate Hybris’ efficiency and show that it can tolerate arbitrary faults of cloud stores at the cost of tolerating outages. Finally, we propose a novel technique to verify the consistency guarantees offered by real-world storage systems. This technique leverages our declarative approach to consistency: we consider consistency semantics as invariants over graph representations of storage systems executions. A preliminary implementation proves this approach practical and useful in improving over the state-of-the-art on consistency verification.

Tree topologies for causal message delivery

Conference Paper

Full-text available

Oct 2017

Causal message delivery, i.e. the requirement that messages are delivered in an order respecting their causal (logical) dependencies, is often mandated in the distributed setting. So far, causal message delivery has been implemented by augmenting messages with meta data information that allows the receiver (or the platform) to re-order, and if necessary hold back, messages upon receipt before processing. We propose that causal message delivery can be achieved by construction, simply by organizing the nodes of the distributed application into a tree topology, and without the need for any meta data in the messages. We present our ideas informally through an example application and then develop a formal model and prove that causal message delivery is preserved in tree-based networks.

Achieving convergent causal consistency and high availability for cloud storage

Article

Full-text available

Apr 2017
FUTURE GENER COMP SY

The tradeoff between consistency and availability is inevitable when designing distributed data stores, and today’s cloud services often choose high availability instead of strong consistency, leading to visible inconsistencies for clients. Convergent causal consistency is one of the strongest consistency model that still remains available during system partitions, and it can also satisfy human perception of causality between events. In this paper, we present CoCaCo, a distributed key-value store that provides convergent causal consistency with asynchronous replication, since it is able to provide cloud services’ desired properties including high performance and availability. Moreover, CoCaCo can efficiently guarantee causal consistency by performing dependency checking only during handling read operations. We implement CoCaCo based on Cassandra and our experimental results indicate that CoCaCo provides performance comparable to eventually consistent Cassandra.

Saturn: a Distributed Metadata Service for Causal Consistency

Conference Paper

Apr 2017

This paper presents the design, implementation, and evaluation of Saturn, a metadata service for geo-replicated systems. Saturn can be used in combination with several distributed and replicated data services to ensure that remote operations are made visible in an order that respects causality, a requirement central to many consistency criteria. Saturn addresses two key unsolved problems inherent to previous approaches. First, it eliminates the tradeoff between throughput and data freshness, when deciding what metadata to use for tracking causality. Second, it enables genuine partial replication, a key property to ensure scalability when the number of geo-locations increases. Saturn addresses these challenges while keeping metadata size constant, independently of the number of clients, servers, data partitions, and locations. By decoupling metadata management from data dissemination, and by using clever metadata propagation techniques, it ensures that the throughput and visibility latency of updates on a given item are (mostly) shielded from operations on other items or locations. We evaluate Saturn in Amazon EC2 using realistic benchmarks under both full and partial geo-replication. Results show that weakly consistent datastores can lean on Saturn to upgrade their consistency guarantees to causal consistency with a negligible penalty on performance.

Lower Bounds and Algorithm for Partially Replicated Causally Consistent Shared Memory

Article

Mar 2017

Distributed shared memory systems maintain multiple replicas of the shared memory locations. Maintaining causal consistency in such systems has received significant attention in the past. However, much of the previous literature focuses on full replication wherein each replica stores a copy of all the locations in the shared memory. In this paper, we investigate causal consistency in partially replicated systems, wherein each replica may store only a subset of the shared data. To achieve causal consistency, it is necessary to ensure that, before an update is performed at any given replica, all causally preceding updates must also be performed. Achieving this goal requires some mechanism to track causal dependencies. In the context of full replication, this goal is often achieved using vector timestamps, with the number of vector elements being equal to the number of replicas. Building on the past work, this paper makes three key contributions: 1. We develop lower bounds on the size of the timestamps that must be maintained in order to achieve causal consistency in partially replicated systems. The size of the timestamps is a function of the manner in which the replicas share data, and the set of replicas accessed by each client. 2. We present an algorithm to achieve causal consistency in partially replicated systems using simple vector timestamps. 3. We present some optimizations to improve the overhead of the timestamps required with partial replication.

Efficient Causal Access in Geo-Replicated Storage Systems

Article

Full-text available

Jan 2023

We consider a setting where applications, such as websites or games, need causal access to objects available in geo-replicated cloud data stores. Common ways of implementing causal consistency involve hiding objects while waiting for their dependencies or waiting for server replicas to synchronize. To minimize delays and retrieve objects faster, applications may try to reach different server replicas at once. This entails a cost because providers charge for each reading request, including reading misses where the causal copy of the object is unavailable. Therefore, latency and cost are conflicting goals, which we control by selecting where to read and when. We formulate this challenge as a multi-criteria optimization problem and propose five non-dominated reading strategies, four of which are Pareto optimal, in a setting constrained to two server replicas. We validate these solutions on the following real cloud storage services: AWS S3, DynamoDB and MongoDB. Savings of as much as 50% on reading costs, with no significant or even a positive impact on latency, demonstrate that both clients and cloud providers could benefit from richer services compatible with these retrieval strategies.

Optimal Causal Access in Geo-Replicated Storage Systems

Preprint

Full-text available

Sep 2022

We consider a setting where applications, such as websites or games, need causal access to objects available in geo-replicated cloud data stores. Common ways of implementing causal consistency involve hiding objects while waiting for their dependencies or waiting for server replicas to synchronize. To minimize delays and retrieve objects faster, applications may try to reach different server replicas at once. This entails a cost because providers charge for each reading request, including reading misses where the causal copy of the object is unavailable. Therefore, latency and cost are conflicting goals, which we control by selecting where to read and when. We formulate this challenge as a multi-criteria optimization problem and propose five non-dominated reading strategies to solve it, four of which are Pareto optimal. We validate these solutions on the following real cloud storage services: AWS S3, DynamoDB and MongoDB. Savings of as much as 50% on reading costs, with no significant or even a positive impact on latency, demonstrate that both clients and cloud providers could benefit from richer services compatible with these retrieval strategies.

AT2: Asynchronous Trustworthy Transfers

Preprint

Dec 2018

Many blockchain-based protocols, such as Bitcoin, implement a decentralized asset transfer (or exchange) system. As clearly stated in the original paper by Nakamoto, the crux of this problem lies in prohibiting any participant from engaging in double-spending. There seems to be a common belief that consensus is necessary for solving the double-spending problem. Indeed, whether it is for a permissionless or a permissioned environment, the typical solution uses consensus to build a totally ordered ledger of submitted transfers. In this paper we show that this common belief is false: consensus is not needed to implement of a decentralized asset transfer system. We do so by introducing AT2 (Asynchronous Trustworthy Transfers), a class of consensusless algorithms. To show formally that consensus is unnecessary for asset transfers, we consider this problem first in the shared-memory context. We introduce AT2$_{SM}$, a wait-free algorithm that asynchronously implements asset transfer in the read-write shared-memory model. In other words, we show that the consensus number of an asset-transfer object is one. In the message passing model with Byzantine faults, we introduce a generic asynchronous algorithm called AT2$_{MP}$ and discuss two instantiations of this solution. First, AT2$_{D}$ ensures deterministic guarantees and consequently targets a small scale deployment (tens to hundreds of nodes), typically for a permissioned environment. Second, AT2$_{P}$ provides probabilistic guarantees and scales well to a very large system size (tens of thousands of nodes), ensuring logarithmic latency and communication complexity. Instead of consensus, we construct AT2$_{D}$ and AT2$_{P}$ on top of a broadcast primitive with causal ordering guarantees offering deterministic and probabilistic properties, respectively.

Consistency in Distributed Storage Systems

Chapter

Full-text available

Jan 2013

Due to the advent of eventually consistent storage systems, consistency has become a focus of research. Still, a clear overview of consistency in distributed systems is missing. In this work, we define and describe consistency, show how different consistency models and perspectives are related and briefly discuss how concrete consistency guarantees of a distributed storage system can be measured.

Distributed Locking as a Data Type

Preprint

Full-text available

May 2024

Mixed-consistency programming models assist programmers in designing applications that provide high availability while still ensuring application-specific safety invariants. However, existing models often make specific system assumptions, such as building on a particular database system or having baked-in coordination strategies. This makes it difficult to apply these strategies in diverse settings, ranging from client/server to ad-hoc peer-to-peer networks. This work proposes a new strategy for building programmable coordination mechanisms based on the algebraic replicated data types (ARDTs) approach. ARDTs allow for simple and composable implementations of various protocols, while making minimal assumptions about the network environment. As a case study, two different locking protocols are presented, both implemented as ARDTs. In addition, we elaborate on our ongoing efforts to integrate the approach into the LoRe mixed-consistency programming language.

Consistency models in distributed systems: A survey on definitions, disciplines, challenges and applications

Preprint

Feb 2019

The replication mechanism resolves some challenges with big data such as data durability, data access, and fault tolerance. Yet, replication itself gives birth to another challenge known as the consistency in distributed systems. Scalability and availability are the challenging criteria upon which the replication is based in distributed systems which themselves require consistency. Consistency in distributed computing systems has been employed in three different applicable fields, such as system architecture, distributed databases, and distributed systems. Consistency models based on their applicability could be sorted from strong to weak. Our goal is to propose a novel viewpoint to different consistency models utilized in distributed systems. This research proposes two different categories of consistency models. Initially, consistency models are categorized into three groups data-centric, client-centric, and hybrid models. Each of these is then grouped into three subcategories of traditional, extended, and novel consistency models. Consequently, the concepts and procedures are expressed in mathematical terms, which are introduced in order to present our models' behavior without implementation. Moreover, we have surveyed different aspects of challenges with respect to consistency i.e., availability, scalability, security, fault tolerance, latency, violation, and staleness, out of which the two latter i.e. violation and staleness, play the most pivotal roles in terms of consistency and trade-off balancing. Finally, the contribution extent of each of the consistency models and the growing need for them in distributed systems are investigated.

Event-Based Data-Centric Semantics for Consistent Data Management in Microservices

Conference Paper

Full-text available

Jun 2022

There is an emerging trend of migrating traditional service-oriented monolithic systems to the microservice architecture. However, this involves the separation of data previously contained in a single database into several databases tailored to specific domains. Developers are thus faced with a new challenge: features such as transaction processing, coordination, and consistency preservation, which were previously supported by the central database, must now be implemented in a decentralized, asynchronously communicating , distributed structure. Numerous prior studies show that these challenges are not met satisfactorily, resulting in inconsistent system states with potentially detrimental consequences. Therefore, we propose to design a coordination service that relies on clear event-based and data-centric formal semantics for microservices specifying the interaction of cross-microservice transactions with their respective databases. Furthermore, we provide a formaliza-tion of consistency properties and outline how they can be used to support dynamic monitoring as well as enforcement of consistency properties, thereby providing robust microservice systems. The envisioned architecture can significantly alleviate the develop-ers' burden of implementing complicated distributed algorithms to maintain consistency across decentralized databases.

Enforcing Consistency in Microservice Architectures through Event-based Constraints

Conference Paper

Full-text available

Jun 2021

Microservice architectures are an emerging paradigm for developing event-driven applications. By prescribing that an application is decomposed into small and independent components, each encapsulating its own state and communicating via asynchronous events, new components and events can be easily integrated into the system. However, by pursuing a model where events are generated and processed at the application-level, developers have a hard time to safeguard arbitrary event interleavings from doing harm to application safety. To address these challenges, we start by analyzing event-driven microservice open-source applications to identify unsafe interleavings. Next, we categorize event-based constraints to address such unsafe encodings, providing an easy-to-use guide for microservice developers. Finally, we introduce StreamConstraints, a library built on top of Kafka Streams designed to enforce explicit event-based constraints defined by developers. We showcase StreamConstraints based on the case of a popular event-driven microservice system, and demonstrate how it could benefit from event-based constraints to ensure application safety.

Semantics, Specification, and Bounded Verification of Concurrent Libraries in Replicated Systems

Chapter

Full-text available

Jul 2020

Geo-replicated systems provide a number of desirable properties such as globally low latency, high availability, scalability, and built-in fault tolerance. Unfortunately, programming correct applications on top of such systems has proven to be very challenging, in large part because of the weak consistency guarantees they offer. These complexities are exacerbated when we try to adapt existing highly-performant concurrent libraries developed for shared-memory environments to this setting. The use of these libraries, developed with performance and scalability in mind, is highly desirable. But, identifying a suitable notion of correctness to check their validity under a weakly consistent execution model has not been well-studied, in large part because it is problematic to naïvely transplant criteria such as linearizability that has a useful interpretation in a shared-memory context to a distributed one where the cost of imposing a (logical) global ordering on all actions is prohibitive. In this paper, we tackle these issues by proposing appropriate semantics and specifications for highly-concurrent libraries in a weakly-consistent, replicated setting. We use these specifications to develop a static analysis framework that can automatically detect correctness violations of library implementations parameterized with respect to the different consistency policies provided by the underlying system. We use our framework to analyze the behavior of a number of highly non-trivial library implementations of stacks, queues, and exchangers. Our results provide the first demonstration that automated correctness checking of concurrent libraries in a weakly geo-replicated setting is both feasible and practical.

Semantics, Specification, and Bounded Verification of Concurrent Libraries in Replicated Systems

Preprint

Full-text available

Apr 2020

Geo-replicated systems provide a number of desirable properties such as globally low latency, high availability, scalability, and built-in fault tolerance. Unfortunately, programming correct applications on top of such systems has proven to be very challenging, in large part because of the weak consistency guarantees they offer. These complexities are exacerbated when we try to adapt existing highly-performant concurrent libraries developed for shared-memory environments to this setting. The use of these libraries, developed with performance and scalability in mind, is highly desirable. But, identifying a suitable notion of correctness to check their validity under a weakly consistent execution model has not been well-studied, in large part because it is problematic to naively transplant criteria such as linearizability that has a useful interpretation in a shared-memory context to a distributed one where the cost of imposing a (logical) global ordering on all actions is prohibitive. In this paper, we tackle these issues by proposing appropriate semantics and specifications for highly-concurrent libraries in a weakly-consistent, replicated setting. We use these specifications to develop a static analysis framework that can automatically detect correctness violations of library implementations parameterized with respect to the different consistency policies provided by the underlying system. We use our framework to analyze the behavior of a number of highly non-trivial library implementations of stacks, queues, and exchangers. Our results provide the first demonstration that automated correctness checking of concurrent libraries in a weakly geo-replicated setting is both feasible and practical.

Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB

Conference Paper

Jun 2019

MongoDB is a distributed database that supports replication and horizontal partitioning (sharding). MongoDB replica sets consist of a primary that accepts all client writes and then propagates those writes to the secondaries. Each member of the replica set contains the same set of data. For horizontal partitioning, each shard (or partition) is a replica set. This paper discusses the design and rationale behind MongoDB's implementation of a cluster-wide logical clock and causal consistency. The design leveraged ideas from across the research community to ensure that the implementation adds minimal processing overhead, tolerates possible operator errors, and gives protection against non-trusted client attacks. While the goal of the team was not to discover or test new algorithms, the practical implementation necessitated a novel combination of ideas from the research community on causal consistency, security, and minimal performance overhead at scale. This paper describes a large scale, practical implementation of causal consistency using a hybrid logical clock, adding the signing of logical time ranges to the protocol, and introducing performance optimizations necessary for systems at scale. The implementation seeks to define an event as a state change and as such must make forward progress guarantees even during periods of no state changes for a partition of data.

A communication-efficient causal broadcast publish/subscribe system

Thesis

Apr 2019

João Paulo de Araujo

The Publish/Subscribe (Pub/Sub) paradigm enables nodes of a distributed system to disseminate information asynchronously. This thesis investigates how to provide a communication-efficient topic-based Pub/Sub system by addressing the problems of traffic overhead and message contention, present in several tree-based solutions. The proposed contributions build distributed spanning trees on top of a hypercube-like topology, such that the source of each message is the root of its own dynamically built spanning tree. Trees rooted at different nodes are differently organized. Initially, it is proposed a causal broadcast protocol which reduces network traffic by aggregating messages without the use of timers. It exploits the causal relation between messages and path intersections between different trees. Different from existing timer-based approaches, it does not increase delivery latency. The second contribution is a topic-based Pub/Sub system, VCube-PS, which ensures causal delivery order for messages published to the same topic and efficiently supports publication of messages to "hot topics'', i.e., topics with high publication rates. Simulation results confirm that the proposed causal aggregation protocol reduces network traffic as well as delivery latencies since there is less message contention. Compared to an approach that uses one single tree per topic, VCube-PS performs better when there is a high publication rate per topic since it provides load balancing of publication.

Consistency models in distributed systems: A survey on definitions, disciplines, challenges and applications

Preprint

Full-text available

Feb 2019

The replication mechanism resolves some challenges with big data such as data durability, data access, and fault tolerance. Yet, replication itself gives birth to another challenge known as the consistency in distributed systems. Scalability and availability are the challenging criteria on which the replication is based upon in distributed systems which themselves require the consistency. Consistency in distributed computing systems has been employed in three different applicable fields, such as system architecture, distributed database, and distributed systems. Consistency models based on their applicability could be sorted from strong to weak. Our goal is to propose a novel viewpoint to different consistency models utilized in the distributed systems. This research proposes two different categories of consistency models. Initially, consistency models are categorized into three groups of data-centric, client-centric and hybrid models. Each of which is then grouped into three subcategories of traditional, extended, and novel consistency models. Consequently, the concepts and procedures are expressed in mathematical terms, which are introduced in order to present our models' behavior without implementation. Moreover, we have surveyed different aspects of challenges with respect to the consistency i.e., availability, scalability, security, fault tolerance, latency, violation, and staleness, out of which the two latter i.e. violation and staleness, play the most pivotal roles in terms of consistency and trade-off balancing. Finally, the contribution extent of each of the consistency models and the growing need for them in distributed systems are investigated.

Hamsaz: replication coordination analysis and synthesis

Article

Full-text available

Jan 2019

Distributed system replication is widely used as a means of fault-tolerance and scalability. However, it provides a spectrum of consistency choices that impose a dilemma for clients between correctness, responsiveness and availability. Given a sequential object and its integrity properties, we automatically synthesize a replicated object that guarantees state integrity and convergence and avoids unnecessary coordination. Our approach is based on a novel sufficient condition for integrity and convergence called well-coordination that requires certain orders between conflicting and dependent operations. We statically analyze the given sequential object to decide its conflicting and dependent methods and use this information to avoid coordination. We present novel coordination protocols that are parametric in terms of the analysis results and provide the well-coordination requirements. We implemented a tool called Hamsaz that can automatically analyze the given object, instantiate the protocols and synthesize replicated objects. We have applied Hamsaz to a suite of use-cases and synthesized replicated objects that are significantly more responsive than the strongly consistent baseline.

Debugging Distributed Systems with Why-Across-Time Provenance

Conference Paper

Oct 2018

Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event's causal history is an over-approximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input. This enables system developers to reason about the causes of events in real-world distributed systems. We observe that automatically extracting the wat-provenance of a state machine is often infeasible. Fortunately, many distributed systems components have simple interfaces from which a developer can directly specify wat-provenance using a technique we call wat-provenance specifications. Leveraging the theoretical foundations of wat-provenance, we implement a prototype distributed debugging framework called Watermelon.

A Communication-Efficient Causal Broadcast Protocol

Conference Paper

Full-text available

Aug 2018

A causal broadcast ensures that messages are delivered to all nodes (processes) preserving causal relation of the messages. In this paper, we propose a causal broadcast protocol for distributed systems whose nodes are logically organized in a virtual hypercube-like topology called VCube. Messages are broadcast by dynamically building spanning trees rooted in the message's source node. By using multiple trees, the contention bottleneck problem of a single root spanning tree approach is avoided. Furthermore, different trees can intersect at some node. Hence, by taking advantage of both the out-of-order reception of causally related messages at a node and these paths intersections, a node can delay to one or more of its children in the tree, the forwarding of the messages whose some causal dependencies it knows that the children in question can not satisfy yet. Such a delay does not induce any overhead. Experimental evaluation conducted on top of PeerSim simulator confirms the communication effectiveness of our causal broadcast protocol in terms of latency and message traffic reduction.

Causal Consistency and Latency Optimality: Friend or Foe?

Preprint

Full-text available

Mar 2018

Causal Consistency and Latency Optimality: Friend or Foe?

Conference Paper

Full-text available

Mar 2018

Causal consistency is an attractive consistency model for geo-replicated data stores. It is provably the strongest model that tolerates network partitions. It avoids the long latencies associated with strong consistency, and, especially when using read-only transactions (ROTs), it prevents many of the anomalies of weaker consistency models. Recent work has shown that causal consistency allows "latency-optimal" ROTs, that are nonblocking, single-round and single-version in terms of communication. On the surface, this latency optimality is very appealing, as the vast majority of applications are assumed to have read-dominated workloads. In this paper, we show that such "latency-optimal" ROTs induce an extra overhead on writes that is so high that it actually jeopardizes performance even in read-dominated workloads. We show this result from a practical as well as from a theoretical angle. We present the Contrarian protocol that implements "almost latency-optimal" ROTs, but that does not impose on the writes any of the overheads incurred by latency-optimal protocols. In Contrarian, ROTs are nonblocking and single-version, but they require two rounds of client-server communication. We experimentally show that this protocol not only achieves higher throughput, but, surprisingly, also provides better latencies for all but the lowest loads and the most read-heavy workloads. We furthermore prove that the extra overhead imposed on writes by latency-optimal ROTs is inherent, i.e., it is not an artifact of the design we consider, and cannot be avoided by any implementation of latency-optimal ROTs. We show in particular that this overhead grows linearly with the number of clients.

Weak Consistency Model in Distributed Systems Using Hierarchical Colored Petri Net

Article

Full-text available

Feb 2018

With regard to recent developments and wide application of distributed systems, keeping consistency of data has been considered as a serious challenge in these systems. Colored Petri Net (CPN) has high capacity in terms of modeling various algorithms and proving them mathematically. Also proving the presented models has great importance. The importance of keeping consistency at distributed systems at different levels always has been known. Therefore in this research, for first time a hierarchical model for weak consistency along with UTC global time in CPN tools has been presented. The presented model is proved and implemented by using a simulator presented in the CPN tools. In this study, it has been shown that how our method modeled and coded by ML language for distributed systems so that an acceptable level of the weak consistency at distributed systems is obtained.

Unobtrusive Deferred Update Stabilization for Efficient Geo-Replication

Article

Full-text available

Feb 2017

In this paper we propose a novel approach to manage the throughput vs latency tradeoff that emerges when managing updates in geo-replicated systems. Our approach consists in allowing full concurrency when processing local updates and using a deferred local serialisation procedure before shipping updates to remote datacenters. This strategy allows to implement inexpensive mechanisms to ensure system consistency requirements while avoiding intrusive effects on update operations, a major performance limitation of previous systems. We have implemented our approach as a variant of Riak KV. Our extensive evaluation shows that we outperform sequencer-based approaches by almost an order of magnitude in the maximum achievable throughput. Furthermore, unlike previous sequencer-free solutions, our approach reaches nearly optimal remote update visibility latencies without limiting throughput.

Timestamps for Partial Replication

Article

Nov 2016

Maintaining causal consistency in distributed shared memory systems using vector timestamps has received a lot of attention from both theoretical and practical prospective. However, most of the previous literature focuses on full replication where each data is stored in all replicas, which may not be scalable due to the increasing amount of data. In this report, we investigate how to achieve causal consistency in partial replicated systems, where each replica may store different set of data. We propose an algorithm that tracks causal dependencies via vector timestamp in client-server model for partial replication. The cost of our algorithm in terms of timestamps size varies as a function of the manner in which the replicas share data, and the set of replicas accessed by each client. We also establish a connection between our algorithm with the previous work on full replication.

Worlds of Events: Deduction with Partial Knowledge about Causality

Article

Full-text available

Aug 2016

Interactions between internet users are mediated by their devices and the common support infrastructure in data centres. Keeping track of causality amongst actions that take place in this distributed system is key to provide a seamless interaction where effects follow causes. Tracking causality in large scale interactions is difficult due to the cost of keeping large quantities of metadata; even more challenging when dealing with resource-limited devices. In this paper, we focus on keeping partial knowledge on causality and address deduction from that knowledge. We provide the first proof-theoretic causality modelling for distributed partial knowledge. We prove computability and consistency results. We also prove that the partial knowledge gives rise to a weaker model than classical causality. We provide rules for offline deduction about causality and refute some related folklore. We define two notions of forward and backward bisimilarity between devices, using which we prove two important results. Namely, no matter the order of addition/removal, two devices deduce similarly about causality so long as: (1) the same causal information is fed to both. (2) they start bisimilar and erase the same causal information. Thanks to our establishment of forward and backward bisimilarity, respectively, proofs of the latter two results work by simple induction on length.

Access Control based on CRDTs for Collaborative Distributed Applications

Conference Paper

Nov 2023

Noctua: Towards Automated and Practical Fine-grained Consistency Analysis

Conference Paper

Apr 2024

Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications

Conference Paper

Oct 2023

CoNICE: Consensus in Intermittently-Connected Environments by Exploiting Naming With Application to Emergency Response

Article

Oct 2022

In many scenarios, information must be disseminated over intermittently-connected environments when the network infrastructure becomes unavailable, e.g., during disasters where first responders need to send updates about critical tasks. If such updates pertain to a shared data set, dissemination consistency is important. This can be achieved through causal ordering and consensus. Popular consensus algorithms, e.g., Paxos, are most suited for connected environments. While some work has been done on designing consensus algorithms for intermittently-connected environments, such as the One-Third Rule (OTR) algorithm, there is still need to improve their efficiency and timely completion. We propose CoNICE, a framework to ensure consistent dissemination of updates among users in intermittently-connected, infrastructure-less environments. It achieves efficiency by exploiting hierarchical namespaces for faster convergence, and lower communication overhead. CoNICE provides three levels of consistency to users, namely replication, causality and agreement. It uses epidemic propagation to provide adequate replication ratios, and optimizes and extends Vector Clocks to provide causality. To ensure agreement, CoNICE extends OTR to also support long-term network fragmentation and decision invalidation scenarios; we define local and global consensus pertaining to within and across fragments respectively. We integrate CoNICE’s consistency preservation with a naming schema that follows a topic hierarchy-based dissemination framework, to improve functionality and performance. Using the Heard-Of model formalism, we prove CoNICE’s consensus to be correct. Our technique extends previously established proof methods for consensus in asynchronous environments. Performing city-scale simulation, we demonstrate CoNICE’s scalability in achieving consistency in convergence time, utilization of network resources, and reduced energy consumption.

Comprehending Concurrency and Consistency in Distributed Systems

Conference Paper

Sep 2021

Nitin Naik

On Mixing Eventual and Strong Consistency: Acute Cloud Types

Article

Jun 2021

In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering , which implies interim disagreement between replicas on the relative order in which the client requests were executed. When not handled carefully, this phenomenon may lead to undesired anomalies, including circular causality. We prove an impossibility result which states that temporary operation reordering is unavoidable in mixed-consistency systems with sufficiently complex semantics. Our result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strongly consistent operations to an eventually consistent system) results in the weakening of the guarantees on the eventually consistent operations.

Causality Tracking Tradeoffs for Distributed Storage

Conference Paper

Nov 2020

CoNICE: Consensus in Intermittently-Connected Environments by Exploiting Naming with Application to Emergency Response

Conference Paper

Oct 2020

In many scenarios, information must be disseminated over intermittently-connected environments when network infrastructure becomes unavailable. Example scenarios include disasters in which first responders need to send updates about their tasks and provide critical information for search and rescue. If such updates pertain to a shared data set (e.g., pins on a map), their consistent dissemination is important. We can achieve this through causal ordering and consensus. Popular consensus algorithms, such as Paxos and Raft, are most suited for connected environments with reliable links. While some work has been done on designing consensus algorithms for intermittently-connected environments, such as the One-Third Rule (OTR) algorithm, there is need to improve their efficiency and timely completion. We propose CoNICE, a framework to ensure consistent dissemination of updates among users in intermittently-connected, infrastructure-less environments. It achieves efficiency by exploiting hierarchical namespaces for faster convergence, and lower communication overhead. CoNICE provides three levels of consistency to users’ views, namely replication, causality and agreement. It uses epidemic propagation to provide adequate replication ratios, and optimizes and extends Vector Clocks to provide causality. To ensure agreement, CoNICE extends basic OTR to support long-term fragmentation and critical decision invalidation scenarios. We integrate the multi-level consistency schema of CoNICE, with a naming schema that follows a topic hierarchy-based dissemination framework, to improve functionality and performance. Performing city-scale simulation experiments, we demonstrate that CoNICE is effective in achieving its consistency goals, and is efficient and scalable in the time for convergence and utilized network resources.

Karma: Cost-Effective Geo-Replicated Cloud Storage with Dynamic Enforcement of Causal Consistency

Article

Jun 2018

Causal consistency has emerged as an attractive middle-ground to architecting cloud storage systems, as it allows for high availability and low latency, while supporting stronger-than-eventual-consistency semantics. However, causally-consistent cloud storage systems have seen limited deployment in practice. A key factor is these systems employ full replication of all the data in all the data centers (DCs), incurring high cost. A simple extension of current causal systems to support partial replication by clustering DCs into rings incurs availability and latency problems. We propose Karma, the first system to enable causal consistency for partitioned data stores while achieving the cost advantages of partial replication without the availability and latency problems of the simple extension. Our evaluation with 64 servers emulating 8 geo-distributed DCs shows that Karma (i) incurs much lower cost than a fully-replicated causal store (obviously due to the lower replication factor); and (ii) offers higher availability and better performance than the above partial-replication extension at similar costs.

Tracking Causal Order in AWS Lambda Applications

Conference Paper

Apr 2018

A Certain Tendency Of The Database Community

Conference Paper

Apr 2017

Christopher Meiklejohn

We posit that striving for distributed systems that provide "single system image" semantics is fundamentally flawed and at odds with how systems operate in the physical world. We realize the database as an optimization of this system: a required, essential optimization in practice that facilitates central data placement and ease of access to participants in a system. We motivate a new model of computation that is designed to address the problems of computation over "eventually consistent" information in a large-scale distributed system.

Causal consistency algorithms for partially replicated and fully replicated systems

Article

May 2017

Data replication is commonly used for fault-tolerance in reliable distributed systems. In large-scale systems, it additionally provides low latency. Recently, causal consistency in such systems has received much attention. However, existing works assume the data is fully replicated. This greatly simplifies the design of the algorithms to implement causal consistency. In this paper, we propose that it can be advantageous to have partial replication of data, and we propose two algorithms for achieving causal consistency in such systems where the data is only partially replicated. This work provides the first evidence that explores causal consistency for partially replicated distributed systems. We also give a special case algorithm for causal consistency in the full-replication case. We give simulation results to show the performance of our algorithms, and to present the advantage of partial replication over full replication.

Limitations of Highly-Available Eventually-Consistent Data Stores

Article

Jan 2017

Modern replicated data stores aim to provide high availability, by immediately responding to client requests, often by implementing objects that expose concurrency. Such objects, for example, multi-valued registers (MVRs), do not have sequential specifications. This paper explores a recent model for replicated data stores that can be used to precisely specify causal consistency for such objects, and liveness properties like eventual consistency, without revealing details of the underlying implementation. The model is used to prove the following results: 1) An eventually consistent data store implementing MVRs cannot satisfy a consistency model strictly stronger than observable causal consistency (OCC).OCC is a model somewhat stronger than causal consistency, which captures executions in which client observations can use causality to infer concurrency of operations. This result holds under certain assumptions about the data store. 2) Under the same assumptions, an eventually consistent and causally consistent replicated data store must send messages of size linear in the size of the system: If s objects, each Ω (lgk)-bit in size, are supported by n replicas, then there is an execution in which an Ω (n,slgk)-bit message is sent.

Performance of Causal Consistency Algorithms for Partially Replicated Systems

Conference Paper

May 2016

Scalable Atomic Visibility with RAMP Transactions

Article

Jul 2016

Databases can provide scalability by partitioning data across several servers. However, multipartition, multioperation transactional access is often expensive, employing coordination-intensive locking, validation, or scheduling mechanisms. Accordingly, many real-world systems avoid mechanisms that provide useful semantics for multipartition operations. This leads to incorrect behavior for a large class of applications including secondary indexing, foreign key enforcement, and materialized view maintenance. In this work, we identify a new isolation model—Read Atomic (RA) isolation—that matches the requirements of these use cases by ensuring atomic visibility: either all or none of each transaction’s updates are observed by other transactions. We present algorithms for Read Atomic Multipartition (RAMP) transactions that enforce atomic visibility while offering excellent scalability, guaranteed commit despite partial failures (via coordination-free execution), and minimized communication between servers (via partition independence). These RAMP transactions correctly mediate atomic visibility of updates and provide readers with snapshot access to database state by using limited multiversioning and by allowing clients to independently resolve nonatomic reads. We demonstrate that, in contrast with existing algorithms, RAMP transactions incur limited overhead—even under high contention—and scale linearly to 100 servers.

TARDiS: A Branch-and-Merge Approach To Weak Consistency

Conference Paper

Jun 2016

This paper presents the design, implementation, and evaluation of TARDiS (Transactional Asynchronously Replicated Divergent Store), a transactional key-value store explicitly designed for weakly-consistent systems. Reasoning about these systems is hard, as neither causal consistency nor per-object eventual convergence allow applications to deal satisfactorily with write-write conflicts. TARDiS instead exposes as its fundamental abstraction the set of conflicting branches that arise in weakly-consistent systems. To this end, TARDiS introduces a new concurrency control mechanism: branch-on-conflict. On the one hand, TARDiS guarantees that storage will appear sequential to any thread of execution that extends a branch, keeping application logic simple. On the other, TARDiS provides applications, when needed, with the tools and context necessary to merge branches atomically, when and how applications want. Since branch-on-conflict in TARDiS is fast, weakly-consistent applications can benefit from adopting this paradigm not only for operations issued by different sites, but also, when appropriate, for conflicting local operations. We find that TARDiS reduces coding complexity for these applications and that judicious branch-on-conflict can improve their local throughput at each site by two to eight times.

Approximate causal consistency for partially replicated geo-replicated cloud storage

Conference Paper

Nov 2015

In geo-replicated systems and the cloud, data replication provides fault tolerance and low latency. Causal consistency in such systems is an interesting consistency model. Most existing works assume the data is fully replicated because this greatly simplifies the design of the algorithms to implement causal consistency. Recently, we proposed causal consistency under partial replication because it reduces the number of messages used under a wide range of workloads. One drawback of partial replication is that its meta-data tends to be relatively large when the message size is small. In this paper, we propose approximate causal consistency whereby we can reduce the meta-data at the cost of some violations of causal consistency. The amount of violations can be made arbitrarily small by controlling a tunable parameter, that we call credits.

Towards a Scalable, Distributed Metadata Service for Causal Consistency under Partial Geo-replication

Conference Paper

Dec 2015

Causal consistency is a consistency criteria of practical relevance in geo-replicated settings because it provides well-defined semantics in a scalable manner. In fact, it has been proved that causal consistency is the strongest consistency model that can be enforced in an always-available system. Previous approaches to provide causal consistency, which successfully tackle the problem under full geo-replication, have unveiled the inherent tradeoff between the concurrency that the system allows and the size of the metadata needed to enforce causality. When the metadata is compressed, information about concurrency may be lost, creating false dependencies, i.e., the encoding may suggest a causal relation that does not exist in reality. False dependencies may cause artificial delays when processing requests, and decrease the quality of service experienced by the clients. Nevertheless, whether is possible to design a scalable solution that only uses an almost negligible amount of metadata and it is still capable of achieving high levels of concurrency under partial geo-replication, an increasingly relevant setting, remains as a challenging and interesting open research question. This position paper reports on the on-going development of Saturn, a metadata service for geo-replicated systems, that aims at mitigating the effects of false dependencies while keeping the metadata size small (even for challenging settings as partial geo-replication).

Cloud-Hosted Data Storage Systems

Chapter

Feb 2014

Over the past decade, rapidly growing Internet-based services such as e-mail, blogging, social networking, search and e-commerce have substantially redefined the way consumers communicate, access contents, share information and purchase products. Relational database management systems (RDBMS) have been considered as the one-size-fits-all solution for data persistence and retrieval for decades. However, ever increasing need for scalability and new application requirements have created new challenges for traditional RDBMS. Recently, a new generation of low-cost, high-performance database software, aptly named as NoSQL (Not Only SQL), has emerged to challenge the dominance of RDBMS. The main features of these systems include: ability to horizontally scale, supporting weaker consistency models, using flexible schemas and data models and supporting simple low-level query interfaces. In this chapter, we explore the recent advancements and the state-of-the-art of Web scale data management approaches. We discuss the advantages and the disadvantages of several recently introduced approaches and its suitability to support certain class of applications and end-users.

An Overview of the NoSQL World

Chapter

Full-text available

Jun 2014

Limitations of Highly-Available Eventually-Consistent Data Stores

Conference Paper

Jul 2015

Modern replicated data stores aim to provide high availability, by immediately responding to client requests, often by implementing objects that expose concurrency. Such objects, for example, multi-valued registers (MVRs), do not have sequential specifications. This paper explores a recent model for replicated data stores that can be used to precisely specify causal consistency for such objects, and liveness properties like eventual consistency, without revealing details of the underlying implementation. The model is used to prove the following results: An eventually consistent data store implementing MVRs cannot satisfy a consistency model strictly stronger than observable causal consistency (OCC). OCC is a model somewhat stronger than causal consistency, which captures executions in which client observations can use causality to infer concurrency of operations. This result holds under certain assumptions about the data store. Under the same assumptions, an eventually consistent and causally consistent replicated data store must send messages of unbounded size: If s objects are supported by n replicas, then, for every k > 1, there is an execution in which an Ω({n,s} k)-bit message is sent.

Making Sense of Relativistic Distributed Systems

Conference Paper

Oct 2014

Linearizability, a widely-accepted correctness property for shared objects, is grounded in classical physics. Its definition assumes a total temporal order over invocation and response events, which is tantamount to assuming the existence of a global clock that determines the time of each event. By contrast, according to Einstein’s theory of relativity, there can be no global clock: time itself is relative. For example, given two events A and B, one observer may perceive A occurring before B, another may perceive B occurring before A, and yet another may perceive A and B occurring simultaneously,with respect to local time. Here, we generalize linearizability for relativistic distributed systems using techniques that do not rely on a global clock. Our novel correctness property, called relativistic linearizability, is instead defined in terms of causality. However, in contrast to standard “causal consistency,” our interpretation defines relativistic linearizability in a manner that retains the important locality property of linearizability. That is, a collection of shared objects behaves in a relativistically linearizable way if and only if each object individually behaves in a relativistically linearizable way.

Surely You're Joking, Mr. Feynman! Adventures of a Curious Character

Article

Full-text available

Sep 1986

Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies

P-Store: Genuine Partial Replication in Wide Area Networks

Conference Paper

Full-text available

Dec 2010

Partial replication is a way to increase the scalability of replicated systems: updates only need to be applied to a subset of the system's sites, thus allowing replicas to handle independent parts of the workload in parallel. In this paper, we propose P-Store, a partially replicated key-value store for wide area networks. In P-Store, each transaction T optimistically executes on one or more sites and is then certified to guarantee serializability of the execution. The certification protocol is genuine, it only involves sites that replicate data items read or written by T, and incorporates a mechanism to minimize a convoy effect. P-Store makes a thrifty use of an atomic multicast service to guarantee correctness: no messages need to be multicast during T's execution and a single message is multicast to certify T. In case T is global, that is, T's execution is distributed at different geographical locations, an extra vote phase is required. Our approach may offer better scalability than previously proposed solutions that either require multiple atomic multicast messages to execute T or are non-genuine. Experimental evaluations reveal that the convoy effect plays an important role even when one percent of the transactions are global. We also compare the scalability of our approach to a fully replicated solution when the proportion of global transactions and the number of sites vary.

Gesundheit! Modeling Contagion through Facebook News Feed.

Conference Paper

Full-text available

Jan 2009

Whether they are modeling bookmarking behavior in Flickr or cascades of failure in large networks, models of diffusion often start with the assumption that a few nodes start long chain reactions, resulting in large-scale cascades. While rea- sonable under some conditions, this assumption may not hold for social media networks, where user engagement is high and information may enter a system from multiple dis- connected sources. Using a dataset of 262,985 Facebook Pages and their as- sociated fans, this paper provides an empirical investigation of diffusion through a large social media network. Although Facebook diffusion chains are often extremely long (chains of up to 82 levels have been observed), they are not usually the result of a single chain-reaction event. Rather, these dif- fusion chains are typically started by a substantial number of users. Large clusters emerge when hundreds or even thousands of short diffusion chains merge together. This paper presents an analysis of these diffusion chains using zero-inflated negative binomial regressions. We show that after controlling for distribution effects, there is no meaningful evidence that a start node's maximum diffusion chain length can be predicted with the user's demographics or Facebook usage characteristics (including the user's number of Facebook friends). This may provide insight into future research on public opinion formation.

Structuring Computer-Mediated Communication Systems to Avoid Information Overload

Article

Full-text available

Jul 1985

Unless computer-mediated communication systems are structured, users will be overloaded with information. But structure should be imposed by individuals and user groups according to their needs and abilities, rather than through general software features.

Linearizability: A Correctness Condition for Concurrent Objects

Article

Full-text available

Jul 1990

A concurrent object is a data object shared by concurrent processes. Linearizability is a correctness condition for concurrent objects that exploits the semantics of abstract data types. It permits a high degree of concurrency, yet it permits programmers to specify and reason about concurrent objects using known techniques from the sequential domain. Linearizability provides the illusion that each operation applied by concurrent processes takes effect instantaneously at some point between its invocation and its response, implying that the meaning of a concurrent object's operations can be given by pre- and post-conditions. This paper defines linearizability, compares it to other correctness conditions, presents and demonstrates a method for proving the correctness of implementations, and shows how to reason about concurrent objects, given they are linearizable.

Lightweight Causal and Atomic Group Multicast

Article

Full-text available

Aug 1991

Modeling the structure and evolution of discussion cascades

Conference Paper

Full-text available

Nov 2011

We analyze the structure and evolution of discussion cascades in four popular websites: Slashdot, Barrapunto, Meneame and Wikipedia. Despite the big heterogeneities between these sites, a preferential attachment (PA) model with bias to the root can capture the temporal evolution of the observed trees and many of their statistical properties, namely, probability distributions of the branching factors (degrees), subtree sizes and certain correlations. The parameters of the model are learned efficiently using a novel maximum likelihood estimation scheme for PA and provide a figurative interpretation about the communication habits and the resulting discussion cascades on the four different websites.

A Response to Cheriton and Skeen's Criticism of Causal and Totally Ordered Communication

Article

Full-text available

Mar 1997

Ken Birman

In a paper to be presented at the 1993 ACM Symposium on Operating Systems Principles, Cheriton and Skeen offer their understanding of causal and total ordering as a communication property. I find their paper highly critical of Isis, and unfairly so, for a number of reasons. In this paper I present some responses to their criticism, and also explain why I find their discussion of causal and total communication ordering to be distorted and incomplete. 1 Background In a paper to be presented at the 1993 ACM Symposium on Operating Systems Principles, Cheriton and Skeen offer their understanding of causal and total ordering as a communication property. In this paper, I want to I respond to their criticisms from the perspective of my work on Isis [Bir93, BJ87a, BJ87b], and the overall communication model that Isis employs. I assume that the reader is familiar with the Cheriton Skeen paper, and the structure of this response roughly parallels the order of presentation that they use. 1 Isis...

Lazy Replication: Exploiting the Semantics of Distributed Services

Article

Full-text available

Mar 2001

To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. In this paper, we propose lazy replication as a way to preserve consistency by exploiting the semantics of the service's operations to relax the constraints on ordering. Three kinds of operations are supported: operations for which the clients define the required order dynamically during the execution, operations for which the service defines the order, and operations that must be globally ordered with respect to both client ordered and service ordered operations. The method performs well in terms of response time, amount of stored state, number of messages, and availability. It is especially well suited to applications in which most operations require only the client-defined order.

Partial Orders for Parallel Debugging

Article

Nov 1988

C. J. Fidge

Parallel programs differ from sequential programs primarily in that the temporal relationships between events are only partially defined. However, for a given distributed computation, debugging utilities typically linearize the observed set of events into a total ordering, thus losing information and allowing potentially capturable temporal errors to escape detection. We explore use of the partially ordered relation “happened before” to augment both centralized and distributed parallel debuggers to ensure that such errors are always detected and that the results produced by the debugger are unaffected by the non-determinism inherent in the partial ordering. This greatly reduces the number of tests required during debugging. Assertions are based on time intervals, rather than treating events as dimensionless points.

Consistency, Availability, and Convergence

Article

May 2012

We examine the limits of consistency in fault-tolerant distributed storage systems. In particular, we identify fundamental tradeoffs among properties of consistency, availability, and convergence, and we close the gap between what is known to be impossible (i.e. CAP) and known systems that are highly-available but that provide weaker consistency such as causal. Specifically, in the asynchronous model with omission-failures and unreliable networks, we show the following tight bound: No consis-tency stronger than Real Time Causal Consistency (RTC) can be provided in an always-available, one-way convergent system and RTC can be provided in an always-available, one-way convergent system. In the asynchronous, Byzantine-failure model, we show that it is impossible to implement many of the recently introduced fork-based consistency semantics without sacrificing either availability or con-vergence; notably, proposed systems allow Byzantine nodes to permanently partition correct nodes from one another. To address this limitation, we introduce bounded fork join causal semantics that extends causal consistency to Byzantine environments while retaining availability and convergence.

Eventually-Serializable Data Services

Conference Paper

Jan 1996
THEOR COMPUT SCI

Data replication is used in distributed systems to improve availability, increase throughput and eliminate single points of failures. The cost of replication is that significant care and communication is required to maintain consistency among replicas. In some settings, such as distributed directory services, it is acceptable to have transient inconsistencies, in exhange for better performance, as long as a consistent view of the data is eventually established. For such services to be usable, it is important that the consistency guarantees are specified clearly.

The Enron Corpus: A New Dataset for Email Classification Research

Conference Paper

Sep 2004
Lect Notes Comput Sci

Automated classication of email messages into user-specic folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state- of-the-art classier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classier, and using all the sections in combination with regression weights.

Thread arcs: An email thread visualization

Conference Paper

Jan 2003

Bernard Kerr

This paper describes Thread Arcs, a novel interactive visualization technique designed to help people use threads found in email. Thread Arcs combine the chronology of messages with the branching tree structure of a conversational thread in a mixed-model visualization (Venolia and Neustaedter 2003) that is stable and compact. By quickly scanning and interacting with Thread Arcs, people can see various attributes of conversations and find relevant messages in them easily. We tested this technique against other visualization techniques with users' own email in a functional prototype email client. Thread Arcs proved an excellent match for the types of threads found in users' email and for the qualities users wanted in small-scale visualizations. CR Categories: H.5.2 User Interfaces, H.5.3 Group and Organization Interfaces, I.3.6 Methodology and Techniques

Understanding the Limitations of Causally and Totally Ordered Communication.

Conference Paper

Dec 1993

Causally and totally ordered communication support (CATOCS) has been proposed as important to provide as part of the basic building blocks for constructing reliable distributed systems. In this paper, we identify four major limitations to CATOCS, investigate the applicability of CATOCS to several classes of distributed applications in light of these limitations, and the potential impact of these facilities on communication scalability and robustness. From this investigation, we find limited merit and several potential problems in using CATOCS. The fundamental difficulty with the CATOCS is that it attempts to solve problems at the communication level in violation of the well-known "end-to-end" argument.

Don't settle for eventual: Scalable causal consistency for wide-area storage with COPS

Conference Paper

Oct 2011

Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model---causal consistency with convergent conflict handling, or causal+---that is the strongest achieved under these constraints. We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.

Measuring Message Propagation and Social Influence on Twitter.com

Conference Paper

Sep 2010

Although extensive studies have been conducted on online social networks (OSNs), it is not clear how to characterize information propagation and social influence, two types of important but not well defined social behavior. This paper presents a measurement study of 58M messages collected from 700K users on Twitter.com , a popular social medium. We analyze the propagation patterns of general messages and show how breaking news (Michael Jackson’s death) spread through Twitter. Furthermore, we evaluate different social influences by examining their stabilities, assessments, and correlations. This paper addresses the complications as well as challenges we encounter when measuring message propagation and social influence on OSNs. We believe that our results here provide valuable insights for future OSN research.

Predicting Response to Political Blog Posts with Topic Models.

Conference Paper

Jan 2009

In this paper we model discussions in online political blogs. To do this, we extend Latent Dirichlet Allocation (Blei et al., 2003), in var- ious ways to capture different characteristics of the data. Our models jointly describe the generation of the primary documents (posts) as well as the authorship and, optionally, the contents of the blog community's verbal reac- tions to each post (comments). We evaluate our model on a novel comment prediction task where the models are used to predict which blog users will leave comments on a given post. We also provide a qualitative discussion about what the models discover.

PNUTS: Yahoo!'s hosted data serving platform

Article

Aug 2008

We describe PNUTS, a massively parallel and geographi- cally distributed database system for Yahoo!'s web applica- tions. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con- current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and uti- lizes automated load-balancing and failover to reduce oper- ational complexity. The first version of the system is cur- rently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimen- tal results.

Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story

Article

Feb 2012

Daniel J. Abadi

The CAP theorem's impact on modern distributed database system design is more limited than is often perceived. Another tradeoff—between consistency and latency —has had a more direct influence on several well-known DDBSs. A proposed new formulation, PACELC, unifies this tradeoff with CAP.

Eventually-Serializable Data Services.

Article

Jun 1999
THEOR COMPUT SCI

Data replication is used in distributed systems to improve availability, increase throughput and climinate single points of failures. The cost of replication is that significant care and communication is required to maintain consistency among replicas. In some settings, such as distributed directory services, it is acceptable to have transient inconsistencies, in exhange for better performance, as long as a consistent view of the data is eventually established. For such services to be usable, it is important that the consistency guarantees are specified clearly. We present a new specification for distributed data services that trades off immediate consistency guarantees for improved system availability and efficiency, while ensuring the long-term consistency of the data. An eventually-serializable data service maintains the requested operations in a partial order that gravitates over time towards a total order. It provides clear and unambiguous guarantees about the immediate and long-term behavior of the system. We also present an algorithm, based on the lazy replication strategy of Ladin, Liskov, Shrira, and Ghemawat (1992), that implements this specification. Our algorithm provides the external interface of the eventnally-scrializable data service specitication. and generalizes their algorithm by allowing arbitrary operations and greater flexibility in specifying consistency requirements. In addition to cotreetness, we prove performance and fault-tolerance properties of this algorithm. © 1999 Elsevier Science B.V. All rights reserved. rights reserved.

Time, Clocks, and the Ordering of Events in a Distributed System

Article

Jul 1978

Leslie Lamport

The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

Leave a Reply: An Analysis of Weblog Comments

Conference Paper

May 2006

Access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. This overlooks an important aspect distin- guishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. In this paper we present a large-scale study of weblog com- ments and their relation to the posts. Using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of we- blog access.

Unsupervised Modeling of Twitter Conversations

Article

Dec 2010

We propose the first unsupervised approach to the problem of modeling dialogue acts in an open domain. Trained on a corpus of noisy Twitter conversations, our method discovers dialogue acts by clustering raw utterances. Because it accounts for the sequential behaviour of these acts, the learned model can provide insight into the shape of communication in a new medium. We address the challenge of evaluating the emergent model with a qualitative visualization and an intrinsic conversation ordering task. This work is inspired by a corpus of 1.3 million Twitter conversations, which will be made publicly available. This huge amount of data, available only because Twitter blurs the line between chatting and publishing, highlights the need to be able to adapt quickly to a new medium. yes yes

Causal Memory: Definitions, Implementation, and Programming

Article

Mar 1995

The abstraction of a shared memory is of growing importance in distributed computing systems. Traditional memory consistency ensures that all processes agree on a common order of all operations on memory. Unfortunately, providing these guarantees entails access latencies that prevent scaling to large systems. This paper weakens such guarantees by defining causal memory, an abstraction that ensures that processes in a system agree on the relative ordering of operations that are causally related. Because causal memory is weakly consistent, it admits more executions, and hence more concurrency, than either atomic or sequentially consistent memories. This paper provides a formal definition of causal memory and gives an implementation for message-passing systems. In addition, it describes a practical class of programs that, if developed for a strongly consistent memory, run correctly with causal memory.

Tracing information flow on a global scale using Internet chain-letter data

Article

Apr 2008
P NATL ACAD SCI USA

Although information, news, and opinions continuously circulate in the worldwide social network, the actual mechanics of how any single piece of information spreads on a global scale have been a mystery. Here, we trace such information-spreading processes at a person-by-person level using methods to reconstruct the propagation of massively circulated Internet chain letters. We find that rather than fanning out widely, reaching many people in very few steps according to “small-world” principles, the progress of these chain letters proceeds in a narrow but very deep tree-like pattern, continuing for several hundred steps. This suggests a new and more complex picture for the spread of information through a social network. We describe a probabilistic model based on network clustering and asynchronous response times that produces trees with this characteristic structure on social-network data. • social networks • algorithms • epidemics • diffusion in networks

Lazy replication: Exploiting the semantics of distributed services

Conference Paper

Dec 1990

The need for high availability in distributed services requires that the data managed by the service be replicated. A major challenge in managing replicated data is ensuring consistency among the copies of the data. One way to guarantee consistency is to force operations to take effect in the same order at all sites. This approach, however, is often expensive. A novel method is designed for constructing logically centralized, highly available services to be used in a distributed environment. The method is intended for services that appear to clients to be logically centralized: in spite of the service's distributed implementation, it has the same observable behavior as a single copy. The semantics of the application implemented by the service is taken into account in order to weaken implementation constraints and thus improve response time and increase availability; constraints can be relaxed as long as clients cannot observe the difference. To illustrate how semantics can be used to relax constraints on operation orders, an electronic mail system is considered. The implementation of a distributed service based on partially ordered operations is discussed

Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services

Article

Nov 2002

When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.

Flexible Update Propagation for Weakly Consistent Replication

Article

Dec 1999

Bayou's anti-entropy protocol for update propagation between weakly consistent storage replicas is based on pair-wise communication, the propagation of write operations, and a set of ordering and closure constraints on the propagation of the writes. The simplicity of the design makes the protocol very flexible, thereby providing support for diverse networking environments and usage scenarios. It accommodates a variety of policies for when and where to propagate updates. It operates over diverse network topologies, including low-bandwidth links. It is incremental. It enables replica convergence, and updates can be propagated using floppy disks and similar transportable media. Moreover, the protocol handles replica creation and retirement in a light-weight manner. Each of these features is enabled by only one or two of the protocol's design choices, and can be independently incorporated in other systems. This paper presents the anti-entropy protocol in detail, describing the design decisions and resulting features.

Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail

Article

Feb 1993

: The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The relative merits and limitations of the different approaches are discussed, and their general feasibility is analyzed. Keywords: Distributed Computation, Causality, Distributed System, Causal Ordering, Logical Time, Vector Time, Global Predicate Detection, Distributed Debugging 1 Introduction Today, distributed and parallel systems are generally available, and their technology has reached a certain degree of maturity. Unfortunately, we still lack complete understanding of how to design, realize, and test the software for such system...

The potential dangers of causal consistency and an explicit solution

Abstract

No full-text available

Recommended publications

High-order filtered scheme for front propagation problems

Peter d’Ailly

Virtutes et uitia dictionis, selección léxica y su condicionamiento métrico en el Comentario a Virgi...

Linearity, Sharing and State: a fully abstract game semantics for Idealized Algol with active expres...