Conference Paper

The potential dangers of causal consistency and an explicit solution

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Causal consistency is the strongest consistency model that is available in the presence of partitions and provides useful semantics for human-facing distributed services. Here, we expose its serious and inherent scalability limitations due to write propagation requirements and traditional dependency tracking mechanisms. As an alternative to classic potential causality, we advocate the use of explicit causality, or application-defined happens-before relations. Explicit causality, a subset of potential causality, tracks only relevant dependencies and reduces several of the potential dangers of causal consistency.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the context of full replication, several causally consistent shared memory systems have been designed, including Lazy Replication [22], COPS [24], GentleRain [12], Orbe [11], SwiftCloud [39], Occult [27], and Causalspartan [35]. Recently, there is also growing interest in partial replication due to the potential storage efficiencies that can be attained [2,4,6,7,9,16,17,19,25,27]. For full replication, it suffices to use a vector timestamp [8,14,26] of length equal to the number of replicas [22] to achieve causal consistency. ...
... Several researchers have observed that partial replication requires larger amount of metadata to track causal dependencies [2,9,16,24]. For partial replication, in general, the timestamp (or Session 10 PODC '19, July 29-August 2, 2019, Toronto, ON, Canada metadata) overhead is expected to be larger than that for full replication in order to avoid false dependencies as will be explained below. ...
... This may result in high bandwidth usage. (2) Simulating full replication introduces unnecessary dependencies (which we call false dependencies) among the update messages. For instance, if update u x on register x depends on update u y on register y, i.e. u x can only be applied after u y is applied, then on any replica who received u x first will wait for the receipt of u y , even if register y is virtual and not stored locally. ...
Conference Paper
The focus of this paper is on causal consistency in a partially replicated distributed shared memory (DSM) system that provides the abstraction of shared read/write registers. Maintaining causal consistency in distributed shared memory systems has received significant attention in the past, mostly on full replication wherein each replica stores a copy of all the registers in the shared memory. To ensure causal consistency, all causally preceding updates must be performed before an update is performed at any given replica. Therefore, some mechanism for tracking causal dependencies is required, such as vector timestamps with the number of vector elements being equal to the number of replicas in the context of full replication. In this paper, we investigate causal consistency in partially replicated systems, wherein each replica may store only a subset of the shared registers. Building on the past work, this paper makes three key contributions: • present a necessary condition on the metadata (which we refer as a timestamp) that must be maintained by each replica to be able to track causality accurately. The necessary condition identifies a set of directed edges in a share graph that a replica's timestamp must keep track of. • We present an algorithm for achieving causal consistency using a timestamp that matches the above necessary condition, thus showing that the condition is necessary and sufficient. • We define a measurement of timestamp space size and present a lower bound (in bits) on the size of the timestamps. The lower bound matches our algorithm in several special cases.
... In practice, the burden of checking and enforcing correctness constraints is left, often ambiguously, to the programmer -frequently resulting in error-prone distributed applications [33]. Hence, a new breed of research works and products have strived to bridge the divide between correctness and scalability, by offering optimized implementations of strong semantic abstractions [192,257] or by exposing tradeoffs matching specific application semantics [34,125]. ...
... Hence, new models have been devised to account for various combinations of fault tolerance concerns and application invariants. Researchers have been striving to formulate the minimum requirements in terms of correctness and, therefore, coordination, to allow for the design of fast yet functional distributed systems [34,30]. Furthermore, an ongoing and exciting research trend has been tackling this issue leveraging different tools and stack layers, spanning from programming languages [16] to data structures [213] and application-level static checkers [219,125]. ...
... Recent work by Bailis et al. [34] promotes the use of explicit application-level causality, which is a subset of potential causality, 8 for building highly available distributed systems that would entail less overhead in terms of coordination and metadata maintenance. Furthermore, an increasing body of research has been drawing attention to causal consistency, considered an optimal tradeoff between user-perceived correctness and coordination overhead, especially in mobile or geo-replicated applications [177,36,252]. ...
Thesis
Engineering distributed systems is an onerous task: the design goals of performance, correctness and reliability are intertwined in complex tradeoffs, which have been outlined by multiple theoretical results. These tradeoffs have become increasingly important as computing and storage have shifted towards distributed architectures. Additionally, the general lack of systematic approaches to tackle distribution in modern programming tools, has worsened these issues — especially as nowadays most programmers have to take on the challenges of distribution. As a result, there exists an evident divide between programming abstractions, application requirements and storage semantics, which hinders the work of designers and developers.This thesis presents a set of contributions towards the overarching goal of designing reliable distributed storage systems, by examining these issues through the prism of consistency. We begin by providing a uniform, declarative framework to formally define consistency semantics. We use this framework to describe and compare over fifty non-transactional consistency semantics proposed in previous literature. The declarative and composable nature of this framework allows us to build a partial order of consistency models according to their semantic strength. We show the practical benefits of composability by designing and implementing Hybris, a storage system that leverages different models and semantics to improve over the weak consistency generally offered by public cloud storage platforms. We demonstrate Hybris’ efficiency and show that it can tolerate arbitrary faults of cloud stores at the cost of tolerating outages. Finally, we propose a novel technique to verify the consistency guarantees offered by real-world storage systems. This technique leverages our declarative approach to consistency: we consider consistency semantics as invariants over graph representations of storage systems executions. A preliminary implementation proves this approach practical and useful in improving over the state-of-the-art on consistency verification.
... This ordering can be either total [15] or partial. Partial orderings can be categorized in causal dependencies [? ] or explicitly stated relationships [5]. ...
... Detecting that receiving a message is safe is a complex task. This decision and collecting the required meta data can be space and time consuming [5,8,13,20]. Moreover, inspecting physical timestamps is not enough, since time is not necessarily moving forward uniformly on all nodes in a distributed system. ...
... Generally, for most types of applications, this is not the case. Hence, another optimization would be to explicitly specify relationships between items [5]. In the best case, this can reduce the local space required to O(1). ...
Conference Paper
Full-text available
Causal message delivery, i.e. the requirement that messages are delivered in an order respecting their causal (logical) dependencies, is often mandated in the distributed setting. So far, causal message delivery has been implemented by augmenting messages with meta data information that allows the receiver (or the platform) to re-order, and if necessary hold back, messages upon receipt before processing. We propose that causal message delivery can be achieved by construction, simply by organizing the nodes of the distributed application into a tree topology, and without the need for any meta data in the messages. We present our ideas informally through an example application and then develop a formal model and prove that causal message delivery is preserved in tree-based networks.
... For any operation a and b, a b means a is a dependency of b. Causal relationships are captured in two ways: via potential or explicit causality [37]. ...
... Under explicit causality, user interfaces can be provided to application developers to define causal relationships between operations [11,27]. Therefore, the number of dependencies in tracking explicit causality are often much less than that in tracking potential causality, and thus results in better performance and less metadata [37]. Considering its benefits, we choose to track explicit causality in this work. ...
... Otherwise, the client program will ignore causal relationships between comments unless it captures potential causality. The example also indicates a limitation of employing explicit causality: application programmers must consider how to merge a capturing strategy into their application logic to capture explicit dependencies [37]. For example, the programmers need to define the causal relationship of a comment and its replied comment besides implementing the basic functionality of comment replying. ...
Article
Full-text available
The tradeoff between consistency and availability is inevitable when designing distributed data stores, and today’s cloud services often choose high availability instead of strong consistency, leading to visible inconsistencies for clients. Convergent causal consistency is one of the strongest consistency model that still remains available during system partitions, and it can also satisfy human perception of causality between events. In this paper, we present CoCaCo, a distributed key-value store that provides convergent causal consistency with asynchronous replication, since it is able to provide cloud services’ desired properties including high performance and availability. Moreover, CoCaCo can efficiently guarantee causal consistency by performing dependency checking only during handling read operations. We implement CoCaCo based on Cassandra and our experimental results indicate that CoCaCo provides performance comparable to eventually consistent Cassandra.
... This is due to the large amount of false dependencies inevitably introduced when compressing metadata [21,22] (a false dependency is created when two concurrent operations are serialized as an artifact of the metadata management). The opposite happens with Cure, that exhibits a low (constant) visibility latency penalty but severely penalizes the throughput due to the computation and storage overhead associated with the metadata management [9,26]. ...
... Furthermore, current solutions are not designed to fully take advantage of partial geo-replication, a setting of practical relevance [19,23]. The culprit is that causal graphs are not easily partitionable, which may force sites to manage not only the metadata associated with the data items stored locally, but also the metadata associated with items stored remotely [9,39,53]. This fact exacerbates the problem of false dependencies, forcing solutions to delay the visibility of remote updates due to updates on data items that are not even replicated locally. ...
... The decoupling between data and metadata management is key in the design of SATURN. First, it relieves the datastore from managing consistency across datacenters, a task that may be highly costly [9,26]. Second, this separation permits SAT-URN to handle heavier loads independently of the size of the managed data. ...
Conference Paper
This paper presents the design, implementation, and evaluation of Saturn, a metadata service for geo-replicated systems. Saturn can be used in combination with several distributed and replicated data services to ensure that remote operations are made visible in an order that respects causality, a requirement central to many consistency criteria. Saturn addresses two key unsolved problems inherent to previous approaches. First, it eliminates the tradeoff between throughput and data freshness, when deciding what metadata to use for tracking causality. Second, it enables genuine partial replication, a key property to ensure scalability when the number of geo-locations increases. Saturn addresses these challenges while keeping metadata size constant, independently of the number of clients, servers, data partitions, and locations. By decoupling metadata management from data dissemination, and by using clever metadata propagation techniques, it ensures that the throughput and visibility latency of updates on a given item are (mostly) shielded from operations on other items or locations. We evaluate Saturn in Amazon EC2 using realistic benchmarks under both full and partial geo-replication. Results show that weakly consistent datastores can lean on Saturn to upgrade their consistency guarantees to causal consistency with a negligible penalty on performance.
... In the context of full replication, There has been significant effort in designing and implementing causally consistent shared memory systems, such as Lazy Replication [16], COPS [18], Orbe [8], SwiftCloud [32] and GentleRain [9]. While much of the past work on shared memory has addressed full replication, there is growing interest in partial replication [5,19,2,18,13,6], due to the potential storage efficiencies that can be attained with partial replication. Several researchers have observed that partial replication can require a large amount of metadata in order to track causal dependencies accurately under partial replication [2,18,13,6]. ...
... While much of the past work on shared memory has addressed full replication, there is growing interest in partial replication [5,19,2,18,13,6], due to the potential storage efficiencies that can be attained with partial replication. Several researchers have observed that partial replication can require a large amount of metadata in order to track causal dependencies accurately under partial replication [2,18,13,6]. Most relevant to this paper is the work of Helary and Milani [13,23]. ...
... Since e jk / ∈ E aq and Condition (1) for replica a q does not hold, Condition (2) for a q must be true. There may be two cases for Condition (2) for a q to be true. In the first case, there exist a q with q > q such that w ∈ X bpa q , which contradicts the fact that a q has the largest subscript. ...
Article
Distributed shared memory systems maintain multiple replicas of the shared memory locations. Maintaining causal consistency in such systems has received significant attention in the past. However, much of the previous literature focuses on full replication wherein each replica stores a copy of all the locations in the shared memory. In this paper, we investigate causal consistency in partially replicated systems, wherein each replica may store only a subset of the shared data. To achieve causal consistency, it is necessary to ensure that, before an update is performed at any given replica, all causally preceding updates must also be performed. Achieving this goal requires some mechanism to track causal dependencies. In the context of full replication, this goal is often achieved using vector timestamps, with the number of vector elements being equal to the number of replicas. Building on the past work, this paper makes three key contributions: 1. We develop lower bounds on the size of the timestamps that must be maintained in order to achieve causal consistency in partially replicated systems. The size of the timestamps is a function of the manner in which the replicas share data, and the set of replicas accessed by each client. 2. We present an algorithm to achieve causal consistency in partially replicated systems using simple vector timestamps. 3. We present some optimizations to improve the overhead of the timestamps required with partial replication.
... In [6], Bailis et al. identify a critical trade-off between staleness ("visibility latency") and write throughput. As the utilization of throughput across replicas increases, thus creating longer queues, the new data will take longer to arrive. ...
... Ladin et al. [43] proposed one of the first replicated systems to offer causal order. Their approach is quite interesting for us, because, as in [6], they allow applications to specify the ordering they want (among "causal", "immediate" and "forced"); and their replica update scheme is lazy because the authors use gossip to optimize the replication mechanisms. While this ensures causality, when applications require so, the lazy approach causes stale reads. ...
Article
Full-text available
We consider a setting where applications, such as websites or games, need causal access to objects available in geo-replicated cloud data stores. Common ways of implementing causal consistency involve hiding objects while waiting for their dependencies or waiting for server replicas to synchronize. To minimize delays and retrieve objects faster, applications may try to reach different server replicas at once. This entails a cost because providers charge for each reading request, including reading misses where the causal copy of the object is unavailable. Therefore, latency and cost are conflicting goals, which we control by selecting where to read and when. We formulate this challenge as a multi-criteria optimization problem and propose five non-dominated reading strategies, four of which are Pareto optimal, in a setting constrained to two server replicas. We validate these solutions on the following real cloud storage services: AWS S3, DynamoDB and MongoDB. Savings of as much as 50% on reading costs, with no significant or even a positive impact on latency, demonstrate that both clients and cloud providers could benefit from richer services compatible with these retrieval strategies.
... In [6], Bailis et al. identify a critical trade-off between staleness ("visibility latency") and write throughput. As the utilization of throughput across replicas increases, thus creating longer queues, the new data will take longer to arrive. ...
... Ladin et al. [40] proposed one of the first replicated systems to offer causal order. Their approach is quite interesting for us, because, as in [6], they allow applications to specify the ordering they want (among causal, "immediate" and "forced"); and their replica update scheme is lazy because the authors use gossip to optimize the replication mechanisms. While this ensures causality, when applications require so, the lazy approach causes stale reads. ...
Preprint
Full-text available
We consider a setting where applications, such as websites or games, need causal access to objects available in geo-replicated cloud data stores. Common ways of implementing causal consistency involve hiding objects while waiting for their dependencies or waiting for server replicas to synchronize. To minimize delays and retrieve objects faster, applications may try to reach different server replicas at once. This entails a cost because providers charge for each reading request, including reading misses where the causal copy of the object is unavailable. Therefore, latency and cost are conflicting goals, which we control by selecting where to read and when. We formulate this challenge as a multi-criteria optimization problem and propose five non-dominated reading strategies to solve it, four of which are Pareto optimal. We validate these solutions on the following real cloud storage services: AWS S3, DynamoDB and MongoDB. Savings of as much as 50% on reading costs, with no significant or even a positive impact on latency, demonstrate that both clients and cloud providers could benefit from richer services compatible with these retrieval strategies.
... Various causally-consistent algorithms exist [19,44]. One problematic aspect in these algorithms is that the metadata associated with GKMPS'18 tracking dependencies can be a burden [5,10,48]. This happens because such algorithms track all potential causal dependencies. ...
... This happens because such algorithms track all potential causal dependencies. In our AT2 algorithm for the message passing model (Figure 4) we track dependencies explicitly [10], permitting a more efficient implementation with a smaller set of dependencies. More concretely, we specify that each transfer outgoing from an account only depends on previous transfers outgoing from and incoming to that-and only that-account, ignoring the transfers that affect other (irrelevant) accounts. ...
Preprint
Many blockchain-based protocols, such as Bitcoin, implement a decentralized asset transfer (or exchange) system. As clearly stated in the original paper by Nakamoto, the crux of this problem lies in prohibiting any participant from engaging in double-spending. There seems to be a common belief that consensus is necessary for solving the double-spending problem. Indeed, whether it is for a permissionless or a permissioned environment, the typical solution uses consensus to build a totally ordered ledger of submitted transfers. In this paper we show that this common belief is false: consensus is not needed to implement of a decentralized asset transfer system. We do so by introducing AT2 (Asynchronous Trustworthy Transfers), a class of consensusless algorithms. To show formally that consensus is unnecessary for asset transfers, we consider this problem first in the shared-memory context. We introduce AT2$_{SM}$, a wait-free algorithm that asynchronously implements asset transfer in the read-write shared-memory model. In other words, we show that the consensus number of an asset-transfer object is one. In the message passing model with Byzantine faults, we introduce a generic asynchronous algorithm called AT2$_{MP}$ and discuss two instantiations of this solution. First, AT2$_{D}$ ensures deterministic guarantees and consequently targets a small scale deployment (tens to hundreds of nodes), typically for a permissioned environment. Second, AT2$_{P}$ provides probabilistic guarantees and scales well to a very large system size (tens of thousands of nodes), ensuring logarithmic latency and communication complexity. Instead of consensus, we construct AT2$_{D}$ and AT2$_{P}$ on top of a broadcast primitive with causal ordering guarantees offering deterministic and probabilistic properties, respectively.
... The following consistency models (apart from Linearizability) do not consider staleness [34]. In fact, increasing strictness of ordering guarantees often leads to higher staleness values as updates may not be applied directly but are required to fulfill dependencies first (e.g., [3]). ...
... This both adds an overhead and increases staleness as updates cannot become visible right away. Bailis et al. [3] propose to minimize this impact by having the application explicitly define dependencies that need to be considered. A typical implementation uses vector clocks to identify (potential) causal dependencies. ...
Chapter
Full-text available
Due to the advent of eventually consistent storage systems, consistency has become a focus of research. Still, a clear overview of consistency in distributed systems is missing. In this work, we define and describe consistency, show how different consistency models and perspectives are related and briefly discuss how concrete consistency guarantees of a distributed storage system can be measured.
... More advanced protocols like Raft [22] and Paxos [14,16,17] improve these trade-offs in some dimensions compared to our examples, but they are still bound by fundamental limitations and trade-offs that exist in distributed systems. Furthermore, Bailis et al. [2] note that even assuming causal consistency globally can lead to scalability issues, because it leads to causal dependencies that are not required by the application semantics. ...
Preprint
Full-text available
Mixed-consistency programming models assist programmers in designing applications that provide high availability while still ensuring application-specific safety invariants. However, existing models often make specific system assumptions, such as building on a particular database system or having baked-in coordination strategies. This makes it difficult to apply these strategies in diverse settings, ranging from client/server to ad-hoc peer-to-peer networks. This work proposes a new strategy for building programmable coordination mechanisms based on the algebraic replicated data types (ARDTs) approach. ARDTs allow for simple and composable implementations of various protocols, while making minimal assumptions about the network environment. As a case study, two different locking protocols are presented, both implemented as ARDTs. In addition, we elaborate on our ongoing efforts to integrate the approach into the LoRe mixed-consistency programming language.
... Therefore, the whole processes based on this consistency have a solid view of the operations. This behavior is like the serial and the sequential behavior by a server [57,11,58,59]. In order to state the sequential operation and find a solution to avoid the controversial operations, the linearizability model is used instead of the sequential consistency [60]. ...
Preprint
The replication mechanism resolves some challenges with big data such as data durability, data access, and fault tolerance. Yet, replication itself gives birth to another challenge known as the consistency in distributed systems. Scalability and availability are the challenging criteria upon which the replication is based in distributed systems which themselves require consistency. Consistency in distributed computing systems has been employed in three different applicable fields, such as system architecture, distributed databases, and distributed systems. Consistency models based on their applicability could be sorted from strong to weak. Our goal is to propose a novel viewpoint to different consistency models utilized in distributed systems. This research proposes two different categories of consistency models. Initially, consistency models are categorized into three groups data-centric, client-centric, and hybrid models. Each of these is then grouped into three subcategories of traditional, extended, and novel consistency models. Consequently, the concepts and procedures are expressed in mathematical terms, which are introduced in order to present our models' behavior without implementation. Moreover, we have surveyed different aspects of challenges with respect to consistency i.e., availability, scalability, security, fault tolerance, latency, violation, and staleness, out of which the two latter i.e. violation and staleness, play the most pivotal roles in terms of consistency and trade-off balancing. Finally, the contribution extent of each of the consistency models and the growing need for them in distributed systems are investigated.
... Logically, the system should not allow an event that represents the request for reimbursement to process until a successful payment event has been observed. This connection between the two events can be interpreted as a happens-before explicit event causality that implies an order upon the two events and ultimately an order between the operations responsible for proceeding with a payment and a reimbursement [1,17]. However, because functionalities are distributed in many different microservices and microservices are developed independently from each other, it is easy to overlook such causalities between events. ...
Conference Paper
Full-text available
There is an emerging trend of migrating traditional service-oriented monolithic systems to the microservice architecture. However, this involves the separation of data previously contained in a single database into several databases tailored to specific domains. Developers are thus faced with a new challenge: features such as transaction processing, coordination, and consistency preservation, which were previously supported by the central database, must now be implemented in a decentralized, asynchronously communicating , distributed structure. Numerous prior studies show that these challenges are not met satisfactorily, resulting in inconsistent system states with potentially detrimental consequences. Therefore, we propose to design a coordination service that relies on clear event-based and data-centric formal semantics for microservices specifying the interaction of cross-microservice transactions with their respective databases. Furthermore, we provide a formaliza-tion of consistency properties and outline how they can be used to support dynamic monitoring as well as enforcement of consistency properties, thereby providing robust microservice systems. The envisioned architecture can significantly alleviate the develop-ers' burden of implementing complicated distributed algorithms to maintain consistency across decentralized databases.
... The complex interaction patterns found in microservices force developers to deal with the possible interleaving of event streams. Built on the observation that microservice applications already implement substantial data management logic at the application-level, we take advantage over this fact to allow developers to explicitly define the relevant dependencies of events [2]. Through a library abstraction on top of Kafka Streams API [6], we allow developers to specify dependencies cutting across distinct events, thus enjoying the benefit of not having to track all potential causality of events that present no dependencies across each other. ...
Conference Paper
Full-text available
Microservice architectures are an emerging paradigm for developing event-driven applications. By prescribing that an application is decomposed into small and independent components, each encapsulating its own state and communicating via asynchronous events, new components and events can be easily integrated into the system. However, by pursuing a model where events are generated and processed at the application-level, developers have a hard time to safeguard arbitrary event interleavings from doing harm to application safety. To address these challenges, we start by analyzing event-driven microservice open-source applications to identify unsafe interleavings. Next, we categorize event-based constraints to address such unsafe encodings, providing an easy-to-use guide for microservice developers. Finally, we introduce StreamConstraints, a library built on top of Kafka Streams designed to enforce explicit event-based constraints defined by developers. We showcase StreamConstraints based on the case of a popular event-driven microservice system, and demonstrate how it could benefit from event-based constraints to ensure application safety.
... This suggests that depending upon the requirements of the clients of the library, there is a tradeoff between consistency and correctness that can be effectively explored. It has long been known that Causal Consistency incurs a performance penalty [3] due to expensive dependency tracking, significant metadata storage, and long wait times for all causally dependent data to arrive. A number of recent approaches [9,14,28] have looked at improving the performance of Causal Consistency, mainly by reducing the amount of dependent data required. ...
Chapter
Full-text available
Geo-replicated systems provide a number of desirable properties such as globally low latency, high availability, scalability, and built-in fault tolerance. Unfortunately, programming correct applications on top of such systems has proven to be very challenging, in large part because of the weak consistency guarantees they offer. These complexities are exacerbated when we try to adapt existing highly-performant concurrent libraries developed for shared-memory environments to this setting. The use of these libraries, developed with performance and scalability in mind, is highly desirable. But, identifying a suitable notion of correctness to check their validity under a weakly consistent execution model has not been well-studied, in large part because it is problematic to naïvely transplant criteria such as linearizability that has a useful interpretation in a shared-memory context to a distributed one where the cost of imposing a (logical) global ordering on all actions is prohibitive. In this paper, we tackle these issues by proposing appropriate semantics and specifications for highly-concurrent libraries in a weakly-consistent, replicated setting. We use these specifications to develop a static analysis framework that can automatically detect correctness violations of library implementations parameterized with respect to the different consistency policies provided by the underlying system. We use our framework to analyze the behavior of a number of highly non-trivial library implementations of stacks, queues, and exchangers. Our results provide the first demonstration that automated correctness checking of concurrent libraries in a weakly geo-replicated setting is both feasible and practical.
... This suggests that depending upon the requirements of the clients of the library, there is a trade-off between consistency and correctness that can be effectively explored. It has long been known that Causal Consistency incurs a performance penalty [3] due to expensive dependency tracking, significant metadata storage, and long wait times for all causally dependent data to arrive. A number of recent approaches [26,9,13] have looked at improving the performance of Causal Consistency, mainly by reducing the amount of dependent data required. ...
Preprint
Full-text available
Geo-replicated systems provide a number of desirable properties such as globally low latency, high availability, scalability, and built-in fault tolerance. Unfortunately, programming correct applications on top of such systems has proven to be very challenging, in large part because of the weak consistency guarantees they offer. These complexities are exacerbated when we try to adapt existing highly-performant concurrent libraries developed for shared-memory environments to this setting. The use of these libraries, developed with performance and scalability in mind, is highly desirable. But, identifying a suitable notion of correctness to check their validity under a weakly consistent execution model has not been well-studied, in large part because it is problematic to naively transplant criteria such as linearizability that has a useful interpretation in a shared-memory context to a distributed one where the cost of imposing a (logical) global ordering on all actions is prohibitive. In this paper, we tackle these issues by proposing appropriate semantics and specifications for highly-concurrent libraries in a weakly-consistent, replicated setting. We use these specifications to develop a static analysis framework that can automatically detect correctness violations of library implementations parameterized with respect to the different consistency policies provided by the underlying system. We use our framework to analyze the behavior of a number of highly non-trivial library implementations of stacks, queues, and exchangers. Our results provide the first demonstration that automated correctness checking of concurrent libraries in a weakly geo-replicated setting is both feasible and practical.
... Since it is too expensive for most modern applications to track all possible causal dependencies in production systems, Balis et al proposed narrowing down the dependencies to the events that mattered, passing the responsibility of defining causal events to the application [10]. In this approach, the data store focuses on enforcing causal relationships on the events explicitly requested by the application instead of enforcing it on the global history of events that ever occurred. ...
Conference Paper
MongoDB is a distributed database that supports replication and horizontal partitioning (sharding). MongoDB replica sets consist of a primary that accepts all client writes and then propagates those writes to the secondaries. Each member of the replica set contains the same set of data. For horizontal partitioning, each shard (or partition) is a replica set. This paper discusses the design and rationale behind MongoDB's implementation of a cluster-wide logical clock and causal consistency. The design leveraged ideas from across the research community to ensure that the implementation adds minimal processing overhead, tolerates possible operator errors, and gives protection against non-trusted client attacks. While the goal of the team was not to discover or test new algorithms, the practical implementation necessitated a novel combination of ideas from the research community on causal consistency, security, and minimal performance overhead at scale. This paper describes a large scale, practical implementation of causal consistency using a hybrid logical clock, adding the signing of logical time ranges to the protocol, and introducing performance optimizations necessary for systems at scale. The implementation seeks to define an event as a state change and as such must make forward progress guarantees even during periods of no state changes for a partition of data.
... On the other hand, a probabilistic causal broadcast is not suitable for the implementation of the causal aggregation broadcast protocol presented in Chapter 4 of this thesis because the proposed aggregation mechanism aims at combining as much as possible causally related messages into a single message, thus requiring precise knowledge about the chain of causal dependencies of a received message, and not incomplete or partial ones. Bailis et al. (2012) discuss some scalability issues associated to traditional mechanisms used to track causal dependencies in terms of number of dependencies and the time necessary to check them. Differently from the traditional concept of causality, where the entire history of preceding messages may affect a new one, they propose the use of explicit or application-defined causality: a sub-set of the causal history which reflects only the causality in application level. ...
Thesis
The Publish/Subscribe (Pub/Sub) paradigm enables nodes of a distributed system to disseminate information asynchronously. This thesis investigates how to provide a communication-efficient topic-based Pub/Sub system by addressing the problems of traffic overhead and message contention, present in several tree-based solutions. The proposed contributions build distributed spanning trees on top of a hypercube-like topology, such that the source of each message is the root of its own dynamically built spanning tree. Trees rooted at different nodes are differently organized. Initially, it is proposed a causal broadcast protocol which reduces network traffic by aggregating messages without the use of timers. It exploits the causal relation between messages and path intersections between different trees. Different from existing timer-based approaches, it does not increase delivery latency. The second contribution is a topic-based Pub/Sub system, VCube-PS, which ensures causal delivery order for messages published to the same topic and efficiently supports publication of messages to "hot topics'', i.e., topics with high publication rates. Simulation results confirm that the proposed causal aggregation protocol reduces network traffic as well as delivery latencies since there is less message contention. Compared to an approach that uses one single tree per topic, VCube-PS performs better when there is a high publication rate per topic since it provides load balancing of publication.
... Therefore, the whole processes based on this consistency have a solid view of the operations. This behavior is like the serial and the sequential behavior by a server [57,11,58,59]. In order to state the sequential operation and find a solution to avoid the controversial opera- tions, the linearizability model is used instead of the sequential consistency [60]. ...
Preprint
Full-text available
The replication mechanism resolves some challenges with big data such as data durability, data access, and fault tolerance. Yet, replication itself gives birth to another challenge known as the consistency in distributed systems. Scalability and availability are the challenging criteria on which the replication is based upon in distributed systems which themselves require the consistency. Consistency in distributed computing systems has been employed in three different applicable fields, such as system architecture, distributed database, and distributed systems. Consistency models based on their applicability could be sorted from strong to weak. Our goal is to propose a novel viewpoint to different consistency models utilized in the distributed systems. This research proposes two different categories of consistency models. Initially, consistency models are categorized into three groups of data-centric, client-centric and hybrid models. Each of which is then grouped into three subcategories of traditional, extended, and novel consistency models. Consequently, the concepts and procedures are expressed in mathematical terms, which are introduced in order to present our models' behavior without implementation. Moreover, we have surveyed different aspects of challenges with respect to the consistency i.e., availability, scalability, security, fault tolerance, latency, violation, and staleness, out of which the two latter i.e. violation and staleness, play the most pivotal roles in terms of consistency and trade-off balancing. Finally, the contribution extent of each of the consistency models and the growing need for them in distributed systems are investigated.
... They proposed proof techniques to verify the sufficiency of user-specified consistency choices , or require user annotations to identify consistency choices and do not guarantee convergence [Balegas et al. 2015a]. Further, many approaches [Balegas et al. 2015a;Li et al. 2014] are crucially dependent on causal consistency as the weakest possible notion while others have established the scalability limitations of causal consistency [Bailis et al. 2012a]. We will further survey related works in ğ 9. Given a sequential object with its integrity properties, our goal is to automatically synthesize a correct-by-construction replicated object that guarantees integrity and convergence and avoids unnecessary coordination: synchronization and tracking dependency between operations. ...
Article
Full-text available
Distributed system replication is widely used as a means of fault-tolerance and scalability. However, it provides a spectrum of consistency choices that impose a dilemma for clients between correctness, responsiveness and availability. Given a sequential object and its integrity properties, we automatically synthesize a replicated object that guarantees state integrity and convergence and avoids unnecessary coordination. Our approach is based on a novel sufficient condition for integrity and convergence called well-coordination that requires certain orders between conflicting and dependent operations. We statically analyze the given sequential object to decide its conflicting and dependent methods and use this information to avoid coordination. We present novel coordination protocols that are parametric in terms of the analysis results and provide the well-coordination requirements. We implemented a tool called Hamsaz that can automatically analyze the given object, instantiate the protocols and synthesize replicated objects. We have applied Hamsaz to a suite of use-cases and synthesized replicated objects that are significantly more responsive than the strongly consistent baseline.
... Causality is a general-purpose formalism that specifies the causal history of a particular event in an arbitrary distributed system. However, causality is too general-purpose, as it fails to incorporate any semantics of the underlying distributed system [2]. As a consequence, the causal history of an event is an overapproximation of the cause of the event. ...
Conference Paper
Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event's causal history is an over-approximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input. This enables system developers to reason about the causes of events in real-world distributed systems. We observe that automatically extracting the wat-provenance of a state machine is often infeasible. Fortunately, many distributed systems components have simple interfaces from which a developer can directly specify wat-provenance using a technique we call wat-provenance specifications. Leveraging the theoretical foundations of wat-provenance, we implement a prototype distributed debugging framework called Watermelon.
... Blessing et al. [10] go further by eliminating the use of meta-data carried by the messages. They exploit applicationdefined causal order [6] in order to organize the actors (processes) of an application into a tree topology that guarantees causal order delivery. The path used by the "causing" message must somehow be included in the path of the "caused" ones. ...
Conference Paper
Full-text available
A causal broadcast ensures that messages are delivered to all nodes (processes) preserving causal relation of the messages. In this paper, we propose a causal broadcast protocol for distributed systems whose nodes are logically organized in a virtual hypercube-like topology called VCube. Messages are broadcast by dynamically building spanning trees rooted in the message's source node. By using multiple trees, the contention bottleneck problem of a single root spanning tree approach is avoided. Furthermore, different trees can intersect at some node. Hence, by taking advantage of both the out-of-order reception of causally related messages at a node and these paths intersections, a node can delay to one or more of its children in the tree, the forwarding of the messages whose some causal dependencies it knows that the children in question can not satisfy yet. Such a delay does not induce any overhead. Experimental evaluation conducted on top of PeerSim simulator confirms the communication effectiveness of our causal broadcast protocol in terms of latency and message traffic reduction.
... Bailis et al. study the overhead of replication and dependency tracking in geo-replicated CC systems [10]. By contrast, we investigate the inherent cost of latency-optimal CC designs, i.e., even in absence of (geo-)replication. ...
... Bailis et al. study the overhead of replication and dependency tracking in geo-replicated CC systems [10]. By contrast, we investigate the inherent cost of latency-optimal CC designs, i.e., even in absence of (geo-)replication. ...
Conference Paper
Full-text available
Causal consistency is an attractive consistency model for geo-replicated data stores. It is provably the strongest model that tolerates network partitions. It avoids the long latencies associated with strong consistency, and, especially when using read-only transactions (ROTs), it prevents many of the anomalies of weaker consistency models. Recent work has shown that causal consistency allows "latency-optimal" ROTs, that are nonblocking, single-round and single-version in terms of communication. On the surface, this latency optimality is very appealing, as the vast majority of applications are assumed to have read-dominated workloads. In this paper, we show that such "latency-optimal" ROTs induce an extra overhead on writes that is so high that it actually jeopardizes performance even in read-dominated workloads. We show this result from a practical as well as from a theoretical angle. We present the Contrarian protocol that implements "almost latency-optimal" ROTs, but that does not impose on the writes any of the overheads incurred by latency-optimal protocols. In Contrarian, ROTs are nonblocking and single-version, but they require two rounds of client-server communication. We experimentally show that this protocol not only achieves higher throughput, but, surprisingly, also provides better latencies for all but the lowest loads and the most read-heavy workloads. We furthermore prove that the extra overhead imposed on writes by latency-optimal ROTs is inherent, i.e., it is not an artifact of the design we consider, and cannot be avoided by any implementation of latency-optimal ROTs. We show in particular that this overhead grows linearly with the number of clients.
... In reading and writing operations of all data, all distributed systems must be considered similarly (Strict Consistency) [1], but executing strict consistency is impossible due to data delay in transferring and sending items. Therefore, various methods have been presented to keep the data consistency in distributed systems such as causal consistency [2]- [5], which previously we provided a model for causal consistency at distributed systems using hierarchical coloured petri net [6]. In this algorithm, operations executed in distributed systems are followed with a special order such as sequential consistency model [7], weak consistency model [8], FIFO or PRAM [7]. ...
Article
Full-text available
With regard to recent developments and wide application of distributed systems, keeping consistency of data has been considered as a serious challenge in these systems. Colored Petri Net (CPN) has high capacity in terms of modeling various algorithms and proving them mathematically. Also proving the presented models has great importance. The importance of keeping consistency at distributed systems at different levels always has been known. Therefore in this research, for first time a hierarchical model for weak consistency along with UTC global time in CPN tools has been presented. The presented model is proved and implemented by using a simulator presented in the CPN tools. In this study, it has been shown that how our method modeled and coded by ML language for distributed systems so that an acceptable level of the weak consistency at distributed systems is obtained.
... Unfortunately, implementing causal consistency is costly due to the computation, communication, and storage overhead caused by metadata management [15,22,12]. A common solution to reduce this cost consists in compressing metadata by serializing sources of concurrency, which unavoidably creates false dependencies among concurrent events, increasing visibility latencies (time interval between the instant in which an update is installed in its origin datacenter and when it becomes visible in remote datacenters). ...
Article
Full-text available
In this paper we propose a novel approach to manage the throughput vs latency tradeoff that emerges when managing updates in geo-replicated systems. Our approach consists in allowing full concurrency when processing local updates and using a deferred local serialisation procedure before shipping updates to remote datacenters. This strategy allows to implement inexpensive mechanisms to ensure system consistency requirements while avoiding intrusive effects on update operations, a major performance limitation of previous systems. We have implemented our approach as a variant of Riak KV. Our extensive evaluation shows that we outperform sequencer-based approaches by almost an order of magnitude in the maximum achievable throughput. Furthermore, unlike previous sequencer-free solutions, our approach reaches nearly optimal remote update visibility latencies without limiting throughput.
... Compared to the large number of protocols assuming Complete Replication and Propagation (CRP), partial replication has received less attention. Several researchers have addressed challenges in achieving causal consistency in partial replication, mainly because of the large amount of metadata the system needs to keep track of in order to characterize the accurate dependencies [2,13,8,4]. Tracking causal dependencies with minimum amount of metadata is an interesting problem from both theoretical and practical prospective. ...
Article
Maintaining causal consistency in distributed shared memory systems using vector timestamps has received a lot of attention from both theoretical and practical prospective. However, most of the previous literature focuses on full replication where each data is stored in all replicas, which may not be scalable due to the increasing amount of data. In this report, we investigate how to achieve causal consistency in partial replicated systems, where each replica may store different set of data. We propose an algorithm that tracks causal dependencies via vector timestamp in client-server model for partial replication. The cost of our algorithm in terms of timestamps size varies as a function of the manner in which the replicas share data, and the set of replicas accessed by each client. We also establish a connection between our algorithm with the previous work on full replication.
... how causality is used to ensure consistency. In addition to being model theoretic, unlike our work, his approach is not based on explicit causality [3]. One particular motivation for confining the universal knowledge of a world of events to microcosms is scalability. ...
Article
Full-text available
Interactions between internet users are mediated by their devices and the common support infrastructure in data centres. Keeping track of causality amongst actions that take place in this distributed system is key to provide a seamless interaction where effects follow causes. Tracking causality in large scale interactions is difficult due to the cost of keeping large quantities of metadata; even more challenging when dealing with resource-limited devices. In this paper, we focus on keeping partial knowledge on causality and address deduction from that knowledge. We provide the first proof-theoretic causality modelling for distributed partial knowledge. We prove computability and consistency results. We also prove that the partial knowledge gives rise to a weaker model than classical causality. We provide rules for offline deduction about causality and refute some related folklore. We define two notions of forward and backward bisimilarity between devices, using which we prove two important results. Namely, no matter the order of addition/removal, two devices deduce similarly about causality so long as: (1) the same causal information is fed to both. (2) they start bisimilar and erase the same causal information. Thanks to our establishment of forward and backward bisimilarity, respectively, proofs of the latter two results work by simple induction on length.
Article
In many scenarios, information must be disseminated over intermittently-connected environments when the network infrastructure becomes unavailable, e.g., during disasters where first responders need to send updates about critical tasks. If such updates pertain to a shared data set, dissemination consistency is important. This can be achieved through causal ordering and consensus. Popular consensus algorithms, e.g., Paxos, are most suited for connected environments. While some work has been done on designing consensus algorithms for intermittently-connected environments, such as the One-Third Rule (OTR) algorithm, there is still need to improve their efficiency and timely completion. We propose CoNICE, a framework to ensure consistent dissemination of updates among users in intermittently-connected, infrastructure-less environments. It achieves efficiency by exploiting hierarchical namespaces for faster convergence, and lower communication overhead. CoNICE provides three levels of consistency to users, namely replication, causality and agreement. It uses epidemic propagation to provide adequate replication ratios, and optimizes and extends Vector Clocks to provide causality. To ensure agreement, CoNICE extends OTR to also support long-term network fragmentation and decision invalidation scenarios; we define local and global consensus pertaining to within and across fragments respectively. We integrate CoNICE’s consistency preservation with a naming schema that follows a topic hierarchy-based dissemination framework, to improve functionality and performance. Using the Heard-Of model formalism, we prove CoNICE’s consensus to be correct. Our technique extends previously established proof methods for consensus in asynchronous environments. Performing city-scale simulation, we demonstrate CoNICE’s scalability in achieving consistency in convergence time, utilization of network resources, and reduced energy consumption.
Article
In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering , which implies interim disagreement between replicas on the relative order in which the client requests were executed. When not handled carefully, this phenomenon may lead to undesired anomalies, including circular causality. We prove an impossibility result which states that temporary operation reordering is unavoidable in mixed-consistency systems with sufficiently complex semantics. Our result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strongly consistent operations to an eventually consistent system) results in the weakening of the guarantees on the eventually consistent operations.
Conference Paper
In many scenarios, information must be disseminated over intermittently-connected environments when network infrastructure becomes unavailable. Example scenarios include disasters in which first responders need to send updates about their tasks and provide critical information for search and rescue. If such updates pertain to a shared data set (e.g., pins on a map), their consistent dissemination is important. We can achieve this through causal ordering and consensus. Popular consensus algorithms, such as Paxos and Raft, are most suited for connected environments with reliable links. While some work has been done on designing consensus algorithms for intermittently-connected environments, such as the One-Third Rule (OTR) algorithm, there is need to improve their efficiency and timely completion. We propose CoNICE, a framework to ensure consistent dissemination of updates among users in intermittently-connected, infrastructure-less environments. It achieves efficiency by exploiting hierarchical namespaces for faster convergence, and lower communication overhead. CoNICE provides three levels of consistency to users’ views, namely replication, causality and agreement. It uses epidemic propagation to provide adequate replication ratios, and optimizes and extends Vector Clocks to provide causality. To ensure agreement, CoNICE extends basic OTR to support long-term fragmentation and critical decision invalidation scenarios. We integrate the multi-level consistency schema of CoNICE, with a naming schema that follows a topic hierarchy-based dissemination framework, to improve functionality and performance. Performing city-scale simulation experiments, we demonstrate that CoNICE is effective in achieving its consistency goals, and is efficient and scalable in the time for convergence and utilized network resources.
Article
Causal consistency has emerged as an attractive middle-ground to architecting cloud storage systems, as it allows for high availability and low latency, while supporting stronger-than-eventual-consistency semantics. However, causally-consistent cloud storage systems have seen limited deployment in practice. A key factor is these systems employ full replication of all the data in all the data centers (DCs), incurring high cost. A simple extension of current causal systems to support partial replication by clustering DCs into rings incurs availability and latency problems. We propose Karma, the first system to enable causal consistency for partitioned data stores while achieving the cost advantages of partial replication without the availability and latency problems of the simple extension. Our evaluation with 64 servers emulating 8 geo-distributed DCs shows that Karma (i) incurs much lower cost than a fully-replicated causal store (obviously due to the lower replication factor); and (ii) offers higher availability and better performance than the above partial-replication extension at similar costs.
Conference Paper
We posit that striving for distributed systems that provide "single system image" semantics is fundamentally flawed and at odds with how systems operate in the physical world. We realize the database as an optimization of this system: a required, essential optimization in practice that facilitates central data placement and ease of access to participants in a system. We motivate a new model of computation that is designed to address the problems of computation over "eventually consistent" information in a large-scale distributed system.
Article
Data replication is commonly used for fault-tolerance in reliable distributed systems. In large-scale systems, it additionally provides low latency. Recently, causal consistency in such systems has received much attention. However, existing works assume the data is fully replicated. This greatly simplifies the design of the algorithms to implement causal consistency. In this paper, we propose that it can be advantageous to have partial replication of data, and we propose two algorithms for achieving causal consistency in such systems where the data is only partially replicated. This work provides the first evidence that explores causal consistency for partially replicated distributed systems. We also give a special case algorithm for causal consistency in the full-replication case. We give simulation results to show the performance of our algorithms, and to present the advantage of partial replication over full replication.
Article
Modern replicated data stores aim to provide high availability, by immediately responding to client requests, often by implementing objects that expose concurrency. Such objects, for example, multi-valued registers (MVRs), do not have sequential specifications. This paper explores a recent model for replicated data stores that can be used to precisely specify causal consistency for such objects, and liveness properties like eventual consistency, without revealing details of the underlying implementation. The model is used to prove the following results: 1) An eventually consistent data store implementing MVRs cannot satisfy a consistency model strictly stronger than observable causal consistency (OCC).OCC is a model somewhat stronger than causal consistency, which captures executions in which client observations can use causality to infer concurrency of operations. This result holds under certain assumptions about the data store. 2) Under the same assumptions, an eventually consistent and causally consistent replicated data store must send messages of size linear in the size of the system: If s objects, each Ω (lgk)-bit in size, are supported by n replicas, then there is an execution in which an Ω (n,slgk)-bit message is sent.
Article
Databases can provide scalability by partitioning data across several servers. However, multipartition, multioperation transactional access is often expensive, employing coordination-intensive locking, validation, or scheduling mechanisms. Accordingly, many real-world systems avoid mechanisms that provide useful semantics for multipartition operations. This leads to incorrect behavior for a large class of applications including secondary indexing, foreign key enforcement, and materialized view maintenance. In this work, we identify a new isolation model—Read Atomic (RA) isolation—that matches the requirements of these use cases by ensuring atomic visibility: either all or none of each transaction’s updates are observed by other transactions. We present algorithms for Read Atomic Multipartition (RAMP) transactions that enforce atomic visibility while offering excellent scalability, guaranteed commit despite partial failures (via coordination-free execution), and minimized communication between servers (via partition independence). These RAMP transactions correctly mediate atomic visibility of updates and provide readers with snapshot access to database state by using limited multiversioning and by allowing clients to independently resolve nonatomic reads. We demonstrate that, in contrast with existing algorithms, RAMP transactions incur limited overhead—even under high contention—and scale linearly to 100 servers.
Conference Paper
This paper presents the design, implementation, and evaluation of TARDiS (Transactional Asynchronously Replicated Divergent Store), a transactional key-value store explicitly designed for weakly-consistent systems. Reasoning about these systems is hard, as neither causal consistency nor per-object eventual convergence allow applications to deal satisfactorily with write-write conflicts. TARDiS instead exposes as its fundamental abstraction the set of conflicting branches that arise in weakly-consistent systems. To this end, TARDiS introduces a new concurrency control mechanism: branch-on-conflict. On the one hand, TARDiS guarantees that storage will appear sequential to any thread of execution that extends a branch, keeping application logic simple. On the other, TARDiS provides applications, when needed, with the tools and context necessary to merge branches atomically, when and how applications want. Since branch-on-conflict in TARDiS is fast, weakly-consistent applications can benefit from adopting this paradigm not only for operations issued by different sites, but also, when appropriate, for conflicting local operations. We find that TARDiS reduces coding complexity for these applications and that judicious branch-on-conflict can improve their local throughput at each site by two to eight times.
Conference Paper
In geo-replicated systems and the cloud, data replication provides fault tolerance and low latency. Causal consistency in such systems is an interesting consistency model. Most existing works assume the data is fully replicated because this greatly simplifies the design of the algorithms to implement causal consistency. Recently, we proposed causal consistency under partial replication because it reduces the number of messages used under a wide range of workloads. One drawback of partial replication is that its meta-data tends to be relatively large when the message size is small. In this paper, we propose approximate causal consistency whereby we can reduce the meta-data at the cost of some violations of causal consistency. The amount of violations can be made arbitrarily small by controlling a tunable parameter, that we call credits.
Conference Paper
Causal consistency is a consistency criteria of practical relevance in geo-replicated settings because it provides well-defined semantics in a scalable manner. In fact, it has been proved that causal consistency is the strongest consistency model that can be enforced in an always-available system. Previous approaches to provide causal consistency, which successfully tackle the problem under full geo-replication, have unveiled the inherent tradeoff between the concurrency that the system allows and the size of the metadata needed to enforce causality. When the metadata is compressed, information about concurrency may be lost, creating false dependencies, i.e., the encoding may suggest a causal relation that does not exist in reality. False dependencies may cause artificial delays when processing requests, and decrease the quality of service experienced by the clients. Nevertheless, whether is possible to design a scalable solution that only uses an almost negligible amount of metadata and it is still capable of achieving high levels of concurrency under partial geo-replication, an increasingly relevant setting, remains as a challenging and interesting open research question. This position paper reports on the on-going development of Saturn, a metadata service for geo-replicated systems, that aims at mitigating the effects of false dependencies while keeping the metadata size small (even for challenging settings as partial geo-replication).
Chapter
Over the past decade, rapidly growing Internet-based services such as e-mail, blogging, social networking, search and e-commerce have substantially redefined the way consumers communicate, access contents, share information and purchase products. Relational database management systems (RDBMS) have been considered as the one-size-fits-all solution for data persistence and retrieval for decades. However, ever increasing need for scalability and new application requirements have created new challenges for traditional RDBMS. Recently, a new generation of low-cost, high-performance database software, aptly named as NoSQL (Not Only SQL), has emerged to challenge the dominance of RDBMS. The main features of these systems include: ability to horizontally scale, supporting weaker consistency models, using flexible schemas and data models and supporting simple low-level query interfaces. In this chapter, we explore the recent advancements and the state-of-the-art of Web scale data management approaches. We discuss the advantages and the disadvantages of several recently introduced approaches and its suitability to support certain class of applications and end-users.
Conference Paper
Modern replicated data stores aim to provide high availability, by immediately responding to client requests, often by implementing objects that expose concurrency. Such objects, for example, multi-valued registers (MVRs), do not have sequential specifications. This paper explores a recent model for replicated data stores that can be used to precisely specify causal consistency for such objects, and liveness properties like eventual consistency, without revealing details of the underlying implementation. The model is used to prove the following results: An eventually consistent data store implementing MVRs cannot satisfy a consistency model strictly stronger than observable causal consistency (OCC). OCC is a model somewhat stronger than causal consistency, which captures executions in which client observations can use causality to infer concurrency of operations. This result holds under certain assumptions about the data store. Under the same assumptions, an eventually consistent and causally consistent replicated data store must send messages of unbounded size: If s objects are supported by n replicas, then, for every k > 1, there is an execution in which an Ω({n,s} k)-bit message is sent.
Conference Paper
Linearizability, a widely-accepted correctness property for shared objects, is grounded in classical physics. Its definition assumes a total temporal order over invocation and response events, which is tantamount to assuming the existence of a global clock that determines the time of each event. By contrast, according to Einstein’s theory of relativity, there can be no global clock: time itself is relative. For example, given two events A and B, one observer may perceive A occurring before B, another may perceive B occurring before A, and yet another may perceive A and B occurring simultaneously,with respect to local time. Here, we generalize linearizability for relativistic distributed systems using techniques that do not rely on a global clock. Our novel correctness property, called relativistic linearizability, is instead defined in terms of causality. However, in contrast to standard “causal consistency,” our interpretation defines relativistic linearizability in a manner that retains the important locality property of linearizability. That is, a collection of shared objects behaves in a relativistically linearizable way if and only if each object individually behaves in a relativistically linearizable way.
Article
Full-text available
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Conference Paper
Full-text available
Partial replication is a way to increase the scalability of replicated systems: updates only need to be applied to a subset of the system's sites, thus allowing replicas to handle independent parts of the workload in parallel. In this paper, we propose P-Store, a partially replicated key-value store for wide area networks. In P-Store, each transaction T optimistically executes on one or more sites and is then certified to guarantee serializability of the execution. The certification protocol is genuine, it only involves sites that replicate data items read or written by T, and incorporates a mechanism to minimize a convoy effect. P-Store makes a thrifty use of an atomic multicast service to guarantee correctness: no messages need to be multicast during T's execution and a single message is multicast to certify T. In case T is global, that is, T's execution is distributed at different geographical locations, an extra vote phase is required. Our approach may offer better scalability than previously proposed solutions that either require multiple atomic multicast messages to execute T or are non-genuine. Experimental evaluations reveal that the convoy effect plays an important role even when one percent of the transactions are global. We also compare the scalability of our approach to a fully replicated solution when the proportion of global transactions and the number of sites vary.
Conference Paper
Full-text available
Whether they are modeling bookmarking behavior in Flickr or cascades of failure in large networks, models of diffusion often start with the assumption that a few nodes start long chain reactions, resulting in large-scale cascades. While rea- sonable under some conditions, this assumption may not hold for social media networks, where user engagement is high and information may enter a system from multiple dis- connected sources. Using a dataset of 262,985 Facebook Pages and their as- sociated fans, this paper provides an empirical investigation of diffusion through a large social media network. Although Facebook diffusion chains are often extremely long (chains of up to 82 levels have been observed), they are not usually the result of a single chain-reaction event. Rather, these dif- fusion chains are typically started by a substantial number of users. Large clusters emerge when hundreds or even thousands of short diffusion chains merge together. This paper presents an analysis of these diffusion chains using zero-inflated negative binomial regressions. We show that after controlling for distribution effects, there is no meaningful evidence that a start node's maximum diffusion chain length can be predicted with the user's demographics or Facebook usage characteristics (including the user's number of Facebook friends). This may provide insight into future research on public opinion formation.
Article
Full-text available
Unless computer-mediated communication systems are structured, users will be overloaded with information. But structure should be imposed by individuals and user groups according to their needs and abilities, rather than through general software features.
Article
Full-text available
A concurrent object is a data object shared by concurrent processes. Linearizability is a correctness condition for concurrent objects that exploits the semantics of abstract data types. It permits a high degree of concurrency, yet it permits programmers to specify and reason about concurrent objects using known techniques from the sequential domain. Linearizability provides the illusion that each operation applied by concurrent processes takes effect instantaneously at some point between its invocation and its response, implying that the meaning of a concurrent object's operations can be given by pre- and post-conditions. This paper defines linearizability, compares it to other correctness conditions, presents and demonstrates a method for proving the correctness of implementations, and shows how to reason about concurrent objects, given they are linearizable.
Conference Paper
Full-text available
We analyze the structure and evolution of discussion cascades in four popular websites: Slashdot, Barrapunto, Meneame and Wikipedia. Despite the big heterogeneities between these sites, a preferential attachment (PA) model with bias to the root can capture the temporal evolution of the observed trees and many of their statistical properties, namely, probability distributions of the branching factors (degrees), subtree sizes and certain correlations. The parameters of the model are learned efficiently using a novel maximum likelihood estimation scheme for PA and provide a figurative interpretation about the communication habits and the resulting discussion cascades on the four different websites.
Article
Full-text available
In a paper to be presented at the 1993 ACM Symposium on Operating Systems Principles, Cheriton and Skeen offer their understanding of causal and total ordering as a communication property. I find their paper highly critical of Isis, and unfairly so, for a number of reasons. In this paper I present some responses to their criticism, and also explain why I find their discussion of causal and total communication ordering to be distorted and incomplete. 1 Background In a paper to be presented at the 1993 ACM Symposium on Operating Systems Principles, Cheriton and Skeen offer their understanding of causal and total ordering as a communication property. In this paper, I want to I respond to their criticisms from the perspective of my work on Isis [Bir93, BJ87a, BJ87b], and the overall communication model that Isis employs. I assume that the reader is familiar with the Cheriton Skeen paper, and the structure of this response roughly parallels the order of presentation that they use. 1 Isis...
Article
Full-text available
To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. In this paper, we propose lazy replication as a way to preserve consistency by exploiting the semantics of the service's operations to relax the constraints on ordering. Three kinds of operations are supported: operations for which the clients define the required order dynamically during the execution, operations for which the service defines the order, and operations that must be globally ordered with respect to both client ordered and service ordered operations. The method performs well in terms of response time, amount of stored state, number of messages, and availability. It is especially well suited to applications in which most operations require only the client-defined order.
Article
Parallel programs differ from sequential programs primarily in that the temporal relationships between events are only partially defined. However, for a given distributed computation, debugging utilities typically linearize the observed set of events into a total ordering, thus losing information and allowing potentially capturable temporal errors to escape detection. We explore use of the partially ordered relation “happened before” to augment both centralized and distributed parallel debuggers to ensure that such errors are always detected and that the results produced by the debugger are unaffected by the non-determinism inherent in the partial ordering. This greatly reduces the number of tests required during debugging. Assertions are based on time intervals, rather than treating events as dimensionless points.
Article
We examine the limits of consistency in fault-tolerant distributed storage systems. In particular, we identify fundamental tradeoffs among properties of consistency, availability, and convergence, and we close the gap between what is known to be impossible (i.e. CAP) and known systems that are highly-available but that provide weaker consistency such as causal. Specifically, in the asynchronous model with omission-failures and unreliable networks, we show the following tight bound: No consis-tency stronger than Real Time Causal Consistency (RTC) can be provided in an always-available, one-way convergent system and RTC can be provided in an always-available, one-way convergent system. In the asynchronous, Byzantine-failure model, we show that it is impossible to implement many of the recently introduced fork-based consistency semantics without sacrificing either availability or con-vergence; notably, proposed systems allow Byzantine nodes to permanently partition correct nodes from one another. To address this limitation, we introduce bounded fork join causal semantics that extends causal consistency to Byzantine environments while retaining availability and convergence.
Conference Paper
Data replication is used in distributed systems to improve availability, increase throughput and eliminate single points of failures. The cost of replication is that significant care and communication is required to maintain consistency among replicas. In some settings, such as distributed directory services, it is acceptable to have transient inconsistencies, in exhange for better performance, as long as a consistent view of the data is eventually established. For such services to be usable, it is important that the consistency guarantees are specified clearly.
Conference Paper
Automated classication of email messages into user-specic folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state- of-the-art classier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classier, and using all the sections in combination with regression weights.
Conference Paper
This paper describes Thread Arcs, a novel interactive visualization technique designed to help people use threads found in email. Thread Arcs combine the chronology of messages with the branching tree structure of a conversational thread in a mixed-model visualization (Venolia and Neustaedter 2003) that is stable and compact. By quickly scanning and interacting with Thread Arcs, people can see various attributes of conversations and find relevant messages in them easily. We tested this technique against other visualization techniques with users' own email in a functional prototype email client. Thread Arcs proved an excellent match for the types of threads found in users' email and for the qualities users wanted in small-scale visualizations. CR Categories: H.5.2 User Interfaces, H.5.3 Group and Organization Interfaces, I.3.6 Methodology and Techniques
Conference Paper
Causally and totally ordered communication support (CATOCS) has been proposed as important to provide as part of the basic building blocks for constructing reliable distributed systems. In this paper, we identify four major limitations to CATOCS, investigate the applicability of CATOCS to several classes of distributed applications in light of these limitations, and the potential impact of these facilities on communication scalability and robustness. From this investigation, we find limited merit and several potential problems in using CATOCS. The fundamental difficulty with the CATOCS is that it attempts to solve problems at the communication level in violation of the well-known "end-to-end" argument.
Conference Paper
Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model---causal consistency with convergent conflict handling, or causal+---that is the strongest achieved under these constraints. We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.
Conference Paper
Although extensive studies have been conducted on online social networks (OSNs), it is not clear how to characterize information propagation and social influence, two types of important but not well defined social behavior. This paper presents a measurement study of 58M messages collected from 700K users on Twitter.com , a popular social medium. We analyze the propagation patterns of general messages and show how breaking news (Michael Jackson’s death) spread through Twitter. Furthermore, we evaluate different social influences by examining their stabilities, assessments, and correlations. This paper addresses the complications as well as challenges we encounter when measuring message propagation and social influence on OSNs. We believe that our results here provide valuable insights for future OSN research.
Conference Paper
In this paper we model discussions in online political blogs. To do this, we extend Latent Dirichlet Allocation (Blei et al., 2003), in var- ious ways to capture different characteristics of the data. Our models jointly describe the generation of the primary documents (posts) as well as the authorship and, optionally, the contents of the blog community's verbal reac- tions to each post (comments). We evaluate our model on a novel comment prediction task where the models are used to predict which blog users will leave comments on a given post. We also provide a qualitative discussion about what the models discover.
Article
We describe PNUTS, a massively parallel and geographi- cally distributed database system for Yahoo!'s web applica- tions. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con- current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and uti- lizes automated load-balancing and failover to reduce oper- ational complexity. The first version of the system is cur- rently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimen- tal results.
Article
The CAP theorem's impact on modern distributed database system design is more limited than is often perceived. Another tradeoff—between consistency and latency —has had a more direct influence on several well-known DDBSs. A proposed new formulation, PACELC, unifies this tradeoff with CAP.
Article
Data replication is used in distributed systems to improve availability, increase throughput and climinate single points of failures. The cost of replication is that significant care and communication is required to maintain consistency among replicas. In some settings, such as distributed directory services, it is acceptable to have transient inconsistencies, in exhange for better performance, as long as a consistent view of the data is eventually established. For such services to be usable, it is important that the consistency guarantees are specified clearly. We present a new specification for distributed data services that trades off immediate consistency guarantees for improved system availability and efficiency, while ensuring the long-term consistency of the data. An eventually-serializable data service maintains the requested operations in a partial order that gravitates over time towards a total order. It provides clear and unambiguous guarantees about the immediate and long-term behavior of the system. We also present an algorithm, based on the lazy replication strategy of Ladin, Liskov, Shrira, and Ghemawat (1992), that implements this specification. Our algorithm provides the external interface of the eventnally-scrializable data service specitication. and generalizes their algorithm by allowing arbitrary operations and greater flexibility in specifying consistency requirements. In addition to cotreetness, we prove performance and fault-tolerance properties of this algorithm. © 1999 Elsevier Science B.V. All rights reserved. rights reserved.
Article
The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.
Conference Paper
Access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. This overlooks an important aspect distin- guishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. In this paper we present a large-scale study of weblog com- ments and their relation to the posts. Using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of we- blog access.
Article
We propose the first unsupervised approach to the problem of modeling dialogue acts in an open domain. Trained on a corpus of noisy Twitter conversations, our method discovers dialogue acts by clustering raw utterances. Because it accounts for the sequential behaviour of these acts, the learned model can provide insight into the shape of communication in a new medium. We address the challenge of evaluating the emergent model with a qualitative visualization and an intrinsic conversation ordering task. This work is inspired by a corpus of 1.3 million Twitter conversations, which will be made publicly available. This huge amount of data, available only because Twitter blurs the line between chatting and publishing, highlights the need to be able to adapt quickly to a new medium. yes yes
Article
The abstraction of a shared memory is of growing importance in distributed computing systems. Traditional memory consistency ensures that all processes agree on a common order of all operations on memory. Unfortunately, providing these guarantees entails access latencies that prevent scaling to large systems. This paper weakens such guarantees by defining causal memory, an abstraction that ensures that processes in a system agree on the relative ordering of operations that are causally related. Because causal memory is weakly consistent, it admits more executions, and hence more concurrency, than either atomic or sequentially consistent memories. This paper provides a formal definition of causal memory and gives an implementation for message-passing systems. In addition, it describes a practical class of programs that, if developed for a strongly consistent memory, run correctly with causal memory.
Article
Although information, news, and opinions continuously circulate in the worldwide social network, the actual mechanics of how any single piece of information spreads on a global scale have been a mystery. Here, we trace such information-spreading processes at a person-by-person level using methods to reconstruct the propagation of massively circulated Internet chain letters. We find that rather than fanning out widely, reaching many people in very few steps according to “small-world” principles, the progress of these chain letters proceeds in a narrow but very deep tree-like pattern, continuing for several hundred steps. This suggests a new and more complex picture for the spread of information through a social network. We describe a probabilistic model based on network clustering and asynchronous response times that produces trees with this characteristic structure on social-network data. • social networks • algorithms • epidemics • diffusion in networks
Conference Paper
The need for high availability in distributed services requires that the data managed by the service be replicated. A major challenge in managing replicated data is ensuring consistency among the copies of the data. One way to guarantee consistency is to force operations to take effect in the same order at all sites. This approach, however, is often expensive. A novel method is designed for constructing logically centralized, highly available services to be used in a distributed environment. The method is intended for services that appear to clients to be logically centralized: in spite of the service's distributed implementation, it has the same observable behavior as a single copy. The semantics of the application implemented by the service is taken into account in order to weaken implementation constraints and thus improve response time and increase availability; constraints can be relaxed as long as clients cannot observe the difference. To illustrate how semantics can be used to relax constraints on operation orders, an electronic mail system is considered. The implementation of a distributed service based on partially ordered operations is discussed
Article
When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.
Article
Bayou's anti-entropy protocol for update propagation between weakly consistent storage replicas is based on pair-wise communication, the propagation of write operations, and a set of ordering and closure constraints on the propagation of the writes. The simplicity of the design makes the protocol very flexible, thereby providing support for diverse networking environments and usage scenarios. It accommodates a variety of policies for when and where to propagate updates. It operates over diverse network topologies, including low-bandwidth links. It is incremental. It enables replica convergence, and updates can be propagated using floppy disks and similar transportable media. Moreover, the protocol handles replica creation and retirement in a light-weight manner. Each of these features is enabled by only one or two of the protocol's design choices, and can be independently incorporated in other systems. This paper presents the anti-entropy protocol in detail, describing the design decisions and resulting features.
Article
: The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The relative merits and limitations of the different approaches are discussed, and their general feasibility is analyzed. Keywords: Distributed Computation, Causality, Distributed System, Causal Ordering, Logical Time, Vector Time, Global Predicate Detection, Distributed Debugging 1 Introduction Today, distributed and parallel systems are generally available, and their technology has reached a certain degree of maturity. Unfortunately, we still lack complete understanding of how to design, realize, and test the software for such system...