Conference Paper

Bolt-on causal consistency

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We consider the problem of separating consistency-related safety properties from availability and durability in distributed data stores via the application of a "bolt-on" shim layer that upgrades the safety of an underlying general-purpose data store. This shim provides the same consistency guarantees atop a wide range of widely deployed but often inflexible stores. As causal consistency is one of the strongest consistency models that remain available during system partitions, we develop a shim layer that upgrades eventually consistent stores to provide convergent causal consistency. Accordingly, we leverage widely deployed eventually consistent infrastructure as a common substrate for providing causal guarantees. We describe algorithms and shim implementations that are suitable for a large class of application-level causality relationships and evaluate our techniques using an existing, production-ready data store and with real-world explicit causality relationships.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Causal Consistency (CC) [3] is weaker than SC, but stronger than EC, and has been proven to be the strongest type of consistency that can be achieved in a fault-tolerant, distributed system [31]. Informally, CC implies that readers cannot find a version of a data element before all the operations that led to that version are visible [5]. Whilst CC is sufficiently strong, and sufficiently performant, for most enterprise applications [39], we believe its adoption in the industry is attenuated by a number of issues, including: ...
... Conflict Handling. For a DDB to support HA with low latency, WRITEs need to be accepted at any site without requiring co-ordination and consensus, at least in the critical path, with other sites [5]. A state of conflict causes data consistency between different sites to be broken. ...
... Each version is mapped to a value and a set of dependencies, encoded as pairs of ⟨k ey,version⟩, thus supporting READ and WRITE operations in a transactional context. Bolt-On [5] describes a custom middleware on top of Cassandra, a commerciallyavailable EC DB with a columnar data model which handles replication. Bolt-On implements explicit causality, offloading dependency tracking to the client. ...
Chapter
Data Consistency defines the validity of a data set according to some set of rules, and different levels of data consistency have been proposed. Causal consistency is the strongest type of consistency that can be achieved when data is stored in multiple locations, and fault tolerance is desired. D-Thespis is a distributed middleware that innovatively leverages the Actor model to implement causal consistency over an industry-standard relational database, whilst abstracting complexities for application developers behind a REST open-protocol interface. We propose the concept of elastic horizontal scalability, and propose systematic designs, algorithms and a correctness evaluation for a middleware that can be scaled to the needs of varying workloads whilst achieving causal consistency in a performant manner.
... A different proposal consists of using a shim layer atop standard cheap storage to control what the client can see. In Bolt-on [7], objects with missing causal dependencies are not made available to the client, i.e., they remain invisible until all the dependencies are also available. A common pattern we observe here is that causal system implementations tend to require some form of background synchronization before objects become available for applications. ...
... Non-blocking operation at the cost of staleness is also the option in Bolt-on [7], which uses local storage (shim) and the notion of causal cut to keep causally consistent data always available to the client. The causal cut [7] essentially keeps file dependencies ready for delivery. ...
... Non-blocking operation at the cost of staleness is also the option in Bolt-on [7], which uses local storage (shim) and the notion of causal cut to keep causally consistent data always available to the client. The causal cut [7] essentially keeps file dependencies ready for delivery. Bolt-on does not deliver a file to the application before having all the necessary files in the cut, or otherwise, the application could block while looking for a previous dependency file. ...
Article
Full-text available
We consider a setting where applications, such as websites or games, need causal access to objects available in geo-replicated cloud data stores. Common ways of implementing causal consistency involve hiding objects while waiting for their dependencies or waiting for server replicas to synchronize. To minimize delays and retrieve objects faster, applications may try to reach different server replicas at once. This entails a cost because providers charge for each reading request, including reading misses where the causal copy of the object is unavailable. Therefore, latency and cost are conflicting goals, which we control by selecting where to read and when. We formulate this challenge as a multi-criteria optimization problem and propose five non-dominated reading strategies, four of which are Pareto optimal, in a setting constrained to two server replicas. We validate these solutions on the following real cloud storage services: AWS S3, DynamoDB and MongoDB. Savings of as much as 50% on reading costs, with no significant or even a positive impact on latency, demonstrate that both clients and cloud providers could benefit from richer services compatible with these retrieval strategies.
... A different proposal consists of using a shim layer atop standard cheap storage to control what the client can see. In Bolt-on [7], objects with missing causal dependencies are not made available to the client, i.e., they remain invisible until all the dependencies are also available. A common pattern we observe here is that causal system implementations tend to require some form of background synchronization before objects become available for applications. ...
... Non-blocking operation at the cost of staleness is also the option in Bolton [7], which uses local storage (shim) and the notion of causal cut to keep causally consistent data always available to the client. The causal cut [7] essentially keeps file dependencies ready for delivery. ...
... Non-blocking operation at the cost of staleness is also the option in Bolton [7], which uses local storage (shim) and the notion of causal cut to keep causally consistent data always available to the client. The causal cut [7] essentially keeps file dependencies ready for delivery. Bolt-on does not deliver a file to the application before having all the necessary files in the cut, or otherwise, the application could block while looking for a previous dependency file. ...
Preprint
Full-text available
We consider a setting where applications, such as websites or games, need causal access to objects available in geo-replicated cloud data stores. Common ways of implementing causal consistency involve hiding objects while waiting for their dependencies or waiting for server replicas to synchronize. To minimize delays and retrieve objects faster, applications may try to reach different server replicas at once. This entails a cost because providers charge for each reading request, including reading misses where the causal copy of the object is unavailable. Therefore, latency and cost are conflicting goals, which we control by selecting where to read and when. We formulate this challenge as a multi-criteria optimization problem and propose five non-dominated reading strategies to solve it, four of which are Pareto optimal. We validate these solutions on the following real cloud storage services: AWS S3, DynamoDB and MongoDB. Savings of as much as 50% on reading costs, with no significant or even a positive impact on latency, demonstrate that both clients and cloud providers could benefit from richer services compatible with these retrieval strategies.
... Causal Consistency (CC) [7] is weaker than SC, but stronger than EC, and has been proven to be the strongest type of consistency that can be achieved in a fault-tolerant, distributed system [8]. Informally, CC implies that readers cannot find a version of a data element before all the operations that led to that version are visible [9]. ...
... Conflict Handling. In order for a DDB to support high availability with low latency, WRITEs need to be accepted at any node without requiring co-ordination, at least in the critical path, with other nodes [9]. A state of conflict causes consistency between different nodes to be broken. ...
... Bolt-On [9] describes a custom middleware on top of Cassandra, a commercially-available EC DB with a columnar data model which handles replication. Bolt-On implements explicit causality, offloading dependency tracking to the client. ...
... Causal Consistency. Causal consistency [7] lies between these two endpoints and is an attractive model for building georeplicated data stores because it hits a sweet spot in this tradeoff between ease of programming and performance [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. On the one hand, it avoids the long latencies and the inability to tolerate network partitions of strong consistency. ...
... Since c A has not established any dependencies yet, the snapshot is determined by the latest items received by p x , or in their absence, the latest received heartbeat messages. In this example, the snapshot is encoded with the dependency vector [12,10], which indicates that the snapshot includes all items that originated in DC 1 with timestamp lower or equal to 12 and all items that originated in DC 2 with timestamp lower or equal to 10. Next, p x forwards the read request for y, together with the transaction snapshot, to p y and returns X to the c A as X is the latest received version of x that belongs to the snapshot. When p y receives the read request, it has to wait until the latest item received from DC 2 , which currently has the value 6, surpasses the snapshot value 10. ...
... Our work is primarily related to the vast literature on causally consistent systems, which include COPS [8], Eiger [9], Bolt-on causal consistency [12], ChainReaction [11], Orbe [10], GentleRain [13], Bolt-on CC [12], SwiftCloud [14], Saturn [49], Contrarian [15], Wren [16], CausalSpartan [50], COPS-SNOW [17], Cure [18] and PaRiS [51]. These systems differ in the mechanism they employ to achieve causal consistency. ...
Article
Causal consistency (CC) is an attractive consistency model for geo-replicated data stores because it hits a sweet spot in the ease-of-programming versus performance trade-off. We present a new approach for implementing CC in geo-replicated data stores, which we call Optimistic Causal Consistency (OCC). OCC's main design goal is to maximize data freshness. The optimism in our approach lies in the fact that the updates replicated to a remote data center are made visible immediately, without checking if their causal dependencies have been received. Servers perform the dependency check needed to enforce CC only upon serving a client operation, rather than on receipt of a replicated data item as in existing systems. OCC offers a significant gain in data freshness, which is of crucial importance for various types of applications, such as real-time systems. OCC's potentially blocking behavior makes it vulnerable to network partitions. We therefore propose a recovery mechanism that allows an OCC system to fall back on a pessimistic protocol to continue operating during network partitions. We implement POCC, the first causally consistent geo-replicated multi-master key-value data store designed to maximize data freshness. We show that POCC improves data freshness, while offering comparable or better performance than its pessimistic counterparts.
... In response to this trade-os implied by the CAP Theorem, weak consistency models were proposed such as causal consistency [46] which is one of the most implemented weak models for distributed systems. Several implementations of dierent variants of causal consistency (such as causal convergence [50] and causal memory [10]) have been developed i.e., [14,30,31,45,48,55,57]. ...
... 14 The general schema of the SC checking procedure using wSC tions, and that for the rest of the executions, it computes in average 99.97% of their kernel. As for CCM, we found that it computes the SC-kernel only in 0.7% of the same set of executions.We also found that wSC saturation puts 98.51% of the pairs of writes of an execution in average, and that CCM order in average 97,89% of the pairs of writes. ...
Thesis
Nowadays, we are all end users of distributed systems. A distributed system is a collection of computers in order to improve performance by sharing of resources. Indeed, with the internet's massive explosion, these systems have become necessary. Unfortunately, due to the parallelism and communication latency over large networks, distributed systems may produce unexpected (inconsistent) behaviors if they are not correctly designed and implemented. For instance, a flight seat can be assigned to two users of a flight booking system at the same time. This thesis addresses the problem of verifying that an implementation of a concurrent/distributed system provides to the clients the expected consistency guarantees (i.e., strong, weak or eventual consistency). In particular, we consider the problem of testing concurrent/distributed systems to determine if they are offering the consistency level expected by their users. For a given computation of a concurrent/distributed system, the test confirms the consistency or inconsistency of the system during that computation. We propose dynamic verification approaches with respect to some well-known consistency models, i.e., executing a large number of test programs and verifying them against a given consistency model. The main consistency criterion that we consider in this thesis is a fundamental model called sequential consistency. The verification problem of this model is known to be NP-hard. The reason is that in order to prove that a computation is conform to this consistency model, one need to find a total order on write operations that explains the execution. Therefore, one need to enumerate all the possible total orders, in the worst case. In the beginning, we are interested in verifying the conformance to consistency models that are checkable in polynomial time using saturation-based techniques. We consider causal consistency in its different variants. Then, we build on this work in order to propose an approach for verifying sequential consistency using a strong causal consistency variant. This approach is improved by proposing another weaker model based on more natural and simpler saturation rules. These approaches allow to avoid falling systematically in the worst case i.e., enumerating explicitly the exponential number of the possible total orders between the computation writes. These two approaches are generalized afterward to cover another consistency model that is a relaxation of the sequential consistency model called total store ordering. The problem of verifying this model is also known to be NP-hard. Indeed, the proposed generalizations use suitable models for approximating the total store ordering model. We implement all these approaches and perform benchmark on real life application.
... Weaker consistency conditions overcome the limits and costs of linearizability by striking a balance between agreement, speed, and dynamicity within a system. Such conditions include PRAM consistency, causal consistency [37]- [39], and eventual consistency [12]. PRAM and causal consistency expect the local histories observed by each process to be plausible, regardless of the other processes, but they do not impose state convergence. ...
... Replacing Eqs. (35) to (37) in Eq. (34), we obtain S r+1 0,0 = S r 0,0 (1 − β S ) P r−1 0,0 +P r−1 1,1 +S r−1 0,0 −P r 0,0 −S r 0,0 −P r 1,1 = S r 0,0 1 − f |S| P r−1 0,0 +P r−1 1,1 +S r−1 0,0 −P r 0,0 −S r 0,0 −P r 1,1 = c P r 0,0 , P r−1 0,0 , P r 1,1 , P r−1 1,1 , S r 0,0 , S r−1 ...
Article
Full-text available
Eventual consistency is a consistency model that favors liveness over safety. It is often used in large-scale distributed systems where models ensuring a stronger safety incur performance that are too low to be deemed practical. Eventual consistency tends to be uniformly applied within a system, but we argue a demand exists for differentiated eventual consistency, e.g. in blockchain systems. We propose UPS to address this demand. UPS is a novel consistency mechanism that works in pair with our novel two-phase epidemic broadcast protocol GPS to offer differentiated eventual consistency and delivery speed. We propose two complementary analyses of the broadcast protocol: a continuous analysis and a discrete analysis based on compartmental models used in epidemiology. Additionally, we propose the formal definition of a scalable consistency metric to measure the consistency trade-off at runtime. We evaluate UPS in two simulated worldwide settings: a one-million-node network and a network emulating that of the Ethereum blockchain. In both settings, UPS reduces inconsistencies experienced by a majority of the nodes and reduces the average message latency for the remaining nodes.
... Consider for example a shared online bulletin board, like a Facebook Timeline. Consider the following events, adapted from Bailis et al. [4]: The immediate causal relations are depicted by arrows in Fig. 1. The complete causal relation is the transitive closure of the depicted causal relations. ...
... Whether it physically occurred before the post in Singapore is irrelevant. 4 Given such timestamped events, the behavior of the distributed system can then be defined by the numerical order of those timestamps. Once the timestamps are assigned, their numerical order is a semantic property, and the goal of a CET system design is to ensure that every component sees events in timestamp order. ...
Preprint
Full-text available
In distributed applications, Brewer's CAP theorem tells us that when networks become partitioned, there is a tradeoff between consistency and availability. Consistency is agreement on the values of shared variables across a system, and availability is the ability to respond to reads and writes accessing those shared variables. We quantify these concepts, giving numerical values to inconsistency and unavailability. Recognizing that network partitioning is not an all-or-nothing proposition, we replace the P in CAP with L, a numerical measure of apparent latency, and derive the CAL theorem, an algebraic relation between inconsistency, unavailability, and apparent latency. This relation shows that if latency becomes unbounded (e.g., the network becomes partitioned), then one of inconsistency and unavailability must also become unbounded, and hence the CAP theorem is a special case of the CAL theorem. We describe two distributed coordination mechanisms, which we have implemented as an extension of the Lingua Franca coordination language, that support arbitrary tradeoffs between consistency and availability as apparent latency varies. With centralized coordination, inconsistency remains bounded by a chosen numerical value at the cost that unavailability becomes unbounded under network partitioning. With decentralized coordination, unavailability remains bounded by a chosen numerical quantity at the cost that inconsistency becomes unbounded under network partitioning. Our centralized coordination mechanism is an extension of techniques that have historically been used for distributed simulation, an application where consistency is paramount. Our decentralized coordination mechanism is an extension of techniques that have been used in distributed databases when availability is paramount.
... Several efforts have been made to formalize causal consistency [16], [25], [39] [40], [7], [15], [10], [8], [38] and there are many implementations [9], [20], [21] satisfying this criterion as opposed to strong consistency (linearizability). ...
Chapter
Full-text available
We present a framework for efficient stateless model checking (SMC) of concurrent programs under three prominent models of causal consistency, \({\texttt {CCv}}, {\texttt {CM}}, \texttt{CC}\). Our approach is based on exploring traces under the program order and the reads from relations. Our SMC algorithm is provably optimal in the sense that it explores each and relation exactly once. We have implemented our framework in a tool called Conschecker. Experiments show that Conschecker performs well in detecting anomalies in classical distributed databases benchmarks.
... So security is the biggest problem while storing and maintain the data in the cloud database [15] [16]. In this paper, the multi-level security architecture is proposed to overcome this problem and enhancing the consistency of data transactions [25] [26]. The consistency and security levels are ensured by the proposed architecture in cloud transactions. ...
Article
Full-text available
Cloud is a major requirement for data storage and computing power, without user direct performance. Cloud computing is a famous option for IT industries, enterprises, and government sectors because it provides everything as a service based on user demand. Cloud computing is a better environment for handling a large amount of data which is produced by social networks, health industries, transactional, etc., However, cloud has some issues during the data transaction, many researchers have proposed models and solutions for these problems but still maintaining consistency during the transaction is the biggest problem, it is one of the important properties in ACID. Further, secured architecture is another important issue in cloud environment. So this paper proposes a secured architecture and efficient D1FTBC approach for cloud data transaction. The performance analyses are evaluated at various levels. This research work may lead the transaction processing applications like banking, online reservations and shopping cart etc.
... In the realm of the scholarly community, the different reference frameworks attempt to arrange the affirmation of outside impacts on one own works, while postings of coauthorships on articles record the cooperative altering on a solitary report, in any event hypothetically. As of recently this cycle was consistently [6] one of individual scholastics composing at various occasions on a similar record, sending it to and fro between the colleagues. As of late have we seen editors that consider the cooperative altering of writings continuously, and out of these Fidus Writer is the simply elective known to us that is intended for nontechnical clients in the humanities and sociologies. ...
... Stronger models place less burden on programmers (by guaranteeing more invariants) and users (by exposing fewer anomalies) but constrain service performance. In this work, we propose two new consistency 2 We say "reasonable application logic" because one can always write a middleware layer that implements a stronger consistency model, , atop a weaker one, , e.g., by taking the ideas of bolt-on consistency [8] to an extreme. But the resulting middleware on service is simply an inefficient implementation of an service. ...
Preprint
Full-text available
Strictly serializable (linearizable) services appear to execute transactions (operations) sequentially, in an order consistent with real time. This restricts a transaction's (operation's) possible return values and in turn, simplifies application programming. In exchange, strictly serializable (linearizable) services perform worse than those with weaker consistency. Switching to such services, however, can break applications. This work introduces two new consistency models to ease this trade-off: regular sequential serializability (RSS) and regular sequential consistency (RSC). They are just as "strong" for applications; we prove any application invariant that holds when using a strictly serializable (linearizable) service also holds when using an RSS (RSC) service. Yet they are "weaker" for services; they allow new, better-performing designs. To demonstrate this, we design, implement, and evaluate variants of two systems, Spanner and Gryff, weakening their consistency to RSS and RSC, respectively. The new variants achieve better read-only transaction and read tail latency than their counterparts.
... The models based on eventual consistency are usually developed on the top of the anti-entropy protocols [15] (like epidemic algorithms [16]) which try to minimize the changes between the state of datastore nodes. Causal Consistency [17] is a stronger model than EA but it is usually used for maintaining multiple replicas of data [18]. A good overview of consistency models of NoSQL databases can be found in [19]. ...
Article
Full-text available
Introducing a strong consistency model into NoSQL data storages is one of the most interesting issues nowadays. In spite of the CAP theorem, many NoSQL systems try to strengthen the consistency to their limits to better serve the business needs. However, many consistency-related problems that occur in popular data storages are impossible to overcome and enforce rebuilding the whole system from scratch. Additionally, providing scalability to those systems really complicates the matter. In this paper, a novel data storage architecture that supports strong consistency without loosing scalability is proposed. It provides strong consistency according to the following requirements: high scalability, high availability, and high throughput. The proposed solution is based on the Scalable Distributed Two–Layer Data Store which has proven to be a very efficient NoSQL system. The proposed architecture takes into account the concurrent execution of operations and unfinished operations. The theoretical correctness of the architecture as well as experimental evaluation in comparison to traditional mechanisms like locking and versioning is also shown. Comparative analysis with popular NoSQL systems like MongoDB and MemCached is also presented. Obtained results show that the proposed architecture presents a very high performance in comparison to existing NoSQL systems.
... Contrary to strong consistency [15] (Linearizability [16] and Sequential Consistency [17]), causal consistency can be implemented in the presence of faults while ensuring availability. Several implementations of different variants of causal consistency (such as causal convergence [20] and causal memory [6,21]) have been developed i.e., [7,11,19,22,23]. However, the development of such implementations that meet both consistency requirements and availability and performance requirements is an extremely hard and error prone task. ...
Article
Full-text available
The CAP Theorem shows that (strong) consistency, availability, and partition tolerance are impossible to be ensured together. Causal consistency is one of the weak consistency models that can be implemented to ensure availability and partition tolerance in distributed systems. In this work, we propose a tool to check automatically the conformance of distributed/concurrent systems executions to causal consistency models. Our approach consists in reducing the problem of checking if an execution is causally consistent to solving datalog queries. The reduction is based on complete characterizations of the executions violating causal consistency in terms of the existence of cycles in suitably defined relations between the operations occurring in these executions. We have implemented the reduction in a testing tool for distributed databases, and carried out several experiments on real case studies, showing the efficiency of the suggested approach.
... Contrary to strong consistency [19] (Linearizability [20] and Sequential Consistency [21]), causal consistency can be implemented in the presence of faults while ensuring availability. Several implementations of different variants of causal consistency (such as causal convergence [24] and causal memory [7,25]) have been developed i.e., [9,13,14,23,26,27]. However, the development of such implementations that meet both consistency requirements and availability and performance requirements is an extremely hard and error prone task. ...
Preprint
Full-text available
The CAP Theorem shows that (strong) Consistency, Availability, and Partition tolerance are impossible to be ensured together. Causal consistency is one of the weak consistency models that can be implemented to ensure availability and partition tolerance in distributed systems. In this work, we propose a tool to check automatically the conformance of distributed/concurrent systems executions to causal consistency models. Our approach consists in reducing the problem of checking if an execution is causally consistent to solving Datalog queries. The reduction is based on complete characterizations of the executions violating causal consistency in terms of the existence of cycles in suitably defined relations between the operations occurring in these executions. We have implemented the reduction in a testing tool for distributed databases, and carried out several experiments on real case studies, showing the efficiency of the suggested approach.
... Informally, a system is causally consistent if reads return values consistent with any reads and writes that could have influenced them (in the sense of Lamport's potential causality relation [18]). Causal consistency is widely used to ensure consistency and high availability of data objects with minimal delay across a wide-area network [4,21,22]. Our work adapts this principle to network routing specifically, introducing improvements to reduce the extent and hence delays associated with network updates. ...
Conference Paper
Though centrally managed by a controller, a software-defined network (SDN) can still encounter routing inconsistencies among its switches due to the non-atomic updates to their forwarding tables. In this paper, we propose a new method to rectify these inconsistencies that is inspired by causal consistency, a consistency model for shared-memory systems. Applied to SDNs, causal consistency would imply that once a packet is matched to ("reads") a forwarding rule in a switch, it can be matched in downstream switches only to rules that are equally or more up-to-date. We propose and analyze a relaxed but functionally equivalent version of this property called suffix causal consistency (SCC) and evaluate an implementation of SCC in Open vSwitch and P4 switches, in conjunction with the Ryu and P4Runtime controllers. Our results show that SCC provides greater efficiency than competing consistent-update alternatives while offering consistency that is strong enough to ensure high-level routing properties (black-hole freedom, bounded looping, etc.).
... There has been a great deal of work both in industry and in the literature to support distributed transactions with varying levels of consistency and scalability. Over the years, many systems with reduced consistency levels have been proposed with the goal of overcoming the scalability challenges of traditional relational database systems [5,15,19,32,35,44,61,64,71]. For many applications, however, isolation levels below serializable permit dangerous anomalies, which may manifest as security vulnerabilities [73]. ...
Article
Developers are increasingly building applications that incorporate multiple data stores, for example to manage heterogeneous data. Often, these require transactional safety for operations across stores, but few systems support such guarantees. To solve this problem, we introduce Epoxy, a protocol for providing transactions across heterogeneous data stores. We make two contributions. First, we adapt multi-version concurrency control to a cross-data store setting, storing versioning information in record metadata and filtering reads with predicates on metadata so they only see record versions in a global transaction snapshot. Second, we show our design enables an atomic commit protocol that does not require data stores implement the participant protocol of two-phase commit, requiring only durable writes. We implement Epoxy for five data stores: Postgres, Elasticsearch, MongoDB, Google Cloud Storage, and MySQL. We evaluate it by adapting TPC-C and microservice workloads to a multi-data store environment. We find it has comparable performance to the distributed transaction protocol XA on TPC-C while providing stronger guarantees like isolation, and has overhead of <10% compared to a non-transactional baseline on read-mostly microservice workloads and 72% on write-heavy workloads.
Article
Operation-based Conflict-free Replicated Data Types (op-based CRDTs) are a family of distributed data structures where all operations are designed to commute, so that replica states eventually converge. Additionally, op-based CRDTs require that operations be propagated between replicas in causal order. This paper presents a framework for verifying safety properties of CRDT implementations using separation logic. The framework consists of two libraries. One implements a Reliable Causal Broadcast (RCB) protocol so that replicas can exchange messages in causal order. A second “OpLib” library then uses RCB to simplify the creation and correctness proofs of op-based CRDTs. OpLib allows clients to implement new CRDTs as purely-functional data structures, without having to reason about network operations, concurrency control and mutable state, and without having to each time re-implement causal broadcast. Using OpLib, we have implemented 12 example CRDTs from the literature, including multiple versions of replicated registers and sets, two CRDT combinators for products and maps, and two example use cases of the map combinator. Our proofs are conducted in the Aneris distributed separation logic and are formalized in Coq. Our technique is the first work on verification of op-based CRDTs that satisfies both of the following properties: it is modular and targets executable implementations , as opposed to high-level protocols.
Article
Facebook's graph store TAO, like many other distributed data stores, traditionally prioritizes availability, efficiency, and scalability over strong consistency or isolation guarantees to serve its large, read-dominant workloads. As product developers build diverse applications on top of this system, they increasingly seek transactional semantics. However, providing advanced features for select applications while preserving the system's overall reliability and performance is a continual challenge. In this paper, we first characterize developer desires for transactions that have emerged over the years and describe the current failure-atomic (i.e., write) transactions offered by TAO. We then explore how to introduce an intuitive read transaction API. We highlight the need for atomic visibility guarantees in this API with a measurement study on potential anomalies that occur without stronger isolation for reads. Our analysis shows that 1 in 1,500 batched reads reflects partial transactional updates, which complicate the developer experience and lead to unexpected results. In response to our findings, we present the RAMP-TAO protocol, a variation based on the Read Atomic Multi-Partition (RAMP) protocols that can be feasibly deployed in production with minimal overhead while ensuring atomic visibility for a read-optimized workload at scale.
Article
In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering , which implies interim disagreement between replicas on the relative order in which the client requests were executed. When not handled carefully, this phenomenon may lead to undesired anomalies, including circular causality. We prove an impossibility result which states that temporary operation reordering is unavoidable in mixed-consistency systems with sufficiently complex semantics. Our result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strongly consistent operations to an eventually consistent system) results in the weakening of the guarantees on the eventually consistent operations.
Article
Replicating data across geo-distributed datacenters is usually necessary for large scale cloud services to achieve high locality, durability and availability. One of the major challenges in such geo-replicated data services lies in consistency maintenance, which usually suffers from long latency due to costly coordination across datacenters. Among others, transaction chopping is an effective and efficient approach to address this challenge. However, existing chopping is conducted statically during programming, which is stubborn and complex for developers. In this paper, we propose Dynamic Transaction Chopping (DTC), a novel technique that does transaction chopping and determines piecewise execution in a dynamic and automatic way. DTC mainly consists of two parts: a dynamic chopper to dynamically divide transactions into pieces according to the data partition scheme, and a conflict detection algorithm to check the safety of the dynamic chopping. Compared with existing techniques, DTC has several advantages: transparency to programmers, flexibility in conflict analysis, high degree of piecewise execution, and adaptability to data partition schemes. A prototype of DTC is implemented to verify the correctness of DTC and evaluate its performance. The experiment results show that our DTC technique can achieve much better performance than similar work.
Preprint
Full-text available
Modern online services rely on data stores that replicate their data across geographically distributed data centers. Providing strong consistency in such data stores results in high latencies and makes the system vulnerable to network partitions. The alternative of relaxing consistency violates crucial correctness properties. A compromise is to allow multiple consistency levels to coexist in the data store. In this paper we present UniStore, the first fault-tolerant and scalable data store that combines causal and strong consistency. The key challenge we address in UniStore is to maintain liveness despite data center failures: this could be compromised if a strong transaction takes a dependency on a causal transaction that is later lost because of a failure. UniStore ensures that such situations do not arise while paying the cost of durability for causal transactions only when necessary. We evaluate UniStore on Amazon EC2 using both microbenchmarks and a sample application. Our results show that UniStore effectively and scalably combines causal and strong consistency.
Article
We introduce consistency-aware durability or C ad , a new approach to durability in distributed storage that enables strong consistency while delivering high performance. We demonstrate the efficacy of this approach by designing cross-client monotonic reads , a novel and strong consistency property that provides monotonic reads across failures and sessions in leader-based systems; such a property can be particularly beneficial in geo-distributed and edge-computing scenarios. We build O rca , a modified version of ZooKeeper that implements C ad and cross-client monotonic reads. We experimentally show that O rca provides strong consistency while closely matching the performance of weakly consistent ZooKeeper. Compared to strongly consistent ZooKeeper, O rca provides significantly higher throughput (1.8--3.3×) and notably reduces latency, sometimes by an order of magnitude in geo-distributed settings. We also implement C ad in Redis and show that the performance benefits are similar to that of C ad ’s implementation in ZooKeeper.
Article
We present the first specification and verification of an implementation of a causally-consistent distributed database that supports modular verification of full functional correctness properties of clients and servers. We specify and reason about the causally-consistent distributed database in Aneris, a higher-order distributed separation logic for an ML-like programming language with network primitives for programming distributed systems. We demonstrate that our specifications are useful, by proving the correctness of small, but tricky, synthetic examples involving causal dependency and by verifying a session manager library implemented on top of the distributed database. We use Aneris's facilities for modular specification and verification to obtain a highly modular development, where each component is verified in isolation, relying only on the specifications (not the implementations) of other components. We have used the Coq formalization of the Aneris logic to formalize all the results presented in the paper in the Coq proof assistant.
Chapter
Distributed key-value stores (KVS) are distributed databases that enable fast access to data distributed across a network of nodes. Prominent examples include Amazon’s Dynamo, Facebook’s Cassandra, Google’s BigTable and LinkedIn’s Voldemort. The design of secure and private key-value stores is an important problem because these systems are being used to store an increasing amount of sensitive data. Encrypting data at rest and decrypting it before use, however, is not enough because each decryption exposes the data and increases its likelihood of being stolen. End-to-end encryption, where data is kept encrypted at all times, is the best way to ensure data confidentiality.
Chapter
Programming loosely connected distributed applications is a challenging endeavour. Loosely connected distributed applications such as geo-distributed stores and intermittently reachable IoT devices cannot afford to coordinate among all of the replicas in order to ensure data consistency due to prohibitive latency costs and the impossibility of coordination if availability is to be ensured. Thus, the state of the replicas evolves independently, making it difficult to develop correct applications. Existing solutions to this problem limit the data types that can be used in these applications, which neither offer the ability to compose them to construct more complex data types nor offer transactions.
Chapter
Robotics Cognitive Architectures (RCA) are becoming a key element in the design of robots that need to be aware of its surrounding space and of their role in it. This is especially important for robots that interact with people in household, eldercare or industrial collaborative scenarios. We have proposed in earlier works an RCA called CORTEX designed for social robots operating in HRI environments. One of CORTEX’s main elements is a working memory designed as a graph-like data structure that is accessed by all the computational modules in charge of some relevant function in the system. Our current implementation is based on the concept of a real-time database, where one of the modules stores, receives and publishes changes to all modules. In this paper, we propose a new design of this element based on the Conflict-free Distributed Replicated Data Types (CRDT) theory of distributed data types. The new working memory presents important advantages over existing designs that are demonstrated with several experiments.
Article
Function-as-a-Service (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular due to ease-of-use and operational simplicity. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms while maintaining the key benefits of existing FaaS offerings. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.
Article
Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making them an attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as key-value stores makes it difficult to achieve ACID transactions and high performance: metadata operations such as listing objects are expensive, and consistency guarantees are limited. In this paper, we present Delta Lake, an open source ACID table storage layer over cloud object stores initially developed at Databricks. Delta Lake uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular datasets (e.g., the ability to quickly search billions of table partitions for those relevant to a query). It also leverages this design to provide high-level features such as automatic data layout optimization, upserts, caching, and audit logs. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale datasets and billions of objects.
Conference Paper
It is common for storage systems designed to runon edge datacenters to avoid the high latencies associated withgeo-distribution by relying oneventually consistentmodels toreplicate data. Eventual consistency works well for many edgeapplications because as long as the client interacts with the samereplica, the storage system can providesession consistency,astronger consistency model that has two additional importantproperties: (i)read-your-writes, where subsequent reads by aclient that has updated an object will return the updated value ora newer one; and, (ii)monotonic reads, where if a client has seena particular value for an object, subsequent reads will returnthe same value or a newer one. While session consistency doesnot guarantee that different clients will perceive updates in thesame order, it nevertheless presents each individual client withan intuitive view of the world that is consistent with the client’sown actions. Unfortunately, these consistency guarantees breakdown when a client interacts with multiple replicas housed ondifferent datacenters over time, either as a result of applicationpartitioning, or client or code mobility.SessionStore is a datastore for fog/edge computing that ensuressession consistencyon a top of otherwise eventually consistentreplicas. SessionStore enforces session consistency by groupingrelated data accesses into asession, and using a session-awarereconciliation algorithm to reconcile only the data that is relevantto the session when switching between replicas. This approachreduces data transfer and latency by up to 90% compared tofull replica reconciliation
Chapter
Irrespective of the server-side architecture, scalable data management is the primary challenge for high performance. Business and presentation logic can be designed to scale by virtue of stateless processing or by offloading the problem of state to a shared data store. Therefore, the requirements of high availability and elastic scalability depend on database systems.
Article
Full-text available
Brewer's conjecture'based on his experiences building infrastructure for some of the first Internet search engines at Inktomi'states that distributed systems requiring always on, highly available operation cannot guarantee the illusion of coherent, consistent single-system operation in the presence of network partitions, which cut communication between active servers. Moreover, even without partitions, a system that chooses availability over consistency enjoys benefits of low latency. If a server can safely respond to a user's request when it is partitioned from all other servers, then it can also respond to a user's request without contacting other servers even when it is able to do so. Eventual consistency as an available alternative. Given the CAP impossibility result, distributed-database designers sought weaker consistency models that would enable both availability and high performance.
Conference Paper
Full-text available
Applications often have consistency requirements beyond those guaranteed by the underlying eventually consis-tent storage system. In this work, we present an approach that guarantees monotonic read consistency and read your writes consistency by running a special middleware component on the same server as the application. We evaluate our approach using both simulation and real world experiments on Cloud storage systems.
Conference Paper
Full-text available
There has been a great deal of hype about Amazon's simple storage service (S3). S3 provides infinite scalability and high availability at low cost. Currently, S3 is used mostly to store multi-media docu- ments (videos, photos, audio) which are shared by a community of people and rarely updated. The purpose of this paper is to demon- strate the opportunities and limitations of using S3 as a storage sys- tem for general-purpose database applications which involve small objects and frequent updates. Read, write, and commit protocols are presented. Furthermore, the cost ($), performance, and consis- tency properties of such a storage system are studied.
Conference Paper
Full-text available
The Deuteronomy system supports efficient and scalable ACID transactions in the cloud by decomposing functions of a database storage engine kernel into: (a) a transactional component (TC) that manages transactions and their "logical" concurrency control and undo/redo recovery, but knows nothing about physical data location and (b) a data component (DC) that maintains a data cache and uses access methods to support a record-oriented interface with atomic operations, but knows nothing about transactions. The Deuteronomy TC can be applied to data anywhere (in the cloud, local, etc.) with a variety of deployments for both the TC and DC. In this paper, we describe the architecture of our TC, and the considerations that led to it. Preliminary experiments using an adapted TPC-W workload show good performance supporting ACID transactions for a wide range of DC latencies.
Conference Paper
Full-text available
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Article
Full-text available
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across dierent data centers). At this scale, small and large components fail continuously. The way Cassandra man- ages the persistent state in the face of these failures drives the reliability and scalability of the software systems rely- ing on this service. While in many ways Cassandra resem- bles a database and shares many design and implementation strategies therewith, Cassandra does not support a full rela- tional data model; instead, it provides clients with a simple data model that supports dynamic control over data lay- out and format. Cassandra system was designed to run on cheap commodity hardware and handle high write through- put while not sacricing read eciency.
Article
Full-text available
A design principle aimed at choosing the proper placement of functions among the modules of a distributed computer system is presented. The principle, called the end-to-end argument, suggests that certain functions usually placed at low levels of the system are often redundant or of little value compared to the cost of providing them at a low level. Examples include highly reliable communications links, encryption, duplicate message suppression, and delivery acknowledgement. It is argued that low level mechanisms to support these functions are justified only as performance enhancement.
Conference Paper
Full-text available
Online social networking sites like My Space, Facebook, and Flickr have become a popular way to share and disseminate content. Their massive popularity has led to viral marketing techniques that attempt to spread content, products, and ideas on these sites. However, there is little data publicly available on viral propagation in the real world and few studies have characterized how information spreads over current online social networks. In this paper, we collect and analyze large-scale traces of information dissemination in the Flickr social network. Our analysis, based on crawls of the favorite markings of 2.5 million users on 11 million photos, aims at answering three key questions: (a) how widely does information propagate in the social network? (b) how quickly does information propagate? and (c) what is the role of word-of-mouth exchanges between friends in the overall propagation of information in the network? Contrary to viral marketing "intuition," we find that (a) even popular photos do not spread widely throughout the network, (b) even popular photos spread slowly through the network, and (c) information exchanged between friends is likely to account for over 50% of all favorite-markings, but with a significant delay at each hop. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Conference Paper
Full-text available
We discuss relationships between client-centric consistency models (known as session guarantees), and data-centric consistency models. The first group includes: read-your-writes guarantee, monotonic-writes guarantee, monotonic-reads guarantee and writes-follow-reads guarantee. The other group includes: atomic consistency, sequential consistency, causal consistency, processor consistency, PRAM consistency, weak consistency, release consistency, scope consistency and entry consistency. We use a consistent notation to present formal definitions of both kinds of consistency models in the context of replicated shared objects. Next, we prove a relationship between causal consistency model and client-centric consistency models. Apparently, causal consistency is similar to writes-follow-reads guarantee. We show that in fact causal consistency requires all common session guarantees, i.e. read-your-writes, monotonic-writes, monotonic-reads and writes-follow-reads to be preserved.
Conference Paper
Full-text available
Four per-session guarantees are proposed to aid users and applications of weakly consistent replicated data: “read your writes”, “monotonic reads”, “writes follow reads”, and “monotonic writes”. The intent is to present individual applications with a view of the database that is consistent with their own actions, even if they read and write from various, potentially inconsistent servers. The guarantees can be layered on existing systems that employ a read-any/write-any replication scheme while retaining the principal benefits of such a scheme, namely high availability, simplicity, scalability, and support for disconnected operation. These session guarantees were developed in the context of the Bayou project at Xerox PARC in which we are designing and building a replicated storage system to support the needs of mobile computing users who may be only intermittently connected
Article
Full-text available
In a paper to be presented at the 1993 ACM Symposium on Operating Systems Principles, Cheriton and Skeen offer their understanding of causal and total ordering as a communication property. I find their paper highly critical of Isis, and unfairly so, for a number of reasons. In this paper I present some responses to their criticism, and also explain why I find their discussion of causal and total communication ordering to be distorted and incomplete. 1 Background In a paper to be presented at the 1993 ACM Symposium on Operating Systems Principles, Cheriton and Skeen offer their understanding of causal and total ordering as a communication property. In this paper, I want to I respond to their criticisms from the perspective of my work on Isis [Bir93, BJ87a, BJ87b], and the overall communication model that Isis employs. I assume that the reader is familiar with the Cheriton Skeen paper, and the structure of this response roughly parallels the order of presentation that they use. 1 Isis...
Article
Full-text available
To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. For some applications a weaker causal operation order can preserve consistency while providing better performance. This paper describes a new way of implementing causal operations. Our technique also supports two other kinds of operations: operations that are totally ordered with respect to one another, and operations that are totally ordered with respect to all other operations. The method performs well in terms of response time, operation processing capacity, amount of stored state, and number and size of messages; it does better than replication methods based on reliable multicast techniques. This research was supported in part by the National Science Foundation under Grant CCR-8822158 and in part by the Advanced Research Projects ...
Book
This two-volume work, first published in 1843, was John Stuart Mill's first major book. It reinvented the modern study of logic and laid the foundations for his later work in the areas of political economy, women's rights and representative government. In clear, systematic prose, Mill (1806–73) disentangles syllogistic logic from its origins in Aristotle and scholasticism and grounds it instead in processes of inductive reasoning. An important attempt at integrating empiricism within a more general theory of human knowledge, the work constitutes essential reading for anyone seeking a full understanding of Mill's thought. Continuing the discussion of induction, Volume 2 concludes with Book VI, 'On the Logic of the Moral Sciences', in which Mill applies empirical reasoning to human behaviour. A crucial early formulation of his thinking regarding free will and necessity, this book establishes the centrality of 'the social science' to Mill's philosophy.
Article
Geo-replicated storage provides copies of the same data at multiple, geographically distinct locations. Facebook, for example, geo-replicates its data (profiles, friends lists, likes, etc.) to data centers on the east and west coasts of the United States, and in Europe. In each data center, a tier of separate Web servers accepts browser requests and then handles those requests by reading and writing data from the storage system.
Article
The widespread use of clusters and Web farms has increased the importance of data replication. In this article, we show how to implement consistent and scalable data replication at the middleware level. We do this by combining transactional concurrency control with group communication primitives. The article presents different replication protocols, argues their correctness, describes their implementation as part of a generic middleware, Middle-R, and proves their feasibility with an extensive performance evaluation. The solution proposed is well suited for a variety of applications including Web farms and distributed object platforms.
Conference Paper
Causal consistency is the strongest consistency model that is available in the presence of partitions and provides useful semantics for human-facing distributed services. Here, we expose its serious and inherent scalability limitations due to write propagation requirements and traditional dependency tracking mechanisms. As an alternative to classic potential causality, we advocate the use of explicit causality, or application-defined happens-before relations. Explicit causality, a subset of potential causality, tracks only relevant dependencies and reduces several of the potential dangers of causal consistency.
Article
We examine the limits of consistency in fault-tolerant distributed storage systems. In particular, we identify fundamental tradeoffs among properties of consistency, availability, and convergence, and we close the gap between what is known to be impossible (i.e. CAP) and known systems that are highly-available but that provide weaker consistency such as causal. Specifically, in the asynchronous model with omission-failures and unreliable networks, we show the following tight bound: No consis-tency stronger than Real Time Causal Consistency (RTC) can be provided in an always-available, one-way convergent system and RTC can be provided in an always-available, one-way convergent system. In the asynchronous, Byzantine-failure model, we show that it is impossible to implement many of the recently introduced fork-based consistency semantics without sacrificing either availability or con-vergence; notably, proposed systems allow Byzantine nodes to permanently partition correct nodes from one another. To address this limitation, we introduce bounded fork join causal semantics that extends causal consistency to Byzantine environments while retaining availability and convergence.
Article
Data store replication results in a fundamental trade-off between operation latency and data consistency. In this paper, we examine this trade-off in the context of quorum-replicated data stores. Under partial, or non-strict quorum replication, a data store waits for responses from a subset of replicas before answering a query, without guaranteeing that read and write replica sets intersect. As deployed in practice, these configurations provide only basic eventual consistency guarantees, with no limit to the recency of data returned. However, anecdotally, partial quorums are often "good enough" for practitioners given their latency benefits. In this work, we explain why partial quorums are regularly acceptable in practice, analyzing both the staleness of data they return and the latency benefits they offer. We introduce Probabilistically Bounded Staleness (PBS) consistency, which provides expected bounds on staleness with respect to both versions and wall clock time. We derive a closed-form solution for versioned staleness as well as model real-time staleness for representative Dynamo-style systems under internet-scale production workloads. Using PBS, we measure the latency-consistency trade-off for partial quorum systems. We quantitatively demonstrate how eventually consistent systems frequently return consistent data within tens of milliseconds while offering significant latency benefits.
Article
A formal definition for liveness properties is proposed. It is argued that this definition captures the intuition that liveness properties stipulate that ‘something good’ eventually happens during execution. A topological characterization of safety and liveness is given. Every property is shown to be the intersection of a safety property and a liveness property.
Conference Paper
Cloud SQL Server is an Internet scale relational database service which is currently used by Microsoft delivered services and also offered directly as a fully relational database service known as "SQL Azure". One of the principle design objectives in Cloud SQL Server was to provide true SQL support with full ACID transactions within controlled scale "consistency domains" and provide a relaxed degree of consistency across consistency domains that would be viable to clusters of 1,000's of nodes. In this paper, we describe the implementation of Cloud SQL Server with an emphasis on this core design principle.
Conference Paper
Causally and totally ordered communication support (CATOCS) has been proposed as important to provide as part of the basic building blocks for constructing reliable distributed systems. In this paper, we identify four major limitations to CATOCS, investigate the applicability of CATOCS to several classes of distributed applications in light of these limitations, and the potential impact of these facilities on communication scalability and robustness. From this investigation, we find limited merit and several potential problems in using CATOCS. The fundamental difficulty with the CATOCS is that it attempts to solve problems at the communication level in violation of the well-known "end-to-end" argument.
Conference Paper
Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model---causal consistency with convergent conflict handling, or causal+---that is the strongest achieved under these constraints. We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.
Conference Paper
We describe the design and implementation of Walter, a key-value store that supports transactions and replicates data across distant sites. A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. We use Walter to build a social networking application and port a Twitter-like application.
Conference Paper
Although extensive studies have been conducted on online social networks (OSNs), it is not clear how to characterize information propagation and social influence, two types of important but not well defined social behavior. This paper presents a measurement study of 58M messages collected from 700K users on Twitter.com , a popular social medium. We analyze the propagation patterns of general messages and show how breaking news (Michael Jackson’s death) spread through Twitter. Furthermore, we evaluate different social influences by examining their stabilities, assessments, and correlations. This paper addresses the complications as well as challenges we encounter when measuring message propagation and social influence on OSNs. We believe that our results here provide valuable insights for future OSN research.
Conference Paper
We present PRACTI, a new approach for large-scale replication. PRACTI systems can replicate or cache any subset of data on any node (Partial Replication), provide a broad range of consistency guarantees (Arbitrary Consistency), and permit any node to send information to any other node (Topology Independence). A PRACTI architecture yields two significant advantages. First, by providing all three PRACTI properties, it enables better trade-offs than existing mechanisms that support at most two of the three desirable properties. The PRACTI approach thus exposes new points in the design space for replication systems. Second, the flexibility of PRACTI protocols simplifies the design of replication systems by allowing a single architecture to subsume a broad range of existing systems and to reduce development costs for new ones. To illustrate both advantages, we use our PRACTI prototype to emulate existing server replication, client-server, and object replication systems and to implement novel policies that improve performance for mobile users, web edge servers, and grid computing by as much as an order of magnitude.
Conference Paper
Cloud computing has emerged as a preferred platform for deploying scalable web-applications. With the growing scale of these applications and the data associated with them, scalable data management systems form a crucial part of the cloud infrastructure. Key-Value stores -- such as Bigtable, PNUTS, Dynamo, and their open source analogues-- have been the preferred data stores for applications in the cloud. In these systems, data is represented as Key-Value pairs, and atomic access is provided only at the granularity of single keys. While these properties work well for current applications, they are insufficient for the next generation web applications -- such as online gaming, social networks, collaborative editing, and many more -- which emphasize collaboration. Since collaboration by definition requires consistent access to groups of keys, scalable and consistent multi key access is critical for such applications. We propose the Key Group abstraction that defines a relationship between a group of keys and is the granule for on-demand transactional access. This abstraction allows the Key Grouping protocol to collocate control for the keys in the group to allow efficient access to the group of keys. Using the Key Grouping protocol, we design and implement G-Store which uses a key-value store as an underlying substrate to provide efficient, scalable, and transactional multi key access. Our implementation using a standard key-value store and experiments using a cluster of commodity machines show that G-Store preserves the desired properties of key-value stores, while providing multi key access functionality at a very low overhead.
Conference Paper
While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being ap- plied to a diverse range of applications that di!er consider- ably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples- to-apples performance comparisons, makes it di"cult to un- derstand the tradeo!s between systems and the workloads for which they are suited. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of fa- cilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!'s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the devel- opment of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible—it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.
Article
We describe PNUTS, a massively parallel and geographi- cally distributed database system for Yahoo!'s web applica- tions. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con- current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and uti- lizes automated load-balancing and failover to reduce oper- ational complexity. The first version of the system is cur- rently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimen- tal results.
Article
The CAP theorem's impact on modern distributed database system design is more limited than is often perceived. Another tradeoff—between consistency and latency —has had a more direct influence on several well-known DDBSs. A proposed new formulation, PACELC, unifies this tradeoff with CAP.
Article
The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.
Article
Recently, several strategies have been proposed for transaction processing in partitioned distributed database systems with replicated data. These strategies are surveyed in light of the competing goals of maintaining correctness and achieving high availability. Extensions and combinations are then discussed, and guidelines are presented for selecting strategies for particular applications.
Article
We propose the first unsupervised approach to the problem of modeling dialogue acts in an open domain. Trained on a corpus of noisy Twitter conversations, our method discovers dialogue acts by clustering raw utterances. Because it accounts for the sequential behaviour of these acts, the learned model can provide insight into the shape of communication in a new medium. We address the challenge of evaluating the emergent model with a qualitative visualization and an intrinsic conversation ordering task. This work is inspired by a corpus of 1.3 million Twitter conversations, which will be made publicly available. This huge amount of data, available only because Twitter blurs the line between chatting and publishing, highlights the need to be able to adapt quickly to a new medium. yes yes
Article
The abstraction of a shared memory is of growing importance in distributed computing systems. Traditional memory consistency ensures that all processes agree on a common order of all operations on memory. Unfortunately, providing these guarantees entails access latencies that prevent scaling to large systems. This paper weakens such guarantees by defining causal memory, an abstraction that ensures that processes in a system agree on the relative ordering of operations that are causally related. Because causal memory is weakly consistent, it admits more executions, and hence more concurrency, than either atomic or sequentially consistent memories. This paper provides a formal definition of causal memory and gives an implementation for message-passing systems. In addition, it describes a practical class of programs that, if developed for a strongly consistent memory, run correctly with causal memory.
Article
Many distributed systems are now being developed to provide users with convenient access to data via some kind of communications network. In many cases it is desirable to keep the system functioning even when it is partitioned by network failures. A serious problem in this context is how one can support redundant copies of resources such as files (for the sake of reliability) while simultaneously monitoring their mutual consistency (the equality of multiple copies). This is difficult since network faiures can lead to inconsistency, and disrupt attempts at maintaining consistency. In fact, even the detection of inconsistent copies is a nontrivial problem. Naive methods either 1) compare the multiple copies entirely or 2) perform simple tests which will diagnose some consistent copies as inconsistent. Here a new approach, involving version vectors and origin points, is presented and shown to detect single file, multiple copy mutual inconsistency effectively. The approach has been used in the design of LOCUS, a local network operating system at UCLA.
Article
A distributed system can be characterized by the fact that the global state is distributed and that a common time base does not exist. However, the notion of time is an important concept in every day life of our decentralized "real world" and helps to solve problems like getting a consistent population census or determining the potential causality between events. We argue that a linearly ordered structure of time is not (always) adequate for distributed systems and propose a generalized non-standardmodel of time which consists of vectors of clocks. These clock-vectors arepartially orderedand form a lattice. By using timestamps and a simple clock update mechanism the structureofcausality is represented in an isomorphic way. The new model of time has a close analogy to Minkowski's relativistic spacetime and leads among others to an interesting characterization of the global state problem. Finally, we present a new algorithm to compute a consistent global snapshot of a distributed system where messages may bereceived out of order.
Article
When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.
Article
Bayou's anti-entropy protocol for update propagation between weakly consistent storage replicas is based on pair-wise communication, the propagation of write operations, and a set of ordering and closure constraints on the propagation of the writes. The simplicity of the design makes the protocol very flexible, thereby providing support for diverse networking environments and usage scenarios. It accommodates a variety of policies for when and where to propagate updates. It operates over diverse network topologies, including low-bandwidth links. It is incremental. It enables replica convergence, and updates can be propagated using floppy disks and similar transportable media. Moreover, the protocol handles replica creation and retirement in a light-weight manner. Each of these features is enabled by only one or two of the protocol's design choices, and can be independently incorporated in other systems. This paper presents the anti-entropy protocol in detail, describing the design decisions and resulting features.
Article
Current commercial databases allow application programmers to trade off consistency for performance. However, existing definitions of weak consistency levels are either imprecise or they disallow efficient implementation techniques such as optimism. Ruling out these techniques is especially unfortunate because commercial databases support optimistic mechanisms. Furthermore, optimism is likely to be the implementation technique of choice in the geographically distributed and mobile systems of the future. This thesis presents the first implementation-independent specifications of existing ANSI isolation levels and a number of levels that are widely used in commercial systems, e.g., Cursor Stability, Snapshot Isolation. It also specifies a variety of guarantees for predicate-based operations in an implementation-independent manner. Two new levels are defined that provide useful consistency guarantees to application writers; one is the weakest level that ensures consistent reads, while th...
Article
: The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The relative merits and limitations of the different approaches are discussed, and their general feasibility is analyzed. Keywords: Distributed Computation, Causality, Distributed System, Causal Ordering, Logical Time, Vector Time, Global Predicate Detection, Distributed Debugging 1 Introduction Today, distributed and parallel systems are generally available, and their technology has reached a certain degree of maturity. Unfortunately, we still lack complete understanding of how to design, realize, and test the software for such system...
Klems. cassandra-user mailing list: Benchmarking Cassandra with YCSB
  • M Klems
  • Klems M.
Make data useful. https://sites.google.com/site/glinden/Home/StanfordDataMining
  • G Linden
Combination of all available comment datasets: mefi, askme, meta, music. User count from usernames
  • Metafilter Infodump
Project Voldemort : Reliable distributed storage
  • A Feinberg
  • Feinberg A.
Twitter hits 400 million tweets per day, mostly mobile. CNET, 2012
  • D Farber
Choosing consistency
  • W Vogels
  • Vogels W.
Social computing data repository at ASU
  • R Zafarani
  • H Liu
Eventually consistent
  • Werner Vogels