ArticlePDF Available

BASE: Using Abstraction to Improve Fault Tolerance

Authors:

Abstract and Figures

increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication technique, BASE, which uses abstraction to reduce the cost of Byzantine fault tolerance and to improve its ability to mask software errors. BASE reduces cost because it enables reuse of o-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or non-deterministic service implementations, which reduces the probability of common mode failures. We built an NFS service where each replica can run a dierent o-the-shelf le system implementation, and an object-oriented database where the replicas ran the same, non-deterministic implementation. These examples suggest that our technique can be used in practice | in both cases, the implementation required only a modest amount of new code, and our performance results indicate that the replicated services perform comparably to the implementations that they reuse.
Content may be subject to copyright.
Using Abstraction To Improve Fault Tolerance
Miguel Castro
Microsoft Research Ltd.
1 Guildhall St., Cambridge CB2 3NH, UK
mcastro@microsoft.com
Rodrigo Rodrigues and Barbara Liskov
MIT Laboratory for Computer Science
545 Technology Sq., Cambridge, MA 02139, USA
f
rodrigo,liskov
g
@lcs.mit.edu
Abstract
Software errors are a major cause of outages and they
are increasingly exploited in malicious attacks. Byzantine
fault tolerance allows replicated systems to mask some soft-
ware errors but it is expensive to deploy. This paper de-
scribes a replication technique, BFTA, which uses abstrac-
tion to reduce the cost of Byzantine fault tolerance and to
improve its ability to mask software errors. BFTA reduces
cost because it enables reuse of off-the-shelf service imple-
mentations. It improves availability because each replica
can be repaired periodically using an abstract view of the
state stored by correct replicas, and because each replica
can run distinct or non-deterministic service implementa-
tions, which reduces the probability of common mode fail-
ures. We built an NFS service that allows each replica to
run a different operating system. This example suggests that
BFTA can be used in practice the replicated file system
required only a modest amount of new code, and prelimi-
nary performance results indicate that it performs compa-
rably to the off-the-shelf implementations that it wraps.
1. Introduction
There is a growing demand for highly-available systems
that provide correct service without interruptions. These
systems must tolerate software errors because these are a
major cause of outages [7]. Furthermore, there is an in-
creasing number of malicious attacks that exploit software
errors to gain control or deny access to systems that provide
important services.
This paper proposes a replication technique, BFTA, that
combines Byzantine fault tolerance [12] with work on data
abstraction [11]. Byzantine fault tolerance allows a repli-
cated service to tolerate arbitrary behavior from faulty repli-
cas, e.g., the behavior caused by a software bug, or the be-
havior of a replica that is controlled by an attacker. Abstrac-
This research was partially supported by DARPA under contract F30602-
98-1-0237 monitored by the Air Force Research Laboratory.
tion hides implementation details to enable the reuse of off-
the-shelf implementations of important services (e.g., file
systems, databases, or HTTP daemons) and to improve the
ability to mask software errors.
We extended the BFT library [1, 2] to implement BFTA.
The original BFT library provides Byzantine fault toleranc e
with good performance and strong correctness guarantees if
no more than
1
=
3
of the replicas fail within a small window
of vulnerability. However, it requires all replicas to run the
same service implementation and to update their state in a
deterministic way. Therefore, it cannot tolerate determinis-
tic software errors that cause all replicas to fail concurrently
and it complicates reuse of existing service implementations
because it requires extensive modifications to ensure identi-
cal values for the state of each replica.
The BFTA library and methodology described in this pa-
per correct these problems they enable replicas to run dif-
ferent or non-deterministic implementations. The method-
ology is based on the concepts of abstract specification and
abstraction function from work on data abstraction [11]. We
start by defining a common abstract specification for the
service, which specifies an abstract state and describes how
each operation manipulates the state. Then we implement
aconformance wrapper for each distinct implementation to
make it behave according to the common specification. The
last step is to implement an abstraction function (and one of
its inverses) to map from the concrete state of each imple-
mentation to the common abstract state (and vice versa).
Our methodology offers several important advantages.
Reuse of existing code. BFTA implements a form of state
machine replication [14, 10], which allows replication of
services that perform arbitrary computations, but requires
determinism: all replicas must produce the same sequence
of results when they process the same sequence of opera-
tions. Most off-the-shelf implementations of services fail
to satisfy this condition. For example, many implementa-
tions produce timestamps by reading local clocks, which
can cause the states of replicas to diverge. The conformance
wrapper and the abstract state conversions enable the reuse
of existing implementations without modifications. Fur-
thermore, these implementations can be non-deterministic,
which reduces the probability of common mode failures.
Software rejuvenation. It has been observed [9] that there
is a correlation between the length of time software runs and
the probability that it fails. BFTA combines proactive re-
covery [2] with abstraction to counter this problem. Repli-
cas are recovered periodically even if there is no reason to
suspect they are faulty. Recoveries are staggered such that
the service remains available during rejuvenation to enable
frequent recoveries. When a replica is recovered, it is re-
booted and restarted from a clean state. Then it is brought
up to date using a correct copy of the abstract state that
is obtained from the group of replicas. Abstraction may
improve availability by hiding corrupt concrete states, and
it enables proactive recovery when replicas do not run the
same code or run code that is non-deterministic.
Opportunistic N-version programming. Replication is
not useful when there is a strong positive correlation be-
tween the failure probabilities of the different replicas, e.g.,
deterministic software bugs cause all replicas to fail at the
same time when they run the same code. BFTA enables an
opportunistic form of N-version programming [3] repli-
cas can run distinct, off-the-shelf implementations of the
service. This is a viable option for many common services,
e.g., relational databases, HTTP daemons, file systems, and
operating systems. In all these cases, competition has led to
four or more distinct implementations that were developed
and are maintained separately but have similar (although not
identical) functionality. Furthermore, the technique is made
easier by the existence of standards that provide identical
interfaces to different implementations, e.g., ODBC [6] and
NFS [5]. We can also leverage the large effort towards stan-
dardizing data representations using XML.
It is widely believed that the benefits of N-version pro-
gramming [3] do not justify its high cost [7]. It is better
to invest the same amount of money on better development,
verification, and testing of a single implementation. But op-
portunistic N-version programming achieves low cost due
to economies of scale without compromising the quality of
individual implementations. Since each off-the-shelf imple-
mentation is sold to a large number of customers, the ven-
dors can amortize the cost of producing a high quality im-
plementation. Additionally, taking advantage of interoper-
ability standards keeps the cost of writing the conformance
wrappers and state conversion functions low.
The paper explains the methodology by walking through
an example, the implementation of a replicated file service
where replicas run different operating systems and file sys-
tems. For this methodology to be successful, the confor-
mance wrapper and the state conversion functions must be
simple to reduce the likelihood of introducing more errors
and introduce a low overhead. Experimental results indicat e
that this is true in our example.
The remainder of the paper is organized as follows. Sec-
tion 2 provides an overview of the BFTA methodology and
library. Section 3 explains how we applied the methodol-
ogy to build the replicated file system. Section 4 presents
our conclusions and some preliminary results.
2. The BFTA Technique
This section provides an overview of our replication tech-
nique. It starts by describing the methodology that we use to
build a replicated system from existing service implemen-
tations. It ends with a description of the BFTA library.
2.1. Methodology
The goal is to build a replicated system by reusing a
set of off-the-shelf implementations,
I
1
; :::; I
n
, of some ser-
vice. Ideally, we would like
n
to equal the number of repli-
cas so that each replica can run a different implementation
to reduce the probability of simultaneous failures. But the
technique is useful even with a single implementation.
Although off-the-shelf implementations of the same ser-
vice offer roughly the same functionality, they behave dif-
ferently: they implement different specifications,
S
1
; :::; S
n
using different representations of the service state. Even the
behavior of different replicas that run the same implementa -
tion may be different when the specification they implement
is not strong enough to ensure deterministic behavior. For
instance, the specification of the NFS protocol [5] allows
implementations to choose arbitrary values for file handles.
BFTA, like any form of state machine replication, re-
quires determinism: replicas must produce the same se-
quence of results when they execute the same sequence of
operations. We achieve determinism by defining a com-
mon abstract specification,
S
, for the service that is strong
enough to ensure deterministic behavior. This specification
defines the abstract state, an initial state value, and the be-
havior of each service operation.
The specification is defined without knowledge of the in-
ternals of each implementation unlike what happens in the
technique sketched in [13]. It is sufficient to treat them as
black boxes, which is important to enable the use of existing
implementations. Additionally, the abstract state captures
only what is visible to the client rather than mimicking what
is common in the concrete states of the different implemen-
tations. This simplifies the abstract state and improves the
effectiveness of our software rejuvenation technique.
The next step, is to implement conformance wrappers,
C
1
; :::; C
n
, for each of
I
1
; :::; I
n
. The conformance wrap-
pers implement the common specification
S
. The imple-
mentation of each wrapper
C
i
is a veneer that invokes the
operations offered by
I
i
to implement the operations in
S
;
in implementing these operations it makes use of a con-
formance rep that stores whatever additional information is
needed to allow the translation from the concrete behavior
of the implementation to the abstract behavior.
The final step is to implement the abstraction function
and one of its inverses. These functions allow state transfer
among the replicas. State transfer is used to repair faulty
replicas, and also to bring slow replicas up-to-date when
messages they are missing have been garbage collected. For
state transfer to work replicas must agreeon the value of the
state of the service after executinga sequence of operations;
they will not agree on the value of the concrete state but our
methodology ensures that they will agree on the value of
the abstract state. The abstraction function is used to con-
vert the concrete state stored by a replica into the abstract
state, which is transferred to another replica. The receiving
replica uses the inverse function to convert the abstract state
into its own concrete state representation.
To enable efficient state transfer between replicas, the
abstract state is defined as an array of variable-sized objects.
We explain how this representation enables efficient state
transfer in Section 2.2.
There is an important trend that simplifies the method-
ology. Market forces push vendors towards extending their
products to offer interfaces that implement standard spec-
ifications for interoperability, e.g., ODBC [6]. Usually, a
standard specification
S
0
cannot be used as the common
specification
S
because it is too weak to ensure determin-
istic behavior. But it can be used as a basis for
S
and, be-
cause
S
and
S
0
are similar, it is relatively easy to implement
conformance wrappers and state conversion functions, these
implementations can be mostly reused across implementa-
tions, and most client code can use the replicated system
without modification.
2.2. Library
The BFTA library extends BFT with the features neces-
sary to provide the methodology. Figure 1 presents a sum-
mary of the library’s interface.
Client call:
int invoke(Byz req *req, Byz rep *rep,
bool read only);
Execution upcall:
int execute(Byz req*req, Byz rep*rep,
int client, Byz buffer *non-det);
Checkpointing:
void modify(int nobjs, int* objs);
State conversion upcalls:
int get obj(int i, char** obj);
void put objs(int nobjs, char **objs,
int *is, int *szs);
Figure 1. BFTA Interface and Upcalls
The invoke procedure is called by the client to invoke
an operation on the replicated service. This procedure car-
ries out the client side of the replication protocol and returns
the result when enough replicas have responded. When the
library needs to execute an operation at a replica, it makes
an upcall to an execute procedure that is implemented
by the conformance wrapper for the service implementation
run by the replica.
To perform state transfer in the presence of Byzantine
faults, it is necessary to be able to prove that the state being
transferred is correct. Otherwise, faulty replicas could cor-
rupt the state of out-of-date but correct replicas. (A detailed
discussion of this point can be found in [2].) Consequently,
replicas cannot discard a copy of the state produced after
executing a request until they know that the state produced
by executing later requests can be proven correct. Repli-
cas could keep a copy of the state after executing each re-
quest but this would be too expensive. Instead replicas keep
just the current version of the concrete state plus copies of
the abstract state produced every k-th request (e.g., k=128).
These copies are called checkpoints.
As mentioned earlier, to implement checkpointing and
state transfer efficiently, we require that the abstract state
be encoded as an array of objects. Creating checkpoints by
making full copies of the abstract state would be too ex-
pensive. Instead, the library uses copy-on-write such that
checkpoints only contain the objects whose value is dif-
ferent in the current abstract state. Similarly, transferring
a complete checkpoint to bring a recovering or out-of-date
replica up to date would be too expensive. The library em-
ploys a hierarchical state partition scheme to transfer state
efficiently. When a replica is fetching state, it recurses down
a hierarchy of meta-data to determine which partitions are
out of date. When it reaches the leaves of the hierarchy
(which are the abstract objects), it fetches only the objects
that are corrupt or out of date.
To implement state transfer, each replica must provide
the library with two upcalls, which implement the abstrac-
tion function and one of its inverses. get obj receives an
object index
i
, allocates a buffer, obtains the value of the
abstract object with index
i
, and places that value in the
buffer. It returns the size for that object and a pointer to
the buffer. put objs receives a vector of objects with the
corresponding indices and sizes. It causes the application
to update its concrete state using the new values for the ab-
stract objects passed as arguments. The library guarantees
that the put objs upcall is invoked with an argument that
brings the abstract state of the replica to a consistent value
(i.e., the value of a valid checkpoint). This is important to
allow encodings of the abstract state with dependencies be-
tween objects, e.g., it allows objects to describe the meaning
of other objects.
Each time the execute upcall is about to modify an
object in the abstract state it is required to invoke a mod-
ify procedure, which is supplied by the library, passing the
object index as argument. This is used to implement copy-
on-write to create checkpoints incrementally: the library in-
vokes get obj with the appropriate index and keeps the
copy of the object until the corresponding checkpoint can
be discarded.
BFTA implements a form of state machine replication
that requires replicas to behave deterministically. The me tho-
dology uses abstraction to hide most of the non-determinism
in the implementations it reuses. However, many services
involve forms of non-determinism that cannot be hidden by
abstraction. For instance, in the case of the NFS service, the
time-last-modified for each file is set by reading the server’s
local clock. If this were done independently at each replica,
the states of the replicas would diverge. The library pro-
vides a mechanism [1] for replicas to agree on these non-
deterministic values, which are then passed as arguments to
the execute procedure.
Proactive recovery periodically restarts each replica from
a correct, up-to-date checkpoint of the abstract state that is
obtained from the other replicas. Recoveries are staggered
so that less than
1
=
3
of the replicas recover at the same time.
This allows the other replicas to continue processing client
requests during the recovery. Additionally, it should reduce
the likelihood of simultaneous failures due to aging prob-
lems because at any instant less than
1
=
3
of the replicas
have been running for the same period of time.
Recoveries are triggered by a watchdog timer. When
a replica is recovered, it reboots after saving the replica-
tion protocol state and the concrete service state to disk.
The protocol state includes the abstract objects that were
copied by the incremental checkpointing mechanism. Then
the replica is restarted, and the conformance rep is recon-
structed using the information that was saved to disk. Next,
the library uses the hierarchical state transfer mechanism to
compare the value of the abstract state it currently stores
with the abstract state values stored by the other replicas.
This is efficient: the replica uses cryptographic hashes stored
in the state partition tree to determine which abstract objects
are out-of-date or corrupt and it only fetches the value of
these objects.
The object values fetched by the replica could be sup-
plied to put objs to update the concrete state, but the
concrete state might still be corrupt. For example, an im-
plementation may have a memory leak and simply calling
put objs will not free unreferenced memory. In fact, im-
plementations will not typically offer an interface that can
be used to fix all corrupt data structures in their concrete
state. Therefore, it is better to restart the implementation
from a clean initial concrete state and use the abstract state
to bring it up-to-date.
3. An example: File System
This section illustrates the methodology using a repli-
cated file system as an example. The file system is based on
the NFS protocol [5]. Its replicas can run different operati ng
systems and file system implementations.
3.1. Abstract Specification
The common abstract specification is based on the spec-
ification of the NFS protocol [5]. The abstract file service
state consists of a fixed-size array of pairs with an object and
a generation number. Each object has a unique identifier,
oid, which is obtained by concatenating its index in the ar-
ray and its generation number. The generation number is in-
cremented every time the entry is assigned to a new object.
There are four types of objects: files, whose data is a byte
array; directories, whose data is a sequence of
<
name, oid
>
pairs ordered lexicographically; symbolic links, whose data
is a small character string; and special null objects, which
indicate an entry is free. All non-null objects have meta-
data, which includes the attributes in the NFS fattr struc-
ture. Each entry in the array is encoded using XDR [4]. The
object with index
0
is a directory object that corresponds to
the root of the file system tree that was mounted.
The operations in the common specification are those de-
fined by the NFS protocol. There are operations to read and
write each type of non-null object. The file handles used by
the clients are the oids of the corresponding objects. To en-
sure deterministic behavior, we define a deterministic pro-
cedure to assign oids, and require that directory entries re-
turned to a client be ordered lexicographically.
The abstraction hides many details; the allocation of file
blocks, the representation of large files and directories, and
the persistent storage medium and how it is accessed. This
is desirable for simplicity, performance, and to improve re-
silience to software faults due to aging.
3.2. Conformance Wrapper
The conformance wrapper for the file service processes
NFS protocol operations and interacts with an off-the-shelf
file system implementation also using the NFS protocol as
illustrated in Figure 2. A file system exported by the repli-
cated file service is mounted on the client machine like any
regular NFS file system. Application processes run unmod-
ified and interact with the mounted file system through the
NFS client in the kernel. We rely on user level relay pro-
cesses to mediate communication between the standard NFS
client and the replicas. A relay receives NFS protocol re-
quests, calls the invoke procedure of our replication li-
brary, and sends the result back to the NFS client. The
replication library invokes the execute procedure imple-
mented by the conformance wrapper to run each NFS re-
quest.
The conformance rep consists of an array that corresponds
to the one in the abstract state but it does not store copies
of the objects; instead each array entry contains the gener-
ation number, the file handle assigned to the object by the
underlying NFS server, and the value of the timestamps in
the object’s abstract meta-data. Empty entries store a null
file handle. The rep also contains a map from file handles
to oids to aid in processing replies efficiently.
The wrapper processes each NFS request received from a
client as follows. It translates the file handles in the request,
which encode oids, into the corresponding NFS server file
Andrew
benchmark
kernelNFSclient
relay
replication
library
replica1
unmodifiedNFSdaemon1
replication
library
conformance
wrapper state
conversion
unmodifiedNFSdaemonn
replication
library
conformance
wrapper state
conversion
replican
client
Figure 2. Software Architecture
handles. Then it sends the modified request to the underly-
ing NFS server. The server processes the request and returns
a reply.
The wrapper parses the reply and updates the confor-
mance rep. If the operation created a new object, the wrap-
per allocates a new entry in the array in the conformance
rep, increments the generation number, and updates the en-
try to contain the file handle assigned to the object by the
NFS server. If any object is deleted, the wrapper marks
its entry in the array free. In both cases, the reverse map
from file handles to oids is updated. The wrapper must also
update the abstract timestamps in the array entries corre-
sponding to objects that were accessed. We use the library
to agree on the timestamp value that is assigned to each op-
eration [1]. This value is one of the arguments to the exe-
cute procedure implemented by the wrapper.
Finally, the wrapper returns a modified reply to the client,
using the map to translate file handles to oids and replacing
the concrete timestamp values by the abstract ones. When
handling readdir calls the wrapper reads the entire directory
and sorts it lexicographically to ensure the client receives
identical replies from all replicas.
3.3. State Conversions
The abstraction function in the file service is implemented
as follows. For each file system object, it uses the file han-
dle stored in the conformance rep to invoke the NFS server
to obtain the data and meta-data for the object. Then it re-
places the concrete timestamp values by the abstract ones,
converts the file handles in directory entries to oids, and
sorts the directories lexicographically.
The inverse abstractionfunction in the file service works
as follows. For each file system object
o
it receives, there
are three possible cases depending on the state of the entry
e
that corresponds to
o
in the conformance rep: (1)
e
contains
o
’s generation number, (2)
e
is not free and does not contain
o
’s generation number, (3)
e
is free.
In the first case, objects that changed can be updated us-
ing the file handle in
e
to make calls to the NFS server. This
is done differently for different types of objects. For files,
it is sufficient to issue a setattr and a write to update
the file’s meta-data and data, and for symbolic links, it is
sufficient to update their meta-data. Updating directories is
slightly trickier. The inverse abstraction function reads the
entire directory from the NFS server, computes its current
abstract value, and compares this value with
o
. Nothing is
done for entries that did not change. Entries that are not
present in
o
or point to a different object are removed by
issuing the appropriate calls to the NFS server. Then entries
that are new or different in
o
are created but if the object
they refer to does not exist in the current abstract state, it is
first created using the value for the object that is supplied to
put objs.
In the second case, the NFS server is invoked to remove
the object and then the function proceeds as in case 3.
In the third case, the NFS server is invoked to create the
object (initially in a separate unlinked directory) and the ob-
ject’s data and meta-data is updated as in case 1. It is guar-
anteed that the directories that point to the object will be
processed; the object is then linked to those directories and
removed from the unlinked directory. When new objects are
created, their file handles are recorded in the conformance
wrapper’s data structures.
3.4. Proactive Recovery
NFS file handles are volatile: the same file system ob-
ject may have a different file handle after the NFS server
restarts. For proactive recovery to work efficiently, we need
a persistent identifier for objects in the concrete file system
state that can be used to compute the abstraction function
during recovery.
The NFS specification states that each object is uniquely
identified by a pair of meta-data attributes:
<
fsid,fileid
>
.
We solve the problem above by maintaining an additional
map from
<
fsid,fileid
>
pairs to the corresponding oids. This
map is saved to disk asynchronously when a checkpoint is
created and synchronously before a proactive recovery. Af-
ter rebooting, the replica that is recovering reads the map
from disk. Then it traverses the file system’s directory tree
depth first from the root. It reads each object, uses the map
to obtain its oid, and uses the cryptographic hashes from the
state transfer protocol to check if the object is up-to-date. If
the object is out-of-date or corrupt, it is fetched from an-
other replica.
Instead of simply calling put objs with the new object
values, we intend to start an NFS server on a second empty
disk and bring it up-to-date incrementally as we obtain the
value of the abstract objects. This has the advantage of im-
proving fault-tolerance as discussed in Section 2.2. Addi-
tionally, it can improve disk locality by clustering blocks
from the same file and files that are in the same directory.
This is not done in the current prototype.
4. Conclusion
Software errors are a major cause of outages and they are
increasingly exploited in malicious attacks to gain control
or deny access to important services. Byzantine fault toler-
ance allows replicated systems to mask some software er-
rors but it has been expensiveto deploy. We have described
a replication technique, BFTA, which uses abstraction to re-
duce the cost of deploying Byzantine fault tolerance and to
improve its ability to mask software errors.
BFTA reduces cost because it enables reuse of off-the-
shelf service implementations without modifications, and it
improves resilience to software errors by enabling oppor-
tunistic N-version programming, and software rejuvenation
through proactive recovery.
Opportunistic N-version programming runs distinct, off-
the-shelf implementations at each replica to reduce the prob-
ability of common mode failures. To apply this technique,
it is necessary to define a common abstract behavioral spec-
ification for the service and to implement appropriate con-
version functions for the state, requests, and replies of each
implementation in order to make it behave according to the
common specification. These tasks are greatly simplified by
basing the common specification on standards for the inter-
operability of software from different vendors; these stan-
dards appear to be common, e.g., ODBC [6], and NFS [5].
Opportunistic N-version programming improves on previ-
ous N-version programming techniques by avoiding the high
development, testing, and maintenance costs without com-
promising the quality of individual versions.
Additionally, we provide a mechanism to repair faulty
replicas. Proactive recovery allows the system to remain
available provided no more than
1
=
3
of the replicas become
faulty and corrupt the abstract state (in a correlated way)
within a window of vulnerability. Abstraction may enable
more than
1
=
3
of the replicas to be faulty because it can
hide corrupt items in concrete states of faulty replicas.
The paper described a replicated NFS file system imple-
mented using our technique. The conformance wrapper and
the state conversion functions in our prototype are simple
they have 1105 semi-colons, which is two orders of magni-
tude less than the size of the Linux 2.2 kernel. This suggests
that they are unlikely to introduce new bugs.
We ran a scaled-up version of the Andrew benchmark [8,
2] (which generates 1 GB of data) to compare the perfor-
mance of our replicated file system and the off-the-shelf
implementation of NFS in Linux 2.2 that it wraps. Our
performance results indicate that the overhead introduced
by our technique is low; it is approximately 30% for this
benchmark with a window of vulnerability of 17 minutes.
These preliminary results suggest that BFTA can be used
in practice. As future work, it would be important to run
experiments that apply BFTA to more challenging services,
e.g., a relational database. It would also be important to
run fault injection experiments to evaluate the availability
improvements afforded by our technique.
References
[1] M. Castro and B. Liskov. Practical Byzantine Fault Toler-
ance. In Proceedings of the Third Symposium on Operat-
ing Systems Design and Implementation, New Orleans, LA,
February 1999.
[2] M. Castro and B. Liskov. Proactive Recovery in a Byzantine-
Fault-Tolerant System. In Proceedings of the Fourth Sympo-
sium on Operating Systems Design and Implementation, San
Diego, CA, October 2000.
[3] L. Chen and A. Avizienis. N-Version Programming: A Fault-
Tolerance Approach to Reliability of Software Operation. In
Fault Tolerant Computing, FTCS-8, pages 3–9, 1978.
[4] Network Working Group Request for Comments: 1014.
XDR: External Data Representation Standard, June 1987.
[5] Network Working Group Request for Comments: 1094.
NFS: Network File System Protocol Specification, March
1989.
[6] Kyle Geiger. Inside ODBC. Microsoft Press, 1995.
[7] J. Gray and D. Siewiorek. High-Availability Computer Sys-
tems. IEEE Computer, 24(9):39–48, September 1991.
[8] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satya-
narayanan, R. Sidebotham, and M. West. Scale and perfor-
mance in a distributed file system. ACM Transactions on
Computer Systems, 6(1):51–81, February 1988.
[9] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. Software
Rejuvenation: Analysis, Module and Applications. In Fault-
Tolerant Computing, FTCS-25, pages 381–390, Pasadena,
CA, June 1995.
[10] L. Lamport. Time, Clocks, and the Ordering of Events
in a Distributed System. Communications of the ACM,
21(7):558–565, July 1978.
[11] B. Liskov and J. Guttag. Program Development in Java:
Abstraction, Specification, and Object-Oriented Design.
Addison-Wesley, 2000.
[12] M. Pease, R. Shostak, and L. Lamport. Reaching Agreement
in the Presence of Faults. Journal of the ACM, 27(2):228–
234, April 1980.
[13] A. Romanovsky. Abstract Object State and Version Re-
covery in N-Version Programming. In TOOLS Europe’99,
Nancy, France, June 1999.
[14] F. Schneider. Implementing Fault-Tolerant Services Using
the State Machine Approach: A Tutorial. ACM Computing
Surveys, 22(4):299–319, December 1990.
... Zyzzyva [4] is a Byzantine Fault Tolerance protocol that is currently under development. It does not require replicas to establish agreement before processing queries, unlike other BFT algorithms such as [5][6][7]. The Zyzzyva protocol operates and responds to clients with speculative outcomes almost instantly. ...
... It is hard to reliably identify which nodes are producing inaccurate transaction information, whether purposefully or by accident, because nodes are geographically dispersed and independent of each other and any central authority [7]. Byzantine Fault Tolerance is a quality of a distributed computer system that allows it to overcome this challenge and build stable consensus despite the fact that some nodes, either mistakenly or intentionally, disagree with the others [17]. ...
Chapter
With the advent of technology, the world saw the rise of blockchain technology, because of its accessibility, and efficiency in managing transactions and the related records. According to IBM, because it delivers immediate, shareable and entirely transparent information kept on an immutable ledger that can only be viewed by permissioned network users, blockchain is excellent for delivering that information. The most important aspect of blockchain is its distributed ledger technology. The Byzantine Fault Tolerance which is largely associated with distributed systems is a feature that allows a decentralised, trustless network to function even when some nodes are broken or malevolent. This paper explains the use of proof-of-work algorithms and Byzantine Fault Tolerance (BFT) to handle the Byzantine Faults in blockchain technology, its effects and the solutions to this problem.
... Independence of failures between replicas (i.e., ECU HW and SW) is a key challenge for the effectiveness of any Byzantine Agreement based system like IRS [2], [6], [12], [23]. The reason is that without avoiding common-mode vulnerabilities or faults, many replicas can fail at the same time, thus violating the assumption of the correctness of a majority of replicas (N-f). ...
Preprint
Full-text available
Current vehicular Intrusion Detection and Prevention Systems either incur high false-positive rates or do not capture zero-day vulnerabilities, leading to safety-critical risks. In addition, prevention is limited to few primitive options like dropping network packets or extreme options, e.g., ECU Bus-off state. To fill this gap, we introduce the concept of vehicular Intrusion Resilience Systems (IRS) that ensures the resilience of critical applications despite assumed faults or zero-day attacks, as long as threat assumptions are met. IRS enables running a vehicular application in a replicated way, i.e., as a Replicated State Machine, over several ECUs, and then requiring the replicated processes to reach a form of Byzantine agreement before changing their local state. Our study rides the mutation of modern vehicular environments, which are closing the gap between simple and resource-constrained "real-time and embedded systems", and complex and powerful "information technology" ones. It shows that current vehicle (e.g., Zonal) architectures and networks are becoming plausible for such modular fault and intrusion tolerance solutions,deemed too heavy in the past. Our evaluation on a simulated Automotive Ethernet network running two state-of-the-art agreement protocols (Damysus and Hotstuff) shows that the achieved latency and throughout are feasible for many Automotive applications.
... Furthermore, independence of failures between IRS replicas is considered a main challenge, being key for the effectiveness in common-mode vulnerabilities or faults. The rich supply chain of automotive HW/SW is useful to generate diversity, which is the main approach to improve independence of failures, among others [1], [6], [11], [24], [29], [30]. For instance, it not uncommon to have ECUs or MCUs of the same specifications from different vendors; diverse software libraries, operating systems and hypervisors. ...
Conference Paper
Full-text available
We introduce the concept of Intrusion Resilience Systems (IRS) for modern vehicles. An IRS enables running a vehicular application in a replicated way, i.e., as a Replicated State Machine, over several ECUs. By requiring the replicated processes to reach a form of Byzantine agreement before changing their local state, the IRS ensures the resilience of critical vehicular applications despite assumed faults or attacks, as long as threat assumptions are met. This paper proposes the tentative architecture of IRS and discusses its feasibility and underlying challenges. Our study rides the mutation of modern vehicular environments, which are closing the gap between simple and resource-scarce 'real-time and embedded systems', and complex and powerful 'information technology' ones. We show that current vehicle architectures and networks are becoming plausible for such modular fault and intrusion tolerance solutions-deemed too heavy in the past. Our evaluation on a simulated Automotive Ethernet network shows that this approach is feasible for many Automotive applications. Our conclusion is that this approach is promising and deserves more attention in both academia and industry.
... Furthermore, independence of failures between IRS replicas is considered a main challenge, being key for the effectiveness in common-mode vulnerabilities or faults. The rich supply chain of automotive HW/SW is useful to generate diversity, which is the main approach to improve independence of failures, among others [1], [6], [11], [24], [29], [30]. For instance, it not uncommon to have ECUs or MCUs of the same specifications from different vendors; diverse software libraries, operating systems and hypervisors. ...
Conference Paper
Full-text available
We introduce the concept of Intrusion Resilience Systems (IRS) for modern vehicles. An IRS enables running a vehicular application in a replicated way, i.e., as a Replicated State Machine, over several ECUs. By requiring the replicated processes to reach a form of Byzantine agreement before changing their local state, the IRS ensures the resilience of critical vehicular applications despite assumed faults or attacks, as long as threat assumptions are met. This paper proposes the tentative architecture of IRS and discusses its feasibility and underlying challenges. Our study rides the mutation of modern vehicular environments, which are closing the gap between simple and resource-scarce 'real-time and embedded systems', and complex and powerful 'information technology' ones. We show that current vehicle architectures and networks are becoming plausible for such modular fault and intrusion tolerance solutions-deemed too heavy in the past. Our evaluation on a simulated Automotive Ethernet network shows that this approach is feasible for many Automotive applications. Our conclusion is that this approach is promising and deserves more attention in both academia and industry.
... Em termos de falhas de processos, assumimos que um número arbitrário de clientes e um limite máximo de f servidores estão sujeitos a falhas bizantinas: podem desviar arbitrariamente de sua especificação e podem, inclusive, trabalhar em conluios maliciosos visando corromper o sistema. Assumimos ainda independência de faltas, obtida através da utilização de diferentes plataformas (hardware, SO, VM, etc) [Castro et al., 2003]. ...
Conference Paper
Os sistemas distribuídos atuais requerem mecanismos de comunicação que atendam requisitos como anonimato e desconexão temporária. Neste contexto, a comunicação generativa vêm se afirmando como um dos modelos de coordenação capazes de atender esses requisitos uma vez que é desacoplada no tempo e no espaço. Este trabalho apresenta a primeira proposta da literatura a considerar a construção de uma infra-estrutura de coordenação generativa tolerante a faltas bizantinos. Esta construção se dá através da aplicação de replicação por sistemas de quóruns bizantinos.
... A arquitetura proposta utiliza o conceito de estado abstrato independente da linguagem [Castro et al. 2003] e apresenta pelo menos dois grandes desafios: (1) internamente em um processo -possibilitar a comunicação entre diferentes linguagens (ex.: uma réplica escrita em C precisa executar métodos do BFT-SMART escritos em Java e vice-versa); e (2) entre processos -possibilitar a troca de informações entre réplicas (ou entre réplicas e clientes) implementadas em linguagens diferentes (ex.: a representação de um vetor em Cé diferente da representação em Java). ...
Conference Paper
Vulnerabilidades podem comprometer as propriedades de segurança de um sistema quando adequadamente exploradas por um atacante. Uma alternativa para mitigar este risco é a implementação de sistemas tolerantes a intrusões. Uma abordagem muito utilizada para estas implementações é a replicação Máquina de Estados (RME). Porém, as soluções existentes não suportam diversidade na implementação das réplicas, de forma que um mesmo ataque pode comprometer todo o sistema. Neste sentido, este trabalho propõe uma arquitetura para fornecer suporte à diversidade de implementação em RMEs e mostra como a mesma foi integrada no BFT-SMART. Um conjunto de experimentos mostra o comportamento prático das soluções propostas.
... Furthermore, independence of failures between IRS replicas is considered a main challenge, being key for the effectiveness in common-mode vulnerabilities or faults. The rich supply chain of automotive HW/SW is useful to generate diversity, which is the main approach to improve independence of failures, among others [1], [6], [11], [24], [29], [30]. For instance, it not uncommon to have ECUs or MCUs of the same specifications from different vendors; diverse software libraries, operating systems and hypervisors. ...
Preprint
Full-text available
We introduce the concept of Intrusion Resilience Systems (IRS) for modern vehicles. An IRS is a middleware that enables running a vehicular application in a replicated way, i.e., as a Replicated State Machine, over several ECUs. By requiring the replicated processes to reach a form of Byzantine agreement before changing their local state, the IRS ensures the resilience of critical vehicular applications despite assumed faults or attacks, as long as threat assumptions are met. This position paper proposes the tentative architecture of IRS and discusses its conceptual feasibility and underlying challenges. Our study rides the mutation of modern vehicular environments, which are closing the gap between simple and resource-scarce 'real-time and embedded systems', and complex and powerful 'information technology' ones. We show that current architectures are becoming plausible for such modular fault and intrusion tolerance solutions-deemed too heavy in the past. Our conclusion is that this topic deserves more attention in both academia and industry.
... However, real-world software diversity is often employed in an ad hoc fashion, which can be justified by how different OSes (e.g., Windows vs. various kinds of Unix) and browsers (e.g., Safari vs. Firefox vs. Chrome) are employed in practice. One exception is the investigation of employing software diversity to enhance Byzantine Fault-Tolerance (BFT), namely how to employ software diversity H. Chen in the replica implementations so that they do not contain common vulnerabilities [23], [24], [25], [26], [27], [28]. This is important because the theoretical fault-tolerance guarantee can be ruined otherwise. ...
Preprint
The deployment of monoculture software stacks can have devastating consequences because a single attack can compromise all of the vulnerable computers in cyberspace. This one-vulnerability-affects-all phenomenon will continue until after software stacks are diversified, which is well recognized by the research community. However, existing studies mainly focused on investigating the effectiveness of software diversity at the building-block level (e.g., whether two independent implementations indeed exhibit independent vulnerabilities); the effectiveness of enforcing network-wide software diversity is little understood, despite its importance in possibly helping justify investment in software diversification. As a first step towards ultimately tackling this problem, we propose a systematic framework for modeling and quantifying the cybersecurity effectiveness of network diversity, including a suite of cybersecurity metrics. We also present an agent-based simulation to empirically demonstrate the usefulness of the framework. We draw a number of insights, including the surprising result that proactive diversity is effective under very special circumstances, but reactive-adaptive diversity is much more effective in most cases.
Article
We present Poligraph, an intrusion-tolerant and decentralized fake news detection system. Poligraph aims to address architectural, system, technical, and social challenges of building a practical, long-term fake news detection platform. We first conduct a case study for fake news detection at authors’ institute, showing that machine learning-based reviews are less accurate but timely, while human reviews, in particular, experts reviews, are more accurate but time-consuming. This justifies the need for combining both approaches. At the core of Poligraph is two-layer consensus allowing seamlessly combining machine learning techniques and human expert determination. We construct the two-layer consensus using Byzantine fault-tolerant (BFT) and asynchronous threshold common coin protocols. We prove the correctness of our system in terms of conventional definitions of security in distributed systems (agreement, total order, and liveness) as well as new review validity (capturing the accuracy of news reviews). We also provide theoretical foundations on parameter selection for our system. We implement Poligraph and evaluate its performance on Amazon EC2 using a variety of news from online publications and social media. We demonstrate Poligraph achieves throughput of more than 5,000 transactions per second and latency as low as 0.05 second. The throughput of Poligraph is only marginally ( ${4\%}$ – ${7\%}$ ) slower than that of an unreplicated, single-server implementation. In addition, we conduct a real-world case study for the review of fake and real news among both experts and non-experts, which validates the practicality of our approach.
Article
Full-text available
Andrew is a distributed computing environment being developed in a joint project by Carnegie Mellon University and IBM. One of the major components of Andrew is a distributed file system which constitutes underlying mechanism for sharing information. The goals of the Andrew file system are to support growth up to at least 7000 workstations (one for each student, faculty member, and staff at Carnegie Mellon) while providing users, application programs, and system administrators with the amenities of a shared file system. A fundamental result of our concern with scale is the design decision to transfer whole files between servers and workstations rather than some smaller unit such as records or blocks, as almost all other distributed file systems do. This paper examines the consequences of this and other design decisions and features that bear on the scalability of Andrew. Large scale affects a distributed system in two ways: it degrades performance and it complicates administration and day-to-day operation. This paper addresses both concerns and shows that the mechanisms we have incorporated cope with them successfully. We start the initial prototype of the system, what we learned from it, and how we changed the system to improve performance. We compare its performance with that of a block-oriented file system, Sun Microsystems' NFS, in order to evaluate the whole file transfer strategy. We then turn to operability, and finish with issues related peripherally to scale and with the ways the present design could be enchanced.
Article
Full-text available
The Andrew File System is a location-transparent distributed tile system that will eventually span more than 5000 workstations at Carnegie Mellon University. Large scale affects performance and complicates system operation. In this paper we present observations of a prototype implementation, motivate changes in the areas of cache validation, server process structure, name translation, and low-level storage representation, and quantitatively demonstrate Andrews ability to scale gracefully. We establish the importance of whole-file transfer and caching in Andrew by comparing its performance with that of Sun Microsystems NFS tile system. We also show how the aggregation of files into volumes improves the operability of the system.
Article
N-version programming is defined as the independent generation of N greater than equivalent to 2 functionally equivalent programs from the same initial specification. A methodology of N-version programming has been devised and three types of special mechanisms have been identified that are needed to coordinate the execution of an N-version software unit and to compare the correspondent results generated by each version. Two experiments have been conducted to test the feasibility of N-version programming. The results of these experiments are discussed. In addition, constraints are identified that must be met for effective application of N-version programming.
Article
The problem addressed here concerns a set of isolated processors, some unknown subset of which may be faulty, that communicate only by means of two-party messages. Each nonfaulty processor has a private value of information that must be communicated to each other nonfaulty processor. Nonfaulty processors always communicate honestly, whereas faulty processors may lie. The problem is to devise an algorithm in which processors communicate their own values and relay values received from others that allows each nonfaulty processor to infer a value for each other processor. The value inferred for a nonfaulty processor must be that processor's private value, and the value inferred for a faulty one must be consistent with the corresponding value inferred by each other nonfaulty processor. It is shown that the problem is solvable for, and only for, n ≥ 3m + 1, where m is the number of faulty processors and n is the total number. It is also shown that if faulty processors can refuse to pass on information but cannot falsely relay information, the problem is solvable for arbitrary n ≥ m ≥ 0. This weaker assumption can be approximated in practice using cryptographic methods.
Article
The OO7 Benchmark represents a comprehensive test of OODBMS performance. In this paper we describe the benchmark and present performance results from its implementation in three OODBMS systems. It is our hope that the OO7 Benchmark will provide useful insight for end-users evaluating the performance of OODBMS systems; we also hope that the research community will find that OO7 provides a database schema, instance, and workload that is useful for evaluating new techniques and algorithms for OODBMS implementation.
Article
Current approaches for checkpointing assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover a failed computation on a different processor architecture. For such situations checkpointing and recovery must be portable. In this paper, we argue that source-to-source compilation is an appropriate concept for this purpose. We describe the compilation techniques that we developed for the design of the c2ftc prototype. The c2ftc compiler enables machine-independent checkpoints by automatic generation of checkpointing and recovery code. Sequential C programs are compiled into fault tolerant C programs, whose checkpoints can be migrated across heterogeneous networks, and restarted on binary incompatible architectures. Experimental results on several systems provide evidence that the performance penalty of portable checkpointing is negligible for realistic checkpointing frequencies.
Article
This paper addresses the problem of using COTS microkernels in dependable systems. Because they are not developed with this aim, their behavior in the presence of faults is a main concern to system designers. We propose a novel approach to contain the effect of both external and internal faults that may affect their behavior. As microkernels can be decomposed into simple components, modeling of their expected behavior in the absence of faults is most often possible, which allows for the easy definition of dynamic predicates. For an efficient implementation of fault containment wrappers checking for these predicates, we introduce the notion of MetaKernel to reify the information required for implementing the predicates and to reflect appropriate actions. This approach is exemplified on a case study using an open version of the Chorus microkernel. MAFALDA, a software-implemented fault injection tool, is used to illustrate the benefits procured by the proposed wrappers
Article
The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.