ArticlePDF Available

Supporting fault-tolerant parallel programming in Linda

Authors:

Abstract and Figures

Linda is a language for programming parallel applications whose most notable feature is a distributed shared memory called tuple space. While suitable for a wide variety of programs, one shortcoming of the language as commonly defined and implemented is a lack of support for writing programs that can tolerate failures in the underlying computing platform. This paper describes FT-Linda, a version of Linda that addresses this problem by providing two major enhancements that facilitate the writing of fault-tolerant applications: stable tuple spaces and atomic execution of tuple space operations. The former is a type of stable storage in which tuple values are guaranteed to persist across failures, while the latter allows collections of tuple operations to be executed in an all-or-nothing fashion despite failures and concurrency. The design of these enhancements is presented in detail and illustrated by examples drawn from both the Linda and fault-tolerance domains. An implementation of FT-Linda for a network of workstations is also described. The design is based on replicating the contents of stable tuple spaces to provide failure resilience and then updating the copies using atomic multicast. This strategy allows an efficient implementation in which only a single multicast message is needed for each atomic collection of tuple space operations
Content may be subject to copyright.
SUPPORTING FAULT-TOLERANT PARALLEL
PROGRAMMING IN LINDA
(Ph.D. Dissertation)
David Edward Bakken
TR 94-23
August 8, 1994
Department of Computer Science
The University of Arizona
Tucson, Arizona 85721
This work was supported in part by the National Science Foundation under grant CCR-9003161 and the
Office of Naval Research under grant N00014-91-J-1015.
SUPPORTING FAULT-TOLERANT PARALLEL
PROGRAMMING IN LINDA
by
David Edward Bakken
Copyright c
David Edward Bakken 1994
A Dissertation Submitted to the Faculty of the
DEPARTMENT OF COMPUTER SCIENCE
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
In the Graduate College
THE UNIVERSITY OF ARIZONA
1 9 9 4
3
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at The University of Arizona and is deposited in the University Library
to be made available to borrowers under the rules of the Library.
Brief quotations from this dissertation are allowable without special permission, pro-
vided that accurate acknowledgment of source is made. Requests for permission for
extended quotation from or reproduction of the manuscript in whole or in part may be
granted by the copyright holder.
SIGNED:
4
5
ACKNOWLEDGMENTS
I wish to expresssincere, profound, and profuse thanksto my advisor,Rick Schlichting.
His encouragement, guidance, friendship, and humor have been a delight to experience.
He has greatly honed my research and writing skills and has been a researcher to venerate
and emulate. I thank Greg Andrews for all he taught me in so many different ways about
concurrent programming, research, and writing. I thank him and Larry Peterson for their
contributions to my research and for serving on my committee. And I thank Mary Bailey
for serving as a substitute on my final committee and for reviewing this dissertation.
Others have contributed directly to this research. I thank Vic Thomas and Shivakant
Mishra for their many discussions regarding this research, and Shivakant for developing
Consul. I thank Jerry Leichter, LRW Systems, and GTE Laboratories for providing the
Linda precompiler that was used as a starting point for the FT-Linda precompiler. I thank
Rob Rubin, Rick Snodgrass, Dennis Shasha, and Tom Wilkes for their useful comments
on FT-Linda. I thank Dhananjay Mahajan and Sandy Miller for their work on Consul.
I thank Ron Riter and Frank McCormick for teaching me so much about software and
systems during my time at Boeing; these lessons were invaluable in graduate school
and will no doubt continue to be so during the rest of my career. I thank Brad Glade for
suggesting read-only tuple spaces. I thank the National Science Foundation for supporting
this work through grant CCR-9003161 and the Office of Naval Research through grant
N00014-91-J-1015.
Many people have made my stay in Tucson delightful. I thank the department’s
system administrators and office staff for keeping things running so very smoothly. I
thank my fellow graduate students Doug Arnett, Nina Bhatti, Peter Bigot, Wanda Chiu,
Peter Druschel, Curtis Dyreson, Vincent Freeh, Patrick Homer, Clint Jeffery, Nick Kline,
David Lowenthal, Vic Thomas, Beth Weiss, and Andrey Yeatts for their friendship. I
thank Anwar Abdulla, Yousef Akeel, Harvey Bostrom, Nancy Nelson, and Nayla Yateem
for making my stay in Tucson more pleasant, each in their own ways. I am thankful for
the sweet fellowship of the members of the prayer group, especially Fred and Ruth Fox,
Derek and Shannon Parks, and Craig and Mary Bell.
I thank my parents for their love and support over the years, and I thank my brother
as well as Gram and Joanne. And I can never hope to thank my dear wife, Beth, or my
precious children, Abby and Adam, enough for their love and encouragement during my
studies. Enduring my grad school habit was difficult for them all.
Finally,and foremost, I thank God for providing me a way to heaven through His Son,
Jesus Christ.
6
Dedicated to the memory of my twin brother
Gregory Harold Bakken
January 2, 1961 August 16, 1987
8
9
TABLE OF CONTENTS
LIST OF FIGURES
:::::::: :: :: :: :: ::: :: :: :: :: ::: :: :
13
LIST OF TABLES
::::::::: :: :: :: :: ::: :: :: :: :: ::: :: :
15
ABSTRACT
::::::: :: ::: :: :: :: :: ::: :: :: :: :: ::: :: :
17
CHAPTER 1: INTRODUCTION
: : : : : : : : : : : : : : : : : : : : : : : : : :
19
1.1 Motivation for Parallel Programming
::::::::: :: :: ::: :: :
19
1.2 Architectures for Parallel Programming
: : : : : : : : : : : : : : : : : :
20
1.3 Simplifying Parallel Programming
::::::::: :: :: :: ::: :: :
21
1.3.1 Fault Tolerance Abstractions
::::::::: :: :: ::: :: :
22
1.3.2 Process Coordination Mechanisms
: : : : : : : : : : : : : : : :
24
1.4 Linda
:::::::: ::: :: :: :: :: ::: :: :: :: :: ::: :: :
25
1.5 FT-Linda
::::::::: :: :: :: :: ::: :: :: :: :: ::: :: :
26
1.6 Dissertation Outline
::::::::: :: ::: :: :: :: :: ::: :: :
27
CHAPTER 2: LINDA AND FAILURES
: : : : : : : : : : : : : : : : : : : : : :
29
2.1 Linda Overview
::::::::: :: :: ::: :: :: :: :: ::: :: :
29
2.2 Problems with Linda Semantics
: : : : : : : : : : : : : : : : : : : : : :
31
2.3 Problems with Failures
:::::::: :: ::: :: :: :: :: ::: :: :
32
2.4 Implementing Stability and Atomicity
::::::::: :: :: ::: :: :
35
2.5 Summary
::::::::: :: :: :: :: ::: :: :: :: :: ::: :: :
37
CHAPTER 3: FT-LINDA
::::::::: :: :: ::: :: :: :: :: ::: :: :
39
3.1 Stable Tuple Spaces
::::::::: :: ::: :: :: :: :: ::: :: :
39
3.2 Features for Atomic Execution
::::::::: :: :: :: :: ::: :: :
41
3.2.1 Atomic Guarded Statement
::::::::: :: :: ::: :: :
41
3.2.2 Atomic Tuple Transfer
: : : : : : : : : : : : : : : : : : : : : :
45
3.3 Tuple Space Semantics
:::::::: :: ::: :: :: :: :: ::: :: :
47
3.4 Related Work
:::::::: :: :: :: ::: :: :: :: :: ::: :: :
48
3.5 Possible Extensions to FT-Linda
: : : : : : : : : : : : : : : : : : : : : :
51
3.5.1 Additional Tuple Space Attributes
: : : : : : : : : : : : : : : :
51
3.5.2 Nested AGS
:::::::: :: ::: :: :: :: :: ::: :: :
52
3.5.3 Notification of AGS Branch Execution
: : : : : : : : : : : : : :
53
3.5.4 Tuple Space Clocks
::::::::: :: :: :: :: ::: :: :
53
3.5.5 Tuple Space Partitions
: : : : : : : : : : : : : : : : : : : : : :
53
10
3.5.6 Guard Expressions
::::::::: :: ::: :: :: :: :: ::
54
3.5.7 TS Creation in an AGS
: : : : : : : : : : : : : : : : : : : : : :
54
3.6 Summary
:::::::: :: ::: :: :: :: :: ::: :: :: :: :: ::
55
CHAPTER 4: PROGRAMMING WITH FT-LINDA
: : : : : : : : : : : : : : : :
57
4.1 Highly Dependable Systems
::::::::: :: ::: :: :: :: :: ::
57
4.1.1 Replicated Server
::::::::: :: ::: :: :: :: :: ::
57
4.1.2 Recoverable Server
:::::::: :: ::: :: :: :: :: ::
60
4.1.3 General Transaction Facility
::::::::: :: :: :: :: ::
63
4.2 Parallel Applications
::::::::: :: :: :: ::: :: :: :: :: ::
70
4.2.1 Fault-Tolerant Divide and Conquer
::::::::: :: :: ::
70
4.2.2 Barriers
:::::::: :: :: :: :: ::: :: :: :: :: ::
71
4.3 Handling Main Process Failures
: : : : : : : : : : : : : : : : : : : : : :
81
4.4 Summary
:::::::: :: ::: :: :: :: :: ::: :: :: :: :: ::
82
CHAPTER 5: IMPLEMENTATION AND PERFORMANCE
: : : : : : : : : : :
85
5.1 Overview
::::::: :: ::: :: :: :: :: ::: :: :: :: :: ::
85
5.2 Major Data Structures
:::::::: :: :: :: ::: :: :: :: :: ::
86
5.3 FT-LCC
:::::::: :: ::: :: :: :: :: ::: :: :: :: :: ::
88
5.4 AGS Request Processing
::::::::: :: :: ::: :: :: :: :: ::
95
5.4.1 General Case
:::::::: :: :: :: ::: :: :: :: :: ::
95
5.4.2 Examples
:::::::: :: :: :: :: ::: :: :: :: :: ::
97
5.5 Rationale for AGS Restrictions
: : : : : : : : : : : : : : : : : : : : : :
100
5.5.1 Dataflow Restrictions
: : : : : : : : : : : : : : : : : : : : : :
100
5.5.2 Blocking Operations in the AGS Body
: : : : : : : : : : : : : :
101
5.5.3 Function Calls, Expressions, and Conditional Execution
: : : : :
101
5.5.4 Restrictions in Similar Languages
: : : : : : : : : : : : : : : :
102
5.5.5 Summary
:::::::: :: :: :: :: ::: :: :: :: :: ::
103
5.6 Initial Performance Results
::::::::: :: ::: :: :: :: :: ::
103
5.7 Optimizations
::::::: ::: :: :: :: :: ::: :: :: :: :: ::
104
5.8 Future Extensions
:::::::: :: :: :: :: ::: :: :: :: :: ::
105
5.8.1 Reintegration of Failed Hosts
: : : : : : : : : : : : : : : : : :
105
5.8.2 Non-Full Replication
:::::::: :: ::: :: :: :: :: ::
106
5.8.3 Network Partitions
::::::::: :: ::: :: :: :: :: ::
107
5.9 Summary
:::::::: :: ::: :: :: :: :: ::: :: :: :: :: ::
107
CHAPTER 6: CONCLUSIONS
:::::::: :: :: :: ::: :: :: :: :: ::
111
6.1 Summary
:::::::: :: ::: :: :: :: :: ::: :: :: :: :: ::
111
6.2 Future Work
:::::::: ::: :: :: :: :: ::: :: :: :: :: ::
112
APPENDIX A: FT-LINDA IMPLEMENTATION NOTES
::::::::: :: ::
115
11
APPENDIX B: FT-LINDA REPLICATED SERVER
: : : : : : : : : : : : : : : :
117
APPENDIX C: FT-LINDA RECOVERABLE SERVER EXAMPLE
: : : : : : : :
121
APPENDIX D: FT-LINDA GENERAL TRANSACTION MANAGER EXAMPLE 127
D.1 Specification
::::::::: :: :: :: ::: :: :: :: :: ::: :: :
127
D.2 Manager
::::::::: :: :: :: :: ::: :: :: :: :: ::: :: :
128
D.3 Sample User
::::::::: :: :: :: ::: :: :: :: :: ::: :: :
136
D.4 User Template (transaction client.c)
: : : : : : : : : : : : : : : : : : : :
140
APPENDIX E: FT-LINDA BAG-OF-TASKS EXAMPLE
: : : : : : : : : : : : :
143
APPENDIX F: FT-LINDA DIVIDE AND CONQUER EXAMPLE
: : : : : : : : :
147
APPENDIX G: FT-LINDA BARRIER EXAMPLE
::::::::: :: ::: :: :
153
APPENDIX H: MAJOR DATA STRUCTURES
: : : : : : : : : : : : : : : : : :
159
REFERENCES
::::::: ::: :: :: :: :: ::: :: :: :: :: ::: :: :
165
12
13
LIST OF FIGURES
2.1 Distributed Variables with Linda
: : : : : : : : : : : : : : : : : : : : : :
33
2.2 Bag-of-Tasks Worker
:::::::: :: ::: :: :: :: :: ::: :: :
34
2.3 State Machine Approach
: : : : : : : : : : : : : : : : : : : : : : : : : :
36
3.1 Lost Tuple Solution for (Static) Bag-of-Tasks Worker
: : : : : : : : : : :
43
3.2 Bag-of-Tasks Monitor Process
::::::::: :: :: :: :: ::: :: :
44
3.3 AGS Disjunction
::::::::: :: :: ::: :: :: :: :: ::: :: :
45
3.4 Fault-Tolerant (Dynamic) Bag-of-Tasks Worker
: : : : : : : : : : : : : :
47
4.1 Replicated Server Client Request
::::::::: :: :: :: ::: :: :
58
4.2 Server Replica
:::::::: :: :: :: ::: :: :: :: :: ::: :: :
59
4.3 Recoverable Server
::::::::: :: ::: :: :: :: :: ::: :: :
61
4.4 Recoverable Server Monitor Process
: : : : : : : : : : : : : : : : : : : :
62
4.5 Transaction Facility Interface
::::::::: :: :: :: :: ::: :: :
64
4.6 Transaction Facility Initialization and Finalization Procedures
: : : : : :
65
4.7 Transaction Initialization
: : : : : : : : : : : : : : : : : : : : : : : : : :
66
4.8 Modifying a Transaction Variable
::::::::: :: :: :: ::: :: :
67
4.9 Transaction Abort and Commit
: : : : : : : : : : : : : : : : : : : : : :
68
4.10 Printing an Atomic Snapshot of all Variables
: : : : : : : : : : : : : : :
69
4.11 Transaction Monitor Process
: : : : : : : : : : : : : : : : : : : : : : : :
69
4.12 Linda Divide and Conquer Worker
::::::::: :: :: :: ::: :: :
70
4.13 FT-Linda Divide and Conquer Worker
::::::::: :: :: ::: :: :
71
4.14 Tree-structured barrier
:::::::: :: ::: :: :: :: :: ::: :: :
73
4.15 Linda Shared Counter Barrier Initialization
: : : : : : : : : : : : : : : :
74
4.16 Linda Shared Counter Barrier Worker
::::::::: :: :: ::: :: :
75
4.17 Linda Barrier
::::::::: :: :: :: ::: :: :: :: :: ::: :: :
76
4.18 FT-Linda Barrier Initialization
::::::::: :: :: :: :: ::: :: :
77
4.19 FT-Linda Barrier Worker
: : : : : : : : : : : : : : : : : : : : : : : : : :
78
4.20 FT-Linda Barrier Monitor
:::::::: ::: :: :: :: :: ::: :: :
79
4.21 Simple Result Synthesis
: : : : : : : : : : : : : : : : : : : : : : : : : :
82
4.22 Complex Result Synthesis
:::::::: ::: :: :: :: :: ::: :: :
83
5.1 Runtime Structure
:::::::: :: :: ::: :: :: :: :: ::: :: :
86
5.2 Tuple Hash Table
:::::::: :: :: ::: :: :: :: :: ::: :: :
87
5.3 Blocked Hash Table
::::::::: :: ::: :: :: :: :: ::: :: :
88
5.4 FT-LCC Structure
:::::::: :: :: ::: :: :: :: :: ::: :: :
89
5.5 Outer AGS GC Fragment for
count
Update
: : : : : : : : : : : : : : : :
93
14
5.6 Inner AGS GC Fragment for
count
Update
: : : : : : : : : : : : : : : :
94
5.7 AGS Request Message Flow
::::::::: :: ::: :: :: :: :: ::
95
5.8 Non-Full Replication
::::::::: :: :: :: ::: :: :: :: :: ::
106
15
LIST OF TABLES
3.1 Linda Ops and their FT-Linda Equivalents
::::::::: :: ::: :: :
42
5.1 FT-Linda Parameter Parsing
: : : : : : : : : : : : : : : : : : : : : : : :
91
5.2 FT-Linda Operations on Various Architectures (
sec)
: : : : : : : : : : :
104
16
17
ABSTRACT
As people are becoming increasingly dependent on computerized systems, the need
for these systems to be dependable is also increasing. However, programming dependable
systems is diffcult, especially when parallelismis involved. This is due in part to the fact
that very few high-level programming languages support both fault-tolerance and parallel
programming.
This dissertation addresses this problem by presenting FT-Linda, a high-level language
for programming fault-tolerant parallel programs. FT-Linda is based on Linda, a language
for programming parallel applications whose most notable feature is a distributed shared
memory called tuple space. FT-Linda extends Linda by providing support to allow a
program to tolerate failures in the underlying computing platform. The distinguishing
features of FT-Linda are stable tuple spaces and atomic execution of multiple tuple space
operations. The former is a type of stable storage in which tuple values are guaranteed to
persist across failures, while the latter allows collections of tuple operations to be executed
in an all-or-nothingfashion despite failures and concurrency. Example FT-Linda programs
are given for both dependable systems and parallel applications.
The design and implementation of FT-Linda are presented in detail. The key technique
used is the replicated state machine approach to constructing fault-tolerant distributed
programs. Here, tuple space is replicated to provide failureresilience, and the replicas are
sent a message describing the atomic sequence of tuple space operations to perform. This
strategy allows an efficient implementation in which only a single multicast message is
needed for each atomic sequence of tuple space operations.
An implementation of FT-Linda for a network of workstations is also described.
FT-Linda is being implemented using Consul, a communication substrate that supports
fault-tolerant distributed programming. Consul is built in turn with the x-kernel, an
operating system kernel that provides support for composing network protocols. Each of
the components of the implementation has been built and tested.
18
CHAPTER 1
INTRODUCTION
Computers are being relied upon to assist humans in an ever-increasing number and
variety of ways. Further, these applications have become increasingly sophisticated and
powerful. To keep pace with the corresponding increased demand for faster computers,
computer manufacturers have built parallel machines that allow subproblems to be solved
simultaneously. Parallel programs running on these computers are now used to solve a
wide variety of programs, includingscientific computations, engineering design programs,
financial transaction systems, and much more [Akl89, And91].
Parallel programming is much harder than sequential programming for a number of
reasons. Coordinating the processors is difficult and error-prone. Also, there are different
architectures that run parallel programs, and programswritten on one cannot generally be
run on another. Finally, one of the processors involved in the parallel computation could
fail, leaving the others in an erroneous state.
Different abstractions and programming languages have been developed to help pro-
grammers cope with the difficulties of parallel programming. One such language is Linda,
which provides a powerful abstraction for communication and synchronization called tu-
ple space. However, Linda does not allow programmers to compensate for the failures of
a subset of the computers involved in a given computation. This limits its usefulness for
many long-running scientific computations and for dependable systems, which are ones
that are relied upon to function properly with a very high probability. This dissertation ad-
dresses Linda’s lack of provisions for dealing with failures by analyzing the failure modes
of typical Linda applications and by proposing extensions that allow Linda programs to
cope with such failures.
1.1 Motivation for Parallel Programming
Parallel programming involves writing programs that consist of multiple processes exe-
cuting simultaneously. This activity includes subdividing the problem to be solved into
subproblems, and then programming the sequential solution to each subproblem. The
latter requires programming the ways in which the processes that solve the subprob-
lems communicate and coordinate. This communication and coordination can be through
physically shared memory or by message passing.
There are three main motivations for parallel programming: performance, economics,
and dependability. First, while computers have been increasing in speed, the demands
of computer users have kept up with these increases, and in some cases surpassed them.
19
20
In fact, for many important problems, even today’s fastest uniprocessor supercomputer
cannot execute as fast as its users would like and could profitably use. Examples of
such problems include environmental and economic modelling, real-time speech and
sight recognition, weather forecasting, molecular modelling, and aircraft testing [Akl89].
Parallel programming enables these problems to be solved more efficiently by performing
multiple tasks simultaneously.
The second reason for parallel programming is economics: It is generally cheaper to
build a parallel computer with the same or even more aggregate computational power than
a fast uniprocessor. The former can be constructed of relatively inexpensive, off-the-shelf
components, because each processor can be much slower than the uniprocessor’s sole
processor. The latter, however, must use low-volume components constructed from more
exotic and expensive materials to achieve significantly better performance than can be
achieved with a single off-the-shelf component [HP90].
The third reason for parallel programming involves dependability. A system or com-
ponent is said to be dependable if reliance can justifiably be placed on the service it
delivers [Lap91]. Computers are increasingly relied upon in situations where their fail-
ures could result in economic loss or even loss of life. For example, the 1990 failure
of an AT&T long distance network was caused by the malfunction of a single computer
[Neu92, Jac90]. In such applications, the dependability of computers is a vital concern.
Architectures on which parallel programs are executed have redundancy inherent in their
multiple processors, and thus they offer at least the potential to tolerate some failures
while still providing the service or finishing the computation that has been assigned to
them. Uniprocessors, on the other hand, have only one processor and thus suffer from a
single point of failure, i.e., the failure of the processor can cause the failure of the entire
machine.
1.2 Architectures for Parallel Programming
A number of different architecturesfor parallel programming have been developed. They
vary mainly in the number of processors and the ways in which the processors can
communicate with each other. We examine three categories in the spectrum of parallel
computer architectures that together comprise thevast majority of parallelc omputers in use
today: shared memory multiprocessors, distributed memory computers, and distributed
systems.
The first architecture for parallel programming is the shared memory multiprocessor,
which are sometimes called closely coupled machines or simply multiprocessors. These
are parallel computers with multiple processors that share the same memory unit. The
processes in a parallel program running on such a computer communicate by reading from
and writing to a shared memory. This memory is fairly quick to access. Since processes
on different CPUs communicate with each other with such low latency, interprocess
communication (IPC) is very quick with these computers. However, the path to the shared
memory, its bus, is a bottleneck if too many processors are added, and thus multiprocessors
21
do not scale well.
The next architecture is the distributed memory computer, also known as the multi-
computer. Here, each processor has fast, private memory. However, a processor may
not access memory on another processor. Rather, processors communicate by sending
messages over high-speed message passing hardware. This message passing hardware
does not have a single bottleneck as a multiprocessor does, and thus multicomputersscale
better than multiprocessors in the number of processors. However, a multicomputer ’sIPC
latency is higher than a multiprocessor’s.
The third architecture for parallel programming is the distributed system. A dis-
tributed system consists of multiple computers connected by a network. However, unlike
the multiprocessor or the multicomputer, these computers are physically separated. Pro-
cesses on different computers typically communicate with latencies on the order of a few
milliseconds for computers inthe same building and on the orderof a few seconds or more
for computers thousands of miles away. The latency for this IPC is much slower than a
multicomputer’s, but and the physical separation of the processors in a distributed system
offers advantages for dependable computing. Examples of distributed systems include
workstation clusters connected by a local area network (LAN), machines on the Internet
cooperating to solve a problem, and the computers on an airplane used for controlling its
various functions such as adjusting its control surfaces and plotting its course.
1.3 Simplifying Parallel Programming
Parallel programming is increasing in both popularity and diversity, but writing parallel
programs is difficult for a number of reasons.First, parallel programs are simply more
complicated; they have to deal with concurrency, something that sequential programs do
not have to do [And91]. For example, a procedure performing an operation on a complex
data structure that operates correctlyon a uniprocessor will generally not function properly
if multiple processes simultaneously call that procedure. A second reason that parallel
programming is more difficult concerns coping with partial failures, i.e., the failure of
part of the underlying computing platform. Some applications have to be able to tolerate
partial failures and keep executing correctly. This makes programming more difficult;
for example, the uniprocessor procedure mentioned above that operates on a complex
data structure would likely leave that data structure in an erroneous state if it failed
while executing in the middle of the procedure body. A final reason why writing parallel
programs is more difficult regards portability. Programs that are written for one kind of
processor or one kind of parallel architecture will not generally run on another. Portability
is a serious problem given the high cost of developing software and the frequency at which
new parallel computers become available.
Abstractions to simplify the task of the programmer fall in two main categories: fault
tolerance and process coordination. Fault tolerance abstractions provide powerful models
and mechanisms for the programmer to deal with failures in the underlying computing
platform [MS92, Jal94, Cri91]. Process coordination mechanisms allow multiple pro-
22
cesses to communicate and coordinate [And91]. These two categories overlap, but they
differ so are described separately below.
1.3.1 Fault Tolerance Abstractions
Fault-tolerant abstractions allow a programmer to construct fault-tolerant software, i.e.,
software that can continue to provide service despite failures. These abstractions come
in many forms and are structured so that each abstraction is built on lower-level ones.
Failure models specify a component’s acceptable behavior, especially the ways in which
it may fail, i.e., deviate from its specified behavior. All other abstractions described
below assume a given failure model; i.e., they will tolerate specific failures. Fault-
tolerant services fall under two categories. One kind provides functionality similar to
that which standard hardware or operating systems provide but with improved semantics
in the presence of failures. The other kind of fault-tolerant service provides consistent
information to all processes involved in a parallel computation. Finally, fault tolerance
structuring paradigms are canonical program structuring techniques to help simplify the
construction of fault tolerant software.
Failure Models
Failure models provide a way for reasoning about the behavior of components in the
presence of failures by specifying assumptions about the effects of such failures. Failure
models form a hierarchy, from weak to strong [MS92]: a given failure model permits
all failures in all stronger models plus some additional ones. A weak failure assumption
assumes little about the behavior of the components and is thus closer to the real situation
with standard, off-the-shelf components. However, it is much more difficult for the
programmer to use, because there are many ways (and combinations thereof) in which the
components may fail. Indeed, one of the major challenges is to develop failure models
and other abstractions that are powerful enough to be useful yet simple enough to allow
an efficient implementation.
The weakest failure model is Byzantine or arbitrary [LSP82]. Here components may
fail in arbitrary ways. The timing failure model assumes a component will respond to
an input with the correct value, but not necessarily within a given timing specification
[CASD85]. The omission failure model assumes that a component may fail to respond
to an input [CASD85]. A stronger model yet is crash or fail-silent, where a component
fails without making any incorrect state transitions [PSB
+
88]. The strongest model is
fail-stop, which adds to the crash model the assumption that the component fails in a way
that is detectable by other components [SS83].
Asystem with a given failuremodel canbeconstructed from components witha weaker
failure model. For example, [Cri91] gives an example of constructing a storage service
that will tolerate one read omission failure from two elementary storage servers with
read omission failure semantics. Also, the TCP/IP protocol is built on top of unreliable
23
protocols, yet it provides reliable, ordered delivery of a sequence of bytes [Com88].
Fault-Tolerant Services
One category of fault-tolerant service provides functionality similar to standard hardware
or operating systems, but with improved semantics in the presence of failures. This
category includes stable storage, atomic actions, and resilient processes. The contents
of a stable storage are guaranteed to survive failures [Lam81]. Atomic actions allow
a number of computations to be seen as an indivisible unit by other processes despite
concurrency and failure [Lam81]. A resilient process can be restarted and then continue
to correctly execute even if the processor on which it is executing fails; techniques to
construct resilient processes for example, checkpointing are discussed in [MS92].
The other category of fault-tolerant service provides consistent information to all
processes involved in a parallel program, which is especially necessary in a distributed
system with no memory or clock shared among the processors. The abstractions that
these services offer are generally concerned with determining and preserving the causal
relation among various events that occur on different processors in a distributed system
[Lam78]. That is, if events
a
and
b
occur at times
T
a
and
T
b
, respectively, could
a
have
affected the execution of
b
? This is simple to ascertain if both
a
and
b
occur on the same
processor, because
T
a
and
T
b
were obtained from the same local clock. However, if they
occurred on different machines in a distributed system, we cannot naively assume that the
clocks are synchronized to reflect potential causality.
Thefault-tolerant services thatprovide consistent information includes common global
time, multicast, and membership services. Common global time provides a distributed
clock service that maintains potential causality among events despite failures [RSB90].
For example, in the scenario above, if
T
a
T
b
, and both were read from the common
global time service, then event
a
could not have affected the execution of
b
, because
a
did
not happen before
b
. A multicast service delivers a message to each process in a group of
processes such that the message delivered to each process in a consistent order relative to
all other messages despite failures and concurrency [BJ87, CASD85, PBS89, GMS91].
This is very important in many different kinds of fault-tolerant programs, especially those
constructed using the state machine approach described below. Finally, a membership
service provides processes in a group consistent information about the set of functioning
processors at any given time [VM90].
Fault Tolerance Structuring Paradigms
Fault tolerance structuring paradigms are canonical program structuring techniques that
have been developed in conjunction with the above abstractions and services to help pro-
grammers structure fault-tolerant distributed programs. As such, each paradigm applies
to a number and variety of different applications. This greatly reduces the complexity of
developing such programs.
24
Three major paradigms are object/action, primary/backup, and state machines. The
object/action paradigm involves passive objects that export actions, which are operations
to modify the long-lived state of the object [Gra86]. These actions are serializable,
which means that the effect of any concurrent execution of actions on the same object is
equivalent to some serial sequence. They are also recoverable, i.e. executed completely
or not at all. These actions are called transactions in the database context.
The primary/backup paradigm features a service implemented by multiple processes;
the primary process is active, while the backup processes are passive [AD76, BMST92].
Only the active process responds to requests for the service; in the case of the failure
of the primary process, one backup process will become the primary, starting from a
checkpointed state.
Areplicated state machine consists of a collection of servers [Sch90]. Each client
request is multicast to each server replica, where it is processed deterministically. Because
each replica processes each request, the state machine approach is sometimes called active
replication, while the primary/backup paradigm is called passive replication. The state
machine approach is described in more detail in Section 2.4.
1.3.2 Process Coordination Mechanisms
Fault-tolerance abstractions help a programmer write parallel programs that can cope with
failures; process coordination mechanisms provide the programmer with techniques that
allow different processes to communicate and synchronize. Process coordination mecha-
nisms include remote procedure call, message passing, shared memories, and coordination
languages. These mechanisms may be provided to the programmer either through an ex-
plicitly parallel language or through libraries of procedures for a sequential language.
Parallel programming languages provide high-level abstractions for programmers to use.
These generally permit a cleaner integration of the sequential and concurrent features of
the language than do libraries of procedures. Examples of such languages include Ada
[DoD83], Orca [Bal90], and SR [AO93]. Libraries of procedures allow the program-
mer to leverage experience with an existing language and reuse existing code. They
can also support many languages, allowing processes written with different languages to
communicate.
Remote procedure call (RPC) [Nel81, BN84] is like a normal procedure call, ex-
cept the invocation statement and the procedure body are executed by two different
processes, potentially on different machines. RPC is a fundamental building block for
distributed systems today, including client/server interactions and many distributed op-
erating systems such as Amoeba [TM81]. It is also supported in many distributed
programming languages, including Aeolus [WL86], Argus [LS83], Avalon [HW87], and
Emerald [BHJL86]. Message passing allows processes on different computers to ex-
change messages to communicate and synchronize. PVM [Sun90] and MPI [For93b,
For93a] are two of the more well-known libraries of message passing libraries; many
parallel programming languages explicitly provide message passing as well [And91].
25
Ashared memory is memory that can be accessed by more than one process. It
can be implemented in a multiprocessor by simply providing a shared memory bus that
connects all processors to the memory. A distributed shared memory (DSM) provides the
illusion of a physically shared memory that is available to all processes in a distributed
system [NL91]. The regions of the DSM are either replicated on all computers or are
passed among them as needed; this is done by the language’s implementation and is
transparent to the programmer. DSMs thus allow distributed systems and multicomputers
to be programmed much more like a multiprocessor. Munin is an example of a system
that implements a DSM [CBZ91].
Coordination languages augment an existing computational language such as FOR-
TRAN or C to allow different processes to communicate in a uniform fashion with each
other and with their environment [GC92]. A coordination language is completely orthog-
onal to the computational language. Coordination languages thus provide in a uniform
fashion many of the services such as process-process and process-environment commu-
nication typically provided by an operating system. A coordination language can thus be
used by a process to perform I/O, communicate with a user, or communicate directly with
another process, regardless of the location and lifetime of that process.
1.4 Linda
Linda is a coordination language that provides a collection of primitives for process
creation and interprocess communication that can be added to existing languages [Gel85,
CG89]. The main abstraction provided by Linda is tuple space (TS), an associative shared
memory. The abstraction is implemented by the Linda implementation transparently to
user processes.
The tuple space primitives provided by Linda allow processes to deposit tuples into TS
and to withdraw or read tuples whose contentsmatch a pattern. Thus, tuple space provides
for associative access, in which information is retrieved by content rather than address.
Tuple space also provides temporal decoupling, because communicating processes do not
have to exist at the same time, and spatial decoupling, because processes do not have to
know each other’s identity. These properties make Linda especially easy for use by appli-
cation programmers [CS93,ACG86, CG88, CG90]. Linda implementations are available
on a number of different architectures [CG86, Lei89, Bjo92, SBA93, CG93, Zen90]
and for a number of different languages [Lei89, Jel90, Has92, Cia93, SC91]. Linda
has been used in many real-world applications, including VLSI design, oil exploration,
pharmaceutical research, fluid-flow systems, and stock purchase and analysis programs
[CS93].
Despite these advantages, one significant deficiency of Linda is failures can cause
Linda programs to fail. For example, the semantics of tuple space primitives are not
well-defined should a processor crash, nor are features provided that allow programmers
to deal with the effect of such a failure. The impact of these omissions has been twofold.
First, programmers who use Linda to write general parallel applications cannot take
26
advantage of fault-tolerance techniques, for example, to recover a long-running scientific
application after a failure. Second, despite the language’s overall appeal, it has generally
been impossible to use Linda to write critical applications such as process control or
telephone switching in which dealing with failures is crucial.
1.5 FT-Linda
In this dissertation, we describe FT-Linda, a version of Linda that provides support for
programming fault-tolerantparallel applications. To do this, FT-Linda includes two major
enhancements: stable tuple spaces and support for atomic execution of TS operations.
The former is a type of stable storage in which the contents are guaranteed to persist
across failures; the latter allows collections of tuple operations to be executed in an all-
or-nothing fashion despite failures and concurrency. Also, we provide additional Linda
primitives to allow tuples to be transferred between tuple spaces atomically. The specific
design of these enhancements has been based on examining common programstructuring
paradigms both those used for Linda and those found in fault-tolerant applications to
determine what features are needed in each situation. In addition, the design is in keeping
with the “minimalist” philosophy of Linda and permits an efficient implementation. These
features help distinguish FT-Linda from other efforts aimed at introducing fault-tolerance
into Linda [Xu88, XL89, Kam90, AS91, Kam91, CD94, CKM92, PTHR93, JS94].
FT-Linda is being implemented using Consul, a communication substrate for building
fault-tolerant systems [Mis92, MPS93b, MPS93a], and the x-kernel, an operating system
kernel that provides support for composing network protocols [HP91]. Stable TSs are
realized using the replicated state machine approach, where tuples are replicated on
multiple processors to provide failure resilience and then updated using atomic multicast.
Atomic execution of multipletuple space operations is achieved by using a single multicast
message to update replicas. The focus in the implementation has been on tolerating
processor crash failures, though the language design is general and could beimplemented
assuming other failure models. Further, the current implementation focuses on distributed
systems, because their inherent redundancy and physical separation make them attractive
candidates for dependable computing.
The major contributions of this dissertation are:
An analysis of the failure modes of typical Linda programs and of Linda’s atomicity
requirements at what level its operations should be realized in an “all or nothing”
fashion despite failures and concurrency.
Extensions to Linda to help tolerate failures, namely stable tuple spaces and atomic
execution of multiple tuple space operations.
An efficient implementation design forthe resulting language, FT-Linda. An atomic
sequence of tuple space operations only requires one multicast message, and FT-
27
Linda’s costs of processing an operation is comparable to an optimized Linda
implementation.
Other contributions include better semantics for Linda, even in the absence of failures;
multiple tuple spaces with attributes that allow differentkinds of tuple spaces to be created;
primitives to allow the transfer of tuples between different tuple spaces; and disjunction,
which allows one from a number of matching tuples to be selected.
1.6 Dissertation Outline
The remainder of this dissertation is organized as follows. Chapter 2 describes Linda
in detail and discusses its failure modes. Alternatives for extending Linda to deal with
failures are described; this also serves as a motivation for the next chapter and further
delineates the contributions of this dissertation.
Chapter 3 describes FT-Linda, our version of Linda to help it tolerate failures. It
describes in detail FT-Linda’s stable tuple spaces and atomic execution of multiple tuple
space operations. The way these features can be used to help tolerate failures is illustrated
using a series of examples.
Chapter 4 gives more examples of how both applications and system programs written
in FT-Linda can tolerate failures. It then shows how FT-Linda programs can handle the
failure of the main or master process.
Chapter 5 describes the implementation of FT-Linda. First it describes the compiler,
and then how the runtime system is implemented on top of Consul and the x-kernel.
It then describes initial performance results, optimizations, and future extensions to the
implementation.
Chapter 6 offers some concluding remarks and future research directions for FT- Linda.
28
CHAPTER 2
LINDA AND FAILURES
Linda is a coordination language that is simple to use yet powerful. However, problems
with its semantics and its lack of facilities to handle failures makes it inappropriate for
long-running scientific applications and dependable systems, domains for which it would
otherwise be suitable [WL88, CGM92, PTHR93, BLL94].
This chapter provides the background information and motivation for our extensions
to Linda. First, it gives an overview of Linda. Next, it discusses semantic limitations of
Linda that are a problem even in the absence of failures. It then examines the problems
Linda programs have in the presence of failures, followed by design alternatives for
providing fault-tolerance to Linda; this motivates our particular design choices. These
design choices are tightly intertwined with the particular language extensions we offer
and with the way in which we implement them, topics for Chapters 3 and 5, respectively.
2.1 Linda Overview
Linda is a parallel programming language based on a communication abstraction known
as tuple space (TS) [Gel85, ACG86, CG89, GC92, CG90]. Tuple space is an associative
(i.e., content-addressable) unordered bag of data elements called tuples. Processes are
created in the context of a given TS, which they use as a means for communicating and
synchronizing. A tuple consists of a logical name and zero or more values. An example
of a tuple is (“
N
;
100
;
true). Here, the tuple consists of the logical name
N
and the data
values 100 and true. Tuples are immutable, i.e., they cannot be changed once placed in
the TS by a process.
The basic operations defined on a TS are the deposit and withdrawal of tuples. The
out operation deposits a tuple. For example, executing
out(
N
;
100
;
true)
causes the tuple (“
N
;
100
;
true) to be deposited. As another example, if
i
equals
100
and
boolvar
equals true, then the operation
out(
N
; i; boolvar
)
will also cause the same tuple, (
N
;
100
;
true), to be deposited. An out operation is
asynchronous
, i.e. the process performing the out need only wait for the arguments to
29
30
be evaluated and marshalled; it does not have to block until the corresponding tuple is
actually deposited into TS.
The in operation withdraws a tuple matching the specified parameters, which collec-
tively are known as the template of the operation. To match a tuple, the in must have the
same number of parameters, and each parameter must match the corresponding value in
the tuple. The parameters to in can either be actuals or formals. An actual is a value that
can be specified as either a literalvalue or with a variable’sname; in the latter the value of
the variable will be used as the actual. To match a value in a tuple, the value of an actual
in a template must be of the same type and value as the corresponding value in the tuple.
A formal is either a variable or type name, and is distinguished from an actual by the
use of a question mark as the first character. Formals automatically match any value of
the same type in a tuple. If the formal is a variable name, then the operation assigns the
corresponding value from the tuple to the variable. This is the mechanism through which
different processes in a Linda program receive data from other processes. For example, if
i
is declared as an integer variable and
b
as a boolean, then executing
in(
N
;
?
i;
?
b
)
will withdraw a tuple named
N
that has an integer as its first argument and a boolean as
its second (and last) one, and will assign to
i
and
b
the respective values from the tuple. If
there is no such tuple, then the process will block until one is present. As another example,
executing
in(
N
;
100
;
true)
will withdraw a tuple named
N
whose second argument is an integer with value 100 and
whose third argument is a boolean with value true. Here, the parameters of in are all
actuals, so executing this operation does not give the program new data, like it would if it
had formal variables. Such an operation may, however, be useful for synchronization.
Formals are legal in out operations, but such usage is rare in practice. In this case,
because an out does not withdraw or inspect a tuple, the formal variable is not assigned
a value. Rather, a valueless type is placed in the appropriate field in the tuple; it must be
matched by an actual, not a formal, in the template for the corresponding Linda operation
that withdraws or inspects the tuple.
The Linda rd operation is like in except it does not withdraw the tuple. That is, it
waits for a matching tuple and assigns values from the tuple to formal variables in its
template. Unlike in, however, it leaves the matching tuple in TS.
Operations rdp and inp are non-blocking versions of rd and in, respectively. If an
appropriate tuple is found when one of these operations is executed, true is returned as
the functional value and and the operation behaves like its non-blocking counterpart; if
31
no such tuple is found, false is returned and the values of any formal parameters remain
unchanged.
The final Linda primitive is eval, which is used for process creation. In particular,
invoking
eval(
f unc
; f unc
(
args
)
)
will create a process to execute
f unc
with
args
as parameters, and also deposit an active
tuple in TS. This tuple may not be matched by any TS operation, but when
f unc
returns
the tuple will become a normal “inactive” tuple containing the return value of
f unc
.
Few applications seem to need the coupling of process creation and tuple depositing
that eval offers, and it has a number of semantic and implementation problems [Lei89].
As a result, it is not examined further in this dissertation; FT-Linda provides an alternative
mechanism for process creation.
Finally, TS may outlive any process in the program and may even persist after termi-
nation of the program that created it. Thus, Linda allows temporal uncoupling because
two processes that communicate do not have to have overlapping lifetimes. Linda also
allows spatial uncoupling because processes need not know the identity of processes with
which they communicate.
1
These properties allow Linda programs to be powerful, yet
simple and flexible.
2.2 Problems with Linda Semantics
While Linda has many advantages, it is not without its deficiencies. Consider the boolean
operations inp and rdp. When they return false, they are making a very strong statement
about the global state of TS: they are indicating that no matching tuple exists anywhere
in TS. This has historically been considered too expensive to implement for a distributed
system or on a multicomputer [Lei89, Bjo92]. Thus, implementations of Linda for such
systems generally do not offer inp and rdp or they offer a weaker best effort semantics.
Here, inp and rdp are not guaranteed to find a matching tuple; they may return false even
if a matching tuple exists in TS.
Also, because out is asynchronous, there is no guarantee as to when the tuple it
specifies will be deposited into TS. This, coupled with a best effort semantics for inp and
rdp, makes some Linda code deceptive. Consider the following example from [Lei89,
Bjo92], which is used to argue that inp and rdp should not be supported:
1
If they do need to know, a process identifier or handle can be included in the tuple. See Section 4.1.2
for an example.
32
process P1 process P2
out(
A
)in(
B
)
out(
B
)if (inp(
A
)) then
print (‘‘Must succeed!’’)
The problem is a naive programmer may believe that this code is synchronized properly
so that the inp must succeed. However, this is not the case with either asynchronous outs
or with best effort semantics for inp, or both. With asynchronous outs, the tuple
B
may
be deposited in TS before
A
”, so
A
is not yet present in TS when the inp checks for it.
And with best effort semantics for inp,inp is not guaranteed to find the tuple
A
”, even
if it is present in TS.
Note that these semantic problems occur even in the absence of failures. They are
discussed further in Section 3.3, where alternatives are provided.
2.3 Problems with Failures
Linda programs also have problems in the presence of failures. In particular, as noted in
Chapter 1, the effects of processor failures, i. e., crash failures, on TS is not considered in
standard definitions of the language or most implementations. In examining how Linda is
currently used to program parallel applications and would likely be used for fault-tolerant
applications, two fundamental deficiencies are apparent. The first is lack of tuple stability.
That is, the language contains no provisions for guaranteeing that tuples will remain
available following a processor failure. Given that tuple space is the means by which
processes communicate and synchronize, it is easy to imagine the problems that would
be caused if certain key tuples are lost or corrupted by failure. Moreover, a stable storage
facility is a key requirement for the use of many fault-tolerance techniques. For example,
checkpoint and recovery is a technique based on saving key values in stablestorage so that
an application process can recover to some intermediate state following a failure [KT87].
This technique cannot be implemented with Linda in the absence of tuple stability, given
that TS is the only means for interprocess communication in a Linda program.
The second deficiency can be characterized as lack of sufficient atomicity. Informally,
a computation that modifies shared state is atomic if, from the perspective of other
computations, all its modifications appear to take place instantaneously despite concurrent
access and failures. In Linda, of course, the shared state is the TS and the computations in
question are TS operations. The key here is that Linda provides only single-op atomicity,
i.e., atomic execution for onlya single TSoperation. Thus, the intermediate states resulting
from a series of TS operations may be visible to other processes.
Providing multi-op atomicity, a means to execute multiple TS operations atomically, is
important for using Linda to program fault-tolerant applications. For example, distributed
consensus, in which multiple processes in a distributed system reach agreement on some
common value, is an important building block for many fault-tolerant systems [TS92].
33
Initialization
out(
count
; value
)
Inspection
rd(
count
;
?
value
)
Updating
in(
count
;
?
oldvalue
)
out(
count
; newval ue
)
Figure 2.1: Distributed Variables with Linda
However, Linda with single-op atomicity has been shown to be insufficient to reach
distributed consensus with more than two processes in the presence of failures or with
arbitrarily slow (or busy) processors [Seg93]. The key is lack of sufficient (multi-op)
atomicity.
Even typical Linda programs cannot be structured to handle failureswith only single-
op atomicity. To illustrate this, we consider specific problems that arise in two common
Linda programming paradigms: the distributed variable and the bag-of-tasks. Both these
paradigms are used to solve a wide variety of problems, meaning that deficiencies in these
two paradigms are applicable to a large class of Linda programs.
Distributed Variable
The simplest paradigm is the distributed variable. Here, one tuple containing the name
and value of the variable is kept in TS. Figure 2.1 shows how typical operations on such
a variable named
count
might be implemented. The first operation initializes the variable
count
to
value
. To inspect the value of
count
,rd is used as shown. Finally, updating the
value involves withdrawing the tuple and depositinga new tuple with the new value. Note
that the tuple must be withdrawn with in and not just read with rd to guarantee mutually
exclusive access to the variable and uniqueness of the resulting tuple.
Unfortunately, if the possibility of a processor crash is considered, the protocol has a
window of vulnerability. Specifically, if the processor executing the process in question
fails after withdrawing the old tuple but before replacing it with the new one, that tuple
will be lost because it is present only in the volatile memory of the failed processor. We
call this problem the lost tuple problem. The result is that processes attempting to inspect
or modify the distributed variable will block forever. The problem is due to the inability
to execute the in and subsequent out as an atomic unit with respect to failures.
34
process worker
while true do
in(
subtask
,?
subtask ar g s
)
calc(
subtask arg s;
var
result args
)
for (
all new subtask s created by this subtask
)
out(
subtask
,
new subtask args
)
out(
result
,
result args
)
end while
end proc
Figure 2.2: Bag-of-Tasks Worker
Bag-of-Tasks
Linda lends itself nicely to a method of parallel programming called the bag-of-tasks
or replicated worker programming paradigm [ACG86, CG88]. In this paradigm, the
task to be solved is partitioned into independent subtasks. These subtasks are placed
in a shared data structure called a bag, and each process in a pool of identical workers
then repeatedly retrieves a subtask description from the bag, solves it, and outputs the
solution. In solving it, the process may use only the subtask arguments and possibly non-
varying global data, which means that the same answer will becomputed regardless of the
processor that computes it and the time at which it is computed. Among the advantages
of this programming approach are transparent scalability, automatic load balancing, and
ease of utilizing idle workstation cycles [GK92, CGKW93, Kam94]. And, as we show in
Chapter 3, it can easily be extended to tolerate failures.
Realizing this approach in Linda is done by having the TS function as the bag. The
TS is seeded with subtask tuples, where each such tuple contains arguments that describe
the given subtask to be solved. The collection of subtask tuples can thus be viewed as
describing the entire problem.
The actions taken by a generic worker are shown in Figure 2.2. The initial step is
to withdraw a tuple describing the subtask to be performed; the logical name
subtask
is used as a distinguishing label to identify tuples containing subtask arguments. The
worker computes the results, which are subsequently output to TS with the identifying
label
result
”. Also, any new subtask tuples that this subtask generates are placed into TS
(this would actually be done in the procedure calc, but is shown outside the procedure for
clarity). If the computation portion of any worker in the program generates new subtask
tuples, then we say that the solution uses a dynamic bag-of-tasks structure. If no new
subtask tuples are generated, then we call the solution a static bag-of-tasks structure; in
this case, a master process is assumed to subdivide the problem and seed the TS with all
appropriate subtask tuples.
Termination detection—ascertaining that all subtasks have been computed—can be
35
accomplished by any one of a number of techniques, including worker deadlock or by
keeping a count of the number of subtasks that have been computed. The actual way in
which the programis terminated is irrelevantfor our purposes, and so is ignoredhereafter.
The bag-of-tasks paradigm suffers from two problems when failures are considered.
The first is again the lost tuple problem. Specifically, if the processor fails after a worker
has withdrawn the subtask tuple but before depositing the result tuple, that result will
never be computed. This is because, during this period, the only representation of that
tuple is in the (volatile) memory of the processor. When the processor fails, then, that
subtask tuple is irretrievably lost.
The second problem is a somewhat different problem, which we call the duplicate
tuple problem. This problem occurs if the processor fails after the worker has generated
some new subtasks but before it has deposited the result tuple. In this case, assuming the
lost tuple problem is solved, another worker will later process the subtask, generating the
same new subtasks that are already in TS. Such an occurrence can lead to the program
producing incorrect results—for example, the process that consumes the result tuples may
expect a fixed number of result tuples—in addition to wasted computation. The cause of
the problem is again lack of sufficient atomicity. What is needed in this case is some way
to deposit all the new subtask tuples and the result tuple into TS in one atomic step.
2.4 Implementing Stability and Atomicity
Given the problems identified above, the challenge is to develop reasonable approaches
to implementing stable TSs and atomic execution within Linda. For stable TSs, choices
range from using hardware devices that approximate the failure-free characteristics of
stable storage (e.g., disks) to replicating the values in the volatile memory of multiple
processors so that failure of (some number of) processors can be tolerated without losing
values. In situations where stable values must also be shared among multiple processors
as is the case here, replication is a more appropriate choice.
To realize a replicated TS, we use the replicated state machine approach (SMA)
introduced in Chapter 1. In this technique, an application is implemented as a state
machine that maintains state variables, which encode the application’s state. These
variables are modified by commands that the state machine exports; a client of the state
machine sends it a request to execute a command. To provide resilience to failures, the
state machine is replicated on multiple independent processors, and an ordered atomic
multicast, also mentioned in Chapter 1, is used to deliver commands to all replicas reliably
and in the same order.
2
For commands that require a reply, one or more state machine
replicas will send a reply message to the client. If the commands are deterministic and
executed atomically with respect to concurrent access, then the state variables of each
replica will remain consistent. The SMA is the basis for a large number of fault-tolerant
distributed systems [BSS91, MPS93a, Pow91].
2
This ordering can be relaxed in some cases; see [Sch90].
36
cmd 1
cmd 2
cmd 1cmd 1
atomic atomic
cmd 2
atomic atomic atomic
atomic
multicast multicast
multicast multicast multicast
cmd 2
multicast
1
client 2
client M
client
1
SM 2
SM N
SM
Network
Figure 2.3: State Machine Approach
The SMA is illustrated in Figure2.3. Here the
M
clients use atomic multicast to submit
commands to the
N
replicas of the state machine. This ensures that the state machines
receive the same sequence of command messages, even in the presence of failures and
concurrency.
To use the SMA to provide tuple stability for Linda, then, each state machine replica
maintains an identical copy of TS. These copies, which we call tuple space replicas, will
be kept consistent with each other, because each replica receives the same sequence of
TS operations and processes them in the same, deterministic manner. Having multiple
identical copies of TS thus allows the failure of some of thesecopies to be tolerated while
still preserving the TS abstraction at higher levels.
Given the use of replication to realize stable TSs, the next step is to consider schemes
for implementing atomic execution of multiple tuple operations that use this TS. Such
a scheme must guarantee, in effect, that either all or none of the TS operations are
executed at either all or none of the functioning processors hosting copies of the tuple
space. Additionally, other processes must not be allowed concurrent access to TS while
an update is in progress.
A number of schemes would satisfy these requirements. For example, techniques
based on the two-phase commit protocol for implementing general database transactions
could be used [Gra78, Lam81]. While sufficient, these techniques are expensive, requiring
multiple rounds of message passing between the processors hosting replicas. At least part
of the reason for the heavyweight nature of the technique is that it supports atomic
execution of essentially arbitrary computations. While important in a database context,
such a facility is stronger than necessary in our situation where only simple sequences of
37
tuple operations require atomicity. Accordingly, a simpler scheme is desired even if it
provides a less general solution.
We believe a good compromise is to exploit the characteristics of the replicated state
machine approach to implement atomicity. Recall that each command in this scheme is
executed atomically; that is, a command is considered a single unit that is either applied as
a whole or not at all, is applied at either all functioning processors or none (as guaranteed
by the atomic multicast), and is executed without interleaving by each replica. Given
this, a simple scheme for atomically executing multiple TS operations is to treat the entire
sequence as, in essence, a single state machine command. The operations are disseminated
together to all replicas in a single multicast message. This is then executed at each TS
replica by its TS manager, the process that operates on the TS replica, as dictated by the
serial order in which commands are processed by each TS manager. This technique has
the virtue of being simple to implement and requires fewer messages than the transactional
approach,
3
while still supporting the level of atomicity needed to realize fault tolerance in
many Linda applications.
This broad outline of an implementation strategy serves not only to describe alterna-
tives, but also to motivate the specific design of our extensions to Linda for achieving
fault tolerance. The extensions and implementation are perhaps more sophisticated than
implied by this discussion—for example, provisions for synchronization and a limited
form of more general atomic executionare also provided—yet the overall design philoso-
phy follows the two precepts inherent in the above discussion. First, the extensions must
provide enough functionality to allow convenient programming of fault-tolerant applica-
tions in Linda. Second, the execution cost must be kept to a minimum. The trick has
been to balance the often conflicting demands of these two goals, while still providing
mechanisms that preserve the design philosophyand semantic integrity established by the
original designers of Linda.
2.5 Summary
This chapter describes Linda, a coordination language that offers both simplicity and
power. Its features are described, and problems with its semantics and with failures are
then discussed. Alternatives for providing fault-tolernace to Linda are outlined, and our
design based on the replicated state machine approach is described.
Linda’s main abstraction is tuple space, an associative, unordered bag of tuples. Linda
provides primitives to deposit tuples into tuple space as well as to withdraw and inspect
tuples that match a specifiedtemplate. These primitives can be used to allow the processes
in a parallel program to communicate and synchronize in a simple yet powerful manner.
3
This comparison assumes that the transaction’s data or logs are replicated. If not, then the transactional
approach may require fewer message (zero to FT-Linda’s one). However, in this case the fault-tolerance
guarantees of the transaction will be far weaker than FT-Linda’s, because the failure of a single device or
computer can halt the entire transactional system.
38
Linda’s semantics are lacking, however, even in the absence of failures. Its best effort
semantics means that the boolean primitives are not guaranteed to nd a matching tuple
even though one exists. And Linda’s asynchronous outs mean that tuples generated by a
single process may appear in tuple space in a different order than specified by the process.
Linda programs also have problems in the presence of failures. It lacks tuple stability,
meaning that tuples are not guaranteed to survive a processor failure. Linda’s single-op
atomicity is also inadequate for handling failures. With many common Linda paradigms,
a tuple is withdrawn and represented only in the volatile memory of a process. If this
process fails, the tuple will be irretrievably lost.
There are a numberof ways in which stability and multi-op atomicity could be provided
to Linda. Stability could be provided by using a stable storage or by replicating the values
on multiple processors. For FT-Linda, we use the latter approach, the replicated state
machine approach. Multi-op atomicity could be provided by either adding transactions
to Linda or by allowing a (less general) sequence of TS operations to be executed by
replicated TS managers. We use the latter approach, which is also the replicated state
machine approach.
CHAPTER 3
FT-LINDA
FT-Linda is a variant of Linda designed to facilitate the construction of fault-tolerant
applications by including features for tuple stability, multi-op atomicity, and strong se-
mantics [BS91, BS94, SBT94]. The system model assumed by the language consists of a
collection of processors connected by a network with no physically shared memory. For
example, it could be a distributed system in which processors areconnected by a local-area
network, or a multicomputer such as a hypercube with a faster and more sophisticated
interconnect. FT-Linda could be implemented on a multiprocessor, but the single points
of failure that multiprocessors have would make it largely pointless.
Processors are assumed to suffer only fail-silent failures, in which execution halts
without undergoing any incorrect state transitions or generating spurious messages. The
FT-Linda runtime system, in turn, converts such failures into fail-stop failures [SS83] by
providing failure notification in the form of a distinguished failure tuple that gets deposited
into TS. We also assume that processors remain failed for the duration of the computation
and are not reintegrated back into the system;
1
extensions to allow such reintegration are
considered in Section 5.8.1.
To support the main additions to the languages, FT-Linda has provisions for creating
processes and for allocating unique system-wide process identifiers. A process is created
using the
create
function, which takes as arguments the name of the function to be
started as a process, the host machine on which to start it, a logical process ID (LPID),
and initialization arguments. There must be a 1:1 correspondence between processes
and LPIDs at any given time, but different processes may assume the same LPID at
different times. A unique LPID is typically generated prior to invoking
create
by using
the
new lpid
routine. A process can determine its LPID using the routine
my lpid
.
The remainder of this chapter is organized as follows. First, it discusses FT-Linda’s
provisions for stable tuple spaces and atomic execution of multiple tuple space operations.
It then gives the semantics of FT-Linda’s tuple space are then given. Finally, it describes
possible extensions to FT-Linda.
3.1 Stable Tuple Spaces
To address the problem of data stability, FT-Linda includes the ability to define a stable
tuple space. However, not wanting to mandate that every application use such a TS
1
Note that the computation that was running on the failed processor can—and often will—be recovered
using another physical processor.
39
40
given its inherent implementation overhead, this abstraction is included as part of more
encompassing provisions. Specifically, FT-Linda allows the programmer to create and
use an arbitrary number of TSs with varying attributes. The programmer specifies these
attributes when creating the TSs, and a TS handle is returned to facilitate access to that
TS. This TS handle is a rst-class object that is subsequently passed as the first argument
to other TS primitives such as in and out, placed into tuples, etc.
FT-Linda currently supports two attributes for tuple spaces: resilience and scope.
2
The resilience attribute, either stable or volatile, specifies the behavior of the TS in the
presence of failures. In particular, tuples in a stable TS will survive processor failures,
while those in a volatile TS have no such guarantee. The number of processor failures
that can be tolerated by a stable TS without loss or corruption of tuples depends on the
number of copies maintained by the implementation, a parameter specified at system
configuration time. Given
N
such copies, tuples will survive given no more than
N
?
1
failures, assuming no network partitions.
3
The scope attribute, either shared or private, indicates which processes may access a
given TS. A shared TS can be used by any process; such a TS is analogous to the single
TS in current versionsof Linda. A private TS, on the other hand,may be used only by the
single logical process whose LPID is specified as an argument in the TS creation primitive
(described below). As already noted, a process can only have a single LPID at a time, and
only one process in the system at a time can have a given LPID.
Allowing access to private TSs based on the notion of a logical process allows the
work of a process that has failed to be taken over by another newly-created process. To
do this, the failure is first detected by waiting for the failure tuple associated with the
processor on which it was executing. At this point, a new process is created with the same
LPID. Once this is done, the new process can use any of the private TSs that were being
used by the failed process, assuming, of course, that they were also declared to be stable.
Such a scenario is demonstrated in Section 4.1.2.
A single stable shared TS is created when the program is started, and can be accessed
using the handle
T S main
. Other tuple spaces are created using the FT-Linda primitive
ts create
. This function takes the resilience and scope attributes as arguments, and returns
a TS handle, A third argument, the LPID of the logical process that can access the TS, is
required in the case of private TSs. To destroy a TS, the primitive
ts destr oy
is called
with the appropriate handle as argument. Subsequent attempts to use the handle result in
an exceptional condition.
As noted in Chapter 2, stability is implemented by replicating tuples on multiple
machines. As a result, a TS created with the stable attribute, whether shared or private, is
also called a replicated tuple space. Conversely, a TS created with attributes volatile and
private is called a local tuple space, because its tuples are only stored on the processor on
which the TS was created, or it is called a scratch tuple space, because it is often used to
2
Other possible attributes are discussed in Section 3.5.1
3
Section 5.8.3 discusses handling network partitions.
41
hold intermediate results that will either be discarded or merged with a shared TS using
primitives described below.
Different types of TSs have different uses. For example, a stable and private TS can
be used by a process such as a server that must have some of its state survive failure so
that it can be reincarnated. Replication is necessary for this type of TS, even though it is
not shared, because it must be stable. An example of this use is given in Section 4.1.2. A
scratch TS, on the other hand, need not be replicated and thus can have very fast access
times. Additionally such a TS can be used in conjunction with the provisions for atomic
execution of TS operations introduced shortly to provide duplicate atomicity.
3.2 Features for Atomic Execution
Two features are provided in FT-Linda to support atomic execution: atomic guarded
statements and atomic tuple transfer primitives. Atomic guarded statements are used
to execute sequences of TS operations atomically, potentially after blocking; the atomic
tuple transfer primitives move and copy allow collections of tuples to be moved or copied
between tuple spaces atomically. Each is addressed in turn below.
3.2.1 Atomic Guarded Statement
An atomic guarded statement (AGS) provides all-or-nothing execution of multiple tuple
operations despite failures or concurrent access to TS by other processes.
Simple Case
The simplest case of the AGS is
h
guard
)
body
i
where the angle brackets are used to denote atomic execution. The
guard
can be any
of in,inp,rd,rdp, or true, while the
body
is a series of in,rd,out,move and copy
operations, or a null body denoted by skip. The process executing an AGS is blocked
until the guard either succeeds or fails, as defined below. If it succeeds, the body is then
executed in such a way that the guard and body are an atomic unit; if it fails, the body is
not executed. In either case, execution continues with the next statement after the AGS.
Informally, a guard succeeds if either a matching tuple is found or the value true is
returned. The specifics are as follows. A true guard succeeds immediately. A guard of
in or rd succeeds once there is a matching tuple in the named TS, which may happen
immediately, at some time in the future, or never. A guard of inp or rdp succeeds if there
is a matching tuple in TS when execution of the AGS begins. Conversely, a guard fails if
the guard is an inp or rdp and there is no matching tuple in TS when the AGS is executed.
A boolean operation used as a guard may be preceded by not, which inverts the success
42
Linda op FT-Linda equivalent
out(
:::
)
h
true
)
out(
:::
)
i
other op
(
:::
)
h
other op
(
:::
)
)
skip
i
Table 3.1: Linda Ops and their FT-Linda Equivalents
semantics for the guard in the expected way. Note that in this case, execution of an AGS
may have an effect even though the guard fails and the body is not executed; for example,
if the failing guard is not inp(
:::
)”, a matching tuple still gets withdrawn from TS and
any formals assigned their corresponding values. However, the body will not be executed
because the guard failed.
An atomic guarded statement with boolean guards can also be used within an expres-
sion. The value of the statement in this case is taken to be true if the guard succeeds and
false otherwise. This facility can be used, for example, within the boolean of a loop or
conditional statement to control execution flow.
Only one operation—the guard—is allowed to block in an AGS. Thus, if an in or rd
in the body does not find a matching tuple in TS, an exceptional condition is declared and
the program is aborted. The implementation strategy also leads to a few other restrictions
on what can be done in the body, most of which involve data flow between local and
replicated TSs. These restrictions are explained further in Section 5.5.1.
Finally, our implementation strategy dictates that Linda TS operations not appear out-
side of an AGS. Table 3.1 gives the FT-Linda equivalent of standard Linda TS operations;
in it,
other op
may be in,inp,rd, or rdp. It would be easy to implement a preprocessor to
translate the standard Linda operations into these equivalents if desired. For convenience,
we use the standard Linda notation for single TS operations below.
Using Atomic Guarded Statements
The AGS can be used to solve atomicity problems of the sort demonstrated earlier in
Chapter 2. For example, consider the lost tuple problem that occurs in the bag-of-tasks
paradigm when a failure interrupts execution after a subtask tuple has been withdrawn
but before the result tuple is deposited. Recall that the essence of the problem was that
there was a window of vulnerability where the subtask was not represented in some form
in TS. Specifically, it was not in TS in some form after the worker withdrew the subtask
tuple and before it deposited the corresponding result tuple. We remove this window of
vulnerability by maintaining a version of the subtask tuple in TS while the subtask is being
solved. This will ensure that the subtask is represented in TS in some form at all times,
and thus it can be recovered if the worker process fails.
To implement this solution in FT-Linda, then, an
in progr ess
tuple is deposited into
TS atomically when the subtask tuple is withdrawn, and then removed atomically when
the result tuple is deposited. This
in progr ess
tuple completely describes the subtask
tuple. It also indicates the host on which the worker is executing, so if that host fails it can
43
#
T S main
is
f
stable,shared
g
process worker()
while true do
h
in (
T S main
,
subtask
,
?
subtask arg s
)
)
out (
T S main
,
in progr ess
,
my hostid
,
subtask arg s
)
i
calc (
subtask arg s
,var
result ar g s
)
h
in(
T S main
,
in prog r ess
,
my hostid
,
subtask arg s
)
)
out (
T S main
,
result
,
result ar g s
)
i
end while
end worker
Figure 3.1: Lost Tuple Solution for (Static) Bag-of-Tasks Worker
be ascertained which subtasks need to be recovered from their
in progr ess
counterparts.
The code for the static version of the bag-of-task worker demonstrating this technique
is shown in Figure 3.1. Here, the worker deposits the
in progr ess
tuple atomically with
withdrawing the
subtask
tuple. It later withdraws this
in progr ess
tuple atomically with
depositing the
result
tuple. This scheme ensures that the subtask is represented in exactly
one form in
T S main
at all times, either as a
subtask
tuple, an
in pr og ress
tuple, or a
result
tuple.
To complete this example, we also now consider the problem of regenerating the lost
subtask tuple from the
in progr ess
tuple. This job is performed by a monitor process. In
general, applications need a set of monitor processes for each window of vulnerability, a
region of code where a failure would require some recovery, e.g., the regeneration of the
subtask
tuples from
in progr ess
ones. One monitor process is created on each machine
hosting a TS replica. This ensures that the failure guarantees for monitor processes are
exactly as strong as those for stable TSs: there is at least one monitor process from a
given set that has not failed if and only if there is at least one TS replica that has not
failed. Having some TS replica hosts without a monitor process is possible but provides
weaker failure guarantees. In particular, a stable TS might still be available in this case
yet essentially inaccessible because all monitor processes had failed.
The monitor process for bag-of-tasks worker is given in Figure 3.2. The monitor waits
for a failure tuple indicating the failure of a host and then attempts to recover subtask
tuples for all
in pr og ress
tuples associated with the failed host. Note that, even though all
monitor processes execute this code, each subtask tuple will be recovered only once due
to the atomic semantics of the AGS. The failure identifier passed to the monitor process
upon initialization is used to match particular failure tuples with monitor processes. This
identifier is generated by a call to an FT-Linda routine, which also registers it with the
44
process monitor(
f ail ur e id
)
while true do
in(
T S main
,
f ail ur e
,
f ail ur e id
,
?
host
)
# recover
subtask
tuples from the failed host
while
h
inp(
T S main
,
in prog r ess
,
host
,
?
subtask arg s
)
)
out (
T S main
,
subtask
,
subtask arg s
)
i
do
noop
end while
end while
end monitor
Figure 3.2: Bag-of-Tasks Monitor Process
language runtime system. When a host failure occurs, one failure tuple is deposited into
T S main
for each registered failure identifier.
Note that to guarantee recovery from worker failures, the failure IDs of monitor
processes must be registered before any workers are created. Otherwise, a window of
vulnerability would exist between the creation of the worker and registration of the failure
ID. Should the worker fail, in this case no monitor process would receive notification of
this failure, and any subtask the worker was executing while it failed would thus not be
recovered.
Note also that each monitor process is guaranteed to get exactly one failure tuple
for each failure. This property is guaranteed by Consul’s membership service, which
generates one failure notification per failure event. This allows the monitor processes to
keep consistent information about failures, e.g. the crash counts for each host.
Further ways in which the atomic guarded statement can be used are demonstrated in
Chapter 4.
Disjunctive Case
The AGS has a disjunctive case that allows more than one guard/body pair, as shown in
Figure 3.3. A process executing this statement blocksuntil at least one guard succeeds, or
all guards fail. To simplify the semantics, and to fit into a programming language’s type
system, the guards in a given statement must all be the same type of operation, i.e., all
true, all blocking operations (in or rd), or all boolean operations (inp or rdp). In addition,
if the guards are boolean, either all or none of the guards may be negated. If the guards
are all true, then all guards succeed immediately; in this case, the first is chosen and the
corresponding body executed. If the guards are all blocking, then the set of guards that
would succeed if executed immediately—that is, those for which there is a matching tuple
45
h
guard
1
)
body
1
or
guard
2
)
body
2
or
:::
or
guard
n
)
body
n
i
Figure 3.3: AGS Disjunction
in the named TS when the statement starts—is determined. If the size of that set is at least
one, then one is selected deterministically by the implementation, and the corresponding
guard and body executed. Otherwise, the process executing the AGS blocks until a guard
succeeds. If the guards are all boolean, then the set of guards that would succeed at the
time execution is commenced is again determined. If the size of the set is at least one,
then the selection and execution is done as before. If, however, the set is empty, the AGS
will immediately return false and no body will be executed.
An example using disjunction is given in Section 4.1.1.
3.2.2 Atomic Tuple Transfer
FT-Linda provides primitives that allow tuples to be moved or copied atomically between
TSs. The two primitives are of the form
transf er op
(
T S f rom
,
T S to
[,
template
] )
Here,
transf er op
is either move or copy,
T S f rom
is the source TS , and
T S to
is
the destination TS. The
template
is optional and consists of a logical name and zero or
more arguments, i.e., exactly what would follow the TS handle in a regular FT-Linda TS
operation. If the template is present, only matching tuples are moved or copied; otherwise,
the operation is applied to all the tuples in the source TS. Also, because the template in a
transfer command may match more than one tuple, any formal variables in
template
are
used only for their type, i.e., they are not assigned a new value by the operation.
Although similar to a series of in (or rd) and out operations, these two primitives
provide useful functionality, even independent of their atomic aspect. To see this, note
that a
46
move(
T S f rom
,
T S to
)
would move the same tuples as executing
in(
T S f rom
,
?
t
)
out(
T S to
,
t
)
for each tuple
t
in
T S f rom
, assuming no other processes accesses
T S f rom
while these
in-out pairs are in progress.
4
Moreover, even if the tuples in
T S f rom
are all of the
same tuple signature, an ordered list of types, the above move is likely to be much more
efficient than its Linda equivalent:
while inp(
T S f rom
,
?
t
)do
out(
T S to
,
t
)
as we shall see in Chapter 5, Of course, move and copy are also atomic with respect to
both failures and concurrency, while the equivalent sequence of inps and outs would not
be, even if combined in a series of AGSs:
while (
h
inp(
T S f rom
,
?
t
)
)
out(
T S to
,
t
)
i
)do
noop
In this case, other processes could observe the intermediate steps here. That is, they could
have access to
T S to
when the move operation being simulated was only partially finished,
i.e. when some but not all of the tuples had been moved.
As an example of how a move primitive might be used in practice, consider the
dynamic bag-of-tasks application from Chapter 2. Recall that this particular paradigm
suffered from both the lost tuple and the duplicate tuple problems. A way to solve these
problems is shown in Figure 3.4. This is similar to the static case shown in Figure 3.1
except that here,
T S scr atch
is a scratch TS that the worker uses to prevent duplicate
tuples. To do this, all new subtask tuples as well as the result tuple are first deposited into
this TS . Then, the
in progr ess
tuple is removed atomically with the moving of all the
tuples from
T S scr atch
to
T S main
. If the worker fails before executing the nal AGS
that performs this task, the subtask will be recovered as before, and another worker will
compute the same result and generate the same new subtasks. In this case, any result and
subtask tuples in
T S scr atch
will be lost, of course, which is desirable. However, if the
worker fails after the final AGS , then the new subtask tuples are already in a stable TS.
Finally, note that a monitor process similar to that for the static worker case, given in
Figure 3.2, would be needed here as well. The complete source code for this example is
in Appendix E.
4
This example is not legal Linda because it treats tuples as first-class objects. However, it serves to
make the point.
47
process
worker
()
T S scratch
:=
ts create
(volatile,private,
my l pid
() )
while true do
h
in (
T S main
,
subtask
,
?
subtask arg s
)
)
out (
T S main
,
in progr ess
,
my hostid
,
subtask arg s
)
i
calc (
subtask arg s
,var
res arg s
)
for (
all new subtasks created by this subtask
)
out(
T S scratch
,
subtask
,
new subtask args
)
out(
T S scratch
,
result
,
res args
)
h
in(
T S main
,
in prog r ess
,
my hostid
,
subtask arg s
)
)
move (
T S scratch
,
T S main
)
i
end while
end worker
Figure 3.4: Fault-Tolerant (Dynamic) Bag-of-Tasks Worker
3.3 Tuple Space Semantics
FT-Linda offers semantics that rectify the shortcomings discussed in Section 2.2. These
semantics are simple to provide, given that the SMA guarantees that each TS replica
receives the same sequence of TS operations, and by the way each replica processes these
operations.
First, inp and rdp in our scheme provide absolute guarantees as to whether there is a
matching tuple in TS, a property that we call strong inp/rdp semantics. That is, if inp or
rdp returns false, FT-Linda guarantees that there is no matching tuple in TS. Of all other
distributed Linda implementations of which we are aware, only PLinda [AS91, JS94]
and MOM [CD94] offer similar semantics. Strong inp/rdp semantics can be very useful
because they make a strong statement about the global state of the TS and hence of the
parallel application or system built using FT-Linda.
FT-Linda also provides oldest matching semantics, meaning that in,inp,rd, and rdp
always return the oldest matching tuple if one exists. These semantics are exploited in
the disjunctive version of an AGS as well to select the guard and body to be executed if
more than one guard succeeds. Oldest matching semantics can be very useful for some
applications, as shown in Section 4.1.2.
Finally, unlike standard versions of Linda, out operations in FT-Linda are not com-
pletely asynchronous. In particular, the order in which out operations in a process are
applied in their respective TSs is guaranteed to be identical to the order in which they are
initiated, which is not the case when out operations are asynchronous. This sequential
ordering property reduces the number of possible outcomes that can result from executing
48
a collection of out operations, thereby simplifying the programming process.
As an example of the differences in TS semantics between Linda and FT-Linda,
consider the Linda fragment from Section 2.2:
process P1 process P2
out(
A
)in(
B
)
out(
B
)if (inp(
A
)) then
print (‘‘Must succeed!’’)
Recall that the problem here is that a programmer may assume that the inp must always
succeed, which is not true without both the strong inp/rdp semantics and the sequential
ordering properties. To implement the above code fragment in FT-Linda, we could group
each process’s operations in a single AGS or implement each op as its own AGS. For
example, the following implements the semantics likely intended by the programmer:
process P1 process P2
h
true
) h
in(
B
)
)
skip
i
out(
A
)if (
h
inp(
A
)
)
skip
i
)then
out(
B
) print (‘‘Must succeed!’’)
i
The other grouping alternatives work in the same way; in all cases the inp will succeed.
3.4 Related Work
A number of other efforts have addressed the problem of providing support for fault-
tolerance in Linda. These include enhanced runtime systems, resilient processes and
data, and transactions. We discuss each in turn, as well as other research projects with
(non-fault-tolerant) features related to FT-Linda’s. We then conclude this section by
summarizing the unique language features of our extensions.
Enhanced Linda Runtime Systems
One class of fault-tolerant versions of Linda does not extend Linda per se, but rather
focuses on adding functionality to the implementation. [XL89, Xu88] give a design for
making the standard Linda TS stable. The design is based on replicating tuples on a subset
of the nodes, and then using locks and a general commit protocol to perform updates. This
replication technique takes advantage of Linda’s semantics so workers suffer little delay
when performing TS operations. Specifically, out and rd are unaffected by these locks; a
worker performing an out need not wait until the tuple is deposited into TS, and a worker
performing a rd need only wait until one of the replicas has responded with a matching
tuple.
49
This technique works as follows. An out stores a tuple at all TS replicas, and an in
removes one from all replicas; a rd can read from any replica. The out is performed in
the background, so the worker is not delayed. A rd broadcasts a request to all replicas;
the worker can proceed when the first reply, which contains all matches for the template
at that replica, arrives (assuming there are any matches). An in must acquire locks for
the signature it wishes to match from all replicas. It broadcasts its request to all replicas.
The replicas each send a reply message, indicating whether the lock was granted and, if
so, including a copy of all matching tuples. If the worker receives locks from all replicas,
it ascertains if there is a matching tuple in all the matching sets it received. If so, the
worker’s node chooses one and lets the worker proceed, while in the background it sends
a message to the other replicas informing them of the tuple selected. However, if not all
locks were acquired, or there was no matching tuple in the intersection of the tuples sent
by the replicas, the worker sends messages to all replicas releasing the locks, and then
starts over.
Processor failures and recoveries and network partitions are handled in [XL89, Xu88]
using a view change algorithm based on the virtual partitions protocol in [ASC85]. This
allows all workers that are in a majority partition to continue to use TS despite network
partitions.
[PTHR93] also implements a stable TS by replication,but uses centralized algorithms
to serialize tuple operations and achieve replica consistency for single TS operations. In
this scheme, processes attach themselves to the particular subspaces (portions of TS) to
which they reference. The degree of replication is dictated by the number of nodes sharing
a subspace; all nodes using a particular subspace have a local copy of it. The node with
the lowest node ID from among these TS replicas for a given subspace is the control node.
Requests to delete a tuple (i.e, in the implementation of in) are sent to this control node.
The other replicas are then each sent a message indicating which tuple to delete from their
copy of the subspace.
MTS [CKM92] addresses the issue of relaxing the consistency of the TS replicas
to improve performance. With MTS, there is a replica of TS on each node and three
different consistencies from which the programmer can choose, depending in the pattern
of usage for a given tuple’s signature. Weak consistency can be used if there are neither
simultaneous ins nor rds on that signature. To implement this, the in routine selects
the tuple it deletes, then sends an erase message to the other replicas instructing them
which tuple to delete from TS. Non-exclusive consistency can be used in the absence of
simultaneous ins on that signature. The node performing the in selects the tuple to be
deleted, sends a delete message to all other replicas to tell them which tuple to delete,
and then waits for replies from all the replicas. Finally, strict consistency may be used in
any situation. It uses a two-phase commit protocol similar to that described above from
[XL89, Xu88].
While all these schemes are undoubtedly useful, we note again that adding only this
type of functionality without extending the language has been shown to be inadequate for
realizing common fault-tolerance paradigms [Seg93]. Also, unlike our approach, most
50
scenarios in the above schemes require multiple messages to update the TS replicas.
Resilient Processes and Data
While the above schemes provided resilient data, another project [Kam90] aims to achieve
fault-toleranceby alsoproviding resilient processes. Itaccomplishes this by checkpointing
TS and process states and writinglogs of TS operations. Processes unilaterally checkpoint
their state to a stable storage as well as the message logs recording what TS operations
they have performed. If a process fails, this message log is replayed to reconstruct the
process’ state.
While the scheme is currently only a design that was never implemented, it appears to
have a significant message overhead. It also mandates that all processes be resilient. This
overhead is not necessary in many cases, such as the bag-of-tasks paradigm, as shown
in Chapter 3. In this paradigm, a given worker process does not have to be resilient, so
long as any subtask a failed worker was executing is recovered and executed by another
worker. Also, this scheme has no provisions for atomic execution of TS operations. The
user can construct them (e.g., with a tuple functioning as a binary semaphore), but this
seems less attractive than an explicitly atomic construct.
Transactions and Linda
Two projects have added transactions to Linda. PLinda allows the programmer to combine
Linda tuple space operations in a transaction, and providesfor resilient processes that are
automatically restarted after failure [AS91, JS94]. The programmer can choose one of
two modes to ensure the resilience of TS. In the first mode, all updates a transaction
makes are logged to disk before the transaction is committed. In the second, the entire TS
is periodically checkpointed to disk; the system ensures there are no active transactions
during this checkpointing. PLinda handles failed or slow processors, processes, and
networks. This design is sufficient for fault tolerance—indeed, it is more general than
what FT-Linda provides.
PLinda has its limitations, however. Its TS and checkpoints and are not replicated, so
if the disk suffersa loss of data failure (such as a head crash), the program will fail. Also,
if the host the disk is attached to fails, then the program cannot progress until that host
recovers. Further, as discussed above, many applications do not need the overhead of a
resilient process. (This overhead seems to be much less than that in [Kam90, Kam91],
however, in part because PLinda provides a mechanism for a process to log its private
state.)
MOM provides a kind of lightweight transaction [CD94]. It extends in to return a
tuple identifier and out to include the identifier of its parent tuple (e.g., the subtask it was
generated by). It then provides a done(
id l ist
) primitive that commits all in operations in
id list
and all out operations whose parents are in
id list
plus all I/O performed by the
process during the transaction. MOM provides mechanisms to allow the user to construct
51
a limited form of resilient processes. Finally, it also provides oldest-matching semantics.
However, MOM also has limitations, partly due to its intended original domain of
long-running, fault-tolerant space systems. One limitation is that its primitives are only
designed to support the bag-of-tasks paradigm. Also, it assumes the user can associate
a timeout with a subtask tuple, so if the in operation does not find a match within this
time the transaction will automatically be aborted. Further, MOM also does not replicate
its checkpoint information, so it suffers from the same single point of failure that PLinda
does.
Other Features Related to FT-Linda
Other features similar to those provided in FT-Linda have also been proposed at various
times. [Gel85] briefly introduces composed statements, which provide a form of dis-
junction and conjunction. [Bjo92] includes an optimization that collapses in-out pairs
on the same tuple pattern; it requires restrictions similar to FT-Linda’s opcodes. [Lei89]
discusses the idea of multiple tuple spaces, and some of the properties that might be
supported in such a scheme. Support for disjunction has also been discussed in [Gel85,
Lei89] and in the context of the Linda Program Builder [AG91a, AG91b]. The latter
offers the abstraction of disjunction by mapping it onto ordinary Linda operations and
hiding the details from the user. None of these efforts consider fault-tolerance.
Summary
FT-Linda has many novel features that distinguish it from other efforts to provide fault-
tolerance to Linda. It is unique in that it is the only design that uses the state machine
approach to implement this fault tolerance. It is also the only Linda variant to provide
multi-op atomicity that is not transaction-based. Its provisions to allow the creation of
multiple TSs of different kinds are unique, as are its tuple transfer primitives. FT-Linda is
the only Linda version we are aware of to support disjunction, and its collection of strong
semantic features is unique.
3.5 Possible Extensions to FT-Linda
A number of features are attractive candidates for addition to FT-Linda: additional at-
tributes for tuple spaces, nested AGSs, AGS branch notification, tuple space clocks, tuple
space partitions, guard flags, and allowing TS creation and destruction primitives in the
AGS.
3.5.1 Additional Tuple Space Attributes
Two additional attributes for tuple spaces might also be considered. The first is encryption.
With this scheme, attribute values can be unencrypted or encrypted. The latter would
imply that all data in the TS are encrypted, while with the former they would not be. This
52
attribute would, of course, be specifiedwhen the TS is created. The encrypting of actuals
will take place in a part of the FT-Linda runtime system that is in the user process’ address
space, for security reasons, and the encryption key will be stored there. This way, the key
never leaves the user’s address space.
Since tuple matching is based only on equality testing, not on other relational operators
(e.g., greater than) or ranges, Linda implementations typically implement matching by
doing a byte by byte comparison of the actual in the template and the corresponding value
in the tuple. As a result, this scheme will still work on encrypted tuple spaces, provided
that the actuals in both the out that generates tuple and in (or other operator) that tries to
match it are encrypted with the same key. Thus, the TS managers would not need to be
changed to implement this encryption of data.
Note that this encryption scheme can be used in conjunction with a private TS to
ensure that only one process can actually access the data. If this is not done, any process
could withdraw tuples from the encrypted TS. Also, it could possibly learn something
useful from the number and kind of tuples in that TS. Of course, the removal of those
tuples could also be harmful in and of itself.
A second possible attribute is to indicate write permissions. A read-write TS may
be modified at any time, exactly as TSs currently are in FT-Linda. However, read-only
TSs may not be written after they have been initialized. In such a scheme, the tuple
space would be seeded by its creator, which would then call an FT-Linda primitive to
announce that the tuple space’s initialization is complete. After this point, the TS may not
be modified, i.e., it may not be operated on by out,in,inp,move, or as a destination for
copy. Since the TS will no longer change, it could be replicated on each computer and
perhaps in each address space. Such a tuple space could be used to disseminate efficiently
global data that does not change yet may be accessed often, e.g. the two matrices to be
multiplied together.
3.5.2 Nested AGS
The multiple branches of a disjunctive AGS are the only form of conditional execution
within an AGS. However, the only form of conditional execution this disjunction allows
is to choose which branch is to be executed. Once this decision is made, every operation
in the body will be executed. An extension to FT-Linda that would allow another form
of conditional execution within the body is to allow nested AGSs. One possible
syntax is:
h
guard
1
)
:::
h
guard
2
)
body
2
i
:::
i
Of course,
body
2
could, in turn, have one of its elements be another AGS, and so forth.
53
Recall that to implement the AGS efficiently, we mandate that no TS operation in the
body may block. Thus, to allow nested AGSs, in and rd would not be allowed to block
in the guards of an AGS that is not at the outer level, because those guards are part of the
body of an enclosing AGS. Thus, in practice their boolean counterparts would most likely
be used.
3.5.3 Notification of AGS Branch Execution
One difficulty in using the disjunctional form ofthe AGS is that often the code following a
disjunctive AGS will need to know which branch was executed. This takes extra coding,
e.g., either by adding an extra formal variable somewhere in each branch, or by an extra
out in each branch to deposit a distinguished tuple into a scratch TS. The FT-Linda
implementation could directly indicate which branch was executed in a number of ways,
for example, with a per-process variablesimilar to Unix’s errno variable or with a library
routine that returns this information.
3.5.4 Tuple Space Clocks
Another possible extension would be to provide common global time [Lam78]. This has
been suggested in a Linda context in the tuple space clock suggested in [Lei89]. Here,
the author notes the usefulness of such a clock that preserves potential causality for a
distributed program, then states
The interesting question is whether there is a semantics for weak clocks which
allows them to be useful in writing parallel programs, but still admits of an
efficient implementation.
Such a clock would be trivial to provide in FT-Linda. Recall that each replicated TS
manager receivesthe same sequence of AGSs containing the same sequence of operations.
A running count of this sequence of operations can thus serve as a clock. The value of
the clock could be placed in a distinguished tuple (that must not be withdrawn) or made
available in some other fashion to the FT-Linda program.
This clock would preserve potential causality, assuming that processes communicate
only through TS. For example, suppose process
A
reads value
time
A
and then deposits a
tuple into TS. If process
B
withdraws that tuple and then reads
time
B
,
time
A
must be less
than
time
B
. Conversely, if we know process
C
dealt with event
C
at
time
C
, process
D
dealt with event
D
at
time
D
, and
time
C
time
D
, then event
C
could not have affected
event
D
.
3.5.5 Tuple Space Partitions
There is one set of replicated TS managers in the current FT-Linda implementation (to be
described in Chapter 5) that implements all replicated TSs. However, in general, there
could be many different unrelated application programs using replicated TSs. It would
54
thus be desirable to allow unrelated applications to have different sets of replicated TS
managers, each managing its own set of TSs. This could significantly reduce the workload
for each of the replicated TS managers.
To accomplish this, a replicated TS would be created in the context of a given partition;
a partition could be well-known, dynamically allocated by the runtime system, or both.
Each partition would then be managed by a different set of replicated TS managers. An
AGS could only contain TSs from one partition, because to allow more would require
coordination between different sets of TS managers. Finally,
T S main
would be created
in its own partition.
3.5.6 Guard Expressions
An AGS’sguards are in effect a kind of scheduling expression that dictates which branch
will be chosen and when that branch will be executed. This expression can currently be
empty (true) or a Linda TS operation (in,inp,rd, or rdp). This can be extended further
to allow a boolean expression to also influence the selection of a guard. Such a boolean
expression would be called a guard expression. Example guards might appear as follows:
expr
1
and in(
:::
)
)
expr
2
)
expr
3
and inp(
:::
)
)
A guard whose guard expression is false would not be eligible for execution in TS; i.e., if
the guard has a TS operation, the TS manager would not even search for a matching tuple
for it.
This guard expression would make FT-Linda’s guards more similar to constructs in
other concurrent languages. The notion of a guarded communication statement was
introduced in [Dij75]; its guard contained both an optional expression and an optional
communication statement. The guarded expression has since been used in various forms
in parallel programming languages such as CSP [Hoa78], Ada [DoD83], SR [AO93], and
Orca [Bal90].
Recallthat in a disjunctive AGS,to simplifythe semanticsandto fitinto a programming
language’s type system, all guardsmust be the same: either all absent (true), blocking (in
or rd), or boolean (inp or rdp). This same restriction would hold. for example, because
the three guards are blocking, absent, and boolean, respectively, they could not be used in
the same AGS. No other restrictions would be required to add guard flags to the language;
the guard flag simply narrows the list of eligible guards before the AGS request is sent to
the TS managers.
3.5.7 TS Creation in an AGS
In some cases, it is desirable to move tuples to another TS to operate on them, and then
either return the tuples or a synthesized result in their place. With this technique, the
55
TS handle must be left as a placeholder to allow the recovery of those tuples in case the
process operating on them fails. For example, a process may use the following:
saf e ts
:=
ts cr eate
(stable,shared)
h
true
)
out(
T S main
,
saf e ts
; my host; saf e ts
)
move(
T S main; saf e ts
)
copy(
saf e ts; T S scr atch
)
i
# Operate somehow on tuples in
T S scr atch
# to produce result tuple
h
true
)
in(
T S main
,
saf e ts
; my host; saf e ts
)
out(
T S main
,
result
; resul t
)
i
ts destroy
(
saf e ts
)
The problem with this scenario is that of the host fails either right before the first AGS or
right after the last AGS, then
ts saf e
will never be destroyed. The accumulation of such
orphaned TSs could be a serious problem in a long-running application.
This can be avoided by allowing a TS to be created and destroyed inside an AGS. In
the above scenario, the creation and the destruction of
ts saf e
would be inside the first
and last AGS, respectively. Additionally, because variable assignment is not allowed in an
AGS, the TS creation primitive would take the TS handle as an additional argument. With
this extension,
ts saf e
could not be orphaned, because its creation and destruction would
be atomic with the depositing and withdrawing of the placeholder TS handle, respectively.
This addition would be very simple to implement. Indeed, the commands to create and
destroy a replicated TS are already broadcast to the replicated TS managers in messages
generated by ts create() and ts destroy(), respectively. It would thus be trivial
to allow these operations to also be combined in an AGS with other TS operations.
3.6 Summary
This chapter describes FT-Linda, a version of Linda that allows Linda programsto tolerate
the crash failures of some of the computers involved withthe computation. FT-Linda sup-
ports the creation of multiple tuple spaces with different attributes. It also has provisions
for atomic execution, and it also features strong semantics. This chapter also surveys other
versions of Linda designed to provide support for fault-tolerance. Finally, this chapter
surveys possible extensions to FT-Linda.
56
FT-Linda supports two attributes for tuple spaces: resilience and scope. The resilience
specifies the behavior of the tuple space in the presence of failures; it can be either stable
or volatile. The scope attribute shared or private indicates which processes may
access the tuple space. Both attributes must be specified when the tuple space is created.
FT-Linda has two provisions for atomic execution, the atomic guarded statement and
tuple transfer primitives. The atomic guarded statement allows a sequence of tuple space
operations to be performedin an all-or-nothingfashion despite concurrency and failures. It
can be used to construct a static bag-of-tasksworker, the failure of which can be recovered
from by a monitor process. The atomic guarded statement also has a disjunctive form
that specifies multiple sequences of tuple space operations, zero or one of which will be
executed atomically. FT-Linda’s tuple transfer primitives that allow tuples to be moved or
copied atomically between tuple spaces. This can be used, in conjunction with the atomic
guarded statement and monitor processes, to create a fault-tolerant dynamic bag-of-tasks
worker.
FT-Linda offers strong semantics that are useful for fault-tolerant parallel program-
ming. Its Strong inp/rdp semantics provide absolute guarantees that the boolean primi-
tives will find a matching tuple if one exists. FT-Linda’s oldest matching semantics mean
that the oldest tuple that matches the given template will be selected by that operation.
Finally, its sequential ordering property ensures that tuple space operations from a given
process will be applied in their respective tuple spaces in the order prescribed in that
process’s code.
Other projects have provided fault-tolerant support for Linda. One class provides
resilient tuple spaces but does not extend Linda any; this is not sufficient to solve the dis-
tributed consensus problem. Another design provides both resilient process and resili ence
tuple spaces. Two projects PLinda and MOM have added transactional support to
Linda.
FT-Linda could be extended in a number of directions. These include additional
tuple space attributes, nested atomic guarded statements, tuple space clocks, tuple space
partitions, and guard expressions.
CHAPTER 4
PROGRAMMING WITH FT-LINDA
This chapter illustrates FT-Linda’s applicability to a wide range of problems. It
presents five examples from FT-Linda’s two primary domains: highly dependable systems
and parallel programming, For simplicity, we only concern ourselves in these examples
with the failure of worker or server processes; the final section in this chapter discusses
handling the failure of the main/master process.
4.1 Highly Dependable Systems
This section gives FT-Linda implementations of three system-level applications. First, it
gives a replicated server example. This is an example of programming replicated state
machines in FT-Linda. Next, it presents an FT-Linda recoverable server an example
of the primary/backup approach [AD76, BMST92]. Since the server is not replicated,
there are no redundant server computations in the absence of failures. Finally, this section
presents an FT-Linda implementation of a transaction facility. This demonstrates the
utility of FT-Linda’s tuple transfer primitives, and the ability of the language to implement
abstractions more powerful than the atomic guarded statement (AGS). Transactions are,
of course, an example of the object-action paradigm introduced in Section 1.3.1 [Gra86].
4.1.1 Replicated Server
In this example, a process implements a service that is invoked by client processes by
issuing requests for that service. To provide availability of the service when failures
occur, the server is replicated on multiple machines in a distributed system. To maintain
consistency between servers, requests must be executed in the same order at every replica.
The key to implementingthis approach in FT-Linda is orderingthe requests in a failure-
resilient way. We accomplish this by using the Linda distributed variable paradigm in the
form of a sequence tuple for each service. The value of this tuple starts at zero and is
incremented each time a client makes a request, meaning that there is exactly one request
per sequence number for each service and that a sequence number uniquely identifies a
request. This strategy results in a total ordering of requests that also preserves causality
between clients, assuming that clients communicate only using TS. This sequence number
is also used to ensure that server replicas process each request exactly once.
An FT-Linda implementation of a generic replicated server follows. First, however,
for each server (not each server replica), the following is performed at initialization time
to create the sequence tuple:
57
58
h
in (
sequence
,
sid;
?
sequence
)
)
out (
sequence
,
sid
,PLUS(
sequence;
1
) )
out (
request
,
sid; sequence; ser v ice name; serv ice id; arg s
)
i
if (
a repl y is needed f or serv ice
)
in(
reply
,
sid; sequence;
?
reply arg s
)
Figure 4.1: Replicated Server Client Request
out(
sequence
; serv er id;
0
)
The
server id
uniquely identifies the (logical) server with which the sequence tuple is
associated.
Given this tuple, then, a client generates a request by executing the code shown in
Figure 4.1. Here, the client does three things: withdraws the sequence tuple, deposits a
new sequence tuple, and deposits its request; after this it withdraws the appropriate reply
tuple if necessary. These three actions must be performed atomically. To see this, consider
what would happen given failures at inopportune times. If the processor on which the
client is executing fails between the in and the out, the sequence tuple would be lost, and
all clients and server replicas would block forever. Similarly, if a failure occurs between
the two outs, there would be no request tuple corresponding with the given sequence
number, so the server replicas would block forever.
Two additional aspects of the client code are worth pointing out. First, note that
the client includes a
service name
and
service id
in the request tuple. This information
specifies which of the services by server
sid
is to be invoked. The redundancy is needed for
structuring the server, as will be seen below. Second, note the PLUS in the TS operation
depositing the updated sequence tuple. This opcode results in the value of the sequence
tuple that was withdrawn in the previous in being incremented when it is deposited back
into TS . These opcodes, which also include such common operations like MIN,MAX,
and MINUS, are intended to allow a limited form of computation within atomic guarded
statements. As already mentioned above, general computations—including the use of
expressions or user functions in arguments to TS operations—are not allowed due to
their implied implementation overhead. For example, among other things, it would mean
having to transfer the code for the computation to all processors hosting TS replicas so that
it could be executed at the appropriate time. Also, these general computations must be
executed, in essence, in a critical section—that is, while the runtime system is processing
the AGS in question—which could severely degrade overall performance.
The code for a generic server replica that retrieves and services request tuples is
given in Figure 4.2. The server waits for a request with the current sequence number
using the disjunctive version of the AGS, one branch (i.e., guard/body pair) for each
59
seq
:= 0
loop forever
h
rd(
T S main
,
request
,
my sm id; seq; ser v ice
1
;
?
servicenum;
?
x
)
)
skip
or
:::
or rd(
T S main
,
request
,
my sm id; seq; ser v ice
n
;
?
servicenum;
?
a;
?
b;
?
c
)
)
skip
i
case
servicenum
of
# each service does out(
reply
,
my sm id; seq; r epl y args
) if
# the service returns an answer to the client
1 :
service
1
(
x
)
:::
n
:
service
n
(
a; b; c
)
end case
seq
:=
seq
+ 1
end loop
Figure 4.2: Server Replica
service offered by the server. When a request with the current sequence number arrives,
the server uses the assignment of the
service id
in the tuple to variable
servicenum
to record which service was invoked. This tactic is necessary because normal variable
assignment to record the selection is not permitted within an atomic guarded statement.
Finally, after withdrawing the request tuple, the server executes a procedure (not shown)
that implements the specified service. This procedure performs the computation, and, if
necessary, deposits a reply tuple.
An alternate way of ascertaining which branch was executed is to have each branch
deposit a distinguished tuple into a scratch TS in which one field indicates which branch
was executed. For example, branch 3 would execute:
out(
T S scr atch;
branch
;
3
)
A third alternative is to extend FT-Linda to indicate this directly, as discussed above in
Section 3.5.3.
Note that in the above scheme, some tuples get deposited into TS but not withdrawn.
In particular, request tuples are not withdrawn, and neither are reply tuples from all but one
of the replicas. If leaving these tuples in TS is undesirable, a garbage collection scheme
could be used. To do this for request tuples, the sequence number of the last request
60
processed by each server replica would first need to be maintained. Since no request with
an earlier sequence number could still be in the midst of processing, such tuples could be
withdrawn. A similar scheme could be used to collect extra reply tuples as well.
The complete source code for this example is in Appendix B.
4.1.2 Recoverable Server
Another strategy for realizing a highly available service is to use a recoverable server. In
this strategy, only a single server process is used instead of the multiple processes as in the
previous section. This saves computational resources if no failure occurs, but also raises
the possibility that the server may cease to function should the processor on which it is
executing fail. To deal with this situation, the server is constructed to save key parts of its
state in stable storage so that it can be recovered on another processor after a failure. The
downside of this approach when compared to the replicated serverapproach is, of course,
the unavailability of the service during the time required to recover the server.
In our FT-Linda implementation of a recoverable server, a stable TS functions as a
stable storage in which values needed for recovery are stored. Monitor processes on
every processor wait for notification of the failure of the processor on which the server is
executing; should this occur, each attempts to create a new server. Distributed consensus
is used to select the one that actually succeeds.
An FT-Linda implementation of such a recoverable server and its clients follows.
For simplicity, we assume that the server only provides one service, and that the service
requires a reply; multiple services would be handled with disjunction as in the previous
example. We also assume that the following is executed upon initialization to create the
server:
server l pid
:=
new lpid
()
server T S handle
:=
ts create
(stable,private,
server l pid
)
out(
T S main
,
server handle
,
server id
,
server T S handle
)
out(
T S main
,
server r eg istry
; serv er id; ser v e l pid; host
)
create(
server
,
host
,
server l pid
,
server id
)
This allocates an LPID for the server, creates a private but stable tuple space for its use,
places the handle for this tuple space in the globally-accessible TS
T S main
, and creates
the server. The final out operation creates a registry tuple that records which processor is
executing the server. This tuple is used in failure handling, as described below.
61
process
server
(
my id
)
# Read in handle of private TS
rd(
T S main
,
server handle
,
my id
,
?
T S serv er
)
# Read in state if present in TSserver, otherwise initialize it there
state
:=
initial state
h
not rdp(
T S serv er
,
state
,
my id;
?
state
)
)
out(
T S serv er
,
state
,
my id; initial state
)
i
loop forever
h
in(
T S main
,
request
,
my id;
?
client lpid;
?
args
)
)
out(
T S main
,
in progr ess
,
my id; cl ient lpid; args
)
i
# calculate service & its reply, change
state
, do output
:::
h
in(
T S main
,
in prog r ess
,
my id; cl ient lpid; args
)
)
out(
T S main
,
reply
,
my id; repl y arg s
)
in(
T S serv er
,
state
,
my id;
?
old state
)
out(
T S serv er
,
state
,
my id; state
)
i
end loop
end
server
Figure 4.3: Recoverable Server
The code used by client processors to request service is:
out(
T S main
,
request
,
server id; client lpid; args
)
in(
T S main
,
reply
,
server id; client lpid;
?
reply args
)
Note that no sequence tuple is needed for the client of a recoverable server, because
there is only one actual server process at any given time. Another function this tuple
performs in Section 4.1.1, beyond ordering requests for the server, is to ensure that the
client withdraws the reply tuple corresponding to its request. To ensure this without the
sequence tuple, the client includes its LPID in the request tuple, the server includes it in
the reply tuple, and then the client withdraws the reply tuple with its LPID in it.
The server itself is given in Figure 4.3. The server first reads its initial state and TS
handle from
T S main
, and then enters an infinite loop that withdraws requests and leaves
an
in progr ess
tuple as in previous examples. Finally, it performs the service, updates
the state tuple, and outputs a reply.
62
process
monitor
(
f ail ur e id
,
server id
)
loop
h
in(
T S monitor
,
f ail ur e
,
f ail ur e id;
?
host
)
)
skip
i
# see if server
server id
was running on failed host
if (
h
rdp(
T S main
,
server reg istr y
,
server id;
?
server lpid; host
)
)
skip
i
)
then
# Regenerate request tuple if found
h
inp(
T S main
,
in prog r ess
,
server id
,
?
client lpid
,
?
args
)
)
out (
T S main
,
request
,
server id
,
client lpid
,
args
)
i
# Attempt to start new incarnation of failed server
if(
h
inp(
T S main
,
server reg istr y
,
server id
,
?
server lpid
,
host
)
)
out(
T S main
,
server reg istr y
,
server id
,
server lpid
,
my host
)
i
)
then
create (
server
,
my host
,
server lpid
,
server id
)
end if
end if
end loop
end
monitor
Figure 4.4: Recoverable Server Monitor Process
When a processor fails, two actions must be taken. First, any
in pr og ress
tuple
associated with a failed server must be withdrawn and the corresponding request tuple
recovered. Second, the failed server itself must be recreated on a functioning processor.
To perform these actions, however, we need to know if the failed processor was in fact
executing a server. This information is determined using the registry tuples alluded to
above.
A monitor process strategy similar to the bag-of-tasks example is used to implement
these two actions. Here, it is structured to monitor a single server, although it could easily
be modified to handle the entire set of servers in a system. The code is shown in Figure 4.4.
The monitor handles the failure of its server by first dealing with a possible
in progr ess
tuple; if present, one such monitor will succeed in regenerating the associated request
tuple. The next step is to attempt to create a new incarnation of the server, a task that
requires cooperation among the monitor processes on each machine to ensure that only
one is created. This is implemented by having the monitor processes synchronize using
the registry tuple in a form of distributed consensus to agree on which should start the new
incarnation. The selected process then creates the new server, while the others simply
continue.
The above scheme features one recoverable server process executing at any given
63
time. However, this scheme could easily be modified for performance reasons by having
multiple recoverable servers working on different requests in parallel. This would only
require slight modifications to the above scheme. For example, the monitor process could
not assume there was at most one
in progr ess
tuple and thus would have to loop until it
found no more. Note that this scheme is different from the replicated server scheme in
Section 4.1.1. With this scheme, there are multiple primary/backup servers implementing
the same service, but they operate independently on different requests. In the replicated
server example, on the other hand, all servers execute each request submitted.
Therecoverycode forthe recoverableserveris more complicated thanforthereplicated
serverin the previoussection. Fortunately,however,failuresare usually infrequent relative
to client requests, so the additional cost of the recovery code will rarely be paid in practice.
Also, thanks to FT-Linda’s oldest-matching semantics, a sequence tuple does not have to
be maintained, as it does with the replicated server. These benefits accrue for each client
request and, for many applications, far outweigh the extra costs incurred for each failure.
Those applications that cannot tolerate the fail-over time of the recoverable server may
need to use the replicated server instead.
The complete source code for this example is in Appendix C.
4.1.3 General Transaction Facility
A transaction is a sequence of actions that must be performed in an all-or-nothing fashion
despite failures and concurrency. Although similar to an AGS, a transaction is more
general, because the latter can feature arbitrary computations and, in the Linda context,
an indefinite number of TS operations. However, we can construct a transaction from a
sequence of AGSs. In this subsection, we give an FT-Linda implementation of a library of
procedures that provide transactions to user-level processes. The interface for this library
is given in Figure 4.5. For simplicity, we assume that variable identifiers are well-known
or ascertainable, and that variables are simply integers.
To initialize the system, init transaction library() is called once, followed by cre-
ate var() for every variable to be involved in any transaction. After this point, usage
of transactions may begin. To perform a transaction, start transaction() is called with a
list of the variables to be involved with the transaction; it returns a unique transaction
identifier (TID). After this, modify var() is called each time a variable is to be modified,
followed by either commit() or abort(). Finally, we provide print variables() to print out
an atomic snapshot of all variables. Note that the user of this transaction library has only
to be aware of this interface, not of the fact that it is implemented with FT-Linda.
To implement this transaction facility, we maintain one lock tuple and one variable
tuple for each variable created. Recall that in the fault-tolerant bag-of-tasks solution in
Section 3.2.1, a subtask beingworked on was kept in an alternate form (an
in pr og ress
tuple) in
T S main
. We call such a tuple a placeholder or a placeholder tuple, and say
that the original tuple is in its placeholder form. In this transaction manager, then, to
implement both stability of and mutual exclusion on a variable’s tuples, we will convert
64
# Types
type tid tis int # Transaction ID
type var t is int # Variable ID
type val t is int # Variable Value
# Transaction procedures
procedure
init transaction library
()
procedure
create var
(var t
var id
, val t
init val
)
procedure
destroy var
(var t
var id
)
procedure
start transaction
(var t
var id list
[], int
num var s
)returns tid t
procedure
modif y v ar
(tid t
tid
, var t
var id
, val t
new val
)
procedure
abort
(tid t
tid
)
procedure
commit
(tid t
tid
)
procedure
print var iabl es
(string
message
)
Figure 4.5: Transaction Facility Interface
these tuples into their placeholder forms when their variable is being used in a transaction.
Also, during a transaction, two scratch TSs are kept, one with the starting values of all
variables in the transaction, and the other with the current values of all variables in the
transaction. Maintaining these two scratch TSs makes it simple to abort or commit all
changes to the variables with a single AGS consisting primarily of a few move operations.
Finally, because the transaction facility is implemented as a library of routines linked into
the user’s address space, the monitor processes are the only processes created there
are no server or other system processes of any kind. Thus, the only thing the monitor
processes have to do is to abort any transactions that were being executed by a user on a
host that failed.
In the following figures we describe the FT-Linda implementation of the routines
given in Figure 4.5. For simplicity, these implementations do not concern themselves
with checking for user errors, e.g. testing if a variable used in a transaction has been
created previously.
Figure 4.6 gives the routines to initialize the transaction facility.
1
The two things
that init transaction facility() has to do is to create a tuple storing the current TID and
also create the monitor processes. To create a variable, create var() creates two tuples
in
T S main
, one for the variable and the other for the lock. Conversely, to destroy a
variable, the lock and variable tuples must be removed. Of course, variables should not
1
In this chapter, we abbreviate an AGS with a true guard:
h
true
)
body
i
as
h
body
i
for the sake of brevity and clarity.
65
procedure
init transaction facil ity
()
# Initialize the transaction IDs
h
out(
T S main
,
tids
;
1
)
i
# Create the monitor processes
for
i
:= 1
to
num hosts
()
do
lpid
:=
new lpid
()
f id
:=
new f ail ur e id
()
create
(
monitor transactions; i; lpid; f id
)
end for
end
init transaction facil ity
procedure
create var
(var t
var id
, val t
init val
)
h
out(
T S main
,
var
; var id; init val
)
out(
T S main
,
lock
; var id
)
i
end
create v ar
procedure
destroy var
(var t
var id
)
# Block until any pending transaction using
var id
completes
h
in(
T S main
,
lock
; var id
)
)
in(
T S main
,
var
; var id;
?val t)
i
end
destroy var
Figure 4.6: Transaction Facility Initialization and Finalization Procedures
be destroyed if they may still be in use.
Figure 4.7 gives the code to start a transaction. First, the scratch TSs for the original
and current values of the variables are created. Their TS handles are then stored in
T S main
for later retrieval by internal routines get orig() and get cur(), respectively.
After this, mutual exclusion is acquired on all variables involved with the transaction; this
is performed in a linear order to ensure that deadlock cannot occur. To acquire mutual
exclusion on a variable, its lock tuple is withdrawn. Additionally, to facilitate recovery
from failures as well as the ability to either commit or abort the transaction later, the lock
and variable tuples are moved to their placeholder form in
T S main
, a copy of the variable
tuple is placed into both scratch TSs, and a copy of the lock tuple is placed into
orig ts
.
Finally, start transaction returns the TID for this transaction; the user must pass this to
modify var(),commit(), and abort().
Figure 4.8 gives the code to modify a variable. It updates the value of the variable’s tu-
ple in the transaction’s scratch tuple space that stores the current values of the transaction’s
66
procedure
start transaction
(var t
var id list
[], int
num vars
)returns tid t
# Allocate the next transaction ID for this transaction
h
in(
T S main
,
tid
;
?
tid
)
)
out(
T S main
,
tid
,PLUS(
tid;
1
))
i
# Create scratch TSs for original and current values of
# vars involved in this transaction
cur ts
:=
ts create
(volatile,private,
my l pid
)
orig ts
:=
ts create
(volatile,private,
my l pid
)
# Deposit handles into TS for retrieval by get cur() and get orig()
h
out(
T S main
,
ts cur
; tid; cur ts
)
out(
T S main
,
ts or ig
; tid; orig ts
)
i
# Acquire locks for vars in this transaction, in a linear order
# In doing so, move its lock tuple to a lock inuse tuple,
# do similarly for the var tuple, and add a copy of the var to
cur ts
sort
(
var id list
[])
for
i
:= 1
to
num var s
do
h
in(
T S main
,
lock
; var id list
[
i
]
)
)
out(
T S main
,
lock inuse
; my hist; tid; var id list
[
i
]
)
out(
orig ts
,
lock
; var id list
[
i
]
)
in(
T S main
,
var
; var id list
[
i
]
;
?
val
)
out(
T S main
,
var inuse
; my hist; tid; var id list
[
i
]
; val
)
out(
orig ts
,
var
; var id list
[
i
]
; val
)
out(
cur ts
,
var
; var id list
[
i
]
; val
)
i
end for
return
tid
end
start transaction
Figure 4.7: Transaction Initialization
67
procedure
modif y v ar
(tid t
tid
, var t
var id
, val t
new val
)
cur ts
:=
get cur
(
tid
)
h
in(
cur ts
,
var
; var id;
?val t)
out(
cur ts
,
var
; var id; new v al
)
i
end
modif y v ar
Figure 4.8: Modifying a Transaction Variable
variables.
Figure 4.9 gives the codeto abort and commit a transaction. To abort a transaction, the
lock and variable tuples must be restored in
T S main
and their placeholders discarded.
Also, the tuples storing the TS handles for the transaction’s scratch TSs are withdrawn.
The code to commit a transaction is identical to the code to abort, except that the variable
tuples are moved from the scratch TS holding the current values of the variables, rather
than the one holding their original values.
The code to print out an atomic snapshot of all variables is given in Figure 4.10. An
AGS copies all variables into a scratch TS, either in their normal or placeholder form.
From there, they can be withdrawn one a time and their values printed out.
Finally, Figure 4.11 gives the monitor process for the transaction facility. Recall
that, because it has no system processes to recover; it only has to abort any transactions
that were being executed by a client on the failed host. Thus, the monitor process only
regenerates variable and lock tuples for any variables that were acquired by a transaction
on the failed host. These tuples are recovered from their placeholders already in
T S main
.
The transaction facility given above requires a modest number of AGS requests. To
implement a transaction the main cost is one replicated AGS, an AGS involving replicated
TSs, per variable in the transaction; this happens when the transaction is started. After
this, it only costs one local AGS, an AGS involving only local (scratch) TSs, to modify a
variable, and then one replicated AGS to either commit or abort the transaction. Also, it
only takes one replicated AGS and then onescratch AGS per variableto print out an atomic
snapshot of all variables. Finally, note that only one process may participate in a given
transaction in the above implementation. However, it would be very simple to extend this
example to allow multiple processes to cooperate in realizing a single transaction. The
only change would be to create the transaction’s TSs (“
ts cur
and
ts or ig
”) as stable
and shared rather than as volatile and private. Of course, this would make this transaction
facility more expensive. In particular, modify var() would be much more expensive; it
would require a replicated AGS rather than a (much cheaper) local one.
The complete source code for this example is in Appendix D.
68
procedure
abort
(tid t
tid
)
cur ts
:=
get cur
(
tid
)
orig ts
:=
get or ig
(
tid
)
# Put the lock and (old) var tuples involved with this transaction
# back into
T S main
, discard the scratch TS tuples,
# and discard the lock and variable inuse placeholders
h
move(
orig ts; T S main
,
lock
,?var t)
move(
orig ts; T S main
,
var
, ?var t,?val t)
in(
T S main
,
ts cur
; tid;
?
ts handle t
)
in(
T S main
,
ts or ig
; tid;
?
ts handle t
)
move(
T S main; cur ts
,
lock inuse
,
my host; tid;
var t)
move(
T S main; cur ts
,
var inuse
,
my host; tid;
?var t, ?val t)
i
ts destroy
(
cur ts
)
ts destroy
(
orig ts
)
end
abort
procedure
commit
(tid t
tid
)
cur ts
:=
get cur
(
tid
)
orig ts
:=
get or ig
(
tid
)
# Put the lock and (new) var tuples involved with this transaction
# back into
T S main
, discard the scratch TS tuples,
# and discard the lock and variable inuse placeholders
h
move(
orig ts; T S main
,
lock
,?var t)
move(
cur ts; T Smain
,
var
, ?var t,?val t)
in(
T S main
,
ts cur
; tid;
?
ts handle t
)
in(
T S main
,
ts or ig
; tid;
?
ts handle t
)
move(
T S main; cur ts
,
lock inuse
,
my host; tid;
var t)
move(
T S main; cur ts
,
var inuse
,
my host; tid;
?var t, ?val t)
i
ts destroy
(
cur ts
)
ts destroy
(
orig ts
)
end
commit
Figure 4.9: Transaction Abort and Commit
69
procedure
print var iabl es
(string
message
)
# Copy an atomic snapshot of all variables, whether inuse or not
scratch ts
:=
ts create
(volatile,private,
my lpid
)
h
copy(
T S main; scr atch ts
,
var
, ?var t, ?val t)
copy(
T S main; scr atch ts
,
var inuse
, ?int, ?tid t, ?var t, ?val t)
i
print(‘‘Variables at’’,
message
)
while (
h
inp(
scratch ts
,
var
;
?
var;
?
val
)
)
skip
i
)do
print(‘‘ var’’, var, ‘‘value’’, val)
end while
while (
h
inp(
scratch ts
,
var inuse
;
?
host;
?
tid;
?
var;
?
val
)
)
skip
i
)do
print(‘‘ var’’,
var
, ‘‘value’’,
val
, ‘‘inuse with tid’’,
tid
,
‘‘on host’’,
host
)
end while
ts destroy
(
scratch ts
)
end
print var iabl es
Figure 4.10: Printing an Atomic Snapshot of all Variables
procedure
monitor tr ansactions
(
f ail ur e id
)
loop
h
in(
T S monitor
,
f ail ur e
,
f ail ur e id;
?
host
)
)
skip
i
# Wait for a failure
# Regenerate all lock and variable tuples we find in-progress
# for any transactions on the failed host.
while (
h
inp(
T S main
,
lock inuse
, failed host, ?tid t,
?
var
)
)
out(
T S main
,
lock
,
var
)
in(
T S main
,
var inuse
, failed host, ?tid t,
var;
?
val
)
out(
T S main
,
var
,
var; val
)
i
)do
noop
end while
end loop
end
monitor tr ansactions
Figure 4.11: Transaction Monitor Process
70
loop forever
in(
subtask
,
?
subtas argsk
)
if(
small enough
(
subtask ar g s
)
)
out(
result
,
result
(
subtask ar g s
)
)
else
out(
subtask
,
part
1(
subtask arg s
)
)
out(
subtask
,
part
2(
subtask arg s
)
)
end if
end loop
Figure 4.12: Linda Divide and Conquer Worker
4.2 Parallel Applications
FT-Linda is applicable to a wide variety of parallel applications. We have already seen
one user-level FT-Linda application in thebag-of-tasks example given in Chapter 3. This
section presents two more. The first, the divide-and-conquer worker, is a generalization
of the bag-of-tasks worker. The second one is implementation of barriers with FT-
Linda; these barriers and related algorithms are applicable to a large number of scientific
problems.
4.2.1 Fault-Tolerant Divide and Conquer
The basic structure of divide and conquer is similar to the bag-of-tasks, where subtask
tuples representing work to be performed are retrieved by worker processes [L ei89].
The difference comes in the actions of the worker. Here, upon withdrawing a subtask
tuple, the worker first determines if the subtask is “small enough,” a notion that is, of
course, application dependent. If so, the task is performed and the result tuple deposited.
However, if the subtask is too large, the worker divides it into two new subtasks and
deposits representative subtask tuples into TS. Such a worker is depicted in Figure 4.12.
An FT-Linda solution that provides tolerance to processor crashes is given in Fig-
ure 4.13. Here, the worker leaves an
in progr ess
tuple when withdrawing a subtask, as
done with the bag-of-tasks. It then decides if the task is small enough. If it is, it calculates
the answer and atomically withdraws the
in progr ess
tuple while depositing the answer
tuple. If not, it divides the task into two subtasks and atomically deposits these new
subtask tuples while withdrawing the in progress tuple. A monitor process similar to the
one discussed in Section 3.2.1 would be used to recover lost subtask tuples upon failure.
Analternative strategy involvinga scratchTSsimilar tothat usedin Section 3.2.2could
also be employed. The subtask tuples are first deposited into the scratch TS, which is then
merged atomically with the shared TS upon withdrawal of the
in pr og ress
tuple. This
71
loop forever
h
in(
subtask
,
?
subtask arg s
)
)
out(
in progr ess
,
subtask args; my hostid
)
i
if(
small enough
(
subtask ar g s
)
)
result ar g s
:=
result
(
subtask ar g s
)
h
in(
in progr ess
,
subtask args; my hostid
)
)
out(
result
,
result ar g s
)
i
else
subtask
1
args
:=
part
1(
subtask
)
subtask
2
args
:=
part
2(
subtask
)
h
in(
in prog r ess
,
subtask args; my hostid
)
)
out(
subtask
,
subtask
1
args
)
out(
subtask
,
subtask
2
args
)
i
end if
end loop
Figure 4.13: FT-Linda Divide and Conquer Worker
strategy would be especially appropriate if a variable number of subtasks are generated
depending on the specific characteristics of the subtask being divided.
The complete source code for this example is in Appendix F..
4.2.2 Barriers
Many scientific problems can be solved with iterative algorithms that compute a better
approximation to the solution with each iteration, stopping when the solution has con-
verged. Examples of such problems include partial differentialequations, region labelling,
parallel prefix computation, linked list operations, and grid computations [And91]. Such
algorithms typically involve an array, with the same computation being performed on
each iteration. Since the computations for a given portion of the array and for a given
iteration are independent of the other computations for that iteration, such algorithms
are ideal candidates for parallelization. However, the computation for a given iteration
depends on the result from the previous iteration. Thus, all processes must synchronize
after each iteration to ensure that they all are using the correct iteration’s values. This
synchronization point is called a barrier, because all processes must arrive at this point
before any may proceed past it.
While barriers are convenient for parallel programming, they would be even more
convenient if an application using barriers could tolerate the failure of some of the pro-
cessors involved with the computation. This would mean making the worker processes
resilient, so that if one failed, a reincarnation would be created to resume execution where
72
the failed worker stopped.
In this subsection, we develop a fault-tolerant barrier with FT-Linda. First, we briefly
discuss multiprocessor implementation of barriers, outlining three different implementa-
tion techniques: shared counters, coordinator processes, and tree-structured barriers. This
serves to survey the problemmore concretely. We then discuss a simpler way to implement
barriers with Linda, taking advantage of its associative matching and blocking primitives.
A fault-tolerant FT-Linda version of this solution is then given. This is followed by an
explanation of how the techniques used in making this fault-tolerant barrier can be used
to make the other kinds of barriers fault-tolerant, as well as classes of algorithms similar
to those involving barrier synchronization.
Multiprocessor Implementations of Barriers
We now describe three techniques for implementing a barrier on a multiprocessor; this
material is summarized from [And91], where two other schemes are also given. The first
technique is to maintain a shared counter that records the number of processes that have
reached the barrier. Thus, when it arrives at the barrier, a worker process increments its
value and then busy waits (spins) until the counter’s value equals the number of workers.
While this technique is simple to implement, it suffers from severe memory contention; on
each iteration, each worker writes to the counter and then continually reads it. This does
not scale well, even with the assistance of multiprocessor cache coherence and atomic
machine instructions such as fetch and add.
The second technique is to use a coordinator process. With this scheme, when a
worker arrives at a barrier, it informs a distinguished coordinator process, and t hen waits
for permission from the coordinator to proceed. For each iteration, then, the coordinator
simply waits until all workers have arrived at the barrier, then gives them all permission
to proceed.
Using a coordinator process solves most of thememory contention problems associated
with the shared counter technique. However, it introduces two new problems that limit its
scalability. First, it uses an additional process. Since the size of many problems is a power
of two, and the number of processors in many multiprocessors is even (or even a power of
two), this means that the coordinator will often have to share a processor with a worker.
This will slow down all workers, because none may proceed until all have completed
a given iteration. Second, the number of computations the coordinator must perform
increases linearly with the number of workers. This also affects scalability, because most
or all workers are generally not doing productive work while the coordinator is performing
these computations, i.e. actively receiving replies and disseminating permission to pass.
A third technique is to use a tree-structured barrier. Recall that both the shared counter
and the coordinator process implementation had scalability and other problems. We can
overcome these problems by eliminating the coordinator process and disseminating its
logic among the worker processes. A tree-structured barrier organizes the workers in
73
Worker Worker
Worker Worker Worker Worker
Worker
Figure 4.14: Tree-structured barrier
a tree in the manner shown in Figure 4.14.
2
In this scheme, the workers signal their
completion up the tree. They then wait for permission to continue; this will be broadcast
by the root with some multiprocessor architectures and signalled back down the tree in
others. Note that with this barrier there are three different kinds of synchronization logic:
in the leaf nodes, which only pass up a signal; in the interior nodes, which both wait for
and then pass up signals, and in the root node, which only waits for a signal. Similarly,
these different nodes also have different roles regarding broadcasting or passing down
the permission to complete. This approach leads to logarithmic execution time and also
scales well, because workers are signalling and broadcasting up the tree in parallel.
A final implementation note is that on a multiprocessor, two versions of the array are
typically maintained: one with the current values of the array and the other in which the
workers store the next iteration’s results. At the barrier, the role of each array is switched.
Also, we note that different parallel iterative algorithms vary in the way in which they
read the last iteration’s global data during a given iteration: some barriers have to use
all elements of the array on every iteration, while others require access to only a small
portion. Thus, to avoid confusion, in the Linda examples in this subsection we add a
comment indicating where the data is initialized and where it is read in each iteration.
Finally,some barriers also have a reduction step afterthe barrier and then a second barrier
after this. (The reduction step is used, for example, to detect termination.) In some cases
this reduction step can also be combined with the synchronization step in the first barrier
(and sometimes even also with the reading of the data). We also omit this reduction phase
for brevity and clarity. However, it is fairly straightforward to implement these in both
Linda and FT-Linda.
2
This figure is taken from [And91].
74
procedure
init barrier
()
out(
barrier count
;
0
)
#
:::
initialize both copies of the global array
:::
# Create worker(id:1..N) with local Linda’s
# process creation mechanism (omitted)
end
init barrier
Figure 4.15: Linda Shared Counter Barrier Initialization
Linda Barriers
A Linda barrier could reasonably be implemented with a shared counter, a coordinator
process, or a tree. For example, because of Linda’s associativity and blocking operations
there would not be memory contention on a shared counter; all workers would simply
block until they were allowed to proceed. To demonstrate this, a Linda barrier using a
shared counter is given in Figures 4.15 and 4.16; it uses a single counter tuple to record
the number of workers that have reached the barrier. A Linda implementation with a
coordinator process is similarly easy to envision, although it would still suffer from the
same problems as the multiprocessor version, namely the existence and computational
delay imposed by the extra process. Finally, a Linda barrierwith a tree is likewise simple
to realize.
While the shared counter and coordinator process techniques can thus be implemented
with Linda, we can take advantage of Linda’s associativity and blocking operations to
devise a solution with similar performance but one that serves as a cleaner basis for a
fault-tolerant barrier. In this scheme, each worker produces a tuple on each iteration that
indicates it is ready to proceed to the next iteration; a worker waits for tuples from all
workers. The pseudocode for this solution is given in Figure 4.17. The initialization
consists of depositing the initial values for the arrayin TS, as well as creating the worker
processes.
Fault-Tolerant Barriers
The solution in Figure 4.17 is simple to use as a base for a fault-tolerant version. Indeed,
looking at Figure 4.17, it may not be apparent that the failure of a worker would require
any recovery at all. It certainly has nothing like the windows of vulnerability associated
with a distributed variable or a bag-of-tasks subtask. However, in this barrier example,
the failure of a worker does indeed require recovery because it must be resilient.
To elaborate, recall that with the bag-of-tasks paradigm, a given worker process did
not have to be resilient; if it failed, any subtask that was in its placeholder form simply
had to be recovered and later re-executed by another worker. The failed worker process
75
process
worker
(id)
# Initialize iter, current, next
iter
:= 1
current
:=
iter
next
:= 1
?
current
#
:::
read in some or all of global array for iteration
current : : :
# Compute next iteration, unless answer has converged
while not
converged
(
a; iter; epsilon
)
do
#
:::
update my portion of global array for next iteration from
my ar ray : : :
compute
(
my array; id
)
# Wait at the barrier: get count and see if I’m last
in(
barrier count
;
?
count
)
if (
count <
(
N
?
1)
)then
# Not all workers at barrier yet
out(
barrier count
; count
+ 1
) # Update count
in(
barrier proceed
,
iter
+ 1
) # Wait until last done
else
# I’m the last to arrive at the barrier
out(
barrier count
;
0
) # Reset count for next time
for
i
:= 1
to
(
N
?
1)
do
out(
barrier proceed
,
iter
+ 1
) # Let other workers proceed
end for
end if
# Update iter, current and next
iter
:=
iter
+ 1
current
=
next
next
= 1
?
current
#
:::
read in some or all of global array for iteration
current : : :
end while
end
worker
Figure 4.16: Linda Shared Counter BarrierWorker
76
procedure
init barr ier
()
#
:::
initialize the global array
:::
# Create worker(id:1..N) with local Linda’s
# process creation mechanism (omitted)
end
init barrier
process
worker
(
id
)
iter
:= 1
#
:::
read in some or all of global array for next iteration
:::
# Compute next iteration, unless answer has converged
while not
converged
(
a; iter; epsilon
)
do
#
:::
update my portion of global array for next iteration from
my ar ray : : :
compute
(
my array; id
)
out (
ready
,
iter
+ 1
; id
)
# Wait at the barrier
for
i
:= 1
to
N
do
rd(
ready
; iter
+ 1
; i;
?
a
[
i
]
)
end for
# Garbage collection; nobody could need it past the barrier
if(
iter >
1
)
in(
ready
; iter
?
1
; id;
?
row t
)
end if
iter
:=
iter
+ 1
#
:::
read in some or all of global array for next iteration
:::
end while
end
worker
Figure 4.17: Linda Barrier
77
procedure
init barr ier
()
iter
:= 1
# First iteration for workers
#
:::
initialize the global array
:::
# Create the monitor processes
for
host
:= 1
to
num hosts
()
do
lpid
:=
new lpid
()
f id
:=
new f ail ur e id
()
create
(
monitor; host; lpid; f id
)
end for
# Create worker(id:1..N) and its registry tuple
for
id
:= 1
to
N
do
lpid
:=
new lpid
()
host
:=
id
%
num hosts
()
out(
T S main;
registry
; host; id; iter
)
create
(
worker; host; lpid; id
)
end for
end
init barrier
Figure 4.18: FT-Linda Barrier Initialization
did not have to be reincarnated as long as its subtask was recovered, because it was not
performing a computation that was assigned specifically to it. This is not the case with
the worker in Figure 4.17, however. Here, for each iteration a given worker must update
its portion of the global array for the next iteration and produce a
ready
tuple. Thus,
if a worker fails, it must be reincarnated, i.e. it must continue executing exactly where it
failed.
To accomplish this, we maintain a registry tuple similar to the registry tuple in the
recoverable server of Section 4.1.2. This is used to record on which host a given worker is
executing. This way, if the host fails, it can be ascertained which workers have failed and
thus need to be reincarnated. However, in the case of the barrier example, this registry
tuple also needs to record on which iteration the worker is currently working, so that if it
fails, its reincarnation can do the correct computation and then proceed.
The FT-Linda solution outlined above is given in Figures 4.18, 4.19, and 4.20.
Figure 4.18 gives the pseudocode to initialize a barrier. It creates the initial global data,
the monitor processes, the registry tuples, and the workers. Figure 4.19 gives the worker.
The main difference from the worker in Figure 4.17 is that when it deposits the
ready
tuple it also atomically increments the value of its current iteration in the registry tuple. In
the event of a worker failure, this action ensures that a given iteration’s
ready
tuple will
be deposited by a worker exactly once. Also, when it begins, the worker does not assume
it should start with iteration 1. Rather, it reads in its current iteration the one it should
start with from its registry tuple. Finally, Figure 4.20 gives the monitor process. It is
78
process
worker
(
id
)
# Initialize iter
rd(
T S main;
registry
;
?
host t; id;
?
iter
) # if reincarnation
iter
may be
>
1
#
:::
read in some or all of global array
:::
# Compute next iteration, unless answer has converged
while not converged(a, iter, epsilon) do
compute(
my ar r ay ; id
)
#
:::
update my portion of global array for next iteration from
my ar ray : : :
# Atomically deposit ready tuple for next iteration & update my registry
h
out(
T S main;
ready
,PLUS
(
iter;
1)
; id
)
in(
T S main;
registry
;
?
host; id; iter
)
out(
T S main;
registry
; host; id;
PLUS(
iter;
1
))
i
# Barrier: wait until all workers are done with iteration
iter
for
i
:= 1
to
N
do
h
rd(
T S main;
ready
,PLUS
(
iter;
1)
; i;
?
a
[
i
]
)
)
skip
i
end for
if (
iter >
1
)then
# Garbage collection on previous iteration
h
in(
T S main;
ready
,MINUS(
iter;
1
),
id
)
i
end if
iter
:=
iter
+ 1
#
:::
read in some or all of global array for next iteration
:::
end while
end
worker
Figure 4.19: FT-Linda Barrier Worker
79
process
monitor
(
f id
)
loop
h
in(
T S main;
f ail ur e
; f id;
?
f ail ed host
)
)
skip
i
# Try to reincarnate all failed workers found on this host
while (
h
inp(
T S main;
registry
; f ailed host;
?
id;
?
iter
)
)
out(
T S main;
registry
,
my host; id; iter
)
i
)do
# Create a new worker on this host
lpid
=
new lpid
()
create
(
worker; my host; lpid; id
)
nap
(
some
)
# crude load balancing
end while
end loop
end
monitor
Figure 4.20: FT-Linda Barrier Monitor
very similar to the monitor process for the recoverable server given in Figure 4.4, except
that it must also include the current iteration in the registry tuple. Unlike the recoverable
server example, however, this monitor process assigns a new LPID to a reincarnation of
a failed worker. It does not need to use the same LPID, while the recoverable server’s
reincarnation had to be created with the same LPID to be able to access its state stored in
a private (and stable) TS.
The same general techniques can also be used with the other kinds of barriers described
above. In all cases, when a worker or coordinator process fails, it must be reincarnated
so that it starts at the proper location. This is accomplished with the use of a registry
tuple. A process must also use a placeholder tuple if it needs to perform a more general
atomic action than a single AGS allows. For example, a coordinator process would leave
a placeholder tuple for each synchronization tuple it received from a worker, while a
process in the tree barrier solution would leave a placeholder when it withdraws the first
signal tuple from a child. These placeholder tuples would then be moved or withdrawn
from
T S main
at the final AGS for a given iteration, which would also increment the
registry tuple.
The complete source code for this example is in Appendix G..
Using Fault-Tolerant Barriers for Systolic-Like Algorithms
Barriers are typically implemented on multiprocessors using physically shared arrays,
although above we have shown how they can be implemented using Linda’s TS, a shared
memory that has been implemented on a wide range of architectures. Other kinds of
80
parallel iterative algorithms are similar to barriers in that all workers work on the next
iteration using the currentvalues, then wait for all to complete the iteration before proceed-
ing. However, parallel iterative algorithms such as systolic-like algorithms
3
use message
passing, not shared memory, to synchronize. Problems that can be solved with such
algorithms include matrix multiplication, network topology, parallel sorting, and region
labelling [And91]. Fortunately, the FT-Linda barrier solution given above already uses
Linda’s blocking operations to synchronize, so its basic ideas apply directly to systolic-like
and other data flow algorithms that use message passing.
Consider a systolic-like algorithm to multiply Arrays
A
and
B
to obtain the product
array
C
, a (non-fault-tolerant) Linda example given in [Bjo92]. In this scheme, each
worker is responsible for computing a given part of the result matrix C. Portions of
A
and
B
are streamed through the various workers so that, at a given step, each worker has the
portions of
A
and
B
that are to be multiplied with each other. The worker accumulates its
portion of
C
, and when done, deposits a tuple with this result. Thus, a worker iteration
consists of the following steps:
1. Withdraw the next submatrices of
A
and
B
from the appropriate workers upstream.
2. Multiply them, accumulating the result in its portion of
C
(kept in TS).
3. Deposit the submatrices of
A
and
B
for consumption by the appropriate workers
downstream.
Note that this is very similar to a barrier worker’s iteration in Figure 4.19. The worker
here uses two inputs rather than one, but this difference is obviously cosmetic. Indeed, the
only significant difference is that the systolic-like worker actually consumes the tuples.
Thus, the only structural change to use Figure 4.19 with systolic-like multiplication is
to generate an
in pr og ress
tuple when withdrawing the submatrices, then atomically do
Steps 2 and 3 in one AGS while also updating the registry tuple and withdrawing the
in progr ess
tuples.
Finally, consider an optimization to this scheme. The key observation is that, in this
systolic-like scheme that uses a message-passing paradigm, each worker produces tuples
that are sent to a specific worker. The scheme above distinguishes a submatrixintended for
a particular worker by placing its identifierin the tuple. However, as shown in Section 5.2,
this can lead to severe contention. The problem is that the submatrixtuples being sent to all
workers are on the same hash chain, even though a given worker can possibly match only
a small fraction of them when withdrawing tuples intended for it. Unfortunately, the TS
operations must still check all tuples on those hash chains. This contention can be greatly
reduced by creating a stable and shared TS for each worker, where all submatrix tuples
intended for that worker will be deposited. This way, the submatrix tuples are spread out
3
We use the term ‘systolic-like’ rather than ‘systolic’ here because the workers in systolic algorithms
run in strict lock step, generally enforced by hardware, while the workers here can get up to one iteration
out of phase with each other.
81
among many TSs, and only submatrix tuples to be withdrawn by a given worker will be on
the hash chain in that worker’s TS. This eliminates the contention described above. Since
TSs require on the order of a hundred bytes of memory, also as shown in Section 5.2, this
scheme scales to a large number of workers.
4.3 Handling Main Process Failures
We have only been concerned so far in this dissertation with handling the failure of worker
and monitor processes. We now consider how to handle the failure of a mainprocess, the
initial process in a Linda program that is created when the program is started. While the
details are very application-dependent, the same general techniques used in handling the
failure of a worker process can be applied to handle failures of a main process.
Main processes generally go through three phases: initialization, synthesis, and fi-
nalization. The initialization phase is relatively short, and includes creating monitor and
worker processes, creating and seeding tuple spaces, etc. We assume that the entity
that started the program will monitor its progress until the initialization phase is over
and restart the program if the main process fails before initialization is complete. The
synthesis phase involves any necessary reductions on the workers outputs. No synthesis
may be necessary; for example, workers performing matrix multiplication deposit their
results directly into TS. If synthesis is necessary, then a set of identical synthesis processes
(and their monitor processes) can be created to perform it, so the failure of some of these
processes can be tolerated. Of course, these processes must not have any windows of
vulnerability, just like worker processes must not. An example of synthesizing results
in fault tolerant manner is given below. Finally, in the finalization phase, an application
generally reports the results of any synthesis to the entity that executed the program. If
this can be done through TS, then this reporting step is not needed; it is often done in the
synthesis phase. Sometimes, however, this is not sufficient, so the results of the synthesis
have to be reported in some other fashion, e.g. writing to a disk or to a console. In either
case, as assume that the entity that started the program will periodically ensure that the
finalization process has not failed and restart it if it has.
The manner in which a synthesizing process is made fault-tolerant is very similar to
that for a worker process. The main concern is not to leave any windows of vulnerability
where data that may be needed later is represented only in volatile memory. Sometimes a
synthesis operation can be done in a single AGS. In other cases, however, multiple AGSs
and a placeholder tuple are required. As an example, consider the problem of finding
the minimum cost (or distance) from among a number of searches (or matches). Here,
the (bag-of-tasks) workers remove data tuples, calculate the cost associated with the data
in the tuple, and deposit a cost tuple. The synthesis consists of finding the minimum
cost from among the cost tuples. This can be accomplished in the manner shown in the
example in Figure 4.21. There is no reason to leave a placeholder tuple here, because
there is no window of vulnerability: this synthesis can be accomplished in a single AGS
by using the MIN opcode. And, because there are no placeholder tuples, there is no need
82
# Record the lowest cost found among the cost result tuples.
# Initialization phase deposited tuple (
min cost
,INFINITY).
while not
done
()
do
h
in(
T S main;
cost
;
?
cost
)
)
in(
T S main;
min cost
;
?
min cost
)
out(
T S main;
min cost
,MIN(
cost; min cost
))
i
end while
Figure 4.21: Simple Result Synthesis
for monitor processes to recover from the failure of a synthesis process executing this
code. Finally, note that the in in the body can never fail, because AGSs are executed
atomically and the
min cost
tuple is always present in TS outside any AGS.
For some synthesizing processes, however, the simple, single-AGS scheme in Fig-
ure 4.21 will not be sufficient, because they cannot perform the synthesis in a single AGS.
For example, if the synthesis above had to not only record the cost, but also the identity
of the data that resulted in that cost, then a single AGS is not powerful enough. In the
example in Figure 4.21, the MIN opcode was sufficient to allow the recording of the
lowest cost. There is no way, however, to record a corresponding identifier along with its
cost in a single AGS. The way to do this is to use multiple AGSs and
in progr ess
tuples,
as shown in Figure 4.22. Here, the
cost
tuple to be synthesized is withdrawn, as well
as the
min cost
tuple to which it is compared; in both cases
in progr ess
tuples are left
as a placeholder. The
min cost
tuple is then replaced atomically with removing the
in progr ess
tuples. Of course, a set of monitor processes would be required to recover
the
cost
and
min cost
tuples from their placeholders in the event of a failure.
Note that, in practice, a synthesizer process would likely be more optimized than the
one in Figure 4.22. For example, it would typically record the lowest cost it had processed
so far in a local variable. If the cost discovered in the first AGS was not less than this, then
most of the rest of the code in Figure 4.22 would not be executed; the
cost in pr og ress
tuple would then simply been removed.
4.4 Summary
This chapter illustrates the flexibility and versatility of FT-Linda in two major domains,
highly dependable computing and parallel applications. All examples demonstrate the
usefulness of the AGS, in combination with monitor processes, to recover from a failure.
The replicated server example shows how FT-Linda can be used to construct a service
with no fail-over time. It also demonstrates how distributed variables (the sequence tuple)
can be used to order events in a Linda system. It further illustrates the usefulness of AGS
83
# Record the lowest cost found among the cost result tuples,
# and the ID associated with that cost.
# Initialization phase deposited tuple (
min cost
, INFINITY, NULL ID).
while not
done
()
do
h
in(
T S main;
cost
;
?
cost;
?
id
)
)
out(
T S main;
cost in pr og r ess
; host; cost; id
)
i
h
in(
T S main;
min cost
;
?
min cost;
?
min id
)
)
out(
T S main;
min cost in progress
; host; min cost; min id
)
i
if(
cost < min cost
)then
# Update the min tuple with
cost
and
id
h
in(
T S main;
cost in progress
; host; cost; id
)
in(
T S main;
min cost in progress
; host; min cost; min id
)
out(
T S main;
min cost
, cost, id)
i
else
# Restore the min tuple to its previous form
h
in(
T S main;
cost in progress
; host; cost; id
)
in(
T S main;
min cost in progress
; host; min cost; min id
)
out(
T S main;
min cost
, min cost, min id)
i
end if
end while
Figure 4.22: Complex Result Synthesis
disjunction. This replicated server is an example of the state machine approach.
The recoverable server example explains how to implement a service with FT-Linda
featuring no redundant server computations. It shows the usefulness of a private and
stable TS to store key state for later recovery. Also, it displays how FT-Linda’s oldest
matching semantics can be used to avoid a client’s starvation. This recoverable server is
an example of the primary/backup technique.
The transaction facility demonstrateshow multiple AGSs can be combined to provide
a higher level of atomicity than whata single AGS gives the programmer. It requires only
one AGS to either commit or abort a transaction, and thus demonstrates the usefulness of
the tuple transfer primitives. This transaction facility is an example of the object/action
paradigm.
The divide-and-conquer example shows how FT-Linda can be used in a more dynamic
environment, where the size of subtasks can vary. It also demonstrates the utility of the
84
tuple transfer primitives.
The barrier example illustrates how Linda’s TS can be used to avoid problems endemic
in multiprocessor implementations of barriers. It then shows how FT-Linda can implement
fault-tolerant barriers in a simple and efficient manner. This technique also extends to
systolic and other similar parallel iterative algorithms.
Finally, the recovery from main process failures is discussed. The techniques for
achieving this are much like those used for worker processes.
CHAPTER 5
IMPLEMENTATION AND PERFORMANCE
A prototype implementation of FT-Linda has been built. All components have been
separately tested, and are awaiting the completion of a newer version of the Consul
communication substrate. The precompiler consists of approximately 15,000 lines of C
code, of which approximately3,000 was added to an existing Linda compilerto handle the
extra constructs for FT-Linda. The FT-Linda runtime system consists of approximately
10,000 lines of C, not including the Consul communication substrate.
This section is organized as follows. It first gives an overview of the implementation.
Next, it discusses the major data structures. After that, it describes FT-LCC, the FT-Linda
C compiler, and the processing of atomic guarded statement (AGS) request messages
by the TS managers. This section then discusses restrictions on the AGS, followed by
initial performance results and optimizations. Finally, it describes future extensions to the
implementation.
5.1 Overview
The implementation of FT-Linda consists of four major components. The first is FT-LCC,
aprecompiler that translates a C program with FT-Linda constructs into C generated code.
The second is the FT-Linda library, which is linked with the object file comprising the
user code. This library manages the flow of control associated with processing FT-Linda
requests and contains a TS manager for TSs that are local to that process. The third is the
TS state machine, which is an x-kernel protocol that sits beneath the user processes on each
machine. This protocol contains the replicatedTS managers, which are implemented using
the state machine approach. Finally, the fourth part is Consul, which acts as the interface
between user processes and the TS state machines, and the network. This collection of
x-kernel protocols implements the basic functionality of atomic multicast, consistent total
ordering, and membership. It also notifies the FT-Linda runtime of processor failures so
that failure tuples can be deposited into the TS specified by the user application. While
the implementation has been designed with Consul in mind, we note that any system that
provides similar functionality (e.g., Isis [BSS91]) could be used instead.
The runtime structure of a system consisting of
N
host processors is shown in Fig-
ure 5.1. At the top are the user processes, which consist of the generated code together
with the FT-Linda library. Then comes the TS state machine, Consul, and the interconnect
structure. The prototype is based on workstations connected by a local-area network, al-
though the overall design extends withoutchange to many other parallel architectures. The
85
86
Consul Consul
Library
FT-Linda
Generated
Code
Library
FT-Linda
Generated
Code
Library
FT-Linda
Generated
Code
Library
FT-Linda
Generated
Code
Host Host
...
TS State
Machine TS State
Machine
1 N
Figure 5.1: Runtime Structure
edges between components represent the path taken by messages to and from the network.
Providing this message-passing functionality and the rest of the protocol infrastructure is
the role of the x-kernel.
As noted, the implementation is currently nearing completion. All four parts have
been implemented and tested, with final integration waiting the completion of a port of
Consul to version 3.2 of the x-kernel. The final version will run on DEC 240, HP Snake,
and other workstations under Mach, Unix, and stand-alone with the x-kernel.
5.2 Major Data Structures
There are two major categories of data structures in FT-Linda. The first is the single
(complicated) data structure used by the user process to communicate a request to the TS
managers. The second is the collection of data structures required to implement a TS.
These two categories are discussed in turn.
The request data structure contains the information required to process requests such
as TS operations in the user’s address space, and to execute a state machine command
at the appropriate TS managers. The most complex version is, of course, the request
data structure associated with an AGS. In addition to general status information, this data
structure contains an array of branch data structures for the branches of the AGS. This
branch data structure contains an op data structure for the guard and an array of such data
structures for the body. An op data structure contains copies of the TS handle(s) involved
with the operation, space for a global timestamp, the operator (e.g., in), and its tuple index.
It also contains the length of the operation’s data and information for each parameter. The
information for each parameter includes its polarity (i.e., actual or formal), the offset in
87
("foo",10)
3
2
1
0
("bar",20,50)
...
...
ts.tuple[tuple_index]
Figure 5.2: Tuple Hash Table
the data area of its actual or formal value (if any), and opcode arguments.
The request data structure also contains space for the value of any formal and actual
parameters. Field r formal[] is assigned values by a TS manager whenever an in,inp,
rd, or rdp operation with a formal variable is executed; the formal values are later copied
from rformal[] into the user’s variables by the GC. Field r actual[] stores the
value of actuals, and is assigned its values by the GC.
Once the invocation from the generated code to the FT-Linda library has been made,
the control logic in the library routes the request appropriately. For example, if it involves
only a local TS, it is dealt with by the TS management code within the library itself, while
if it involves a replicated TS, it is multicast to the TS state machine protocols using Consul.
In the latter case, this is the hand-off point betweenthe user process and the control thread
that carries the message through the x-kernel protocol graph to the network. As such,
the user process is blocked until the request has been completed. When the request is
completed, the request data structure is returned to the user process, where the generated
code copies the values for any formals to the corresponding variables in the user address
space.
Tuple spaces are implemented in two places, the FT-Linda library for local TSs and
the TS state machine for replicated TSs. The algorithms and data structures used in both
places are essentially identical. In each, a table of TS data structures is maintained. When
a request to create a new TS is processed, a new table entry is allocated; the index of
this entry is known as the TS index. The TS handle returned as the functional value of
ts cr eate
contains this index, as well as the attributes of the tuple space. A subsequent
request to destroy a TS frees the appropriate table entry and increments an associated
version number. This number is used to detect future TS operations on the destroyed TS.
A TS itself is represented as a hash table of tuples, as shown in Figure 5.2. The
entries in the hash table are op data structures. This op data structure is thus used for both
a TS operation in the AGS request data structure and for a tuple in TS. The difference
between the two usages is that, for efficiency’s sake, the values of all actuals in an AGS
are stored in one field, r actual[]. However, the TS manager matching a tuple in TS
cannot access this field, because the AGS request will have been recycled.
1
Thus, when
1
It is highly desirable to recycle the AGS data structure once it has finished execution, not leave it
88
0
1
2
3branch: 1
branch: 0 ...
...
>
in
in ....
....
< ("A", 10) =>
or
("B", ?i, ?j) =>
ts.blocked[tuple_index]
Figure 5.3: Blocked Hash Table
an op data structure is allocated for a tuple, extra room at the end is allocated for field
o actual[]. The value of the actuals are then copied from r actual[] in the AGS
into o actual[] in the tuple.
The index into the hash table for a given tuple is simply the tuple index assigned by
the precompiler modulo the hash table size. Linda implementations (including FT-Linda)
generally try to ensure the table size is larger than the number of unique signatures (and
hence different values of tuple indices), so a particular hash chain will contain only tuples
from one signature. This reduces unnecessary contention, and is easy to accomplish
in virtually all cases, because few Linda programs use more than a small number of
signatures.
Also associated with each TS is another hash table used to store blocked AGS requests.
That is, at any given time, this table contains requests with guards of in or rd for which
no matching tuple exists. An example of such a table is shown in Figure 5.3. Here, the
blocked AGS request has two in guards: one waiting for a tuple named
A
with a tuple
index of 1 and the otherwaiting for a tuple named
B
with a tuple index of 3. Note that the
request itself is stored indirectly in the table because, in the general case, such an AGS
may have multiple guards and thus may be waiting for tuples with different signatures.
The C definitions of major FT-Linda data structuresare given in in Appendix H. A TS
data structure should have a hash chain for both tuplesand blocked AGSs for each possible
tuple signature. (This is required for efficiency, not correctness.) However, because most
programs only use a few different TS signatures, and a hash chain takes little space, a TS
takes on the order of a hundred bytes of space.
5.3 FT-LCC
The phases involved in building an FT-Linda program are as follows:
1. Run the C preprocessor (cpp) on the C FT-Linda code.
allocated until the last tuple that has data in r actual[] has been removed from TS.
89
type information
signature of TS op
tuple index
Additional
FT-Linda
Code for
LCC
Original
token stream
FT-Linda
code
Generated
Code (GC)
Figure 5.4: FT-LCC Structure
2. Run FT-LCC on this to generate C generated code (GC).
3. Compile the GC with a C preprocessor and compiler.
4. Link the object modules with the FT-Linda library to produce the executable pro-
gram.
The first step is necessary because FT-LCC must process the code with the macro substitu-
tion and file inclusion completed. Next, FT-LCC generates the GC, which is subsequently
compiled and linked with the FT-Linda library to produce the executable program. Note
that the GC must be preprocessed in step 3, not just compiled, because it uses macros
(especially constants) defined in auxiliary FT-Linda include files that the GC specifies.
The remainder of this section on FT-LCC is organized as follows. First, the internal
structure of FT-LCC is described. A discussion of why FT-LCC is more complex than
LCC is given, followed by a description of the different kinds of parameters FT-LCC must
parse. This section concludes with an example of GC for a simple AGS.
FT-LCC Internal Structure
The FT-Linda precompiler, FT-LCC, is a derivative of the LCC precompiler [Lei89,
LRW91]. The internal structure of FT-LCC is shown in Figure 5.4. The original LCC code
handles everything not involving an FT-Linda construct, with the bulk of the additional
code being dedicated to handling the AGS. This code receives from LCC a token stream
and type information for those tokens, as shown in Figure 5.4. This stream is generally
output unchanged until the opening ’<’ of an AGS is detected. At this time, the AGS-
handling code parses the AGS and generates the code to implement the AGS. This GC
marshalls the request data structure, passes it to the FT-Linda library, and then unmarshalls
the AGS after its execution is finished. This unmarshalling consists mainly of assigning
values to formal variables.
For each TS operation such as in that is included in the AGS, FT-LCC passes the
signature of the op its ordered list of types to the LCC code, which returns the
unique tuple index for that signature. This index is used in the GC to calculate the hash
table entry for the index by taking its residue modulo the hash table size. This hash index
is then stored in the AGS request structure and subsequently used by the TS managers for
matching purposes.
90
FT-LCC Compared to LCC
FT-LCC is more complex than LCC mainly because of the requirement that operations in
an AGS be executed atomically. The data in an AGS request structure is produced and
consumed in the following phases (which are elaborated further below):
1. When the AGS is executed by the user code, the GC marshalls the request data
structure with the information described above. The information placed into the
request is the structure of the given AGS (number of branches, information about
each op, etc.) and the values of actuals that are known. These include constants,
which are known at compile time, and variables that are not formals also found
earlier in the same branch. These values are placed into field ractual[] of the
request data structure. We call this phase marshall time.
2. The local and replicated TS Managers process the request. We call this phase TS
manager time. Here the values of actuals in r actual[] are used, and the values
of formal variables are placed into r formal[] from a matching tuple. If a formal
variable appears later in the branch as an actual variable, then its value is read from
rformal[].
3. After the processing of the AGS request by the TS managers is finished, the FT-
Linda library returns control to the GC. The GC then assigns the formal variables
set by the AGS with the values from r formal[]. We call this phase formal
assignment time or unmarshalling time.
As an example to help motivate why FT-LCC is more complicated than LCC, consider
the following Linda sequence for a distributed variable:
in(
count
;
?
count
)
out(
count
; count
+ 1
)
In a Linda program, the in and out operations are separate invocations to the Linda
runtime system. Thus, even in a distributed implementation of Linda, the value
count
+ 1
would be known at marshall time and thus disseminated appropriately along with the
other information needed to implement the out operation in TS. However, consider the
FT-Linda equivalent to the above fragment:
h
in(
T S main;
count
;
?
count
)
)
out(
T S main;
count
;
PLUS(
count;
1
))
i
Here,
count
is an actual variable in the out. However, its value is not known until TS
manager time, when the in in the AGS is processed at the TS State Machines and its value
91
Case Tokens param t value Polarity
1 typename P TYPENAME 1
2 ? typename P TYPENAME 0
3 ? varname P FORMAL VAR 0
4 varname P VAL 1
5 varname P FORMAL VAL 1
6 value P VAL 1
7 OPCODE
i
(args) P VAL 1
8 OPCODE
i
(args) P OPCODE
i
1
Table 5.1: FT-Linda Parameter Parsing
is placed into r formal[]. FT-LCC thus must note whether an actual’s value is known
by marshall time or whether its value is not known until TS manager time, because the
GC and TS managers must process these cases differently.
FT-LCC Parsing Cases
FT-LCC can ascertain whether or not the value of an actual will be known at marshall
time by the actual’s syntax and by recording which variables have been used as formal
variables in the current branch. Recall that FT-Linda does not allow function calls or
expressions as such arguments; only constants, variables, and opcodes may appear. The
specific cases encountered when parsing an FT-Linda operation are given in Table 5.1.
FT-LCC has to track whether a variable has been used in a given branch in a number of
these cases, as described below. For each parameter, then, it tracks the kind of parameter
it is with values of type param t, as well as the polarity, which indicates whether the
parameter is a formal (polarity 0) or an actual (polarity 1).
The specific cases in Table 5.1 are as follows:
1. A typename used as an actual.
2. A typename used as a formal.
3. A formal variable. This will be noted so that further usage of this variable in the
branch can be diagnosed properly; specifically, to distinguish between Cases 4 and 5
and between Cases 7 and 8. When a matching tuple is found for this operation, the
value fromthe correspondingfield in the matching tuple is placedinto r formal[]
if it is an in,rd,inp, or rdp (i.e., not for out,move, or copy). This value is used
by the GC at formal assignment time to set the formal variable’s value. Also, this
value in rformal is used if the variable is used later in the branch as an actual
variable (i.e., Case 5).
4. A variable used as an actual that was not a formal previously in the branch. The
value of this actual is stored into r actual[] by the GC at marshall time.
92
5. A variable used as an actual that was a formal previously in the branch. The value
for this actual is not known until TS manager time, when the TS operation in which
the variable was a formal is performed. Its value is fetched from rformal[] and
used for the operation.
6. An actual whose value is specified by a literalvalue. This value is known at compile
time and is stored into ractual[] by the GC when marshalling the request.
7. An opcode(e.g., PLUS(
count
,1))whoseparameters’valuesarealleitherCases4 or 6,
i.e. not Case 5. This means that the values of these parameters are all known at
marshall time. At this time, then, the GC will evaluate the opcode and place its
result into r actual, and indeed it will be treated just like an actual from Cases 4 or 6.
Each opcode has its own value of type param t. Current opcode values are
P OP MIN,P OP MAX,P OP MINUS, and P OP PLUS, corresponding to opcodes
MIN,MAX,MINUS, and PLUS, respectively.
8. An opcode with at least one parameter of Case 5. The values of the parameters are
thus not all known when the GC is marshalling the AGS request. Thus, the opcode
will have to be evaluated by the TS manager that executes this operation.
Cases 4–8 may optionally be preceded with a C type cast, i.e., a type surrounded by
parentheses. Also, as denoted by the polarity column in Table 5.1, Cases 2 and 3 are
formal parameters while the other are actual parameters.
As an example, the essential portion of the GC produced for the following distributed
variable update:
h
in(
T S main;
count
;
?
count
)
)
out(
T S main;
count
;
PLUS(
count;
1
))
i
is given in Figures 5.5 and 5.6. Figure 5.5 gives the skeleton for the GC. It contains four
main phases:
1. Initialize fields that only have to be initialized once (this will be explained further
below in the section on optimizations). The innermost part of this section has been
elited from Figure 5.5, and is given in Figure 5.6.
2. Initialize fields that have to be initialized each time the AGS is executed. In this
case, only the address of
count
need be initialized here.
3. Pass the request data structure to the FT-Linda library, which will pass it to the TS
managers.
93
f
/
start of AGS 1 statement starting at count.c:28
/
#define AGS NUM 1
register int gc i; static int gc init done = 0;
/
other variables defined local to this AGS scope omitted for brevity
/
if (!gc init done)
f
/
Initialize the one-time fields
/
register op t
gc op;
/
current op within br
/
gc init done = 1;
strncpy(gc req
!
r filename, "count.c", MAX FILENAME);
gc req
!
r filename[MAX FILENAME] = '\0';
gc req
!
r starting line = 28;
gc req
!
r kind = REQ AGS;
/
branch #0 pre-processing
/
/
.... See Figure 5.6
/
/
end of branch 0
/
/
request post-processing
/
gc req
!
r num branches = 1;
gc req
!
r next offset = 8;
/
r next offset aligned 4 ==
>
8
/
rts request init(gc req);
g
/
!gc init done stuff initialized once
/
/
These have to be filled in with each AGS call, not just once.
/
gc formal ptr
!
f addr[0][0] = (void
) &(count);
/
Pass the request to the FT-Linda library, which will pass it to the TS managers.
/
ftlinda library( (void
) gc req, &gc reply, AGS NUM);
/
Code to fill in formal variables from
gc formal ptr-
>
f addr[branch][formalnum] omitted.
/
#undef AGS NUM
/
1
/
g
/
end of AGS 1 statement starting at count.c:28
/
Figure 5.5: Outer AGS GC Fragment for
count
Update
94
/
branch #0 pre-processing
/
gc br = &(gc req
!
r branch[0]);
/
guard for branch 0 preprocessing
/
gc op = &(gc br
!
b guard);
gc br
!
b guard present = TRUE;
gc op
!
o optype = OP IN;
memcpy(&(gc op
!
o ts[0]),&(TSmain),sizeof(ts handle t));
/
copy TS handle
/
/
Parameter 0, case 1
/
gc op
!
o param[0] = P TYPENAME;
/
Parameter 1, case 3
/
gc op
!
o param[1] = P FORMAL VAR;
gc op
!
o idx[1] = 0;
/
count is formal 0 for this branch (#0)
/
gc br
!
b formal offset[0] = 0;
/
formal #0 (count) is at r formal[0..3]
/
/
guard post-processing
/
gc op
!
o polarity = 0xfffffffd; gc op
!
o arity = 2;
gc op
!
o type = 0; gc op
!
o hash = 0;
/
( 0 & (MAX HASH-1))
/
/
body[0] preprocessing
/
gc op = &(gc br
!
b body[0]);
gc op
!
o optype = OP OUT;
/
read in the 1 TS handle for out
/
memcpy(&(gc op
!
o ts[0]),&(TSmain),sizeof(ts handle t));
/
Parameter 0, case 1
/
gc op
!
o param[0] = P TYPENAME;
/
Parameter 1, case 8
/
gc op
!
o param[1] = P OP PLUS;
/
Some complicated code to describe PLUS(count,1) is omitted
for clarity. It has to note that opcode parameter count is Case 5
and thus not known until TS manager time,
while opcode parameter ’1’ is known at marshall time.
/
/
body[0] post-processing
/
gc op
!
o polarity = 0xffffffff; gc op
!
o arity = 2;
gc op
!
o type = 0; gc op
!
o hash = 0;
/
( 0 & (MAX HASH-1))
/
/
branch #0 post-processing
/
gc br
!
b body size = 1;
/
end of branch 0
/
Figure 5.6: Inner AGS GC Fragment for
count
Update
95
Consul Consul
Library
FT-Linda
Generated
Code
Library
FT-Linda
Generated
Code
Library
FT-Linda
Generated
Code
Library
FT-Linda
Generated
Code
7TS State
Machine 7TS State
Machine
Host Host
...
1
3
6
2
9
10
6
8
54
1 N
Figure 5.7: AGS Request Message Flow
4. Fill in the formal variables in the code (in this example, only
count
) from field
r formal[] in the request data structure.
Finally, Figure5.6 contains the code to marshall the op data structures involved with this
AGS.
5.4 AGS Request Processing
The processing of an AGS request is the most fundamental and important function the
runtime system performs. We will discuss it in a general fashion first, then look at specific
examples of processing particular AGSs.
5.4.1 General Case
Atomic guarded statements are clearly the most complicated of the extensions that make
up FT-Linda. They include provisions for synchronization, guarantee atomicity, and allow
TS operations in which multiple TSs are accessed. To demonstrate how these provisions
impact the implementation, here we discuss the way in which requests generated by such
statements are handled within the implementation. This discussion also serves to highlight
how the components of the system interact at runtime.
The steps involved in processing a generic AG statement are illustrated in Figure 5.7.
These can be described as follows.
1. The generated code fills in the fields of the request data structure that describes the
AGS, and then invokes the FT-Linda library.
96
2. The code for managing local TS within the library executes as many of the TS ops
in the AGS as possible. If such an operation withdraws or reads a tuple, the values
for any formals in the operationare placed in the request data structure; this ensures
that later operations that access these formals have their values. Processing of this
AGS stops if a local TS operation is encountered that depends on data or tuples
from a replicated TS operation earlier in the statement; more on this below.
3. The AGS request is submitted to Consul’s ordered atomic multicast service.
4. Consul immediately multicasts the message. Lost messages are handled transpar-
ently to FT-Linda by the multicast service within Consul.
5. The message arrives at all other hosts.
6. Some time later Consul passes the message up to the TS state machine. The order
in which messages are passed up is guaranteed to be identical on all hosts.
7. Each TS state machine executes all TS operations in the AGS involving replicated
TSs. As in step 2, if such an operation withdraws or reads a tuple, the values for
any formals in the operation are placed into the request data structure. If the request
has blocking guards with no matching tuples, then it is stored in the blocked hash
table until a matching tuple becomes available.
8. The TS state machine on the host that originated the AGS returns its request data
structure to the FT-Linda library code. Note that this step and the remaining ones
are only executed on the processor from which the AGS originated, because the
replicated TSs are now up to date.
9. The library code managing local TSs executes any remaining TS operations in the
AGS request. The original invocation from the generated code to the FT-Linda
library then returns with the request data structure as the result.
10. The generated code copies values for formals in the AGS into the corresponding
variables in the user process. The process can now execute the next statement after
the AGS.
Thus, the processing of an AGS can be viewed as three distinct phases: processing of
local TS operations, then dissemination and processing of replicated TS operations, and
finally, any additional processing of local TS operations. This paradigm is responsible
for the restrictions on the way in which TS operations are used in the body of an AGS
that were mentioned in Section 3.2.1. For example, it would be possible to construct
an example in which the data flow between TS operations would dictate going between
local TSs and replicated TSs multiple times. Our experience is that such situations do not
occur in practice, and so such uses have been prohibited. These and other restrictions are
discussed further in Section 5.5.
97
Finally, we note that all the steps above may not be needed for certain AGS requests.
For example, if the AGS statement does not involve replicated TSs, then the request will
not be multicast to the TS state machines, and therefore steps 3 9 above will not be
executed. As another example, if the request consists solely of out operations, the user
process will not wait for a reply.
5.4.2 Examples
To make the processing of AGS requests more concrete, this section examines some
specific examples in detail. In the following, let
scratch tsidx
be the index of a local TS
T S scr atch
,
main tsidx
be the index of the replicated TS
T S main
, and
tidx
be the tuple
index of the tuple or template of the operation in question. The tasks performed by the
generated code in each case are identical to the above, and so are omitted.
Local Case
First consider an example that involves only a local TS:
h
true
)
out(
T S scr atch
,
f oo
,
i
)
out(
T S scr atch
,
f oo
,
j
)
in(
T S scr atch
,
f oo
,
?
k
)
i
The generated code passes the request to the FT-Linda library where local TSs are imple-
mented. To perform the out operations, this code creates a tuple using the values from
the operations. It then attaches the tuple to the appropriate hash chain in the tuple hash
table, i.e.,
T S
[
scratch tsidx
]
:tuple
[
tidx
]
. The in is then executed. In doing so, the oldest
matching tuple in TS will be withdrawn; in this case, it would be the tuple deposited by
the first out in the AGS, assuming no matching tuples existed prior to execution of this
statement. After all operations in the body have been executed, the request data structure
is returned to the generated code, where the value for
k
is copied into
k
’s memory location
in the user process’s address space.
Single Replicated Operation
Consider now a lone TS operation on a replicated TS:
h
in(
T S main
,
f oo
, i, ?j)
)
skip
i
After the request data structure is passed to the FT-Linda library code, it is immediately
multicast to all TS state machines. Upon receiving the message, each state machine first
98
checks for a matching tuple in the tuple hash table entry
T S
[
main tsidx
]
:tuple
[
tidx
]
. If
this entry is not empty, the first matching one on this list—that is, the oldest match—is
dequeued. If no such matching tuple is found, then the request is stored on the blocked
queue for the guard,
T S
[
main tsidx
]
:blocked
[
tidx
]
. In either case, once a matching
tuple arrives, the in is executed, and the matching tuple used to fill in the request data
structure with the value of
j
. The state machine on the originating host then returns the
data structure back to the user process.
Both Local and Replicated Tuple Spaces
Now consider a case involving both local and replicated TSs:
h
true
)
in(
T S scr atch
,
f oo
,
?
i
, 100)
in(
T S main
,
bar
,
i
,
?
j
)
rd(
T S scr atch
,
f oo
,
j
, ?k)
i
The request data structure is first passed to the code implementing local TSs in the FT-
Linda library. As many operations as possible are then executed, which in this case is only
the first in; the subsequent local rd cannot be executed because it depends on the value
for
j
, which will not be present in the request data structure until later. In processing this
local in, the new value for
i
retrieved from the matching tuple is copied into the request
data structure. Of course, neither this in nor any other operation in the body may block.
Next, the request is passed to Consul, which transmits it by multicast to all machines
hosting copies of the TS. Some time later each TS state machine gets the request message.
At this point, each state machine removes the oldest tuple that matches the second in and
updates the value for
j
in the request. Note that, to find this match, the value used for
i
is taken from the request data structure because its value was assigned earlier within the
AGS.
Following execution of replicated TS operations, the remaining local TS operation is
performed on the host on which the AGS originated. To do this, the request data structure
is first passed back to the user process. The rd operation is then executed; again, the
value of
j
used for matching is taken from the request data structure. The value of the
matching tuple is used to fill in the value for
k
before the request data structure is returned
to the generated code. There the new values of
i
,
j
, and
k
are copied from the request data
structure into their respective variables in the user process’s address space.
Move Operations
The move operation is treated as a series of in and out operations, as illustrated by the
following example:
99
h
true
)
move(
T S scr atch
,
T S main
)
in(
T S main
,
f oo
,
?
i
)
i
In this example, the generated code invokes the entry point in the FT-Linda library,which
in turn invokes the local TS code. There, all tuples in
T S scr atch
are removed, and the
move replaced in the request data structure by an out(
T S main
,
t
) for each such tuple
t
in
T S scr atch
. When the TS state machines receive this request, they execute these out
operations and then the final in.
If the move had been a copy instead, the only difference would be that the tuples are
copied from
T S scr atch
rather than removed. Templates in such tuple transfer operations
are handled using the same matching mechanism as for normal TS operations.
AGS Disjunction
Consider an AGS involving disjunction, as in the following:
h
in(
T S main
,
ping
, 1)
)
:::
or
:::
in(
T S main
,
ping
, n)
)
i
The actions taken here are similarto earlier examples until processing reaches the TS state
machines. When the state machines receive this request, they find the oldest matching
tuple for each guard. If no such tuple exists, a stub for each branch is enqueued on
T S main
’s blocked hash table. If there are matching tuples, the oldest among them is
selected and the corresponding branch processed as described previously. Note that the
oldest matching semantics implemented by FT-Linda are important here because it is this
property that guarantees all state machines choose the same branch.
Unblocking Requests
An out operation may generate a tuple that matches the guards for one or more blocked
requests. Consider the following.
100
h
true
)
out(
T S main
,
f oo
, 100, 200)
i
First, the tuple generated by the out is placed at the end of the appropriate hash chain in
the TS data structure, i.e., the chain starting at
T S
[
main tsidx
]
:tuple
[
tidx
]
. Then, the
TS state machines determine if there are matching guards stored on the analogous hash
chain in the blocked hash table, i.e.,
T S
[
main tsidx
]
:blocked
[
tidx
]
. If so, they take them
in chronological order and schedule any number of rd guards and up to one in guard to
be considered for execution, along with their bodies, after the current AGS is completed.
5.5 Rationale for AGS Restrictions
Now that FT-Linda’s language features, usage, and implementation have been examined,
we can better motivate the design decisions leading to the AGS restrictions mentioned
above. These restrictions involve dataflow, blocking in the body, expressions and function
calls as parameters, and conditional execution in the body.
5.5.1 Dataflow Restrictions
As noted in Section 5.4.1, the three phase process (Steps 2, 7, and 9) of executing TS
operations in atomic guarded statements leads to restrictions based on data ow between
formals and actuals in the statement. Informally, any AGS that cannot be processed in
these three steps is not allowed. Note that these steps involve local, replicated, and local
TSs, respectively.
Following is an example of an AGS that does not meet these criteria and therefore, is
disallowed:
h
true
)
in(
T S scr atch
,
f oo
,
?
i
, 100)
in(
T S main
,
bar
,
i
,
?
j
)
rd(
T S scr atch
,
f oo
,
j
, ?k)
out(
T S main
,
bar
,
k
,
?
l
)
i
The data flow local to replicated to local to replicated violates the three phase
processing. If this were permitted, efficiency would suffer, and handling failures and
concurrencywould be morecomplicated. Efficiencywould suffer becauseit is not possible
to process this AGS with onlyone multicast message to the replicatedTS managers. Also,
handling failures is more complicated. In this case, the host with
T S scr atch
could fail
between the in and rd involving
T S scr atch
. Since
T S scr atch
would thus no longer be
available, the replicated TS managers would have to be able to undo the effects of in on
T S main
to cancel the AGS. Further, to implement this more general AGS would add
101
complexity to the replicated TS managers. They would have to take additional measures
too ensure that this AGS appears to be atomic to other processes, i.e. so no other AGS
could access
T S main
between the in and the out operations.
The exact nature of the dataflow restrictions depend on the particular operation, but
are based on whether the compiler and runtime system can somehow implement the AGS
in the three phases. For example, the following code from Figure 3.4 is permissible:
h
in(
T S main
,
in progr ess
,
my hostid
,
subtask arg s
)
)
move (
T S scr atch
,
T S main
)
i
In this case, the move is converted in step 2 into a series of in operations that read from
T S scr atch
and corresponding out operations that deposit into
T S main
. Matching tuples
will therefore be removed from
T S scr atch
in step 2 and then added to
T S main
in step 7
after the guard is executed.
5.5.2 Blocking Operations in the AGS Body
No operation in the body of an AGS is permitted to block, as discussed in Section 3.2.1.
With this restriction it is much simpler to implement the AGS’s atomicity in the presence
of both failures and concurrency. Once a guard has a matching tuple in TS, then both the
guard and the body can be executed without the need to process any other AGS. This is a
simple and efficient way to provide atomicity with respect to concurrency.
Allowing the body to block causes many of the same sorts of difficulties described
above with regard to the three-phase rule. Indeed, the code fragment given above that
violated this rule did block between the second in and the rd, as far as the replicated TS
managers are concerned. That is, they had to stop processing after the second in until the
rd had been performed in
T S scr atch
. The difficulties this caused are very similar to the
difficulties causes by allowing blocking in the body. In both cases, more messages are
required, and the TS managers have to do more work to be able to handle failures and
concurrency in the middle of an AGS.
5.5.3 Function Calls, Expressions, and Conditional Execution
FT-Linda does not allow a function call or an expression to be a parameter to a TS
operation, as mentioned in Chapter 3 and describedfurther in Section 5.3. It also does not
allow any sort of conditional execution within an AGS, apart from which guard is chosen
in an AGS. The reasons for these restrictions will now be described in turn. A common
denominator in these restrictions is that their absence would make FT-Linda harder to
understand, program, and implement.
A parameter to an FT-Linda operation may not contain a function call or an expression,
as shown in Table 5.1 in Section 5.3; the only form of computation allowed in an AGS
102
is the opcode. Allowing a function call in an AGS would be allowing an arbitrary
computation inside a critical section. This would degrade performance, because each TS
manager could not process other AGSs while this computation was taking place. Also,
any such functions would have to have restrictions on them; for example, they would
have to be free of side-effects, and they could not reference pointers. This is required
to maintain reasonable semantics, because the functions would be executed on all the
machines hosting the TS replicas, rather than just on the user’s host. Similarly, any
expressions would have to have restrictions on them, because some expressions would
have to be evaluated in a distributed fashion with values obtained from the replicated TS
managers. While there may be a way to define a reasonable usage of expressions and
functions in FT-Linda parameters, we feel it would be difficult to explain cleanly. It would
also be difficult to rationalize why a slightly more general form should not be permitted.
Similarly, one could imagine permitting some form of loops or conditional statement
in an AGS. To be useful, however, many ways one could envision using these would
require some form of variable assignment. For example, variable assignment would be
used to store a return value from inp, if it were allowed in the body, something that
makes sense if one is to allow varaible assignment. It thus would become quite difficult
to add such conditional statements in ways that could be cleanly described, efficiently
implemented, and yet did not beg for more functionality.
5.5.4 Restrictions in Similar Languages
Other researchers have found it necessary or at least desireable to impose restrictions
similar to those discussed above. As mentioned in Section 3.4, one optimization in the
Linda implementation described in [Bjo92] collapses an inand an outinto one operation at
the TS manager. However, in orderfor the compiler to be able to applythis optimization,
the in and out have to use only simple calculations very similar to FT-Linda’s opcodes.
Of course, this is applicable to many common Linda usages, most notably a distributed
variable.
A second example is Orca, a language that is useful for many of the same kinds of
applications [BKT92, Bal90, TKB92, KMBT92]. The language is based on the shared-
object model and has been implemented on both multiprocessors and distributed systems.
An Orca object consists of private data and external operations that it exports. Operations
consist of a guard (a boolean expression) and a series of statements that may modify the
object’s private data. An operation is both atomic and serializable, as is FT-Linda’s AGS.
To achive these semantics, Orca’s designers have placed some restrictions similar to
thost described above for FT-Linda. For example, only the guard may block. Also, the
guard must be free of side effects. These restrictions are crucial in all owing the reasonable
implementation of the atomicity properties of Orca’s operations.
103
5.5.5 Summary
The above restrictions allow the AGS to be implemented with reasonable semantics and
performance in the presence of failures and concurrency. And, as we have demonstrated
in the examples in Chapters 3 and 4, these restrictions do not appear to adversely affect
FT-Linda’s usage; they still allow FT-Linda to be useful for a variety of applications.
Finally, other researchers have also found it necessary to impose similar restrictions to
achieve similar goals.
5.6 Initial Performance Results
Some initial performance studies have been done on the FT-Linda implementation. As
noted, the runtime system has not yet been merged with Consul, so the measurements
capture only the cost of marshalling, the AGS, performing its TS operations at the TSs
involved, and then unmarshalling the AGS. In the tested version, the control flow from the
library to the state machine was implemented by procedure call rather than the x-kernel.
As such, only one replica of the TS state machine was used.
Table 5.2 gives timings figures for a number of different machines. The first result
column is for an empty AGS, while the next give the cost of incrementing a distributed
variable. Subsequent columns gives the marginal cost of including different types of in
or out operations in the body. We note that the i386 figures are comparable to results
reported elsewhere[Bjo92]. This is encouraging for two reasons. First, FT-Linda is largely
unoptimized, while the work in [Bjo92] is based on a highly optimized implementation.
Second, we have augmented the functionality of Linda, not just reimplemented existing
functionality.
These figures can be used to derive at least a rough estimate of the total latency of an
AGS by adding the time required by Consul to disseminate and totally order the multicast
message before passing it up to the TS state machine. For three replicas executing on
Sun-3 workstations connected by a 10 Mb Ethernet, this dissemination and ordering time
has been measured as approximately 4.0 msec [MPS93a]. We expect this number to
improve once the port of Consul to a faster processor is completed.
We note that even these relatively low latency numbers overstate the cost involved
in some ways. A key property of our design is that TS operations from an AGS in one
user process on a given processor can be executed by the TS state machine while those
from other processes on the same processor are being disseminated by Consul. This
concurrent processing means that, although the latency reflects the cost to an individual
process, the overall computational throughput of the system is higher because other
processes can continue to make progress. In other words, the latency does not necessarily
represent wastedtime, because the processor can be performing user-level computations or
disseminating other AGSs during this period. To our knowledge, this ability to process TS
operations concurrently within a distributed Linda implementation is unique to FT-Linda.
The above performance figures support the contention that the state machine approach
104
empty cost per body op
Machine AGS in-out
0
out
0
out
1
out
2
in
0
in
1
in
2
in
3
SparcStation 10 4 31 25 28 28 8 9 20 19
HP Snake 4 33 23 26 26 10 11 23 23
SparcStation IPC 14 123 77 88 90 22 25 61 64
i386 (Sequent) 30 300 147 176 184 91 120 270 264
empty AGS
h
true
)
skip
i
in-out
0
h
in(
T S main;
T E S T
”,
?
i
)
)
out(
T S main;
T E S T
”, PLUS(
i
,1));
i
out
0
out(
T S main;
T E S T
”)
out
1
out(
T S main;
T E S T
;
1
;
2
;
3
;
4
;
5
;
6
;
7
;
8
)
out
2
out(
T S main;
T E S T
; a; b; c; d; e; f ; g; h
)
in
0
in(
T S main;
T E S T
”)
in
1
in(
T S main;
T E S T
;
?
int;
?
int;
?
int;
?
int;
?
int;
?
int;
?
int;
?
int
)
in
2
in(
T S main;
T E S T
;
1
;
2
;
3
;
4
;
5
;
6
;
7
;
8
)
in
3
out(
T S main;
T E S T
;
?
a;
?
b;
?
c;
?
d;
?
e;
?
f;
?
g;
?
h
)
Table 5.2: FT-Linda Operations on Various Architectures (
sec)
is a reasonable way to implement fault tolerance in Linda. Although a more thorough anal-
ysis is permature at this point, our speculation is that this approach will prove competitive
with transactions, a common wayto achieve fault tolerance with dependable systems, and
checkpoints, the technique most widely used to achieve fault tolerance in scientific pro-
grams. None of these three fault-tolerance paradigms have been developed and deployed
fully enough with Linda to make quantitative comparisons about their strengths and weak-
nesses. Fortunately, active research is being performed in all three areas, so hopefully
in the near future we will be able understand better the performance and tradeoffs of the
different approaches.
5.7 Optimizations
As already noted, the TS managers and GC have been only slightly optimized. However,
one optimization has provided great benefits, and two other planned optimizations would
also be beneficial.
The key observation for the first optimization is that most of the information in an
AGS request data structure does not change between different executions of the AGS.
Examples of items that do not change include the number of branches and the signature of
each operation. In fact, only two items can vary between different executions of the same
AGS by the same process. The first is the address of a stack variable used as a formal
parameter (addresses of formal variables are used in the GC at formal assignment time).
105
This corresponds to Case 3 in Table 5.1, if the variable in question is a stack variable
(an “automatic” variable in C and some other languages). The second is the value of a
variable used as an actual or as an opcode parameter, and then only if this variable was not
a formal earlierin the branch. This is Case 4 in in Table 5.2. Thus, with this optimization,
the AGS request data structure is allocated and fully initialized in the generated code the
first time it is executed for a given process. When the same AGS is then executed later
only the values and addresses of the aforementioned variables are copied.
This optimization dramatically reduces the execution loads associated with the AGS
command. For example, the times in Table 5.2 for the Sparcstation IPC were 1200
1500
seconds higher—more than an order of magnitude—without this optimization.
And it is useful not only for timing tests but also for real-world applications, because
many AGS statements are in loops.
The second optimization is the network analogy of the first one. Since most of the
information in an AGS request data structure does not change between invocations, the
information that does not change only needs to be sent to its TS manager(s) once, and
stored for future use. This greatly reduces the size of the AGS request that has to be sent
over the network each time, and thus reduces the latency. It also eliminates the CPU time
to copy the unchanging information. As noted previously, this optimization has not yet
been implemented.
The third and final optimization is to make AGSs with all outs, moves, and copys
awrite-only AGS not to delay the user any more than necessary. Currently, the user
is blocked until the AGS has been executed by all pertinent TS managers. However, the
AGS need only block the user until the GC has marshalled the its request data structure
and submitted it to the FT-Linda library, because the user code receives no information
from the AGS’s execution. Thus, it need not wait for this execution to occur.
5.8 Future Extensions
We hope to extend FT-Linda’s implementation to provide greater functionality and per-
formance in a number of ways. These are discussed in the following sections. First, the
reintegration of failed TS managers is discussed. Next, the replicating of TS managers to
a subset of the hosts in an FT-Linda system is described. We conclude with a discussion
of how network partitions can be handled.
5.8.1 Reintegration of Failed Hosts
The major problem in reintegrating a failed processor upon recovery is restoring the
states of replicated TSs that were on that machine. Although there are several possible
strategies, a common one used in such situations is to obtain the data from another
functioning machine. To do this, however, requires not only copying the actual data, but
also ensuring that it is installed at exactly the correct time relative to the stream of tuple
operations coming over the network. That is, if the state of the TSs given to the recovering
106
Consul
TS State
Machine
Library
FT-Linda Library
FT-Linda
1
2
Compute ServerTuple Server
Generated
Code Generated
Code
4
3
5
Replicated Request Handler
Figure 5.8: Non-Full Replication
processor
P
i
is a snapshot taken after AGS
S
1
has been executed but before the next AGS,
S
2
, then
P
i
must know to ignore
S
1
but execute
S
2
when they arrive.
Fortunately, Consul’s membership service provides exactly the functionality required
[MPS93a]. When a processor
P
i
recovers, a restart message is multicast to the other
processors, which then execute a protocol to add
P
i
back into the group. The key property
enforced by the protocol is that all processors—including
P
i
—add
P
i
to the group at
exactly the same point in the total order of multicast messages. This point could easily
be passed up to the TS state machine and used as the point to take a snapshot of all
replicated TSs to pass to
P
i
. Note that, to reintegrate a TS state machine, the hash tables
for both tuples and blocked requests would need to be transferred. This general scheme
for reintegrating failed hosts could also be used to incorporate new tuple servers into the
system during execution.
5.8.2 Non-Full Replication
The FT-Linda implementation currently keeps copiesof all replicated TSs on all processors
involved in the computation. Using all processors is, however, unnecessary. We can
designate a small number to be tuple servers, and use only these processors to manage
replicated TSs. Each tuple server would maintain copies of all replicated TSs and would
either be well-known or ascertainable from a name server. [CP89] User processes would
execute on separate compute servers.
An organization along these lines would necessitate some changes in the way user
processes interact with the rest of the FT-Linda runtime system. For example, Figure 5.8
demonstrates the differences in the processing of an AGS. Rather than requests being
submitted to Consul directly from the FT-Linda library, a remote procedure call (RPC)
[Nel81, BN84] would be used to forward the request to a request handler process on a
tuple server. This handler immediately submits it to Consul’s multicast service as before.
Later, after the AGS has been processed by the TS state machines, the request handler
on the tuple server that originally received the request sends the request back to the user
107
process.
Failure of a compute server causes no additional problems beyond those present in
complete replication, but the same is not true for tuple servers. In particular, if such
a failure occurs, then any user process awaiting for a reply from that processor could
potentially block forever. To solve this problem, the user process would time out on
its reply port and resubmit the request to another tuple server. To prevent the request
from being inadvertently processed multiple times, the user process would attach a unique
identifier to the request, and the local TS logic and the TS state machines only process
one request with a given identifier.
Note that tuple servers in this scenario are very much analogous to the file servers
found in a typical workstation environment. Indeed, the way in which they would be
used would likely be similar as well, with a few tuple servers and many compute servers.
Of course, the tuple servers could have faster CPUs and more memory than the compute
servers, as file servers typically do when compared to the client machines they serve.
This relatively small degree of replication for stable TSs that non-full replication
allows should be sufficient for many fault-tolerant applications. It should also scale much
better than full replication.
5.8.3 Network Partitions
A network partition occurs when hosts on a network that can normally communicate with
each other cannot do so, typically due to the failure of a communication link or of a
network gateway. The typical way to handle such a situation is to allow only those hosts
that are in a majority partition to continue [ASC85]. This ensures that there will not be
multiple, divergent versions of the data, because at most one partition contains a majority
of the hosts. Hosts that are in the minority must wait until they are in the majority, update
their copy of the replicated data, and then proceed.
The current implementation of Consul does not handle network partitions. However,
it would be very easy to extend Consul to do so, as described in [MPS93a]. This change
wouldbe completelytransparentto theFT-Lindaimplementation,assuming thescheme for
reintegratinghostsdiscussedabove hadbeen implemented. Consul’smembershipprotocol
on a given host already keeps track of which hosts it can currently communicate with.
Thus, it knows the size and membership of its partition. Additionally, this information
is kept consistent with all other members in its partition. It is thus simple to extend this
membership protocol to check whether or not its host is in the majority partition. If it is,
it proceeds as normal. If not, it simulates a crash and then reintegrates upon rejoining the
majority partition.
5.9 Summary
This chapter describes the details of the implementation of FT-Linda. An overview is
given of the major components, followed by a description of the major data structures
108
were then given; The FT-LCC precompiler is presented, followed by the details regarding
the processing of an atomic guarded statement. The restrictions on the AGS are described
next. Initial performance results andoptimizations arethen given,followed by a discussion
of future planned extensions to the implementation.
The two major categories of data structures in FT-Linda’s implementation are for
request messages sent to the tuple space managers and for tuple spaces. The request
messages specify operations that the tuple space managers perform, including creating
and destroying tuple spaces and also the atomic guarded statement. These messages also
include space for the values of actuals and formals to be recorded. A tuple space is simply
two hash tables, one for tuples and the other for blocked atomic guarded statements.
The FT-LCC precompiler generates C code that includes generated code (GC) to
marshall an AGS request data structure and pass this request to the FT-Linda library. It
is more complex than its LCC (Linda) predecessor because it must deal with the atomic
guarded statement’s atomic properties, and some of the values involved in an atomic
guarded statement are not known until the request is processed at the replicated tuple
space managers.
The processing of an AGS takes place at a number of stages. It is marshalled by the
GC, then operated on by the local TS manager. After this, it is multicast to all replicated
TS managers, where it is executed at the same logical time. The AGS is then processed
further at the local TS manager, and then the AGS returns control to the next statement
in the program. The actual logic to process an atomic guarded statement at a given tuple
space manager is virtually identical for local and replicated tuple spaces.
The atomic guarded statement has restrictions involving dataflow, blocking in the
body, function calls, expressions; it also allows no conditional execution inside the body.
The dataflow restrictions are necessary to allow efficient processing and multicasting of
the atomic guarded statement. An operation in the body of an atomic guarded statement is
not allowed to block for similar reasons. Function calls and expressions are not permitted
in an atomic guarded statement because they would allow an arbitrarycomputation inside
an atomic guarded statement. Also, the functions and expressions would have to have
limitations on them that would be hard to explain cleanly and to justify why more general
forms are not permitted. Finally, we note that another Linda implementation and a similar
language, Orca, place restrictions similar to many of those that FT-Linda mandates.
Initial performance results are given in this chapter. The costs in processing tuple
space operations are comparable to another Linda implementation. This is encouraging,
because FT-Linda is largely unoptimized and it also has augmented the functionality of
Linda.
The implementation has been optimized some, and future optimizations are possible.
The information in an atomic guarded statement that does not change is only marshalled
once, and a planned optimization will also only transmit this unchanging information over
the network once. Also, write-only atomic guarded statements can be executed in the
background while the user’s code proceeds.
Therearethree ways in which weplan toextend theFT-Lindaimplementation. Support
109
for reintegration of failed or new hosts can be provided directly by Consul’s membership
service. Non-full replication can be achieved by having tuple servers on a subset of
the machines. Finally, Consul can be extended in proven ways to allow replicated TS
managers that canstill communicate with the majorityof their peersto continue operating
in the face of network partitions.
110
CHAPTER 6
CONCLUSIONS
6.1 Summary
In this dissertation we have addressed the problem of providing high-level language
support for fault-tolerant parallel programming. We have created a version of Linda,
which we call FT-Linda, to permit the construction of fault-tolerant parallel programs.
The distinguishing features of FT-Linda are its stable tuple spaces and atomic execution
of multiple tuple space operations.
We surveyed Linda in Chapter 2. Linda has semantic deficiencies even in the absence
of failures, including weak inp/rdp semantics and asynchronous outs. Linda applications
also have problems in thepresence of failures, most notably Linda’s lack of tuple space sta-
bility and its single-operation atomicity. Common Linda paradigms such as the distributed
variable and the bag-of-tasks paradigm also have problems in the presence of failures.
Finally, we concluded this chapter by outlining alternative ways for implementing tuple
space stability and multi-operation atomicity.
In Chapter 3 we presented FT-Linda. It provides mechanisms for creating multiple
tuple spaces with different attributes. These attributes are resilience, which specifies a
tuple space’s failure behavior, and scope, which designates which processes may access
a tuple space. FT-Linda has two provisions for atomic execution. The first is the
atomic guarded statement (AGS), which allows a sequence of tuple space operations to be
executed in an all-or-nothing fashion despite failures and concurrency. The AGS also has
a disjunctive form that allows the programmer to specify multiple sequences of tuple space
operations from which zero or one sequence is selected for atomic execution. The second
provision for atomic execution is FT-Linda’s tuple transfer primitives, which allow tuples
to be moved or copied atomically between tuple spaces. FT-Linda features improved
semantics: strong inp/rdp semantics, oldest matching semantics, and the sequential
orderingproperty. Inthischapter we alsosurveyed other effortsto allow Lindaprograms to
tolerate failures. Finally, in Chapter 3 we discussed possible extensions to FT-Linda; these
include additional tuple space attributes, nested AGSs, notification of which sequence of
tuple space operations was executed (if a disjunctional AGS is used), tuple space clocks,
tuple space partitions, guard expressions, and allowing the creation and destruction of
tuple spaces to be included in AGSs.
In Chapter 4 we demonstrated FT-Linda’s usefulness in constructing highly depend-
able systems and parallel applications. We gave three examples of dependable systems
constructed with FT-Linda: a replicated server, a recoverable server, and a transaction
facility. The replicated server features a server with no fail-over time there is no
111
112
recover phase to delay a clientof a failed server that uses a distributed variable tuple to
maintain replica consistency. The recoverable serverperforms no redundant computations
in the absence of failures, and saves its private state in a stable tuple space for recovery
purposes. The transaction facility is a library of user-level procedures implemented with
FT-Linda. This shows how AGSs can be used to construct a higher level abstraction. In
particular, it demonstrates the power of FT-Linda’s tuple transfer primitives.
Chapter 4 also presented two examples of using FT-Linda to construct fault-tolerant
parallel applications: a divide-and conquer worker and a fault-tolerantbarrier. The divide-
and-conquer worker is a generalization of the bag-of-tasks example from Chapter 3. The
fault-tolerant barrier solution also applies to another class of algorithms, systolic-like
algorithms. We concluded Chapter 4 with a discussion of how to tolerate the failure of a
main process.
We discussed the implementation and performance of FT-Linda in Chapter 5. The
major data structures are for the AGS and the tuple spaces. The FT-LCC precompiler
is a derivative of the Linda LCC precompiler; its main differences from LCC are due to
the atomic nature in which multiple tuple space operations are combined in the AGS. An
AGS is executed by marshalling the AGS data structure, sending it to the various tuple
space managers for processing, and then unmarshalling the data structure (mainly copying
the values of formal variables into their respective variables). The AGS has restrictions
similar to those in similar parallel languages. The cost per processing an FT-Linda
operation by the tuple space managers is competitive with other Linda implementations.
Optimizations to FT-Linda’s implementation include marshalling and transmitting only
once the unchanging information in an atomic guarded statement. Future extensions to
FT-Linda include reintegration of failed hosts, non-full replication, and tolerating some
network partitions.
6.2 Future Work
This work can be expanded in many different directions. These include completing
and extending FT-Linda’s implementation, porting it to other environments, extending
the language, investigating FT-Linda’s use as a back-end language, and considering a
real-time variant of FT-Linda.
One major area for future is to complete and extend the implementation. The first item
is to complete the integration with Consul. After that, a port of FT-Linda to Isis would be
interesting. The extensions to the implementation discussed in Section 5.8 reintegration
of failed hosts, allowing non-full replication, and tolerating network partitions should
be completed. Finally, the implementation could be optimized further. One optimization
would be to transmit only unchanging information in an AGS request message, as was
discussed in Section 5.7. Another optimization to investigate is whether it is beneficial
to use a semantic dependent ordering in delivering the AGS requests to the replicated
TS managers than the more restrictive total ordering. For example, AGSs with all rds
are commutative with each other, and thus a collection of such read-only AGSs could
113
be executed in different orders at different replicas. Unfortunately, a semantic dependent
ordering can feature higher latency than a total ordering if only a few of the operations
are commutative. It is an open question as to whether enough Linda applications would
contain a sufficient number ofread-only AGSs so as to warrant using a semantic dependent
ordering. However, note that this choice could be made by the user when starting the
FT-Linda program.
The FT-Linda language will continue to mature. One possibility in this area is to
accommodate many or all of the possible extensions listed in Section 3.5. Another is to
support more opcodes. A final possibility is to remove or lessen some of the restrictions,
most notably to allow expressions in an AGS.
FT-Linda also seems a good candidate as a back-end language for indirect usage. One
example of this was given in the general transaction facility in Section 4.1.3. Another
possibility is to build on Linda tools or other additions to Linda. The Linda Program
Builder (LPB) is a high-level Computer Aided Software Engineering (CASE) tool that is
usedto make programmingin Linda simpler. Forexample,it presents theprogrammer with
menu choices for common Linda paradigms such as the bag-of-tasks and the distributed
variable paradigms. In doing so, it ensures that theprogrammer supplies information about
the actual usage of each tuple signature. This allows a higher level of optimization than
just static analysis of the code would permit. It would be instructive to explore FT-Linda’s
usage as a back-end to the LPB; indeed, the programmer might not even need to know
about FT-Linda, only Linda. Another interesting Linda project is Piranha, which allows
idle workstations to be used transparently. FT-Linda’s integration with Piranha would be
of great interest to many, especially if it were also integrated with a CASE tool such as the
LPB, Another back-end usage of FT-Linda is with finite state automatas (FSAs), which
are very useful for specifying features in telecommunications systems [Zav93]. FT-Linda
could be used to implement FSAs. For example, events and states could be represented
by tuples, and a state change the consumption and production of event and state tuples
could be represented by a single AGS.
Finally, there is a great need for more high-level language support for fault-tolerant,
real-time programming. If Linda’s simple tuple space model could be used for this it
would be both edifying and useful. Real time systems need predictability, which can be
met in part by ensuringthat all timing constraints are met. There are a number of possible
ways that FT-Linda could be extended to facilitate this, a thorough discussion of which
is beyond the scope of this dissertation. However, any such extensions would have to
answer at least the following questions:
Will hard real time or soft real time or both be supported?
Will the predictability be achieved by priorities or deadlines or both?
Are these priorities or deadlines to be associated with a TS, a TS operation, or an
entire AGS? Further, if they are associated with a TS or a TS operation, what is the
114
meaning of the deadline/priority in the face of an AGS that has multiple priorities
represented in the TSs involved with it.
Further,it would be interesting if facilities to allow the programmer to deal with a timing
failurecould beprovided in asimilarfashion tohowFT-Linda providesfailure notifications
upon the failure of a host. Additionally, the state machine approach used to implement
FT-Linda seems well-suited to provide real-time support, in addition to fault-tolerance.
Indeed, one such project is already underway to provide real-time communication support
for replicated state machine in the Corto project [AGMvR93], a real-time successor to
Isis. This could serve as an excellent basis for a real-time versions of FT-Linda.
APPENDIX A
FT-LINDA IMPLEMENTATION NOTES
The C FT-Linda code in the appendices differs from the pseudocode given in previous
chapters in a number of ways.
First, the code to parse array subscripts in an AGS has not yet been implemented. This
is only encountered in Appendix G. The workaround used in the following appendices
is to use a non-subscripted variable in the AGS and copy to and from it. This variable
can be an array or structure or anything that is not subscripted and that the C sizeof()
facility will give the true size of the data in question (i.e., not pointers, either directly or
in structures).
Second, inp and rdp have not yet been implemented in expressions. In this case,
these operations are boolean guards and the entire AGS is a boolean expression. The
workaround is to set a variable with a sentinel value that cannot occur in a tuple and then
use this variable as a formal in an AGS. The value of this variable can be compared to
the illegal value after the AGS to ascertain whether or not the AGS guard and body was
executed.
Third, the implementation provides a hook, ftl fail host(), to simulate the
failure of a host.
Fourth, the logical name of a tuple is not a character string like
name
but rather is a
void type, which is stylistically represented in uppercase (e.g., NAME). This void type is
declared with the newtype operator, which is the C-Linda analogy to the C typedef
operator. For further information, see [Lei89],
115
116
APPENDIX B
FT-LINDA REPLICATED SERVER
/
Replicated server example. There is no monitor process since the failure
of a server does not need to be cleaned up after.
/
#ttcontext replicated server
#include
<
stdio.h
>
#include "ftlinda.h"
#define THIS SERVICE 1
#define NUM CLIENTS 5
#define CMD1 1
/
SQR
/
#define CMD2 2
/
SUM
/
#define CLIENT LOOPS 2
#define SQR ANS(x) ((x)
(x))
#define SUM ANS(a,b,c) ((a) + (b) + (c))
newtype void SERVER TIME;
newtype void REQUEST;
newtype void REPLY;
newtype void SQR CMD;
/
CMD1
/
newtype void SUM CMD;
/
CMD2
/
void server(void);
void client(int);
117
118
LindaMain (argc, argv)
int argc;
char
argv [];
f
int host, i, lpid, num hosts = ftl num hosts();
/
initialize the sequence tuple
/
<
true =
>
out(TSmain, SERVER TIME, THIS SERVICE, (int) 0);
>
/
Create one server replica on each host
/
for (host=0; host
<
num hosts; host++)
f
lpid = new lpid();
ftl create user thread(server, "server", host, lpid, 0, 0, 0, 0);
g
/
create some clients
/
for (i=0; i
<
NUM CLIENTS; i++)
f
lpid = new lpid();
host = i % num hosts;
ftl create user thread(client, "client", host, lpid, i, 0, 0, 0);
g
/
The LindaMain thread goes away here, but the program won’t be
finished until all living clients are through
/
g
119
/
The client invokes both services CLIENT LOOPS times. It also
tests the answers it gets, something of course a real client
generally would not (and often could not) do.
/
void
client (int client id)
f
int time, i, x, a, b, c, answer;
printf("Client %d on host %d here\n", client id, ftl my host() );
for (i=0; i
<
CLIENT LOOPS; i++)
f
/
invoke the first command
/
x=i+10;
<
in(TSmain, SERVER TIME, THIS SERVICE, ?time) =
>
out(TSmain, SERVER TIME, THIS SERVICE, PLUS(time,1) );
out(TSmain, REQUEST, THIS SERVICE, time, SQR CMD, CMD1, x);
>
/
wait for the first reply to this command
/
<
in(TSmain, REPLY, THIS SERVICE, time, ?answer) =
>
skip
>
if (answer
6
=
SQR ANS(x))
ftl exit("Client got bad sqr answer.", 1);
/
invoke the second command
/
a=i
100; b=i
200; c=i
300;
<
in(TSmain, SERVER TIME, THIS SERVICE, ?time) =
>
out(TSmain, SERVER TIME, THIS SERVICE, PLUS(time,1) );
out(TSmain, REQUEST, THIS SERVICE, time, SUM CMD, CMD2, a, b, c);
>
/
wait for the first reply to this command
/
<
in(TSmain, REPLY, THIS SERVICE, time, ?answer) =
>
skip
>
if (answer
6
=
SUM ANS(a, b, c))
ftl exit("Client got bad answer.", 1);
g
printf("Client %d on host %d done\n", client id, ftl my host() );
g
120
/
The server implements two different commands, both of which return
a simple answer.
/
void
server()
f
int time, x, a, b, c, answer, cmd;
/
loop forever over all times
/
for (time=0; ; time++)
f
/
read the next request tuple
/
<
rd(TSmain, REQUEST, THIS SERVICE, time, SQR CMD, ?cmd, ?x) =
>
skip
or
rd(TSmain, REQUEST, THIS SERVICE, time, SUM CMD, ?cmd, ?a, ?b, ?c) =
>
skip
>
/
compute the answer for the request
/
switch(cmd)
f
case CMD1:
answer = SQR ANS(x);
break;
case CMD2:
answer = SUM ANS(a,b,c);
break;
default:
ftl exit("Server error", 1);
/
NOTREACHED
/
g
/
send the reply
/
<
true =
>
out(TSmain, REPLY, THIS SERVICE, time, answer);
>
g
/
NOTREACHED
/
g
APPENDIX C
FT-LINDA RECOVERABLE SERVER EXAMPLE
/
Recoverable server example. This shows how to implement two commands,
instead of the one shown earlier in the paper.
/
#ttcontext recoverable server
#include
<
stdio.h
>
#include "ftlinda.h"
#include "assert.h"
/
Anything that can be an actual in a tuple had better be cast to a type
so the signatures are guaranteed to match
/
#define MY SERVICE (int) 1
#define NUM CLIENTS (int) 5
#define CMD1 (int) 1
/
SQR
/
#define CMD2 (int) 2
/
SUM
/
#define CLIENT LOOPS 2
#define SQR ANS(x) ((x)
(x))
#define SUM ANS(a,b,c) ((a) + (b) + (c))
#define INITIAL SERVER HOST (int) 1
#define ILLEGAL HOST
?
1
#define ILLEGAL SQR
?
1
#define INIT SUM (int) 0
#define INIT SQR (int) 0
newtype void REQUEST;
newtype void IN PROGRESS;
newtype void REPLY;
newtype void SERVER REGISTRY;
newtype void SERVER STATE;
newtype void SERVER HANDLE;
newtype void SQR CMD;
/
CMD1
/
newtype void SUM CMD;
/
CMD2
/
newtype void REINCARNATE;
void server(void);
void client(int);
void monitor(int);
121
122
LindaMain (argc, argv)
int argc;
char
argv [];
f
int f id, host, i, server lpid, lpid, num hosts = ftl num hosts();
ts handle t server handle;
/
For now we use a
f
Stable,Shared
g
TS for the server, instead
of a
f
Stable,Private
g
one as one would normally do, since
passing an LPID to the create primitive isn’t implemented yet.
/
server lpid = new lpid();
create TS(Stable, Shared, &server handle);
/
Initialize the TS handle and registry for the server.
We will not initialize the state, since if server handle were
Private, as our solution would normally have, we could not do this.
/
<
true =
>
out(TSmain, SERVER HANDLE, MY SERVICE, server handle);
out(TSmain, SERVER REGISTRY, MY SERVICE, server lpid, INITIAL SERVER HOST);
>
/
create a monitor process on each host
/
for (host=0; host
<
num hosts; host++)
f
lpid = new lpid();
fid = new failure id();
ftl create user thread(monitor, "monitor", host, lpid, f id, 0, 0, 0);
g
/
Create one server on host INITIAL SERVER HOST
/
assert(ftl num hosts()
INITIAL SERVER HOST);
lpid = new lpid();
ftl create user thread(server, "server", INITIAL SERVER HOST, lpid, 0, 0, 0, 0);
/
create some clients
/
for (i=0; i
<
NUM CLIENTS; i++)
f
lpid = new lpid();
host = i % num hosts;
ftl create user thread(client, "client", host, lpid, i, 0, 0, 0);
g
/
The LindaMain thread goes away here, but the program won’t be
finished until all living clients are through
/
g
123
/
The client invokes both services CLIENT LOOPS times. It also
tests the answers it gets, something of course a real client
generally would not (and often could not) do.
/
void
client (int client id)
f
int i, x, a, b, c, answer, my lpid = ftl my lpid();
printf("Client %d on host %d here\n", client id, ftl my host() );
for (i=1; i
CLIENT LOOPS; i++)
f
/
invoke the first command
/
x=i+10;
<
true =
>
out(TSmain, REQUEST, MY SERVICE, my lpid, SQR CMD, CMD1, x);
>
/
wait for the first reply to this command
/
<
in(TSmain, REPLY, MY SERVICE, my lpid, ?answer) =
>
skip
>
if (answer
6
=
SQR ANS(x))
ftl exit("Client got bad sqr answer.", 1);
/
invoke the second command
/
a=i
100; b=i
200; c=i
300;
<
true =
>
out(TSmain, REQUEST, MY SERVICE, my lpid, SUM CMD, CMD2, a, b, c);
>
/
wait for the first reply to this command
/
<
in(TSmain, REPLY, MY SERVICE, my lpid, ?answer) =
>
skip
>
if (answer
6
=
SUM ANS(a, b, c))
ftl exit("Client got bad answer.", 1);
g
printf("Client %d on host %d done\n", client id, ftl my host() );
g
124
/
The server implements two different commands, both of which return
a simple answer.
/
void
server()
f
int x, a, b, c, answer, cmd, best sqr, best sum, client lpid;
ts handle t my ts;
printf("Server here on host %d\n", ftl my host() );
/
read in server’s TS handle
/
<
rd(TSmain, SERVER HANDLE, MY SERVICE, ?my ts) =
>
skip
>
/
read in server’s state. Note that we would normally initialize
it like this:
if (
<
not rdp(my ts, SERVER STATE, ... ) =
>
out(my ts, SERVER STATE, ... initial values ...);
>
However, we cannot do this since the AGS has not yet been implemented
as part of an expression. Thus, we will simulate this.
/
best sqr = ILLEGAL SQR;
<
rdp(my ts, SERVER STATE, MY SERVICE, ?best sqr, ?best sum) =
>
skip
>
if (best sqr == ILLEGAL SQR)
f
/
The rdp did not find a state tuple, so we will create one.
Note that having these two AGSs would not work in general,
i.e. it won’t have the same semantics viz. failures and concurrency
as the AGS expression we would normally implement it with.
We know here that it works, however, since there can be only this
server executing MY SERVICE at once.
/
best sqr = INIT SQR;
best sum = INIT SUM;
<
true =
>
out(my ts, SERVER STATE, MY SERVICE, best sqr, best sum);
>
g
125
/
loop forever
/
for (;;)
f
/
read the next request tuple
/
<
in(TSmain, REQUEST, MY SERVICE, ?client lpid, SQR CMD, ?cmd, ?x) =
>
out(TSmain, IN PROGRESS, MY SERVICE, client lpid, SQR CMD, cmd, x);
orin(TSmain, REQUEST, MY SERVICE, ?client lpid, SUM CMD, ?cmd, ?a, ?b, ?c) =
>
out(TSmain, IN PROGRESS, MY SERVICE, client lpid, SUM CMD, cmd, a, b, c);
>
/
compute the answer for the request
/
switch(cmd)
f
case CMD1:
answer = SQR ANS(x);
best sqr = (answer
>
best sqr ? answer : best sqr);
break;
case CMD2:
answer = SUM ANS(a,b,c);
best sum = (answer
>
best sum ? answer : best sum);
break;
default:
ftl exit("Server error", 1);
g
/
send the reply and update state
/
switch(cmd)
f
case CMD1:
<
in(TSmain, IN PROGRESS, MY SERVICE, ?int, SQR CMD, cmd, ?int) =
>
out(TSmain, REPLY, MY SERVICE, client lpid, answer);
in(my ts, SERVER STATE, MY SERVICE, ?int, ?int);
out(my ts, SERVER STATE, MY SERVICE, best sqr, best sum);
>
break;
case CMD2:
<
in(TSmain, IN PROGRESS, MY SERVICE, ?int, SUM CMD, cmd, ?int, ?int, ?int) =
>
out(TSmain, REPLY, MY SERVICE, client lpid, answer);
in(my ts, SERVER STATE, MY SERVICE, ?int, ?int);
out(my ts, SERVER STATE, MY SERVICE, best sqr, best sum);
>
break;
default:
ftl exit("Server error", 1);
g
g
/
NOTREACHED
/
g
126
void
monitor(int failure id)
f
int failed host, host, server lpid, client lpid, my host = ftl my host(), reincarnate, x, a, b, c, cmd;
ts handle t scratch ts;
create TS(Volatile, Private, &scratch ts);
for (;;)
f
<
in(TSmain, FAILURE, failure id, ?failed host) =
>
skip
>
/
Wait for a failure
/
/
See if server MY SERVICE was executing on the failed host.
Again, we simulate an AGS in an expression here.
/
host = ILLEGAL HOST;
<
rdp(TSmain, SERVER REGISTRY, MY SERVICE, ?server lpid, ?host) =
>
skip
>
if (host == failed host)
f
/
Service MY SERVICE, which we are monitoring, has failed.
Regenerate any request tuples found for MY SERVICE. Note
that since there is only one server replica there can be
at most one IN PROGRESS tuple.
/
<
inp(TSmain, IN PROGRESS, MY SERVICE, ?client lpid, SQR CMD, ?cmd, ?x) =
>
out(TSmain, REQUEST, MY SERVICE, client lpid, SQR CMD, cmd, x);
or
inp(TSmain, IN PROGRESS, MY SERVICE, ?client lpid, SUM CMD, ?cmd, ?a, ?b, ?c) =
>
out(TSmain, REQUEST, MY SERVICE, client lpid, SUM CMD, cmd, a, b, c);
>
/
Attempt to start a new incarnation of the failed server. Again,
we simulate the effect of an AGS expression with the REINCARNATE tuple.
/
<
inp(TSmain, SERVER REGISTRY, MY SERVICE, server lpid, failed host) =
>
out(TSmain, SERVER REGISTRY, MY SERVICE, server lpid, my host);
out(scratch ts, REINCARNATE, (int) 1);
>
/
See if we did change the registry; in this case create a new
server on this host.
/
reincarnate = 0;
<
inp(scratch ts, REINCARNATE, ?reincarnate) =
>
skip
>
if (reincarnate)
ftl create user thread(server, "server", my host, server lpid, 0, 0, 0, 0);
g
g
g
APPENDIX D
FT-LINDA GENERAL TRANSACTION MANAGER EXAMPLE
D.1 Specification
/
FT-Linda specification for a general transaction manager. It does not
generally verify that variable IDs are correct or do other error checking.
/
newtype int val t;
/
values for variables used in a transaction
/
newtype int var t;
/
variable handles
/
newtype int tid t;
/
transaction IDs
/
#define ILLEGAL VAR
?
1
/
init transaction mgr() must be called exactly once before any transaction
routine below is used
/
void init transaction mgr(void);
/
create var creates a variable with ID var and an initial value of val.
/
void create var(var t var, val t val);
/
destroy variable var
/
void destroy var(var t var);
/
start transaction begins a transaction involving the num vars variables
in var list. It returns the transaction ID for this transaction.
/
tid t start transaction(var t var list[], int num vars);
/
modify var modifies var to have the value new val. This is assumed
to be called after a transaction was started with this variable.
/
void modify var(tid t tid, var t var, val t new val);
/
abort aborts transaction tid
/
void abort(tid);
/
commit commits transaction tid
/
void commit(tid);
/
print out the varaibles and their values, in one atomic snapshot
/
void print variables(char
msg);
127
128
D.2 Manager
#include
<
malloc.h
>
newtype void TIDS;
newtype void LOCK;
newtype void LOCK INUSE;
newtype void VAR;
newtype void VAR INUSE;
newtype void TS CUR;
newtype void TS ORIG;
/
For each transaction we keep two scratch TSs: cur ts keeps
the current values of the variables, and orig ts keeps the
original values of the variables plus VAR and LOCK tuples.
get cur and get orig are utility routines that fetch the
handles for these two TSs for a transaction.
/
static void get cur(tid t, ts handle t
);
static void get orig(tid t, ts handle t
);
static void monitor transactions(int failure id);
/
init transaction mgr() must be called exactly once before any transaction
routine below is used
/
void init transaction mgr()
f
int lpid, f id;
/
create one monitor process on each host
/
for (host=0; host
<
ftl num hosts(); host++)
f
lpid = new lpid();
fid = new failure id();
ftl create user thread(monitor transactions, "monitor_transactions",
host, lpid, f id, 0, 0, 0);
g
<
true =
>
out(TSmain, TIDS, (tid t) 1);
>
g
129
/
create var creates a variable with ID var and an initial value of val.
No validity check is done on val.
/
void create var(var t var, val t val)
f
if (var == (var t) ILLEGAL VAR)
ftl exit("create_var has a conflict with ILLEGAL_VAR", 1);
<
true =
>
out(TSmain, VAR, var, val);
out(TSmain, LOCK, var);
>
g
/
destroy variable var
/
void destroy var(var t var)
f
/
this blocks until any current transaction with var completes
/
<
in(TSmain, LOCK, var) =
>
in(TSmain, VAR, var, ?val t);
>
g
130
/
start transaction begins a transaction involving the num vars variables in var list.
It returns the transaction ID for this transaction or ILLEGAL TID if the transaction
could not be started (either because there were too many outstanding transactions
or because a variable in var list[]).
/
tid t
start transaction(var t var list[], int num vars)
f
tid t tid; var t
vars, var; val t val;
int i, my host = ftl my host();
static int var compare(var t
i, var t
j);
ts handle t cur ts, orig ts;
char buf[100];
<
in(TSmain, TIDS, ?tid) =
>
out(TSmain, TIDS, PLUS(tid,1) );
>
/
Create scratch TSs to keep a pristine copy of the variables involved
in this transactionas well as their current uncommittted values
/
create TS(Volatile, Private, &cur ts); create TS(Volatile, Private, &orig ts);
<
true =
>
out(TSmain, TS CUR, tid, cur ts); out(TSmain, TS ORIG, tid, orig ts);
>
/
Create a safe copy of var list and then sort it
/
vars = (var t
) calloc(num vars, sizeof(var t) );
assert(vars
6
=
(var t) NULL);
for (i=0; i
<
num vars; i++)
vars[i] = var list[i];
qsort(vars, num vars, sizeof(var t), var compare);
/
acquire all the locks for these variables in order
/
for (i=0; i
<
num vars; i++)
f
var = vars[i];
/
Move LOCK from TSmain to orig ts and leave LOCK INUSE in TSmain
(it is used only for recovery). Similarly for VAR; also add
a copy of VAR to cur ts.
/
<
in(TSmain, LOCK, var) =
>
out(TSmain, LOCK INUSE, my host, tid, var);
out(orig ts, LOCK, var);
in(TSmain, VAR, var, ?val);
out(TSmain, VAR INUSE, my host, tid, var, val);
out(orig ts, VAR, var, val);
out(cur ts, VAR, var, val);
>
g
cfree(vars);
return tid;
g
131
/
modify var modifies var to have the value new val. This is assumed
to be called after a transaction was started with this variable.
/
void modify var(tid t tid, var t var, val t new val)
f
ts handle t cur ts;
char buf[100];
get cur(tid, &cur ts);
sprintf(buf,"cur_ts for modify var V%d val %d tid %d", var, new val, tid);
<
in(cur ts, VAR, var, ?val t) =
>
out(cur ts, VAR, var, new val);
>
g
/
abort aborts transaction tid
/
void
abort(tid t tid)
f
ts handle t cur ts, orig ts;
int my host = ftl my host();
char buf[100];
/
Regenerate LOCK and VAR from TSmain, discard their INUSE
placeholders there, and remove the scratch TS handles from TS.
/
<
true =
>
move(orig ts, TSmain, LOCK, ?var t);
move(orig ts, TSmain, VAR, ?var t, ?val t);
in(TSmain, TS CUR, tid, ?ts handle t);
in(TSmain, TS ORIG, tid, ?ts handle t);
move(TSmain, cur ts, VAR INUSE, my host, tid, ?var t, ?val t);
move(TSmain, orig ts, LOCK INUSE, my host, tid, ?var t);
>
destroy TS(cur ts); destroy TS(orig ts);
g
132
/
commit commits transaction tid
/
void
commit(tid t tid)
f
ts handle t cur ts, orig ts;
int host = ftl my host();
char buf[100];
get cur(tid, &cur ts);
get orig(tid, &orig ts);
/
Restore the LOCKs from this transaction from orig ts, and move the
current values of all variables involved in this transaction from
cur ts to TSmain. Discard the VAR INUSE and LOCK INUSE placeholders,
and remove the scratch TS handles from TS.
/
<
true =
>
move(orig ts, TSmain, LOCK, ?var t);
move(cur ts, TSmain, VAR, ?var t, ?val t);
in(TSmain, TS CUR, tid, ?ts handle t);
in(TSmain, TS ORIG, tid, ?ts handle t);
move(TSmain, cur ts, VAR INUSE, host, tid, ?var t, ?val t);
move(TSmain, cur ts, LOCK INUSE, host, tid, ?var t);
>
destroy TS(cur ts);
destroy TS(orig ts);
printf("Aborted transaction %d for LPID %d\n", tid, ftl my lpid() );
g
133
static void
monitor transactions(int failure id)
f
int failed host;
val t val;
var t var;
for (;;)
f
/
wait for a failure
/
<
in(TSmain, FAILURE, failure id, ?failed host) =
>
skip
>
/
regenerate all LOCKs and VARs we find for any transactions
on failed host.
/
do
f
var = ILLEGAL VAR;
<
inp(TSmain, LOCK INUSE, failed host, ?tid t, ?var) =
>
out(TSmain, LOCK, var);
in(TSmain, VAR INUSE, failed host, ?tid t, var, ?val)
out(TSmain, VAR, var, val);
>
g
while (var
6
=
ILLEGAL VAR);
g
g
134
/
print variables will store the values into a buffer and then print, since
it could be scheduled out at each AGS. That would almost certainly make
the output interlaced with other output, which is not very useful.
/
void
print variables(char
msg)
f
ts handle t scratch ts;
val t val; tid t tid; var t var; int host;
char buf[1000], buf2[100];
/
Big enough ...
/
create TS(Volatile, Private, &scratch ts);
/
Grab an atomic snapshot of all variables, whether in use or not.
/
<
true =
>
copy(TSmain, scratch ts, VAR, ?var t, ?val t);
copy(TSmain, scratch ts, VAR INUSE, ?int, ?tid t, ?var t, ?val t);
>
sprintf(buf, "Variables at %s\n", msg);
/
Format out that snapshot, again simulating an AGS expression.
/
do
f
var = ILLEGAL VAR;
<
inp(scratch ts, VAR, ?var, ?val) =
>
skip
>
sprintf(buf2, "\tV%d=0x%x\n", var, val);
strcat(buf, buf2);
g
while (var
6
=
ILLEGAL VAR);
do
f
var = ILLEGAL VAR;
<
inp(scratch ts, VAR INUSE, ?host, ?tid, ?var, ?val) =
>
skip
>
sprintf(buf2,"\tV%d=0x%x\t(INUSE with tid %d on host %d)\n",
var, val, tid, host);
strcat(buf, buf2);
g
while (var
6
=
ILLEGAL VAR);
destroy TS(scratch ts);
printf(buf);
g
135
static int
var compare(var t
i, var t
j)
f
return(
i
?
j);
g
static void
get cur(tid t tid, ts handle t
handle)
f
ts handle t temp;
unsigned int old = rts debug value();
<
true =
>
rd(TSmain, TS CUR, tid, ?temp);
>
handle = temp;
g
static void
get orig(tid t tid, ts handle t
handle)
f
ts handle t temp;
<
true =
>
rd(TSmain, TS ORIG, tid, ?temp);
>
handle = temp;
g
136
D.3 Sample User
/
transaction user.c is a torture test for the transaction manager.
srandom() initializes things for random() outside of this file.
/
#ttcontext transaction user
#include
<
stdio.h
>
#include "ftlinda.h"
#include "assert.h"
#include "transaction_mgr.h"
#include "transaction_mgr.c"
/
We define symbols to use as handles for the variables, as well
as their initial values
/
#define A VAR 1
#define B VAR 2
#define C VAR 3
#define D VAR 4
#define E VAR 5
#define F VAR 6
#define G VAR 7
#define H VAR 8
#define A INIT 0x100
#define B INIT 0x200
#define C INIT 0x300
#define D INIT 0x400
#define E INIT 0x500
#define F INIT 0x600
#define G INIT 0x700
#define H INIT 0x800
#define CLIENT LOOPS 10
#define ABORT ROLL (random() % 6)
#define FAIL ROLL (random() % 30)
static void client1(), client2(), client3(), client4(), client5();
static void shuffle(var t[],int);
static void maybe fail(void);
137
LindaMain (argc, argv)
int argc;
char
argv [];
f
int lpid, num hosts = ftl num hosts();
printf("LindaMain here\n");
init transaction mgr();
/
Create the variables
/
create var(A VAR, A INIT);
create var(B VAR, B INIT);
create var(C VAR, C INIT);
create var(D VAR, D INIT);
create var(E VAR, E INIT);
create var(F VAR, F INIT);
create var(G VAR, G INIT);
create var(H VAR, H INIT);
/
Create the clients
/
lpid = new lpid();
ftl create user thread(client1, "client1", 1 % num hosts, lpid, 0, 0, 0, 0);
lpid = new lpid();
ftl create user thread(client1, "client2", 2 % num hosts, lpid, 0, 0, 0, 0);
lpid = new lpid();
ftl create user thread(client1, "client3", 3 % num hosts, lpid, 0, 0, 0, 0);
lpid = new lpid();
ftl create user thread(client1, "client4", 4 % num hosts, lpid, 0, 0, 0, 0);
lpid = new lpid();
ftl create user thread(client1, "client5", 5 % num hosts, lpid, 0, 0, 0, 0);
g
138
/
The clients test the transaction manager
/
/
client1
/
#define CLIENT 1
#define CLIENT NAME client1
#define VARS USED
f
A VAR, C VAR, G VAR
g
#include "transaction_client.c"
/
defines client1
/
#undef CLIENT
#undef CLIENT NAME
#undef VARS USED
/
client2
/
#define CLIENT 2
#define CLIENT NAME client2
#define VARS USED
f
B VAR, D VAR, F VAR, G VAR
g
#include "transaction_client.c"
/
defines client2
/
#undef CLIENT
#undef CLIENT NAME
#undef VARS USED
/
client3
/
#define CLIENT 3
#define CLIENT NAME client3
#define VARS USED
f
A VAR
g
#include "transaction_client.c"
/
defines client3
/
#undef CLIENT
#undef CLIENT NAME
#undef VARS USED
/
client4
/
#define CLIENT 4
#define CLIENT NAME client4
#define VARS USED
f
C VAR, G VAR
g
#include "transaction_client.c"
/
defines client4
/
#undef CLIENT
#undef CLIENT NAME
#undef CLIENT NAME
#undef VARS USED
/
client5
/
#define CLIENT 5
#define CLIENT NAME client5
#define VARS USED
f
A VAR, B VAR, C VAR, D VAR, E VAR, F VAR, G VAR
g
#include "transaction_client.c"
/
defines client5
/
#undef CLIENT
#undef CLIENT NAME
#undef VARS USED
139
/
Shuffle the variable array
/
static void
shuffle(var t vars[], int num vars)
f
int i, slot;
var t item;
for (i=num vars
?
1; i
>
0; i
??
)
f
/
swap vars[i] with vars[slot] for some slot in [0,i)
/
slot = random() % i;
item = vars[i];
vars[i] = vars[slot];
vars[slot] = item;
g
g
static void
maybe fail()
f
int host = ftl my host();
/
see again if we should fail our host
/
if ( (host
6
=
0) && (FAIL ROLL == 0) )
f
printf("Failing host %d\n", host);
ftl fail host(host);
g
g
140
D.4 User Template (transaction client.c)
/
This file holds the template for each client; they are all
instantiated with different macros for the function name,
variables used, etc.
/
void
CLIENT NAME ()
f
static var t vars used[] = VARS USED;
#define NUM VARS USED (sizeof(vars used) / sizeof(var t))
int i, j, my host = ftl my host(), abort it;
tid t tid;
char buf[200], buf2[50];
char name[20];
sprintf(name, "client%d", CLIENT);
sprintf(buf,"%s here on host %d with %d variables: ",
name, ftl my host(), NUM VARS USED);
for (i=0; i
<
NUM VARS USED; i++)
f
sprintf(buf2, "V%d ", vars used[i]);
strcat(buf, buf2);
g
printf("%s\n", buf);
for (i=1; i
CLIENT LOOPS; i++)
f
val t vals this time[NUM VARS USED];
/
Shuffle the order of the variables used
/
shuffle(vars used, NUM VARS USED);
tid = start transaction(vars used, NUM VARS USED);
maybe fail();
/
Generate some random values for these variables
/
for (j=0; j
<
NUM VARS USED; j++)
f
vals this time[j] = (int) (random() % 0x10000);
modify var(tid, vars used[j], vals this time[j]);
g
141
maybe fail();
/
See if we should abort
/
abort it = (ABORT ROLL == 0);
if (abort it)
abort(tid);
else
commit(tid);
sprintf(buf, "after %s %s changes: ",
name, (abort it ? "aborted" :"committed") );
for (j=0; j
<
NUM VARS USED; j++)
f
sprintf(buf2, "V%d=0x%x", vars used[j], vals this time[j]);
strcat(buf, buf2);
g
print variables(buf);
g
printf("%s on %d all done\n", name, my host);
#undef NUM VARS USED
g
142
APPENDIX E
FT-LINDA BAG-OF-TASKS EXAMPLE
/
Bag of Tasks example program.
/
#ttcontext bag of tasks
#include
<
stdio.h
>
#include "ftlinda.h"
newtype void SUBTASK;
newtype void RESULT;
newtype void INPROGRESS;
#define ILLEGAL VAL
?
1
/
illegal value for subtask
/
#define NUM WORKERS 10
#define NUM SUBTASKS 20
char
progname;
static void worker(int);
static void calc(int, int
);
static void get input(int
, int
);
static void monitor();
143
144
LindaMain (argc, argv)
int argc;
char
argv [];
f
int i, val, f id, lpid, host, num hosts = ftl num hosts();
progname = argv[0];
printf("%s here with %d hosts in [0..%d)\n", progname, num hosts,
num hosts);
/
Create a monitor thread on each host
/
for (host=0; host
<
num hosts; host++)
f
lpid = new lpid();
fid = new failure id();
ftl create user thread(monitor, "monitor", host, lpid, f id, 0, 0, 0);
g
/
Create some workers
/
for (i=0; i
<
NUM WORKERS; i++)
f
lpid = new lpid();
ftl create user thread(worker, "worker", i % num hosts, lpid, i, 0, 0, 0);
g
/
Create some subtasks
/
for (i=0; i
<
NUM SUBTASKS; i++)
f
<
true =
>
out(TSmain, SUBTASK, i, i);
>
g
/
Wait for those subtasks to be completed.
/
for (i=0; i
<
NUM SUBTASKS; i++)
f
<
in(TSmain, RESULT, i, ?val) =
>
skip
>
g
printf("%s done\n", progname);
ftl exit(NULL, 0);
/
exit normally
/
g
145
static void
monitor(failure id)
int failure id;
f
int num, val, failed host, id;
while(1)
f
failed host =
?
1;
/
sanity check
/
/
wait for a failure like a vulture
/
<
in(TSmain, FAILURE, failure id, ?failed host) =
>
skip
>
/
try to regenerate any INPROGRESS tuple for a failed worker
on the host that failed
/
do
f
val = ILLEGAL VAL;
/
illegal subtask value
/
<
inp(TSmain, INPROGRESS, failed host, ?id, ?num, ?val) =
>
out(TSmain, SUBTASK, num, val);
>
g
while (val
6
=
ILLEGAL VAL);
g
g
146
static void
worker(id)
int id;
f
int num, val, result, host=ftl my host();
ts handle t TSscratch;
create TS(Volatile, Private, &TSscratch);
printf("worker(%d) here on host %d\n", id, host);
while (1)
f
/
the worker ID in the INPROGRESS tuple is not needed, but is
useful for debugging
/
<
in(TSmain, SUBTASK, ?num, ?val) =
>
out(TSmain, INPROGRESS, host, id, num, val);
>
calc(val, &result);
<
true =
>
out(TSscratch, RESULT, num, result);
>
<
in(TSmain, INPROGRESS, host, id, num, val) =
>
move(TSscratch, TSmain);
>
g
g
static void
calc(val, result ptr)
int val;
int
result ptr;
f
result ptr = 2
val;
/
really simple ...
/
g
APPENDIX F
FT-LINDA DIVIDE AND CONQUER EXAMPLE
/
Fault-tolerant divide and conquer worker. Here we sum up
the elements of a vector to demonstrate the technique.
/
#ttcontext divide
#include
<
stdio.h
>
#include "ftlinda.h"
#define MAX SIZE 256
/
biggest vector size
/
#define MAX ELEM 50
/
biggest element
/
#define MIN ELEM 10
/
smallest element
/
#define SIZE CUTOFF 16
#define SMALL ENOUGH(task) (task.size
SIZE CUTOFF ? 1 : 0)
#define ILLEGAL SIZE
?
1
#define WORKERS PER HOST 4
/
number of workers to create on each host
/
/
types
/
typedef struct
f
int size;
int elem[MAX SIZE];
g
vec;
newtype void SUBTASK;
newtype void RESULT;
newtype void INPROGRESS;
newtype int SUM T;
newtype int SIZE T;
/
function declarations
/
void worker();
void monitor();
void init(vec
);
void part1(vec
, vec
);
void part2(vec
, vec
);
SUM T sumvec(vec);
void switch some();
147
148
LindaMain (argc, argv)
int argc;
char
argv [];
f
int w, host, lpid, f id;
SUM T sum, total sum, correct sum;
SIZE T size, total size;
vec task;
/
create one monitor process on each host
/
for (host=0; host
<
ftl num hosts(); host++)
f
lpid = new lpid();
f id = new failure id();
ftl create user thread(monitor, "monitor", host, lpid, f id, 0, 0, 0);
g
/
create WORKERS PER HOST workers on each host
/
for (host=0; host
<
ftl num hosts(); host++)
f
for (w=0; w
<
WORKERS PER HOST; w++)
f
lpid = new lpid();
ftl create user thread(worker, "worker", host, lpid, 0, 0, 0, 0);
g
g
/
initalize the vector
/
init(&task);
correct sum = sumvec(task);
/
check answer later with this
/
/
deposit the task into TS
/
<
true =
>
out(TSmain, SUBTASK, task);
>
/
wait until the sums have come in from all subtasks
/
do
f
<
in(TSmain, RESULT, ?sum, ?size) =
>
skip
>
total sum += sum;
total size += size;
g
while(total size
<
MAX SIZE);
printf("The sum of the %d elements is %d\n", MAX SIZE, total sum);
if (total sum
6
=
correct sum)
ftl exit("incorrect sum", 1);
else
ftl exit(NULL, 0);
/
halt the worker threads
/
g
149
void
worker()
f
int r, host = ftl my host(), lpid = ftl my lpid();
vec task, task1, task2;
SUM T sum;
SIZE T size;
/
here we will put an extra LPID field in the INPROGRESS
tuple to ensure we withdraw our INPROGRESS tuple, not
another worker’s from this host.
/
for (;;)
f
<
in(TSmain, SUBTASK, ?task) =
>
out(TSmain, INPROGRESS, task, lpid, host);
>
if (SMALL ENOUGH(task))
f
sum = sumvec(task);
size = task.size;
<
in(TSmain, INPROGRESS, ?vec, lpid, host) =
>
out(TSmain, RESULT, sum, size);
>
g
else
f
part1(&task,&task1);
part2(&task,&task2);
<
in(TSmain, INPROGRESS, ?vec, lpid, host) =
>
out(TSmain, SUBTASK, task1);
out(TSmain, SUBTASK, task2);
>
g
g
g
150
void
monitor(int failure id)
f
int lpid=ftl my lpid(), failed host, my host=ftl my host();
vec task;
SUM T sum;
for (;;)
f
/
wait for a failure
/
<
in(TSmain, FAILURE, failure id, ?failed host) =
>
skip
>
/
Regenerate all subtasks that were inprogress on the failed
host. Note that since the AGS is not yet implemented in
expressions we have to test to see if the formal in the
inp was set. To do this, we set task.size to an illegal
value; if it is still this after the inp then we know it failed.
/
do
f
task.size = ILLEGAL SIZE;
<
inp(TSmain, INPROGRESS, ?task, ?lpid, failed host) =
>
out(TSmain, SUBTASK, task);
>
g
while (task.size
6
=
ILLEGAL SIZE);
g
g
/
Initialize the vector randomly
/
void
init(vec
task)
f
int i, count=0;
task
!
size = MAX SIZE;
/
seed with elements in [MIN ELEM,MAX ELEM)
/
for (i=0; i
<
MAX SIZE; i++)
task
!
elem[i] = MIN ELEM + (random() % (MAX ELEM
?
MIN ELEM) );
g
151
/
Fill the first half of t into t1
/
void
part1(vec
t, vec
t1)
f
int i, mid = t
!
size / 2;
t1
!
size = mid;
for (i=0; i
<
mid; i++)
t1
!
elem[i] = t
!
elem[i];
g
/
Fill the second half of t into t2
/
void
part2(vec
t, vec
t2)
f
int i, mid = t
!
size / 2;
t2
!
size = t
!
size
?
mid;
for (i=mid; i
<
t
!
size; i++)
t2
!
elem[i
?
mid] = t
!
elem[i];
g
/
Sum up the elements in task
/
SUM T
sumvec(vec task)
f
SUM T sum=0; int i;
for (i=0; i
<
task.size; i++)
sum += task.elem[i];
return sum;
g
152
APPENDIX G
FT-LINDA BARRIER EXAMPLE
/
Fault-tolerant barrier.
/
#ttcontext barrier
#include
<
stdio.h
>
#include "ftlinda.h"
#define NUM COLS 64
#define NUM ROWS 8
#define FAIL ROLL (random() % 8)
/
types
/
newtype int ROW T[NUM COLS];
newtype ROW T ARRAY T[NUM ROWS];
newtype void ARRAY;
newtype void REGISTRY;
newtype void WORKER DONE;
/
function declarations
/
void worker();
void monitor();
void init(ARRAY T);
int converged(ARRAY T, int);
int compute(ARRAY T, int);
static void maybe fail(int);
/
Note: since array subscripts have not yet been implememted
in the AGS parsing code, we have to maintain an extra variable
to use in TS operations and then copy to and from outside the AGS.
/
153
154
LindaMain (argc, argv)
int argc;
char
argv [];
f
int i, w, host, lpid, f id, iter=1;
ARRAY T a;
ROW T r;
/
initalize the array and place it in TS
/
init(a);
for (i=0; i
<
NUM ROWS; i++)
f
(void) memcpy(r, a[i], sizeof(r));
/
copy a[i] for use in AGS
/
<
true =
>
out(TSmain, ARRAY, iter, i, r);
>
g
/
create one monitor on each host
/
for (host=0; host
<
ftl num hosts(); host++)
f
lpid = new lpid();
fid = new failure id();
ftl create user thread(monitor, "monitor", host, lpid, f id, 0, 0, 0);
g
/
create NUM ROWS workers and their registry tuples
/
for (w=0; w
<
NUM ROWS; w++)
f
host = w % ftl num hosts();
/
must create registry tuple before worker!
/
<
true =
>
out(TSmain, REGISTRY, host, w, iter);
>
lpid = new lpid();
ftl create user thread(worker, "worker", host, lpid, w, 0, 0, 0);
g
/
wait until all workers are done
/
for (w=0; w
<
NUM ROWS; w++)
f
<
in(TSmain, WORKER DONE, ?w) =
>
skip
>
g
printf("Program %s is all done\n", argv[0]);
g
155
/
worker(id) updates a[id])
/
void
worker(int id)
f
ARRAY T a;
ROW T r;
int i, iter, host=ftl my host();
/
initialize iter and a
/
<
rd(TSmain, REGISTRY, host, id, ?iter) =
>
skip
>
for (i=0; i
<
NUM ROWS; i++)
f
<
rd(TSmain, ARRAY, iter, i, ?r) =
>
skip
>
/
read a[i] from TS into r
/
memcpy(a[i], r, sizeof(r));
/
copy into a[i] in memory
/
g
while ( !converged(a, iter) )
f
maybe fail(id);
compute(a, id);
/
update a[id] in local memory
/
memcpy(r, a[i], sizeof(r));
/
atomically deposit my row for next iteration & update my registry
/
<
true =
>
out(TSmain, ARRAY, PLUS(iter,1), id, r);
in(TSmain, REGISTRY, ?host, id, iter);
out(TSmain, REGISTRY, host, id, PLUS(iter,1));
>
/
Barrier: wait until all workers are done with iteration iter
/
for (i=0; i
<
NUM ROWS; i++)
f
<
rd(TSmain, ARRAY, PLUS(iter,1), i, ?r) =
>
skip
>
memcpy(a[i], r, sizeof(r));
g
/
Garbage collection on last iteration
/
if (iter
>
1)
f
<
true =
>
in(TSmain, ARRAY, MINUS(iter,1), id, ?ROW T);
>
g
iter++;
g
<
true =
>
out(TSmain, WORKER DONE, id);
>
g
156
void
monitor(int failure id)
f
#define ILLEGAL WORKER
?
1
int failed host, lpid, w, iter, my host=ftl my host();
for (;;)
f
/
wait for a failure
/
<
in(TSmain, FAILURE, failure id, ?failed host) =
>
skip
>
/
try to recreate all failed workers found on this host
/
do
f
w = ILLEGAL WORKER;
<
inp(TSmain, REGISTRY, failed host, ?w, ?iter) =
>
out(TSmain, REGISTRY, my host, w, iter);
>
if (w
6
=
ILLEGAL WORKER)
f
lpid = new lpid();
ftl create user thread(worker, "worker", my host, lpid, w,
0, 0, 0);
g
g
while (w
6
=
ILLEGAL WORKER);
g
#undef ILLEGAL WORKER
g
157
/
Initialize the array somehow
/
void
init(ARRAY T a)
f
int i,j;
for (i=0; i
<
NUM ROWS; i++)
for (j=0; j
<
NUM COLS; j++)
a[i][j] = 0x1000
i + j;
g
/
compute the next iteration of a[id]. Just a toy example computation ...
/
int
compute(ARRAY T a, int id)
f
int j;
int above, below;
for (j=0; j
<
NUM COLS; j++)
f
above = (id == 0 ? 0 : a[id
?
1][j]);
below = (id == (NUM ROWS
?
1) ? 0 : a[id+1][j]);
a[id][j] += (above + below);
g
g
/
Since this uses a toy example with no real meaning, we will simply
converge after a few iterations
/
int
converged(ARRAY T a, int iterations)
f
return (iterations
3 ? 0 : 1);
g
static void
maybe fail(int id)
f
int host = ftl my host();
/
see again if we should fail our host
/
if ( (host
6
=
0) && (FAIL ROLL == 0) )
f
ftl fail host(host);
g
g
158
APPENDIX H
MAJOR DATA STRUCTURES
/
Major data structures in the FT-Linda TS managers. Some of the minor data
structures and unimportant fields in the following data structures have been
ommitted for brevity and clarity. They have also been reordered for clarity.
/
/
request kind t tracks the kind of request request t deals with.
/
typedef enum
f
REQ AGS,
/
a normal
<
... =
>
...
>
command
/
REQ NEW LPID,
/
a request for a logical PID
/
REQ TS CREATE,
/
create a replicated TS
/
REQ TS DESTROY,
/
destroy a replicated TS
/
REQ FAIL NOTIFY,
/
notify TS replicas of a host failure
/
REQ NEW FAILURE ID
/
allocate a new failure ID
/
g
request kind t;
159
160
/
request t is what is passed from the FT-Linda application to the FT-Linda RTS.
/
typedef struct request t
f
request kind t r kind;
/
what kind of request
/
guard kind t r guards kind;
/
absent, blocking, boolean
/
resilience t r guards resilience;
/
resilience of guards
/
scope t r guards scope;
/
scope of guards
/
char r filename[MAX FILENAME+1];
/
filename of request
/
int r starting line;
/
line
<
...
>
started on
/
int r num branches;
/
number of branches
/
branch t r branch[MAX BRANCHES];
/
each guard =
>
body
/
int r branch chosen;
/
which branch did we do?
/
TS HANDLE T r ts handle;
/
which TS to destroy
/
UCHAR r guard return val;
/
return val for =
>
/
/
r id is used by non-AGS requests depending on what its r kind is:
REQ TS ID: TS id allocated
REQ LPID: LPID allocated
REQ CREATE: TS index of created TS
REQ DESTROY: TS index of destroyed TS
REQ FAIL NOTIFY: host that failed
REQ NEW FAILURE ID: failure ID allocated
/
int r id;
int r lpid;
/
LPID of request originator
/
int r rep seqn;
/
sequence number for id
/
int r next offset;
/
next offset into r actual[]
/
/
unresolved opcode args, i.e. one where at least one argument is
P FORMAL VAL so the GC couldn’t evaluate.
/
opcode args t r opcode args [MAX OPCODES][MAX OPCODE ARGS];
int r cookie;
/
magic cookie to try to detect
corruption of request
structure.
/
int r times called;
/
times the AGS has been
executed
/
int r host;
/
host the request sent from
/
double r pad1;
/
ensure following aligned
/
UCHAR r formal[FORMALSIZE];
/
area to store all the
formals for the
chosen branch.
/
double r pad2;
/
ensure following aligned
/
UCHAR r actual[MAX ACTUALS SIZE];
g
request t;
161
/
a branch is one guard =
>
body
/
typedef struct branch t
f
UCHAR b guard present;
/
is there a guard?
/
UCHAR b guard negated;
/
is guard negated with “not” ??
/
op t b guard;
/
guard of the branch
/
op t b body[MAX BODY];
/
body of the branch
/
int b body size;
/
no of ops in body
/
int b body next idx[MAX BODY];
/
ordering of body ops; 0, then
next idx[0], ... Need cause a
move must generate outs before
next op
/
int b formal offset[MAX FORMALS];
/
offset for each formal
into r formal[].
/
stub t
b stubptr;
/
RTS ptr to branch stub
/
g
branch t;
/
param t tracks what type of parameters each TS op have. The values of some parameters
(P FORMAL VAL below) will not have their values known until the request has been received
at each replica, since they are a reference to the value of a variable that was a formal
in an earlier op in the same
<
...
>
. These can occur either as
parameters or as arguments to an opcode parameter. For example, in
<
in(FOO, ?x) =
>
out(FEE, x, MAX(X,1) )
>
x is P FORMAL in the in(FOO...) but in out(FEE... is P FORMAL VAL
both as a parameter by itsself and as an argument to opcode MAX.
The order of the literals used here is important. Opcodes must come last,
and the first opcode must be P OP MIN. This is so the RTS can quickly test
whether or not a param is an opcode.
/
typedef enum
f
P TYPENAME,
/
Linda type, actual or formal
/
P FORMAL VAR,
/
?var
/
PVAL,
/
constant (later expr?)
/
PFORMAL VAL,
/
val of var that was a formal
earlier in the same branch
/
POP MIN, P OP MAX, P OP MINUS, P OP PLUS,
/
P OP xxx is opcode xxx
/
POP LOOKUP1, P OP LOOKUP2, P OP LOOKUP3
g
param t;
162
/
optype t denotes Linda primitives; op t stores needed info for them.
/
typedef enum
f
OP IN, OP INP, OP RD, OP RDP, OP MOVE, OP COPY, OP OUT
g
optype t;
/
guard kind t tells what kind the guards are (they all must be the same)
/
typedef enum
f
g absent, g blocking, g boolean
g
guard kind t;
/
we track the arguments for opcode calls. If none of the agruments
are P FORMAL VAL then they are all known while the request is being
filled in by the GC. In this case the opcode will be evaluated
and the param listed as P VAL. Thus, P OP... params only occur
when arg(s) are P FORMAL VAL. (MAY CHANGE FOR SIMJPLICITY)
/
typedef struct
f
int oa formal index;
/
index into request.bindings if param is FORMAL VAL
or -1 if arg val is in op arg value.
/
int oa op arg value;
/
LIMITATION: only ints for opcode args
for now. Could later make this a union.
/
g
opcode args t;
/
A stub t variable represents one branch of a request in the RTS. It is
enqueued on a queue based on the hash value of the branch’s guard.
/
typedef struct stub t
f
struct request t
st request;
/
request for the given stub
/
int st branch index;
/
branch # of corresponding branch for this stub
/
BOOL is blocked;
/
is this blocked? else on candidateQ
/
g
stub t;
/
ts t is a tuple space.
/
typedef struct ts
f
Q t ts blocked[MAX HASH];
/
stubs for blocked guards
/
Q t ts tuples[MAX HASH];
/
tuples in the TS
/
g
ts t;
163
/
op t is the data structure for both ops in an AGS and also tuples in TS. If the op is
in an AGS then the actuals’ data will be stored in the request’s r actual[] area, otherwise
the tuple’s actuals will be stored in op.o actual[]. When the TS managers create a tuple from
an out op they allocate an op with enough room at the end for o acutal[] to fit all the actual data.
/
typedef struct op t
f
Q t o links;
/
RTS links + key; MUST BE FIRST
/
TIME T o time;
/
RTS time stamp; MUST FOLLOW LINKS
/
TS HANDLE T o ts[2];
/
TS or TSs involved in this op
/
optype t o optype;
/
operator
/
param t o param[NARITY];
/
kind of parameter. ??
/
/
o idx[i] is used in different ways, as an index into another array.
The way is it used is a function of o param[i] :
case P FORMAL VAR:
case P FORMAL VAL:
Here o idx[i] tells which formal # that parameter i is
for the branch. This can be used as follows to find
where to store the formal (P FORMAL VAR) or where to
retrieve its value from (P FORMAL VAL):
formal idx = tuple.o idx[i]
offset = b formal offset[formal idx]
formal address = &(r formal[offset])
case P OP xxx:
Here o idx[i] tells which unresolved opcode # that parameter
i is for this request. (Unresolved opcodes are where at
least one of the arguments is P FORMAL VAL and thus the
GC can’t evaluate it and convert it to P VAL.) The info
for this opcode is stored in r opcode args[o idx[i]].
/
UCHAR o idx[NARITY];
/
Let start = o data start[i] and stop = o data stop[i]. Then parameter i’s data is in
locations [start..stop) of either tuple.o actual[] or request.r actual[], depending
on which case the parameter is.
/
UWORD o data start[NARITY];
UWORD o data stop[NARITY];
UWORD o arity;
/
number of params MAY GO AWAY
/
long o polarity;
/
actual/formal; assumes NARITY
32
/
int o linenum;
/
starting line of op
/
int o type;
/
tuple type, aka the tuple’s index.
/
int o hash;
/
hash value;f(type,param1)
/
double o pad1;
/
ensure o actual[] aligned
/
UCHAR o actual[1];
/
area for actual (P VAL) data if this op is a tuple.
/
g
op t;
164
165
REFERENCES
[ACG86] Sudhir Ahuja, Nicholas Carriero, and David Gelernter. Linda and friends.
IEEE Computer, 19(8):26–34, August 1986.
[AD76] P. A. Alsberg and J. D. Day. A principle for resilient sharing of distributed
resources. In Proceedings of the Second International Conference on
Software Engineering, pages 627–644, October 1976.
[AG91a] Shakil Ahmed and David Gelernter. A higher-level environment for par-
allel programming. Technical Report YALEDU/DCS/RR-877, Yale Uni-
versity Department of Computer Science, November 1991.
[AG91b] Shakil Ahmed and David Gelernter. Program builders as alternatives
to high-level languages. Technical Report YALEDU/DCS/RR-887, Yale
University Department of Computer Science, November 1991.
[AGMvR93] Carlos Almedia, Brad Glade, Keith Marzullo, and Robbert van Renesse.
High availability in a real-time system. ACM Operating Systems Review,
27(2):82–87, April 1993.
[Akl89] Selim G. Akl. The Design and Analysis of Parallel Algorithms. Prentice
Hall, 1989.
[And91] GregoryR.Andrews. Concurrent Programming: Principles and Practice.
Benjamin/Cummings, Redwood City, California, 1991.
[AO93] Gregory R. Andrews and Ronald A. Olsson. The SR Programming Lan-
guage: Concurrency in Practice. Benjamin/Cummings, Redwood City,
California, 1993.
[AS91] Brian G. Anderson and Dennis Shasha. Persistent Linda: Linda + trans-
actions + query processing. In J.P. Banˆ
atre and D. Le M´
etayer, editors,
Research Directions in High-Level Parallel Programming Languages, num-
ber 574 in LNCS, pages 93–109. Springer, 1991.
[ASC85] Amr El Abbadi, Dale Skeen, and Flaviu Cristian. An efficient, fault-
tolerant protocol for replicated data management. In Proceedings of the
4th ACM SIGACT/SIGMOD Conference on Principles of Database Systems,
1985.
166
[Bal90] Henri E. Bal. Programming Distributed Systems. Silicon Press, Summit,
New Jersey, 1990.
[BHJL86] Andrew P. Black, Norman Hutchinson, Eric Jul, and Henry M. Levy. Ob-
ject structure in the emerald system. In Proceedings of the First ACM
Conference on Object-Oriented Programming Systems, Languages and Ap-
plications, pages 78–86, Portland, Oregon, September 1986.
[BJ87] Kenneth P. Birman and Thomas A. Joseph. Reliable communication in the
presence of failures. ACM Transactions on Computer Systems, 5(1):47–
76, February 1987.
[Bjo92] Robert D. Bjornson. Linda on Distributed Memory Multiprocessors.
PhD thesis, Department of Computer Science, Yale University, Novem-
ber 1992.
[BKT92] Henri E. Bal, M. Frans Kaashoek, and Andrew S. Tanenbaum. Orca: A
language for parallel programming of distributed systems. IEEE Transac-
tions on Software Engineering, 18(3):190–205, March 1992.
[BLL94] Ralph M. Butler, Alan L. Leveton, and Ewing L. Lusk. p4-linda: A
portable implementation of linda. In Proceedings of the Second Inter-
national Symposium on High Performance Distributed Computing, pages
50–58, Spokane, Washington, July 1994.
[BMST92] Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg.
Primary-backup protocols: Lower bounds and optimal implementations.
In Proceedings of the Third IFIP Working Conference on DependableCom-
puting for Critical Applications, pages 187–198, Mondello, Italy, 1992.
[BN84] Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure
calls. ACM Transactions on Computer Systems, 2(1):39–59, February
1984.
[BS91] David E. Bakken and Richard D. Schlichting. Tolerating failures in the
bag-of-tasks programming paradigm. In Proceedings of the Twenty-First
International Symposium on Fault-Tolerant Computing, pages 248–255,
June 1991.
[BS94] David E. Bakken and Richard D. Schlichting. Supporting fault-tolerant
parallel programming in Linda. IEEE Transactions on Parallel and Dis-
tributed Systems, 1994. To appear.
[BSS91] Kenneth Birman, Andr`
e Schiper, and Pat Stepheson. Lightweight causal
and atomic group multicast. ACM Transactions on Computer Systems,
9(3):272–314, August 1991.
167
[CASD85] Flaviu Cristian, Houtan Aghili, Ray Strong, and Danny Dolev. Atomic
broadcast: From simple message diffusion to Byzantine agreement. In
Proceedings of the Fifteenth International Symposium on Fault-Tolerant
Computing, pages 200–206. IEEE Computer Society Press, June 1985.
[CBZ91] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementa-
tion and performance of Munin. In Proceedings of the Thirteenth ACM
Symposium on Operating Systems Principles, 1991.
[CD94] Scott R. Cannon and David Dunn. Adding fault-tolerant transaction pro-
cessing to Linda. Software—Practice and Experience, 24(5):449–466,
May 1994.
[CG86] Nicholas Carriero and David Gelernter. The S/Net’s Linda kernel. ACM
Transactions on Computer Systems, 4(2):110–129, May 1986.
[CG88] Nicholas Carriero and David Gelernter. Applications experience with
Linda. ACM SIGPLAN Notices (Proc. ACM SIGPLAN PPEALS),
23(9):173–187, September 1988.
[CG89] Nicholas Carriero and David Gelernter. Linda in context. Communica-
tions of the ACM, 32(4):444–458, April 1989.
[CG90] Nicholas Carriero and David Gelernter. How to Write Parallel Programs:
A First Course. MIT Press, 1990.
[CG93] P. Ciancarini and N. Guerrini. Linda meets Minix. ACM SIGOPS Oper-
ating Systems Review, 27(4):76–92, October 1993.
[CGKW93] Nicholas Carriero, David Gelernter, David Kaminsky, and Jeffery
Westbrook. Adaptive parallelism with Piranha. Technical Report
YALE/DCS/RR-954, Yale University Department of Computer Science,
February 1993.
[CGM92] Nicholas Carriero, David Gelernter, and Timothy G. Mattson. Linda in
heterogenous computing environments. In Proceedings of the Workshop
on Heterogenous Processing. IEEE, March 1992.
[Cia93] Paolo Ciancarini. Distributed programming with logic tuple spaces.
Technical Report UBLCS-93-7, Laboratory for Computer Science, Uni-
versity of Bologna, April 1993.
[CKM92] Shigeru Chiba, Kazuhiko Kato, and Takishi Masuda. Exploiting a weak
consistency to implement distributed tuple space. In Proceedings of the
12th International Conference on Distributed Computing Systems, pages
416–423, June 1992.
168
[Com88] Douglas Comer. Internetworking with TCP/IP. Prentice-Hall, 1988.
[CP89] Douglas E. Comer and Larry L. Peterson. Understanding naming in dis-
tributed systems. Distributed Computing, 3(2):51–60, 1989.
[Cri91] Flaviu Cristian. Understanding fault-tolerant distributed systems. Com-
munications of the ACM, 34(2):57–78, February 1991.
[CS93] Leigh Cagan and Andrew H. Sherman. Linda unites network systems.
IEEE Spectrum, 30(12):31–35, December 1993.
[Dij75] Edsger W. Dijkstra. Guarded commands, nondeterminacy, and formal
derivation of programs. Communications of the ACM, 18(8):453–457,
August 1975.
[DoD83] U.S. Department of Defense. Reference Manual for the Ada Programming
Language. Washington D.C., 1983.
[For93a] Message Passing Interface Forum. Document for a standard messge-
passing interface, October 1993. (available from netlib).
[For93b] The MPI Forum. MPI: A message passing interface. In Proceedings of
Supercomputing ’93, pages 878–883, Los Alamitos, California, November
1993. IEEE Computer Society Press.
[GC92] David Gelernter and Nicholas Carriero. Coordination languages and their
significance. Communications of the ACM, 35(2):97–107, February 1992.
[Gel85] David Gelernter. Generative communication in Linda. ACM Transac-
tions on Programming Languages and Systems, 7(1):80–112, January 1985.
[GK92] David Gelernter and David Kaminsky. Supercomputing out of recycled
garbage: Preliminary experiencewith Piranha. In Proceedings of the Sixth
ACM International Conference on Supercomputing, Washington, D.C., July
1992.
[GMS91] Hector Garcia-Molina and Annemarie Spauster. Ordered and reliable
multicast communication. ACM Transactions on Computer Systems,
9(3):242–271, August 1991.
[Gra78] Jim Gray. Notes on database operating systems. In Operating Systems:
An Advanced Course, Lecture Notes in Computer Science. Springer-Verlag,
Berlin, 1978.
[Gra86] James N. Gray. An approach to decentralized computer systems. IEEE
Transactions on Software Engineering, SE-12(6):684–692, June 1986.
169
[Has92] Willi Hasselbring. A formal z specification of proset-Linda. Technical
Report 04–92, Universityof Essen Department ofComputer Science, 1992.
[Hoa78] C.A.R. Hoare. Communicating sequential processes. Communications
of the ACM, 21(8):666–677, August 1978.
[HP90] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann (Palo Alto, California), 1990.
[HP91] Norman C. Hutchinson and Larry L. Peterson. The x-kernel: An architec-
ture for implementing network protocols. IEEE Transactions on Software
Engineering, 17(1):64–76, January 1991.
[HW87] Maurice P. Herlihy and Jeannette M. Wing. Avalon: Language support
for reliable distributed systems. In Digest of Papers, The Seventeenth
International Symposium on Fault-Tolerant Computing, pages 89–94. IEEE
Computer Society, IEEE Computer Society Press, July 1987.
[Jac90] Jonathan Jacky. Inside risks: Risks in medical electronics. Communica-
tions of the ACM, 33(12):138, December 1990.
[Jal94] Pankaj Jalote. Fault Tolerance in Distributed Systems. Prentice Hall,
1994.
[Jel90] Robert Jellinghaus. Eiffel Linda: An object-oriented Linda dialect.
ACM SIGPLAN Notices, 25(12):70–84, December 1990.
[JS94] Karpjoo Jeong and Dennis Shasha. PLinda 2.0: A transac-
tional/checkpointing approach to fault tolerant Linda. In Proceedings
of the Thirteenth Symposium on Reliable Distributed Systems, Dana Point,
California, October 1994. To appear.
[Kam90] Srikanth Kambhatla. Recovery with limited replay: Fault-tolerant pro-
cesses in Linda. Technical Report CS/E 90-019, Department of Computer
Science, Oregon Graduate Institute, 1990.
[Kam91] Srikanth Kambhatla. Replication issues for a distributed and highly avail-
able Linda tuple space. Master’s thesis, Department of Computer Science,
Oregon Graduate Institute, 1991.
[Kam94] David Kaminsty. Adaptive Parallelism with Piranha. PhD thesis, De-
partment of Computer Science, Yale University, May 1994.
[KMBT92] M. Frans Kaashoek, Raymond Michiels, Henri E. Bal, and Andrew S.
Tannenbaum. Transparent fault-tolerance in parallel orca programs. In
Proceedings of the Third Symposium on Experiences with Distributed and
170
MultiprocessorSystems, pages 297–311, Newport Beach, California, March
1992.
[KT87] Richard Koo and Sam Toueg. Checkpointing and rollback-recovery
for distributed systems. IEEE Transactions on Software Engineering,
13(1):23–31, January 1987.
[Lam78] Leslie Lamport. Time, clocks, and the ordering of events in a distributed
system. Communications of the ACM, 21(7):558–565, July 1978.
[Lam81] Butler Lampson. Atomic transactions. In Distributed Systems—
Architecture and Implementation, pages 246–265. Springer-Verlag, Berlin,
1981.
[Lap91] Jean-Claude Laprie. Dependability: Basic Concepts and Terminol-
ogy, volume 4 of Dependable Computing and Fault-Tolerant Systems.
Springer-Verlag, 1991.
[Lei89] Jerrold Leichter. Shared Tuple Memories, Shared Memories, Buses and
LAN’s—Linda Implementation Across the Spectrum of Connectivity. PhD
thesis, Department of Computer Science, Yale University, July 1989.
[LRW91] LRW Systems. LRW
TM
LINDA-C for VAX Users Guide, 1991. Order
number VLN-UG-102.
[LS83] Barbara Liskov and Robert Scheifler. Guardians and actions: Linguistic
support for robust, distributed programs. ACM Transactions on Program-
ming Languages and Systems, 5(3):381–404, July 1983.
[LSP82] Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine
generals problem. ACM Transactions on Programming Languages and
Systems, 4(3):382–401, July 1982.
[Mis92] Shivakant Mishra. Consul: A Communication Substrate for Fault-
Tolerant Distributed Programs. PhD thesis, Department of Computer
Science, The University of Arizona, February 1992.
[MPS93a] Shivikant Mishra, Larry L. Peterson,and Richard D. Schlichting. Consul:
A communication substrate for fault-tolerant distributed programs. Dis-
tributed Systems Engineering, 1:87–103, 1993.
[MPS93b] Shivikant Mishra, Larry L. Peterson, and Richard D. Schlichting. Expe-
rience with modularity in Consul. Software Practice and Experience,
23(10):1059–1075, October 1993.
171
[MS92] Shivikant Mishra and Richard D. Schlichting. Abstractions for construct-
ing dependable distributed systems. Technical Report 92-19, Department
of Computer Science, The University of Arizona, August 1992.
[Nel81] Bruce J. Nelson. Remote Procedure Call. PhD thesis, Computer Science
Department, Carnegie-Mellon University, 1981.
[Neu92] Peter G. Neumann. Inside risks: Avoiding weak links. Communications
of the ACM, 35(12):146, December 1992.
[NL91] Bill Nitzberg and Virginia Lo. Distributed shared memory: A survey of
issues and algorithms. Computer, 24(8):52–60, August 1991.
[PBS89] Larry L. Peterson, Nick C. Buchholz, and Richard D.Schlichting. Preserv-
ing and using context information in interprocess communication. ACM
Transactions on Computer Systems, 7(3):217–246, August 1989.
[Pow91] David Powell, editor. Delta-4: A Generic Architecture for Dependable
Distributed Computing. Springer-Verlag, 1991.
[PSB
+
88] David Powell, Douglass Seaton, Gottfried Bonn, Paulo Verissimo, and
F. Waeselynk. The Delta-4 approach to dependability in open distributed
computing systems. In Proceedings of the Eighteenth Symposium on
Fault-Tolerant Computing, Tokyo, June 1988.
[PTHR93] Lewis I. Patterson, Richard S. Turner, Robert M. Hyatt, and Kevin D.
Reilly. Construction of a fault-tolerant distributed tuple-space. In Pro-
ceedings of the 1993 Symposium on Applied Computing, pages 279–285.
ACM/SIGAPP, February 1993.
[RSB90] Parameswaran Ramanathan, Kang G. Shin, and Ricky W. Butler. Fault-
tolerant clock synchronization in distributed systems. IEEE Computer,
23(10):33–42, October 1990.
[SBA93] Benjamin R. Seyfarth, Jerry L. Bickham, and MangaiarkarasiArumughum.
Glenda installation and use. University of Southern Mississippi, Novem-
ber 1993.
[SBT94] Richard D. Schlichting, David E. Bakken, and Vicraj T. Thomas. Lan-
guage support for fault-tolerant parallel and distributed programming. In
Foundations of Ultradependable Computing. Kluwer Academic Publishers,
1994. To appear.
[SC91] Ellen H. Siegel and Eric C. Cooper. Implementing distributed Linda in
standard ML. Technical Report CMU-CS-91-151, School of Computer
Science, Carnegie Mellon University, 1991.
172
[Sch90] Fred Schneider. Implementing fault-tolerant services using the state ma-
chine approach. ACM Computing Surveys, 22(4):299–319, December
1990.
[Seg93] Edward Segall. Tuple Space Operations: Multiple-Key Search, On-line
Matching, and Wait-free Synchronization. PhD thesis, Department of
Electrical Engineering, Rutgers University, 1993.
[SS83] Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: An
approach to designing fault-tolerant computing systems. ACM Transac-
tions on Computer Systems, 1(3):222–238, August 1983.
[Sun90] V. S. Sunderam. PVM: A framework for parallel distributed computing.
Concurrency: Practice and Experience, 2(4):315–339, December 1990.
[TKB92] Andrew S. Tannenbaum, M. Frans Kaashoek, and Henri E. Bal. Parallel
programming using shared objects and broadcasting. IEEE Computer,
25(8):10–19, August 1992.
[TM81] Andrew S. Tannenbaum and Sape J. Mullender. An overview of the
Amoeba distributed operating system. ACM Operating Systems Review,
15(3):51–64, July 1981.
[TS92] John Turek and Dennis Shasha. The manyfaces of consensus in distributed
systems. Computer, 25(6):8–17, June 1992.
[VM90] Paulo Verissimo and Jos´
e Alves Marques. Reliable broadcast for fault-
tolerance on local computer networks. In Proceedings of the Ninth Sym-
posium on Reliable Distributed Systems, pages 54–63, Huntsville, AL,
October 1990.
[WL86] C. Thomas Wilkes and Richard J. LeBlanc. Rationale for the design of
Aoelus: a systems programming language for the action/object system.
In Proceedings of the 1986 IEEE International Conference on Computer
Languages, pages 107–122, October 1986.
[WL88] Robert Whiteside and Jerrold Leichter. Using Linda for supercomputing
on a local area network. In Proceedings of Supercomputing 88, 1988.
[XL89] Andrew Xu and Barbara Liskov. A design for a fault-tolerant, distributed
implementation of Linda. In Proceedings of the Nineteenth International
Symposium on Fault-Tolerant Computing, pages 199–206, June 1989.
[Xu88] Andrew Xu. A fault-tolerant network kernel for Linda. Master’s thesis,
MIT Laboratory for Computer Science, August 1988.
173
[Zav93] Pamela Zave. Feature interaction and formal specifications in telecommu-
nications. IEEE Computer, 26(8):20–29, August 1993.
[Zen90] Steven Ericsson Zenith. Linda coordination language; subsystem ker-
nel architecture (on transputers). Technical Report YALEU/DCS/RR-794,
Department of Computer Science, Yale University, May 1990.
... Existem várias propostas para replicação de espaço de tuplas. Algumas se baseiam em replicação ativa [Bakken and Schlichting, 1995] e outras em sistemas de quóruns [Xu and Liskov, 1989], entretanto, nenhuma delas considera a ocorrência de faltas bizantinas. O trabalho apresentado em [Chiba et al., 1992] explora a semântica da aplicação executada sobre o espaço de tuplas, definindo diferentes protocolos para manutenção da consistência na replicação considerando os padrões de comunicação esperados durante a execução da aplicação. ...
Conference Paper
Os sistemas distribuídos atuais requerem mecanismos de comunicação que atendam requisitos como anonimato e desconexão temporária. Neste contexto, a comunicação generativa vêm se afirmando como um dos modelos de coordenação capazes de atender esses requisitos uma vez que é desacoplada no tempo e no espaço. Este trabalho apresenta a primeira proposta da literatura a considerar a construção de uma infra-estrutura de coordenação generativa tolerante a faltas bizantinos. Esta construção se dá através da aplicação de replicação por sistemas de quóruns bizantinos.
... Também existem variantes não bloqueantes das operações de leitura, denominadas inp e rdp, que retornam nulo quando não existe uma tupla que combine com o molde. Uma extensão comum a este modeloé a inclusão das seguintes operações [Bessani et al. 2008, Distler et al. 2015, Bakken and Schlichting 1995, Segall 1995: cas(t, t) que insere t no espaço se não tiver uma tupla t que combine com o molde t e retorna nulo, caso contrário retorna t ; replace(t, t ) que insere a tupla t e remove (e retorna) a tupla t caso t esteja no espaço, caso contrário apenas retorna nulo; e readAll(t) que retorna todas as tuplas que combinem com o molde t. ...
Conference Paper
Full-text available
Mecanismos de coordenação e sincronização, como contadores compartilhados e filas distribuídas, são empregados no desenvolvimento de vários sistemas distribuídos. Estes mecanismos são suportados por infraestruturas de coordenação, como as baseadas em espaço de tuplas. Um espaço de tuplas é um objeto de memória compartilhada que fornece operações para armazenar e recuperar conjuntos ordenados de dados, chamados tuplas. Apesar de espaços de tuplas proverem as funcionalidades necessárias para coordenação, estudos recentes mostraram que o emprego de protocolos e arquiteturas extensíveis é fundamental para o desempenho do sistema. A ideia principal das extensões é permitir que os servidores, que mantém a infraestrutura de coordenação, acessem e processem as informações de coordenação. Desta forma, não é necessário transportar informações para os clientes, além de evitar reprocessamentos devido a acessos concorrentes. As propostas existentes para coordenação distribuída e extensível não fornecem segurança e privacidade adequadas, uma vez que dados em claro são acessados pelos servidores. Neste sentido, este trabalho propõe a utilização de esquemas robustos de criptografia, implementados no DEPSPACE, para o desenvolvimento de protocolos de coordenação extensível com propriedades de segurança e privacidade. Experimentos mostram que as soluções propostas melhoram significativamente o desempenho do sistema.
... Algumas soluções usam replicação para garantir disponibilidade e confiabilidade do ET [Xu and Liskov 1989, Hansen and Cannon 1994, Bakken and Schlichting 1995, Bessani et al. 2008. O DepSpace [Bessani et al. 2008] é um ET tolerante a faltas bizantinas que possui propriedades de segurança como confidencialidade e controle de acesso. ...
Conference Paper
O uso de espaços de tuplas tem se mostrado uma solução atrativa para coordenação entre processos em sistemas distribuídos abertos e dinâmicos, como redes P2P, principalmente pelas suas características de desacoplamento espacial e temporal. Nestes ambientes, caracterizados como sistemas heterogêneos e abertos, aumenta significativamente a possibilidade de processos maliciosos estarem presentes em determinada computação. Neste sentido, este trabalho apresenta nossos esforços na concretização de um Espaço de Tuplas (ET) sobre um overlay P2P, que tolera a presença de processos maliciosos. O ET é construído sobre uma infraestrutura para construção de memórias compartilhadas dinâmicas e tolerantes a intrusões, descrita em outro trabalho.
... Também existem variantes não bloqueantes das operações de leitura, denominadas inp e rdp, que retornam nulo quando não existe uma tupla que combine com o molde. Uma extensão comum a este modeloé a inclusão das seguintes operações [Bessani et al. 2008, Distler et al. 2015, Bakken and Schlichting 1995, Segall 1995: cas(t, t) que insere t no espaço se não tiver uma tupla t que combine com o molde t e retorna nulo, caso contrário retorna t ; replace(t, t ) que insere a tupla t e remove (e retorna) a tupla t caso t esteja no espaço, caso contrário apenas retorna nulo; e readAll(t) que retorna todas as tuplas que combinem com o molde t. ...
Conference Paper
Full-text available
Redes Definidas por Software (SDN) surgiram como um novo paradigma para gerenciamento de redes, definindo uma arquitetura que separa os planos de dados e de controle. Uma arquitetura SDN baseada em um controlador centralizado não escala e nem tolera falhas, pois apresenta um ponto único de falhas. Controladores distribuídos baseados em um modelo de consistência após um tempo para gerenciamento do estado da rede também apresentam sérios problemas: um modelo de programação complexo para as aplicações de rede; e pode gerar anomalias na rede. Consequentemente, soluções considerando um modelo de dados consistente para o armazenamento das informações da rede SDN foram propostos. Nestas abordagens, os controladores distribuídos usam um armazenamento de dados consistente e tolerante a falhas para armazenar o estado relevante das aplicações e da rede. Infelizmente, estas propostas existentes não consideram requisitos fundamentais de segurança para a arquitetura SDN. Este trabalho apresenta nossos esforços no projeto, implementação e avaliação de um modelo seguro e consistente para o plano de controle, baseado no DepSpace, que é um espaço de tuplas com propriedades de segurança. Resultados experimentais mostram a viabilidade prática da arquitetura proposta.
Article
In distributed systems, tuple space is one of the coordination models that significantly maximizes system performance against failure due to its space and time decoupling features. With the growing popularity of distributed computing and increasing complexity in the network, host and link failure occurs frequently, resulting in poor system performance. This article proposes a fault‐tolerant model named Tuple Space Replication (TSR) for tuple space coordination in distributed environments. The model introduces a multi‐agent system that consists of multiple hosts. Each host in a multi‐agent system comprises an agent space with a tuple space for coordination. In this model, we introduce three novel fault‐tolerant algorithms for tuple space primitives to provide coordination among hosts with tolerance to multiple links and hosts failure. The first algorithm is given for out() operation to insert tuples in the tuple space. The second algorithm is presented for rdp() operation to read any tuple from the tuple space. The third algorithm is given for inp() operation to delete or withdraw tuples from the tuple space. These algorithms use less number of messages to ensure consistency in the system. The message complexity of the proposed algorithms is analyzed and found O(n) for out() , O(1) for rdp() , and O(n) for inp() operations which is comparable and better than existing works, where n is the number of hosts. The testbed experiment reveals that the proposed TSR model gives performance improvement up to 88%, 70.94%, and 63.80% for out() , rdp() , and inp() operations compared to existing models such as FT‐SHE, LBTS, DEPSPACE, and E‐DEPSPACE.
Chapter
In this chapter our survey of methods and structures for application-level fault-tolerance continues, getting closer to the programming language: Indeed, tools such as compilers and translators work at the level of the language—they parse, interpret, compile or transform our programs, so they are interesting candidates for managing dependability aspects in the application layer. An important property of this family of methods is the fact that fault-tolerance complexity is extracted from the program and turned into architectural complexity in the compiler or the translator. Apart from continuing with our survey, this chapter also aims at providing the reader with two practical examples: • Reflective and refractive variables, that is, a syntactical structure to express adaptive feedback loops in the application layer. This is useful to resilient computing because a feedback loop can attach error recovery strategies to error detection events. • Redundant variables, that is, a tool that allows designers to make use of adaptively redundant data structures with commodity programming languages such as C or Java. Designers using such tools can define redundant data structures in which the degree of redundancy is not fixed once and for all at design time, but rather it changes dynamically with respect to the disturbances experienced during the run time. Both tools are new research activities that are currently being carried out by the author of this book at the PATS research group of the University of Antwerp. It is shown how through a simple translation approach it is possible to provide sophisticated features such as adaptive fault-tolerance to programs written in any language, even plain old C.
Chapter
This chapter describes some hybrid approaches for application-level software fault-tolerance. All the approaches reported in the rest of this chapter exploit the recovery language approach introduced in Chapter VI and couple it with other tools and paradigms described in other parts of this book. The objective of this chapter is to demonstrate how ReL can serve as a tool to further enhance some of the application- level fault-tolerance paradigms introduced in previous chapters. But why hybrid approaches in the first place? The main reason is that joining two or more concepts and their “system structures” (Randell, 1975), that is, the conceptual and syntactical axioms used in disparate application-level software fault-tolerance provisions, one comes up with a tool with better Syntactical Adequacy (the SA attribute introduced in Chapter II). As already mentioned, a wider syntactical structure can facilitate the expression of our codes, while on the contrary awkward structures often lead to clumsy, buggy applications. Hybrid approaches are often more versatile and can also inspire brand new designs. A drawback of hybrid approaches is that they are modifications of existing designs. The extra design complexity must be carefully added to prevent the introduction if design faults in the architecture.
Chapter
Software Defined Networks (SDN) emerged as a new paradigm for network management, defining an architecture the physically decouples the control and data planes. An SDN architecture based on a central controller does not scale and neither is fault-tolerant since it presents a single point of failure. Distributed SDN controllers based on an eventually consistent model for the network state also bring serious drawbacks: a complex programming model for network applications; and it can lead to network anomalies. Consequently, solutions considering a consistent model for the network state are emerging. In these approaches, the distributed controllers use a consistent and fault-tolerant data store that keeps relevant network and applications state. Unfortunately, these approaches do not consider security requirements for the SDN. This work aims to design, implement and evaluate a secure and consistent model for the control plane based on DepSpace, a secure tuple space implementation used to coordinate distributed processes. Experimental results show the practical feasibility of the proposed architecture.
Article
This paper suggests that input and output are basic primitives of programming and that parallel composition of communicating sequential processes is a fundamental program structuring method. When combined with a development of Dijkstra's guarded command, these concepts are surprisingly versatile. Their use is illustrated by sample solutions of a variety of a familiar programming exercises.