Conference PaperPDF Available

Software Implemented Fault Tolerance Technologies and Experience.

Authors:
Software Implemented Fault Tolerance:
Technologies and Experience
Yennun Huang and Chandra Kintala
AT&T
Bell Laboratories
Murray Hill,
NJ
07974
Abstract
By software implemented fault tolerance, we mean
a set of software facilities to detect ‘and recover from
faults that are are not handled by the underlying hard-
ware
or
operating system. We consider those faults
that cause an application process to crash or hang;
they include software faults as well as faults
in
the
un-
derlying hardware and operating system layers
if
they
are undetected
in
those layers. We define
4
levels of
software fault tolerance based on availability and data
consistency of an application
in
the presence of such
faults.
Watchd,
libft
and
nDFS
are reusable compo-
nents that provide up to the 3rd level of software fault
tolerance. They perform, respectively, automatic de-
tection and restart of failed processes, periodic check-
pointing and recovery
of
critical volatile data, and
replication and synchronization of persistent data
in
an application software system. These modules have
been ported to a number of
UNIX’
platforms and can
be used by any application with minimal programming
egort.
Some newer telecommunications products
in
AT&T
have already enhanced their fault-tolerance capability
using these three components. Experience with those
products to date indicates that these modules provide
eficient and economical means to increase the level
of
fault tolerance
in
a software product. The performance
overhead due to these components depends on the level
and varies from
0.1%
to
14%
based on the amount of
critical data
being
checkpointed
and
replicated.
1
Introduction
There are increasing demands to make the soft-
ware in the applications we build today more toler-
ant to faults. From users’ point of view, fault toler-
‘UNIX
is
a registered trademark
of
UNIX
System
Labora-
tories,
Inc.
2
0731-3071/93
$3.00
0
1993
IEEE
ance has two dimensions:
availability
and
data consis-
tency
of the application. For example, users of tele-
phone switching systems demand continuous availabil-
ity whereas bank teller machine customers demand the
highest degree of data consistency[l3]. Most other ap-
plications have lower degrees
of
requirements for fault-
tolerance in both dimensions but the trend is to in-
crease those degrees
as
the costs, performance, tech-
nologies and other engineering considerations permit;
see Figure
1
below. In this paper, we discuss three
Baoli
Teller
Machines
Availability
Figure
1:
Dimensions
of
Fault Tolerance
cost-effective software technologies to raise the degree
of fault tolerance in both dimensions of the applica-
tion.
Following Cristian[7], we consider software appli-
cations that provide
a
“service” to clients. The ap-
plications in turn use the services provided by the
underlying operating or database systems which in
turn use the computing and network communication
services provided by the underlying hardware; see
Figure
2.
Tolerating faults in such applications in-
volves detecting
a
failure, gathering knowledge about
the failure and recovering from that failure. Tradi-
tionally, these fault tolerance actions are performed
A*-
FT-DBMS,
...
.*.-
duplex,
T’MR,
...
Figure 2: Layers of Fault Tolerance
in the hardware, operating
or
database systems used
in the underlying layers of the application software.
Hardware fault tolerance is provided using Duplex,
Triple-Module-Redundancy
or
other techniques[l9].
Fault tolerance in the operating and database layers
is often provided using replicated file systems[22], ex-
ception handling[23], disk shadowing[6], transaction-
based checkpointing and recovery[l8], and other sys-
tem routines. These methods and technologies handle
faults occurring in the underlying hardware, operating
and database system layers only.
Increasing number of faults are however occurring
in the application software layer causing the applica-
tion processes to crash
or
hang.
A
picocess is said to
be
crashed
if the working process image is no longer
present in the system.
A
process is said to be
hung
if
the process image is alive, its entry is still present in
the process table but the process is not making any
progress from
a
user’s point of view. Such software
failures arise from incorrect program designs
or
cod-
ing errors but, more often than not, they arise from
transient and nondeterministic errors[9]; for example,
unsatisfactory boundary value conditions, timing and
race conditions in the underlying computing and com-
munication services, performance failures etc.[7]. Due
to complex and temporal nature of interleaving of
messages and computations in
a
distributed system,
no amount of verification, validation and testing will
eliminate all those software faults in an application
and give complete confidence in the availability and
data consistency of that application.
So,
those faults
will occasionally manifest themselves and cause the
application process to crash or hang.
It
is possible to detect
a
crash andl restart an ap-
plication at
a
checkpointed
state
through operating
sytem facilities, as in IBM’s MVS[24]. In their pa-
per on
End-to-End Arguments[21],
Saltzer et.al. claim
that such hardware and operating system based meth-
ods to detect and recover from software failures are
necessarily incomplete. They show that fault toler-
ance cannot be complete without the knowledge and
help from the endpoints of an application, i.e., the
application software. We claim that such methods,
i.e. services
at
a
lower layer detecting and recovering
from failures
at
a
higher layer, are also inefficient. For
example, file replication on
a
mirrored disk through
a
facility in the operating system will be more inef-
ficient than replicating only the “critical” files of the
application in the application layer since the operating
system has no internal knowledge of that application.
Similarly, generalized checkpointing schemes in an
op-
erating system checkpoint entire in-memory data of an
application whereas application-assisted methods will
checkpoint only the critical data[l7,
31.
A
common but misleading argument against em-
bedding checkpointing, recovery and other fault tol-
erance schemes inside an application is that such
schemes are not transparent, efficient
or
reliable be-
cause t hey are coded by application programmers;
we claim that well-tested and efficient fault tolerance
methods can be built as libraries
or
software reusable
components executing in the application layer and
that they provide as much
as
transparency
as
the other
methods do. All the three components discussed in
this paper have those properties, i.e. efficient, reliable
and transparent.
The above observations and the
End-to-End argu-
ment
[21] lead to our notion of software fault tolerance
as defined below:
Software fault tolerance
is
a set of software
facilities to detect and recover
from
faults
that cause an application process
to
crash
or
hang and that are are not handled
by
the
un-
derlying hardware
or
operating system.
Observe that this includes software faults as described
earlier as well
as
faults in the underlying hardware and
operating system layers if they are undetected in those
layers. Thus, if the underlying hardware and operat-
ing system are not fault-tolerant in an application sys-
tem due to performance/cost trade-offs
or
other engi-
neering considerations, then that
system
can
increase
its availability more cost effectively through software
fault tolerance as described in this paper. It is also an
easier migration path for making existing applications
more fault-tolerant.
3
2
Model
For simplicity in the following discussions, we con-
sider only client-server based applications running in
a
local or wide-area network of computers in
a
dis-
tributed system; the discussion
also
applies to other
kinds of applications. Each application has
a
server
process executing in the user level on top of
a
vendor
supplied hardware and operating system. To get ser-
vices, clients send messages to the server; the server
process acts on those messages one by one and, in
each of those message processing steps, updates its
data. We sometimes call the server process
the ap-
plication.
For fault tolerance purposes, the nodes in
the distributed system will be viewed
as
being in
a
circular configuration such that each node will be
a
backup node for its left neighbor in that circular list.
As shown in Figure
3,
each application will be exe-
Backup
Primary
..........................................
....
7
A.
i
....
....
I<
..................................
j:
i;
!:
.
c
..................
I
.............
,
.........................................
Application
Data
(libft)
I
.........................................
Persisten
::
.....................
..........
..
1..
p
............
,I
................
Figure
3:
Modified Primary Site Approach
cuting primarily on one of the nodes in the network,
called the primary node for that application. Each
executing application has
process text
(the compiled
code),
volatile data
(variables, structures, pointers and
all the bytes in the static and dynamic memory seg-
ments
of
the process image) and
persistent data
(the
application files being referred to and updated by the
executing process).
We use
a
modified primary-site approach to soft-
ware fault tolerance[l]. In the primary site approach,
the service to be made fault tolerant is replicated at
many nodes, one
of
which is designated
as
primary
and the others
as
backups. All the requests for the
service are sent to the primary site. The primary site
periodically checkpoints its state on the backups. If
the primary fails, one of the backups takes over as
primary. This model for fault tolerance has been ana-
lyzed for frequency of checkpointing, degree of service
replication and the effect on response time by Huang
and Jalote[ll, 121. This model was slightly modified,
as
described below, to build the three technologies de-
scribed in this paper. The tasks in our modification
of the primary site approach are:
a
watchdog process running on the primary node
watching for application crashes or hangs,
a
watchdog process running on the backup node
watching for primary node crashes,
periodically checkpointing the critical volatile
data in the application
logging of client messages to the application,
replicating application’s persistent data and mak-
ing them available on the backup node,
when the application on the primary node crashes
or hangs, restarting the application, if possible,
on the primary node, otherwise, on the backup
node.
recovering the application to the last check-
pointed state and reexecuting the message log and
connecting the replicated files to the backup node
if the application restarts on the backup.
Observe that these software fault tolerance tasks can
be used in addition to other methods such as N-version
programming[2] or recovery blocks[20] inside an ap-
plication program. Observe also that the application
process on the backup node will not be running until
it is started by the watchdog process; this is unlike
in the process-pair model[9] where the backup process
will be passively running even during normal opera-
tions.
The degree to which the above software fault toler-
ance tasks are used in an application determines the
availability and data consistency
of
that application.
It
is, therefore, useful to establish
a
classification of
the different levels of software fault tolerance. We de-
fine the following
4
levels based on our experience in
AT&T. Applications illustrating these levels are de-
scribed in Section
4.
4
Level
0:
No
tolerance to faults
in
the application
software:
In this level, when the executing (application pro-
cess dies or hangs, it has to be manually restarted
from an initial internal state, i.e. the initial values
of the volatile data. The application may leave
its persistent data in an incorrect
or
inconsistent
state due to the timing of the crwh and may take
a
long time to restart due to elaborate initializa-
tion procedures.
Level
1:
Automatic detection and restart:
When the application dies
or
hangs, the fault will
be detected and the application will be restarted
from an initial internal state on the same pro-
cessor, if possible,
or
on
a
backup processor if
available. In this level, the internal state of the
application is not saved and, hence, the process
will restart
at
the initial internal state. As stated
above, restart along with reinitialization will be
slow. The restarted internal state may not re-
flect all the messages that have been processed
in the previous execution, and therefore, may not
be consistent with the persistent data. The differ-
ence between Levels
0
and
1
is that the detection
and restart are automatic in Level
1,
and there-
fore, the application availability is higher in Level
1
than in Level
0.
Level
2:
Level
1
plus periodic checkpointing, logging
and recovery of internal state:
In addition to what is available in Level
1,
the
internal state of the application process is pe-
riodically checkpointed, i.e. the #critical volatile
data is saved, and the messages to the server are
logged. After
a
failure is detected, the applica-
tion is restarted
at
the most recent checkpointed
internal state and the logged messages will be re-
processed to bring the application to the state
at
which it crashed. The application availability
and volatile data consistency are higher in Level
2
than those in Level
1.
Level
3:
Level
2
plus persistent data recovery:
In addition to what is available in Level
2,
the
persistent data of the application
is
replicated on
a
backup disk connected to
a
backup node, and is
kept consistent with the primary server through-
out the normal operation of the application. In
case of
a
fault and resulting recovery of the ap-
plication on the backup node, the backup disk
brings the application’s persistent data as close
to the state
at
which the application crashed
as
possible. The data consistency of the application
in Level
3
is higher than that in Level
2.
Level
4:
Continuous operation without any inter-
ruption:
This level of fault tolerance in software guarantees
the highest degree of availability and data consis-
tency. Often, this is provided, such as in switch-
ing systems, using replicated processing of the ap-
plication on “hot” spare hardware. The state of
a
process need not be saved, but multicast messag-
ing, voting and other mechanisms must be used
to maintain consistency and concurrency control.
Availability of the system during planned inter-
ruptions, such
as
those during upgrades, is made
possible using dynamic loading
or
other operating
system facilities. The technologies we describe in
this paper do not provide this level of fault toler-
ance.
3
Technologies
Many applications perform some of these software
fault tolerance features by coding them directly in
their programs. We developed three reusable com-
ponents
-
watchd, libft
and
nDFS
-
to embed those
features in any application with minimal programming
effort.
3.1
Watchd
Watchd
is
a
watchdog daemon process that runs
on
a
single machine
or
on
a
network of machines.
It
con-
tinually watches the life of
a
local application process
by periodically sending
a
null signal to the process and
checking the return value to detect whether that pro-
cess is alive
or
dead.
It
detects whether that process
is hung
or
not by using one
of
the following two meth-
ods. The first method sends
a
null message to the local
application process using IPC (Inter Process Commu-
nication) facilities on the local node and checks the re-
turn value.
If
it cannot make the connection, it waits
for some time (specified by the application) and tries
again. If it fails after the second attempt, it will inter-
pret it to mean that the process is hung. The second
method asks the application process to send
a
heart-
beat message to
watchd
periodically and
watchd
pe-
riodically checks the heartbeat. If the heartbeat mes-
sage from the application is not received by
a
specified
time,
watchd
will assume that the application is hung.
This implies that
wat
chd
cannot differentiate between
hung processes and very slow processes.
5
When it detects that the application process
crashed
or
hung,
watchd
recovers that application
at
an initial internal state or
at
the last checkpointed
state. The application is recovered on the primary
node if that node has not crashed, otherwise on the
backup node for the primary as specified in
a
config-
uration file.
If
libft
is also used,
watchd
sets the
restarted application to process all the logged mes-
sages from the log file generated by
libft. watchd
also watches one neighboring
watchd
(left
or
right) in
a
circular fashion to detect node failures; this circular
arrangement is similar to the adaptive distributed di-
agnosis algorithm[5]. When
a
node failure is detected,
watchd
can execute user-defined recovery commands
and reconfigure the network. Observe that neighbor-
ing
wat chds
cannot differentiate between node failures
and link failures. In general, this is the problem of
attaining common knowledge in the presence of com-
munication failures which is provably unsolvable[lO].
Watchd
also
watches itself.
A
self-recovery mech-
anism is built into
watchd
in such
a
way that it
can recover itself from
an
unexpected software fail-
ure.
Watchd
also facilitates restarting
a
failed process,
restoring the saved values and reexecuting the logged
events and provides facilities for remote execution, re-
mote copy, distributed election, and status report pro-
duction.
3.2
Libft
Libft
is
a
user-level library of C functions that can
be used in application programs to specify and check-
point critical data, recover the checkpointed data, log
events, locate and reconnect
a
server, do exception
handling, do N-version programming (NVP), and use
recovery block techniques.
Libft
provides
a
set of functions (e.g.
critical())
to specify critical volatile data in an application.
Those critical data items are allocated in
a
reserved re-
gion of the virtual memory and are periodically check-
pointed. Values in critical data structures are saved
using memory copy functions, and thus avoid travers-
ing application-dependent data structures. When
an
application does
a
checkpoint, its critical data will be
saved on the primary and backup nodes. Unlike other
checkpointing methods[l?], the overhead in our check-
pointing mechanism is minimized by saving only crit-
ical data and avoiding data-structure traversals. This
idea of saving only critical data in an application is
analogous to the Recovery Box concept in Sprite[3].
Libf t
provides functions (e.g.
getsvrloc
0,
getsvrport(),ftconnectO,
ftbind())for
clients to
locate servers and reconnect to servers in
a
network
environment. The exception handling, NVP and re-
covery block facilities are implemented using
C
macros
and standard C library functions. These facilities can
be used by any application without changing the un-
derlying operating system
or
adding new
C
preproces-
sors.
Libft
also provides
ftread0
and
ftwrite0
func-
tions to automatically
log
messages. When the
ftreado
function is called by
a
process in
a
normal
condition, the data will be read from
a
channel and
automatically logged on
a
file. The logged data will
then be duplicated and logged by the
watchd
daemon
on
a
backup machine. The replication of logged data is
necessary for
a
process
to
recover from
a
primary ma-
chine failure. When the
ftreado
function is called
by
a
process which is recovering from
a
failure in
a
recovery situation, the input data will be read from
the logged file before any data can be read from
a
reg-
ular input channel. Similarly, the
ftwrite0
function
logs output data before they are sent out. The out-
put data is also duplicated and logged by the
watchd
daemon on a backup machine. The log files created
by the
ftread0
and
ftwriteo
functions are trun-
cated after
a
checkpoint
0
function is successfully ex-
ecuted. Using functions
checkpoint
(),
f tread()
and
ftwrite0,
one can implement either
a
sender-based
or
a
receiver-based logging and recovery scheme[l4].
There is
a
slight possibility that some messages dur-
ing the automatic restart procedure may get lost. If
this is
a
concern to an application, an additional mes-
sage synchronization mechanism can be built into the
application to check and retransmit lost messages.
Speed and portability are primary concerns in im-
plementing
libf t.
The
libft
checkpoint mechanism
is not fully transparent to programmers as in the Con-
dor system[l6]. However,
libft
does not require
a
new language,
a
new preprocessor
or
complex decla-
rations and computations to save
data
structures[9].
The sacrifice of transparency for speed has been
proven to be useful in some projects to adopt
libft.
The installation of
libft
doesn’t require any change
to
a
UNIX-based operating system; it has been ported
to several platforms.
Watchd
and
libf t
separate fault detection and
volatile data recovery facilities from the application
functions. They provide those facilities as reusable
components which can be combined with any appli-
cation to make it fault tolerant. Since the messages
received
at
the server site are logged and only the
server process is recovered in this scheme, the con-
sistency problems that occur in recovering multiple
processes[l4] are not issues in this implementation.
6
3.3
nDFS
The multi-dimensional file system,
nRFS[8],
is based
on
3DFS[15]
and provides facilities for replication of
critical data.
It
allows users to specify and replicate
critical files on backup file systems in real time. The
implementation of
nDFS
uses the dynamic-shared
li-
brary mechanism to intercept file system calls and
propagate the system calls to backup file systems.
nDFS
is built on top of UNIX file systems, and
so
its use requires no change in the underlying file sys-
tem. Speed, robustness and replication transparency
are the primary design goals of
nDFS.
Implementation of
nDFS
uses
watchd
and
libft
for fault detection and fast recovery. A failure of
the underlying replication mechanism (software fail-
ure)
or
a
crash of
a
backup file system is transparent
to applications
or
users. A failed software component
is detected and recovered immediately by
watchd;
a
crashed backup file system, after it is repaired, can
catch up with the primary file system without inter-
rupting
or
slowing down applications using the pri-
mary file system.
4
Experience
Fault tolerance in some of the newer telecommuni-
cations network management products in AT&T has
been enhanced using
watchd, libft
and
nDFS.
Ex-
perience with those products to date indicates that
these technologies are indeed economical and effective
means to increase the level
of
fault tolerance in ap-
plication software. The performance overhead due to
these components depends on the level of fault toler-
ance, the amount of critical volatile data being check-
pointed, frequency of checkpointing, and the amount
of persistent data being replicated.
It
varies from
0.1%
to
14%.
We describe some of those products to illus-
trate the availability, flexibility and efficiency in pro-
viding software fault tolerance through these
3
com-
ponents.
To
protect the proprietary information of
those products, we use generic terms and titles in the
descriptions.
Level
1:
Failure detection and restart using
watchd:
Application C uses
watchd
to check the “live-
ness” of some service daemon processes in C
at
10
sec-
ond intervals. When any of those processes fails, i.e.
crashes
or
hangs,
watchd
restarts that process
at
its
initial
state.
It
took
2
people
3
hours
to
embed
and
configure
watchd
for this level of fault tolerance in ap-
plication C. One potential use of this kind of fault tol-
erance would be in general purpose local area comput-
ing environments for state-less network services such
as
lpr,
finger
or
inetd
daemons. Providing higher
levels of fault tolerance in those services would be un-
necessary.
Level
2:
Failure detection, checkpointing, restart
and recovery using
watchd
and
libft:
Application
N maintains
a
certain segment of the telephone call
routing information on
a
Sun servgr; maintenance op-
erators use workstations running
N’s
client processes
communicating with N’s server process using
sockets.
The server process in N was crashing or hanging for
unknown reasons. During such failures, the system
administrators had to manually bring back the server
process, but they could not do
so
immediately be-
cause of the UNIX delay in cleaning up the socket
table. Moreover, the maintenance operators had
to
restart client interactions from an initial state. Re-
placing the server node with fault tolerant hardware
would have increased their capital and development
costs by
a
factor of
4.
Even then, all their problems
would not have been solved;
for
example, saving the
client states of interactions. Using
watchd
and
libft,
system
N
is now able to tolerate such failures.
Watchd
also detects primary server failures and restarts it on
the backup server. Location transparency is obtained
using
getsvrloc()
and
getsvrport
0
calls in client
programs and
f tbind
0
in server program.
Libf
t’s
checkpoint and recovery mechanisms are used to save
and recover all critical data. Checkpointing and recov-
ery overheads are below
2%.
Installing and integrating
the two components into the application took
2
people
3
days.
Level
3:
Failure detection, checkpointing, replica-
tion,
restart
and recovey using
watchd, libft
and
nDFS:
Application
D
is
a
real-time telecommunication
network element currently being developed. In addi-
tion to the previous requirements for fault tolerance,
this product needed to get its persistent files on-line
immediately after
a
failure recovery on
a
backup node.
During normal operations on the primary server,
nDFS
replicates all the critical persistent files on
a
backup
server with an expected overhead of less than
14%.
When the primary server fails,
watchd
starts the appli-
cation
D
on the backup node and automatically con-
nects it to the backup disk on which the persistent
files were replicated.
4.1
Other
Possible
Uses
The three software components,
watchd, libft
and
nDFS,
can be used not only to increase the level of
fault-tolerance in an application, as described above,
7
but also to aid in other operations unrelated to fault-
tolerance
as
described below.
a
On-line upgrading
of
software:
One can install
a
new version of software for an application with-
out interrupting the service provided by the older
version. This can be done by first loading the new
version on the backup node, simulating
a
fault on
the primary and then letting
wat
chd
dynamically
move the service location to the backup node.
This method assumes that the two versions
are
compatible
at
the application level client-server
protocol.
a
Overcoming persistent errors in software:
Some
errors in software are simply persistent, i.e. non-
debuggable, due to the complexity and transient
nature of the interactions and events in
a
dis-
tributed system[4]. Such errors sometimes do not
reappear after the server process is restarted[9];
watchd,
in those instances, can be used to bring
the server process back up without clients notic-
ing the failure and restart. After restart and
restoration of the checkpointed state, message
logs can be replayed in the order they originally
arrived
at
the server
or,
if needed, in
a
differ-
ent order[25]. Reordering the message logs some-
times eliminates transient errors due to “bound-
ary” conditions.
a
Using checkpoint states
and
message
logs
for
de-
bugging distributed applications:
In
libf
t
,
all the
checkpointed states, i.e. values in the critical
data, and message logs can optionally be saved
in
a
journal file. This journal can be used to aid
in analyzing failures in distributed applications.
5
Summary
We identified some of the dimensions of fault tol-
erance and defined
a
role,
a
taxonomy and tasks for
software fault tolerance based on availability and data
consistency requirements of an application.
We
then
described three software components,
wat chd,
libf
t,
and
nDFS
to perform these tasks. These three compo-
nents are flexible, portable and reusable; they can be
embedded in any UNIX-based application software
to
provide different levels of fault tolerance with minimal
programming effort. Experience
in
using these three
components in some telecommunication products has
shown that these components indeed increase the level
of fault tolerance with acceptable increases in perfor-
mance overhead.
Acknowledgments
Many thanks to Lawrence Bernstein who suggested
defining levels for fault tolerance, provided leadership
to transfer this technology rapidly and encouraged us-
ing these components in
a
wide range of AT&T prod-
ucts and services. The authors have benefited from
discussions, contributions and comments from several
colleagues, particularly, Rao Arimilli, David Belanger,
Marilyn Chiang, Glenn Fowler, Kent Fuchs, Pankaj
Jalote, Robin Knight, David Korn, Herman Rao and
Yi-Min Wang.
References
[l]
P.
A. Alsberg and J. D. Day,
“A
Principle for Re-
silient Sharing of Distributed Services,”
Proceed-
ings
of
2nd Intl. Conf. on Software Engineering,
pp. 562-570, October 1976.
[2] A. Avizienis, “The N-Version Approach to Fault-
Tolerant Software,”
IEEE Trans. on Software En-
gineering,
SE-11, No. 12, pp. 1491-1501, Dec. 1985.
[3]
M.
Baker and
M.
Sullivan, “The Recovery Box:
Using Fast Recovery to Provide High Availability
in the UNIX environment,”
Proceedings
of
Sum-
mer
’92
USENIX,
pp. 31-43, June 1992.
[4] L. Bernstein, “On Software Discipline and the War
of 1812,”
ACM
Software Engineering Notes,
p. 18,
Oct. 1992.
[5]
R.
Bianchini, Jr. and
R.
Buskens, “An Adap-
tive Distributed System-Level Diagnosis Algo-
rithm and Its Implementation,”
Proceedings
of
21st IEEE Conference
on
Fault- Tolerant Comput-
ing Systems (FTCS),
pp. 222-229, July 1991.
[6]
D.
Bitton and
J.
Gray, “Disk Shadowing,”
Proceed-
ing
of
14th Conference on Very Large
Data
Bases,
pp. 331-338, Sept. 1988.
[7]
H.
Cristian, “Understanding Fault-Tolerant Dis-
tributed Systems,”
Communications
of
the
AGM,
Vol. 34, No. 2, pp. 56-78, February 1991.
[8] G.
S.
Fowler,
Y.
Huang, D. G. Korn and
H.
Rao,
“A User-Level Replicated File System,” To be pre-
sented
at
the
Summer
1993
USENIX Conference,
June 1993.
[9]
J.
Gray and D.
P.
Siewiorek, “High-Availability
Computer Systems,”
IEEE Computer,
Vol. 24,
No.
9, pp. 39-48, September 1991.
8
[lo]
J.
Y.
Halpern and
Y.
Moses, “Knowledge and
Common Knowledge in
a
Distributed Environ-
ment,”
Journal
of
ACM,
Vol.
37,
No.
3,
pp.
549-
587,
July
1990.
[ll]
Y.
Huang and
P.
Jalote, “Analytic Models
for
the
Primary Site Approach to Fault-Tolerance,”
Acta
Informatica,
Vol
26,
pp.
543-557, 1989.
[12]
Y.
Huang and
P.
Jalote, “Effect
of
Fault Toler-
ance on Response Time
-
Analysis
of
the Primary
Site Approach,”
IEEE Transactions on Comput-
ers,
Vol.
41,
No.
4,
pp.
420-428,
April
1992.
[13]
M.
Iwama, D. Stan and
G.
Zimmerman,
Personal
correspondence,
July
1992.
[14]
P.
Jalote, “Fault Tolerant Processes,”
Distributed
Computing,
Vol.
3,
pp.
187-195, 1989.
[15]
D.
G.
Korn and
E.
Krell,
“A
New Dimension for
the Unix File System,”
Software Practice and Ex-
perience,
Vol.
20,
Supplement
1,
pp.
S1/19-S1/34,
June
1990.
[l6]
M. Litxkow, M. Livny, and M Mutka. “Condor
-
a
hunter
of
idle workstations,”
Proc.
of
the 8th Intl.
Conf. on Distributed Computing Systems,
IEEE
Computer Society Press, June
1988.
[17]
9. Long,
W.
K.
Fuchs and J. A. Abraham,
“Compiler-Assisted Static Checkpoint Insertion,”
Proceedings
of
22nd IEEE Conference on Fault-
Tolerant Computing Systems
(FTCS),
pp.
58-65,
July
1992.
[18]
A.
Nangia and
D.
Finker, “Transaction-based
Fault-Tolerant Computing in Distributed
Sys-
tems,”
Proceedings
of
1992
IEEE
Workshop
on
Fault-tolerant Parallel and Distributed Systems,
pp.
92-97,
July,
1992.
[19]
D.
K.
Pradhan (ed.),
Fault-Tolerant Computing:
theory and techniques,
Vols.
1
and
2,
YPrentice-Hall,
1986.
[20]
B.
Randell, “System Structure for Software Fault
Tolerance,”
IEEE Duns. on Software Engineering,
SE-1,
No.
2,
pp.
220-232,
June,
1975.
[21]
J.
H. Saltzer, D.
P.
Reed and
D.
D. Clark, “End-
To-End Arguments in System Design,’’
ACM
Transactions on Computer Systems,
Vol.
2,
No.
4,
pp.
277-288,
November,
1984.
[22]
M.
Satyanarayanan, “Coda:
A
Highly Available
File System for
a
Distributed Workstation Envi-
ronment,”
IEEE
B,
on Computers,
Vol.
C-39,
pp447-459,
April
1990.
[23]
S.
K.
Shrivastava (ed.),
Reliable
Computer Sys-
[24]
D.
P.
Siewiorek and R.
S.
Swarz,
Reliable Com-
puter Systems Design and Implementation,
Chap-
ter
7,
Digital Press,
1992.
tems,
Chapter
3,
Springer-verlag,
1985.
[25]
Y.
M.
Wang,
Y.
Huang and
W.
K.
Fuchs, “Pro-
gressive Retry for Software Error Recovery in Dis-
tributed Systems,”
To
be presented at this
FTCS-
23
Conference,
June
1993.
9
... The root cause of software aging is the activation of Aging-Related Bugs (ARBs), which tend to manifest themselves after a long period of execution and cause the system to enter an error state [9,11]. Examples of ARBs include memory leaks, storage problems, unreleased files, socket exceptions, unreleased file handles, and disk fragmentation [12], which is shown in Fig. 1. While software aging is a progressive process, its impact can be significant, as real-world failures due to software aging can result in economic loss or even harm to human lives. ...
Preprint
Full-text available
Software aging refers to the accumulation of error conditions over time in long-running software systems, which can lead to decreased performance and an increased likelihood of failures. Aging-Related Bug Prediction (ARBP) was introduced to predict the Aging-related Bugs (ARBs) hidden in the systems by using features extracted from source code. ARBs include memory leaks, storage problems, unreleased files, socket exceptions, unreleased file handles, disk fragmentation and so on. Previous research in Software Defect Prediction (SDP) indicated that using feature selection techniques to select a subset of representative features to use could enhance the performance of classification models. However, considering the difference between ARB features and SDP features, blindly applying the method performed well in SDP to pre-process the ARBs dataset may not necessarily improve the performance of the ARBP model, and could potentially result in a decline in performance. To address this limitation, 22 feature selection methods with 21 classifiers embedded in the most used ARBP model on four benchmark datasets from real-world software projects, and six different evaluation indicators were employed to assess the performance of ARBP models comprehensively. Our experiment results showed that: (1) The filter-based feature ranking method called SVMF performed the best on the ARBP, and the filter-based feature subset selection method ConBF performs the worst on the ARBP task. (2) Using the statistic-based classifiers as the base classification model embedded with the SVMF can perform the best, the Naive Bayes classifier always achieves the best performance. Researchers are recommended to first consider CountLineBlank, CountLineComment, and MaxCyclomaticModified features for the ARBP task.(3) The feature selection method ConBF, which performed the best in conventional SDP was not optimal for our specific task. This highlights the unique nature of aging-related features and underscores the need for a tailored feature selection method. Based on these findings, we recommend using SVMF with the Naive Bayes classifier when building ARBP models, in our study, this combination can improve the Balance performance by 18\% and Recall by 25.9% compared with no feature selection for ARBP.
... application-level checkpointing: Checkpointing in scientific computing is commonly implemented by program libraries [113,175]. However, checkpointing libraries are not fully transparent to the user, as the user must insert library calls into critical areas of the program to initiate checkpointing. ...
Thesis
Full-text available
Modern computer devices exhibit transient hardware faults that disturb the electrical behavior but do not cause permanent physical damage to the devices. Transient faults are caused by a multitude of sources, such as fluctuation of the supply voltage, electromagnetic interference, and radiation from the natural environment. Therefore, dependable computer systems must incorporate methods of fault tolerance to cope with transient faults. Software-implemented fault tolerance represents a promising approach that does not need expensive hardware redundancy for reducing the probability of failure to an acceptable level. This thesis focuses on software-implemented fault tolerance for operating systems because they are the most critical pieces of software in a computer system: All computer programs depend on the integrity of the operating system. However, the C/C++ source code of common operating systems tends to be already exceedingly complex, so that a manual extension by fault tolerance is no viable solution. Thus, this thesis proposes a generic solution based on Aspect-Oriented Programming (AOP). To evaluate AOP as a means to improve the dependability of operating systems, this thesis presents the design and implementation of a library of aspect-oriented fault-tolerance mechanisms. These mechanisms constitute separate program modules that can be integrated automatically into common off-the-shelf operating systems using a compiler for the AOP language. Thus, the aspect-oriented approach facilitates improving the dependability of large-scale software systems without affecting the maintainability of the source code. The library allows choosing between several error-detection and error-correction schemes, and provides wait-free synchronization for handling asynchronous and multi-threaded operating-system code. This thesis evaluates the aspect-oriented approach to fault tolerance on the basis of two off-the-shelf operating systems. Furthermore, the evaluation also considers one user-level program for protection, as the library of fault-tolerance mechanisms is highly generic and transparent and, thus, not limited to operating systems. Exhaustive fault-injection experiments show an excellent trade-off between runtime overhead and fault tolerance, which can be adjusted and optimized by fine-grained selective placement of the fault-tolerance mechanisms. Finally, this thesis provides evidence for the effectiveness of the approach in detecting and correcting radiation-induced hardware faults: High-energy particle radiation experiments confirm improvements in fault tolerance by almost 80 percent.
... unreleased files, disk fragmentation, etc. [3], [4]. This problem has been recognized in various software systems that caused serious consequences such as loss of money, or even human life. ...
... unreleased files, disk fragmentation, etc. [3], [4]. This problem has been recognized in various software systems that caused serious consequences such as loss of money, or even human life. ...
Article
Software aging refers to a problem of performance decay in the software systems, which are running for a long period. The primary cause of this phenomenon is the accumulation of run-time errors in the software, which are also known as aging-related bugs (ARBs). Many efforts have been reported earlier to predict the origin of ARBs in the software so that these bugs can be identified and fixed during testing. Imbalanced dataset, where the representation of ARBs patterns is very less as compared to the representation of the non-ARBs pattern significantly hinders the performance of the ARBs prediction models. Therefore, in this article, we present an oversampling approach, generative adversarial networks-based synthetic data generation-based ARBs prediction models. The approach uses generative adversarial networks to generate synthetic samples for the ARBs patterns in the given datasets implicitly and build the prediction models on the processed datasets. To validate the performance of the presented approach, we perform an experimental study for the seven ARBs datasets collected from the public repository and use various performance measures to evaluate the results. The experimental results showed that the presented approach led to the improved performance of prediction models for the ARBs prediction as compared to the other state-of-the-art models.
Article
Full-text available
This paper covers segment of available solutions and methods as well as the advancements interoperability system technology that has been made within the area of concern that is cross-platform interoperability. The literature review focuses on previously used techniques to achieve sharing of resources between ICT systems. For decades, computer program development requires the utilization of measured useful components that perform a particular work in different places inside an application. As application integration and component-sharing operations got to be connected to pools of facilitating assets and conveyed databases, endeavors required a way to adjust their procedure-based improvement model to the use of inaccessible, conveyed components. The proposed system covers a lot of gaps that came from the close analysis of previous systems performance. The paper includes proposed system test cases with acceptance testing. The content in literature has aided the researchers in adopting the trending techniques in the interoperability of ICT systems.
Chapter
In Communication and inter-planetary missions, satellites are placed in elliptical parking orbits (EPO). This is followed by a series of maneuvers which subsequently positions the spacecraft in the de-sired Orbit. Liquid Apogee Motor (LAM) mode is a thruster firing mode used for orbit raising with the help of sensors such as Dynamically Tuned Gyroscopes (DTG), Digital Sun Sensors (DSS), Star Sensors (SS) and actuators such as thrusters. In the LAM mode, the output from the selected sensor is used to update the spacecraft attitude and this is compared with the desired attitude steering profile to derive the error in the attitude. The controller then corrects the spacecraft attitude error along the three axes. Presently, in the absence of sensor data during LAM mode, the burn is terminated by the on-board software logic and the spacecraft is normalized by ground intervention. But during mission critical operations, like in interplanetary missions, this logic fails to meet the mission requirements since there is no other opportunity to carry out the burn. In this context, a new fault tolerant intelligence approach is required. This chapter discusses the design and implementation of a new logic introduced in on-board software, to overcome LAM termination, in case of failure of sensor data updates. It also highlights the recovery time for various combinations of sensor failures.
Article
Full-text available
The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
Article
Full-text available
Two of the main parameters of real-time computer systems are reliability and performance. Researchers are always looking for solutions to increase the values of these parameters, which is the goal of this study. To this end, we propose an architecture for a dual-computer system that operates in real-time with fault tolerance implemented purely by hardware. The hardware, as designed and implemented, performs the following key services: 1) determination of the fault type (temporary or permanent) and 2) localization of the faulty computer without using self-testing techniques or diagnostic routines. Our design has several benefits: 1) the designed hardware shortens the recovery point time period; 2) the proposed nontrivial sequence of fault-tolerant services reduces (to two) the number of logical segments that must be re-run to recover computational processes; and 3) the determination of the fault type allows for the elimination of only computers with permanent faults. These contributions yield improvements in both the performance and reliability of the system.
Article
Full-text available
A design principle aimed at choosing the proper placement of functions among the modules of a distributed computer system is presented. The principle, called the end-to-end argument, suggests that certain functions usually placed at low levels of the system are often redundant or of little value compared to the cost of providing them at a low level. Examples include highly reliable communications links, encryption, duplicate message suppression, and delivery acknowledgement. It is argued that low level mechanisms to support these functions are justified only as performance enhancement.
Article
A common approach for supporting fault tolerance against node failures is the primary site approach. In this approach the service to be made fault-tolerant is replicated at many nodes, one of which is designated as primary and the others as backups. All the requests for the service are sent to the primary site. The primary site periodically checkpoints its state on the backups. If the primary fails, one of the backups takes over as primary, and to maintain consistency, it first re-executes all the requests performed by the previous primary since the last checkpoint. Two important issues that effect performance of this approach are the frequency of checkpointing and the degree of replication of the service. If the checkpointing interval is decreased the overhead of reexecuting old requests decreases, but the overhead for checkpointing increases. If the degree of replication increases, on the one hand, the availability of the system for user services increases since the reliability of the system increases. On the other hand, the checkpointing time increases, which reduces the availability of the system. In this paper, we present an analytic model to study the optimum checkpointing interval, and a queuing model to study the optimum degree of replication for a service in a primary site system. The reliability of a primary site system is also studied.
Conference Paper
Disk shadowing is a technique for maintaining a set of two or more identical disk images on separate disk devices. Its primary purpose is to enhance reliability and availability of secondary storage by providing multiple paths to redundant data. However, shadowing can also boost I/O performance. In this paper, we contend that intelligent device scheduling of shadowed discs increases the I/O rate by allowing parallel reads and by substantially reducing the average seek time for random reads. In particular, we develop and analytic model which shows that the seek time for a random read in a shadow set is a monotonic decreasing function of the number of disks.
Article
A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for recovering messages is to use message-logging techniques. In this paper, we present a model for message-logging based schemes to support fault-tolerant processes and develop conditions for proper message recovery in asynchronous systems. We show that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary. We then propose a distributed scheme to support fault-tolerant processes that can also handle multiple process failures.