Conference PaperPDF Available

Software Implemented Fault Tolerance Technologies and Experience.

January 1993

January 1993

Source
DBLP

Conference: Digest of Papers: FTCS-23, The Twenty-Third Annual International Symposium on Fault-Tolerant Computing, Toulouse, France, June 22-24, 1993

Authors:

Yennun Huang

Academia Sinica

Content uploaded by Yennun Huang

Content may be subject to copyright.

Software Implemented Fault Tolerance:

Technologies and Experience

Yennun Huang and Chandra Kintala

AT&T

Bell Laboratories

Murray Hill,

07974

Abstract

By software implemented fault tolerance, we mean

a set of software facilities to detect ‘and recover from

faults that are are not handled by the underlying hard-

ware

operating system. We consider those faults

that cause an application process to crash or hang;

they include software faults as well as faults

the

un-

derlying hardware and operating system layers

they

are undetected

those layers. We define

levels of

software fault tolerance based on availability and data

consistency of an application

the presence of such

faults.

Watchd,

libft

and

nDFS

are reusable compo-

nents that provide up to the 3rd level of software fault

tolerance. They perform, respectively, automatic de-

tection and restart of failed processes, periodic check-

pointing and recovery

critical volatile data, and

replication and synchronization of persistent data

an application software system. These modules have

been ported to a number of

UNIX’

platforms and can

be used by any application with minimal programming

egort.

Some newer telecommunications products

AT&T

have already enhanced their fault-tolerance capability

using these three components. Experience with those

products to date indicates that these modules provide

eficient and economical means to increase the level

fault tolerance

a software product. The performance

overhead due to these components depends on the level

and varies from

0.1%

14%

based on the amount of

critical data

being

checkpointed

and

replicated.

Introduction

There are increasing demands to make the soft-

ware in the applications we build today more toler-

ant to faults. From users’ point of view, fault toler-

‘UNIX

a registered trademark

UNIX

System

Labora-

tories,

Inc.

0731-3071/93

$3.00

1993

IEEE

ance has two dimensions:

availability

and

data consis-

tency

of the application. For example, users of tele-

phone switching systems demand continuous availabil-

ity whereas bank teller machine customers demand the

highest degree of data consistency[l3]. Most other ap-

plications have lower degrees

requirements for fault-

tolerance in both dimensions but the trend is to in-

crease those degrees

the costs, performance, tech-

nologies and other engineering considerations permit;

see Figure

below. In this paper, we discuss three

Baoli

Teller

Machines

Availability

Figure

Dimensions

Fault Tolerance

cost-effective software technologies to raise the degree

of fault tolerance in both dimensions of the applica-

tion.

Following Cristian[7], we consider software appli-

cations that provide

“service” to clients. The ap-

plications in turn use the services provided by the

underlying operating or database systems which in

turn use the computing and network communication

services provided by the underlying hardware; see

Figure

Tolerating faults in such applications in-

volves detecting

failure, gathering knowledge about

the failure and recovering from that failure. Tradi-

tionally, these fault tolerance actions are performed

A*-

FT-DBMS,

...

.*.-

duplex,

T’MR,

...

Figure 2: Layers of Fault Tolerance

in the hardware, operating

database systems used

in the underlying layers of the application software.

Hardware fault tolerance is provided using Duplex,

Triple-Module-Redundancy

other techniques[l9].

Fault tolerance in the operating and database layers

is often provided using replicated file systems[22], ex-

ception handling[23], disk shadowing[6], transaction-

based checkpointing and recovery[l8], and other sys-

tem routines. These methods and technologies handle

faults occurring in the underlying hardware, operating

and database system layers only.

Increasing number of faults are however occurring

in the application software layer causing the applica-

tion processes to crash

hang.

picocess is said to

crashed

if the working process image is no longer

present in the system.

process is said to be

hung

the process image is alive, its entry is still present in

the process table but the process is not making any

progress from

user’s point of view. Such software

failures arise from incorrect program designs

cod-

ing errors but, more often than not, they arise from

transient and nondeterministic errors[9]; for example,

unsatisfactory boundary value conditions, timing and

race conditions in the underlying computing and com-

munication services, performance failures etc.[7]. Due

to complex and temporal nature of interleaving of

messages and computations in

distributed system,

no amount of verification, validation and testing will

eliminate all those software faults in an application

and give complete confidence in the availability and

data consistency of that application.

So,

those faults

will occasionally manifest themselves and cause the

application process to crash or hang.

is possible to detect

crash andl restart an ap-

plication at

checkpointed

state

through operating

sytem facilities, as in IBM’s MVS[24]. In their pa-

per on

End-to-End Arguments[21],

Saltzer et.al. claim

that such hardware and operating system based meth-

ods to detect and recover from software failures are

necessarily incomplete. They show that fault toler-

ance cannot be complete without the knowledge and

help from the endpoints of an application, i.e., the

application software. We claim that such methods,

i.e. services

lower layer detecting and recovering

from failures

higher layer, are also inefficient. For

example, file replication on

mirrored disk through

facility in the operating system will be more inef-

ficient than replicating only the “critical” files of the

application in the application layer since the operating

system has no internal knowledge of that application.

Similarly, generalized checkpointing schemes in an

op-

erating system checkpoint entire in-memory data of an

application whereas application-assisted methods will

checkpoint only the critical data[l7,

31.

common but misleading argument against em-

bedding checkpointing, recovery and other fault tol-

erance schemes inside an application is that such

schemes are not transparent, efficient

reliable be-

cause t hey are coded by application programmers;

we claim that well-tested and efficient fault tolerance

methods can be built as libraries

software reusable

components executing in the application layer and

that they provide as much

transparency

the other

methods do. All the three components discussed in

this paper have those properties, i.e. efficient, reliable

and transparent.

The above observations and the

End-to-End argu-

ment

[21] lead to our notion of software fault tolerance

as defined below:

Software fault tolerance

a set of software

facilities to detect and recover

from

faults

that cause an application process

crash

hang and that are are not handled

the

un-

derlying hardware

operating system.

Observe that this includes software faults as described

earlier as well

faults in the underlying hardware and

operating system layers if they are undetected in those

layers. Thus, if the underlying hardware and operat-

ing system are not fault-tolerant in an application sys-

tem due to performance/cost trade-offs

other engi-

neering considerations, then that

system

can

increase

its availability more cost effectively through software

fault tolerance as described in this paper. It is also an

easier migration path for making existing applications

more fault-tolerant.

Model

For simplicity in the following discussions, we con-

sider only client-server based applications running in

local or wide-area network of computers in

dis-

tributed system; the discussion

also

applies to other

kinds of applications. Each application has

server

process executing in the user level on top of

vendor

supplied hardware and operating system. To get ser-

vices, clients send messages to the server; the server

process acts on those messages one by one and, in

each of those message processing steps, updates its

data. We sometimes call the server process

the ap-

plication.

For fault tolerance purposes, the nodes in

the distributed system will be viewed

being in

circular configuration such that each node will be

backup node for its left neighbor in that circular list.

As shown in Figure

each application will be exe-

Backup

Primary

..........................................

....

..................................

..................

.............

.........................................

Application

Data

(libft)

.........................................

Persisten

.....................

..........

1..

............

................

Figure

Modified Primary Site Approach

cuting primarily on one of the nodes in the network,

called the primary node for that application. Each

executing application has

process text

(the compiled

code),

volatile data

(variables, structures, pointers and

all the bytes in the static and dynamic memory seg-

ments

the process image) and

persistent data

(the

application files being referred to and updated by the

executing process).

We use

modified primary-site approach to soft-

ware fault tolerance[l]. In the primary site approach,

the service to be made fault tolerant is replicated at

many nodes, one

which is designated

primary

and the others

backups. All the requests for the

service are sent to the primary site. The primary site

periodically checkpoints its state on the backups. If

the primary fails, one of the backups takes over as

primary. This model for fault tolerance has been ana-

lyzed for frequency of checkpointing, degree of service

replication and the effect on response time by Huang

and Jalote[ll, 121. This model was slightly modified,

described below, to build the three technologies de-

scribed in this paper. The tasks in our modification

of the primary site approach are:

watchdog process running on the primary node

watching for application crashes or hangs,

watchdog process running on the backup node

watching for primary node crashes,

periodically checkpointing the critical volatile

data in the application

logging of client messages to the application,

replicating application’s persistent data and mak-

ing them available on the backup node,

when the application on the primary node crashes

or hangs, restarting the application, if possible,

on the primary node, otherwise, on the backup

node.

recovering the application to the last check-

pointed state and reexecuting the message log and

connecting the replicated files to the backup node

if the application restarts on the backup.

Observe that these software fault tolerance tasks can

be used in addition to other methods such as N-version

programming[2] or recovery blocks[20] inside an ap-

plication program. Observe also that the application

process on the backup node will not be running until

it is started by the watchdog process; this is unlike

in the process-pair model[9] where the backup process

will be passively running even during normal opera-

tions.

The degree to which the above software fault toler-

ance tasks are used in an application determines the

availability and data consistency

that application.

is, therefore, useful to establish

classification of

the different levels of software fault tolerance. We de-

fine the following

levels based on our experience in

AT&T. Applications illustrating these levels are de-

scribed in Section

Level

tolerance to faults

the application

software:

In this level, when the executing (application pro-

cess dies or hangs, it has to be manually restarted

from an initial internal state, i.e. the initial values

of the volatile data. The application may leave

its persistent data in an incorrect

inconsistent

state due to the timing of the crwh and may take

long time to restart due to elaborate initializa-

tion procedures.

Level

Automatic detection and restart:

When the application dies

hangs, the fault will

be detected and the application will be restarted

from an initial internal state on the same pro-

cessor, if possible,

backup processor if

available. In this level, the internal state of the

application is not saved and, hence, the process

will restart

the initial internal state. As stated

above, restart along with reinitialization will be

slow. The restarted internal state may not re-

flect all the messages that have been processed

in the previous execution, and therefore, may not

be consistent with the persistent data. The differ-

ence between Levels

and

is that the detection

and restart are automatic in Level

and there-

fore, the application availability is higher in Level

than in Level

Level

plus periodic checkpointing, logging

and recovery of internal state:

In addition to what is available in Level

the

internal state of the application process is pe-

riodically checkpointed, i.e. the #critical volatile

data is saved, and the messages to the server are

logged. After

failure is detected, the applica-

tion is restarted

the most recent checkpointed

internal state and the logged messages will be re-

processed to bring the application to the state

which it crashed. The application availability

and volatile data consistency are higher in Level

than those in Level

Level

plus persistent data recovery:

In addition to what is available in Level

the

persistent data of the application

replicated on

backup disk connected to

backup node, and is

kept consistent with the primary server through-

out the normal operation of the application. In

case of

fault and resulting recovery of the ap-

plication on the backup node, the backup disk

brings the application’s persistent data as close

to the state

which the application crashed

possible. The data consistency of the application

in Level

is higher than that in Level

Level

Continuous operation without any inter-

ruption:

This level of fault tolerance in software guarantees

the highest degree of availability and data consis-

tency. Often, this is provided, such as in switch-

ing systems, using replicated processing of the ap-

plication on “hot” spare hardware. The state of

process need not be saved, but multicast messag-

ing, voting and other mechanisms must be used

to maintain consistency and concurrency control.

Availability of the system during planned inter-

ruptions, such

those during upgrades, is made

possible using dynamic loading

other operating

system facilities. The technologies we describe in

this paper do not provide this level of fault toler-

ance.

Technologies

Many applications perform some of these software

fault tolerance features by coding them directly in

their programs. We developed three reusable com-

ponents

watchd, libft

and

nDFS

to embed those

features in any application with minimal programming

effort.

3.1

Watchd

watchdog daemon process that runs

single machine

network of machines.

con-

tinually watches the life of

local application process

by periodically sending

null signal to the process and

checking the return value to detect whether that pro-

cess is alive

dead.

detects whether that process

is hung

not by using one

the following two meth-

ods. The first method sends

null message to the local

application process using IPC (Inter Process Commu-

nication) facilities on the local node and checks the re-

turn value.

it cannot make the connection, it waits

for some time (specified by the application) and tries

again. If it fails after the second attempt, it will inter-

pret it to mean that the process is hung. The second

method asks the application process to send

heart-

beat message to

watchd

periodically and

watchd

pe-

riodically checks the heartbeat. If the heartbeat mes-

sage from the application is not received by

specified

time,

watchd

will assume that the application is hung.

This implies that

wat

chd

cannot differentiate between

hung processes and very slow processes.

When it detects that the application process

crashed

hung,

watchd

recovers that application

an initial internal state or

the last checkpointed

state. The application is recovered on the primary

node if that node has not crashed, otherwise on the

backup node for the primary as specified in

config-

uration file.

libft

is also used,

watchd

sets the

restarted application to process all the logged mes-

sages from the log file generated by

libft. watchd

also watches one neighboring

watchd

(left

right) in

circular fashion to detect node failures; this circular

arrangement is similar to the adaptive distributed di-

agnosis algorithm[5]. When

node failure is detected,

watchd

can execute user-defined recovery commands

and reconfigure the network. Observe that neighbor-

ing

wat chds

cannot differentiate between node failures

and link failures. In general, this is the problem of

attaining common knowledge in the presence of com-

munication failures which is provably unsolvable[lO].

Watchd

also

watches itself.

self-recovery mech-

anism is built into

watchd

in such

way that it

can recover itself from

unexpected software fail-

ure.

Watchd

also facilitates restarting

failed process,

restoring the saved values and reexecuting the logged

events and provides facilities for remote execution, re-

mote copy, distributed election, and status report pro-

duction.

3.2

Libft

user-level library of C functions that can

be used in application programs to specify and check-

point critical data, recover the checkpointed data, log

events, locate and reconnect

server, do exception

handling, do N-version programming (NVP), and use

recovery block techniques.

Libft

provides

set of functions (e.g.

critical())

to specify critical volatile data in an application.

Those critical data items are allocated in

reserved re-

gion of the virtual memory and are periodically check-

pointed. Values in critical data structures are saved

using memory copy functions, and thus avoid travers-

ing application-dependent data structures. When

application does

checkpoint, its critical data will be

saved on the primary and backup nodes. Unlike other

checkpointing methods[l?], the overhead in our check-

pointing mechanism is minimized by saving only crit-

ical data and avoiding data-structure traversals. This

idea of saving only critical data in an application is

analogous to the Recovery Box concept in Sprite[3].

Libf t

provides functions (e.g.

getsvrloc

getsvrport(),ftconnectO,

ftbind())for

clients to

locate servers and reconnect to servers in

network

environment. The exception handling, NVP and re-

covery block facilities are implemented using

macros

and standard C library functions. These facilities can

be used by any application without changing the un-

derlying operating system

adding new

preproces-

sors.

Libft

also provides

ftread0

and

ftwrite0

func-

tions to automatically

log

messages. When the

ftreado

function is called by

process in

normal

condition, the data will be read from

channel and

automatically logged on

file. The logged data will

then be duplicated and logged by the

watchd

daemon

backup machine. The replication of logged data is

necessary for

process

recover from

primary ma-

chine failure. When the

ftreado

function is called

process which is recovering from

failure in

recovery situation, the input data will be read from

the logged file before any data can be read from

reg-

ular input channel. Similarly, the

ftwrite0

function

logs output data before they are sent out. The out-

put data is also duplicated and logged by the

watchd

daemon on a backup machine. The log files created

by the

ftread0

and

ftwriteo

functions are trun-

cated after

checkpoint

function is successfully ex-

ecuted. Using functions

checkpoint

(),

f tread()

and

ftwrite0,

one can implement either

sender-based

receiver-based logging and recovery scheme[l4].

There is

slight possibility that some messages dur-

ing the automatic restart procedure may get lost. If

this is

concern to an application, an additional mes-

sage synchronization mechanism can be built into the

application to check and retransmit lost messages.

Speed and portability are primary concerns in im-

plementing

libf t.

The

libft

checkpoint mechanism

is not fully transparent to programmers as in the Con-

dor system[l6]. However,

libft

does not require

new language,

new preprocessor

complex decla-

rations and computations to save

data

structures[9].

The sacrifice of transparency for speed has been

proven to be useful in some projects to adopt

libft.

The installation of

libft

doesn’t require any change

UNIX-based operating system; it has been ported

to several platforms.

Watchd

and

libf t

separate fault detection and

volatile data recovery facilities from the application

functions. They provide those facilities as reusable

components which can be combined with any appli-

cation to make it fault tolerant. Since the messages

received

the server site are logged and only the

server process is recovered in this scheme, the con-

sistency problems that occur in recovering multiple

processes[l4] are not issues in this implementation.

3.3

nDFS

The multi-dimensional file system,

nRFS[8],

is based

3DFS[15]

and provides facilities for replication of

critical data.

allows users to specify and replicate

critical files on backup file systems in real time. The

implementation of

nDFS

uses the dynamic-shared

li-

brary mechanism to intercept file system calls and

propagate the system calls to backup file systems.

nDFS

is built on top of UNIX file systems, and

its use requires no change in the underlying file sys-

tem. Speed, robustness and replication transparency

are the primary design goals of

nDFS.

Implementation of

nDFS

uses

watchd

and

libft

for fault detection and fast recovery. A failure of

the underlying replication mechanism (software fail-

ure)

crash of

backup file system is transparent

to applications

users. A failed software component

is detected and recovered immediately by

watchd;

crashed backup file system, after it is repaired, can

catch up with the primary file system without inter-

rupting

slowing down applications using the pri-

mary file system.

Experience

Fault tolerance in some of the newer telecommuni-

cations network management products in AT&T has

been enhanced using

watchd, libft

and

nDFS.

Ex-

perience with those products to date indicates that

these technologies are indeed economical and effective

means to increase the level

fault tolerance in ap-

plication software. The performance overhead due to

these components depends on the level of fault toler-

ance, the amount of critical volatile data being check-

pointed, frequency of checkpointing, and the amount

of persistent data being replicated.

varies from

0.1%

14%.

We describe some of those products to illus-

trate the availability, flexibility and efficiency in pro-

viding software fault tolerance through these

com-

ponents.

protect the proprietary information of

those products, we use generic terms and titles in the

descriptions.

Level

Failure detection and restart using

watchd:

Application C uses

watchd

to check the “live-

ness” of some service daemon processes in C

sec-

ond intervals. When any of those processes fails, i.e.

crashes

hangs,

watchd

restarts that process

its

initial

state.

took

people

hours

embed

and

configure

watchd

for this level of fault tolerance in ap-

plication C. One potential use of this kind of fault tol-

erance would be in general purpose local area comput-

ing environments for state-less network services such

lpr,

finger

inetd

daemons. Providing higher

levels of fault tolerance in those services would be un-

necessary.

Level

Failure detection, checkpointing, restart

and recovery using

watchd

and

libft:

Application

N maintains

certain segment of the telephone call

routing information on

Sun servgr; maintenance op-

erators use workstations running

N’s

client processes

communicating with N’s server process using

sockets.

The server process in N was crashing or hanging for

unknown reasons. During such failures, the system

administrators had to manually bring back the server

process, but they could not do

immediately be-

cause of the UNIX delay in cleaning up the socket

table. Moreover, the maintenance operators had

restart client interactions from an initial state. Re-

placing the server node with fault tolerant hardware

would have increased their capital and development

costs by

factor of

Even then, all their problems

would not have been solved;

for

example, saving the

client states of interactions. Using

watchd

and

libft,

system

is now able to tolerate such failures.

Watchd

also detects primary server failures and restarts it on

the backup server. Location transparency is obtained

using

getsvrloc()

and

getsvrport

calls in client

programs and

f tbind

in server program.

Libf

t’s

checkpoint and recovery mechanisms are used to save

and recover all critical data. Checkpointing and recov-

ery overheads are below

2%.

Installing and integrating

the two components into the application took

people

days.

Level

Failure detection, checkpointing, replica-

tion,

restart

and recovey using

watchd, libft

and

nDFS:

Application

real-time telecommunication

network element currently being developed. In addi-

tion to the previous requirements for fault tolerance,

this product needed to get its persistent files on-line

immediately after

failure recovery on

backup node.

During normal operations on the primary server,

nDFS

replicates all the critical persistent files on

backup

server with an expected overhead of less than

14%.

When the primary server fails,

watchd

starts the appli-

cation

on the backup node and automatically con-

nects it to the backup disk on which the persistent

files were replicated.

4.1

Other

Possible

Uses

The three software components,

watchd, libft

and

nDFS,

can be used not only to increase the level of

fault-tolerance in an application, as described above,

but also to aid in other operations unrelated to fault-

tolerance

described below.

On-line upgrading

software:

One can install

new version of software for an application with-

out interrupting the service provided by the older

version. This can be done by first loading the new

version on the backup node, simulating

fault on

the primary and then letting

wat

chd

dynamically

move the service location to the backup node.

This method assumes that the two versions

are

compatible

the application level client-server

protocol.

Overcoming persistent errors in software:

Some

errors in software are simply persistent, i.e. non-

debuggable, due to the complexity and transient

nature of the interactions and events in

dis-

tributed system[4]. Such errors sometimes do not

reappear after the server process is restarted[9];

watchd,

in those instances, can be used to bring

the server process back up without clients notic-

ing the failure and restart. After restart and

restoration of the checkpointed state, message

logs can be replayed in the order they originally

arrived

the server

or,

if needed, in

differ-

ent order[25]. Reordering the message logs some-

times eliminates transient errors due to “bound-

ary” conditions.

Using checkpoint states

and

message

logs

for

de-

bugging distributed applications:

libf

all the

checkpointed states, i.e. values in the critical

data, and message logs can optionally be saved

journal file. This journal can be used to aid

in analyzing failures in distributed applications.

Summary

We identified some of the dimensions of fault tol-

erance and defined

role,

taxonomy and tasks for

software fault tolerance based on availability and data

consistency requirements of an application.

then

described three software components,

wat chd,

libf

and

nDFS

to perform these tasks. These three compo-

nents are flexible, portable and reusable; they can be

embedded in any UNIX-based application software

provide different levels of fault tolerance with minimal

programming effort. Experience

using these three

components in some telecommunication products has

shown that these components indeed increase the level

of fault tolerance with acceptable increases in perfor-

mance overhead.

Acknowledgments

Many thanks to Lawrence Bernstein who suggested

defining levels for fault tolerance, provided leadership

to transfer this technology rapidly and encouraged us-

ing these components in

wide range of AT&T prod-

ucts and services. The authors have benefited from

discussions, contributions and comments from several

colleagues, particularly, Rao Arimilli, David Belanger,

Marilyn Chiang, Glenn Fowler, Kent Fuchs, Pankaj

Jalote, Robin Knight, David Korn, Herman Rao and

Yi-Min Wang.

References

[l]

A. Alsberg and J. D. Day,

“A

Principle for Re-

silient Sharing of Distributed Services,”

Proceed-

ings

2nd Intl. Conf. on Software Engineering,

pp. 562-570, October 1976.

[2] A. Avizienis, “The N-Version Approach to Fault-

Tolerant Software,”

IEEE Trans. on Software En-

gineering,

SE-11, No. 12, pp. 1491-1501, Dec. 1985.

[3]

Baker and

Sullivan, “The Recovery Box:

Using Fast Recovery to Provide High Availability

in the UNIX environment,”

Proceedings

Sum-

mer

’92

USENIX,

pp. 31-43, June 1992.

[4] L. Bernstein, “On Software Discipline and the War

of 1812,”

ACM

Software Engineering Notes,

p. 18,

Oct. 1992.

[5]

Bianchini, Jr. and

Buskens, “An Adap-

tive Distributed System-Level Diagnosis Algo-

rithm and Its Implementation,”

Proceedings

21st IEEE Conference

Fault- Tolerant Comput-

ing Systems (FTCS),

pp. 222-229, July 1991.

[6]

Bitton and

Gray, “Disk Shadowing,”

Proceed-

ing

14th Conference on Very Large

Data

Bases,

pp. 331-338, Sept. 1988.

[7]

Cristian, “Understanding Fault-Tolerant Dis-

tributed Systems,”

Communications

the

AGM,

Vol. 34, No. 2, pp. 56-78, February 1991.

[8] G.

Fowler,

Huang, D. G. Korn and

Rao,

“A User-Level Replicated File System,” To be pre-

sented

the

Summer

1993

USENIX Conference,

June 1993.

[9]

Gray and D.

Siewiorek, “High-Availability

Computer Systems,”

IEEE Computer,

Vol. 24,

No.

9, pp. 39-48, September 1991.

[lo]

Halpern and

Moses, “Knowledge and

Common Knowledge in

Distributed Environ-

ment,”

Journal

ACM,

Vol.

37,

No.

pp.

549-

587,

July

1990.

[ll]

Huang and

Jalote, “Analytic Models

for

the

Primary Site Approach to Fault-Tolerance,”

Acta

Informatica,

Vol

26,

pp.

543-557, 1989.

[12]

Huang and

Jalote, “Effect

Fault Toler-

ance on Response Time

Analysis

the Primary

Site Approach,”

IEEE Transactions on Comput-

ers,

Vol.

41,

No.

pp.

420-428,

April

1992.

[13]

Iwama, D. Stan and

Zimmerman,

Personal

correspondence,

July

1992.

[14]

Jalote, “Fault Tolerant Processes,”

Distributed

Computing,

Vol.

pp.

187-195, 1989.

[15]

Korn and

Krell,

“A

New Dimension for

the Unix File System,”

Software Practice and Ex-

perience,

Vol.

20,

Supplement

pp.

S1/19-S1/34,

June

1990.

[l6]

M. Litxkow, M. Livny, and M Mutka. “Condor

hunter

idle workstations,”

Proc.

the 8th Intl.

Conf. on Distributed Computing Systems,

IEEE

Computer Society Press, June

1988.

[17]

9. Long,

Fuchs and J. A. Abraham,

“Compiler-Assisted Static Checkpoint Insertion,”

Proceedings

22nd IEEE Conference on Fault-

Tolerant Computing Systems

(FTCS),

pp.

58-65,

July

1992.

[18]

Nangia and

Finker, “Transaction-based

Fault-Tolerant Computing in Distributed

Sys-

tems,”

Proceedings

1992

IEEE

Workshop

Fault-tolerant Parallel and Distributed Systems,

pp.

92-97,

July,

1992.

[19]

Pradhan (ed.),

Fault-Tolerant Computing:

theory and techniques,

Vols.

and

YPrentice-Hall,

1986.

[20]

Randell, “System Structure for Software Fault

Tolerance,”

IEEE Duns. on Software Engineering,

SE-1,

No.

pp.

220-232,

June,

1975.

[21]

H. Saltzer, D.

Reed and

D. Clark, “End-

To-End Arguments in System Design,’’

ACM

Transactions on Computer Systems,

Vol.

No.

pp.

277-288,

November,

1984.

[22]

Satyanarayanan, “Coda:

Highly Available

File System for

Distributed Workstation Envi-

ronment,”

IEEE

on Computers,

Vol.

C-39,

pp447-459,

April

1990.

[23]

Shrivastava (ed.),

Reliable

Computer Sys-

[24]

Siewiorek and R.

Swarz,

Reliable Com-

puter Systems Design and Implementation,

Chap-

ter

Digital Press,

1992.

tems,

Chapter

Springer-verlag,

1985.

[25]

Wang,

Huang and

Fuchs, “Pro-

gressive Retry for Software Error Recovery in Dis-

tributed Systems,”

be presented at this

FTCS-

Conference,

June

1993.

The impact of feature selection techniques on aging-related bug prediction models: An empirical investigation

Preprint

Full-text available

May 2023

Software aging refers to the accumulation of error conditions over time in long-running software systems, which can lead to decreased performance and an increased likelihood of failures. Aging-Related Bug Prediction (ARBP) was introduced to predict the Aging-related Bugs (ARBs) hidden in the systems by using features extracted from source code. ARBs include memory leaks, storage problems, unreleased files, socket exceptions, unreleased file handles, disk fragmentation and so on. Previous research in Software Defect Prediction (SDP) indicated that using feature selection techniques to select a subset of representative features to use could enhance the performance of classification models. However, considering the difference between ARB features and SDP features, blindly applying the method performed well in SDP to pre-process the ARBs dataset may not necessarily improve the performance of the ARBP model, and could potentially result in a decline in performance. To address this limitation, 22 feature selection methods with 21 classifiers embedded in the most used ARBP model on four benchmark datasets from real-world software projects, and six different evaluation indicators were employed to assess the performance of ARBP models comprehensively. Our experiment results showed that: (1) The filter-based feature ranking method called SVMF performed the best on the ARBP, and the filter-based feature subset selection method ConBF performs the worst on the ARBP task. (2) Using the statistic-based classifiers as the base classification model embedded with the SVMF can perform the best, the Naive Bayes classifier always achieves the best performance. Researchers are recommended to first consider CountLineBlank, CountLineComment, and MaxCyclomaticModified features for the ARBP task.(3) The feature selection method ConBF, which performed the best in conventional SDP was not optimal for our specific task. This highlights the unique nature of aging-related features and underscores the need for a tailored feature selection method. Based on these findings, we recommend using SVMF with the Naive Bayes classifier when building ARBP models, in our study, this combination can improve the Balance performance by 18\% and Recall by 25.9% compared with no feature selection for ARBP.

Aspect-Oriented Technology for Dependable Operating Systems

Thesis

Full-text available

May 2017

Christoph Borchert

Modern computer devices exhibit transient hardware faults that disturb the electrical behavior but do not cause permanent physical damage to the devices. Transient faults are caused by a multitude of sources, such as fluctuation of the supply voltage, electromagnetic interference, and radiation from the natural environment. Therefore, dependable computer systems must incorporate methods of fault tolerance to cope with transient faults. Software-implemented fault tolerance represents a promising approach that does not need expensive hardware redundancy for reducing the probability of failure to an acceptable level. This thesis focuses on software-implemented fault tolerance for operating systems because they are the most critical pieces of software in a computer system: All computer programs depend on the integrity of the operating system. However, the C/C++ source code of common operating systems tends to be already exceedingly complex, so that a manual extension by fault tolerance is no viable solution. Thus, this thesis proposes a generic solution based on Aspect-Oriented Programming (AOP). To evaluate AOP as a means to improve the dependability of operating systems, this thesis presents the design and implementation of a library of aspect-oriented fault-tolerance mechanisms. These mechanisms constitute separate program modules that can be integrated automatically into common off-the-shelf operating systems using a compiler for the AOP language. Thus, the aspect-oriented approach facilitates improving the dependability of large-scale software systems without affecting the maintainability of the source code. The library allows choosing between several error-detection and error-correction schemes, and provides wait-free synchronization for handling asynchronous and multi-threaded operating-system code. This thesis evaluates the aspect-oriented approach to fault tolerance on the basis of two off-the-shelf operating systems. Furthermore, the evaluation also considers one user-level program for protection, as the library of fault-tolerance mechanisms is highly generic and transparent and, thus, not limited to operating systems. Exhaustive fault-injection experiments show an excellent trade-off between runtime overhead and fault tolerance, which can be adjusted and optimized by fine-grained selective placement of the fault-tolerance mechanisms. Finally, this thesis provides evidence for the effectiveness of the approach in detecting and correcting radiation-induced hardware faults: High-energy particle radiation experiments confirm improvements in fault tolerance by almost 80 percent.

A Study of Aging-Related Bugs Prediction in Software System

Chapter

Jan 2021

Generative Adversarial Networks-Based Imbalance Learning in Software Aging-Related Bug Prediction

Article

Feb 2021

Software aging refers to a problem of performance decay in the software systems, which are running for a long period. The primary cause of this phenomenon is the accumulation of run-time errors in the software, which are also known as aging-related bugs (ARBs). Many efforts have been reported earlier to predict the origin of ARBs in the software so that these bugs can be identified and fixed during testing. Imbalanced dataset, where the representation of ARBs patterns is very less as compared to the representation of the non-ARBs pattern significantly hinders the performance of the ARBs prediction models. Therefore, in this article, we present an oversampling approach, generative adversarial networks-based synthetic data generation-based ARBs prediction models. The approach uses generative adversarial networks to generate synthetic samples for the ARBs patterns in the given datasets implicitly and build the prediction models on the processed datasets. To validate the performance of the presented approach, we perform an experimental study for the seven ARBs datasets collected from the public repository and use various performance measures to evaluate the results. The experimental results showed that the presented approach led to the improved performance of prediction models for the ARBs prediction as compared to the other state-of-the-art models.

IFCM: An improved Fuzzy C-means clustering method to handle Class Overlap on Aging-related Software Bug Prediction

Conference Paper

Oct 2023

Unstick Yourself: Recoverable Byzantine Fault Tolerant Services

Conference Paper

May 2023

Survey on the development of an Insurance Application Systems Interoperability toolbox for business organizations

Article

Full-text available

Apr 2022

This paper covers segment of available solutions and methods as well as the advancements interoperability system technology that has been made within the area of concern that is cross-platform interoperability. The literature review focuses on previously used techniques to achieve sharing of resources between ICT systems. For decades, computer program development requires the utilization of measured useful components that perform a particular work in different places inside an application. As application integration and component-sharing operations got to be connected to pools of facilitating assets and conveyed databases, endeavors required a way to adjust their procedure-based improvement model to the use of inaccessible, conveyed components. The proposed system covers a lot of gaps that came from the close analysis of previous systems performance. The paper includes proposed system test cases with acceptance testing. The content in literature has aided the researchers in adopting the trending techniques in the interoperability of ICT systems.

Automated Intelligent Healing in Cloud-Scale Data Centers

Conference Paper

Sep 2021

A Novel Method for Fault Tolerance Intelligence Advisor System (FT-IAS) for Mission Critical Operations

Chapter

Mar 2020

In Communication and inter-planetary missions, satellites are placed in elliptical parking orbits (EPO). This is followed by a series of maneuvers which subsequently positions the spacecraft in the de-sired Orbit. Liquid Apogee Motor (LAM) mode is a thruster firing mode used for orbit raising with the help of sensors such as Dynamically Tuned Gyroscopes (DTG), Digital Sun Sensors (DSS), Star Sensors (SS) and actuators such as thrusters. In the LAM mode, the output from the selected sensor is used to update the spacecraft attitude and this is compared with the desired attitude steering profile to derive the error in the attitude. The controller then corrects the spacecraft attitude error along the three axes. Presently, in the absence of sensor data during LAM mode, the burn is terminated by the on-board software logic and the spacecraft is normalized by ground intervention. But during mission critical operations, like in interplanetary missions, this logic fails to meet the mission requirements since there is no other opportunity to carry out the burn. In this context, a new fault tolerant intelligence approach is required. This chapter discusses the design and implementation of a new logic introduced in on-board software, to overcome LAM termination, in case of failure of sensor data updates. It also highlights the recovery time for various combinations of sensor failures.

Semantic-Aware Online Workload Characterization and Consolidation

Conference Paper

Jul 2018

System Structure for Software Fault Tolerance

Article

Full-text available

Aug 2003

Brian Randell

The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

Design and implementation of highly reliable dual-computer systems

Article

Full-text available

Oct 2009
COMPUT SECUR

Refik Samet

Two of the main parameters of real-time computer systems are reliability and performance. Researchers are always looking for solutions to increase the values of these parameters, which is the goal of this study. To this end, we propose an architecture for a dual-computer system that operates in real-time with fault tolerance implemented purely by hardware. The hardware, as designed and implemented, performs the following key services: 1) determination of the fault type (temporary or permanent) and 2) localization of the faulty computer without using self-testing techniques or diagnostic routines. Our design has several benefits: 1) the designed hardware shortens the recovery point time period; 2) the proposed nontrivial sequence of fault-tolerant services reduces (to two) the number of logical segments that must be re-run to recover computational processes; and 3) the determination of the fault type allows for the elimination of only computers with permanent faults. These contributions yield improvements in both the performance and reliability of the system.

End-To-End Arguments in System Design

Article

Full-text available

Nov 1984

A design principle aimed at choosing the proper placement of functions among the modules of a distributed computer system is presented. The principle, called the end-to-end argument, suggests that certain functions usually placed at low levels of the system are often redundant or of little value compared to the cost of providing them at a low level. Examples include highly reliable communications links, encryption, duplicate message suppression, and delivery acknowledgement. It is argued that low level mechanisms to support these functions are justified only as performance enhancement.

The Recovery Box: Using Fast Recovery to Provide High Availability

Article

MUTKA òCON-DOR?A HUNTER OF IDLE WORKSTATIONS

Article

Fault-tolerant computing: theory and techniques; vol. 1

Article

Dhiraj Pradhan

An abstract is not available.

Analytic models for the primary site approach to fault-tolerance

Article

Jul 1989

A common approach for supporting fault tolerance against node failures is the primary site approach. In this approach the service to be made fault-tolerant is replicated at many nodes, one of which is designated as primary and the others as backups. All the requests for the service are sent to the primary site. The primary site periodically checkpoints its state on the backups. If the primary fails, one of the backups takes over as primary, and to maintain consistency, it first re-executes all the requests performed by the previous primary since the last checkpoint. Two important issues that effect performance of this approach are the frequency of checkpointing and the degree of replication of the service. If the checkpointing interval is decreased the overhead of reexecuting old requests decreases, but the overhead for checkpointing increases. If the degree of replication increases, on the one hand, the availability of the system for user services increases since the reliability of the system increases. On the other hand, the checkpointing time increases, which reduces the availability of the system. In this paper, we present an analytic model to study the optimum checkpointing interval, and a queuing model to study the optimum degree of replication for a service in a primary site system. The reliability of a primary site system is also studied.

Disk Shadowing.

Conference Paper

Jan 1988

Disk shadowing is a technique for maintaining a set of two or more identical disk images on separate disk devices. Its primary purpose is to enhance reliability and availability of secondary storage by providing multiple paths to redundant data. However, shadowing can also boost I/O performance. In this paper, we contend that intelligent device scheduling of shadowed discs increases the I/O rate by allowing parallel reads and by substantially reducing the average seek time for random reads. In particular, we develop and analytic model which shows that the seek time for a random read in a shadow set is a monotonic decreasing function of the number of disks.

A User-Level Replicated File System.

Conference Paper

Jan 1993

Fault Tolerant Processes.

Article

Dec 1989

Pankaj Jalote

A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for recovering messages is to use message-logging techniques. In this paper, we present a model for message-logging based schemes to support fault-tolerant processes and develop conditions for proper message recovery in asynchronous systems. We show that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary. We then propose a distributed scheme to support fault-tolerant processes that can also handle multiple process failures.

Software Implemented Fault Tolerance Technologies and Experience.

Recommended publications

Software implemented fauklt detection and fault tolerance mechnisms - part I: concepts and algorithm...

On software implementation and approbation of the DFBSA algorithm of synthesis of empirical decision...

Using a User Centered Approach to a CMS Adoption: Planning, Process, and Implementation

Selected issues of the Bluetooth technology in control on-line and telemetry