ArticlePDF Available

Software Based Fault Tolerant Computing Using Redundancy Software Based Fault Tolerant Computing Using Redundancy

November 2005
Ubiquity 2005(November)

November 2005
2005(November)

Authors:

Centre for Development of Advanced Computing

This paper examines how software based fault tolerant computing approach through triplicate redundancy and recovery. This approach is not intended to tolerate the software design bugs. It is intended to tolerate various environmental faults during the execution time of a computer-controlled system. Application data corruption due to electrical transients is detected and recovered to control the system immediately. The proposed approach is a low cost tool towards designing a robust industrial application system that can tolerate errors due to electrical surges and transients. This approach does not rely on design diversification.

Content uploaded by Goutam Saha

Content may be subject to copyright.

Software Based Fault Tolerant Computing Using Redundancy

International Journal of the Computer, the Internet and Management Vol. 17. No.3 (September-December, 2009) pp 41 -46

Software Based Fault Tolerant Computing Using Redundancy

Goutam Kumar Saha

CA –2 / 4B, CPM Party Office Road, Baguiati,

Deshbandhu Nagar, Kolkata 700059 WB India.

E-mail: gksaha@rediffmail.com

Abstract

This paper examines how software

based fault tolerant computing approach

through triplicate redundancy and recovery.

This approach is not intended to tolerate the

software design bugs. It is intended to

tolerate various environmental faults during

the execution time of a computer- controlled

system. Application data corruption due to

electrical transients is detected and

recovered to control the system immediately.

The proposed approach is a low cost tool

towards designing a robust industrial

application system that can tolerate errors

due to electrical surges and transients. This

approach does not rely on design

diversification.

Keywords: Fault tolerant robust computing,

recovery and application system.

1. Introduction

Electrical Noises, Electrical Transients

(ETs), Electrostatic Discharge (ESD),

Electromagnetic Pulses (EMP) are the

example of short duration noises. Again

short duration noises often cause random

data corruption in the primary memory of a

computing machine. Many scientific

applications which need reference

information tables are thus forced to miss

their goals by using those corrupted data

tables and program codes. Often we take it

granted that our program code and data

banks are absolutely safe and correct while

designing software for an on-line

application. But it is always not correct

because the high speed processing units are

often victimized by short duration noises as

discussed in (Anderson and Lee 1981),

(Dimitri 1989), (Wicker 1995).

Electromagnetic Interference (EMI) is an

unplanned, extraneous electrical signal that

affects the performance of a computer

system. It can cause memory errors and data

file destruction during the run time of an

application. Externally produced EMI or

Noise enters the computer through the

cabling or openings in the case. Sometimes

it enters by static discharge through the case

of the disk drive. Thus while designing

software, the effects of noises should not be

overlooked in a scientific application that

uses look-up data during its execution time.

2. The Software Application

The look up table contains the angular

field distribution for different regions of an

aircraft. Algorithm 1., however shows the

basic logic for determining if phase and

amplitude of the received signal in a

particular direction match a record in a

predefined user look-up table [LT] for the

angular distribution concerned. Depending

on the result of this match, some action or

function (say track) is performed.

Goutam Kumar Saha

So before initiating an action, the

matching logic and processing should be of

higher reliability and accuracy. If any

transient causes data corruption or program

code corruption then the whole system will

be locked up leading to a complete mission

failure.

Algorithm 1.

/* A predefined user data look up table [LT]

contains records of information like PHI,

PHASE, and AMPLITUDE. This algorithm

shows the basic processing logic to find out a

true match of PHI, PHASE and

AMPLITUDE of received signal with

predefined record in LT. Variables namely,

FI, FASE, AMPL are to denote PHI,

PHASE, AMPLITUDE respectively. */

Step 1. Read: Received signal FI, FASE,

AMPL

Step 2. If FI .EQ. FI_LT, then:

/* Compare input parameters with the stored

Parameters in the Look up Table */

If FASE .EQ. FASE_LT .AND.

AMPL .EQ. AMPL_LT, then:

TRACK /*If matching then

initiate Tracking */

Else:

GOTO Step 1. /* If not matching,

then read inputs */

[End of If structure]

Else:

GOTO Step 1.

[End of If Structure]

{End of Algorithm 1. showing basic

processing logic}

The above mentioned basic processing

logic is made robust with enhanced

processing logic as shown in Algorithm 2.

Algorithm 2.

/* It shows the enhanced processing logic

towards transient fault tolerance. Variables

namely, PREVFI, PREVFASE,

PREVAMPL are to store the earlier

incoming signal’s PHI, PHASE and

AMPLITUDE. A routine “VBYTFLT” is

called from this enhanced logic for detecting

transient errors in the program and data code

and for the correction thereof periodically

say, after every NRUN number of executions

of the application system. */

Step 1. Set NRUN = 0 /* it keeps the

number of application run or Iteration

involved. */

Step 2. Read: Received Signal FI, FASE,

AMPL

Step 3. Set PREVFI = FI, PREVFASE =

FASE, PREVAMPL = AMPL

Step 4. Read: Received Signal FI,

FASE, AMPL /* successive 2nd reading */

Step 5. If FI <> PREVFI .AND. FASE<>

PREVFASE .AND.

AMPL <> PREVAMPL, Then:

Set NRUN = NRUN + 1

If NRUN .EQ. 20, Then:

/* Compare with the upper limit Value

of NRUN i.e., say 20 */

Call VBYTFLT

/* Fault detection & recovery Routine is

invoked every 20 runs */

Set NRUN = 0

GOTO Step 2.

/*If immediate two readings are not

consistent due to transient potential, then

read again otherwise go ahead with the

rest of the application */

[End of If Structure]

Else:

If FI <> FI_LT,

Then:

GOTO Step 2.

Software Based Fault Tolerant Computing Using Redundancy

International Journal of the Computer, the Internet and Management Vol. 17. No.3 (September-December, 2009) pp 41 -46

Else If FASE .EQ.

FASE_LTB .AND. AMPL .EQ. AMPL_LT,

Then:

TRACK /* Initiate track */

Else:

GOTO Step 2.

[End of If Structure]

{End of the Enhanced Processing Logic

i.e., Algorithm 2.}

Algorithm 3

The following steps show how the

VBYTFLT Algorithm works.

/* It verifies the corresponding three bytes at

an offset say, k, of the three images or copies

of the application and its data. Byte error is

detected by comparing the k-th byte of the

three images of the application and Look-up

table. If any byte error has occurred, then the

corrupted byte is repaired by overwriting the

corrupted byte with the byte pattern in

majority. Any disagreement among the

corresponding bytes indicates the potential

transient bit-errors. Starting addresses of

the three images are known.*/

Step 1. Initialize: Size of an image in

bytes. /*size of an image in bytes is known */

Step 2. Compare the k-th byte of the three

images to find a majority one.

Step 3. If there is a disagreement, Then

rewrite or repair the odd byte of an image

with the majority one found at step-2.

Else If there is no disagreement,

Then increment k by one and go to step-2

for checking the next byte until the end of

an image.

Else If no agreement among the

three bytes (a crash), reload the application,

Then go to ERROR.

/* if no majority is found then call an

ERROR routine for restart or re-executing

the application. */

End If

Step 4. Return to the enhanced tracking

program (algorithm-2) for reliable tracking.

{End of the VBYTFLT algorithm-3}.

3. Discussion

Algorithm-1 describes the basic

processing logic. The input parameters are

compared with the stored records in the look

up table. If there is a matching record inside

the LT, then it does not track because of the

fact that LT stores only the parameters of the

friends’ aircraft. But a mismatch or a 'not

found' in LT, indicates that the aircraft is of a

foe and therefore it needs to be tracked and

that is why the tracking function is initiated.

However, the basic processing logic does not

work properly in reality.

Algorithm-2 describes the steps

involved in this application with an enhanced

processing logic. This logic uses the input

parameters of two immediate and successive

readings in order to eliminate any ambiguity

in input data arising due to potential

transients etc. It also behaves like an input

filter. The NRUN variable can be tuned in

order to combat the transients’ effect. In this

algorithm, after every twenty (say, NRUN‘s

upper limit is 20) runs of the application, the

error detection and recovery routine namely,

VBYTFLT is invoked. If the transient threat

is more frequent, then upper limit of the

NRUN variable may be reduced to a value

say, 5. Thus depending on the real

environment, the time interval between two

successive calls of VBYTFLT routine

during the execution of the application, can

be reduced (to say 1) or increased by

changing the upper limit of the NRUN

variable, as shown at step-5 of the algorithm-

Goutam Kumar Saha

2. The variables namely, PREVFI,

PREVFASE, PREVAMPL are to store the

previous inputs, whereas FI, FASE, AMPL

variables store the most recent input

parameters.

Algorithm-3 shows the steps involved in

order to detect and correct errors in the look

up table namely, LT, as well as in the

application code. There are three images of

the application code along with the LT,

stored in memory of the computing machine.

If the starting addresses of the three images

are say I1, I2 and I3 respectively. When the

offset K is say 0 ( initial value ), then the

address I10 denotes the starting address of

first image I1 only , because I10 has the

value of (I1 + 0) i.e., starting address plus

offset. In general, if Im be the starting

address of the mth image then, the address

of the kth byte ( or at offset say, k) is shown

by equation (1).

K = Im + k (1)

Again, if any one byte out of the three

corresponding bytes of three images at an

offset says, k is corrupted, then this

VBYTFLT routine repairs the corrupted byte

by overwriting the erroneous byte with the

byte in majority. The affected byte is

detected by means of comparison of three

bytes at the same offset, as shown at step 2, 3

of the Algorithm-3.

The possibility of getting inadvertently

alteration (by transients) of two bytes at

distant locations, into a similar corrupted

value in order to dissatisfy the step-3 of

algorithm-3 is almost nil. In other words the

chances of byte error remaining undetected

1 / (2 8) * 1 / (2 8) = 2 –16 (2)

This method is capable of detecting even

all 8 bit errors i.e., even an entirely corrupted

byte is also detected.

Again, it can repair all the 8-bit errors. If

say, In

K byte is corrupted to In*

K, and say,

K and Io

K byte contents are same at an

offset k of two images Im and Io, and then

by comparing three corresponding bytes of

the three images, we can detect that In

K byte

is corrupted (as shown at step-3 of the

Algorithm-3). The corrupted byte is repaired

by overwriting the wrong one with the

majority one. This is applicable even for 8

bit errors in a byte.

If there is no error in program and data

code then the following equation will be

satisfied.

K = In

K = Io

K (3)

The chances of satisfying the equation

(3) by the corrupted three bytes of the three

images at the same offset is negligibly

small, because the transients’ effects on

memory, registers are very random and

independent in nature.

1/( 2 8 ) * 1 /( 2 8 ) * 1 /( 2 8 ) = 2 -24 (4)

In other words, the chances of three

bytes at different locations corresponding to

a particular value with similar bit pattern,

getting altered (due to random effects of

transients) simultaneously to a similar value

in order to satisfy equation (5) is negligibly

small.

Im*

K = In*

K = Io*

K (5)

Again, the chances of a particular value

(of one byte size) stored at same offset in the

three images, getting altered to three

different values (of different bit pattern), is

1 / (2 24).

The above disastrous effect indicates a

possibility of memory hardware or

permanent errors and the ERROR routine is

invoked for necessary recovery thereof.

Software Based Fault Tolerant Computing Using Redundancy

International Journal of the Computer, the Internet and Management Vol. 17. No.3 (September-December, 2009) pp 41 -46

In other words, the possibility of

invoking the error routine namely, ERROR

(as shown at step-3) will be negligibly small.

The routine VBYTFLT verifies for byte

errors for the entire application program

along with the reference data table (LT), by

increasing the value of offset K from 0 to

N –1 (the size of an image say, of N bytes).

This is very effective for online error

detection and error recovery of the

application during the life cycle of the

application. After detecting and repairing the

entire application, program control goes back

to the main application (as shown in

Algorithm-2). Even a totally corrupted image

can be repaired by this proposed technique

by repairing byte after byte.

Space redundancy of this proposed

technique is about three. However, because

of the lower economic trend on the hardware

prices, this much space redundancy can be

easily affordable. Little higher time

redundancy can be affordable because of

easily affordable high speed machine. This

proposed technique is capable of detecting

and repairing any number of soft errors (not

reproducible) as well as permanent errors

during the run time of the application. Fault

detectibility is inversely proportional to the

upper limit of the variable NRUN, i.e.

FD α 1 /NRUN_UL (6)

Thus depending on the threat of

transients potential, the upper limit value of

NRUN can be changed in order to meet the

requirement.

There are several conventional error

detection and correction schemes like Parity

Checks, Hamming Codes, Cyclic

Redundancy Checks (CRC), Checksums etc.

They are not free from limitations as stated

in Rhee (1985), Saha (1999). The single

parity checks can detect only odd number of

errors. Any even number of errors remain

undetected. So it is inadequate for detecting

any number of errors. Again for CRC, shift

normally shift register based. The Shift

and finding the remainder. Modulo 2 adders,

multiplier circuits are also used. In CRC,

when errors actually occur, the receiver fails

to detect the errors when the remainder is

zero. Again, CRC is having high time

redundancy and that is why they are

normally hardware based. Software based

CRC implementation is impractical in a real

time application. Again, a Hamming code is

to provide only the single-error correction

and double error detection. In a typical

Checksum where n bytes are XORed and the

result is stored in (n+1)th byte. Now if this

byte itself is corrupted due to transients or in

the case of even changes, the errors remain

undetected by this typical Checksum.

Interested reader may refer the works of

(Huang and Abraham 1984), (Avizienis

1985), (Liestman 1986), (Saha 1995), (Saha

2000), (Saha 2003), (Papageorgiou and

Kakolakis 2004). However, the conventional

methods have limitations and there exists

higher redundancy in both time and memory

space. However, the proposed technique is

promising enough to detect multiple errors

and corrections thereof with an affordable

redundancy in both time and memory space

for designing a robust computer- controlled

system.

4. Conclusion

The proposed software based technique

is a very cost effective and an economically

important tool in comparison to the

traditional triple modular redundancy (TMR)

or an N-Versions programming technique

that are based on design diversification. It

can detect and repair errors. However, the

proposed approach does not address the

Goutam Kumar Saha

design bugs of an application. During the life

cycle of an application where an ultra

reliability in the computed result is essential,

this proposed technique will be very

effective at the cost of affordable redundancy

in time and space, without any increase in

the monetary budget. This proposed

technique can be used as a very effective tool

for the system engineers. It is easier to

implement also. It provides high fault

detection and reliability in the processing

logic. It is also a very useful and a low cost

tool for gaining dependable computation and

higher maintainability of a computer

controlled system with an affordable

memory overhead of 3.2 and 3.25 time

redundancy.

References

Dimitri B., (1989), Data networks, (PHI).

Anderson, T. and Lee, P.A. (1981), Fault

tolerance principles and practice,

(PHI).

Wicker B.S., (1995), Error control systems

for digital communication and storage,

(PH, NJ, USA).

Saha, G.K., (1994), “Transient control by

software for identification of friend and

foe (IFF) application”, Int’l Symp.

Proc. EMC’94, Sendai, Japan, IEEE

Press, 509-512.

Man Y.R., (1985), Error – correcting coding

theory, McGraw Hill Publishing

Company, USA.

Saha, G. K., (1999), “Algorithm based EFT

errors detection in matrix arrays”,

SAMS Journal, Gordon and Breach, 36,

117-135.

Huang K., and Abraham J., (1984),

“Algorithm-based fault tolerance for

matrix operations”, IEEE Transactions

on Computers, c-33(6), 518-528.

Avizienis, A., (1985), “The N-version

approach to fault tolerant software”,

IEEE Transactions on Software

Engineering, SE-11, 1491-1501.

Liestman, A.L., et.al., (1986), “A fault

tolerant scheduling problem”, IEEE

Transactions on Software Enginee-

ring, SE-12, 1089-1095.

Saha, G.K., (1995), “Designing an EMI

immune software for a micro-processor

based traffic control system”, Proc.

11th IEEE Int’l Symp. EMC

Switzerland, 401-404.

Saha, G. K., (2000), “Transient fault tolerant

processing in a RF application”, SAMS

Journal, Gordon and Breach, 38, 81-

93.

Saha G.K., (2003), “Transient fault tolerance

analysis using algorithms”, accepted

paper, proof no GSAM-31100,

International Journal – Systems

Analysis Modelling Simulation, Tailor

& Francis, U. K.

Papageorgiou E., and Kokolakis G., (2004),

“A two -unit parallel system supported

by(n-2) standbys with general and non-

identical lifetimes”, International

Journal of Systems Science, 35, 1-12.

Guaranteeing Performance in a Fault Tolerant Architecture Solution using Software Agent’s Coordination

Article

Full-text available

Oct 2022

Festus Oliha

Performance is a critical attribute in evaluating the quality and dependability of service-oriented systems dependent on fault-tolerantarchitectures. Fault-tolerant architectures have been implemented with redundant techniques to ensure fault-tolerant services. However, replica-related overhead burdens fault-tolerant techniques with associated performance degradation in service delivery, and this consequentially discourages service consumers with discredits for service providers. In this paper, a fault-tolerant approach thatadopts replication and diversity was employed on agent-oriented coordination toward guaranteeing the performance of the proposedfault-tolerant architecture solution under a large-scale service request load. In addition, the resultant architecture solution was simulated with Apache JMeter for performance evaluation considering the performability in the absence and presence of a fault load. The simulation experiments and results revealed the architecture’s efficiency in fault tolerance via the timely coordination of logicaland replica-related activities by software agents. Noteworthily, the continued service availability and performance were guaranteed for the architecture solution with a significant rate of regularity in the absence and presence of a replica-related fault. Therefore, this study’s performance evaluation methods and results could serve as a veritable milestone for building fault-tolerant service systems with appreciable performability and contribute to the service-oriented fields where performance is inevitable.

Assessing the Performability of a Fault Tolerant Architecture for Web Services Solution Using Software Fault Injection

Article

Oct 2022

Festus Oliha

Application semantic driven assertions toward fault tolerant computing

Article

Full-text available

Jun 2006

Goutam Saha

Based on semantics of an application processing logic, we find out the most critical and sensitive parts of an application and we derive set of conditions or assertions among the various diagnostic checkpoint variables and we enhance the processing logic to enable it to detect run-time various operational or environmental faults toward fault tolerant computing. This paper examines how a single-version algorithm can establish software based fault tolerance by designing in thoughtful software based execution-time checks in a computing application. The algorithm developed here relies on various assertions that are derived from the semantics of an application. Various diagnostic assertive checkpoints have been derived based on an application's semantics. This work is not intended to correct bit-errors using conventional error correction codes. Errors have been detected through checkpoints and periodical execution of an application with known test data and verification of observed result with known result thereof. Electrical transients or small particles hitting the circuit, often cause random errors or faults in data and program flow. The manuscript describes an algorithm that allows the detection and recovery of transient or operational failures in software on a specific problem, just by using one version of a software program running on just one machine. This approach does not aim to tolerate software design bugs. This algorithmic approach uses various run-time signatures and validation thereof in order to detect faults.

Software fault avoidance issues

Article

Full-text available

Nov 2006

Goutam Saha

This article aims to discuss various issues of software fault avoidance. Software fault avoidance aims to produce fault free software through various approaches having the common objective of reducing the number of latent defects in software programs.

Software implemented fault tolerance through data error recovery

Article

Full-text available

Sep 2005

Goutam Saha

This paper examines how a new software-implemented data error recovery scheme can be so effective in comparison to conventional Error Correction Codes (ECC) during the execution time of an application. The proposed algorithm is three times faster than the conventional software-implemented ECC and application program designers can easily implement the proposed scheme because of its simplicity while designing their fault tolerant applications at no extra hardware cost. The proposed software-implemented scheme for execution-time data-error detection and correction relies on three-fold replication of application data set as a basis for fault tolerance.

A fault-tolerant scheduling problem

Article

Full-text available

Nov 1986

A real-time system must be reliable if a failure to meet its timing specifications might endanger human life, damage equipment, or waste expensive resources. A deadline mechanism has been previously proposed to provide fault tolerance in real-time software systems. The mechanism trades the accuracy of the results of a service for timing precision. Two independent algorithms are provided for each service subject to a deadline. The primary algorithm produces a good quality service, although its real-time reliability may not be assured. The alternate algorithm is reliable and produces an acceptable response. An algorithm to generate an optimal schedule for the deadline mechanism is introduced, and a simple and efficient implementation is discussed. The schedule ensures the timely completion of the alternate algorithm despite a failure to complete the primary algorithm within real time.

A two-unit parallel system supported by (n − 2) standbys with general and non-identical lifetimes

Article

Full-text available

Jan 2004

This paper examines a functioning policy of a parallel system. We assume availability of n non-identical, non-repairable units for replacement or support. Two units start their operation simultaneously at times , and any one of them is replaced instantaneously upon its failure by one of the ( n − 2) standby units at random starting times Si ( ). Thus, with probability one, the system is functioning with two units up till the failure of the ( n − 1)th unit. Unit lifetimes Ti have a general joint distribution function F( t ). The system has to operate for a fixed period of time, c, and it stops functioning when all available units fail before c. The probability that the system is functioning for the required period of time c depends on the distribution of the unit lifetimes. The reliability of the system is evaluated by recursive relations. Independent unit lifetimes are considered as special cases.

Experimental evaluation of the fail-silent behaviour in programs with consistency checks

Article

Full-text available

Jan 1996

An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.

Software implemented fault tolerance through data error recovery

Article

Full-text available

Sep 2005

Goutam Saha

Algorithm based EFT errors detection in matrix arrays

Article

Nov 1999
Syst Anal Model Simulat

Goutam Saha

This paper shows how to design a software for the detection of multiple errors in matrix arrays. It also shows how the multiple errors in matrix arrays are recovered. Potential Electromagnetic Pulses (EMPs) like nuclear EMP, lightning EMP, Electrical Transients, Electrical Fast Transients (EFTs) etc., often cause random or burst errors in stored matrices or matrix arrays and thus causing errors in computed results. This proposed technique is very useful for detection and recovery of multiple errors in a matrix arrays and thus it promises to be a very useful tool for the systems engineers in designing their on-line applications with fault tolerance.

Transient fault tolerant processing in a RF application

Article

Jun 2000
Syst Anal Model Simulat

Goutam Saha

This paper shows how a properly designed software can play an important role in establishing fault tolerance in the processing logic of an on-line RF application. Data corruption due to potential electrical or electronic noises like, electromagnetic pulses, transients etc., is properly detected and controlled immediately, by this proposed algorithm based technique and thus higher processing reliability is achieved.

Error control systems for digital communication and storage

Article

Jan 1994

S. B. Wicker

Error-correcting coding theory

Article

Man Young Rhee

An abstract is not available.

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

Article

Oct 1999
IBM J RES DEV

Fault tolerance in IBM S/390® systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex® in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.

Fault Tolerance: Principles and Practice

Book

Jan 1981

A treatise on fault tolerant techniques to enhance the dependability (reliability, safety, security, ...) of a computing system, with a particular emphasis on the need for fault tolerance to all defects – including those of design.

Software Based Fault Tolerant Computing Using Redundancy Software Based Fault Tolerant Computing Using Redundancy

Abstract

Recommended publications

Time-lag duplexing-a fault tolerance technique for onlinetransaction processing systems

Software Based Fault Tolerance Against Byzantine Failures

A Single-Version Algorithmic Approach to Fault Tolerant Computing Using Static Redundancy

Software-Based Fault Tolerant Computing

Application semantic driven assertions toward fault tolerant computing