ArticlePDF Available

Software Based Fault Tolerant Computing Using Redundancy Software Based Fault Tolerant Computing Using Redundancy

Authors:

Abstract

This paper examines how software based fault tolerant computing approach through triplicate redundancy and recovery. This approach is not intended to tolerate the software design bugs. It is intended to tolerate various environmental faults during the execution time of a computer-controlled system. Application data corruption due to electrical transients is detected and recovered to control the system immediately. The proposed approach is a low cost tool towards designing a robust industrial application system that can tolerate errors due to electrical surges and transients. This approach does not rely on design diversification.
Software Based Fault Tolerant Computing Using Redundancy
International Journal of the Computer, the Internet and Management Vol. 17. No.3 (September-December, 2009) pp 41 -46
41
Software Based Fault Tolerant Computing Using Redundancy
Goutam Kumar Saha
CA –2 / 4B, CPM Party Office Road, Baguiati,
Deshbandhu Nagar, Kolkata 700059 WB India.
E-mail: gksaha@rediffmail.com
Abstract
This paper examines how software
based fault tolerant computing approach
through triplicate redundancy and recovery.
This approach is not intended to tolerate the
software design bugs. It is intended to
tolerate various environmental faults during
the execution time of a computer- controlled
system. Application data corruption due to
electrical transients is detected and
recovered to control the system immediately.
The proposed approach is a low cost tool
towards designing a robust industrial
application system that can tolerate errors
due to electrical surges and transients. This
approach does not rely on design
diversification.
Keywords: Fault tolerant robust computing,
recovery and application system.
1. Introduction
Electrical Noises, Electrical Transients
(ETs), Electrostatic Discharge (ESD),
Electromagnetic Pulses (EMP) are the
example of short duration noises. Again
short duration noises often cause random
data corruption in the primary memory of a
computing machine. Many scientific
applications which need reference
information tables are thus forced to miss
their goals by using those corrupted data
tables and program codes. Often we take it
granted that our program code and data
banks are absolutely safe and correct while
designing software for an on-line
application. But it is always not correct
because the high speed processing units are
often victimized by short duration noises as
discussed in (Anderson and Lee 1981),
(Dimitri 1989), (Wicker 1995).
Electromagnetic Interference (EMI) is an
unplanned, extraneous electrical signal that
affects the performance of a computer
system. It can cause memory errors and data
file destruction during the run time of an
application. Externally produced EMI or
Noise enters the computer through the
cabling or openings in the case. Sometimes
it enters by static discharge through the case
of the disk drive. Thus while designing
software, the effects of noises should not be
overlooked in a scientific application that
uses look-up data during its execution time.
2. The Software Application
The look up table contains the angular
field distribution for different regions of an
aircraft. Algorithm 1., however shows the
basic logic for determining if phase and
amplitude of the received signal in a
particular direction match a record in a
predefined user look-up table [LT] for the
angular distribution concerned. Depending
on the result of this match, some action or
function (say track) is performed.
Goutam Kumar Saha
42
So before initiating an action, the
matching logic and processing should be of
higher reliability and accuracy. If any
transient causes data corruption or program
code corruption then the whole system will
be locked up leading to a complete mission
failure.
Algorithm 1.
/* A predefined user data look up table [LT]
contains records of information like PHI,
PHASE, and AMPLITUDE. This algorithm
shows the basic processing logic to find out a
true match of PHI, PHASE and
AMPLITUDE of received signal with
predefined record in LT. Variables namely,
FI, FASE, AMPL are to denote PHI,
PHASE, AMPLITUDE respectively. */
Step 1. Read: Received signal FI, FASE,
AMPL
Step 2. If FI .EQ. FI_LT, then:
/* Compare input parameters with the stored
Parameters in the Look up Table */
If FASE .EQ. FASE_LT .AND.
AMPL .EQ. AMPL_LT, then:
TRACK /*If matching then
initiate Tracking */
Else:
GOTO Step 1. /* If not matching,
then read inputs */
[End of If structure]
Else:
GOTO Step 1.
[End of If Structure]
{End of Algorithm 1. showing basic
processing logic}
The above mentioned basic processing
logic is made robust with enhanced
processing logic as shown in Algorithm 2.
Algorithm 2.
/* It shows the enhanced processing logic
towards transient fault tolerance. Variables
namely, PREVFI, PREVFASE,
PREVAMPL are to store the earlier
incoming signal’s PHI, PHASE and
AMPLITUDE. A routine “VBYTFLT” is
called from this enhanced logic for detecting
transient errors in the program and data code
and for the correction thereof periodically
say, after every NRUN number of executions
of the application system. */
Step 1. Set NRUN = 0 /* it keeps the
number of application run or Iteration
involved. */
Step 2. Read: Received Signal FI, FASE,
AMPL
Step 3. Set PREVFI = FI, PREVFASE =
FASE, PREVAMPL = AMPL
Step 4. Read: Received Signal FI,
FASE, AMPL /* successive 2nd reading */
Step 5. If FI <> PREVFI .AND. FASE<>
PREVFASE .AND.
AMPL <> PREVAMPL, Then:
Set NRUN = NRUN + 1
If NRUN .EQ. 20, Then:
/* Compare with the upper limit Value
of NRUN i.e., say 20 */
Call VBYTFLT
/* Fault detection & recovery Routine is
invoked every 20 runs */
Set NRUN = 0
GOTO Step 2.
/*If immediate two readings are not
consistent due to transient potential, then
read again otherwise go ahead with the
rest of the application */
[End of If Structure]
Else:
If FI <> FI_LT,
Then:
GOTO Step 2.
Software Based Fault Tolerant Computing Using Redundancy
International Journal of the Computer, the Internet and Management Vol. 17. No.3 (September-December, 2009) pp 41 -46
43
Else If FASE .EQ.
FASE_LTB .AND. AMPL .EQ. AMPL_LT,
Then:
TRACK /* Initiate track */
Else:
GOTO Step 2.
[End of If Structure]
[End of If Structure]
{End of the Enhanced Processing Logic
i.e., Algorithm 2.}
Algorithm 3
The following steps show how the
VBYTFLT Algorithm works.
/* It verifies the corresponding three bytes at
an offset say, k, of the three images or copies
of the application and its data. Byte error is
detected by comparing the k-th byte of the
three images of the application and Look-up
table. If any byte error has occurred, then the
corrupted byte is repaired by overwriting the
corrupted byte with the byte pattern in
majority. Any disagreement among the
corresponding bytes indicates the potential
transient bit-errors. Starting addresses of
the three images are known.*/
Step 1. Initialize: Size of an image in
bytes. /*size of an image in bytes is known */
Step 2. Compare the k-th byte of the three
images to find a majority one.
Step 3. If there is a disagreement, Then
rewrite or repair the odd byte of an image
with the majority one found at step-2.
Else If there is no disagreement,
Then increment k by one and go to step-2
for checking the next byte until the end of
an image.
Else If no agreement among the
three bytes (a crash), reload the application,
Then go to ERROR.
/* if no majority is found then call an
ERROR routine for restart or re-executing
the application. */
End If
Step 4. Return to the enhanced tracking
program (algorithm-2) for reliable tracking.
{End of the VBYTFLT algorithm-3}.
3. Discussion
Algorithm-1 describes the basic
processing logic. The input parameters are
compared with the stored records in the look
up table. If there is a matching record inside
the LT, then it does not track because of the
fact that LT stores only the parameters of the
friends’ aircraft. But a mismatch or a 'not
found' in LT, indicates that the aircraft is of a
foe and therefore it needs to be tracked and
that is why the tracking function is initiated.
However, the basic processing logic does not
work properly in reality.
Algorithm-2 describes the steps
involved in this application with an enhanced
processing logic. This logic uses the input
parameters of two immediate and successive
readings in order to eliminate any ambiguity
in input data arising due to potential
transients etc. It also behaves like an input
filter. The NRUN variable can be tuned in
order to combat the transients’ effect. In this
algorithm, after every twenty (say, NRUN‘s
upper limit is 20) runs of the application, the
error detection and recovery routine namely,
VBYTFLT is invoked. If the transient threat
is more frequent, then upper limit of the
NRUN variable may be reduced to a value
say, 5. Thus depending on the real
environment, the time interval between two
successive calls of VBYTFLT routine
during the execution of the application, can
be reduced (to say 1) or increased by
changing the upper limit of the NRUN
variable, as shown at step-5 of the algorithm-
Goutam Kumar Saha
44
2. The variables namely, PREVFI,
PREVFASE, PREVAMPL are to store the
previous inputs, whereas FI, FASE, AMPL
variables store the most recent input
parameters.
Algorithm-3 shows the steps involved in
order to detect and correct errors in the look
up table namely, LT, as well as in the
application code. There are three images of
the application code along with the LT,
stored in memory of the computing machine.
If the starting addresses of the three images
are say I1, I2 and I3 respectively. When the
offset K is say 0 ( initial value ), then the
address I10 denotes the starting address of
first image I1 only , because I10 has the
value of (I1 + 0) i.e., starting address plus
offset. In general, if Im be the starting
address of the mth image then, the address
of the kth byte ( or at offset say, k) is shown
by equation (1).
Im
K = Im + k (1)
Again, if any one byte out of the three
corresponding bytes of three images at an
offset says, k is corrupted, then this
VBYTFLT routine repairs the corrupted byte
by overwriting the erroneous byte with the
byte in majority. The affected byte is
detected by means of comparison of three
bytes at the same offset, as shown at step 2, 3
of the Algorithm-3.
The possibility of getting inadvertently
alteration (by transients) of two bytes at
distant locations, into a similar corrupted
value in order to dissatisfy the step-3 of
algorithm-3 is almost nil. In other words the
chances of byte error remaining undetected
is
1 / (2 8) * 1 / (2 8) = 2 –16 (2)
This method is capable of detecting even
all 8 bit errors i.e., even an entirely corrupted
byte is also detected.
Again, it can repair all the 8-bit errors. If
say, In
K byte is corrupted to In*
K, and say,
Im
K and Io
K byte contents are same at an
offset k of two images Im and Io, and then
by comparing three corresponding bytes of
the three images, we can detect that In
K byte
is corrupted (as shown at step-3 of the
Algorithm-3). The corrupted byte is repaired
by overwriting the wrong one with the
majority one. This is applicable even for 8
bit errors in a byte.
If there is no error in program and data
code then the following equation will be
satisfied.
Im
K = In
K = Io
K (3)
The chances of satisfying the equation
(3) by the corrupted three bytes of the three
images at the same offset is negligibly
small, because the transients’ effects on
memory, registers are very random and
independent in nature.
1/( 2 8 ) * 1 /( 2 8 ) * 1 /( 2 8 ) = 2 -24 (4)
In other words, the chances of three
bytes at different locations corresponding to
a particular value with similar bit pattern,
getting altered (due to random effects of
transients) simultaneously to a similar value
in order to satisfy equation (5) is negligibly
small.
Im*
K = In*
K = Io*
K (5)
Again, the chances of a particular value
(of one byte size) stored at same offset in the
three images, getting altered to three
different values (of different bit pattern), is
1 / (2 24).
The above disastrous effect indicates a
possibility of memory hardware or
permanent errors and the ERROR routine is
invoked for necessary recovery thereof.
Software Based Fault Tolerant Computing Using Redundancy
International Journal of the Computer, the Internet and Management Vol. 17. No.3 (September-December, 2009) pp 41 -46
45
In other words, the possibility of
invoking the error routine namely, ERROR
(as shown at step-3) will be negligibly small.
The routine VBYTFLT verifies for byte
errors for the entire application program
along with the reference data table (LT), by
increasing the value of offset K from 0 to
N –1 (the size of an image say, of N bytes).
This is very effective for online error
detection and error recovery of the
application during the life cycle of the
application. After detecting and repairing the
entire application, program control goes back
to the main application (as shown in
Algorithm-2). Even a totally corrupted image
can be repaired by this proposed technique
by repairing byte after byte.
Space redundancy of this proposed
technique is about three. However, because
of the lower economic trend on the hardware
prices, this much space redundancy can be
easily affordable. Little higher time
redundancy can be affordable because of
easily affordable high speed machine. This
proposed technique is capable of detecting
and repairing any number of soft errors (not
reproducible) as well as permanent errors
during the run time of the application. Fault
detectibility is inversely proportional to the
upper limit of the variable NRUN, i.e.
FD α 1 /NRUN_UL (6)
Thus depending on the threat of
transients potential, the upper limit value of
NRUN can be changed in order to meet the
requirement.
There are several conventional error
detection and correction schemes like Parity
Checks, Hamming Codes, Cyclic
Redundancy Checks (CRC), Checksums etc.
They are not free from limitations as stated
in Rhee (1985), Saha (1999). The single
parity checks can detect only odd number of
errors. Any even number of errors remain
undetected. So it is inadequate for detecting
any number of errors. Again for CRC, shift
register based circuits are used. CRC is
normally shift register based. The Shift
register circuit is for dividing polynomials
and finding the remainder. Modulo 2 adders,
multiplier circuits are also used. In CRC,
when errors actually occur, the receiver fails
to detect the errors when the remainder is
zero. Again, CRC is having high time
redundancy and that is why they are
normally hardware based. Software based
CRC implementation is impractical in a real
time application. Again, a Hamming code is
to provide only the single-error correction
and double error detection. In a typical
Checksum where n bytes are XORed and the
result is stored in (n+1)th byte. Now if this
byte itself is corrupted due to transients or in
the case of even changes, the errors remain
undetected by this typical Checksum.
Interested reader may refer the works of
(Huang and Abraham 1984), (Avizienis
1985), (Liestman 1986), (Saha 1995), (Saha
2000), (Saha 2003), (Papageorgiou and
Kakolakis 2004). However, the conventional
methods have limitations and there exists
higher redundancy in both time and memory
space. However, the proposed technique is
promising enough to detect multiple errors
and corrections thereof with an affordable
redundancy in both time and memory space
for designing a robust computer- controlled
system.
4. Conclusion
The proposed software based technique
is a very cost effective and an economically
important tool in comparison to the
traditional triple modular redundancy (TMR)
or an N-Versions programming technique
that are based on design diversification. It
can detect and repair errors. However, the
proposed approach does not address the
Goutam Kumar Saha
46
design bugs of an application. During the life
cycle of an application where an ultra
reliability in the computed result is essential,
this proposed technique will be very
effective at the cost of affordable redundancy
in time and space, without any increase in
the monetary budget. This proposed
technique can be used as a very effective tool
for the system engineers. It is easier to
implement also. It provides high fault
detection and reliability in the processing
logic. It is also a very useful and a low cost
tool for gaining dependable computation and
higher maintainability of a computer
controlled system with an affordable
memory overhead of 3.2 and 3.25 time
redundancy.
References
Dimitri B., (1989), Data networks, (PHI).
Anderson, T. and Lee, P.A. (1981), Fault
tolerance principles and practice,
(PHI).
Wicker B.S., (1995), Error control systems
for digital communication and storage,
(PH, NJ, USA).
Saha, G.K., (1994), “Transient control by
software for identification of friend and
foe (IFF) application”, Int’l Symp.
Proc. EMC’94, Sendai, Japan, IEEE
Press, 509-512.
Man Y.R., (1985), Error – correcting coding
theory, McGraw Hill Publishing
Company, USA.
Saha, G. K., (1999), “Algorithm based EFT
errors detection in matrix arrays”,
SAMS Journal, Gordon and Breach, 36,
117-135.
Huang K., and Abraham J., (1984),
“Algorithm-based fault tolerance for
matrix operations”, IEEE Transactions
on Computers, c-33(6), 518-528.
Avizienis, A., (1985), “The N-version
approach to fault tolerant software”,
IEEE Transactions on Software
Engineering, SE-11, 1491-1501.
Liestman, A.L., et.al., (1986), “A fault
tolerant scheduling problem”, IEEE
Transactions on Software Enginee-
ring, SE-12, 1089-1095.
Saha, G.K., (1995), “Designing an EMI
immune software for a micro-processor
based traffic control system”, Proc.
11th IEEE Int’l Symp. EMC
Switzerland, 401-404.
Saha, G. K., (2000), “Transient fault tolerant
processing in a RF application”, SAMS
Journal, Gordon and Breach, 38, 81-
93.
Saha G.K., (2003), “Transient fault tolerance
analysis using algorithms”, accepted
paper, proof no GSAM-31100,
International Journal – Systems
Analysis Modelling Simulation, Tailor
& Francis, U. K.
Papageorgiou E., and Kokolakis G., (2004),
“A two -unit parallel system supported
by(n-2) standbys with general and non-
identical lifetimes”, International
Journal of Systems Science, 35, 1-12.
... This assertion connotes that performance overhead is presumed to have a high correlation with replica-related overheads due to: the fault-tolerant mechanisms on service replicas selection or switching in the presence of faults; and their dependencies on web services technologies -which are unavoidably prone to failure at invocation or runtime in an unpredictable network on the Internet Lyu, 2011;Reddy et al., 2017;Pandey et al., 2018). Relatively, software agents are capable of logical service coordination at invocation time and have been described as fundamental for developing fault-tolerant service systems via triplicate redundancy (Saha, 2009). ...
... In other words, the relevance in making service systems adequate on faulttolerant capability with efficient service delivery demands features such as transparent self-healing, scalability, and guaranteed service responses. These attributes are not peculiar to software agents since fault tolerance is fundamental for building agent-based applications (Saha, 2009). ...
... Literary works revealed trends of software agents in fault tolerance and their adoption for computational intelligence as a result of their autonomous, proactive, adaptable nature, and inherent coordination capabilities for logical computations (Alhosban, 2013;Oliha, 2018;Alvi et al., 2019). Moreover, the popularity over the years has been impactful in distributed systems' implementation, and details of their approaches were documented in correlated literature (Saha, 2009;Alvi et al., 2019). However, expressed that the use of triplicate redundancy has been prevalent for fault tolerance by software agents regardless of their minimal support for design diversity during execution time for interactive applications. ...
Article
Full-text available
Performance is a critical attribute in evaluating the quality and dependability of service-oriented systems dependent on fault-tolerantarchitectures. Fault-tolerant architectures have been implemented with redundant techniques to ensure fault-tolerant services. However, replica-related overhead burdens fault-tolerant techniques with associated performance degradation in service delivery, and this consequentially discourages service consumers with discredits for service providers. In this paper, a fault-tolerant approach thatadopts replication and diversity was employed on agent-oriented coordination toward guaranteeing the performance of the proposedfault-tolerant architecture solution under a large-scale service request load. In addition, the resultant architecture solution was simulated with Apache JMeter for performance evaluation considering the performability in the absence and presence of a fault load. The simulation experiments and results revealed the architecture’s efficiency in fault tolerance via the timely coordination of logicaland replica-related activities by software agents. Noteworthily, the continued service availability and performance were guaranteed for the architecture solution with a significant rate of regularity in the absence and presence of a replica-related fault. Therefore, this study’s performance evaluation methods and results could serve as a veritable milestone for building fault-tolerant service systems with appreciable performability and contribute to the service-oriented fields where performance is inevitable.
... To establish the performability of service systems dependent on web services under a fault load, the researcher proposed software agents as a feasible solution towards ensuring dependability in terms of FT and performance. Software agents have been underlined in literature as the fundamental for developing fault-tolerant-based software solutions [30], [31]. ...
Article
Full-text available
Based on semantics of an application processing logic, we find out the most critical and sensitive parts of an application and we derive set of conditions or assertions among the various diagnostic checkpoint variables and we enhance the processing logic to enable it to detect run-time various operational or environmental faults toward fault tolerant computing. This paper examines how a single-version algorithm can establish software based fault tolerance by designing in thoughtful software based execution-time checks in a computing application. The algorithm developed here relies on various assertions that are derived from the semantics of an application. Various diagnostic assertive checkpoints have been derived based on an application's semantics. This work is not intended to correct bit-errors using conventional error correction codes. Errors have been detected through checkpoints and periodical execution of an application with known test data and verification of observed result with known result thereof. Electrical transients or small particles hitting the circuit, often cause random errors or faults in data and program flow. The manuscript describes an algorithm that allows the detection and recovery of transient or operational failures in software on a specific problem, just by using one version of a software program running on just one machine. This approach does not aim to tolerate software design bugs. This algorithmic approach uses various run-time signatures and validation thereof in order to detect faults.
Article
Full-text available
This article aims to discuss various issues of software fault avoidance. Software fault avoidance aims to produce fault free software through various approaches having the common objective of reducing the number of latent defects in software programs.
Article
Full-text available
This paper examines how a new software-implemented data error recovery scheme can be so effective in comparison to conventional Error Correction Codes (ECC) during the execution time of an application. The proposed algorithm is three times faster than the conventional software-implemented ECC and application program designers can easily implement the proposed scheme because of its simplicity while designing their fault tolerant applications at no extra hardware cost. The proposed software-implemented scheme for execution-time data-error detection and correction relies on three-fold replication of application data set as a basis for fault tolerance.
Article
Full-text available
A real-time system must be reliable if a failure to meet its timing specifications might endanger human life, damage equipment, or waste expensive resources. A deadline mechanism has been previously proposed to provide fault tolerance in real-time software systems. The mechanism trades the accuracy of the results of a service for timing precision. Two independent algorithms are provided for each service subject to a deadline. The primary algorithm produces a good quality service, although its real-time reliability may not be assured. The alternate algorithm is reliable and produces an acceptable response. An algorithm to generate an optimal schedule for the deadline mechanism is introduced, and a simple and efficient implementation is discussed. The schedule ensures the timely completion of the alternate algorithm despite a failure to complete the primary algorithm within real time.
Article
Full-text available
This paper examines a functioning policy of a parallel system. We assume availability of n non-identical, non-repairable units for replacement or support. Two units start their operation simultaneously at times , and any one of them is replaced instantaneously upon its failure by one of the ( n − 2) standby units at random starting times Si ( ). Thus, with probability one, the system is functioning with two units up till the failure of the ( n − 1)th unit. Unit lifetimes Ti have a general joint distribution function F( t ). The system has to operate for a fixed period of time, c, and it stops functioning when all available units fail before c. The probability that the system is functioning for the required period of time c depends on the distribution of the unit lifetimes. The reliability of the system is evaluated by recursive relations. Independent unit lifetimes are considered as special cases.
Article
Full-text available
An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.
Article
Full-text available
This paper examines how a new software-implemented data error recovery scheme can be so effective in comparison to conventional Error Correction Codes (ECC) during the execution time of an application. The proposed algorithm is three times faster than the conventional software-implemented ECC and application program designers can easily implement the proposed scheme because of its simplicity while designing their fault tolerant applications at no extra hardware cost. The proposed software-implemented scheme for execution-time data-error detection and correction relies on three-fold replication of application data set as a basis for fault tolerance.
Article
This paper shows how to design a software for the detection of multiple errors in matrix arrays. It also shows how the multiple errors in matrix arrays are recovered. Potential Electromagnetic Pulses (EMPs) like nuclear EMP, lightning EMP, Electrical Transients, Electrical Fast Transients (EFTs) etc., often cause random or burst errors in stored matrices or matrix arrays and thus causing errors in computed results. This proposed technique is very useful for detection and recovery of multiple errors in a matrix arrays and thus it promises to be a very useful tool for the systems engineers in designing their on-line applications with fault tolerance.
Article
This paper shows how a properly designed software can play an important role in establishing fault tolerance in the processing logic of an on-line RF application. Data corruption due to potential electrical or electronic noises like, electromagnetic pulses, transients etc., is properly detected and controlled immediately, by this proposed algorithm based technique and thus higher processing reliability is achieved.
Article
An abstract is not available.
Article
Fault tolerance in IBM S/390® systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex® in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.
Book
A treatise on fault tolerant techniques to enhance the dependability (reliability, safety, security, ...) of a computing system, with a particular emphasis on the need for fault tolerance to all defects – including those of design.