ArticlePDF Available

Software implemented fault tolerance through data error recovery

Authors:

Abstract

This paper examines how a new software-implemented data error recovery scheme can be so effective in comparison to conventional Error Correction Codes (ECC) during the execution time of an application. The proposed algorithm is three times faster than the conventional software-implemented ECC and application program designers can easily implement the proposed scheme because of its simplicity while designing their fault tolerant applications at no extra hardware cost. The proposed software-implemented scheme for execution-time data-error detection and correction relies on three-fold replication of application data set as a basis for fault tolerance.
A preview of the PDF is not available
... A set of carefully chosen software error detection techniques including Assertions [7], Algorithm Based Fault Tolerance (ABFT) [8], Control Flow Checking [9], procedure duplication [10] are suitable to achieve a high degree of safe behavior in ordinary computers by complementing the intrinsic Error Detection Mechanisms (Edemas) of the system (exceptions, memory protection, etc). In [11] it has been shown that it is possible to achieve an adaptive infrastructure to support different levels of availability requirements in a network environment. In [12] it has been shown how to detect transient errors and how to recover on using triple copies of an enhanced application using single version only. ...
... • A wrong reset of a variable (e.g., GTB1 ) will result in a discrepancy between the variable's expected and observed values. This discrepancy is may also be repaired on keeping three images of the application code and data [11]. Again the sanity of the codes of the decider or examiner routine namely EXAMCODE is also verified by another routine namely, EXAM-EXAMCODE. ...
Article
Full-text available
Based on semantics of an application processing logic, we find out the most critical and sensitive parts of an application and we derive set of conditions or assertions among the various diagnostic checkpoint variables and we enhance the processing logic to enable it to detect run-time various operational or environmental faults toward fault tolerant computing. This paper examines how a single-version algorithm can establish software based fault tolerance by designing in thoughtful software based execution-time checks in a computing application. The algorithm developed here relies on various assertions that are derived from the semantics of an application. Various diagnostic assertive checkpoints have been derived based on an application's semantics. This work is not intended to correct bit-errors using conventional error correction codes. Errors have been detected through checkpoints and periodical execution of an application with known test data and verification of observed result with known result thereof. Electrical transients or small particles hitting the circuit, often cause random errors or faults in data and program flow. The manuscript describes an algorithm that allows the detection and recovery of transient or operational failures in software on a specific problem, just by using one version of a software program running on just one machine. This approach does not aim to tolerate software design bugs. This algorithmic approach uses various run-time signatures and validation thereof in order to detect faults.
... Other works [14,15,16,20] also discussed on software hardening and the limitations of ECC and conventional software based techniques through single bit fault injection. Software cost analysis for RB, NVP and SIFT approaches have been discussed in [21,22,23,24]. ...
... Like any conventional SIFT, or triple modular redundancy (TMR) based fault tolerance schemes, this SVS approach also cannot claim to be free from an overhead on code and execution time redundancy. Execution time redundancy as observed in SVS on an average is 2.6 times [21] the basic application code without any software fix for fault tolerance. ...
Article
Full-text available
This paper describes how to design low-cost reliable computing software for various application systems, by incorporating a single-version fault tolerant scheme along with run-time signature-based control-flow checking. Most of the ordinary systems lack fault tolerant software fix. The conventional fault tolerant approaches viz., Recovery Block (RB), N Version Programming (NVP) etc., are too costly to fix in an ordinary low-cost application system because, both the RB and NVP rely on multiple (at least three) versions of both software and computing machines. However, the proposed approach needs a single version (SV) of an enhanced application program that gets executed on one computing machine only. It is common that we often face interrupted service (caused either by an intermittent fault in an application program or in hardware), during the service delivery period of an ordinary cheaper application system. Execution of an application program often show malfunctions or it gets interrupted due to memory bit errors. Error Correction Codes (ECC) (viz., parity, Hamming codes, CRC etc.,) that are used in memory, are not as effective for online correction of multiple bit errors, as they are, for the detection of few bit errors. Again, software implemented ECC has a significant overhead over both time and code redundancy. In other words, built in ECC in memory, cannot recover all bit errors but can detect only. As a result, if an error is detected by ECC, the application program needs to be restarted for its re-execution afresh in various microprocessor based application systems. So, the ECC alone is useful for designing a fail-stop kind of system but it suffers from high time redundancy. Other software implemented fault-tolerance schemes are also towards fail-stop kind. But, the proposed (SV) based approach is capable of tolerating such errors without stopping the execution of an application. This SV Scheme (SVS) aims to provide an uninterrupted service at no extra money, but at an acceptable more execution time and memory space. This SV is a non-fail-stop kind fault tolerance scheme that can be implemented in various computing systems without spending an additional money, and as a result, major part of common people in our society, can gain reliable service from the low – cost, SV-based computing system.
... Again, the algorithm-based fault tolerance (ABFT) approach that refers to a self-contained method for detecting, locating, and correcting errors with a software procedure, is also useful. The single version software-based approaches include software implemented control flow error checking [20], error masking [22,25], fault recovery [24], error detection [19,28] and correction [23] and so on by using necessary replicated data or code, assertions, time or space redundancy etc. Such techniques normally rely on enhanced single version programming (ESVP) schemes [25] that are based on single robust design only. ...
Article
Full-text available
This article aims to discuss various issues of software fault avoidance. Software fault avoidance aims to produce fault free software through various approaches having the common objective of reducing the number of latent defects in software programs.
Article
A comparative study on various software-implemented fault detection approaches has been briefly described in a tabular form
Article
Full-text available
This paper examines how software based fault tolerant computing approach through triplicate redundancy and recovery. This approach is not intended to tolerate the software design bugs. It is intended to tolerate various environmental faults during the execution time of a computer-controlled system. Application data corruption due to electrical transients is detected and recovered to control the system immediately. The proposed approach is a low cost tool towards designing a robust industrial application system that can tolerate errors due to electrical surges and transients. This approach does not rely on design diversification.
Article
This paper reviews 802.16 2001's MAC layer QoS metric. It explains the importance of QoS and its parameter set; defines types of services supported by this standard; explores the main entity of the MAC layer used for transportation that is service flow ...
Article
This paper describes how to design a software-based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single -- version scheme (SVS). The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.
Article
A comparative study on various software-implemented fault detection approaches has been briefly described in a tabular form
Article
Full-text available
This paper examines how software based fault tolerant computing approach through triplicate redundancy and recovery. This approach is not intended to tolerate the software design bugs. It is intended to tolerate various environmental faults during the execution time of a computer-controlled system. Application data corruption due to electrical transients is detected and recovered to control the system immediately. The proposed approach is a low cost tool towards designing a robust industrial application system that can tolerate errors due to electrical surges and transients. This approach does not rely on design diversification.
Article
Full-text available
A real-time system must be reliable if a failure to meet its timing specifications might endanger human life, damage equipment, or waste expensive resources. A deadline mechanism has been previously proposed to provide fault tolerance in real-time software systems. The mechanism trades the accuracy of the results of a service for timing precision. Two independent algorithms are provided for each service subject to a deadline. The primary algorithm produces a good quality service, although its real-time reliability may not be assured. The alternate algorithm is reliable and produces an acceptable response. An algorithm to generate an optimal schedule for the deadline mechanism is introduced, and a simple and efficient implementation is discussed. The schedule ensures the timely completion of the alternate algorithm despite a failure to complete the primary algorithm within real time.
Article
This article describes a low cost software technique for transient fault detection and fault tolerance in a processing system. The random errors caused by potential transients, Electrical Fast Transients (EFT) can be controlled by this proposed technique. Transient errors, if present, are detected and then necessary recovery action can be taken for attaining higher system reliability and tolerance thereof. It is a very cost effective tool for the application design engineers than the traditional expensive hardware fixes, or N-Version programming.
Article
An abstract is not available.
Article
This paper describes how to design a software-based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single -- version scheme (SVS). The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.
Book
A treatise on fault tolerant techniques to enhance the dependability (reliability, safety, security, ...) of a computing system, with a particular emphasis on the need for fault tolerance to all defects – including those of design.