ArticlePDF Available

Software implemented fault tolerance through data error recovery

September 2005
Ubiquity 2005(September)

September 2005
2005(September)

Authors:

Centre for Development of Advanced Computing

This paper examines how a new software-implemented data error recovery scheme can be so effective in comparison to conventional Error Correction Codes (ECC) during the execution time of an application. The proposed algorithm is three times faster than the conventional software-implemented ECC and application program designers can easily implement the proposed scheme because of its simplicity while designing their fault tolerant applications at no extra hardware cost. The proposed software-implemented scheme for execution-time data-error detection and correction relies on three-fold replication of application data set as a basis for fault tolerance.

Content uploaded by Goutam Saha

Content may be subject to copyright.

A preview of the PDF is not available

Application semantic driven assertions toward fault tolerant computing

Article

Full-text available

Jun 2006

Goutam Saha

Based on semantics of an application processing logic, we find out the most critical and sensitive parts of an application and we derive set of conditions or assertions among the various diagnostic checkpoint variables and we enhance the processing logic to enable it to detect run-time various operational or environmental faults toward fault tolerant computing. This paper examines how a single-version algorithm can establish software based fault tolerance by designing in thoughtful software based execution-time checks in a computing application. The algorithm developed here relies on various assertions that are derived from the semantics of an application. Various diagnostic assertive checkpoints have been derived based on an application's semantics. This work is not intended to correct bit-errors using conventional error correction codes. Errors have been detected through checkpoints and periodical execution of an application with known test data and verification of observed result with known result thereof. Electrical transients or small particles hitting the circuit, often cause random errors or faults in data and program flow. The manuscript describes an algorithm that allows the detection and recovery of transient or operational failures in software on a specific problem, just by using one version of a software program running on just one machine. This approach does not aim to tolerate software design bugs. This algorithmic approach uses various run-time signatures and validation thereof in order to detect faults.

A single-version scheme of fault tolerant computing

Article

Full-text available

Apr 2006

Goutam Saha

This paper describes how to design low-cost reliable computing software for various application systems, by incorporating a single-version fault tolerant scheme along with run-time signature-based control-flow checking. Most of the ordinary systems lack fault tolerant software fix. The conventional fault tolerant approaches viz., Recovery Block (RB), N Version Programming (NVP) etc., are too costly to fix in an ordinary low-cost application system because, both the RB and NVP rely on multiple (at least three) versions of both software and computing machines. However, the proposed approach needs a single version (SV) of an enhanced application program that gets executed on one computing machine only. It is common that we often face interrupted service (caused either by an intermittent fault in an application program or in hardware), during the service delivery period of an ordinary cheaper application system. Execution of an application program often show malfunctions or it gets interrupted due to memory bit errors. Error Correction Codes (ECC) (viz., parity, Hamming codes, CRC etc.,) that are used in memory, are not as effective for online correction of multiple bit errors, as they are, for the detection of few bit errors. Again, software implemented ECC has a significant overhead over both time and code redundancy. In other words, built in ECC in memory, cannot recover all bit errors but can detect only. As a result, if an error is detected by ECC, the application program needs to be restarted for its re-execution afresh in various microprocessor based application systems. So, the ECC alone is useful for designing a fail-stop kind of system but it suffers from high time redundancy. Other software implemented fault-tolerance schemes are also towards fail-stop kind. But, the proposed (SV) based approach is capable of tolerating such errors without stopping the execution of an application. This SV Scheme (SVS) aims to provide an uninterrupted service at no extra money, but at an acceptable more execution time and memory space. This SV is a non-fail-stop kind fault tolerance scheme that can be implemented in various computing systems without spending an additional money, and as a result, major part of common people in our society, can gain reliable service from the low – cost, SV-based computing system.

Software fault avoidance issues

Article

Full-text available

Nov 2006

Goutam Saha

This article aims to discuss various issues of software fault avoidance. Software fault avoidance aims to produce fault free software through various approaches having the common objective of reducing the number of latent defects in software programs.

Software-Implemented Fault Detection Approaches

Article

May 2008

Goutam Saha

A comparative study on various software-implemented fault detection approaches has been briefly described in a tabular form

A low-cost correction algorithm for transient data errors

Article

May 2006

Software based fault tolerance

Article

Jul 2006

Goutam Saha

Software Based Fault Tolerant Computing Using Redundancy Software Based Fault Tolerant Computing Using Redundancy

Article

Full-text available

Nov 2005

Goutam Saha

This paper examines how software based fault tolerant computing approach through triplicate redundancy and recovery. This approach is not intended to tolerate the software design bugs. It is intended to tolerate various environmental faults during the execution time of a computer-controlled system. Application data corruption due to electrical transients is detected and recovered to control the system immediately. The proposed approach is a low cost tool towards designing a robust industrial application system that can tolerate errors due to electrical surges and transients. This approach does not rely on design diversification.

A low-cost correction algorithm for transient data errors

Article

May 2006

This paper reviews 802.16 2001's MAC layer QoS metric. It explains the importance of QoS and its parameter set; defines types of services supported by this standard; explores the main entity of the MAC layer used for transportation that is service flow ...

Software-Based Fault Tolerant Computing

Article

Nov 2005

Goutam Saha

This paper describes how to design a software-based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single -- version scheme (SVS). The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.

Software-Implemented Fault Detection Approaches

Article

May 2008

Goutam Saha

A comparative study on various software-implemented fault detection approaches has been briefly described in a tabular form

Software Based Fault Tolerant Computing Using Redundancy Software Based Fault Tolerant Computing Using Redundancy

Article

Full-text available

Nov 2005

Goutam Saha

A fault-tolerant scheduling problem

Article

Full-text available

Nov 1986

A real-time system must be reliable if a failure to meet its timing specifications might endanger human life, damage equipment, or waste expensive resources. A deadline mechanism has been previously proposed to provide fault tolerance in real-time software systems. The mechanism trades the accuracy of the results of a service for timing precision. Two independent algorithms are provided for each service subject to a deadline. The primary algorithm produces a good quality service, although its real-time reliability may not be assured. The alternate algorithm is reliable and produces an acceptable response. An algorithm to generate an optimal schedule for the deadline mechanism is introduced, and a simple and efficient implementation is discussed. The schedule ensures the timely completion of the alternate algorithm despite a failure to complete the primary algorithm within real time.

Error control systems for digital communication and storage

Article

Jan 1994

S. B. Wicker

A software fix towards fault-tolerant computing

Article

May 2005

Goutam Saha

This article describes a low cost software technique for transient fault detection and fault tolerance in a processing system. The random errors caused by potential transients, Electrical Fast Transients (EFT) can be controlled by this proposed technique. Transient errors, if present, are detected and then necessary recovery action can be taken for attaining higher system reliability and tolerance thereof. It is a very cost effective tool for the application design engineers than the traditional expensive hardware fixes, or N-Version programming.

Error-correcting coding theory

Article

Man Young Rhee

An abstract is not available.

Transient software fault tolerance using single-version algorithm

Article

Aug 2005

Goutam Saha

An abstract is not available.

Software-Based Fault Tolerant Computing

Article

Nov 2005

Goutam Saha

Data networks (2. ed.).

Book

Jan 1992

Fault Tolerance: Principles and Practice

Book

Jan 1981

A treatise on fault tolerant techniques to enhance the dependability (reliability, safety, security, ...) of a computing system, with a particular emphasis on the need for fault tolerance to all defects – including those of design.

Error control systems for digital communication and storage / S.B. Wicker.

Article

Stephen Bryant Wicker

Software implemented fault tolerance through data error recovery

Abstract

Recommended publications

Fault-Tolerant Data Transfer in a Multiprocessor System by Forward and Backward Hardware Error Recov...

Application semantic driven assertions toward fault tolerant computing

Approaches to Software Based Fault Tolerance – A Review

Application semantic driven assertions toward fault tolerant computing

A single-version scheme of fault tolerant computing