Conference Paper

Two Fault Injection Techniques for Test of Fault Handling Mechanisms

January 1991
IEEE International Test Conference (TC)

January 1991

DOI:10.1109/TEST.1991.519504

Source
DBLP

Conference: Proceedings IEEE International Test Conference 1991, Test: Faster, Better, Sooner, Nashville, TN, USA, October 26-30, 1991

Authors:

Johan Karlsson

Chalmers University of Technology

Two fault injection techniques for experimental validation of fault handling mechanisms in computer systems are investigated and compared. One technique is based on irradiation of ICs with heavy-ion radiation from a 252Cf source. The other technique uses voltage sags injected in the power supply rails to ICs. Both techniques have been used for fault injection experiments with the MC6809E microprocessor. Most errors generated by the 252Cf method were seen first in the address bus, while the power supply disturbances most frequently affected the control signals. An error classification shows that both methods generate many control flow errors, while pure data errors are infrequent. Results from a simulation experiment show that that the low number data errors in the 252Cf experiments can be explained by the fact that many errors in data registers are overwritten owing to the normal program execution.

Conception architecturale haut débit et sûre de fonctionnement pour les codes correcteurs d'erreurs

Thesis

Dec 2009

Houssein Jaber

Les systèmes de communication modernes exigent des débits de plus en plus élevés afin de traiter des volumes d'informations en augmentation constante. Ils doivent être flexibles pour pouvoir gérer des environnements multinormes, et évolutifs pour s'adapter aux normes futures. Pour ces systèmes, la qualité du service (QoS) doit être garantie malgré l'évolution des technologies microélectroniques qui augmente la sensibilité des circuits intégrés aux perturbations externes (impact de particules, perte de l'intégrité du signal, etc.). La tolérance aux fautes devient un critère important pour améliorer la fiabilité et par conséquence la qualité de service. Cette thèse s'inscrit dans la continuité des travaux menés au sein du laboratoire LICM concernant la conception architecturale d'une chaîne de transmission à haut débit, faible coût, et sûre de fonctionnement. Elle porte sur deux axes de recherche principaux : le premier axe porte sur les aspects rapidité et flexibilité, et en particulier sur l'étude et l'implantation d'architectures parallèles-pipelines dédiées aux codeurs convolutifs récursifs. Le principe repose sur l'optimisation des blocs calculant le reste de la division polynomiale qui constitue l'opération critique du codage. Cette approche est généralisée aux filtres récursifs RII. Les caractéristiques architecturales principales recherchées sont une grande flexibilité et une extensibilité aisée, tout en préservant la fonctionnalité ainsi qu'un bon équilibre entre quantité de ressources utilisées (et donc surface consommée) et performances obtenues (vitesse de fonctionnement) ; le deuxième axe de recherche porte sur le développement d'une méthodologie de conception de codeurs sûrs en présence de fautes, améliorant ainsi la tolérance de circuits intégrés numériques. L?approche proposée consiste à ajouter aux codeurs des blocs supplémentaires permettant la détection matérielle en ligne de l'erreur afin d'obtenir des architectures sûrs en présence des fautes. Les solutions proposées permettent d'obtenir un bon compromis entre complexité et fréquence de fonctionnement. Afin d'améliorer encore le débit du fonctionnement, nous proposons également des versions parallèles-pipelines des codeurs sûrs. Différents campagnes d'injection de fautes simples, doubles, et aléatoires ont été réalisées sur les codeurs afin d'évaluer les taux de détection d?erreurs. L'étude architectures sûrs de fonctionnement a ensuite été étendue aux décodeurs parallèles-pipeline pour les codes cycliques en blocs. L'approche choisie repose sur une légère modification des architectures parallèles-pipeline développées

Experimental evaluation of the fail-silent behavior in computers without error masking

Conference Paper

Full-text available

Jul 1994

Traditionally, fail-silent computers are implemented by using massive redundancy (hardware or software). In this research we investigate if it is possible to obtain a high degree of fail-silent behavior from a computer without hardware or software replication by using only simple behavior based error detection techniques. It is assumed that if the errors caused by a fault are detected in time it will be possible to stop the erroneous computer behavior, thus preventing the violation of the fail-silent model. The evaluation technique used in this research is physical fault injection at the pin level. Results obtained by the injection of about 20000 different faults in two different target systems have shown that: in a system without error detection up to 46% of the faults caused the violation of the fail-silent model; in a computer with behavior based error detection the percentage of faults that caused the violation of the fail-silent mode was reduced to values from 2.3% to 0.4%; the results are very dependent on the target system, on the program under execution during the fault injection and on the type of faults

System Dependability: Characterization and Benchmarking

Book

May 2012
ADV COMPUT

This chapter presents briefly basic concepts, measures, and approaches for dependability characterization and benchmarking, and examples of benchmarks, based on dependability modeling and measurements. We put emphasis on commercial off-the-shelf (COTS) components on COTS-based systems. To illustrate the various concepts, techniques, and results of dependability benchmarking, we present two examples of benchmarks: one addressing the system and service level of a COTS-based fault tolerant system, and one dedicated to COTS software components. The first benchmark shows how dependability modeling can be used to benchmark alternative architectural solutions of instrumentation and control systems of nuclear power plants. The benchmarked measure corresponds to system availability. The second benchmark shows how controlled experiments can be used to benchmark operating systems taking Windows and Linux as examples. The benchmark measures are robustness, reaction time, and restart time.

Fault Injection in the Design Process Using VHDL

Article

Fault injection provides a method of assessing the dependability of a system under test. Traditionally fault injection is employed near the end of the design process after hardware and software prototypes have been devel-oped. In order to eliminate costly re-designs near the end of the design process, a methodology for performing fault injection throughout the design process is described in this paper. The Walker-Thomas model of design is used to abstract system design into multiple levels. The fault injection methodology is applied to the design of a watch-dog monitor card in a distributed computer system. Sim-ulation results illustrate the fault injection methodology at multiple levels.

EBSCN: An Error Backtracking Method for Soft Errors Based on Clustering and a Neural Network

Article

Full-text available

Oct 2019

With the development of integrated circuit design technology, soft errors have become an important threat to system reliability, and software-based fault-tolerant techniques are gradually attracting people’s attention. In many cases, researchers use fault injection techniques that are less observable and less controllable to verify system reliability, and at this point, analysing where soft errors occur requires considerable work. In this paper, we present EBSCN, an error backtracking method. The EBSCN method sorts functions by suspiciousness by analysing the erroneous output results, which will help researchers reduce the amount of work required for analysis. The EBSCN method includes a feature extraction method based on clustering and a feature analysis method based on a deep neural network. This paper introduces the principle of the two methods as well as methods to improve and extend them with program-related information. We discuss the effect of the scale of the output result and the severity of the error on the EBSCN method through experiments and verify the effect of the EBSCN method. The results showed that the proportion of the function in which the soft error actually occurs in the ranking of the top 25% of the suspiciousness sequence is no less than 82%, and the proportion ranked in the top 50% is no less than 97%.

Simulation von Fehlern in digitalen Schaltungen mit SystemC

Thesis

Full-text available

Jan 2008

S. Misera

Elektronische Systeme sind in vielen Bereichen des täglichen Lebens zur Selbstverständlichkeit geworden. Dabei werden diese zunehmend komplexer und leistungsfähiger. Für den Entwurf entstehen dadurch Herausforderungen, die z. B. zunehmend mit Hilfe von SystemC gelöst werden. SystemC, eine C++- Klassenbibliothek, erlaubt die Modellierung von Hardwarekomponenten in einer Programmiersprache, wie sie zuvor nur mittels spezieller Hardwarebeschreibungssprachen möglich war. Die Simulation des Systems im Zusammenhang mit Fehlererscheinungen ist ein wichtiger Schritt für den Zuverlässigkeitsnachweis. Für die Simulation von Fehlern ergeben sich dabei im wesentlichen zwei Einsatzbereiche. Zum einen ist es die Fehlersimulation, welche Testmuster hinsichtlich ihrer Effektivität für den Produktionstest beurteilt. Zum anderen dient sie, häufig als simulierte Fehlerinjektion bezeichnet, dem Nachweis der Fehlertoleranz oder des Systemverhaltens im Betrieb. In dieser Arbeit werden erstmals verschiedene Aspekte und Strategien zur Simulation von Fehlern in digitalen Schaltungen als SystemC-Beschreibungen methodisch untersucht. Vorteile mit der Verwendung von SystemC bestehen z. B. in der Verwendung eines einheitlichen Beschreibungsmittels in einem durchgängigen Ablauf und in der Möglichkeit einer schnellen Ausführung. Voraussetzung für die Simulation von Hardware ist eine entsprechende Modellierungsmöglichkeit. Neben der Betrachtung der Register-Transfer- und Gatterebene wird eine neue Modellierung auf der Schalterebene präsentiert. Der Vorteil liegt hierbei in einer höheren Auflösung bei einer akzeptablen Geschwindigkeitseinbuße, wie sie insbesondere bei der Simulation von Fehlern sinnvoll ist. Bei der Modellierung der Fehler wurden neben dem klassischen Haftfehlermodell weitere Modelle kreiert, die Defekte und Störungen realitätsnäher abbilden. Die Injektion der Fehler in die Hardware-Beschreibung erfolgt mit Hilfe verschiedener Strategien. Neben der Umsetzung von Methoden, wie sie aus anderen Hardwarebeschreibungssprachen bekannt sind, werden weitere Wege in SystemC gezeigt. Hierbei nutzt man spezielle Eigenschaften der Sprache C++ und des SystemC-Kernels. Ergebnis dieser Techniken ist eine Modifikation der SystemC-Bibliothek, die zu einer sehr effizienten Fehlerinjektion führt. Bei der Einführung neuer Entwurfswerkzeuge entsteht oft das Problem der Portierung des bestehenden Know-hows. Eine Alternative kann die Kopplung von bestehenden und neuen Simulationsprogrammen sein. Probleme solcher gekoppelten Simulationen liegen z. B. im zusätzlichen Kommunikationsaufwand und der Synchronisation, welche die Ausführungszeit oft deutlich erhöhen. Die hier präsentierte Mixed-Language-Co-Simulation in Form einer Thread-Implementierung und einer weiteren Modifikation der SystemC-Bibliothek erweist sich aber als sehr effizient im Vergleich zu anderen Co-Simulationen. Eine wesentliche Herausforderung für Simulationsaufgaben besteht in ihren Ausführungen in einer akzeptablen Zeit. Es werden verschiedene Implementierungen der beschleunigten Simulation präsentiert. Dabei kommen unterschiedliche Ebenen der Parallelisierung zur Verwendung und zur Untersuchung in Bezug auf ihre Performance. Eine Kombination verschiedener Methoden führt zu einer zusätzlichen Verbesserung. Mit dieser Arbeit wird gezeigt, dass die Simulation von Fehlern in Hardware mit SystemC praktikabel und sinnvoll ist. Anhand von Simulationen und Bewertungen an Beispielschaltungen wird dies unterstrichen.

Multicore Systems: Challenges for creating a representative fault model for fault injection

Article

Nuno Silva

The development of multicore applications presents a challenge to developers, especially those unfamiliar with the complex tasks of managing and debugging issues related to multiple processors. These technologies have been introduced in large scale systems such as PCs and then mobile and handheld devices due to their particular characteristics. Recently, usage of these architectures is becoming more and more interesting for embedded and safety-critical systems, but there is still some way to go. This work presents the breakthroughs and challenges identified on an on-going research project that intends to create an appropriate fault model in order to test (with fault injection) multicore applications for usage in safety critical domains.

Fault-Injection through Model Checking via Naive Assumptions about State Machine Synchrony Semantics

Article

Sabina Joseph

Five Modular Redundancy with Mitigation Technique to Recover the Error Module

Article

Full-text available

Mar 2014

Hazard radiation can lead the system fault therefore Fault Tolerance is required. Fault Tolerant is a system, which is designed to keep operations running, despite the degradation in the specific module is happening. Many fault tolerances have been developed to handle the problem, to find the most robust and efficient in the possible technology. This paper will present the Five Modular Redundancy (FMR) with Mitigation Technique to Recover the Error Module. With Dynamic Partial Reconfiguration technology that have already available today, such fault tolerance technique can be implemented successfully. The project showed the robustness of the system is increased and module which is error can be recovered immediately.

Expériences d'injection de fautes multi− niveaux dans des descriptions VHDL

Article

Résumé Les circuits intégrés numériques complexes sont aujourd'hui utilisés dans la quasi−totalité des domaines d'application. L'évolution des technologies rend ces circuits de plus en plus sensibles aux fautes transitoires (par exemple les inversions de bits). Il devient donc critique de pouvoir analyser l'impact de telles fautes afin de réduire la probabilité des modes de défaillance critiques. Dans cet article deux aspects complémentaires seront exposés : l'injection de faute dans des descriptions de haut niveau et l'analyse des résultats obtenus à l'issue de campagnes d'injection réalisées à différents niveaux de description.

A new approach to build a low-level malicious fault list starting from high-level description and alternative graphs

Conference Paper

Full-text available

Jan 1997

In this paper a new approach is presented to build a list of faults to be used by the fault injection environment; the list is built starting from a high-level description of the system. The approach especially aims at identifying malicious faults, i.e. faults having a critical impact on the system reliability. To overcome the complexity problem inherent in low-level descriptions, high-level ones are exploited, and alternative graphs are applied to carry our the cause-effect analysis, to build up a fault tree and to carry out fault collapsing. The reduced high-level malicious fault list is converted so that it can be used together with the low level description for the final fault injection.

Using run-time reconfiguration for fault injection in hardware prototypes

Conference Paper

Feb 2002

In this paper, a new methodology for the injection of single event upsets (SEU) in memory elements is introduced. SEUs in memory elements can occur due to many reasons (e.g. particle hits, radiation) and at any time. It becomes therefore important to examine the behaviour of circuits when an SEU occurs in them. Reconfigurable hardware (especially FPGAs) was shown to be suitable to emulate the behaviour of a logic design and to realise fault injection. The proposed methodology for SEU injection exploits FPGAs and, contrarily to the most common fault injection techniques, realises the injection directly in the reconfigurable hardware, taking advantage of run-time reconfiguration capabilities of the device. In this case, no modification of the initial design description is needed to inject a fault, that results in avoiding hardware overheads and specific synthesis, place and route phases.

Using Run-Time Reconfiguration for fault injection in hardware prototypes

Conference Paper

Full-text available

Feb 2000

In this paper, approaches using run-time reconfiguration (RTR) for fault injection in programmable systems are introduced. In FPGA-based systems an important characteristic is the time to reconfigure the hardware. With novel FPGA families (e.g. Virtex, AT6000) it is possible to reconfigure the hardware partially in run-time. Important time-savings can be achieved when taking advantage of this characteristic for fault injection as only a small part of the device must be reconfigured

High-Level Decision Diagram Simulation for Diagnosis and Soft-Error Analysis

Chapter

Jan 2011

Increasing design costs are the main challenge facing the semiconductor community today. Assuring the correctness of the design contributes to a major part of the problem. However, while diagnosis and correction of errors are more time-consuming compared to error detection, they have received far less attention, both, in terms of research works and industrial tools introduced.An additional, orthogonal, threat to the continuation speed of development is the rapidly growing rate of soft errors in the emerging nanometer technologies. According to roadmaps, soft errors in sequential logic are becoming a more severe issue than in memories, currently protected against them. The design community is not ready for this kind of challenge because existing soft error escape identification methods for sequential logic are inadequate.This Chapter addresses the above-mentioned challenges by presenting a holistic diagnosis approach for design error location and malicious fault list generation for soft errors. First, a method for locating design errors at the source-level of hardware description language code using the design representation of high-level decision diagrams is explained. Subsequently, this method is reduced to malicious fault list generation at the high-level. A minimized fault list is generated for optimizing the time to be spent on the fault injection run necessary for assessing designs vulnerability to soft-errors.

Effects of Physical Injection of Transient Faults on Control Flow and Evaluation of Some Software-Implemented Error Detection Techniques

Chapter

Jan 1995

An approach for assessing the impact of physical injection of transient faults on control flow behaviour is described and evaluated. The fault injection is based on two complementary methods using heavyion radiation and power supply disturbances. A total of 6,000 transient faults was injected into the target microprocessor, an MC6809E 8-bit CPU, running three different benchmark programs. In the evaluation, the control flow errors were distinguished from those that had no effect on the correct flow of control, vis. the control flow OKs. The errors that led to wrong results are separated from those that had no effect on the correct results. The errors that had no effect on either the correct flow or the correct result are specified. Three error detection techniques, namely two software-based techniques and one watchdog timer, were combined and used in the test in order to characterize the detected and undetected errors. It was found that more than 87% of all errors and 93% of the control flow errors could be detected.

Dependability Benchmarking Using Environmental Test Tools

Chapter

Jan 2008

Cristian Constantinescu

IntroductionIs Silent Data Corruption a Real Threat?Temperature and Voltage Operating TestElectrostatic Discharge Operating TestConclusions AcknowledgmentReferences

FAULT-INJECTION BASED DEPENDABILITY ANALYSIS IN A FPGA-BASED ENVIRONMENT

Article

Apr 2008

P. Vanhauwaert

Technology downscaling increases the sensitivity of integrated circuits faced to perturbations (particles strikes, lose of signal integrity...). The erroneous behaviour of a circuit can be unacceptable and a dependability analysis at a high abstraction level enables to select the most efficient protections and to limit timing overhead induced by a possible rework. This PhD aims at developing a methodology and an environment which improves the dependability analysis of digital integrated circuits. The proposed approach uses a hardware prototype of an instrumented version of the design to be analyzed. The environment includes three levels of execution including an embedded software level that enables to speed-up the experiments while keeping an important flexibility: the user can obtain the best trade-off between the complexity of the analysis and the duration of the experiments. We also propose new techniques for the instrumentation and for the injection control in order to improve the performances of the environment. A predictive evaluation of the performances informs the designer on the most influent parameters and on the analysis duration for a given design and a given implementation of the environment. Finally the methodology is applied on the analysis of two significant systems including a hardware/software system built around a SparcV8 processor.

Early prediction of dependability of complex digital circuits

Article

Jun 2005

K. Hadjiat

FAIL*: An Open and Versatile Fault-Injection Framework for the Assessment of Software-Implemented Hardware Fault Tolerance

Conference Paper

Sep 2015

Due to voltage and structure shrinking, the influence of radiation on a circuit’s operation increases, resulting in future hardware designs exhibiting much higher rates of soft errors. Software developers have to cope with these effects to ensure functional safety. However, software-based hardware fault tolerance is a holistic property that is tricky to achieve in practice, potentially impaired by every single design decision. We present FAIL*, an open and versatile architecture-level fault-injection (FI) framework for the continuous assessment and quantification of fault tolerance in an iterative software development process. FAIL* supplies the developer with reusable and composable FI campaigns, advanced pre- and post-processing analyses to easily identify sensitive spots in the software, well-abstracted back-end implementations for several hardware and simulator platforms, and scalability of FI campaigns by providing massive parallelization. We describe FAIL*, its application to the development process of safety-critical software, and the lessons learned from a real-world example.

DEPENDABILITY ANALYSIS OF COMPLEX CIRCUITS DESCRIBED IN A HIGH LEVEL LANGAGE

Article

Aug 2006

A. Ammari

Survey on Software Faults Injection in Java Applications

Chapter

Jan 2013

Software faults are currently on of the major cause for computer-based system failures. Nowadays, no software development methodology has provided fault-free software, and fault injection is recognized as a technique to understand the effects of faults in a given software product. However, the injection of software faults is not trivial and is still an open research problem. This chapter provides an overview on the injection of software faults and focus on the injection of faults for Java-based software, highlighting the new challenges specific of this language.

Fault Injection

Article

Jun 1995

Fault-injection involves the deliberate insertion of faults or errors into a computer system in order to determine its response. It has proven to be an effective method for measuring the parameters of analytical dependability models, validating existing fault-tolerant systems, synthesizing new fault-tolerant designs, and observing how systems behave in the presence of faults. Growing dependence on computers in life- and cost-critical applications makes it increasingly important to understand and utilize this technique. This paper motivates the use of fault- injection and develops a taxonomy for interpreting fault-injection experiments. Background on how faults affect computer systems is provided. Results from several recent fault-injection studies are reviewed. Tools that facilitate the use of fault-injection are examined, and areas for future research are discussed.

Experimental evaluation of the impact of processor faults on parallel applications

Article

Jan 1995

This paper addresses the problem of processor faults in distributed memory parallel systems. It shows that transient faults injected at the processor pins of one node of a commercial parallel computer, without any particular fault-tolerant techniques, can cause erroneous application results for up to 43% of the injected faults (depending on the application). In addition to these very subtle faults, up to 19% of the injected faults (almost independent on the application) caused the system to hang up. These results show that fault-tolerant techniques are absolutely required in parallel systems, not only to ensure the completion of long-run applications but, and more important, to achieve confidence in the application results. The benefits of including some fairly simple behaviour based error detection mechanisms in the system were evaluated together with Algorithm Based Fault Tolerance (ABFT) techniques. The inclusion of such Mechanisms in parallel systems seems to be very important for detecting most of those subtle errors without greatly affecting the performance and the cost of these systems.

Implementing Fault Injection and Tolerance Mechanisms in Multiprocessor Systems

Article

Jan 1996

The size and complexity of today's multiprocessor systems require the development of new techniques to measure their dependability. An effective technique allowing to inject faults in message passing multiprocessor systems is presented. Interrupt messages are used to trigger fault injection routines in the targeted processors. Any fault that can be emulated by a modification of the memory content of processors can be injected. That includes faults that could occur within the processors, memories and even in the communication network. The proposed technique allows to control the time and location of faults as well as other characteristics. It has been used in a prototype multiprocessor system running real applications in order to compare the efficiency of various error detection and correction mechanisms.

Experimental evaluation of the fail-silent behaviour in programs with consistency checks

Article

Full-text available

Jan 1996

An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.

A Flexible SoPC-based Fault Injection Environment

Conference Paper

Jan 2006

Analyzing the behavior of ICs faced to soft errors is now mandatory, even for applications running at sea level, to prevent malfunctions in critical applications such as automotive. This paper introduces a novel prototyping-based fault injection environment that enables to perform several types of dependability analyses in a common optimized framework. The approach takes advantage of hardware speed and of software flexibility offered by embedded processors to achieve optimized trade-offs between experiment duration and processing complexity. The repartition of tasks between hardware and embedded software is defined with respect to the type of circuit to analyze

RIFLE: A general purpose pin-level fault injector

Conference Paper

Full-text available

Jan 1994

This paper discusses the problems of pin-level fault injection for dependability validation and presents the architecture of a pin-level fault injector called RIFLE. This system can be adapted to a wide range of target systems and the faults are mainly injected in the processor pins. The injection of the faults is deterministic and can be reproduced if needed. Faults of different nature can be injected and the fault injector is able to detect whether the injected fault has produced an error or not without the requirement of feedback circuits. RIFLE can also detect specific circumstances in which the injected faults do not affect the target system. Sets of faults with specific impact on the target system can be generated. The paper also presents fault injection results showing the coverage and latency achieved with a set of simple behavior based error detection mechanisms. It is shown that up to 72,5% of the errors can be detected with fairly simple mechanisms. Furthermore, for over 90% of the faults the target system has behaved according to the fail-silent model, which suggests that a traditional computer equipped with simple error detection mechanisms is relatively close to a fail-silent computer.

Ninja: A Framework for Network Services.

Conference Paper

Full-text available

Jan 2002

Ninja is a new framework that makes it easy to cre-ate robust scalable Internet services. We introduce a new programming model based on the natural parallel-ism of large-scale services, and show how to implement the model. The first key aspect of the model is intelligent connection management, which enables high availabil-ity, load balancing, graceful degradation and online evolution. The second key aspect is support for shared persistent state that is automatically partitioned for scalability and replicated for fault tolerance. We discuss two versions of shared state, a cluster-based hash table with transparent replication and novel features that reduce lock contention, and a cluster-based file system that provides local transactions and cluster-wide namespaces and replication. Using several applications we show that the framework enables the creation of scalable, highly available services with persistent data, with very little application code, as little as one-tenth the code size of comparable stand-alone applications.

A Study of the Error Behavior of a 32-bit RISC Subjected to Simulated Transient Fault Injection.

Conference Paper

Jan 1992

In this paper; a simulation-based fault injection technique is used to study various aspects of the error behavior of a 32bit pipelined RISC, using a register level model written in VHDL. Consequences on the error behavior owing to the fault location were studied speciJcally. Finally, the efJiciency of two built-in error-detecting methods was evaluated

Fault Injection A Method for Validating Computer-System Dependability

Article

Jun 1995

A fault tolerant computer system's dependability must be validated to ensure that its redundancy has been correctly implemented and the system will provide the desired level of reliable service. Fault injection-the deliberate insertion of faults into an operational system to determine its response offers an effective solution to this problem. We survey several fault injection studies and discuss tools such as React (Reliable Architecture Characterization Tool) that facilitate its application

Développement et étude d'un système d'exploitation tolérant aux défaillances pour système un multiprocesseur

Article

Nicolas Gagnon

An Investigation Of The Validity Of The Single Bit Error Model For Particle-induced Transients By Physical Fault Injection.

Conference Paper

May 1994

Not Available

Behavior Of Memory Elements In The Presence Of Power Supply Disturbances

Conference Paper

May 1996

H.H. Amer

Not Available

Performing fault simulation in large system design

Conference Paper

Feb 1997

This paper presents a methodology and supporting set of tools for performing fault simulation throughout the design process for large systems. Most of the previous work in fault simulation has sought efficient methods for simulating faults at a single level design abstraction. This paper has developed a methodology for performing fault simulation of design models at the architectural, algorithmic, functional-block, and gate levels of design abstraction. As a result, fault simulation is supported throughout the design process from system definition through hardware/software implementation. Furthermore, since the fault simulation utilities are provided in an advanced design environment prototype tool, an iterative design/evaluation process is available for system designers at each stage of design refinement. The two key contributions of this paper are: a fault simulation methodology and supporting tools for performing fault simulation throughout the design process of large systems; and a methodology for performing fault simulation concurrently in hardware and software component designs and a proof-of-concept implementation

A fault-list generation algorithm for the evaluation of system coverage

Conference Paper

Feb 1995

The expanding size and complexity of dependable computing systems has increased their cost and at the same time complicated the process of estimating dependability attributes such as fault coverage and detection latency. One approach to estimating such parameters is to employ fault injection, however algorithms are needed to generate a list of faults to inject. Unlike randomly selected faults, a fault list is needed which guarantees to cause either system failure or the activation of mechanisms which cover the injected fault. This research effort has developed an automated technique for selecting faults to use during fault injection experiments. The technique is general in nature and can be applied to any computing platform. The primary objective of this research effort was the development and implementation of the algorithms to generate a fault set which exercises the fault detection and fault processing aspects of the system. The end result is a completely automated method for evaluating complex dependable computing systems by estimating fault coverage and fault detection latency

Efficiency comparison of signature monitoring schemes for FSMs

Conference Paper

Jan 1995

This paper addresses the detection of permanent and transient faults in complex VLSI circuits, with a particular focus on faults leading to sequencing errors. Several Finite State Machine implementations using signature monitoring for control-flow checking are compared in terms of error detection latency, theoretical error coverage, experimental error coverage and area overheads. Advantages and drawbacks of each approach are presented

Two software techniques for on-line error detection

Conference Paper

Aug 1992

Two software-based techniques for online detection of control flow errors were evaluated by fault injection. One technique, called block signature self-checking (BSSC), checks the control flow between program blocks. The other, called error capturing instructions (ECIs), inserts ECIs in the program area, the data area, and the unused area of the memory. To demonstrate these techniques, a program has been developed which modifies the executable code for the MC6809E 8-b microprocessor. The error detection techniques were evaluated using two fault injection techniques: heavy-ion radiation from a californium-252 source and power supply disturbances. Combinations of the two error detection techniques were tested for three different workloads. A combination BSSC, ECIs, and a watchdog timer was also evaluated

A Fault Injection Technique for VHDL Behavioral-Level Models.

Article

Full-text available

Feb 1996

Fault injection is an important technique for the evaluation of design metrics such as reliability, safety, and fault coverage. Fault injection involves inserting faults into a system and monitoring the system to determine its behavior in response to the fault. Recently, designers are realizing the advantages of using simulation to perform fault injection on a model of the design, as opposed to performing the fault injection on the actual system. As designs become more complex, the need arises to perform fault injection at earlier stages in the design process, specifically at the behavioral level of abstraction where a functional description exists for the various components of the design, but the implementation details have not been developed. A technique has been developed that allows for the injection of faults into a Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL) behavioral-level model. The corruption of a VHDL signal (the fault injection) is accomp...

Using Heavy-lon Radiation to Validate Fault-Handling Mechanisms

Article

Mar 1994

Fault injection is an effective method for studying the effects of faults in computer systems and for validating fault-handling mechanisms. The approach presented involves injecting transient faults into integrated circuits by using heavy-ion radiation from a Californium-252 source. The proliferation of safety-critical and fault-tolerant systems using VLSI technology makes such attempts to inject faults at internal locations in VLSI circuits increasingly important.< >

Evaluating Processor-Behavior and Three Error-Detection Mechanisms Using Physical Fault-Injection

Article

Oct 1995

An approach for assessing the impact of physical injection of transient faults on processor execution is described and evaluated. The fault injection is based on two complementary methods using: (1) heavy-ion radiation; and (2) power supply disturbances. 12000 transient faults were injected into the target microprocessor, a Motorola MC6809E 8-bit CPU, running 3 different workloads. In the evaluation, the control-flow errors were distinguished from those that had no effect on the correct flow of control. The errors that led to wrong results are separated from those that did not affect the correct results. The errors that affected neither the correct control flow nor the correct results are specified. Effects of errors on the registers and signals of the processor are characterized, Workload dependency on error rates is demonstrated. Three error-detection mechanisms, (2 software-based mechanisms and 1 watchdog timer) were combined and used to characterize the detected and undetected errors. More than 87% of all errors and 93% of the control-flow errors could be detected. In a different test, the efficiency of an isolated watchdog timer was evaluated. The coverage of the isolated watchdog timer was only 62%. The results indicate that fault-injection methods, workloads, and programming languages all differently affect the control flow, coverage, latency, and error rates

System Dependability Evaluation via a Fault List Generation Algorithm.

Article

Sep 1996

The size and complexity of modern dependable computing systems has significantly compromised the ability to accurately measure system dependability attributes such as fault coverage and fault latency. Fault injection is one approach for the evaluation of dependability metrics. Unfortunately, fault injection techniques are difficult to apply because the size of the fault set is essentially infinite. Current techniques select faults randomly resulting in many fault injection experiments which do not yield any useful information. This research effort has developed a new deterministic, automated dependability evaluation technique using fault injection. The primary objective of this research effort was the development and implementation of algorithms which generate a fault set which fully exercises the fault detection and fault processing aspects of the system. The theory supporting the developed algorithms is presented first. Next, a conceptual overview of the developed algorithms is followed by the implementation details of the algorithms. The last section of this paper presents experimental results gathered via simulation-based fault injection of an Interlocking Control System (ICS). The end result is a deterministic, automated method for accurately evaluating complex dependable computing systems using fault injection

RIFLE: A general purpose pin-level fault injector

Article

Full-text available

Mar 2001

VHDL-based Fault Injection with VERIFY

Article

Jan 2002

This paper describes a new methodology to inject transient and permanent faults in digital systems. For this purpose, the simulation based fault injector VERIFY (VHDL-based Evaluation of Reliability by Injecting Faults efficientlY) has been developed, which allows fault injection at several abstraction levels of a digital system. The combined approach of injection and analysis of the results enables the system's engineer to evaluate the reliability of the system as well as the coverage of fault tolerance mechanisms applied to the system. The approach is applied to the DP32-processor, where faults are injected at pin-level, by flipping bits in internal registers and gate-level. The results of this comparison shows, that the time to recover from a fault and the total number of faults which lead to a recovery differ significantly according to the type of fault injection. Whereas in the first 2 s after fault injection using the bit- flip fault model more than 79% of all faults injected were still present in the system, only 4.5% remained in the processor when using a stuck-at fault model at gate-level. Keywords - Fault injection, VHDL, dependable systems, transient faults, experimental analysis, system validation, fault/error models 1.

Understanding large system failures-a fault injection experiment

Conference Paper

Full-text available

Jul 1989

Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability

Fault Injection for Dependability Validation: A Methodology and Some Applications

Article

Full-text available

Mar 1990

The authors address the problem of validating the dependability of fault-tolerant computing systems, in particular, the validation of the fault-tolerance mechanisms. The proposed approach is based on the use of fault injection at the physical level on a hardware/software prototype of the system considered. The place of this approach in a validation-directed design process and with respect to related work on fault injection is clearly identified. The major requirements and problems related to the development and application of a validation methodology based on fault injection are presented and discussed. Emphasis is put on the definition, analysis, and use of the experimental dependability measures that can be obtained. The proposed methodology has been implemented through the realization of a general pin-level fault injection tool (MESSALINE), and its usefulness is demonstrated by the application of MESSALINE to the experimental validation of two systems: a subsystem of a centralized computerized interlocking system for railway control applications and a distributed system corresponding to the current implementation of the dependable communication system of the ESPRIT Delta-4 Project

Use of Heavy-Ion Radiation from 252Californium for Fault Injection Experiments

Chapter

Jan 1991

The use of heavy-ion radiation from 252Californium as a fault injection method for experimental validation of dependable computing systems is described and discussed. Heavy ions from 252Cf can cause transient faults as they pass through depletion regions in integrated circuits. Irradiation of a circuit must be performed in a vacuum. A design of a portable miniature vacuum chamber which contains one integrated circuit and a 252Cf source is presented. The vacuum chamber can be plugged directly into an IC socket of the same type as the irradiated circuit uses. For an effective system validation, the fault injection method used must cause a variety of errors in the irradiated circuit. Results from irradiation of the MC6809E 8-bit microprocessor are presented. They show that errors occurred on all output pins of the microprocessor, and that both single and multiple bit errors were observed. Sensitivity to heavy ion radiation from 252Cf for various IC technologies is discussed briefly.

The effectiveness of software error-detection mechanisms in real-time operating systems

Article

Jan 1986

A. Damm

The Effects of Radiation on Electronic Systems

Book

Jan 1986

This book is the first unified treatment of the analysis and design methods for protection of principally electronic systems from the deleterious effects of nuclear and electro-magnetic radiation. Coverage spans from a detailed description of the nuclear radiation sources to pertinent semiconductor physics, then to hardness assurance. This work combines the disciplines of solid state physics, semiconductor physics, circuit engineering, nuclear physics, together with electronics and electromagnetic theory into a book that can be used as a text with problems at the end of the majority of the chapters. Written by veterans in the field, the most significant feature of this book is its comprehensive treatment of the phenomena involved. This treatment includes the analysis and design of the effect of nuclear radiation on electronic systems from the experimental, theoretical, and engineering viewpoints. Unique pedagogical attempts are employed to make the material more understandable from the position of an enlightened engineering and scientific readership whose task is the design and analysis of radiation hardened electronic systems.

Experiments with Transient Fault Upsets in Microprocessor Controllers

Conference Paper

Jan 1987

Janusz Sosnowski

Original fault and error insertion experiments determining the effects of transient faults in microprocessor controllers are described. The fault insertion experiments qualify most probable errors (logical level) which are caused by transient faults resulting from electrical noise. A wide class of microprocessor circuits is considered and measurements of static and dynamic noise margins are presented. The error insertion experiments allow to evaluate sensitivity of the controller to various errors (fault tolerance, error coverage and error latency). These experiments are carried out by using universal microprocessor emulator.

The Effects of Heavy-Ion Induced Single Event Upsets in the MC6809E Microprocessor.

Conference Paper

Jan 1989

Fault injection by heavy-ion radiation from Californium-252 could become a useful method for experimental verification and validation of error handling mechnisms used in computer systems. Heavy ions emitted from Cf-252 have the capacity to cause transient faults and soft errors in integrated circuits. In this paper, results of initial fault injection experiments using the MC6809E 8-bit microprocessor are presented. The purpose of the experiments was to investigate the variation of the error behavior seen on the external buses when the microprocessor chip was irradiated by a Cf-252 source. The variation of the error behavior is imperative for an effective evaluation of error handling mechnisms, e.g. those designed to detect errors caused by a microprocessor. The experiments showed that the errors seen on the external buses of the MC6809E were well spread, both in terms of location and number of bits affected.

Evaluation of error detection schemes using fault injection by heavy-ion radiation

Conference Paper

Jul 1989

Several concurrent error detection schemes suitable for a watch-dog processor were evaluated by fault injection. Soft errors were induced into a MC6809E microprocessor by heavy-ion radiation from a Californium-252 source. Recordings of error behavior were used to characterize the errors as well as to determine coverage and latency for the various error detection schemes. The error recordings were used as input to programs that simulate the error detection schemes. The schemes evaluated detected up to 79% of all errors within 85 bus cycles. Fifty-eight percent of the errors caused execution to diverge permanently from the correct program. The best schemes detected 99% of these errors. Eighteen percent of the errors affected only data, and the coverage of these errors was at most 38%

Recent Trends in Parts SEU Susceptibility from Heavy Ions

Article

Jan 1988

JPL and Aerospace have collected an extensive set of heavy ion single event upset (SEU) test data since their last joint publication in December, 1985. Trends in SEU susceptibility for state-of-the-art parts are presented.

Use of CF-252 to Determine Parameters for SEU Rate Calculation

Article

Jan 1986

A Cf-252 irradiation facility for single event testing of microcircuits has been developed. Testing techniques have been refined to include the capability of determining LET thresholds as well as event cross-sections. The capabilities and limitations of Cf-252 in testing to provide parameters for calculation of SEU rate in the heavy ion environment of space are discussed.

Trends in Parts Susceptibility to Single Event Upset from Heavy Ions

Article

Jan 1986

New test data from the Jet Propulsion Laboratory (JPL), The Aerospace Corporation, Rockwell International (Anaheim) and IRT have been combined with published data of JPL [1,2] and Aerospace [3] to form a nearly comprehensive body of single event upset (SEU) test data for heavy ion irradiations. This data has been arranged to exhibit the SEU susceptibility of devices by function, technology and manufacturer. Clear trends emerge which should be useful in predicting future device performance.

Cosmic Ray Simulation Experiments for the Study of Single Event Upsets and Latch-Up in CMOS Memories

Article

Jan 1984

Heavy ion induced single event upsets and latch-up in 4K CMOS RAMs and PROMs have been demonstrated using both the Harwell Variable Energy Cyclotron and a laboratory Californium-252 source. The latter provides a novel and convenient alternative which complements heavy ion accelerator techniques. A number of memories have been examined by both techniques, enabling appropriate cross sections to be measured.

A method for characterizing a microprocessor's vulnerability to SEU

Article

Jan 1989

A system has been developed that tests microprocessors for single-event upset (SEU) at the specified clock speed and without adding wait or hold states. This system compiles a detailed record of SEU-induced errors and has been used to test the Sandia SA3000 microprocessor and prototypes of its commercial equivalent, the Harris H80C85 at the Lawrence Berkeley Laboratory 88-inch cyclotron facility. Using appropriate test programs and analyzing the resulting upset data, the authors have established the SEU cross section of the major functional elements of the hardened processors. With these cross sections and the estimated duty factors of a `typical' program, they computed the expected upset rate in a parallel, normally incident, beam as a function of linear energy transfer and measured the rate in several cyclotron beams. Good agreement between the measured and calculated rates was obtained

Fault injection experiments using FIAT

Article

May 1990

The results of several experiments conducted using the fault-injection-based automated testing (FIAT) system are presented. FIAT is capable of emulating a variety of distributed system architectures, and it provides the capabilities to monitor system behavior and inject faults for the purpose of experimental characterization and validation of a system's dependability. The experiments consists of exhaustively injecting three separate fault types into various locations, encompassing both the code and data portions of memory images, of two distinct applications executed with several different data values and sizes. Fault types are variations of memory bit faults. The results show that there are a limited number of system-level fault manifestations. These manifestations follow a normal distribution for each fault type. Error detection latencies are found to be normally distributed. The methodology can be used to predict the system-level fault responses during the system design stage

Latch-up and SEU Test of Motorola MC68020 32-bit Microprocessor

U Gunneflo
J Karlsson

Properties of Transient Errors Due to Power Supply Disturbances

M L Cortes
E J Mccjuskey
K D Wagner
D J Lu

Two Fault Injection Techniques for Test of Fault Handling Mechanisms

Abstract

No full-text available

Recommended publications

Low voltage superjunction power MOSFET: An application optimized technology

A magnetoelectronic register file cell for a self‐checkpointing microprocessor

An ultra-low-voltage ultra-low-power CMOS Miller OTA with rail-to-rail input/output swing

Instruction Prediction in Microprocessor Unit