Conference Paper

Two Fault Injection Techniques for Test of Fault Handling Mechanisms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Two fault injection techniques for experimental validation of fault handling mechanisms in computer systems are investigated and compared. One technique is based on irradiation of ICs with heavy-ion radiation from a 252Cf source. The other technique uses voltage sags injected in the power supply rails to ICs. Both techniques have been used for fault injection experiments with the MC6809E microprocessor. Most errors generated by the 252Cf method were seen first in the address bus, while the power supply disturbances most frequently affected the control signals. An error classification shows that both methods generate many control flow errors, while pure data errors are infrequent. Results from a simulation experiment show that that the low number data errors in the 252Cf experiments can be explained by the fact that many errors in data registers are overwritten owing to the normal program execution.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... De nombreuses approches d'injection de fautes s'appliquent pour évaluer la tolérance d'une telle architecture ; approche à base de simulation, par exemple la simulation au niveau RTL (Register Transfer Level) qui consiste à injecter des fautes au niveau transferts de registres [138,139] et simulation au niveau portes [156] ; d'autres injections basées sur le matériel, citons l'injection au niveau des broches [83][84][85], par corruption de la mémoire [86], par perturbations de l'alimentation [88], et par laser [90]. Dans la suite, nous ciblons la méthode d'injection de fautes basées sur des simulations de descriptions de haut niveau RTL, ce processus d'injection de fautes étant toutefois entièrement compatible avec la réalisation des expériences permettent de valider nos architectures. ...
... L'injection de fautes est une des principales approches pour l'évaluation de la sûreté. Un grand nombre de méthodes ont été proposées, se basant essentiellement sur la validation physique des systèmes, et incluant des injections sur les broches des circuits, des corruptions en mémoire [86], l'injection d'ions lourds, la perturbation des alimentations [88], ou encore l'injection de fautes par laser [136,137]. Aucune de ces approches ne peut être utilisée pour une évaluation de la sûreté avant que le circuit ne soit effectivement fabriqué. ...
... -Perturbation de l'alimentation : basée sur l'ajout d'un transistor MOS entre l'alimentation et le port Vcc du circuit à analyser [88]. Elle permet d'injecter des fautes en modifiant la tension de grille de ce transistor. ...
Thesis
Les systèmes de communication modernes exigent des débits de plus en plus élevés afin de traiter des volumes d'informations en augmentation constante. Ils doivent être flexibles pour pouvoir gérer des environnements multinormes, et évolutifs pour s'adapter aux normes futures. Pour ces systèmes, la qualité du service (QoS) doit être garantie malgré l'évolution des technologies microélectroniques qui augmente la sensibilité des circuits intégrés aux perturbations externes (impact de particules, perte de l'intégrité du signal, etc.). La tolérance aux fautes devient un critère important pour améliorer la fiabilité et par conséquence la qualité de service. Cette thèse s'inscrit dans la continuité des travaux menés au sein du laboratoire LICM concernant la conception architecturale d'une chaîne de transmission à haut débit, faible coût, et sûre de fonctionnement. Elle porte sur deux axes de recherche principaux : le premier axe porte sur les aspects rapidité et flexibilité, et en particulier sur l'étude et l'implantation d'architectures parallèles-pipelines dédiées aux codeurs convolutifs récursifs. Le principe repose sur l'optimisation des blocs calculant le reste de la division polynomiale qui constitue l'opération critique du codage. Cette approche est généralisée aux filtres récursifs RII. Les caractéristiques architecturales principales recherchées sont une grande flexibilité et une extensibilité aisée, tout en préservant la fonctionnalité ainsi qu'un bon équilibre entre quantité de ressources utilisées (et donc surface consommée) et performances obtenues (vitesse de fonctionnement) ; le deuxième axe de recherche porte sur le développement d'une méthodologie de conception de codeurs sûrs en présence de fautes, améliorant ainsi la tolérance de circuits intégrés numériques. L?approche proposée consiste à ajouter aux codeurs des blocs supplémentaires permettant la détection matérielle en ligne de l'erreur afin d'obtenir des architectures sûrs en présence des fautes. Les solutions proposées permettent d'obtenir un bon compromis entre complexité et fréquence de fonctionnement. Afin d'améliorer encore le débit du fonctionnement, nous proposons également des versions parallèles-pipelines des codeurs sûrs. Différents campagnes d'injection de fautes simples, doubles, et aléatoires ont été réalisées sur les codeurs afin d'évaluer les taux de détection d?erreurs. L'étude architectures sûrs de fonctionnement a ensuite été étendue aux décodeurs parallèles-pipeline pour les codes cycliques en blocs. L'approche choisie repose sur une légère modification des architectures parallèles-pipeline développées
... It has been issued as an Internal Report for early dissemination only. Its distribution outside University of Coimbra prior to publication is limited to peer communications and specific requests.Figure 2 -First error manifestation of internal processor faults at the pin-level [10, 22, 23]. delay in the results is not very important. ...
... Previous fault injection works on the evaluation of error detection techniques [10, 18, 23, 15, 26] have shown that small changes in the experimental conditions might cause significant changes in the results. Aspects such as the type of faults, the workload, and the type of memory cycle affected by the faults have great impact on the results. ...
... It is worth saying that the injection of physical faults and the analysis of results in complex processors is not simple. Even the most recent works in this field [10, 23, 26] have used fairly simple processors with about the same complexity as the Z80. Another reason for the choice of the target system has to do with the fact that it is very difficult (or even impossible) to implement the hardware required by the signature monitoring techniques (which is one interesting error detection technique) in the outside of many recent processors because most of them use complex prefetch queues or pipelining. ...
Conference Paper
Full-text available
Traditionally, fail-silent computers are implemented by using massive redundancy (hardware or software). In this research we investigate if it is possible to obtain a high degree of fail-silent behavior from a computer without hardware or software replication by using only simple behavior based error detection techniques. It is assumed that if the errors caused by a fault are detected in time it will be possible to stop the erroneous computer behavior, thus preventing the violation of the fail-silent model. The evaluation technique used in this research is physical fault injection at the pin level. Results obtained by the injection of about 20000 different faults in two different target systems have shown that: in a system without error detection up to 46% of the faults caused the violation of the fail-silent model; in a computer with behavior based error detection the percentage of faults that caused the violation of the fail-silent mode was reduced to values from 2.3% to 0.4%; the results are very dependent on the target system, on the program under execution during the fault injection and on the type of faults
... • Power supply disturbances are rarely used for fault injection because of low repeatability. They have been used mainly as a complement to other fault injection techniques in the assessment of error detection mechanisms for small microprocessors [72]. ...
... In space, soft errors are caused by cosmic rays, i.e., highly energetic heavy-ion particles. The sensitivity of integrated circuits to heavy-ion radiation can be exploited for assessing the efficiency of fault tolerance mechanisms [51] and [72]. In [9] the three techniques described above (pin-level fault injection, electromagnetic interferences, heavy-ion radiation) have been used for the validation of the MARS architecture. ...
Book
This chapter presents briefly basic concepts, measures, and approaches for dependability characterization and benchmarking, and examples of benchmarks, based on dependability modeling and measurements. We put emphasis on commercial off-the-shelf (COTS) components on COTS-based systems. To illustrate the various concepts, techniques, and results of dependability benchmarking, we present two examples of benchmarks: one addressing the system and service level of a COTS-based fault tolerant system, and one dedicated to COTS software components. The first benchmark shows how dependability modeling can be used to benchmark alternative architectural solutions of instrumentation and control systems of nuclear power plants. The benchmarked measure corresponds to system availability. The second benchmark shows how controlled experiments can be used to benchmark operating systems taking Windows and Linux as examples. The benchmark measures are robustness, reaction time, and restart time.
... Most of the techniques fall into three categories: 1) hardware-based fault injection, 2) software-based fault injection, and 3) simulation-based fault injection. Hardware-based fault injection has been accomplished by such methods as bombarding the hardware with heavy-ion radiation [1], injecting voltage sags on the power rails of the hardware [1], and corrupting logic values at the pin level [2]. Other examples of hardware fault injection may be found in [3] and [4]. ...
... Most of the techniques fall into three categories: 1) hardware-based fault injection, 2) software-based fault injection, and 3) simulation-based fault injection. Hardware-based fault injection has been accomplished by such methods as bombarding the hardware with heavy-ion radiation [1], injecting voltage sags on the power rails of the hardware [1], and corrupting logic values at the pin level [2]. Other examples of hardware fault injection may be found in [3] and [4]. ...
Article
Fault injection provides a method of assessing the dependability of a system under test. Traditionally fault injection is employed near the end of the design process after hardware and software prototypes have been devel-oped. In order to eliminate costly re-designs near the end of the design process, a methodology for performing fault injection throughout the design process is described in this paper. The Walker-Thomas model of design is used to abstract system design into multiple levels. The fault injection methodology is applied to the design of a watch-dog monitor card in a distributed computer system. Sim-ulation results illustrate the fault injection methodology at multiple levels.
... The HWIFI techniques use additional hardware to inject faults into the system. The most common method involves simulating the natural cause of transient faults by using heavy ion radiation or power interference [16]- [18]. Compared with SWIFI techniques or simulation-based FI techniques, HWIFI techniques are closer to the real situation of a transient fault occurrence, are faster in testing speed, and require almost no additional modification to the target system. ...
Article
Full-text available
With the development of integrated circuit design technology, soft errors have become an important threat to system reliability, and software-based fault-tolerant techniques are gradually attracting people’s attention. In many cases, researchers use fault injection techniques that are less observable and less controllable to verify system reliability, and at this point, analysing where soft errors occur requires considerable work. In this paper, we present EBSCN, an error backtracking method. The EBSCN method sorts functions by suspiciousness by analysing the erroneous output results, which will help researchers reduce the amount of work required for analysis. The EBSCN method includes a feature extraction method based on clustering and a feature analysis method based on a deep neural network. This paper introduces the principle of the two methods as well as methods to improve and extend them with program-related information. We discuss the effect of the scale of the output result and the severity of the error on the EBSCN method through experiments and verify the effect of the EBSCN method. The results showed that the proportion of the function in which the soft error actually occurs in the ranking of the top 25% of the suspiciousness sequence is no less than 82%, and the proportion ranked in the top 50% is no less than 97%.
... Daneben kann man noch eine HWFI-Fehlerinjektion verursachen, indem man Störungen auf der Versorgungsspannung erzeugt [138]. ...
Thesis
Full-text available
Elektronische Systeme sind in vielen Bereichen des täglichen Lebens zur Selbstverständlichkeit geworden. Dabei werden diese zunehmend komplexer und leistungsfähiger. Für den Entwurf entstehen dadurch Herausforderungen, die z. B. zunehmend mit Hilfe von SystemC gelöst werden. SystemC, eine C++- Klassenbibliothek, erlaubt die Modellierung von Hardwarekomponenten in einer Programmiersprache, wie sie zuvor nur mittels spezieller Hardwarebeschreibungssprachen möglich war. Die Simulation des Systems im Zusammenhang mit Fehlererscheinungen ist ein wichtiger Schritt für den Zuverlässigkeitsnachweis. Für die Simulation von Fehlern ergeben sich dabei im wesentlichen zwei Einsatzbereiche. Zum einen ist es die Fehlersimulation, welche Testmuster hinsichtlich ihrer Effektivität für den Produktionstest beurteilt. Zum anderen dient sie, häufig als simulierte Fehlerinjektion bezeichnet, dem Nachweis der Fehlertoleranz oder des Systemverhaltens im Betrieb. In dieser Arbeit werden erstmals verschiedene Aspekte und Strategien zur Simulation von Fehlern in digitalen Schaltungen als SystemC-Beschreibungen methodisch untersucht. Vorteile mit der Verwendung von SystemC bestehen z. B. in der Verwendung eines einheitlichen Beschreibungsmittels in einem durchgängigen Ablauf und in der Möglichkeit einer schnellen Ausführung. Voraussetzung für die Simulation von Hardware ist eine entsprechende Modellierungsmöglichkeit. Neben der Betrachtung der Register-Transfer- und Gatterebene wird eine neue Modellierung auf der Schalterebene präsentiert. Der Vorteil liegt hierbei in einer höheren Auflösung bei einer akzeptablen Geschwindigkeitseinbuße, wie sie insbesondere bei der Simulation von Fehlern sinnvoll ist. Bei der Modellierung der Fehler wurden neben dem klassischen Haftfehlermodell weitere Modelle kreiert, die Defekte und Störungen realitätsnäher abbilden. Die Injektion der Fehler in die Hardware-Beschreibung erfolgt mit Hilfe verschiedener Strategien. Neben der Umsetzung von Methoden, wie sie aus anderen Hardwarebeschreibungssprachen bekannt sind, werden weitere Wege in SystemC gezeigt. Hierbei nutzt man spezielle Eigenschaften der Sprache C++ und des SystemC-Kernels. Ergebnis dieser Techniken ist eine Modifikation der SystemC-Bibliothek, die zu einer sehr effizienten Fehlerinjektion führt. Bei der Einführung neuer Entwurfswerkzeuge entsteht oft das Problem der Portierung des bestehenden Know-hows. Eine Alternative kann die Kopplung von bestehenden und neuen Simulationsprogrammen sein. Probleme solcher gekoppelten Simulationen liegen z. B. im zusätzlichen Kommunikationsaufwand und der Synchronisation, welche die Ausführungszeit oft deutlich erhöhen. Die hier präsentierte Mixed-Language-Co-Simulation in Form einer Thread-Implementierung und einer weiteren Modifikation der SystemC-Bibliothek erweist sich aber als sehr effizient im Vergleich zu anderen Co-Simulationen. Eine wesentliche Herausforderung für Simulationsaufgaben besteht in ihren Ausführungen in einer akzeptablen Zeit. Es werden verschiedene Implementierungen der beschleunigten Simulation präsentiert. Dabei kommen unterschiedliche Ebenen der Parallelisierung zur Verwendung und zur Untersuchung in Bezug auf ihre Performance. Eine Kombination verschiedener Methoden führt zu einer zusätzlichen Verbesserung. Mit dieser Arbeit wird gezeigt, dass die Simulation von Fehlern in Hardware mit SystemC praktikabel und sinnvoll ist. Anhand von Simulationen und Bewertungen an Beispielschaltungen wird dies unterstrichen.
... Fault injectors were initially designed to inject faults at a very low level (e.g., circuit and gate level) and implemented in hardware. Some examples are pin-level instrumentation to modify the values at the processor pins, insertion of probes to alter the electric current values in selected hardware points, and external hardware to expose the target to heavy-ion radiation (see [3], [7], [8], [9], [10]). To address the issues implementation cost and the difficulty that physical reachability to the target that hardware implemented fault injectors may require, tools started to be implemented in software which then allowed for the use of more complex fault models such as software faults which became a very relevant type of faults. ...
Article
The development of multicore applications presents a challenge to developers, especially those unfamiliar with the complex tasks of managing and debugging issues related to multiple processors. These technologies have been introduced in large scale systems such as PCs and then mobile and handheld devices due to their particular characteristics. Recently, usage of these architectures is becoming more and more interesting for embedded and safety-critical systems, but there is still some way to go. This work presents the breakthroughs and challenges identified on an on-going research project that intends to create an appropriate fault model in order to test (with fault injection) multicore applications for usage in safety critical domains.
... They detected the most potent locations for incorporating additional fault-tolerant features. Karlsson et al. [27] combined radiation and power supply disturbances to examine the propagation of internal errors to the bus of an MC6809E. They injected transient faults into this microprocessor by exposing it to heavy ions and pulses into the power supply. ...
... To overcome the NMR we implement mitigation technique that can mitigate or recover the error modules. Testing of fault tolerance can be done by providing a direct test on the hardware of FPGA device, by giving a large ion injection or given power supply disturbance [6]. However, this method is relatively expensive and difficult to obtain expected environment. ...
Article
Full-text available
Hazard radiation can lead the system fault therefore Fault Tolerance is required. Fault Tolerant is a system, which is designed to keep operations running, despite the degradation in the specific module is happening. Many fault tolerances have been developed to handle the problem, to find the most robust and efficient in the possible technology. This paper will present the Five Modular Redundancy (FMR) with Mitigation Technique to Recover the Error Module. With Dynamic Partial Reconfiguration technology that have already available today, such fault tolerance technique can be implemented successfully. The project showed the robustness of the system is increased and module which is error can be recovered immediately.
... L'injection de fautes est une des principales approches pour l'évaluation de la sûreté. Un grand nombre de méthodes ont été proposées, se basant essentiellement sur la validation physique des systèmes, et incluant des injections sur les broches des circuits, des corruptions en mémoire [1], l'injection d'ions lourds, la perturbation des alimentations [2], ou encore l'injection de fautes par laser [3]. Aucune de ces approches ne peut être utilisée pour une évaluation de la sûreté avant que le circuit ne soit effectivement fabriqué. ...
Article
Résumé Les circuits intégrés numériques complexes sont aujourd'hui utilisés dans la quasi−totalité des domaines d'application. L'évolution des technologies rend ces circuits de plus en plus sensibles aux fautes transitoires (par exemple les inversions de bits). Il devient donc critique de pouvoir analyser l'impact de telles fautes afin de réduire la probabilité des modes de défaillance critiques. Dans cet article deux aspects complémentaires seront exposés : l'injection de faute dans des descriptions de haut niveau et l'analyse des résultats obtenus à l'issue de campagnes d'injection réalisées à différents niveaux de description.
... Several works [1] [3] demonstrated that with randomly selected fault lists the percentage of faults which do not produce errors ranges from 10 to 66%, depending from the system. The goal of the fault injection experiment is to exercise the system's fault processing capabilities. ...
Conference Paper
Full-text available
In this paper a new approach is presented to build a list of faults to be used by the fault injection environment; the list is built starting from a high-level description of the system. The approach especially aims at identifying malicious faults, i.e. faults having a critical impact on the system reliability. To overcome the complexity problem inherent in low-level descriptions, high-level ones are exploited, and alternative graphs are applied to carry our the cause-effect analysis, to build up a fault tree and to carry out fault collapsing. The reduced high-level malicious fault list is converted so that it can be used together with the low level description for the final fault injection.
... Most of the approaches proposed up to now apply once the system or circuit is available. Such approaches include pin-level fault injection, memory corrup- tion [29] [20] [19], heavy-ion injection [30] [23], power supply disturbances [16], laser fault injection [33] or software fault injection [22] [21] [7] [5] [6]. More recently, several authors proposed to apply fault injection early in the design process. ...
Conference Paper
In this paper, a new methodology for the injection of single event upsets (SEU) in memory elements is introduced. SEUs in memory elements can occur due to many reasons (e.g. particle hits, radiation) and at any time. It becomes therefore important to examine the behaviour of circuits when an SEU occurs in them. Reconfigurable hardware (especially FPGAs) was shown to be suitable to emulate the behaviour of a logic design and to realise fault injection. The proposed methodology for SEU injection exploits FPGAs and, contrarily to the most common fault injection techniques, realises the injection directly in the reconfigurable hardware, taking advantage of run-time reconfiguration capabilities of the device. In this case, no modification of the initial design description is needed to inject a fault, that results in avoiding hardware overheads and specific synthesis, place and route phases.
... Most of the approaches proposed up to now apply once the system or circuit is available. Such approaches include pin-level fault injection, memory corruption [29] [20] [19], heavy-ion injection [30] [23], power supply disturbances [16], laser fault injection [33] or soft-ware fault injection [22] [21] [7] [5] [6]. ...
Conference Paper
Full-text available
In this paper, approaches using run-time reconfiguration (RTR) for fault injection in programmable systems are introduced. In FPGA-based systems an important characteristic is the time to reconfigure the hardware. With novel FPGA families (e.g. Virtex, AT6000) it is possible to reconfigure the hardware partially in run-time. Important time-savings can be achieved when taking advantage of this characteristic for fault injection as only a small part of the device must be reconfigured
Chapter
Increasing design costs are the main challenge facing the semiconductor community today. Assuring the correctness of the design contributes to a major part of the problem. However, while diagnosis and correction of errors are more time-consuming compared to error detection, they have received far less attention, both, in terms of research works and industrial tools introduced.An additional, orthogonal, threat to the continuation speed of development is the rapidly growing rate of soft errors in the emerging nanometer technologies. According to roadmaps, soft errors in sequential logic are becoming a more severe issue than in memories, currently protected against them. The design community is not ready for this kind of challenge because existing soft error escape identification methods for sequential logic are inadequate.This Chapter addresses the above-mentioned challenges by presenting a holistic diagnosis approach for design error location and malicious fault list generation for soft errors. First, a method for locating design errors at the source-level of hardware description language code using the design representation of high-level decision diagrams is explained. Subsequently, this method is reduced to malicious fault list generation at the high-level. A minimized fault list is generated for optimizing the time to be spent on the fault injection run necessary for assessing designs vulnerability to soft-errors.
Chapter
An approach for assessing the impact of physical injection of transient faults on control flow behaviour is described and evaluated. The fault injection is based on two complementary methods using heavyion radiation and power supply disturbances. A total of 6,000 transient faults was injected into the target microprocessor, an MC6809E 8-bit CPU, running three different benchmark programs. In the evaluation, the control flow errors were distinguished from those that had no effect on the correct flow of control, vis. the control flow OKs. The errors that led to wrong results are separated from those that had no effect on the correct results. The errors that had no effect on either the correct flow or the correct result are specified. Three error detection techniques, namely two software-based techniques and one watchdog timer, were combined and used in the test in order to characterize the detected and undetected errors. It was found that more than 87% of all errors and 93% of the control flow errors could be detected.
Chapter
IntroductionIs Silent Data Corruption a Real Threat?Temperature and Voltage Operating TestElectrostatic Discharge Operating TestConclusions AcknowledgmentReferences
Article
Technology downscaling increases the sensitivity of integrated circuits faced to perturbations (particles strikes, lose of signal integrity...). The erroneous behaviour of a circuit can be unacceptable and a dependability analysis at a high abstraction level enables to select the most efficient protections and to limit timing overhead induced by a possible rework. This PhD aims at developing a methodology and an environment which improves the dependability analysis of digital integrated circuits. The proposed approach uses a hardware prototype of an instrumented version of the design to be analyzed. The environment includes three levels of execution including an embedded software level that enables to speed-up the experiments while keeping an important flexibility: the user can obtain the best trade-off between the complexity of the analysis and the duration of the experiments. We also propose new techniques for the instrumentation and for the injection control in order to improve the performances of the environment. A predictive evaluation of the performances informs the designer on the most influent parameters and on the analysis duration for a given design and a given implementation of the environment. Finally the methodology is applied on the analysis of two significant systems including a hardware/software system built around a SparcV8 processor.
Conference Paper
Due to voltage and structure shrinking, the influence of radiation on a circuit’s operation increases, resulting in future hardware designs exhibiting much higher rates of soft errors. Software developers have to cope with these effects to ensure functional safety. However, software-based hardware fault tolerance is a holistic property that is tricky to achieve in practice, potentially impaired by every single design decision. We present FAIL*, an open and versatile architecture-level fault-injection (FI) framework for the continuous assessment and quantification of fault tolerance in an iterative software development process. FAIL* supplies the developer with reusable and composable FI campaigns, advanced pre- and post-processing analyses to easily identify sensitive spots in the software, well-abstracted back-end implementations for several hardware and simulator platforms, and scalability of FI campaigns by providing massive parallelization. We describe FAIL*, its application to the development process of safety-critical software, and the lessons learned from a real-world example.
Chapter
Software faults are currently on of the major cause for computer-based system failures. Nowadays, no software development methodology has provided fault-free software, and fault injection is recognized as a technique to understand the effects of faults in a given software product. However, the injection of software faults is not trivial and is still an open research problem. This chapter provides an overview on the injection of software faults and focus on the injection of faults for Java-based software, highlighting the new challenges specific of this language.
Article
Fault-injection involves the deliberate insertion of faults or errors into a computer system in order to determine its response. It has proven to be an effective method for measuring the parameters of analytical dependability models, validating existing fault-tolerant systems, synthesizing new fault-tolerant designs, and observing how systems behave in the presence of faults. Growing dependence on computers in life- and cost-critical applications makes it increasingly important to understand and utilize this technique. This paper motivates the use of fault- injection and develops a taxonomy for interpreting fault-injection experiments. Background on how faults affect computer systems is provided. Results from several recent fault-injection studies are reviewed. Tools that facilitate the use of fault-injection are examined, and areas for future research are discussed.
Article
This paper addresses the problem of processor faults in distributed memory parallel systems. It shows that transient faults injected at the processor pins of one node of a commercial parallel computer, without any particular fault-tolerant techniques, can cause erroneous application results for up to 43% of the injected faults (depending on the application). In addition to these very subtle faults, up to 19% of the injected faults (almost independent on the application) caused the system to hang up. These results show that fault-tolerant techniques are absolutely required in parallel systems, not only to ensure the completion of long-run applications but, and more important, to achieve confidence in the application results. The benefits of including some fairly simple behaviour based error detection mechanisms in the system were evaluated together with Algorithm Based Fault Tolerance (ABFT) techniques. The inclusion of such Mechanisms in parallel systems seems to be very important for detecting most of those subtle errors without greatly affecting the performance and the cost of these systems.
Article
The size and complexity of today's multiprocessor systems require the development of new techniques to measure their dependability. An effective technique allowing to inject faults in message passing multiprocessor systems is presented. Interrupt messages are used to trigger fault injection routines in the targeted processors. Any fault that can be emulated by a modification of the memory content of processors can be injected. That includes faults that could occur within the processors, memories and even in the communication network. The proposed technique allows to control the time and location of faults as well as other characteristics. It has been used in a prototype multiprocessor system running real applications in order to compare the efficiency of various error detection and correction mechanisms.
Article
Full-text available
An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.
Conference Paper
Analyzing the behavior of ICs faced to soft errors is now mandatory, even for applications running at sea level, to prevent malfunctions in critical applications such as automotive. This paper introduces a novel prototyping-based fault injection environment that enables to perform several types of dependability analyses in a common optimized framework. The approach takes advantage of hardware speed and of software flexibility offered by embedded processors to achieve optimized trade-offs between experiment duration and processing complexity. The repartition of tasks between hardware and embedded software is defined with respect to the type of circuit to analyze
Conference Paper
Full-text available
This paper discusses the problems of pin-level fault injection for dependability validation and presents the architecture of a pin-level fault injector called RIFLE. This system can be adapted to a wide range of target systems and the faults are mainly injected in the processor pins. The injection of the faults is deterministic and can be reproduced if needed. Faults of different nature can be injected and the fault injector is able to detect whether the injected fault has produced an error or not without the requirement of feedback circuits. RIFLE can also detect specific circumstances in which the injected faults do not affect the target system. Sets of faults with specific impact on the target system can be generated. The paper also presents fault injection results showing the coverage and latency achieved with a set of simple behavior based error detection mechanisms. It is shown that up to 72,5% of the errors can be detected with fairly simple mechanisms. Furthermore, for over 90% of the faults the target system has behaved according to the fail-silent model, which suggests that a traditional computer equipped with simple error detection mechanisms is relatively close to a fail-silent computer.
Conference Paper
Full-text available
Ninja is a new framework that makes it easy to cre-ate robust scalable Internet services. We introduce a new programming model based on the natural parallel-ism of large-scale services, and show how to implement the model. The first key aspect of the model is intelligent connection management, which enables high availabil-ity, load balancing, graceful degradation and online evolution. The second key aspect is support for shared persistent state that is automatically partitioned for scalability and replicated for fault tolerance. We discuss two versions of shared state, a cluster-based hash table with transparent replication and novel features that reduce lock contention, and a cluster-based file system that provides local transactions and cluster-wide namespaces and replication. Using several applications we show that the framework enables the creation of scalable, highly available services with persistent data, with very little application code, as little as one-tenth the code size of comparable stand-alone applications.
Conference Paper
In this paper; a simulation-based fault injection technique is used to study various aspects of the error behavior of a 32bit pipelined RISC, using a register level model written in VHDL. Consequences on the error behavior owing to the fault location were studied speciJcally. Finally, the efJiciency of two built-in error-detecting methods was evaluated
Article
A fault tolerant computer system's dependability must be validated to ensure that its redundancy has been correctly implemented and the system will provide the desired level of reliable service. Fault injection-the deliberate insertion of faults into an operational system to determine its response offers an effective solution to this problem. We survey several fault injection studies and discuss tools such as React (Reliable Architecture Characterization Tool) that facilitate its application
Conference Paper
This paper presents a methodology and supporting set of tools for performing fault simulation throughout the design process for large systems. Most of the previous work in fault simulation has sought efficient methods for simulating faults at a single level design abstraction. This paper has developed a methodology for performing fault simulation of design models at the architectural, algorithmic, functional-block, and gate levels of design abstraction. As a result, fault simulation is supported throughout the design process from system definition through hardware/software implementation. Furthermore, since the fault simulation utilities are provided in an advanced design environment prototype tool, an iterative design/evaluation process is available for system designers at each stage of design refinement. The two key contributions of this paper are: a fault simulation methodology and supporting tools for performing fault simulation throughout the design process of large systems; and a methodology for performing fault simulation concurrently in hardware and software component designs and a proof-of-concept implementation
Conference Paper
The expanding size and complexity of dependable computing systems has increased their cost and at the same time complicated the process of estimating dependability attributes such as fault coverage and detection latency. One approach to estimating such parameters is to employ fault injection, however algorithms are needed to generate a list of faults to inject. Unlike randomly selected faults, a fault list is needed which guarantees to cause either system failure or the activation of mechanisms which cover the injected fault. This research effort has developed an automated technique for selecting faults to use during fault injection experiments. The technique is general in nature and can be applied to any computing platform. The primary objective of this research effort was the development and implementation of the algorithms to generate a fault set which exercises the fault detection and fault processing aspects of the system. The end result is a completely automated method for evaluating complex dependable computing systems by estimating fault coverage and fault detection latency
Conference Paper
This paper addresses the detection of permanent and transient faults in complex VLSI circuits, with a particular focus on faults leading to sequencing errors. Several Finite State Machine implementations using signature monitoring for control-flow checking are compared in terms of error detection latency, theoretical error coverage, experimental error coverage and area overheads. Advantages and drawbacks of each approach are presented
Conference Paper
Two software-based techniques for online detection of control flow errors were evaluated by fault injection. One technique, called block signature self-checking (BSSC), checks the control flow between program blocks. The other, called error capturing instructions (ECIs), inserts ECIs in the program area, the data area, and the unused area of the memory. To demonstrate these techniques, a program has been developed which modifies the executable code for the MC6809E 8-b microprocessor. The error detection techniques were evaluated using two fault injection techniques: heavy-ion radiation from a californium-252 source and power supply disturbances. Combinations of the two error detection techniques were tested for three different workloads. A combination BSSC, ECIs, and a watchdog timer was also evaluated
Article
Full-text available
Fault injection is an important technique for the evaluation of design metrics such as reliability, safety, and fault coverage. Fault injection involves inserting faults into a system and monitoring the system to determine its behavior in response to the fault. Recently, designers are realizing the advantages of using simulation to perform fault injection on a model of the design, as opposed to performing the fault injection on the actual system. As designs become more complex, the need arises to perform fault injection at earlier stages in the design process, specifically at the behavioral level of abstraction where a functional description exists for the various components of the design, but the implementation details have not been developed. A technique has been developed that allows for the injection of faults into a Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL) behavioral-level model. The corruption of a VHDL signal (the fault injection) is accomp...
Article
Fault injection is an effective method for studying the effects of faults in computer systems and for validating fault-handling mechanisms. The approach presented involves injecting transient faults into integrated circuits by using heavy-ion radiation from a Californium-252 source. The proliferation of safety-critical and fault-tolerant systems using VLSI technology makes such attempts to inject faults at internal locations in VLSI circuits increasingly important.< >
Article
An approach for assessing the impact of physical injection of transient faults on processor execution is described and evaluated. The fault injection is based on two complementary methods using: (1) heavy-ion radiation; and (2) power supply disturbances. 12000 transient faults were injected into the target microprocessor, a Motorola MC6809E 8-bit CPU, running 3 different workloads. In the evaluation, the control-flow errors were distinguished from those that had no effect on the correct flow of control. The errors that led to wrong results are separated from those that did not affect the correct results. The errors that affected neither the correct control flow nor the correct results are specified. Effects of errors on the registers and signals of the processor are characterized, Workload dependency on error rates is demonstrated. Three error-detection mechanisms, (2 software-based mechanisms and 1 watchdog timer) were combined and used to characterize the detected and undetected errors. More than 87% of all errors and 93% of the control-flow errors could be detected. In a different test, the efficiency of an isolated watchdog timer was evaluated. The coverage of the isolated watchdog timer was only 62%. The results indicate that fault-injection methods, workloads, and programming languages all differently affect the control flow, coverage, latency, and error rates
Article
The size and complexity of modern dependable computing systems has significantly compromised the ability to accurately measure system dependability attributes such as fault coverage and fault latency. Fault injection is one approach for the evaluation of dependability metrics. Unfortunately, fault injection techniques are difficult to apply because the size of the fault set is essentially infinite. Current techniques select faults randomly resulting in many fault injection experiments which do not yield any useful information. This research effort has developed a new deterministic, automated dependability evaluation technique using fault injection. The primary objective of this research effort was the development and implementation of algorithms which generate a fault set which fully exercises the fault detection and fault processing aspects of the system. The theory supporting the developed algorithms is presented first. Next, a conceptual overview of the developed algorithms is followed by the implementation details of the algorithms. The last section of this paper presents experimental results gathered via simulation-based fault injection of an Interlocking Control System (ICS). The end result is a deterministic, automated method for accurately evaluating complex dependable computing systems using fault injection
Article
Full-text available
This paper discusses the problems of pin-level fault injection for dependability validation and presents the architecture of a pin-level fault injector called RIFLE. This system can be adapted to a wide range of target systems and the faults are mainly injected in the processor pins. The injection of the faults is deterministic and can be reproduced if needed. Faults of different nature can be injected and the fault injector is able to detect whether the injected fault has produced an error or not without the requirement of feedback circuits. RIFLE can also detect specific circumstances in which the injected faults do not affect the target system. Sets of faults with specific impact on the target system can be generated. The paper also presents fault injection results showing the coverage and latency achieved with a set of simple behavior based error detection mechanisms. It is shown that up to 72,5% of the errors can be detected with fairly simple mechanisms. Furthermore, for over 90% of the faults the target system has behaved according to the fail-silent model, which suggests that a traditional computer equipped with simple error detection mechanisms is relatively close to a fail-silent computer.
Article
This paper describes a new methodology to inject transient and permanent faults in digital systems. For this purpose, the simulation based fault injector VERIFY (VHDL-based Evaluation of Reliability by Injecting Faults efficientlY) has been developed, which allows fault injection at several abstraction levels of a digital system. The combined approach of injection and analysis of the results enables the system's engineer to evaluate the reliability of the system as well as the coverage of fault tolerance mechanisms applied to the system. The approach is applied to the DP32-processor, where faults are injected at pin-level, by flipping bits in internal registers and gate-level. The results of this comparison shows, that the time to recover from a fault and the total number of faults which lead to a recovery differ significantly according to the type of fault injection. Whereas in the first 2 s after fault injection using the bit- flip fault model more than 79% of all faults injected were still present in the system, only 4.5% remained in the processor when using a stuck-at fault model at gate-level. Keywords - Fault injection, VHDL, dependable systems, transient faults, experimental analysis, system validation, fault/error models 1.
Conference Paper
Full-text available
Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability
Article
Full-text available
The authors address the problem of validating the dependability of fault-tolerant computing systems, in particular, the validation of the fault-tolerance mechanisms. The proposed approach is based on the use of fault injection at the physical level on a hardware/software prototype of the system considered. The place of this approach in a validation-directed design process and with respect to related work on fault injection is clearly identified. The major requirements and problems related to the development and application of a validation methodology based on fault injection are presented and discussed. Emphasis is put on the definition, analysis, and use of the experimental dependability measures that can be obtained. The proposed methodology has been implemented through the realization of a general pin-level fault injection tool (MESSALINE), and its usefulness is demonstrated by the application of MESSALINE to the experimental validation of two systems: a subsystem of a centralized computerized interlocking system for railway control applications and a distributed system corresponding to the current implementation of the dependable communication system of the ESPRIT Delta-4 Project
Chapter
The use of heavy-ion radiation from 252Californium as a fault injection method for experimental validation of dependable computing systems is described and discussed. Heavy ions from 252Cf can cause transient faults as they pass through depletion regions in integrated circuits. Irradiation of a circuit must be performed in a vacuum. A design of a portable miniature vacuum chamber which contains one integrated circuit and a 252Cf source is presented. The vacuum chamber can be plugged directly into an IC socket of the same type as the irradiated circuit uses. For an effective system validation, the fault injection method used must cause a variety of errors in the irradiated circuit. Results from irradiation of the MC6809E 8-bit microprocessor are presented. They show that errors occurred on all output pins of the microprocessor, and that both single and multiple bit errors were observed. Sensitivity to heavy ion radiation from 252Cf for various IC technologies is discussed briefly.
Book
This book is the first unified treatment of the analysis and design methods for protection of principally electronic systems from the deleterious effects of nuclear and electro-magnetic radiation. Coverage spans from a detailed description of the nuclear radiation sources to pertinent semiconductor physics, then to hardness assurance. This work combines the disciplines of solid state physics, semiconductor physics, circuit engineering, nuclear physics, together with electronics and electromagnetic theory into a book that can be used as a text with problems at the end of the majority of the chapters. Written by veterans in the field, the most significant feature of this book is its comprehensive treatment of the phenomena involved. This treatment includes the analysis and design of the effect of nuclear radiation on electronic systems from the experimental, theoretical, and engineering viewpoints. Unique pedagogical attempts are employed to make the material more understandable from the position of an enlightened engineering and scientific readership whose task is the design and analysis of radiation hardened electronic systems.
Conference Paper
Original fault and error insertion experiments determining the effects of transient faults in microprocessor controllers are described. The fault insertion experiments qualify most probable errors (logical level) which are caused by transient faults resulting from electrical noise. A wide class of microprocessor circuits is considered and measurements of static and dynamic noise margins are presented. The error insertion experiments allow to evaluate sensitivity of the controller to various errors (fault tolerance, error coverage and error latency). These experiments are carried out by using universal microprocessor emulator.
Conference Paper
Fault injection by heavy-ion radiation from Californium-252 could become a useful method for experimental verification and validation of error handling mechnisms used in computer systems. Heavy ions emitted from Cf-252 have the capacity to cause transient faults and soft errors in integrated circuits. In this paper, results of initial fault injection experiments using the MC6809E 8-bit microprocessor are presented. The purpose of the experiments was to investigate the variation of the error behavior seen on the external buses when the microprocessor chip was irradiated by a Cf-252 source. The variation of the error behavior is imperative for an effective evaluation of error handling mechnisms, e.g. those designed to detect errors caused by a microprocessor. The experiments showed that the errors seen on the external buses of the MC6809E were well spread, both in terms of location and number of bits affected.
Conference Paper
Several concurrent error detection schemes suitable for a watch-dog processor were evaluated by fault injection. Soft errors were induced into a MC6809E microprocessor by heavy-ion radiation from a Californium-252 source. Recordings of error behavior were used to characterize the errors as well as to determine coverage and latency for the various error detection schemes. The error recordings were used as input to programs that simulate the error detection schemes. The schemes evaluated detected up to 79% of all errors within 85 bus cycles. Fifty-eight percent of the errors caused execution to diverge permanently from the correct program. The best schemes detected 99% of these errors. Eighteen percent of the errors affected only data, and the coverage of these errors was at most 38%
Article
JPL and Aerospace have collected an extensive set of heavy ion single event upset (SEU) test data since their last joint publication in December, 1985. Trends in SEU susceptibility for state-of-the-art parts are presented.
Article
A Cf-252 irradiation facility for single event testing of microcircuits has been developed. Testing techniques have been refined to include the capability of determining LET thresholds as well as event cross-sections. The capabilities and limitations of Cf-252 in testing to provide parameters for calculation of SEU rate in the heavy ion environment of space are discussed.
Article
New test data from the Jet Propulsion Laboratory (JPL), The Aerospace Corporation, Rockwell International (Anaheim) and IRT have been combined with published data of JPL [1,2] and Aerospace [3] to form a nearly comprehensive body of single event upset (SEU) test data for heavy ion irradiations. This data has been arranged to exhibit the SEU susceptibility of devices by function, technology and manufacturer. Clear trends emerge which should be useful in predicting future device performance.
Article
Heavy ion induced single event upsets and latch-up in 4K CMOS RAMs and PROMs have been demonstrated using both the Harwell Variable Energy Cyclotron and a laboratory Californium-252 source. The latter provides a novel and convenient alternative which complements heavy ion accelerator techniques. A number of memories have been examined by both techniques, enabling appropriate cross sections to be measured.
Article
A system has been developed that tests microprocessors for single-event upset (SEU) at the specified clock speed and without adding wait or hold states. This system compiles a detailed record of SEU-induced errors and has been used to test the Sandia SA3000 microprocessor and prototypes of its commercial equivalent, the Harris H80C85 at the Lawrence Berkeley Laboratory 88-inch cyclotron facility. Using appropriate test programs and analyzing the resulting upset data, the authors have established the SEU cross section of the major functional elements of the hardened processors. With these cross sections and the estimated duty factors of a `typical' program, they computed the expected upset rate in a parallel, normally incident, beam as a function of linear energy transfer and measured the rate in several cyclotron beams. Good agreement between the measured and calculated rates was obtained
Article
The results of several experiments conducted using the fault-injection-based automated testing (FIAT) system are presented. FIAT is capable of emulating a variety of distributed system architectures, and it provides the capabilities to monitor system behavior and inject faults for the purpose of experimental characterization and validation of a system's dependability. The experiments consists of exhaustively injecting three separate fault types into various locations, encompassing both the code and data portions of memory images, of two distinct applications executed with several different data values and sizes. Fault types are variations of memory bit faults. The results show that there are a limited number of system-level fault manifestations. These manifestations follow a normal distribution for each fault type. Error detection latencies are found to be normally distributed. The methodology can be used to predict the system-level fault responses during the system design stage
Latch-up and SEU Test of Motorola MC68020 32-bit Microprocessor
  • U Gunneflo
  • J Karlsson
Properties of Transient Errors Due to Power Supply Disturbances
  • M L Cortes
  • E J Mccjuskey
  • K D Wagner
  • D J Lu