Workflow of OVF computation

Source publication

FIGURE 1. DDG and eDDG corresponding to code fragment in Table.1

FIGURE 4. Detection rate of SDC via monitoring the variables selected...

FIGURE 5. Related source code containing the most critical variables in...

A Methodology to Assess Output Vulnerability Factors for Detecting Silent Data Corruption

Article

Full-text available

Aug 2019

As process technology scales, electronic devices become more susceptible to soft error induced by radiation. Silent data corruption (SDC) is considered the most severe outcome incurred by soft error. The effects of faulty variables on producing SDC vary widely. Without a profiling of vulnerability of variables, the derived detectors often incur low...

Context 1

... calculation of OVF is based on eDDG. The whole process has three steps (see Fig.2). ...

View in full-text

gemV-tool: A Comprehensive Soft Error Reliability Estimation Tool for Design Space Exploration

Article

Full-text available

Nov 2023

With aggressive technology scaling, soft errors have become a major threat in modern computing systems. Several techniques have been proposed in the literature and implemented in actual devices as countermeasures to this problem. However, their effectiveness in ensuring error-free computing cannot be ascertained without an accurate reliability estimation methodology. This can be achieved by using the vulnerability metric: the probability of system failure as a function of the time the program data are exposed to transient faults. In this work, we present a gemV-tool, a comprehensive toolset for estimating system vulnerability, based on the cycle-accurate gem5 simulator. The three main characteristics of the gemV-tool are: (i) fine-grained modeling: vulnerability modeling at a fine-grained granularity through the use of RTL abstraction; (ii) accurate modeling: accurate vulnerability calculation of speculatively executed instructions; and (iii) comprehensive modeling: vulnerability estimation of all the sequential elements in the out-of-order processor core. We validated our vulnerability models through extensive fault injection campaigns with <3% correlation error and 90% statistical confidence. Using the gemV-tool, we made the following observations: (i) the vulnerability of two microarchitectural configurations with similar performance can differ by 82%; (ii) the vulnerability of a processor can vary by more than 10×, depending on the implemented algorithm; and (iii) the vulnerability of each component in the processor varies significantly, depending on the ISA of the processor.

Trade-off Mechanism Between Reliability and Performance for Data-flow Soft Error Detection

Article

Full-text available

Nov 2023
J ELECTRON TEST

The high energy particles in the space environment will perturb integrated circuits, resulting in system errors or even failures, which is also known as single event effects (SEE). To ensure the normal operation of space systems, it is first necessary to detect these errors. However, detection algorithms also bring additional overhead to the system and reduce its performance. Therefore, we aim to find a trade-off between reliability and performance. To this end, we propose a quantitative evaluation model for detection methods that evaluates the reliability gain of different detection methods under the same overhead. Our method allocates the optimal detection method to the corresponding code segment based on the quantitative results, thereby achieving a trade-off between reliability and performance. Experimental results show that the average energy efficiency of our trade-off method is 91.34%, which is 21.49% higher than the other methods.

An automated framework for selectively tolerating SDC errors based on rigorous instruction-level vulnerability assessment

Article

Full-text available

Apr 2024
FUTURE GENER COMP SY

The recent trend in most processor manufacturing technologies has significantly increased the vulnerability of embedded systems operating in harsh environments against soft errors. These errors can cause Silent Data Corruptions (SDCs) that produce erroneous execution results silently, disturbing the system’s execution and potentially leading to severe financial, human or environmental disasters. The use of fault tolerance techniques that take into account the performance and constraints of safety-critical systems is therefore essential to improve system reliability efficiently. Given the significant overhead imposed by conventional techniques, e.g., performance loss, increased memory usage, and additional hardware costs, researchers have developed cost-effective software-based techniques for fault tolerance. However, as detection rates grow, these techniques can increase code size and execution time significantly, which creates a challenge. This paper proposes an automated framework for selective fault tolerance of SDCs in software running on different architectures. The framework comprises a sequence of several consecutive techniques executed automatically. It offers a software-based technique that operates at the microarchitecture level and evaluates the vulnerability of program instructions against SDC errors. The framework conducts vulnerability assessment at the binary code level using a non-intrusive, runtime fault injection mechanism. It can inject faults at different granularity levels to maximize fault activation, including fine-grained injection at specific instruction fields or encoding bits, and coarse-grained injection into the entire software system. The framework makes minor modifications to the software being tested, enabling it to run at near-native speed. When SDC vulnerable instructions are identified, the framework selectively protects them automatically using a compiler extension, achieving a more appropriate trade-off between SDC detection and overhead by avoiding overprotection. Our framework was evaluated by conducting a large number of fault injection-based experiments on real-world benchmark programs using the cycle-accurate Gem5 simulator. Leveraging the accurate vulnerability assessment results provided by our framework, the proposed selective technique reduces SDC errors by up to 99% by selectively protecting only 45% of the program’s static instructions, with a performance overhead ranging from 8% to 35%.

BiGResi: Robust bit-level fault injection framework for assessing intrinsic software resilience against soft errors

Article

Full-text available

Apr 2024
COMPUT ELECTR ENG

Radiation-induced soft errors, despite rare, pose a significant threat to the reliability of systems. Assessing the intrinsic resilience of software to soft errors is therefore essential for building fault-tolerant systems cost-effectively. Analytical models, while fast, can be imprecise. In contrast, Fault Injection (FI) has been successfully applied as a mature method for reliability assessment. While high-level FI offers less accuracy, existing low-level techniques can enhance resilience assessment accuracy by sacrificing some desirable features like fault coverage and intrusiveness. Furthermore, these techniques are often driven by random FI campaigns, making establishing a clear correlation between application characteristics and resilience challenging. This paper presents BiGResi, a versatile software-based framework for assessing software resilience. BiGResi overcomes the limitations of random, instruction type-agnostic FI techniques by evaluating resilience at a low-level granularity, considering instruction type and bit location. Furthermore, it targets the instruction set architecture (ISA), enhancing assessment accuracy by revealing architecturally visible faults. BiGResi employs a timing-based FI mechanism with negligible modifications to the target software, minimizing intrusiveness and ensuring near-native speed. BiGResi’s accuracy is empirically evaluated through many FI campaigns targeting different benchmarks with diverse characteristics. We observed that instruction types, ISA encoding bits, and bit location are key factors to consider when assessing software resilience. Finally, BiGResi’s effectiveness is demonstrated by selectively applying instruction protection, resulting in an average reduction of silent data corruptions (SDCs) by 73.80%, with a performance overhead of 15.46%. Furthermore, allowing a slightly higher overhead of 22% can improve the SDC detection rate by up to 93.83%.

Multi-bit Data Flow Error Detection Method Based on SDC Vulnerability Analysis

Article

Nov 2022
ACM T EMBED COMPUT S

One of the most difficult data flow errors to detect caused by single event upsets in space radiation is the Silent Data Corruption (SDC). To solve the problem of multi-bit upsets causing program SDC, an instruction multi-bit SDC vulnerability prediction model based on one-class support vector machine classification is built using SDC vulnerability analysis, which has more accurate vulnerability instruction identification capabilities. By hardening the program with selective instruction redundancy, we propose a multi-bit data flow error detection method for detecting SDC error (SDCVA-OCSVM), aiming to protect the data in the memory or register used by the program. We have also verified the effectiveness of the method through comparative experiments. The method has been verified to have a higher error detection rate and lower code size and time overhead.

Silent Data Corruption Estimation and Mitigation Without Fault Injection Estimation et atténuation de la corruption silencieuse des données sans injection de faute

Article

Jul 2022

Silent data corruptions (SDCs) have been always regarded as the serious effect of radiation-induced faults. Traditional solutions based on redundancies are very expensive in terms of chip area, energy consumption, and performance. Consequently, providing low-cost and efficient approaches to cope with SDCs has received researchers’ attention more than ever. On the other hand, identifying SDC-prone data and instruction in a program is a very challenging issue, as it requires time-consuming fault injection processes into different parts of a program. In this article, we present a cost-efficient approach to detecting and mitigating the rate of SDCs in the whole program with the presence of multibit faults without a fault injection process. This approach uses a combination of machine learning and a metaheuristic algorithm that predicts the SDC event rate of each instruction. The evaluation results show that the proposed approach provides a high level of detection accuracy of 99% while offering a low-performance overhead of 58%.

Workflow of OVF computation

Context in source publication

Citations