Conference PaperPDF Available

Using Likely Program Invariants to Detect Hardware Errors

Authors:

Abstract and Figures

In the near future, hardware is expected to become increasingly vulnerable to faults due to continuously decreasing feature size. Software-level symptoms have previously been used to detect permanent hardware faults. However, they can not detect a small fraction of faults, which may lead to silent data corruptions(SDCs). In this paper, we present a system that uses invariants to improve the coverage and latency of existing detection techniques for permanent faults. The basic idea is to use training inputs to create likely invariants based on value ranges of selected program variables and then use them to identify faults at runtime. Likely invariants, however, can have false positives which makes them challenging to use for permanent faults. We use our on-line diagnosis framework for detecting false positives at runtime and limit the number of false positives to keep the associated overhead minimal. Experimental results using microarchitecture level fault injections in full-system simulation show 28.6% reduction in the number of undetected faults and 74.2% reduction in the number of SDCs over existing techniques, with reasonable overhead for checking code.
Content may be subject to copyright.
Appears in Proceedings of the 38th International Conference on Dependable Systems and Networks (DSN), June 2008.
Using Likely Program Invariants to Detect Hardware Errors
Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran,
Sarita V.Adve, Vikram S. Adve, Yuanyuan Zhou
Department of Computer Science
University of Illinois at Urbana-Champaign
swat@cs.uiuc.edu
Abstract
In the near future, hardware is expected to become increas-
ingly vulnerable to faults due to continuously decreasing
feature size. Software-level symptoms have previously been
used to detect permanent hardware faults. However, they can
not detect a small fraction of faults, which may lead to Silent
Data Corruptions(SDCs). In this paper, we present a system
that uses invariants to improve the coverage and latency of
existing detection techniques for permanent faults. The ba-
sic idea is to use training inputs to create likely invariants
based on value ranges of selected program variables and
then use them to identify faults at runtime. Likely invariants,
however, can have false positives which makes them chal-
lenging to use for permanent faults. We use our on-line di-
agnosis framework for detecting false positives at runtime
and limit the number of false positives to keep the associ-
ated overhead minimal. Experimental results using micro-
architecture level fault injections in full-system simulation
show 28.6% reduction in the number of undetected faults
and 74.2% reduction in the number of SDCs over existing
techniques, with reasonable overhead for checking code.
1. Introduction
As CMOS feature sizes continue to decrease, hardware re-
liability is emerging as a major bottleneck to reap the ben-
efits of increasing transistor density in microprocessor de-
sign. Chips in the field are expected to see increasing failure
rates due to permanent, intermittent, and transient faults, in-
cluding wear-out, design defects, soft errors, and others [2].
The traditional approach in microprocessor design of pre-
senting an illusion of a failure-free hardware device to soft-
ware will become prohibitively expensive for commodity
systems. Traditional solutions such as dual modular redun-
dancy for tolerating hardware errors incur very high over-
heads in performance, area and power. Recent hardware so-
This work is supported in part by the Gigascale Systems Research Center
(funded under FCRP, an SRC program), the National Science Foundation
under Grants NSF CCF 05-41383, CNS 07-20743, and NGS 04-06351, an
OpenSPARC Center of Excellence at the University of Illinois at Urbana-
Champaign supported by Sun Microsystems, and an equipment donation
from AMD.
lutions such as variations on redundant multithreading im-
prove on this, but still incur significant overheads [27].
Recently, researchers have investigated using software-
visible symptoms to detect hardware errors [5, 9, 19, 24, 25,
26, 30, 32]. While much of that work focuses on transient
or intermittent faults that last a few cycles (e.g., 4 cycles
or less), we have explored using these symptoms to detect
permanent faults in hardware [9].
Using software-level symptoms to detect permanent faults
in hardware has several benefits over traditional hardware-
level solutions. First, using software-level symptoms deals
with only those errors that actually affect software correct-
ness. The rest of the faults are safely ignored, potentially
reducing the incurred overhead due to detection and recov-
ery. Second, the reliability targets for the system under con-
sideration dictates the overheads that the system allows to
achieve those targets. Using software-level symptoms for
detection facilitates exploring these trade-offs in reliability
and overhead seamlessly as they are highly customizable.
We proposed a system design called SWAT, a firmware-
level low-overhead reliability solution that could potentially
handle multiple sources of hardware failures [9] using soft-
ware symptoms such as fatal hardware traps, software hangs,
abnormal application execution, and high OS activity. Im-
plementing these detectors in a thin firmware layer would
present significantly lower hardware cost than using tradi-
tional circuit-level hardware detectors. These detectors help
identify over 95% of hardware faults in many structures. Ad-
ditionally, 86% of these detections can be recovered using
hardware checkpointing schemes, while all these detected
faults are software recoverable [9].
Nevertheless, using these simple symptoms as detectors
results in an SDC rate of 0.8% for permanent hardware faults
in the current SWAT system, which may not be acceptable
for most systems. This motivates the use of more sophisti-
cated detectors to further reduce this SDC rate and increase
detection coverage. In addition, using more sophisticated de-
tectors has the potential of reducing the detection latency of
the detected faults, making more faults amenable to hard-
ware recovery. Recovery through hardware checkpointing
techniques, which can treat detection latencies upto 100K
1
cycles [28], are more attractive than those that use software
checkpointing techniques for recovery as it facilitates seam-
less recovery of both the application and the OS in the event
of a fault with much lesser overhead.
In this work, we extend the set of symptom-level detec-
tors in SWAT to include program-level invariants that are
derived from program properties observed during program
execution. We use “likely program invariants” which have
been shown to be a powerful approach in detecting software
bugs [4, 6]. We derive likely program invariants by monitor-
ing the execution of a program for different inputs and iden-
tifying program properties that hold on all such executions. 1
A major drawback with using likely invariants for error de-
tection is that they may lead to false positives: some of the in-
ferred program invariants may be violated for an input as the
program behavior on that input is different compared with
the training inputs used to extract invariants. Hence, likely
program invariants have been proposed and used primarily
for analysis purposes such as program evolution [4], pro-
gram understanding [7], and detecting and diagnosing soft-
ware bugs [4, 6, 12, 33, 13]. The only exceptions have been
for detecting transient hardware faults, where a false positive
can be identified quickly and cheaply [22, 24, 3].
In this paper, we propose and evaluate a hardware-
assisted methodology to use likely invariants for detect-
ing permanent (or intermittent) hardware errors safely. The
SWAT system has a hardware-assisted diagnosis framework
and we adapt it to detect false positives at runtime. We also
limit the number of false positives in a novel way to keep the
associated overhead due to false positive detection low. Us-
ing the principles discussed above, we designed the iSWAT
framework for invariant detection and enforcement, and we
implemented it as an extension of the SWAT system [9].
The contributions of this work are:
We demonstrate a new hardware-supported strategy for
using unsound program invariants for detecting perma-
nent hardware errors. We believe this is the first work to
use unsound invariants for such errors.
We show that likely invariants can be extracted efficiently
in software for realistic programs, unlike previous work
which used only toy benchmark programs [22]. Further-
more, because of our tolerance for false positives, we
only need 12 inputs for extracting our invariants while
others have used hundreds of inputs [4, 22].
We provide a realistic and comprehensive evaluation
with full-system simulation by injecting faults into dif-
ferent micro-architectural structures. Such faults present
more realistic fault scenarios than the previously studied
application-level fault injections.
1With simple compiler support, this “training” phase can be performed
transparently during debugging runs for any program, and could even be
extended into production runs with more sophisticated tools [11].
The most important outcome from our experiments is that
our technique reduces SDCs by 74.2%: fewer than 0.2%
of all fault injections are now SDCs.
In more detail, our experimental results show that the
number of undetected faults in iSWAT decreases by nearly
28.6% compared with the base SWAT System. The number
of SDCs reduces from 31 to 8(i.e., 74.2% reduction). The
number of detections that are hardware recoverable (with
latency less than 100K instructions) improves slightly by
2%. The mean overhead due to invariant checking code is
low - 14% on an UltraSparcIIIi machine and only 5% on
an AMD Athlon machine. Moreover, this work is just a first
step using one simple style of invariants. These results show
that using likely invariants is a promising way to improve
overall reliability, at a low cost.
The rest of this paper is organized as follows. Section 2
provides a brief overview of likely invariants. In Section 3,
we describe the iSWAT System in detail, explaining how
we exploit the diagnosis module to detect false positives
caused by the invariants. Section 4 discusses the evaluation
methodology, the results of which are discussed and ana-
lyzed in Section 5. Related work is discussed in Section 6.
Section 7 draws conclusions and implications from our ex-
perience with the iSWAT framework and discusses future
work.
2. Invariant-Based Error Detection
In this section, we provide some background on likely pro-
gram invariants and then discuss the particular type of likely
invariants we use to detect permanent faults.
2.1 Likely Program Invariants
A program invariant at a particular program point Pis a
property that is guaranteed to hold at Pon all executions
of the program. Static analysis is the most common method
to extract such sound invariants. A combination of offline in-
variant extraction pass and static analysis, or theorem prov-
ing techniques, has also been suggested to extract sound in-
variants [20]. However, current techniques are not scalable
enough to generate sound invariants for real programs [20].
Also, they can not identify algorithm-specific properties that
are not explicit in the code (e.g. some inputs are always pos-
itive).
Likely Program Invariants are properties involving pro-
gram values that hold on many executions on all observed
inputs and are expected to hold on other inputs. However,
they are unsound invariants which may not hold on some in-
puts. Extracting likely program invariants is easier than ex-
tracting sound invariants as we do not need expensive static
analysis methods to prove program properties and can iden-
tify algorithm specific properties. The extraction can be done
either online or offline. In online version, invariants are ex-
tracted and used during program execution in the production
runs. Online extraction can present unacceptable overheads
to program execution, and may in fact be infeasible without
hardware support. The offline version, on the other hand, ex-
tracts invariants in a separate pass during program testing or
debugging, and these generated invariants can be used later
during the production runs. During the testing phases of soft-
ware development, the extra overhead of invariants extrac-
tion can be tolerated. This makes offline invariant extraction
a powerful method, allowing the use of more complex in-
variant mining techniques than would be feasible with on-
line methods. With compiler support, this “training” phase
can be done transparently at development time.
We can broadly classify likely program invariants into
three categories. Value-based invariants specify properties
involving only program values, and can be used for a va-
riety of tasks including software bug detection, program un-
derstanding and program refactoring etc [4, 11, 7, 12, 6].
Control-flow-based invariants specify properties of the con-
trol flow of the program, and have been used previously to
detect control-flow errors due to transient faults [30, 29, 5].
PC-based invariants specify program properties involving
program counter values, and have been proposed for detect-
ing memory errors in programs during debugging [33].
2.2 Range-Based Invariants
One of our main goals in exploring the use of invariants to
detect permanent faults is to improve the coverage of detec-
tion and reduce the number of SDCs. Since SDCs are typ-
ically caused by erroneous values written to output, we ex-
plore the use of value-based invariants to detect permanent
faults. The other two class of invariants can detect control-
flow or memory errors, which generally result in anomalous
software behavior that can be detected by the other detectors
in SWAT. For example, erroneous control-flow typically re-
sults in a crash which can be caught by the FatalTrap symp-
tom in SWAT [9]. In contrast, we expect value-based invari-
ants to capture deviations of values that do not result in any
significant change of program behavior to cause an applica-
tion or OS crash, but may still result in incorrect output.
As a first step towards using likely program invariants
for permanent hardware faults, we use a particular form of
value-based invariants known as range-based invariants. A
range-based invariant on a program variable xwill be of
the form [MIN, MAX], where MIN and MAX are constants
inferred from offline training such that M I N xM AX
is true for all the training runs.
These range-based invariants are suitable for error detec-
tion for various reasons. These types of invariants can be eas-
ily and efficiently generated by monitoring program values.
They are also composable the invariants can be generated
for each training input separately and can then be combined
together to generate invariants for the complete training set.
These invariants are also much easier to enforce within the
checking code compared to other forms of invariants as they
are simple and involve a single data value. In the ongoing
future work, we are exploring a broader class of invariants.
3. The iSWAT Detection Framework
We implement the above described range-based invariants
as an extension to the existing SWAT System [9] to build
the iSWAT system that uses likely program invariants as an
additional software-level symptom to detect hardware faults.
3.1 Overview of SWAT System
The SWAT system uses low-overhead software-level symp-
toms to detect the presence of an underlying hardware
faults, and exploits a firmware-assisted diagnosis and re-
covery module to recover the system from multiple sources
of faults [9]. While this paper targets permanent hardware
faults, our methods, similar to SWAT, also extend readily to
detect transient faults.
SWAT assumes a multi-core architecture under a single
fault model where a fault-free core is always available. The
system also assumes support for a checkpoint/rollback/re-
play mechanism and a firmware layer that lies between the
processor and the OS to monitor and control such mecha-
nisms.
Detection: SWAT uses four low-cost symptom-based detec-
tion mechanisms that require little new hardware or software
support. These mechanisms look for anomalous software ex-
ecution as symptoms of possible hardware faults. We briefly
describe them below; the details can be found in [9].
1. FatalTrap: Fatal hardware traps are those traps (caused
by either the application or the OS) that do not occur
during fault-free execution. In Solaris, some of the fatal
traps are RED (Recover Error and Debug) State trap,
Data Access Exception trap etc.
2. Abort-App: These indicate instances of a segmentation
fault or illegal operation, when the OS terminates the
application with a signal. In such cases, the OS informs
the detection framework that the application performed
an illegal operation, leading to a detectable symptom.
3. Hangs: Application and OS hangs are other common
symptoms of hardware faults [19]. SWAT uses a low
hardware-overhead heuristic hang detector, based on (of-
fline) application profiling to detect hangs with high fi-
delity.
4. High OS activity: In normal executions, on a typical invo-
cation of the OS, control returns to the application after
a few tens of OS instructions, except cases such as timer
interrupt or I/O system calls. As a symptom of abnormal
behavior, SWAT looks for instances of abnormally high
contiguous OS instructions to indicate the presence of an
underlying fault.
Diagnosis: After the fault detection, the diagnosis frame-
work is invoked to distinguish the source of the fault as a
software fault, or a transient fault, or a permanent fault. The
diagnosis component rolls back to the last checkpoint and
replays the execution on the original core. If the symptom
does not recur, it infers a transient fault and continues ex-
ecution. However, if the symptom recurs, it re-executes on
another core (assumed to be fault-free) to distinguish soft-
ware faults (in which case the symptom would most likely
reoccur on the fault-free core) from permanent hardware
faults (in which case the symptom would not occur). We
may need to do the replay multiple times on the two cores
to distinguish not-deterministic software errors. In the case
of a permanent fault, the diagnosis module also does micro-
architecture level diagnosis to identify the faulty microarchi-
tectural structure [10].
Recovery and Repair: For recovery, SWAT assumes some
form of checkpoint/restart mechanism that periodically
checkpoints the state of the system. Depending upon the re-
quirements, appropriate hardware checkpointing or software
checkpointing mechanisms, or a combination of both, can
be used to recover the system after detection. SafetyNet [28]
and Revive [18, 23] show reasonably low overhead for hard-
ware checkpoint/replay. In the event of a permanent hard-
ware fault, the component that is diagnosed as faulty can be
reconfigured or disabled.
3.2 The iSWAT system
The iSWAT system extends the above described SWAT sys-
tem to include the violation of likely program invariants as
possible symptoms that indicate the presence of a hardware
fault. These invariants are derived from “training” runs of
the application and the invariant checking code is embed-
ded into the application. iSWAT exploits the diagnosis and
recovery module of the SWAT system to detect and disable
false-positive invariants at run-time.
3.2.1 Generating Invariants and Invariant Checks
The iSWAT system leverages support from the compiler for
two distinct components: invariant generation and invariant
insertion. Both of these use the LLVM compiler infrastruc-
ture [8].
Invariant Generation: We use compile-time instrumenta-
tion to monitor program values during training runs in or-
der to generate likely program invariants. We can monitor
many different types of values, including load, store, return
and intermediate result values. For this work, we decided to
monitor only the store values as checking values stored to
memory has the most potential to catch faults, as all nec-
essary computations eventually pass their results to stores.
Also, monitoring only the stores helps us keep the overhead
of detection low. We monitor stored values of all integer
types (both signed and unsigned) of size 2, 4, and 8 bytes
as well as single and double precision floating point types.
We do not monitor integer stores of size 1 byte (character
data types), as they represent only a small range of values
and hence may be ineffective to detect faults.
We do the code generation for invariant generation in two
steps. In the first step, we use LLVM-1.9 with llvm-gcc-3.4
to generate the LLVM bytecode and run an instrumentation
pass to insert calls to monitor the store values. Then we gen-
erate a C program from the LLVM bytecode file through the
LLVM C back end. Finally, we generate SPARC native code
through the Sun cc compiler. We use the generated program
to create the invariants for each input separately, which we
then combine to form the final invariants in another offline
pass.
Invariant Insertion: The invariants generated by the offline
pass then need to be inserted into the code to check the val-
ues being stored. For this, we take the invariant ranges from
the generation phase and then insert calls to the invariant
checking code at the LLVM byte-code level through another
compile-time instrumentation pass. Then, as in the invari-
ant generation phase, we generate a C program from LLVM
bytecode through LLVM C back end and finally, use Sun cc
compiler to generate native code.
3.2.2 Handling False-Positive Invariants
False positives present a major short-coming for likely pro-
gram invariants as detecting them may incur high overheads
in the presence of permanent faults. For transient hardware
faults, relatively low-cost techniques such as pipeline flush
can help to tolerate false positives [24]. If the fault occurs
again after pipeline flush, the invariants will be updated and
this is done cheaply using hardware support. In contrast, for
a framework like iSWAT System that supports permanent
fault detection, more expensive rollback and replay on the
same and a fault-free core is needed to detect false positives.
Too many false positive detections may thus lead to exorbi-
tant overheads.
While training with too many inputs can potentially make
false positives rare, the ranges may become too broad ren-
dering the invariants ineffective for error detection. In our
framework, we propose to train with a set of inputs such that
the false positive rate is sufficiently low. In general, the num-
ber of false positives can be used to guide how many inputs
to use for training.
iSWAT leverage the existing diagnosis framework in the
SWAT system [10] to detect the remaining false positives at
run-time. In the event of an invariant violation, we follow
the full rollback and replay on the same and fault-free core.
If the violation occurs even on fault-free core, then it is a
false positive invariant(or a software bug). Since too many
rollback/replays due to false positives can cause large over-
heads, we disable the static invariants once it results in a
false positive during dynamic execution (Online updation of
invariants will add too much overhead for our software-only
technique). In this way, if the total number of static invariants
is I, the maximum number of rollbacks possible will also
be I, limiting the overhead incurred due to false positives.
Currently, the disable operation is done inside the invariant
i f ( ( v a l u e <mi n ) o r ( v a l u e >max ) ) {/ / I n v a r i a n t v i o l a t e d
i f ( F a l s e P o s A r r a y [ I n v I d ] ! = t r u e ) {/ / No t f a l s e p o s i t i v e
i f ( d e t e c t F a l s e P o s ( I n vI d ) = = t r u e ) / / P e rf o rm d i a g n o s i s
F a l s e P o s A r r a y [ I n v I d ] = t r u e ; / / D i s ab l e th e in v a r i a n t
} }
Figure 1. Invariant checking code template
checking code and does not need any extra hardware sup-
port. We maintain a table with one entry for each invariant
indexed by the invariant id to identify false positive invari-
ants. In the invariantchecking code,the table entry is marked
to disable the false positive invariants from all later execu-
tions, if the invariant is detected as a false positive. Figure 1
shows a template of the actual invariant checking code.
The overhead caused by false positives in an invariant-
based approach depends on the number of rollbacks, which
in turn depends on the number of static false positive invari-
ants. The false positive results, presented in Section 5, indi-
cate that the overhead for our set of applications is negligible
compared to their runtimes.
4. Methodology
4.1 Simulation Environment
For the fault injection experiments, we used a full system
simulation environment comprising the Virtutech Simics full
system simulator [31] with the Wisconsin GEMS timing
models for the microarchitecutre and the memory [14] as
in [9]. These simulators provide cycle-acurate microarchi-
tecture level timing simulation of a real workloads running
on a real operating system (full Solaris-9 on the SPARC V9
ISA) on a modern out-of-order superscalar processor and
memory hierarchy (Table 1).
We exploit the timing-first approach of the GEMS+Simics
infrastructure [15] to inject microarchitecture-level faults. In
such an approach, GEMS and Simics compare their full
architectural states after each instruction and in case of a
mismatch GEMS state is updated with Simics state. This
checking mechanism was leveraged for our fault injection.
The faults are injected into GEMS’s micro-architecturalstate
and the fault is allowed to propagate. After a mismatch be-
tween Simics and GEMS, if the mismatch is found to be be-
cause of the fault, we copy the faulty architectural state from
GEMS to Simics, to make sure Simics follows the same cor-
rupted execution path as GEMS. Otherwise, the GEMS state
is updated as usual. When we find the architectural state is
corrupted, we say that the fault has been activated.
4.2 Fault Model
In the current work, we focus on permanent or hard faults.
The well established stuck-at-0 and stuck-at-1 fault models
as well as the dominant-0 and dominant-1 bridging fault
models are used for modeling permanent hardware faults
in this paper. The stuck-at fault models model a fault in a
single bit and the bridging fault model models faults that
affect adjacent bits. The dominant-0 bridging fault acts like
Base Processor Parameters
Frequency 2.0GHz
Fetch/decode/execute/retire rate 4 per cycle
Functional units 2 Int add/mult, 1 Int div
2 Load, 2 Store, 1 Branch
2 FP add, 1 FP mult, 1 FP div/Sqrt
Integer FU latencies 1 add, 4 multiply, 24 divide
FP FU latencies 4 default, 7 multiply, 12 divide
Reorder buffer size 128
Register file size 256 integer, 256 FP
Unified Load-Store Queue Size 64 entries
Base Memory Hierarchy Parameters
Data L1/Instruction L1 16KB each
L1 hit latency 1 cycle
L2 (Unified) 1MB
L2 hit/miss latency 6/80 cycles
Table 1. Parameters of the simulated processor
Microarchitecture structure Fault location
Instruction decoder Input latch of one of the decoders
Integer ALU Output latch of one of the integer ALUs
Register bus Bus on the write port to the register file
Physical integer register file A physical reg in the integer register file
Reorder buffer (ROB) Source/dest reg num of instr in ROB entry
Register alias table (RAT) Logical phys map of a logical register
Address gen unit (AGEN) Virtual address generated by the unit
FP ALU Output latch of one of the FP ALUs
Table 2. Microarchitectural structures in which faults are in-
jected.
a logical-AND operation between the adjacent bits that are
marked faulty, whereas the dominant-1 bridging faults act
like a logical-OR operation.
The microarchitectural structures and locations where the
faults are injected are listed in Table 2. For each structure,
a fault is injected in each of 40 random points in each ap-
plication (after the initialization phase in each application is
over). For each application injection point, we perform an in-
jection for each of the 4 fault models (two stuck-at and two
bridging faults). The injections are performed on a randomly
chosen bit in the given structure. This gives a total of 800
fault injection simulation runs per microarchitectural struc-
ture (5 applications ×40 points per application ×4 fault
models) and 6,400 total injections across all 8 structures.
After a fault is injected, we run the simulation for 10
million instructions. Note that the fault is maintained for the
rest of the 10M instruction window. For the small number
of runs where an activated fault is not detected within this
window, we use functional (full-system) simulation to run
the application to competition (as detailed simulation is too
slow to run to completion) for evaluating masking due to the
application, and SDCs. The functional simulation does not
inject any faults beyond the first 10M instructions, resulting
in the fault acting like an intermittent fault that is active only
in the 10 million instruction window. We believe that 10M
instructions is long enough that the simulation reflects the
behavior of permanent faults.
4.3 Fault Detection Techniques Used
We show the effectiveness of our invariant-based approach
by evaluating invariants in conjunction with the four low-
cost detection mechanisms built into the base SWAT system.
This is more realistic than studying only detections by invari-
ants as the other techniques are lower overhead (and need
very little hardware/software support) and will show the im-
pact of the new technique in any realistic system.
4.4 Fault Metrics
When the fault causes a corruption in the architectural state
of the processor, we say it is activated. If the fault is never
activated, we say the fault is architecturally masked. An
activated fault which is undetected, but does not cause any
corruption in the output produced by the application is said
to be application masked.
We used five metrics to evaluate the impact of the new
detection technique:
1. Coverage: The percentage of non-masked faults that are
detected in the 10M instruction window. We refer the
percentage of undetected faults as unknown-fraction.
2. Latency: The total number of instructions retired from
the first architecture state corruption (of either OS or
application) until the fault is detected by one of the above
techniques.
3. SDCs: The number of SDCs which result in corrupting
the output of the application.
4. False positives: The total number of false positive invari-
ants.
5. Overhead: The overhead of the invariant checking code
as a percentage of original execution time, measured in
fault-free run.
4.5 Applications
For the experiments, we used five SpecCPU 2000 bench-
marks four SpecInt benchmarks (gzip, bzip2, mcf, parser)
and one SpecFP benchmark (art). For most of the other
SPEC benchmarks, we could not collect sufficient training
inputs, while we could not compile and run the others in our
simluator.
Previous work on invariants uses toy Siemens bench-
marks because many inputs are available for these bench-
marks [4, 22]. We use more realistic applications which
makes it much harder for us to obtain valid inputs for ex-
periments. Nevertheless, obtaining inputs will not be a prob-
lem in practice as developers test their programs on many
inputs during the testing phase. Invariant generation and in-
sertion can be easily done during testing through a compile-
time pass. The “test” and “train” input sets formed part of
our training set. Different techniques were used to generate
more inputs depending on the benchmarks. For three bench-
marks (gzip, bzip2 and parser), we collected random inputs
0%
10%
20%
30%
40%
50%
60%
70%
80%
2 4 8 12
False Positives Rate
Number of Training Inputs
mcf
gzip
bzip2
parser
art
Figure 2. Variation of False positives rate with different num-
ber of training inputs. The rate is <5% with 12 training sets,
motivating the use of 12 inputs for therest of our experiments.
from external sources. For mcf, a script was used to gener-
ate random inputs, while for art, different input options were
used to generate invariants. Since the inputs were predomi-
nantly generated randomly, the inputs used for training were
significantly different from the reference inputs, which we
used for testing the false positives, coverage, latency, etc.
5. Experimental Results
In this section, we present our experimental results evaluat-
ing the effectivenessof using likely program invariants to de-
tect permanent hardware faults. All the injected FPU faults
were architecturally masked in all the applications, except
one floating point benchmark (art). So, we have excluded
the FPU unit results from the reported results, except as oth-
erwise noted.
We subject the same application binary (instrumented
with invariant detection) to faults under both the SWAT and
the iSWAT systems. We use the same binary in both cases to
obtain a valid coverage comparison between the two cases
as the behavior of faults (i.e., whether they are masked, or
detected, or become SDCs) depends on both static code lay-
out and dynamic instruction sequence. In the SWAT sys-
tem, since invariants are not monitored, the system ignores
the violation of any invariants and continues execution. The
iSWAT framework, on the other hand, invoked the diagnosis
module in the case of an invariant to determine false posi-
tives. If a false positive is detected, it just continues execu-
tion (in this case, the invariant will be disabled in the code),
otherwise a fault is detected. 2
5.1 False Positives
We first evaluate the effect of training with different training
sets on the number of false positives.
We define false positive rate to be the fraction of false
positive invariants as a percentage of total number of static
invariants. Figure 2 shows the variation of false positive rate
2In a real system, iSWAT should check for false positives on every invariant
violation by invoking the rollback/recovery in diagnosis module. However,
since we have ref input available, we currently identify the false positive
invariants using an offline fault-free run and during faulty run, the diagnosis
module uses that information to detect false positives. In this way, we
effectively mimic a real system.
for our five applications running on the ref input, as the
number of training inputs is increased from 2 to 12.
As expected, false positive rate decreases as the number
of inputs increases. By 12 inputs, the rate of false positives is
less than 5% for all applications and 0% for three. This false
positive rate is sufficiently low for our purpose, motivating
us to use 12 training inputs for all of our experiments. In
previous work using Siemens benchmarks [4, 22], hundreds
of inputs were used for training. We find much lower num-
ber of training inputs suffice for permanent fault detection
with our approach, as our techniques can tolerate more false
positives.
The maximum number of static invariants in all applica-
tions was 231. Assuming each false positive detection has an
overhead of 1M instructions (conservatively computed con-
sidering overheads due to checkpoint/replay and context mi-
gration), the maximum overhead of false positive detection
on any input will be only 231M instructions, which is neg-
ligible considering the application runtimes. In practice, the
overhead will even be lower due to low false positive rates.
Interestingly, Figure 2 shows that after just four inputs,
only less than 10% of the invariants are false positives for
four applications. These results show that likely invariants
generated from many inputs will have sufficiently few false
positives to be usable for permanent fault detection.
5.2 Coverage
Here, we present the improvements in fault coverage achieved
by the iSWAT system (using 12 inputs for training in-
variants) over the SWAT system, evaluated using micro-
architecture-level fault injections.
Table 3 presents the improvements offered by iSWAT
over the baseline SWAT system to detect permanent hard-
ware faults. Each column shows the number of fault in-
jections that result in different outcomes (both the absolute
number and as a percentage of total number of fault injec-
tions) and the last column shows the unknown-fraction. The
first two columns represent faults that are masked by the
architecture (Arch-Mask), and the application (App-Mask).
The Unknown column represents the fraction of faults that
are not detected within 10M instructions in each of the sys-
tems. The rest of the columns represent faults that are de-
tected by each of the detection mechanisms in 10M in-
structions, using the detection methods described previously
(Section 3). We don’t show Abort-App in the table, as it does
not detect any fault.
Three points can be observed from this table. First, the in-
variant detection is catching nearly 5.8% of total fault injec-
tions. Second, the invariant detection is detecting some faults
that are not detected by the traditional symptoms which re-
sulted in unknowns in SWAT, thus resulting in a 28.6% re-
duction of unknown cases from 168 the in SWAT system to
120 in the iSWAT. Third, the iSWAT invariants detect some
faults (about 5% of total fault injections) that are caught
by the other symptoms in SWAT at a lower latency than
Microarchitecture structure SWAT iSWAT Reduction
Instruction decoder 0.7% 0.6% 16.7%
Integer ALU 7.8% 6.2% 20.5%
Register bus 4.97% 2.6% 48.3%
Physical integer register file 12.8% 8.5% 33.7%
Reorder buffer (ROB) 0.9% 0.9% 0.0%
Register alias table (RAT) 2.0% 2.2% -9.6%
Address gen unit (AGEN) 2.4% 1.3% 46.1%
Total 4.0% 2.8% 28.7%
Table 4. Reduction in Unknown category for each microar-
chitectural structure.
Unknown Seg Other No SDC
fault signals output
SWAT 168 102 7 28 31
iSWAT 120 85 3 24 8
Table 5. Breakdown of Unknown category after the comple-
tion runs. The “No output” category includes OS hangs, appli-
cation hangs and OS crashes.
the other techniques, thus number of detections by other
symptoms in iSWAT are lower compared to SWAT. This re-
sult leads to a small improvement in detection latency, as
we show in the subsection 5.4. The overall coverage of the
iSWAT system is 97.2%.
Detection using Invariants: In order to understand the ef-
fectiveness of these invariants to detect faults in different
micro-architectural structures, we categorize the unknowns
in the two systems in Table 4. For each structure injected
with faults, the table shows the corresponding percentage of
non-masked faults that result in unknowns in the SWAT, and
iSWAT system, along with the percentage reduction in the
unknowns. The “Total” row shows the aggregate numbers.
These results show that invariants are most effective for
detecting faults in the integer ALU, register databus, integer
register, and AGEN units. These correspond to faults that
affect store values, without significantly perturbing the con-
trol and data flow. Invariants are not effective for the decode,
ROB, and RAT units. Faults in these units perturb program
control flow, and do not directly affect values that the invari-
ants monitor. Faults in these structures are also very likely
to cause the invariant checking to be done incorrectly. For-
tunately, There are a very few remaining unknown cases in
these units. Faults in the RAT show an increased unknown
rate in the iSWAT system as some faults that are masked the
by the application in SWAT, are detected by the invariants in
iSWAT. These are real hardware faults which affect program
values, but are masked at the application level.
Invariants detect all the unknown cases for FPU Unit
faults. Thus the overall unknown-fraction decreases from
4.2% to 2.8%, if we include FPU unit. But, more floating
point applications are needed to draw any conclusions.
5.3 SDCs
A small fraction of faults still result in unknown outcomes
in the iSWAT system (2.8% of the non-masked faults) af-
ter 10M instructions. After 10M instructions of detailed tim-
Symptoms App-mask Arch-Mask Fatal-Trap Fatal-Trap Hang-App Hang-OS INV High-OS Unknown Unknown
-App -OS -fraction
SWAT(%) 293(5.2%) 1090(20%) 1252(22%) 1421(25%) 47(0.8%) 16 (0.3%) - 1305(23%) 168(3.0%) (4.0%)
iSWAT(%) 288(5.2%) 1090(20%) 1187(21%) 1357(24%) 29(0.5%) 15(0.3%) 325(5.8%) 1181(21%) 120(2.1%) (2.8%)
Table 3. Improvement in coverage of iSWAT over SWAT for permanent faults. The percentages are computed using total number
of fault injections as the baseline. Invariants are effective in catching faults that escape the traditional detection techniques in SWAT
and sometimes catching the same faults earlier, resulting in a reduced 2.8% unknown-fraction compared to 4% of the SWATsystem.
ing simulation, we ran the unknown cases (for all structures
but the FPU) to completion in functional simulation mode
to evaluate how many of the unknown cases result in SDCs.
In this mode, faults are not injected during execution due
to lack of micro-architectural details in the functional sim-
ulator. Also, in functional mode, invariants checks are not
enforced by the iSWAT system as we do not have the diag-
nosis framework support to detect false positives caused by
invariants. Hence, our reported SDC numbers are conserva-
tive estimates of realistic SDC numbers.
We refer to the number of cases which result in the same
output as App-Mask and rest of the cases as unknown. Ta-
ble 5 shows the breakdown of the total number of unknown
cases according to the results after completion. The next two
columns show segmentation faults and application termina-
tions due to other signals. Executions that produce no output
due to an application hang, an OS hang or OS Crash (indi-
cated by timing out the execution after a long duration) fall
under the No output category. Finally, the cases that result in
undetected faults that corrupt application outputs are shown
under the SDC category.
Overall, the SDCs in the iSWAT system is significantly
lower than that in SWAT. The invariants reduce the SDCs by
74%, from 31 to 8. We consider the reduction in the SDCs
as the most important contribution of the invariants. Though
a few SDCs remain, we believe that more sophisticated in-
variants can make the SDC cases negligible. The number of
cases detected through other categories also decreases by 27
in iSWAT, which correspond to faults detected by invariants
before the application/OS crash through a signal or hang.
Analysis of SDCs: To do an in-depth analysis of why in-
variants don’t detect some of the SDC cases, we moved the
invariant checks to the simulator. In this way, we can observe
the monitored values and various other information, which is
not possible when the checks are in code.
To move the invariant checks into the simulator, we per-
form an instrumentation pass to store the monitored invari-
ant values to known memory locations. The simulator reads
the invariants ranges through a file. When it finds a store
to the known memory location, it can determine the corre-
sponding invariant from its memory address and perform the
bounds checking. Table 6 shows the key results, when the
checks are done in the simulator. Overall, there is a 35% re-
duction in unknown cases and 47% reduction in SDCs. We
observe a smaller reduction in SDCs compared to the Ta-
ble 5. So, the SDC results seem to be sensitive to static code
Unknown SDC
SWAT 164 32
iSWAT 106 17
Table 6. Results when invariant checking is done in simulator.
layout and dynamic instruction sequence, as they will deter-
mine the instruction where fault is injected and how the fault
affects the architectural state.
We analyzed the remaining 17 SDC cases by running both
the correct runs and fault injection runs and comparing the
monitored values. We made some interesting observations:
In three cases out of 17 (all in mcf), very few invariants
are checked after arch-state mismatch. In fact, in one of
these cases, the faulty run has much fewer checks com-
pared to correct run, as the control flow diverges com-
pletely to some part of the application with much fewer
number of checks. In the other two cases, the correct run
also has very few invariant checks as the faulty runs.
In two cases (in mcf,gzip) there are no mismatches of the
monitored values inside the 10M window. iSWAT can’t
detect them as there are no checks after 10M window.
In most cases, control flow does not diverge significantly
- it diverges for a short period and then merges back.
Irrespective of whether fault injection is in higher or
lower order bits, almost all the value mismatches in SDC
cases were in lower order bits. So, simple range-based
or control-flow invariants are unlikely to be effective for
these cases. In fact, most of the value mismatches were in
lowest 3 bits. We will need other types of invariants/de-
tectors for detecting mismatches in lower order bits.
5.4 Latency
Table 7 shows latency results for the faults detected by the
SWAT and the iSWAT systems, binned into various cate-
gories from under 1k instructions to under 10M instructions.
In order to perform a fair comparison, the numbers are pre-
sented as a percentage of the total number of faults detected
by iSWAT (i.e. the number of detection in the <10M case).
The number of faults detected at a latency of under 1k
instructions shows the largest increase of about 2% (the rest
of the numbers are cumulative). This shows that the latency
of detection of invariants is significantly lower than that of
the other symptoms. This increases the number of faults
that are amenable to simple hardware recovery. Although
the latency benefits offered by iSWAT are not substantial
Latencies <1k <5k <10k <50k <100k <500k <1M <5M <10M
SWAT 41.1% 47.0% 50.7% 78.7% 81.0% 87.0% 90.3% 95.7% 98.7%
iSWAT 43.1% 49.6% 53.4% 81.2% 83.3% 89.2% 92.7% 97.7% 100.0%
Table 7. Detection latencies for SWAT and iSWAT. The percentages are computed using number of detections in iSWAT with <10M
as baseline. The invariants increase the faults amenable to hardware recovery by 2%.
15%
20%
25%
30%
Overhead of iSWAT system over SWAT
Sparc
x86
0%
5%
10%
15%
art bzip2 gzip mcf parser
Geometric
Mean
Overhead of iSWAT system over SWAT
Benchmarks
Figure 3. Overhead of invariants on an UltraSPARC-IIIi
(sparc) machine and an AMD Athlon machine (x86).
so far, using more sophisticated invariants may improve the
effectiveness of iSWAT to reduce the latency.
5.5 Overhead
We evaluate the overhead of using invariants by running
the binary (with invariants checking) on fault-free hardware,
using two machines: Sun UltraSPARC-IIIi 1.2GHz machine
with 1MB unified L2, and 2GB RAM, and on an AMD
Athlon(TM) dual-core MP 2100+ machine with 256KB L2
and 1.5GB RAM. The Sun machine is referred to as Sparc
machine, and the AMD one as x86 machine in this section.
Figure 3 shows the overhead of using invariants checking
in the programs as a percentage over the baseline program
which has no invariants checking. The geometric mean of
the overheads is also shown for the two machines.
The sparc machine exhibits a higher overhead when run-
ning the invariants code than the x86 machine, with the aver-
age overheads being 14% and 5% respectively. In particular,
the overhead for the application mcf is significantly higher
in the Sparc machine (26%) than the x86 machine (2%). The
high overhead of the Sparc machine is likely due to its in-
ability to hide the cache misses and branch mispredictions
induced by these extra invariants. The x86 machine is able
to hide these latencies better, resulting in lower overheads.
In spite of these differences, the overheads produced by
these invariants checks are within acceptable overheads for
the increased coverage that they provide, motivating the
iSWAT system for increased resilience.
6. Related Work
There is a growing body of work on using software-visible
symptoms to detect hardware errors. A number of papers
propose the use of control path signatures to detect control-
flow errors [1, 5, 21, 29, 30]. Wang and Patel propose us-
ing branch mispredictions, cache misses, and exceptions as
symptoms of faults [32]. Most of this work focuses on tran-
sient faults or intermittent faults, and does not handle perma-
nent faults (the exceptions are discussed below). Permanent
faults are important because of the expected increase in phe-
nomena such as wear-out, insufficient burn-in, and design
defects [2]. In our previous work [9], we used simple soft-
ware symptoms to detect both permanent as well as transient
faults and we extend that system in this paper.
Dynamically detected program invariants (likely invari-
ants [4]), which are inherently unsound, have been studied
for a wide range of analysis tasks, including program evo-
lution [4], program understanding [7, 4], and detecting and
diagnosing software bugs [4, 6, 12, 33, 13]. The only work
we know of that uses likely invariants for online error de-
tection comprises three recent papers on transient hardware
fault detection [22, 24, 3]. Racunas et al. and Dimitrov et
al. extract the invariants using online hardware monitoring
whereas Pattabiraman et al. use ahead-of-time monitoring of
program runs (similar to our work). In all the cases, how-
ever, they can only use their invariants for transient errors
because they do not have any mechanism to distinguish false
positives from true hardware failures. (Racunas et al. and
Dimitrov et al. flush the pipeline, Pattabiraman et al. do not
suggest any concrete solution for false positives.) In contrast,
we can handle both transient and permanent faults.
Meixner et al. in the Argus project have proposed the
use of a program dataflow checker, combined with control
flow signature checking, functional unit checkers, a memory
checker and parity on all data transfer and storage units to
handle a wide range of faults [16, 17]. Their dataflow graph
and control flow signatures conceptually are invariants that
are encoded by the compiler in the binary and checked by
the hardware. Unfortunately, the technique does not work
with interrupts, I/O, etc. because these affect the control
flow. Some parts of the Argus solution may also incur in-
ordinate performance overhead. Coverage data is reported
only for a synthetic microbenchmark, thus the effectiveness
of the technique for real programs is not clear [16]. Finally,
the estimated area overhead is 17% of the core for Argus
a fault in this part could lead to false positives. In con-
trast, we look at far cheaper detection techniques, combin-
ing software-extracted invariants with several other software
symptoms that can be observed at near-zero cost.
7. Conclusion and Future Work
Previously existing methods for detecting hardware faults
using software-level symptoms, such as SWAT [9], are very
promising because of their high coverage and low cost. Nev-
ertheless, these systems need additional detectors for achiev-
ing reliability levels that would be acceptable for most sys-
tems. In this work, we proposed and evaluated the first de-
sign (we know of) that uses likely program invariants for
detecting permanent faults. We used simple range-based
invariants on single variable values, in conjunction with
low-overhead symptom-based detection techniques already
available in the SWAT System. Our results show that likely
invariants can reduce the fraction of undetected errors from
4% to 2.8%, when used in conjunction with other symptom-
based detection techniques. Further, they reduce SDCs by
47% to 74.2%, which is important for any hardware fault
tolerance solution. We further showed that by leveraging the
diagnosis framework in SWAT, we could keep the overhead
caused by false positives to acceptable levels.
These range-based invariants form a first step towards us-
ing invariants to detect hardware faults. We are now inves-
tigating more sophisticated invariant schemes to further im-
prove the effectiveness of the iSWAT system. We also want
to monitor other program values and to design a strategy
to select the most effective values for monitoring to reduce
overhead. We would also like to evaluate the approach on
more benchmarks and real applications.
Acknowledgments
We would like to thank Robert Bocchino for many discus-
sions and help in writing.
References
[1] E. Borin, C. Wang, Y. Wu, and G. Araujo. Dynamic binary
control-flow errors detection. SIGARCH Comput. Archit.
News, 33(5), 2005.
[2] S. Borkar. Designing Reliable Systems from Unreliable
Components: The Challenges of Transistor Variability and
Degradation. IEEE Micro, 25(6), 2005.
[3] M. Dimitrov and H. Zhou. Unified architectural support for
soft-error protection or software bug detection. In Proc. Int’l
Conf. on Parallel Architectures and Compilation Techniques
(PACT), 2007.
[4] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin.
Dynamically discovering likely program invariants to support
program evolution. IEEE Trans. Software Eng., 2001.
[5] O. Goloubeva et al. Soft-Error Detection Using Control Flow
Assertions. In Proc. of 18th IEEE Intl. Symp. on Defect and
Fault Tolerance in VLSI Systems, 2003.
[6] S. Hangal and M. S. Lam. Tracking down software bugs
using automatic anomaly detection. In Proceedings of the
International Conference on Software Engineering, 2002.
[7] Y. Kataoka, M. D. Ernst, W. G. Griswold, and D. Notkin.
Automated support for program refactoring using invariants.
In IEEE Int’l Conf. on Software Maintenance (ICSM), 2001.
[8] C. Lattner and V. Adve. LLVM: A Compilation Framework
for Lifelong Program Analysis and Transformation. In Proc.
Int’l Symp. on Code Generation and Optimization, 2004.
[9] M. Li, P. Ramachandran, S. Sahoo, S. Adve, V. Adve, and
Y. Zhou. Understanding the Propagation of Hard Errors to
Software and Implications for Resilient System Design. In
Proc. Intl. Conf. on Architectural Support for Programming
Languages and Operating Systems(ASPLOS), 2008.
[10] M. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve,
and Y. Zhou. Trace Based Diagnosis of Permanent Hardware
Faults. In International Conference on Dependable Systems
and Networks, 2008.
[11] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug
isolation via remote program sampling. In Proc. of Conf. on
Programming Language Design and Implementation, 2003.
[12] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan.
Scalable statistical bug isolation. In Proc. of Conf. on
Programming Language Design and Implementation, 2005.
[13] S. Lu, J. Tucek, F. Qin, and Y. Zhou. Avio: detecting atomicity
violations via access interleaving invariants. In Proc. Int’l
Conf. on Architectural Support for Programming Languages
and Operating Systems (ASPLOS), 2006.
[14] M. Martin et al. Multifacet’s General Execution-Driven
Multiprocessor Simulator (GEMS) Toolset. SIGARCH
Computer Architecture News, 33(4), 2005.
[15] C. J. Mauer, M. D. Hill, and D. A. Wood. Full-System Timing-
First Simulation. SIGMETRICS Performance Evaluation Rev.,
30(1), 2002.
[16] A. Meixner, M. E. Bauer, and D. J. Sorin. Argus: Low-
cost, comprehensive error detection in simple cores. In Proc.
ACM/IEEE Int’l Symposium on Microarchitecture, 2007.
[17] A. Meixner and D. J. Sorin. Error detection using dynamic
dataflow verification. In Proc. Int’l Conf. on Parallel
Architectures and Compilation Techniques (PACT), 2007.
[18] J. Nakano et al. ReViveI/O: Efficient Handling of I/O in
Highly-Available Rollback-Recovery Servers. In Int’l Symp.
on High Performance Computer Architecture (HPCA), 2006.
[19] N. Nakka et al. An Architectural Framework for Detecting
Process Hangs/Crashes. In European Dependable Computing
Conference (EDCC), 2005.
[20] J. W. Nimmer and M. D. Ernst. Automatic generation of
program specifications. In Proc. ACM SIGSOFT Int’l Symp.
on Software Testing and Analysis, 2002.
[21] N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow
checking by software signatures. IEEE Trans. on Reliability,
51, March 2002.
[22] K. Pattabiraman, G. P. Saggesse, D. Chen, Z. Kalbarczyk,
and R. Iyer. Dynamic derivation of application-specific error
detectors and their hardware implementation. In Proc. of
European Dependable Computing Conference (EDCC), 2006.
[23] M. Prvulovic et al. ReVive: Cost-Effective Architectural
Support for Rollback Recovery in Shared-Memory Multipro-
cessors. In Int’l Symp. on Computer Architecture (ISCA),
2002.
[24] P. Racunas et al. Perturbation-based Fault Screening. In
International Symposium on High Performance Computer
Architecture (HPCA), 2007.
[25] V. Reddy et al. Assertion-Based Microarchitecture Design
for Improved Fault Tolerance. In International Conference on
Computer Design , 2006.
[26] G. A. Reis et al. Software-Controlled Fault Tolerance. ACM
Transactions on Architectural Code Optimization, 2(4), 2005.
[27] E. Rotenberg. AR-SMT: A Microarchitectural Approach
to Fault Tolerance in Microprocessors. In International
Symposium on Fault-Tolerant Computing (FTCS), 1999.
[28] D. Sorin et al. SafetyNet: Improving the Availability of Shared
Memory Multiprocessors with Global Checkpoint/Recovery.
In Int’l Symp. on Computer Architecture (ISCA), 2002.
[29] R. Vemu and J. A. Abraham. CEDA: Control-flow Error De-
tection through Assertions. In Intl. On-Line Test Symposium,
2006.
[30] R. Venkatasubramanian et al. Low-Cost On-Line Fault
Detection Using Control Flow Assertions. In International
On-Line Test Symposium, 2003.
[31] Virtutech. Simics Full System Simulator. Website, 2006.
http://www.simics.net.
[32] N. Wang and S. Patel. ReStore: Symptom-Based Soft
Error Detection in Microprocessors. IEEE Transactions on
Dependable and Secure Computing, 3(3), July-Sept 2006.
[33] P. Zhou, W. Liu, F. Long, S. Lu, F. Qin, Y. Zhou, S. Midkiff,
and J. Torrellas. Accmon: Automatically detecting memory-
related bugs via program counter-based invariants. In Proc.
ACM/IEEE Int’l Symposium on Microarchitecture, 2004.
... A symptom is signaled when an invariant is violated by a variable exceeding its expected data range. Several invariant detectors have been proposed in the literature [209,190,102]. ...
... Symptom detection [252,197,209,148,190,102] The latter conclusion clearly points out a severe shortcoming of the state-of-the-art in software-implement fault tolerance. To the best of my knowledge, there is no transparent approach to detection and correction of memory errors in multi-threaded OS kernels and hypervisors, yet. ...
Thesis
Full-text available
Modern computer devices exhibit transient hardware faults that disturb the electrical behavior but do not cause permanent physical damage to the devices. Transient faults are caused by a multitude of sources, such as fluctuation of the supply voltage, electromagnetic interference, and radiation from the natural environment. Therefore, dependable computer systems must incorporate methods of fault tolerance to cope with transient faults. Software-implemented fault tolerance represents a promising approach that does not need expensive hardware redundancy for reducing the probability of failure to an acceptable level. This thesis focuses on software-implemented fault tolerance for operating systems because they are the most critical pieces of software in a computer system: All computer programs depend on the integrity of the operating system. However, the C/C++ source code of common operating systems tends to be already exceedingly complex, so that a manual extension by fault tolerance is no viable solution. Thus, this thesis proposes a generic solution based on Aspect-Oriented Programming (AOP). To evaluate AOP as a means to improve the dependability of operating systems, this thesis presents the design and implementation of a library of aspect-oriented fault-tolerance mechanisms. These mechanisms constitute separate program modules that can be integrated automatically into common off-the-shelf operating systems using a compiler for the AOP language. Thus, the aspect-oriented approach facilitates improving the dependability of large-scale software systems without affecting the maintainability of the source code. The library allows choosing between several error-detection and error-correction schemes, and provides wait-free synchronization for handling asynchronous and multi-threaded operating-system code. This thesis evaluates the aspect-oriented approach to fault tolerance on the basis of two off-the-shelf operating systems. Furthermore, the evaluation also considers one user-level program for protection, as the library of fault-tolerance mechanisms is highly generic and transparent and, thus, not limited to operating systems. Exhaustive fault-injection experiments show an excellent trade-off between runtime overhead and fault tolerance, which can be adjusted and optimized by fine-grained selective placement of the fault-tolerance mechanisms. Finally, this thesis provides evidence for the effectiveness of the approach in detecting and correcting radiation-induced hardware faults: High-energy particle radiation experiments confirm improvements in fault tolerance by almost 80 percent.
... If the assertion fails at runtime, it indicates that an error has occurred. iSWAT [15] and FaultScreening [21] adopt bounded-range-based invariant assertions. iSWAT applies soft-level symptoms at the firmware level to detect permanent faults and invariant assertions at the program level to detect transient faults. ...
Article
Full-text available
Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect SDC, program invariant assertions are used to harden programs. However, there exist redundant assertions in hardened programs, which impairs the detection efficiency. Benign errors are another type of soft error result. An assertion may detect benign errors, incurring unnecessary recovery overhead. The detection degree of an assertion represents the detection capability, and an assertion with a high detection degree can detect severe errors. To improve the detection efficiency and detection degree while reducing the benign detection ratio, F_Radish is proposed in the present work to screen redundant assertions in a novel way. At a program point, the detection degree and benign detection ratio are considered to evaluate the importance of the assertions in the program point. As a result, only the most important assertion remains in the program point. Moreover, the redundancy degree is considered to screen redundant assertions for neighbouring program points. Experimental results show that in comparison with the Radish approach, the detection efficiency of F_Radish is about two times greater. Moreover, F_Radish reduces the benign detection ratio and improves the detection degree. It can avoid more unnecessary recovery overheads and detect more serious SDC than can Radish.
... The probability of soft-errors has increased by about 6-7 times by scaling from 130 to 30 nm technology. As a result, soft-error is considered as a major source of digital systems failure (Saggese et al. 2005;Shivakumar et al. 2002;Slegel et al. 1999;Jiesheng et al. 2011;Shuguang et al. 2010;Zhang et al. 2006;Zhang and Shanbhag 2004;Sahoo 2008;Dixit and Wood 2011). Performance-overhead and programming complexities are the main drawbacks of software based techniques that are used for controlling the soft-errors (Slegel et al. 1999;Rebaudengo et al. 1999;Lu et al. 2005;Arasteh et al. 2014;Nair et al. 2010;Kleinberg and Tardos 2004;Benso et al. 2003;Rebaudengo et al. 2001;Borodin and Juurlink 2010;Fazeli et al. 2005;Thaker et al. 2006). ...
Article
Decreasing the scale of transistors and their voltages and exponential increase in the transistor counts have made the nowadays digital integrated circuits more susceptible to transient hardware errors (soft errors). One of the interesting features of software systems is that a considerable number of soft errors are inherently masked at software level. The likelihood of error masking (error deration) in the software may be influenced by the Algorithm, data structures and programming paradigms used in the software. One of the main research questions in this field of study is that how can software reliability be improved against soft errors without external redundancy and only by selecting appropriate software structures. This paper investigates the inherent effects of the underlying data structures on the rate of error deration and program reliability. To attain this goal, five different benchmark programs were implemented by four different data structures, i.e. Array, Binary-search Tree, One-way linked list and Two-way linked list; profiling experiments were performed on the benchmarks to identify those features of the data structures which affect the rate of error-deration. Then, in order to quantify and examine the inherent reliability of the data structures, about 5,600,000 faults were injected into the benchmark programs. The results show that about 53.95% of faults in the Array based programs are masked; this figure for Binary-search tree, One-way linked list and Two-way linked list are 40.16%, 44.02% and 42.73%, respectively. We found that Array and BST as two different data structures have the highest and lowest inherent reliability respectively. These findings enable the software developers to select the most reliable data structures for developing reliable programs without external redundancy.
Chapter
This chapter is about the author’s time at the University of Illinois as a PhD student in Ravi’s group working on the Trusted Illiac project at the University of Illinois (UIUC) from 2004 to 2009. The author starts by narrating his initial involvement in the project, and how it grew as time progressed. He then reflects on the lessons he learned from the project, and how the project has influenced his subsequent research career.
Article
Full-text available
Aggressive technology scaling and near-threshold computing have made soft error reliability one of the leading design considerations in modern embedded microprocessors. Although traditional hardware/software redundancy-based schemes can provide a high level of protection, they incur significant overheads in terms of performance and hardware resources. The considerable overheads from such full redundancy-based techniques has motivated researchers to propose low-cost soft error protection schemes, such as symptom-based error protection schemes. The main idea behind a symptom-based error protection scheme is that soft errors in the system will quickly generate some symptoms, such as exceptions, branch mispredictions, cache or TLB misses, or unpredictable variable values. Therefore, monitoring such infrequent symptoms makes it possible to cover the manifestation of failures caused by soft errors. Symptom-based protection schemes have been suggested as shortcuts to achieve acceptable reliability with comparable overheads. Since the symptom-based protection schemes seem attractive due to their generality and simplicity, even state-of-the-art protection schemes exploit them as the baseline protections. However, our detailed analysis of the fault coverage and performance overheads of such schemes reveals that the user-visible failure coverage, particularly of ReStore, is limited (29% on average). By contrast, the runtime overheads are significant (40% on average) because the majority of the fault injection experiments, which were considered as detected/recovered failures by low-level symptoms, are actually benign faults by program-level masking effects.
Article
Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a non-negligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application's execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. Such studies do not provide any feedback on the portions of the application where faults were not injected. Some of those instructions may be vulnerable to SDCs, and identifying them could allow protecting them through other means if needed. This paper presents Relyzer, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections for transient faults. Relyzer employs novel fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults. We find that Relyzer prunes about 99.78% of the total faults across twelve applications studied here, reducing the faults that require detailed simulation by 3 to 5 orders of magnitude for most of the applications. Fault injection simulations on the remaining faults can identify SDC causing faults in the entire application. Some of Relyzer's techniques rely on heuristics to determine fault equivalence. Our validation efforts show that Relyzer determines fault outcomes with 96% accuracy, averaged across all the applications studied here.
Article
Silent Data Corruptions (SDCs), those errors that escape detection methods, are critical for system designers because they may result in systems failures. In order to catch SDCs, mechanisms should focus on the behavioural aspects of errors in addition to their physical location or error patterns. Therefore, protection codes like parity, hamming, and the Reed-Solomon code, which heavily depend on the physical location of data bits, are not enough in processors for detection of computing errors. Using characterizing data behaviour during program executions, we have observed value locality in results of destination register for each instruction binary code (instruction opcode and operand codes). This locality exists not only in the results of each instruction, but also in the results of instructions at different memory locations having the same binary code. As a result, an architecture called Instruction Result History Table (IRHT) is presented, which is indexed by the instruction binary code. In the IRHT, a history of values produced by the same instruction binary codes are stored in and utilized during each instruction execution cycle. Any mismatch between the stored values in the IRHT and those generated by current execution, indicates an SDC syndrome. To confirm having SDCs with a high level of confidence, a second execution of the current instruction is issued. A duplication of the execution confirms whether SDC occurred. In the case of SDCs, a third instruction execution with the help of a majority voting frees the system of SDC. Several extensive simulations showed that, up to 83.54% of SDCs are detectable with the help of this locality. Moreover, with the small hardware, IRHT, i.e., 16 kB size, 80.66% of SDCs can be detected on average. Note that the presented method can detect those errors that escape conventional detection mechanisms. So, it can be utilized in conjunction with other conventional methods.
Article
Full-text available
This paper proposes a novel technique for preventing a wide range of data errors from corrupting the execution of applications. The proposed technique enables automated derivation of fine-grained, application-specific error detectors. An algorithm based on dynamic traces of application execution is developed for extracting the set of error detector classes, parameters, and locations in order to maximize the error detection coverage for a target application. The paper also presents an automatic framework for synthesizing the set of detectors in hardware to enable low-overhead run-time checking of the application execution. Coverage (evaluated using fault injection) of the error detectors derived using the proposed methodology, the additional hardware resources needed, and performance overhead for several benchmark programs are also reported.
Conference Paper
Full-text available
Device scaling and large scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implement- ing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long, and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor — in particular, the less struc- tured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpoint- ing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow mis- speculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a non- zero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 20x if ReStore is coupled with protection for certain particularly vulnerable pipeline structures.
Conference Paper
Full-text available
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults. Fundamental to such a solution is a characterization of how hardware faults indifferent microarchitectural structures of a modern processor propagate through the application and OS. This paper aims to provide such a characterization, resulting in identifying low-cost detection methods and providing guidelines for implementation of the recovery and diagnosis components of such a reliability solution. We focus on hard faults because they are increasingly important and have different system implications than the much studied transients. We achieve our goals through fault injection experiments with a microarchitecture-level full system timing simulator. Our main results are: (1) we are able to detect 95% of the unmasked faults in 7 out of 8 studied microarchitectural structures with simple detectors that incur zero to little hardware overhead; (2) over 86% of these detections are within latencies that existing hardware checkpointing schemes can handle, while others require software checkpointing; and (3) a surprisingly large fraction of the detected faults corrupt OS state, but almost all of these are detected with latencies short enough to use hardware checkpointing, thereby enabling OS recovery in virtually all such cases.
Article
Producing specifications by dynamic (runtime) analysis of program executions is potentially unsound, because the analyzed executions may not fully characterize all possible executions of the program. In practice, how accurate are the results of a dynamic analysis? This paper describes the results of an investigation into this question, determining how much specifications generalized from program runs must be changed in order to be verified by a static checker. Surprisingly, small test suites captured nearly all program behavior required by a specific type of static checking; the static checker guaranteed that the implementations satisfy the generated specifications, and ensured the absence of runtime exceptions. Measured against this verification task, the generated specifications scored over 90% on precision, a measure of soundness, and on recall, a measure of completeness. This is a positive result for testing, because it suggests that dynamic analyses can capture all semantic information of interest for certain applications. The experimental results demonstrate that a specific technique, dynamic invariant detection, is effective at generating consistent, sufficient specifications for use by a static checker. Finally, the research shows that combining static and dynamic analyses over program specifications has benefits for users of each technique, guaranteeing soundness of the dynamic analysis and lessening the annotation burden for users of the static analysis.
Article
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day.
Conference Paper
This paper presents an efficient software technique, control flow error detection through assertions (CEDA), for online detection of control flow errors. Extra instructions are automatically embedded into the program at compile time to continuously update run-time signatures and to compare them against pre-assigned values. The novel method of computing run-time signatures results in a huge reduction in the performance overhead, as well as the ability to deal with complex programs and the capability to detect subtle control flow errors. The widely used C compiler, GCC, has been modified to implement CEDA, and the SPEC benchmark programs were used as the target to compare with earlier techniques. Fault injection experiments were used to evaluate the fault detection capabilities. Based on a new comparison metric, method efficiency, which takes into account both error coverage and performance overhead, CEDA is found to be much better than previously proposed methods
Conference Paper
Explicitly stated program invariants can help programmers by identifying program properties that must be preserved when modifying code. In practice, however, these invariants are usually implicit. An alternative to expecting programmers to fully annotate code with invariants is to automatically infer likely invariants from the program itself. This research focuses on dynamic techniques for discovering invariants from execution traces. This article reports three results. First, it describes techniques for dynamically discovering invariants, along with an implementation, named Daikon, that embodies these techniques. Second, it reports on the application of Daikon to two sets of target programs. In programs from Gries's work on program derivation, the system rediscovered predefined invariants. In a C program lacking explicit invariants, the system discovered invariants that assisted a software evolution task. These experiments demonstrate that, at least for small programs, invariant inference is both accurate and useful. Third, it analyzes scalability issues, such as invariant detection runtime and accuracy, as functions of test suites and program points instrumented.
Conference Paper
This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.
Conference Paper
This paper proposes a novel technique for preventing a wide range of data errors from corrupting the execution of applications. The proposed technique enables automated derivation of fine-grained, application-specific error detectors. An algorithm based on dynamic traces of application execution is developed for extracting the set of error detector classes, parameters, and locations in order to maximize the error detection coverage for a target application. The paper also presents an automatic framework for synthesizing the set of detectors in hardware to enable low-overhead run- time checking of the application execution. Coverage (evaluated using fault injection) of the error detectors derived using the proposed methodology, the additional hardware resources needed, and performance overhead for several benchmark programs are also reported.