Conference PaperPDF Available

Using Likely Program Invariants to Detect Hardware Errors

July 2008

July 2008

DOI:10.1109/DSN.2008.4630072

Source
IEEE Xplore

Conference: Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on

Authors:

Swarup K Sahoo

University of Illinois, Urbana-Champaign

Pradeep Ramachandran

Intel

Sarita V. Adve

University of Illinois, Urbana-Champaign

Show all 6 authorsHide

In the near future, hardware is expected to become increasingly vulnerable to faults due to continuously decreasing feature size. Software-level symptoms have previously been used to detect permanent hardware faults. However, they can not detect a small fraction of faults, which may lead to silent data corruptions(SDCs). In this paper, we present a system that uses invariants to improve the coverage and latency of existing detection techniques for permanent faults. The basic idea is to use training inputs to create likely invariants based on value ranges of selected program variables and then use them to identify faults at runtime. Likely invariants, however, can have false positives which makes them challenging to use for permanent faults. We use our on-line diagnosis framework for detecting false positives at runtime and limit the number of false positives to keep the associated overhead minimal. Experimental results using microarchitecture level fault injections in full-system simulation show 28.6% reduction in the number of undetected faults and 74.2% reduction in the number of SDCs over existing techniques, with reasonable overhead for checking code.

Variation of False positives rate with different number of training inputs. The rate is <5% with 12 training sets, motivating the use of 12 inputs for the rest of our experiments.

…

Overhead of invariants on an UltraSPARC-IIIi (sparc) machine and an AMD Athlon machine (x86).

…

Figures - uploaded by Swarup K Sahoo

Content may be subject to copyright.

Content uploaded by Swarup K Sahoo

Content may be subject to copyright.

Appears in Proceedings of the 38th International Conference on Dependable Systems and Networks (DSN), June 2008.

Using Likely Program Invariants to Detect Hardware Errors ∗

Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran,

Sarita V.Adve, Vikram S. Adve, Yuanyuan Zhou

Department of Computer Science

University of Illinois at Urbana-Champaign

swat@cs.uiuc.edu

Abstract

In the near future, hardware is expected to become increas-

ingly vulnerable to faults due to continuously decreasing

feature size. Software-level symptoms have previously been

used to detect permanent hardware faults. However, they can

not detect a small fraction of faults, which may lead to Silent

Data Corruptions(SDCs). In this paper, we present a system

that uses invariants to improve the coverage and latency of

existing detection techniques for permanent faults. The ba-

sic idea is to use training inputs to create likely invariants

based on value ranges of selected program variables and

then use them to identify faults at runtime. Likely invariants,

however, can have false positives which makes them chal-

lenging to use for permanent faults. We use our on-line di-

agnosis framework for detecting false positives at runtime

and limit the number of false positives to keep the associ-

ated overhead minimal. Experimental results using micro-

architecture level fault injections in full-system simulation

show 28.6% reduction in the number of undetected faults

and 74.2% reduction in the number of SDCs over existing

techniques, with reasonable overhead for checking code.

1. Introduction

As CMOS feature sizes continue to decrease, hardware re-

liability is emerging as a major bottleneck to reap the ben-

eﬁts of increasing transistor density in microprocessor de-

sign. Chips in the ﬁeld are expected to see increasing failure

rates due to permanent, intermittent, and transient faults, in-

cluding wear-out, design defects, soft errors, and others [2].

The traditional approach in microprocessor design of pre-

senting an illusion of a failure-free hardware device to soft-

ware will become prohibitively expensive for commodity

systems. Traditional solutions such as dual modular redun-

dancy for tolerating hardware errors incur very high over-

heads in performance, area and power. Recent hardware so-

∗This work is supported in part by the Gigascale Systems Research Center

(funded under FCRP, an SRC program), the National Science Foundation

under Grants NSF CCF 05-41383, CNS 07-20743, and NGS 04-06351, an

OpenSPARC Center of Excellence at the University of Illinois at Urbana-

Champaign supported by Sun Microsystems, and an equipment donation

from AMD.

lutions such as variations on redundant multithreading im-

prove on this, but still incur signiﬁcant overheads [27].

Recently, researchers have investigated using software-

visible symptoms to detect hardware errors [5, 9, 19, 24, 25,

26, 30, 32]. While much of that work focuses on transient

or intermittent faults that last a few cycles (e.g., 4 cycles

or less), we have explored using these symptoms to detect

permanent faults in hardware [9].

Using software-level symptoms to detect permanent faults

in hardware has several beneﬁts over traditional hardware-

level solutions. First, using software-level symptoms deals

with only those errors that actually affect software correct-

ness. The rest of the faults are safely ignored, potentially

reducing the incurred overhead due to detection and recov-

ery. Second, the reliability targets for the system under con-

sideration dictates the overheads that the system allows to

achieve those targets. Using software-level symptoms for

detection facilitates exploring these trade-offs in reliability

and overhead seamlessly as they are highly customizable.

We proposed a system design called SWAT, a ﬁrmware-

level low-overhead reliability solution that could potentially

handle multiple sources of hardware failures [9] using soft-

ware symptoms such as fatal hardware traps, software hangs,

abnormal application execution, and high OS activity. Im-

plementing these detectors in a thin ﬁrmware layer would

present signiﬁcantly lower hardware cost than using tradi-

tional circuit-level hardware detectors. These detectors help

identify over 95% of hardware faults in many structures. Ad-

ditionally, 86% of these detections can be recovered using

hardware checkpointing schemes, while all these detected

faults are software recoverable [9].

Nevertheless, using these simple symptoms as detectors

results in an SDC rate of 0.8% for permanent hardware faults

in the current SWAT system, which may not be acceptable

for most systems. This motivates the use of more sophisti-

cated detectors to further reduce this SDC rate and increase

detection coverage. In addition, using more sophisticated de-

tectors has the potential of reducing the detection latency of

the detected faults, making more faults amenable to hard-

ware recovery. Recovery through hardware checkpointing

techniques, which can treat detection latencies upto 100K

cycles [28], are more attractive than those that use software

checkpointing techniques for recovery as it facilitates seam-

less recovery of both the application and the OS in the event

of a fault with much lesser overhead.

In this work, we extend the set of symptom-level detec-

tors in SWAT to include program-level invariants that are

derived from program properties observed during program

execution. We use “likely program invariants” which have

been shown to be a powerful approach in detecting software

bugs [4, 6]. We derive likely program invariants by monitor-

ing the execution of a program for different inputs and iden-

tifying program properties that hold on all such executions. 1

A major drawback with using likely invariants for error de-

tection is that they may lead to false positives: some of the in-

ferred program invariants may be violated for an input as the

program behavior on that input is different compared with

the training inputs used to extract invariants. Hence, likely

program invariants have been proposed and used primarily

for analysis purposes such as program evolution [4], pro-

gram understanding [7], and detecting and diagnosing soft-

ware bugs [4, 6, 12, 33, 13]. The only exceptions have been

for detecting transient hardware faults, where a false positive

can be identiﬁed quickly and cheaply [22, 24, 3].

In this paper, we propose and evaluate a hardware-

assisted methodology to use likely invariants for detect-

ing permanent (or intermittent) hardware errors safely. The

SWAT system has a hardware-assisted diagnosis framework

and we adapt it to detect false positives at runtime. We also

limit the number of false positives in a novel way to keep the

associated overhead due to false positive detection low. Us-

ing the principles discussed above, we designed the iSWAT

framework for invariant detection and enforcement, and we

implemented it as an extension of the SWAT system [9].

The contributions of this work are:

•We demonstrate a new hardware-supported strategy for

using unsound program invariants for detecting perma-

nent hardware errors. We believe this is the ﬁrst work to

use unsound invariants for such errors.

•We show that likely invariants can be extracted efﬁciently

in software for realistic programs, unlike previous work

which used only toy benchmark programs [22]. Further-

more, because of our tolerance for false positives, we

only need 12 inputs for extracting our invariants while

others have used hundreds of inputs [4, 22].

•We provide a realistic and comprehensive evaluation

with full-system simulation by injecting faults into dif-

ferent micro-architectural structures. Such faults present

more realistic fault scenarios than the previously studied

application-level fault injections.

1With simple compiler support, this “training” phase can be performed

transparently during debugging runs for any program, and could even be

extended into production runs with more sophisticated tools [11].

•The most important outcome from our experiments is that

our technique reduces SDCs by 74.2%: fewer than 0.2%

of all fault injections are now SDCs.

In more detail, our experimental results show that the

number of undetected faults in iSWAT decreases by nearly

28.6% compared with the base SWAT System. The number

of SDCs reduces from 31 to 8(i.e., 74.2% reduction). The

number of detections that are hardware recoverable (with

latency less than 100K instructions) improves slightly by

2%. The mean overhead due to invariant checking code is

low - 14% on an UltraSparcIIIi machine and only 5% on

an AMD Athlon machine. Moreover, this work is just a ﬁrst

step using one simple style of invariants. These results show

that using likely invariants is a promising way to improve

overall reliability, at a low cost.

The rest of this paper is organized as follows. Section 2

provides a brief overview of likely invariants. In Section 3,

we describe the iSWAT System in detail, explaining how

we exploit the diagnosis module to detect false positives

caused by the invariants. Section 4 discusses the evaluation

methodology, the results of which are discussed and ana-

lyzed in Section 5. Related work is discussed in Section 6.

Section 7 draws conclusions and implications from our ex-

perience with the iSWAT framework and discusses future

work.

2. Invariant-Based Error Detection

In this section, we provide some background on likely pro-

gram invariants and then discuss the particular type of likely

invariants we use to detect permanent faults.

2.1 Likely Program Invariants

A program invariant at a particular program point Pis a

property that is guaranteed to hold at Pon all executions

of the program. Static analysis is the most common method

to extract such sound invariants. A combination of ofﬂine in-

variant extraction pass and static analysis, or theorem prov-

ing techniques, has also been suggested to extract sound in-

variants [20]. However, current techniques are not scalable

enough to generate sound invariants for real programs [20].

Also, they can not identify algorithm-speciﬁc properties that

are not explicit in the code (e.g. some inputs are always pos-

itive).

Likely Program Invariants are properties involving pro-

gram values that hold on many executions on all observed

inputs and are expected to hold on other inputs. However,

they are unsound invariants which may not hold on some in-

puts. Extracting likely program invariants is easier than ex-

tracting sound invariants as we do not need expensive static

analysis methods to prove program properties and can iden-

tify algorithm speciﬁc properties. The extraction can be done

either online or ofﬂine. In online version, invariants are ex-

tracted and used during program execution in the production

runs. Online extraction can present unacceptable overheads

to program execution, and may in fact be infeasible without

hardware support. The ofﬂine version, on the other hand, ex-

tracts invariants in a separate pass during program testing or

debugging, and these generated invariants can be used later

during the production runs. During the testing phases of soft-

ware development, the extra overhead of invariants extrac-

tion can be tolerated. This makes ofﬂine invariant extraction

a powerful method, allowing the use of more complex in-

variant mining techniques than would be feasible with on-

line methods. With compiler support, this “training” phase

can be done transparently at development time.

We can broadly classify likely program invariants into

three categories. Value-based invariants specify properties

involving only program values, and can be used for a va-

riety of tasks including software bug detection, program un-

derstanding and program refactoring etc [4, 11, 7, 12, 6].

Control-ﬂow-based invariants specify properties of the con-

trol ﬂow of the program, and have been used previously to

detect control-ﬂow errors due to transient faults [30, 29, 5].

PC-based invariants specify program properties involving

program counter values, and have been proposed for detect-

ing memory errors in programs during debugging [33].

2.2 Range-Based Invariants

One of our main goals in exploring the use of invariants to

detect permanent faults is to improve the coverage of detec-

tion and reduce the number of SDCs. Since SDCs are typ-

ically caused by erroneous values written to output, we ex-

plore the use of value-based invariants to detect permanent

faults. The other two class of invariants can detect control-

ﬂow or memory errors, which generally result in anomalous

software behavior that can be detected by the other detectors

in SWAT. For example, erroneous control-ﬂow typically re-

sults in a crash which can be caught by the FatalTrap symp-

tom in SWAT [9]. In contrast, we expect value-based invari-

ants to capture deviations of values that do not result in any

signiﬁcant change of program behavior to cause an applica-

tion or OS crash, but may still result in incorrect output.

As a ﬁrst step towards using likely program invariants

for permanent hardware faults, we use a particular form of

value-based invariants known as range-based invariants. A

range-based invariant on a program variable xwill be of

the form [MIN, MAX], where MIN and MAX are constants

inferred from ofﬂine training such that M I N ≤x≤M AX

is true for all the training runs.

These range-based invariants are suitable for error detec-

tion for various reasons. These types of invariants can be eas-

ily and efﬁciently generated by monitoring program values.

They are also composable – the invariants can be generated

for each training input separately and can then be combined

together to generate invariants for the complete training set.

These invariants are also much easier to enforce within the

checking code compared to other forms of invariants as they

are simple and involve a single data value. In the ongoing

future work, we are exploring a broader class of invariants.

3. The iSWAT Detection Framework

We implement the above described range-based invariants

as an extension to the existing SWAT System [9] to build

the iSWAT system that uses likely program invariants as an

additional software-level symptom to detect hardware faults.

3.1 Overview of SWAT System

The SWAT system uses low-overhead software-level symp-

toms to detect the presence of an underlying hardware

faults, and exploits a ﬁrmware-assisted diagnosis and re-

covery module to recover the system from multiple sources

of faults [9]. While this paper targets permanent hardware

faults, our methods, similar to SWAT, also extend readily to

detect transient faults.

SWAT assumes a multi-core architecture under a single

fault model where a fault-free core is always available. The

system also assumes support for a checkpoint/rollback/re-

play mechanism and a ﬁrmware layer that lies between the

processor and the OS to monitor and control such mecha-

nisms.

Detection: SWAT uses four low-cost symptom-based detec-

tion mechanisms that require little new hardware or software

support. These mechanisms look for anomalous software ex-

ecution as symptoms of possible hardware faults. We brieﬂy

describe them below; the details can be found in [9].

1. FatalTrap: Fatal hardware traps are those traps (caused

by either the application or the OS) that do not occur

during fault-free execution. In Solaris, some of the fatal

traps are RED (Recover Error and Debug) State trap,

Data Access Exception trap etc.

2. Abort-App: These indicate instances of a segmentation

fault or illegal operation, when the OS terminates the

application with a signal. In such cases, the OS informs

the detection framework that the application performed

an illegal operation, leading to a detectable symptom.

3. Hangs: Application and OS hangs are other common

symptoms of hardware faults [19]. SWAT uses a low

hardware-overhead heuristic hang detector, based on (of-

ﬂine) application proﬁling to detect hangs with high ﬁ-

delity.

4. High OS activity: In normal executions, on a typical invo-

cation of the OS, control returns to the application after

a few tens of OS instructions, except cases such as timer

interrupt or I/O system calls. As a symptom of abnormal

behavior, SWAT looks for instances of abnormally high

contiguous OS instructions to indicate the presence of an

underlying fault.

Diagnosis: After the fault detection, the diagnosis frame-

work is invoked to distinguish the source of the fault as a

software fault, or a transient fault, or a permanent fault. The

diagnosis component rolls back to the last checkpoint and

replays the execution on the original core. If the symptom

does not recur, it infers a transient fault and continues ex-

ecution. However, if the symptom recurs, it re-executes on

another core (assumed to be fault-free) to distinguish soft-

ware faults (in which case the symptom would most likely

reoccur on the fault-free core) from permanent hardware

faults (in which case the symptom would not occur). We

may need to do the replay multiple times on the two cores

to distinguish not-deterministic software errors. In the case

of a permanent fault, the diagnosis module also does micro-

architecture level diagnosis to identify the faulty microarchi-

tectural structure [10].

Recovery and Repair: For recovery, SWAT assumes some

form of checkpoint/restart mechanism that periodically

checkpoints the state of the system. Depending upon the re-

quirements, appropriate hardware checkpointing or software

checkpointing mechanisms, or a combination of both, can

be used to recover the system after detection. SafetyNet [28]

and Revive [18, 23] show reasonably low overhead for hard-

ware checkpoint/replay. In the event of a permanent hard-

ware fault, the component that is diagnosed as faulty can be

reconﬁgured or disabled.

3.2 The iSWAT system

The iSWAT system extends the above described SWAT sys-

tem to include the violation of likely program invariants as

possible symptoms that indicate the presence of a hardware

fault. These invariants are derived from “training” runs of

the application and the invariant checking code is embed-

ded into the application. iSWAT exploits the diagnosis and

recovery module of the SWAT system to detect and disable

false-positive invariants at run-time.

3.2.1 Generating Invariants and Invariant Checks

The iSWAT system leverages support from the compiler for

two distinct components: invariant generation and invariant

insertion. Both of these use the LLVM compiler infrastruc-

ture [8].

Invariant Generation: We use compile-time instrumenta-

tion to monitor program values during training runs in or-

der to generate likely program invariants. We can monitor

many different types of values, including load, store, return

and intermediate result values. For this work, we decided to

monitor only the store values as checking values stored to

memory has the most potential to catch faults, as all nec-

essary computations eventually pass their results to stores.

Also, monitoring only the stores helps us keep the overhead

of detection low. We monitor stored values of all integer

types (both signed and unsigned) of size 2, 4, and 8 bytes

as well as single and double precision ﬂoating point types.

We do not monitor integer stores of size 1 byte (character

data types), as they represent only a small range of values

and hence may be ineffective to detect faults.

We do the code generation for invariant generation in two

steps. In the ﬁrst step, we use LLVM-1.9 with llvm-gcc-3.4

to generate the LLVM bytecode and run an instrumentation

pass to insert calls to monitor the store values. Then we gen-

erate a C program from the LLVM bytecode ﬁle through the

LLVM C back end. Finally, we generate SPARC native code

through the Sun cc compiler. We use the generated program

to create the invariants for each input separately, which we

then combine to form the ﬁnal invariants in another ofﬂine

pass.

Invariant Insertion: The invariants generated by the ofﬂine

pass then need to be inserted into the code to check the val-

ues being stored. For this, we take the invariant ranges from

the generation phase and then insert calls to the invariant

checking code at the LLVM byte-code level through another

compile-time instrumentation pass. Then, as in the invari-

ant generation phase, we generate a C program from LLVM

bytecode through LLVM C back end and ﬁnally, use Sun cc

compiler to generate native code.

3.2.2 Handling False-Positive Invariants

False positives present a major short-coming for likely pro-

gram invariants as detecting them may incur high overheads

in the presence of permanent faults. For transient hardware

faults, relatively low-cost techniques such as pipeline ﬂush

can help to tolerate false positives [24]. If the fault occurs

again after pipeline ﬂush, the invariants will be updated and

this is done cheaply using hardware support. In contrast, for

a framework like iSWAT System that supports permanent

fault detection, more expensive rollback and replay on the

same and a fault-free core is needed to detect false positives.

Too many false positive detections may thus lead to exorbi-

tant overheads.

While training with too many inputs can potentially make

false positives rare, the ranges may become too broad ren-

dering the invariants ineffective for error detection. In our

framework, we propose to train with a set of inputs such that

the false positive rate is sufﬁciently low. In general, the num-

ber of false positives can be used to guide how many inputs

to use for training.

iSWAT leverage the existing diagnosis framework in the

SWAT system [10] to detect the remaining false positives at

run-time. In the event of an invariant violation, we follow

the full rollback and replay on the same and fault-free core.

If the violation occurs even on fault-free core, then it is a

false positive invariant(or a software bug). Since too many

rollback/replays due to false positives can cause large over-

heads, we disable the static invariants once it results in a

false positive during dynamic execution (Online updation of

invariants will add too much overhead for our software-only

technique). In this way, if the total number of static invariants

is I, the maximum number of rollbacks possible will also

be I, limiting the overhead incurred due to false positives.

Currently, the disable operation is done inside the invariant

i f ( ( v a l u e <mi n ) o r ( v a l u e >max ) ) {/ / I n v a r i a n t v i o l a t e d

i f ( F a l s e P o s A r r a y [ I n v I d ] ! = t r u e ) {/ / No t f a l s e p o s i t i v e

i f ( d e t e c t F a l s e P o s ( I n vI d ) = = t r u e ) / / P e rf o rm d i a g n o s i s

F a l s e P o s A r r a y [ I n v I d ] = t r u e ; / / D i s ab l e th e in v a r i a n t

} }

Figure 1. Invariant checking code template

checking code and does not need any extra hardware sup-

port. We maintain a table with one entry for each invariant

indexed by the invariant id to identify false positive invari-

ants. In the invariantchecking code,the table entry is marked

to disable the false positive invariants from all later execu-

tions, if the invariant is detected as a false positive. Figure 1

shows a template of the actual invariant checking code.

The overhead caused by false positives in an invariant-

based approach depends on the number of rollbacks, which

in turn depends on the number of static false positive invari-

ants. The false positive results, presented in Section 5, indi-

cate that the overhead for our set of applications is negligible

compared to their runtimes.

4. Methodology

4.1 Simulation Environment

For the fault injection experiments, we used a full system

simulation environment comprising the Virtutech Simics full

system simulator [31] with the Wisconsin GEMS timing

models for the microarchitecutre and the memory [14] as

in [9]. These simulators provide cycle-acurate microarchi-

tecture level timing simulation of a real workloads running

on a real operating system (full Solaris-9 on the SPARC V9

ISA) on a modern out-of-order superscalar processor and

memory hierarchy (Table 1).

We exploit the timing-ﬁrst approach of the GEMS+Simics

infrastructure [15] to inject microarchitecture-level faults. In

such an approach, GEMS and Simics compare their full

architectural states after each instruction and in case of a

mismatch GEMS state is updated with Simics state. This

checking mechanism was leveraged for our fault injection.

The faults are injected into GEMS’s micro-architecturalstate

and the fault is allowed to propagate. After a mismatch be-

tween Simics and GEMS, if the mismatch is found to be be-

cause of the fault, we copy the faulty architectural state from

GEMS to Simics, to make sure Simics follows the same cor-

rupted execution path as GEMS. Otherwise, the GEMS state

is updated as usual. When we ﬁnd the architectural state is

corrupted, we say that the fault has been activated.

4.2 Fault Model

In the current work, we focus on permanent or hard faults.

The well established stuck-at-0 and stuck-at-1 fault models

as well as the dominant-0 and dominant-1 bridging fault

models are used for modeling permanent hardware faults

in this paper. The stuck-at fault models model a fault in a

single bit and the bridging fault model models faults that

affect adjacent bits. The dominant-0 bridging fault acts like

Base Processor Parameters

Frequency 2.0GHz

Fetch/decode/execute/retire rate 4 per cycle

Functional units 2 Int add/mult, 1 Int div

2 Load, 2 Store, 1 Branch

2 FP add, 1 FP mult, 1 FP div/Sqrt

Integer FU latencies 1 add, 4 multiply, 24 divide

FP FU latencies 4 default, 7 multiply, 12 divide

Reorder buffer size 128

Uniﬁed Load-Store Queue Size 64 entries

Base Memory Hierarchy Parameters

Data L1/Instruction L1 16KB each

L1 hit latency 1 cycle

L2 (Uniﬁed) 1MB

L2 hit/miss latency 6/80 cycles

Table 1. Parameters of the simulated processor

Microarchitecture structure Fault location

Instruction decoder Input latch of one of the decoders

Integer ALU Output latch of one of the integer ALUs

Physical integer register ﬁle A physical reg in the integer register ﬁle

Reorder buffer (ROB) Source/dest reg num of instr in ROB entry

Address gen unit (AGEN) Virtual address generated by the unit

FP ALU Output latch of one of the FP ALUs

Table 2. Microarchitectural structures in which faults are in-

jected.

a logical-AND operation between the adjacent bits that are

marked faulty, whereas the dominant-1 bridging faults act

like a logical-OR operation.

The microarchitectural structures and locations where the

faults are injected are listed in Table 2. For each structure,

a fault is injected in each of 40 random points in each ap-

plication (after the initialization phase in each application is

over). For each application injection point, we perform an in-

jection for each of the 4 fault models (two stuck-at and two

bridging faults). The injections are performed on a randomly

chosen bit in the given structure. This gives a total of 800

fault injection simulation runs per microarchitectural struc-

ture (5 applications ×40 points per application ×4 fault

models) and 6,400 total injections across all 8 structures.

After a fault is injected, we run the simulation for 10

million instructions. Note that the fault is maintained for the

rest of the 10M instruction window. For the small number

of runs where an activated fault is not detected within this

window, we use functional (full-system) simulation to run

the application to competition (as detailed simulation is too

slow to run to completion) for evaluating masking due to the

application, and SDCs. The functional simulation does not

inject any faults beyond the ﬁrst 10M instructions, resulting

in the fault acting like an intermittent fault that is active only

in the 10 million instruction window. We believe that 10M

instructions is long enough that the simulation reﬂects the

behavior of permanent faults.

4.3 Fault Detection Techniques Used

We show the effectiveness of our invariant-based approach

by evaluating invariants in conjunction with the four low-

cost detection mechanisms built into the base SWAT system.

This is more realistic than studying only detections by invari-

ants as the other techniques are lower overhead (and need

very little hardware/software support) and will show the im-

pact of the new technique in any realistic system.

4.4 Fault Metrics

When the fault causes a corruption in the architectural state

of the processor, we say it is activated. If the fault is never

activated, we say the fault is architecturally masked. An

activated fault which is undetected, but does not cause any

corruption in the output produced by the application is said

to be application masked.

We used ﬁve metrics to evaluate the impact of the new

detection technique:

1. Coverage: The percentage of non-masked faults that are

detected in the 10M instruction window. We refer the

percentage of undetected faults as unknown-fraction.

2. Latency: The total number of instructions retired from

the ﬁrst architecture state corruption (of either OS or

application) until the fault is detected by one of the above

techniques.

3. SDCs: The number of SDCs which result in corrupting

the output of the application.

4. False positives: The total number of false positive invari-

ants.

5. Overhead: The overhead of the invariant checking code

as a percentage of original execution time, measured in

fault-free run.

4.5 Applications

For the experiments, we used ﬁve SpecCPU 2000 bench-

marks – four SpecInt benchmarks (gzip, bzip2, mcf, parser)

and one SpecFP benchmark (art). For most of the other

SPEC benchmarks, we could not collect sufﬁcient training

inputs, while we could not compile and run the others in our

simluator.

Previous work on invariants uses toy Siemens bench-

marks because many inputs are available for these bench-

marks [4, 22]. We use more realistic applications which

makes it much harder for us to obtain valid inputs for ex-

periments. Nevertheless, obtaining inputs will not be a prob-

lem in practice as developers test their programs on many

inputs during the testing phase. Invariant generation and in-

sertion can be easily done during testing through a compile-

time pass. The “test” and “train” input sets formed part of

our training set. Different techniques were used to generate

more inputs depending on the benchmarks. For three bench-

marks (gzip, bzip2 and parser), we collected random inputs

10%

20%

30%

40%

50%

60%

70%

80%

2 4 8 12

False Positives Rate

Number of Training Inputs

mcf

gzip

bzip2

parser

art

Figure 2. Variation of False positives rate with different num-

ber of training inputs. The rate is <5% with 12 training sets,

motivating the use of 12 inputs for therest of our experiments.

from external sources. For mcf, a script was used to gener-

ate random inputs, while for art, different input options were

used to generate invariants. Since the inputs were predomi-

nantly generated randomly, the inputs used for training were

signiﬁcantly different from the reference inputs, which we

used for testing the false positives, coverage, latency, etc.

5. Experimental Results

In this section, we present our experimental results evaluat-

ing the effectivenessof using likely program invariants to de-

tect permanent hardware faults. All the injected FPU faults

were architecturally masked in all the applications, except

one ﬂoating point benchmark (art). So, we have excluded

the FPU unit results from the reported results, except as oth-

erwise noted.

We subject the same application binary (instrumented

with invariant detection) to faults under both the SWAT and

the iSWAT systems. We use the same binary in both cases to

obtain a valid coverage comparison between the two cases

as the behavior of faults (i.e., whether they are masked, or

detected, or become SDCs) depends on both static code lay-

out and dynamic instruction sequence. In the SWAT sys-

tem, since invariants are not monitored, the system ignores

the violation of any invariants and continues execution. The

iSWAT framework, on the other hand, invoked the diagnosis

module in the case of an invariant to determine false posi-

tives. If a false positive is detected, it just continues execu-

tion (in this case, the invariant will be disabled in the code),

otherwise a fault is detected. 2

5.1 False Positives

We ﬁrst evaluate the effect of training with different training

sets on the number of false positives.

We deﬁne false positive rate to be the fraction of false

positive invariants as a percentage of total number of static

invariants. Figure 2 shows the variation of false positive rate

2In a real system, iSWAT should check for false positives on every invariant

violation by invoking the rollback/recovery in diagnosis module. However,

since we have ref input available, we currently identify the false positive

invariants using an ofﬂine fault-free run and during faulty run, the diagnosis

module uses that information to detect false positives. In this way, we

effectively mimic a real system.

for our ﬁve applications running on the ref input, as the

number of training inputs is increased from 2 to 12.

As expected, false positive rate decreases as the number

of inputs increases. By 12 inputs, the rate of false positives is

less than 5% for all applications and 0% for three. This false

positive rate is sufﬁciently low for our purpose, motivating

us to use 12 training inputs for all of our experiments. In

previous work using Siemens benchmarks [4, 22], hundreds

of inputs were used for training. We ﬁnd much lower num-

ber of training inputs sufﬁce for permanent fault detection

with our approach, as our techniques can tolerate more false

positives.

The maximum number of static invariants in all applica-

tions was 231. Assuming each false positive detection has an

overhead of 1M instructions (conservatively computed con-

sidering overheads due to checkpoint/replay and context mi-

gration), the maximum overhead of false positive detection

on any input will be only 231M instructions, which is neg-

ligible considering the application runtimes. In practice, the

overhead will even be lower due to low false positive rates.

Interestingly, Figure 2 shows that after just four inputs,

only less than 10% of the invariants are false positives for

four applications. These results show that likely invariants

generated from many inputs will have sufﬁciently few false

positives to be usable for permanent fault detection.

5.2 Coverage

Here, we present the improvements in fault coverage achieved

by the iSWAT system (using 12 inputs for training in-

variants) over the SWAT system, evaluated using micro-

architecture-level fault injections.

Table 3 presents the improvements offered by iSWAT

over the baseline SWAT system to detect permanent hard-

ware faults. Each column shows the number of fault in-

jections that result in different outcomes (both the absolute

number and as a percentage of total number of fault injec-

tions) and the last column shows the unknown-fraction. The

ﬁrst two columns represent faults that are masked by the

architecture (Arch-Mask), and the application (App-Mask).

The Unknown column represents the fraction of faults that

are not detected within 10M instructions in each of the sys-

tems. The rest of the columns represent faults that are de-

tected by each of the detection mechanisms in 10M in-

structions, using the detection methods described previously

(Section 3). We don’t show Abort-App in the table, as it does

not detect any fault.

Three points can be observed from this table. First, the in-

variant detection is catching nearly 5.8% of total fault injec-

tions. Second, the invariant detection is detecting some faults

that are not detected by the traditional symptoms which re-

sulted in unknowns in SWAT, thus resulting in a 28.6% re-

duction of unknown cases from 168 the in SWAT system to

120 in the iSWAT. Third, the iSWAT invariants detect some

faults (about 5% of total fault injections) that are caught

by the other symptoms in SWAT at a lower latency than

Microarchitecture structure SWAT iSWAT Reduction

Instruction decoder 0.7% 0.6% 16.7%

Integer ALU 7.8% 6.2% 20.5%

Physical integer register ﬁle 12.8% 8.5% 33.7%

Reorder buffer (ROB) 0.9% 0.9% 0.0%

Address gen unit (AGEN) 2.4% 1.3% 46.1%

Total 4.0% 2.8% 28.7%

Table 4. Reduction in Unknown category for each microar-

chitectural structure.

Unknown Seg Other No SDC

fault signals output

SWAT 168 102 7 28 31

iSWAT 120 85 3 24 8

Table 5. Breakdown of Unknown category after the comple-

tion runs. The “No output” category includes OS hangs, appli-

cation hangs and OS crashes.

the other techniques, thus number of detections by other

symptoms in iSWAT are lower compared to SWAT. This re-

sult leads to a small improvement in detection latency, as

we show in the subsection 5.4. The overall coverage of the

iSWAT system is 97.2%.

Detection using Invariants: In order to understand the ef-

fectiveness of these invariants to detect faults in different

micro-architectural structures, we categorize the unknowns

in the two systems in Table 4. For each structure injected

with faults, the table shows the corresponding percentage of

non-masked faults that result in unknowns in the SWAT, and

iSWAT system, along with the percentage reduction in the

unknowns. The “Total” row shows the aggregate numbers.

These results show that invariants are most effective for

detecting faults in the integer ALU, register databus, integer

affect store values, without signiﬁcantly perturbing the con-

trol and data ﬂow. Invariants are not effective for the decode,

ROB, and RAT units. Faults in these units perturb program

control ﬂow, and do not directly affect values that the invari-

ants monitor. Faults in these structures are also very likely

to cause the invariant checking to be done incorrectly. For-

tunately, There are a very few remaining unknown cases in

these units. Faults in the RAT show an increased unknown

rate in the iSWAT system as some faults that are masked the

by the application in SWAT, are detected by the invariants in

iSWAT. These are real hardware faults which affect program

values, but are masked at the application level.

Invariants detect all the unknown cases for FPU Unit

faults. Thus the overall unknown-fraction decreases from

4.2% to 2.8%, if we include FPU unit. But, more ﬂoating

point applications are needed to draw any conclusions.

5.3 SDCs

A small fraction of faults still result in unknown outcomes

in the iSWAT system (2.8% of the non-masked faults) af-

ter 10M instructions. After 10M instructions of detailed tim-

Symptoms App-mask Arch-Mask Fatal-Trap Fatal-Trap Hang-App Hang-OS INV High-OS Unknown Unknown

-App -OS -fraction

SWAT(%) 293(5.2%) 1090(20%) 1252(22%) 1421(25%) 47(0.8%) 16 (0.3%) - 1305(23%) 168(3.0%) (4.0%)

iSWAT(%) 288(5.2%) 1090(20%) 1187(21%) 1357(24%) 29(0.5%) 15(0.3%) 325(5.8%) 1181(21%) 120(2.1%) (2.8%)

Table 3. Improvement in coverage of iSWAT over SWAT for permanent faults. The percentages are computed using total number

of fault injections as the baseline. Invariants are effective in catching faults that escape the traditional detection techniques in SWAT

and sometimes catching the same faults earlier, resulting in a reduced 2.8% unknown-fraction compared to 4% of the SWATsystem.

ing simulation, we ran the unknown cases (for all structures

but the FPU) to completion in functional simulation mode

to evaluate how many of the unknown cases result in SDCs.

In this mode, faults are not injected during execution due

to lack of micro-architectural details in the functional sim-

ulator. Also, in functional mode, invariants checks are not

enforced by the iSWAT system as we do not have the diag-

nosis framework support to detect false positives caused by

invariants. Hence, our reported SDC numbers are conserva-

tive estimates of realistic SDC numbers.

We refer to the number of cases which result in the same

output as App-Mask and rest of the cases as unknown. Ta-

ble 5 shows the breakdown of the total number of unknown

cases according to the results after completion. The next two

columns show segmentation faults and application termina-

tions due to other signals. Executions that produce no output

due to an application hang, an OS hang or OS Crash (indi-

cated by timing out the execution after a long duration) fall

under the No output category. Finally, the cases that result in

undetected faults that corrupt application outputs are shown

under the SDC category.

Overall, the SDCs in the iSWAT system is signiﬁcantly

lower than that in SWAT. The invariants reduce the SDCs by

74%, from 31 to 8. We consider the reduction in the SDCs

as the most important contribution of the invariants. Though

a few SDCs remain, we believe that more sophisticated in-

variants can make the SDC cases negligible. The number of

cases detected through other categories also decreases by 27

in iSWAT, which correspond to faults detected by invariants

before the application/OS crash through a signal or hang.

Analysis of SDCs: To do an in-depth analysis of why in-

variants don’t detect some of the SDC cases, we moved the

invariant checks to the simulator. In this way, we can observe

the monitored values and various other information, which is

not possible when the checks are in code.

To move the invariant checks into the simulator, we per-

form an instrumentation pass to store the monitored invari-

ant values to known memory locations. The simulator reads

the invariants ranges through a ﬁle. When it ﬁnds a store

to the known memory location, it can determine the corre-

sponding invariant from its memory address and perform the

bounds checking. Table 6 shows the key results, when the

checks are done in the simulator. Overall, there is a 35% re-

duction in unknown cases and 47% reduction in SDCs. We

observe a smaller reduction in SDCs compared to the Ta-

ble 5. So, the SDC results seem to be sensitive to static code

Unknown SDC

SWAT 164 32

iSWAT 106 17

Table 6. Results when invariant checking is done in simulator.

layout and dynamic instruction sequence, as they will deter-

mine the instruction where fault is injected and how the fault

affects the architectural state.

We analyzed the remaining 17 SDC cases by running both

the correct runs and fault injection runs and comparing the

monitored values. We made some interesting observations:

•In three cases out of 17 (all in mcf), very few invariants

are checked after arch-state mismatch. In fact, in one of

these cases, the faulty run has much fewer checks com-

pared to correct run, as the control ﬂow diverges com-

pletely to some part of the application with much fewer

number of checks. In the other two cases, the correct run

also has very few invariant checks as the faulty runs.

•In two cases (in mcf,gzip) there are no mismatches of the

monitored values inside the 10M window. iSWAT can’t

detect them as there are no checks after 10M window.

•In most cases, control ﬂow does not diverge signiﬁcantly

- it diverges for a short period and then merges back.

•Irrespective of whether fault injection is in higher or

lower order bits, almost all the value mismatches in SDC

cases were in lower order bits. So, simple range-based

or control-ﬂow invariants are unlikely to be effective for

these cases. In fact, most of the value mismatches were in

lowest 3 bits. We will need other types of invariants/de-

tectors for detecting mismatches in lower order bits.

5.4 Latency

Table 7 shows latency results for the faults detected by the

SWAT and the iSWAT systems, binned into various cate-

gories from under 1k instructions to under 10M instructions.

In order to perform a fair comparison, the numbers are pre-

sented as a percentage of the total number of faults detected

by iSWAT (i.e. the number of detection in the <10M case).

The number of faults detected at a latency of under 1k

instructions shows the largest increase of about 2% (the rest

of the numbers are cumulative). This shows that the latency

of detection of invariants is signiﬁcantly lower than that of

the other symptoms. This increases the number of faults

that are amenable to simple hardware recovery. Although

the latency beneﬁts offered by iSWAT are not substantial

Latencies <1k <5k <10k <50k <100k <500k <1M <5M <10M

SWAT 41.1% 47.0% 50.7% 78.7% 81.0% 87.0% 90.3% 95.7% 98.7%

iSWAT 43.1% 49.6% 53.4% 81.2% 83.3% 89.2% 92.7% 97.7% 100.0%

Table 7. Detection latencies for SWAT and iSWAT. The percentages are computed using number of detections in iSWAT with <10M

as baseline. The invariants increase the faults amenable to hardware recovery by 2%.

15%

20%

25%

30%

Overhead of iSWAT system over SWAT

Sparc

x86

10%

15%

art bzip2 gzip mcf parser

Geometric

Mean

Overhead of iSWAT system over SWAT

Benchmarks

Figure 3. Overhead of invariants on an UltraSPARC-IIIi

(sparc) machine and an AMD Athlon machine (x86).

so far, using more sophisticated invariants may improve the

effectiveness of iSWAT to reduce the latency.

5.5 Overhead

We evaluate the overhead of using invariants by running

the binary (with invariants checking) on fault-free hardware,

using two machines: Sun UltraSPARC-IIIi 1.2GHz machine

with 1MB uniﬁed L2, and 2GB RAM, and on an AMD

Athlon(TM) dual-core MP 2100+ machine with 256KB L2

and 1.5GB RAM. The Sun machine is referred to as Sparc

machine, and the AMD one as x86 machine in this section.

Figure 3 shows the overhead of using invariants checking

in the programs as a percentage over the baseline program

which has no invariants checking. The geometric mean of

the overheads is also shown for the two machines.

The sparc machine exhibits a higher overhead when run-

ning the invariants code than the x86 machine, with the aver-

age overheads being 14% and 5% respectively. In particular,

the overhead for the application mcf is signiﬁcantly higher

in the Sparc machine (26%) than the x86 machine (2%). The

high overhead of the Sparc machine is likely due to its in-

ability to hide the cache misses and branch mispredictions

induced by these extra invariants. The x86 machine is able

to hide these latencies better, resulting in lower overheads.

In spite of these differences, the overheads produced by

these invariants checks are within acceptable overheads for

the increased coverage that they provide, motivating the

iSWAT system for increased resilience.

6. Related Work

There is a growing body of work on using software-visible

symptoms to detect hardware errors. A number of papers

propose the use of control path signatures to detect control-

ﬂow errors [1, 5, 21, 29, 30]. Wang and Patel propose us-

ing branch mispredictions, cache misses, and exceptions as

symptoms of faults [32]. Most of this work focuses on tran-

sient faults or intermittent faults, and does not handle perma-

nent faults (the exceptions are discussed below). Permanent

faults are important because of the expected increase in phe-

nomena such as wear-out, insufﬁcient burn-in, and design

defects [2]. In our previous work [9], we used simple soft-

ware symptoms to detect both permanent as well as transient

faults and we extend that system in this paper.

Dynamically detected program invariants (likely invari-

ants [4]), which are inherently unsound, have been studied

for a wide range of analysis tasks, including program evo-

lution [4], program understanding [7, 4], and detecting and

diagnosing software bugs [4, 6, 12, 33, 13]. The only work

we know of that uses likely invariants for online error de-

tection comprises three recent papers on transient hardware

fault detection [22, 24, 3]. Racunas et al. and Dimitrov et

al. extract the invariants using online hardware monitoring

whereas Pattabiraman et al. use ahead-of-time monitoring of

program runs (similar to our work). In all the cases, how-

ever, they can only use their invariants for transient errors

because they do not have any mechanism to distinguish false

positives from true hardware failures. (Racunas et al. and

Dimitrov et al. ﬂush the pipeline, Pattabiraman et al. do not

suggest any concrete solution for false positives.) In contrast,

we can handle both transient and permanent faults.

Meixner et al. in the Argus project have proposed the

use of a program dataﬂow checker, combined with control

ﬂow signature checking, functional unit checkers, a memory

checker and parity on all data transfer and storage units to

handle a wide range of faults [16, 17]. Their dataﬂow graph

and control ﬂow signatures conceptually are invariants that

are encoded by the compiler in the binary and checked by

the hardware. Unfortunately, the technique does not work

with interrupts, I/O, etc. because these affect the control

ﬂow. Some parts of the Argus solution may also incur in-

ordinate performance overhead. Coverage data is reported

only for a synthetic microbenchmark, thus the effectiveness

of the technique for real programs is not clear [16]. Finally,

the estimated area overhead is 17% of the core for Argus

– a fault in this part could lead to false positives. In con-

trast, we look at far cheaper detection techniques, combin-

ing software-extracted invariants with several other software

symptoms that can be observed at near-zero cost.

7. Conclusion and Future Work

Previously existing methods for detecting hardware faults

using software-level symptoms, such as SWAT [9], are very

promising because of their high coverage and low cost. Nev-

ertheless, these systems need additional detectors for achiev-

ing reliability levels that would be acceptable for most sys-

tems. In this work, we proposed and evaluated the ﬁrst de-

sign (we know of) that uses likely program invariants for

detecting permanent faults. We used simple range-based

invariants on single variable values, in conjunction with

low-overhead symptom-based detection techniques already

available in the SWAT System. Our results show that likely

invariants can reduce the fraction of undetected errors from

4% to 2.8%, when used in conjunction with other symptom-

based detection techniques. Further, they reduce SDCs by

47% to 74.2%, which is important for any hardware fault

tolerance solution. We further showed that by leveraging the

diagnosis framework in SWAT, we could keep the overhead

caused by false positives to acceptable levels.

These range-based invariants form a ﬁrst step towards us-

ing invariants to detect hardware faults. We are now inves-

tigating more sophisticated invariant schemes to further im-

prove the effectiveness of the iSWAT system. We also want

to monitor other program values and to design a strategy

to select the most effective values for monitoring to reduce

overhead. We would also like to evaluate the approach on

more benchmarks and real applications.

Acknowledgments

We would like to thank Robert Bocchino for many discus-

sions and help in writing.

References

[1] E. Borin, C. Wang, Y. Wu, and G. Araujo. Dynamic binary

control-ﬂow errors detection. SIGARCH Comput. Archit.

News, 33(5), 2005.

[2] S. Borkar. Designing Reliable Systems from Unreliable

Components: The Challenges of Transistor Variability and

Degradation. IEEE Micro, 25(6), 2005.

[3] M. Dimitrov and H. Zhou. Uniﬁed architectural support for

soft-error protection or software bug detection. In Proc. Int’l

Conf. on Parallel Architectures and Compilation Techniques

(PACT), 2007.

[4] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin.

Dynamically discovering likely program invariants to support

program evolution. IEEE Trans. Software Eng., 2001.

[5] O. Goloubeva et al. Soft-Error Detection Using Control Flow

Assertions. In Proc. of 18th IEEE Intl. Symp. on Defect and

Fault Tolerance in VLSI Systems, 2003.

[6] S. Hangal and M. S. Lam. Tracking down software bugs

using automatic anomaly detection. In Proceedings of the

International Conference on Software Engineering, 2002.

[7] Y. Kataoka, M. D. Ernst, W. G. Griswold, and D. Notkin.

Automated support for program refactoring using invariants.

In IEEE Int’l Conf. on Software Maintenance (ICSM), 2001.

[8] C. Lattner and V. Adve. LLVM: A Compilation Framework

for Lifelong Program Analysis and Transformation. In Proc.

Int’l Symp. on Code Generation and Optimization, 2004.

[9] M. Li, P. Ramachandran, S. Sahoo, S. Adve, V. Adve, and

Y. Zhou. Understanding the Propagation of Hard Errors to

Software and Implications for Resilient System Design. In

Proc. Intl. Conf. on Architectural Support for Programming

Languages and Operating Systems(ASPLOS), 2008.

[10] M. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve,

and Y. Zhou. Trace Based Diagnosis of Permanent Hardware

Faults. In International Conference on Dependable Systems

and Networks, 2008.

[11] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug

isolation via remote program sampling. In Proc. of Conf. on

Programming Language Design and Implementation, 2003.

[12] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan.

Scalable statistical bug isolation. In Proc. of Conf. on

Programming Language Design and Implementation, 2005.

[13] S. Lu, J. Tucek, F. Qin, and Y. Zhou. Avio: detecting atomicity

violations via access interleaving invariants. In Proc. Int’l

Conf. on Architectural Support for Programming Languages

and Operating Systems (ASPLOS), 2006.

[14] M. Martin et al. Multifacet’s General Execution-Driven

Multiprocessor Simulator (GEMS) Toolset. SIGARCH

Computer Architecture News, 33(4), 2005.

[15] C. J. Mauer, M. D. Hill, and D. A. Wood. Full-System Timing-

First Simulation. SIGMETRICS Performance Evaluation Rev.,

30(1), 2002.

[16] A. Meixner, M. E. Bauer, and D. J. Sorin. Argus: Low-

cost, comprehensive error detection in simple cores. In Proc.

ACM/IEEE Int’l Symposium on Microarchitecture, 2007.

[17] A. Meixner and D. J. Sorin. Error detection using dynamic

dataﬂow veriﬁcation. In Proc. Int’l Conf. on Parallel

Architectures and Compilation Techniques (PACT), 2007.

[18] J. Nakano et al. ReViveI/O: Efﬁcient Handling of I/O in

Highly-Available Rollback-Recovery Servers. In Int’l Symp.

on High Performance Computer Architecture (HPCA), 2006.

[19] N. Nakka et al. An Architectural Framework for Detecting

Process Hangs/Crashes. In European Dependable Computing

Conference (EDCC), 2005.

[20] J. W. Nimmer and M. D. Ernst. Automatic generation of

program speciﬁcations. In Proc. ACM SIGSOFT Int’l Symp.

on Software Testing and Analysis, 2002.

[21] N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-ﬂow

checking by software signatures. IEEE Trans. on Reliability,

51, March 2002.

[22] K. Pattabiraman, G. P. Saggesse, D. Chen, Z. Kalbarczyk,

and R. Iyer. Dynamic derivation of application-speciﬁc error

detectors and their hardware implementation. In Proc. of

European Dependable Computing Conference (EDCC), 2006.

[23] M. Prvulovic et al. ReVive: Cost-Effective Architectural

Support for Rollback Recovery in Shared-Memory Multipro-

cessors. In Int’l Symp. on Computer Architecture (ISCA),

2002.

[24] P. Racunas et al. Perturbation-based Fault Screening. In

International Symposium on High Performance Computer

Architecture (HPCA), 2007.

[25] V. Reddy et al. Assertion-Based Microarchitecture Design

for Improved Fault Tolerance. In International Conference on

Computer Design , 2006.

[26] G. A. Reis et al. Software-Controlled Fault Tolerance. ACM

Transactions on Architectural Code Optimization, 2(4), 2005.

[27] E. Rotenberg. AR-SMT: A Microarchitectural Approach

to Fault Tolerance in Microprocessors. In International

Symposium on Fault-Tolerant Computing (FTCS), 1999.

[28] D. Sorin et al. SafetyNet: Improving the Availability of Shared

Memory Multiprocessors with Global Checkpoint/Recovery.

In Int’l Symp. on Computer Architecture (ISCA), 2002.

[29] R. Vemu and J. A. Abraham. CEDA: Control-ﬂow Error De-

tection through Assertions. In Intl. On-Line Test Symposium,

2006.

[30] R. Venkatasubramanian et al. Low-Cost On-Line Fault

Detection Using Control Flow Assertions. In International

On-Line Test Symposium, 2003.

[31] Virtutech. Simics Full System Simulator. Website, 2006.

http://www.simics.net.

[32] N. Wang and S. Patel. ReStore: Symptom-Based Soft

Error Detection in Microprocessors. IEEE Transactions on

Dependable and Secure Computing, 3(3), July-Sept 2006.

[33] P. Zhou, W. Liu, F. Long, S. Lu, F. Qin, Y. Zhou, S. Midkiff,

and J. Torrellas. Accmon: Automatically detecting memory-

related bugs via program counter-based invariants. In Proc.

ACM/IEEE Int’l Symposium on Microarchitecture, 2004.

Aspect-Oriented Technology for Dependable Operating Systems

Thesis

Full-text available

May 2017

Christoph Borchert

Modern computer devices exhibit transient hardware faults that disturb the electrical behavior but do not cause permanent physical damage to the devices. Transient faults are caused by a multitude of sources, such as fluctuation of the supply voltage, electromagnetic interference, and radiation from the natural environment. Therefore, dependable computer systems must incorporate methods of fault tolerance to cope with transient faults. Software-implemented fault tolerance represents a promising approach that does not need expensive hardware redundancy for reducing the probability of failure to an acceptable level. This thesis focuses on software-implemented fault tolerance for operating systems because they are the most critical pieces of software in a computer system: All computer programs depend on the integrity of the operating system. However, the C/C++ source code of common operating systems tends to be already exceedingly complex, so that a manual extension by fault tolerance is no viable solution. Thus, this thesis proposes a generic solution based on Aspect-Oriented Programming (AOP). To evaluate AOP as a means to improve the dependability of operating systems, this thesis presents the design and implementation of a library of aspect-oriented fault-tolerance mechanisms. These mechanisms constitute separate program modules that can be integrated automatically into common off-the-shelf operating systems using a compiler for the AOP language. Thus, the aspect-oriented approach facilitates improving the dependability of large-scale software systems without affecting the maintainability of the source code. The library allows choosing between several error-detection and error-correction schemes, and provides wait-free synchronization for handling asynchronous and multi-threaded operating-system code. This thesis evaluates the aspect-oriented approach to fault tolerance on the basis of two off-the-shelf operating systems. Furthermore, the evaluation also considers one user-level program for protection, as the library of fault-tolerance mechanisms is highly generic and transparent and, thus, not limited to operating systems. Exhaustive fault-injection experiments show an excellent trade-off between runtime overhead and fault tolerance, which can be adjusted and optimized by fine-grained selective placement of the fault-tolerance mechanisms. Finally, this thesis provides evidence for the effectiveness of the approach in detecting and correcting radiation-induced hardware faults: High-energy particle radiation experiments confirm improvements in fault tolerance by almost 80 percent.

F_Radish: Enhancing Silent Data Corruption Detection for Aerospace-Based Computing

Article

Full-text available

Dec 2020

Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect SDC, program invariant assertions are used to harden programs. However, there exist redundant assertions in hardened programs, which impairs the detection efficiency. Benign errors are another type of soft error result. An assertion may detect benign errors, incurring unnecessary recovery overhead. The detection degree of an assertion represents the detection capability, and an assertion with a high detection degree can detect severe errors. To improve the detection efficiency and detection degree while reducing the benign detection ratio, F_Radish is proposed in the present work to screen redundant assertions in a novel way. At a program point, the detection degree and benign detection ratio are considered to evaluate the importance of the assertions in the program point. As a result, only the most important assertion remains in the program point. Moreover, the redundancy degree is considered to screen redundant assertions for neighbouring program points. Experimental results show that in comparison with the Radish approach, the detection efficiency of F_Radish is about two times greater. Moreover, F_Radish reduces the benign detection ratio and improves the detection degree. It can avoid more unnecessary recovery overheads and detect more serious SDC than can Radish.

Software reliability enhancement against hardware transient errors using inherently reliable data structures

Article

Jun 2020

Decreasing the scale of transistors and their voltages and exponential increase in the transistor counts have made the nowadays digital integrated circuits more susceptible to transient hardware errors (soft errors). One of the interesting features of software systems is that a considerable number of soft errors are inherently masked at software level. The likelihood of error masking (error deration) in the software may be influenced by the Algorithm, data structures and programming paradigms used in the software. One of the main research questions in this field of study is that how can software reliability be improved against soft errors without external redundancy and only by selecting appropriate software structures. This paper investigates the inherent effects of the underlying data structures on the rate of error deration and program reliability. To attain this goal, five different benchmark programs were implemented by four different data structures, i.e. Array, Binary-search Tree, One-way linked list and Two-way linked list; profiling experiments were performed on the benchmarks to identify those features of the data structures which affect the rate of error-deration. Then, in order to quantify and examine the inherent reliability of the data structures, about 5,600,000 faults were injected into the benchmark programs. The results show that about 53.95% of faults in the Array based programs are masked; this figure for Binary-search tree, One-way linked list and Two-way linked list are 40.16%, 44.02% and 42.73%, respectively. We found that Array and BST as two different data structures have the highest and lowest inherent reliability respectively. These findings enable the software developers to select the most reliable data structures for developing reliable programs without external redundancy.

Application-Aware Reliability and Security: The Trusted Illiac Experience

Chapter

Jan 2023

Karthik Pattabiraman

This chapter is about the author’s time at the University of Illinois as a PhD student in Ravi’s group working on the Trusted Illiac project at the University of Illinois (UIUC) from 2004 to 2009. The author starts by narrating his initial involvement in the project, and how it grew as time progressed. He then reflects on the lessons he learned from the project, and how the project has influenced his subsequent research career.

Optimizing Selective Protection for CNN Resilience

Conference Paper

Oct 2021

Revisiting Symptom-Based Fault Tolerant Techniques against Soft Errors

Article

Full-text available

Dec 2021

Aggressive technology scaling and near-threshold computing have made soft error reliability one of the leading design considerations in modern embedded microprocessors. Although traditional hardware/software redundancy-based schemes can provide a high level of protection, they incur significant overheads in terms of performance and hardware resources. The considerable overheads from such full redundancy-based techniques has motivated researchers to propose low-cost soft error protection schemes, such as symptom-based error protection schemes. The main idea behind a symptom-based error protection scheme is that soft errors in the system will quickly generate some symptoms, such as exceptions, branch mispredictions, cache or TLB misses, or unpredictable variable values. Therefore, monitoring such infrequent symptoms makes it possible to cover the manifestation of failures caused by soft errors. Symptom-based protection schemes have been suggested as shortcuts to achieve acceptable reliability with comparable overheads. Since the symptom-based protection schemes seem attractive due to their generality and simplicity, even state-of-the-art protection schemes exploit them as the baseline protections. However, our detailed analysis of the fault coverage and performance overheads of such schemes reveals that the user-visible failure coverage, particularly of ReStore, is limited (29% on average). By contrast, the runtime overheads are significant (40% on average) because the majority of the fault injection experiments, which were considered as detected/recovered failures by low-level symptoms, are actually benign faults by program-level masking effects.

Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults

Article

Jun 2012

Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a non-negligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application's execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. Such studies do not provide any feedback on the portions of the application where faults were not injected. Some of those instructions may be vulnerable to SDCs, and identifying them could allow protecting them through other means if needed. This paper presents Relyzer, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections for transient faults. Relyzer employs novel fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults. We find that Relyzer prunes about 99.78% of the total faults across twelve applications studied here, reducing the faults that require detailed simulation by 3 to 5 orders of magnitude for most of the applications. Fault injection simulations on the remaining faults can identify SDC causing faults in the entire application. Some of Relyzer's techniques rely on heuristics to determine fault equivalence. Our validation efforts show that Relyzer determines fault outcomes with 96% accuracy, averaged across all the applications studied here.

IRHT: An SDC Detection and Recovery Architecture Based on Value Locality of Instruction Binary Codes

Article

Jun 2020
MICROPROCESS MICROSY

Silent Data Corruptions (SDCs), those errors that escape detection methods, are critical for system designers because they may result in systems failures. In order to catch SDCs, mechanisms should focus on the behavioural aspects of errors in addition to their physical location or error patterns. Therefore, protection codes like parity, hamming, and the Reed-Solomon code, which heavily depend on the physical location of data bits, are not enough in processors for detection of computing errors. Using characterizing data behaviour during program executions, we have observed value locality in results of destination register for each instruction binary code (instruction opcode and operand codes). This locality exists not only in the results of each instruction, but also in the results of instructions at different memory locations having the same binary code. As a result, an architecture called Instruction Result History Table (IRHT) is presented, which is indexed by the instruction binary code. In the IRHT, a history of values produced by the same instruction binary codes are stored in and utilized during each instruction execution cycle. Any mismatch between the stored values in the IRHT and those generated by current execution, indicates an SDC syndrome. To confirm having SDCs with a high level of confidence, a second execution of the current instruction is issued. A duplication of the execution confirms whether SDC occurred. In the case of SDCs, a third instruction execution with the help of a majority voting frees the system of SDC. Several extensive simulations showed that, up to 83.54% of SDCs are detectable with the help of this locality. Moreover, with the small hardware, IRHT, i.e., 16 kB size, 80.66% of SDCs can be detected on average. Note that the presented method can detect those errors that escape conventional detection mechanisms. So, it can be utilized in conjunction with other conventional methods.

Low-cost prediction-based fault protection strategy

Conference Paper

Feb 2020

Aloe: verifying reliability of approximate programs in the presence of recovery mechanisms

Conference Paper

Feb 2020

Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware

Article

Full-text available

Oct 2006

This paper proposes a novel technique for preventing a wide range of data errors from corrupting the execution of applications. The proposed technique enables automated derivation of fine-grained, application-specific error detectors. An algorithm based on dynamic traces of application execution is developed for extracting the set of error detector classes, parameters, and locations in order to maximize the error detection coverage for a target application. The paper also presents an automatic framework for synthesizing the set of detectors in hardware to enable low-overhead run-time checking of the application execution. Coverage (evaluated using fault injection) of the error detectors derived using the proposed methodology, the additional hardware resources needed, and performance overhead for several benchmark programs are also reported.

ReStore: Symptom Based Soft Error Detection in Microprocessors

Conference Paper

Full-text available

Jan 2005

Device scaling and large scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implement- ing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long, and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor — in particular, the less struc- tured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpoint- ing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow mis- speculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a non- zero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 20x if ReStore is coupled with protection for certain particularly vulnerable pipeline structures.

Understanding the propagation of hard errors to software and implications for resilient system design

Conference Paper

Full-text available

Mar 2008
Comput Architect News

With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults. Fundamental to such a solution is a characterization of how hardware faults indifferent microarchitectural structures of a modern processor propagate through the application and OS. This paper aims to provide such a characterization, resulting in identifying low-cost detection methods and providing guidelines for implementation of the recovery and diagnosis components of such a reliability solution. We focus on hard faults because they are increasingly important and have different system implications than the much studied transients. We achieve our goals through fault injection experiments with a microarchitecture-level full system timing simulator. Our main results are: (1) we are able to detect 95% of the unmasked faults in 7 out of 8 studied microarchitectural structures with simple detectors that incur zero to little hardware overhead; (2) over 86% of these detections are within latencies that existing hardware checkpointing schemes can handle, while others require software checkpointing; and (3) a surprisingly large fraction of the detected faults corrupt OS state, but almost all of these are detected with latencies short enough to use hardware checkpointing, thereby enabling OS recovery in virtually all such cases.

Automatic generation of program specifications

Article

Jan 2002

Producing specifications by dynamic (runtime) analysis of program executions is potentially unsound, because the analyzed executions may not fully characterize all possible executions of the program. In practice, how accurate are the results of a dynamic analysis? This paper describes the results of an investigation into this question, determining how much specifications generalized from program runs must be changed in order to be verified by a static checker. Surprisingly, small test suites captured nearly all program behavior required by a specific type of static checking; the static checker guaranteed that the implementations satisfy the generated specifications, and ensured the absence of runtime exceptions. Measured against this verification task, the generated specifications scored over 90% on precision, a measure of soundness, and on recall, a measure of completeness. This is a positive result for testing, because it suggests that dynamic analyses can capture all semantic information of interest for certain applications. The experimental results demonstrate that a specific technique, dynamic invariant detection, is effective at generating consistent, sufficient specifications for use by a static checker. Finally, the research shows that combining static and dynamic analyses over program specifications has benefits for users of each technique, guaranteeing soundness of the dynamic analysis and lessening the annotation burden for users of the static analysis.

ReVive

Article

May 2002
Comput Architect News

This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day.

CEDA: control-flow error detection through assertions

Conference Paper

Jan 2006

This paper presents an efficient software technique, control flow error detection through assertions (CEDA), for online detection of control flow errors. Extra instructions are automatically embedded into the program at compile time to continuously update run-time signatures and to compare them against pre-assigned values. The novel method of computing run-time signatures results in a huge reduction in the performance overhead, as well as the ability to deal with complex programs and the capability to detect subtle control flow errors. The widely used C compiler, GCC, has been modified to implement CEDA, and the SPEC benchmark programs were used as the target to compare with earlier techniques. Fault injection experiments were used to evaluate the fault detection capabilities. Based on a new comparison metric, method efficiency, which takes into account both error coverage and performance overhead, CEDA is found to be much better than previously proposed methods

Assertion-Based Microarchitecture Design for Improved Reliability.

Conference Paper

Jan 2006

Dynamically Discovering Likely Program Invariants to Support Program Evolution.

Conference Paper

May 1999

Explicitly stated program invariants can help programmers by identifying program properties that must be preserved when modifying code. In practice, however, these invariants are usually implicit. An alternative to expecting programmers to fully annotate code with invariants is to automatically infer likely invariants from the program itself. This research focuses on dynamic techniques for discovering invariants from execution traces. This article reports three results. First, it describes techniques for dynamically discovering invariants, along with an implementation, named Daikon, that embodies these techniques. Second, it reports on the application of Daikon to two sets of target programs. In programs from Gries's work on program derivation, the system rediscovered predefined invariants. In a C program lacking explicit invariants, the system discovered invariants that assisted a software evolution task. These experiments demonstrate that, at least for small programs, invariant inference is both accurate and useful. Third, it analyzes scalability issues, such as invariant detection runtime and accuracy, as functions of test suites and program points instrumented.

An Architectural Framework for Detecting Process Hangs/Crashes

Conference Paper

Apr 2005
Lect Notes Comput Sci

This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware.

Conference Paper

Jan 2006

This paper proposes a novel technique for preventing a wide range of data errors from corrupting the execution of applications. The proposed technique enables automated derivation of fine-grained, application-specific error detectors. An algorithm based on dynamic traces of application execution is developed for extracting the set of error detector classes, parameters, and locations in order to maximize the error detection coverage for a target application. The paper also presents an automatic framework for synthesizing the set of detectors in hardware to enable low-overhead run- time checking of the application execution. Coverage (evaluated using fault injection) of the error detectors derived using the proposed methodology, the additional hardware resources needed, and performance overhead for several benchmark programs are also reported.

Using Likely Program Invariants to Detect Hardware Errors

Abstract and Figures

Recommended publications

Evaluating the Impact of Fault Recovery on Superscalar Processor Performance

Understanding the propagation of hard errors to software and implications for resilient system desig...

Understanding the propagation of hard errors to software and implications for resilient system desig...

Trace-Based Microarchitecture-Level Diagnosis of Permanent Hardware Faults

Relyzer: Exploiting Application-Level Fault Equivalence to Analyze Application Resiliency to Transie...