Top Banner
DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang
20

DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

Dec 14, 2015

Download

Documents

Haven Seman
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

DESIGN AND EVALUATION OF HYBRID

FAULT-DETECTION SYSTEMS

Qing Xu Kevin Wang

Page 2: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

OUTLINE

Background Motivation Key Ideas Introduction to CRAFT Summary and Discussion Points

Page 3: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

0 1

BACKGROUND Smaller and Faster Transistors

Lower threshold voltage Tighter noise margins Less reliable

Results Incorrect program execution

Recovery

Alpha Particle Transie

nt Faults

Software OnlyHardware Only

REDUNDENCY

Int main(){ cout << “Hello\n”;}

Int main(){ cout << “Hello\n”;}

Page 4: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

MOTIVATION AND GOAL

Software Only

Inadequate coverage

Slow

Hardware Only Large Overhead/Area High cost

Hybrid Solution

Better Reliability and PerformanceLower Hardware

Area and Cost

Page 5: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

KEY IDEA: COMPILER ASSISTED FAULT TOLERANCE (CRAFT) Characteristics:

- Based on software technique

- Minimal hardware adaptations

- Take advantages from Software and Hardware solution

Benefits:

- Nearly perfect reliability

- Low performance degradation

- Low hardware cost

Software

Hardware

Page 6: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

CRAFT: HYBRID OF EXISTING METHODS

Hardware Method Software Method Redundant

Multithreading Technique (RMT)

Error Correcting Codes (ECC)

Software Implemented Fault Tolerance (SWIFT)

Error Detection by Duplicating Instructions (EDDI)

Advantages Almost-perfect fault coverage Low performance cost

Advantages High fault coverage Modest performance cost Zero hardware cost

Page 7: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

EXISTING METHOD: HARDWARERMT

RMT makes use of SMT resource through loosely synchronized redundant threads

Components not covered by redundant execution must employ alternative techniques, such as Error Correction Code (ECC)

Original Thread

Checker Thread

Redundant Multi-threading (RMT)

Page 8: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

EXISTING METHOD: SOFTWARESWIFT A compiler based

transformation Store instruction is the

synchronization point Assumes that Error

Correction Code (ECC) guards correctness of memory subsystem

ld r3 = [r4]

add r1 = r2, r3

st m[r1] = r2

(Original Code)

ld r3 = [r4]mov r3’ = r3

add r1 = r2, r3add r1’ = r2’, r3’

br Fault, r1 != r1’br Fault, r2 != r2’br Fault, r3 != r3’

st m[r1] = r2

(SWIFT Code)

Page 9: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

CRAFT: SUITE OF THREE DETECTION SYSTEM

Preliminaries List of the Suite:

1. Checking Store Buffer (CSB)

2. Load Value Queue (LVQ)

3. CSB + LVQ

Assume Single Event Upset fault model

Architecturally Correct Execution (ACE)

Detected Unrecoverable Error (DUE)

Silent Data Corruption (SDC)

Page 10: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

SUITE 1: CHECKING STORE BUFFER (CSB)

Solution:• Add a Store Buffer to perform

checks

Problem to Improve:• SWIFT: Vulnerable to faults in the

time interval between the validation and use of a register value

Use of validated valuesValidated values

Vulnerable to Faults

Page 11: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

CSB # 0 1 2 3

Address -- -- 0xFF 0xEE

Value -- -- 0x8 0x1

Validated -- -- N N

0xFF

0x8

0xEE

0x2

Compiler duplicates storesst [r1] = r2 st1 [r1] = r2

st2 [r1’] = r2’

Not match, not OK to go to MEM

CSB : IMPLEMENTATIONBasic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions

Table will fill up and structural hazard

Insn duplicate #1

Insn duplicate #2

Y N

Store Value Checks Out! Send to MEM.

Page 12: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

CSB : ADVANTAGES/ DISADVANTAGES Checking implemented in hardware level

No longer need validation code; reduces code size

Store instructions are no longer synchronization points (SWIFT)

Exploit more dynamic scheduling

Advantages

Disadvantages Additional compiler requirements: distance

between duplicated instruction should not exceed size of CSB

Page 13: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

SUITE 2: LOAD VALUE QUEUE (LVQ)

Problem to Improve:• SWIFT: Window of vulnerability

between load instruction and value duplication.

Solution:• Add a load value queue

Vulnerable to Faults

Copying valuesLoading values

Page 14: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

LVQ : IMPLEMENTATION PROCEDURE

Threadmill: Branch to TEST1

Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution

LVQ # 0 1 2 3

Address -- -- -- --

Value -- -- -- --

0xAA 0xAACompiler duplicates loadsld [r1] = r2 ld1 [r1] = r2

ld2 [r1’] = r2’

ld insn ld insn duplicate

0xAA

0x2

0x2 0x2

Page 15: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

LVQ : ADVANTAGES/ DISADVANTAGESAdvantages

Disadvantages Extra hardware to enforce loads and their duplicates

access same entry in LVQ

Reduces window of vulnerability by issuing duplicated load instruction Keep memory traffic low by bypassing load value

Page 16: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

SUITE 3: CSB + LVQ Implements both CSB and LVQ simultaneously to software-only solutions like SWIFT

Page 17: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

EXPERIMENTAL EVALUATION Evaluation Method – Performance vs. Reliability:

Inject randomly chosen faults to detailed microarchitectural simulation

Each chosen bit-flip is tracked until completion of program

Analyze final result to determine:

- How much SDC is converted to DUE

- How much work (# of application) did program complete before encountering SDC

Page 18: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

EXPERIMENTAL EVALUATION Results: Measures # of applications the program completed before encountering an SDC

Implementation

Performance

CSB Enable better performance as it eliminates scheduling constraints

LVQ Impact varies by benchmark

Page 19: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

SUMMARY AND CONCLUSION

CRAFT, as compared to:

Hybrid technique can provide better reliability with relatively low cost

Software-only Technique Hardware-only Technique

Execution time reduction by 5%

Significantly reduce area overhead

SDC to DUE conversion rate increase by 75%

Maintain comparable reliability

Page 20: DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

DISCUSSION POINTS

CRAFT detects fault when CSB is clogged

Tradeoff between detection latency and more flexible scheduling?

Recovery method? Evaluation in terms of coverage?