Resilient Theme Task # 5.5.3 GSRC Annual Symposium · SWAT_poster_draft_3.pptx Author: Pradeep Created Date: 9/15/2010 8:03:22 PM ...

GSRC Annual Symposium

September 28, 2010 through

October 1, 2010

Detection Results

Low SDC rate for all apps

<0.5% of injections SDCs

Short detection latency

>90% in <100K instr

⇒ Low-cost symptom detection feasible for HW faults

Diagnosis Results >95% successful diagnosis

Latency <10M ⇒invisible

µarch-level diagnosis for repair

⇒ SWAT diagnoses faults in single and multi-core systems

Recovery Results

Pradeep Ramachandran, Siva Kumar SastryHari, Manlap Li, SwarupSahoo, Robert Smolinsk, Xin Fu, Lei Chen, SaritaAdve, VikramAdve

Resilient Theme Task # 5.5.3

The Reliability Threat

Technology scaling ⇒ smaller devices vulnerable to failures

Increased in-the-field failures in commodity systems

Need low-cost detection, diagnosis, recovery, repair solutions

Traditional solutions ⇒ high area, performance, power

SWAT: A Comprehensive Low Cost Solution

Fault Detection [ASPLOS ʻ08, DSN ʻ08] Fault Recovery [submitted]

Key Findings SWAT effective for permanent, transient faults in many apps

Detection: <0.5% SDC rate in SPEC, server, media apps

Low overheads during fault-free execution

Recovery: Majority of faults recoverable in <100K instructions

<5% perf, near-zero area impact from recovery operations

Diagnosis: >95% of detected faults successfully diagnosed

Faulty core identified without spare core

TMR/DMR only for diagnosis ⇒ does not impact fault-free exec

Fault Diagnosis [DSN ʼ08, MICRO ʻ09]

Transient errors Wear-out Design Bugs … and so on

Goal: Effective, quick detection with minimal fault-free impact

Use symptom detectors to monitor anomalous SW execution

Simple hardware detectors with low area overheads

Low-cost SW detectors to aid HW detectors

Goal: Low-cost fault recovery in the presence of I/O

HW checkpoint to restore system state

Low-cost recovery for proc + memory

Buffer external outputs in dedicated HW

First low-cost implementation w/ simple HW

Avoids commonly ignored output-commit problem

Leverage SW support for device reset, input replay

Goal: Diagnose fault source without affecting fault-free exec

⇒ No spares for diagnosis

Diagnose faulty core even when symptom from fault-free core

Fatal Traps

Div by zero, RED state, etc.

Hangs

Simple HW hang detector

Kernel Panic

OS panics due to fault

High OS

High contiguous OS activity

App Abort

App abort due to fault

0% 20% 40% 60% 80%

100%

Full

No-

Dev

ice

No-

I/O

Full

No-

Dev

ice

No-

I/O

Full

No-

Dev

ice

No-

I/O

Full

No-

Dev

ice

No-

I/O

100K 10M 100K 10M

Permanents Transients

Inje

cted

Fau

lts

Potential SDC DUE Recovered Masked

6.3% 2.5% 2.3% 1.5%

Ongoing and Future Work Ongoing: Prototyping SWAT on FPGA

Implement SWAT firmware in OpenSolaris

Demonstrate SWAT on multicore OpenSPARC FPGA

Leverage Univ. of Michigan CrashTest for fault injection

Understand when/why SWAT works

Evaluate SWAT for off-core faults, other fault models

A B C D Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Isolated deterministic

replay

Emulated TMR

Key Ideas TA TB TC TD

TA

TA TB TC TD

TA TB TC TD

TA TB TC TD

0%

20%

40%

60%

80%

100%

Dec

oder

INT

ALU

Reg

Dbu

s

Int r

eg

RO

B

RAT

AG

EN

Aver

age

Det

ecte

d Fa

ults

CorrectlyDiagnosed Undiagnosed

99 100 99 87 100 78 99 95.9

1

10

100

10K 100K 1M 2M 5M 10M Clie

nt e

xec

time

with

buf

fer/

with

out b

uffe

r

Chkpt Interval (in instructions)

apache

sshd

squid

mysql

Fault

Out-of-Bounds HW/SW co-designed detector

Monitor legal limit of addresses Low perf, area overhead

iSWAT Compiler support to detect faults Use likely invariants as detectors

Low false +ves, perf. impact

0%

20%

40%

60%

80%

100%

SP

EC

Ser

ver

Med

ia

SP

EC

Ser

ver

Med

ia

Permanents Transients

Tota

l inj

ectio

ns

Masked Detected App-Tolerated SDC

0.1 0.1 0.2 0.2 0.3 0.5

* Does not include iSWAT detectors

Low overheads @ 100K inst

<5% perf, <2KB area

Practical sol ⇒delay <1M inst

High recovery at 100K interval

Low perf, area impact

⇒ SWAT effective for low-cost fault recovery

Resilient Theme Task # 5.5.3 GSRC Annual Symposium · SWAT_poster_draft_3.pptx Author: Pradeep Created Date: 9/15/2010 8:03:22 PM ...

Documents