Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng

Post on 04-Feb-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke. MICRO-40 December 3, 2007. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…” - PowerPoint PPT Presentation

Transcript

1 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Self-calibrating Online Wearout Detection

Authors: Jason Blome

Shuguang Feng

Shantanu Gupta

Scott Mahlke

MICRO-40

December 3, 2007

2 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Motivation

“Designing Reliable Systems from Unreliable Components…”

- Shekhar Borkar (Intel)

[Srinivasan, DSN‘04] [Borkar, MICRO‘05]

More failures to comeFailures will be wearout

induced

3 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Current Approaches

Traditional Design margins Burn-in

Detection: based on replication of computation TMR (Tandem/HP NonStop servers) DIVA (Bower, MICRO’05)

Prediction: utilizes precise analytical models and/or sensors

Canary circuits (SentinelSilicion, RidgeTop) RAMP (Srinivasan, UIUC/IBM)

RA

MP

CostlyStatic

Impractical

4 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Wearout Mechanisms

Many failure mechanisms have been shown to be progressive

Hot carrier injection (HCI)

Oxide

Electromigration (EM) Oxide Breakdown (OBD)

GS

I gs DIgd

B

N+N+

P-wellIgb

I gcsIgcd

Negative Bias Temperature Inversion (NBTI)

Oxide

5 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Objective

Propose a failure prediction technique that exploits the progressive nature of wearout

Monitor impact on path delays

Prediction

• Monitors evolution of wearout

• Proactive

• enables failure avoidance/mitigation

• Continuous feedback

• False negatives and positives

Detection

• Identifies existing fault

• Reactive

• enables failure recovery

• End-of-life feedback

• False negatives

6 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

GGGGG

Oxide Breakdown (OBD)

G

Accumulation of defects leads to a conductive path

G

ΔIoxide

GS D

B

N+N+

P-well

Oxide

Percolation Model [Stathis, JAP‘06]

7 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

OBD HSPICE Model

Post-breakdown leakage modeling

[Rodriguez, Stathis, Linder, IRPS ‘03]

0

0

gdgd

gsgs

IKI

IKI

GS

I gs DIgd

B

N+N+

P-wellIgb

I gcsIgcd

unchangedremain

and ,, gbgcdgcs III

[BSIM4.6.0, ‘06]

GS

I gs DIgd

B

N+N+

P-wellIgb

I gcsIgcd

8 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Characterization Testbench

tcircuit

tcell

90nm standard cell library

BUFX4 BUFX4

FO4GATE FO4BUFX4

DC

Gate UUT

9 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Impact on Propagation Delay

10 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Delay Profiling Unit (DPU)

input signal

LatencySampling

1 1

0

0

0

0

0

0

01

1

1

1

1

1

0

0

1

1

1

uArch Module

11 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

TRIX Analysis

Magnitude of divergence between TRIXglobal

and TRIXlocal reflects amount of degradation

12 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Exponential Moving Average (EMA)

Triple-smoothed Exponential Moving Average

TRIX Analysis Details

size windowby the defined is where

)()( 11

tt EMApriceEMAtEMA

)()(

)()(

)()(

132

133

121

122

11

111

ttt

ttt

ttt

EMAEMAEMAtEMA

EMAEMAEMAtEMA

EMApriceEMAtEMA

13 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Noisy Latency Profile

94

96

98

100

102

104

106

108

110

Raw Latency Profile Trix Profile (local) Trix Profile (global)

Per

cen

t N

om

inal

Del

ay (

%)

Increasing Age

14 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

DPU with TRIX Hardware

input signal

LatencySampling

TRIXl

Calculation

Prediction

TRIXg

Calculation

0

0

0

0

0

0

0

1

1

1

15 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Wearout Detection Unit (WDU)

LatencySampling

Prediction

TRIXl

Calculation+

TRIXg

Calculation

16 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Evaluation Framework

OR1200Verilog

OR1200Verilog

Synthesis and Place and Route

Synthesis and Place and Route

Timing, Power, and Temperature

Simulations

Timing, Power, and Temperature

Simulations

MediaBenchSuite

MediaBenchSuite

90nm Library

90nm Library

Fully Synthesized, P&R, OR1200 Core

Monte Carlo

Simulator

OBD Wearout Model

OBD Wearout Model

HSPICE Simulations

HSPICE Simulations

Gate-level Processor Simulator

Workload Simulator

Wearout Simulator

17 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

WDU Accuracy

0

20

40

60

80

100

120

ALU Register File LSU Next PC

Module

Per

cent

age

(%)

Life Expended Signals Flagged

18 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

WDU Overhead

0

5

10

15

20

25

30

35

40

45

50

1 2 4 8

# Signals Monitored

Per

cen

tag

e O

verh

ead

(%

)

Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware

19 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

WDU Overhead

0

0.5

1

1.5

2

2.5

3

1 2 4 8

# Signals Monitored

Per

cen

tag

e O

verh

ead

(%

)

Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware

20 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Long-term Vision

Introspective Reliability Management (IRM) Intelligent reliability management directed by on-chip

sensor feedback

Prospective sensors Delay (WDU) Leakage/Vt Temperature

21 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Introspective Reliability Management

Sen

sor

Dat

a

Virtualization Layer

OS

Ru

nti

me

An

alys

is

Reliability Assesment

Scheduled Jobs IRM Policy

Raw

Sen

sor

Dat

a

Filt

ered

Dat

a S

trea

m

Job Assignment

Thread Migration

Power/CLK Gating

DVFS Configuration

WDU

WDU

WDU

WDU

WDU

Fil

teri

ng

an

d A

nal

ys

is

Raw

Sen

sor

Dat

a

Ag

gre

ga

te A

na

lys

is

Pro

cess

ed D

ata

Virtualization Layer Reliability Assesment

OS

Scheduled Jobs IRM Policy

Thread Migration

Reconfiguration

Power/CLK Gating

DVFS Settings

22 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Conclusions

Many progressive wearout phenomenon impact device-level performance.

It’s possible to characterize this impact and anticipate failures

WDU performance Failure predicted within 20% of end of life (tunable) Area overhead < 3% (hybrid)

Low-level sensors can be used to enable intelligent reliability management

23 University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Questions?

?

top related