Top Banner
SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong Dong Hyuk Woo Vijayalakshmi Srinivasan Jude A. Rivers Hsien-Hsin S. Lee
36

SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Dec 17, 2015

Download

Documents

Agnes Watts
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

SAFER: Stuck-At-Fault Error Recovery for Memories

Nak Hee Seong†

Dong Hyuk Woo†

Vijayalakshmi Srinivasan‡

Jude A. Rivers‡

Hsien-Hsin S. Lee†

‡ †

Page 2: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

2

Emerging Memory Technologies• Resistive memories

– Due to DRAM scaling challenge

• Phase Change Memory (PCM)Scalability, high density Limited write endurance (Avg. 108 writes)

• Incurring stuck-at faults

Page 3: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

3

Cell Write Endurance• Endurance variation

– No spatial correlation– Increases with technology scaling

• Issues– Unpredictable cell endurance

• Read verification required for each write

– The weakest cell dictates memory lifetime!– # of stuck-at faults gradually grows!

• Multi-bit error recovery scheme is needed!

Page 4: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

4

Existing Error Correcting Methods• (72,64) Hamming code

– For transient faults– Single Error Correction Double Error Detection

(SECDED)– 12.5% overhead

• Error-Correcting Pointers (ECP) [Schechter, ISCA37]

– Dynamically replace failed cells with extra cells– Storing multiple fail pointers for each data block– Recover from 6 fails with 61-bit overhead (11.9%)

Page 5: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

5

SAFER: Stuck-At-Fault Error Recovery

Page 6: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

SEC SEC

6

Concept of SAFER• Exploit two properties

of Stuck-At Faults– Permanency– Readability

• Multiple error correction– Fault separation– Low-cost Single Error

Correction (SEC)

Fault Separation

Page 7: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

7

SAFER: 1. Fault Separation2. Single Error Correction

Page 8: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

8

Fault Separation• Assuming 2 faults in an 8-bit block

– C(8,2) = 28 possible fault pairs

• How to separate these 2 faults (of all 28 pairs)?

7 6 5 4 3 2 1 0

Pattern #2

Pattern #1

Pattern #0

7 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0

Page 9: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

9

Pattern #2

Pattern #1

Pattern #0

Decision for Fault Separation• Use bit pointers for fault separation

Data Block

Bit Pointer 7 6 5 4 3 2 1 0

1

1

0

1

0

1

1

0

0

0

1

1

0

1

0

0

0

1

0

0

0

1

1

1

bit 2

bit 1

bit 0

Bit Pointer

Page 10: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

10

Pattern #0

Pattern #1

Pattern #2

Decision for Fault Separation• Find pattern candidates by XORing bit pointers

Data Block

Bit Pointer 7 6 5 4 3 2 1 0

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

1

0

0

Difference Vector

bit 2

bit 1

bit 0

Bit Pointer

Page 11: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

11

Pattern #0

Pattern #1

Pattern #2

Decision for Fault Separation• Find pattern candidates by XORing bit pointers

Data Block

Bit Pointer 7 6 5 4 3 2 1 0

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

0

1

1

bit 2

bit 1

bit 0

Bit Pointer

Page 12: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Extension to Multi-Group Partition• Use two bits for 4 group partition

Data Block

Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

(bit 3, bit 2)

(bit 3, bit 1)

(bit 3, bit 0)

bit 2

bit 1

bit 0

Bit Pointer

bit 3 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 10 0 0 0 0 0 0 0

1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

Page 13: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Data Block

1st Partition Field

2nd Partition Field bit 0Fixed Partition Counter 1

bit 3

Data Block

1st Partition Field bit 2

2nd Partition Field bit 0Fixed Partition Counter 0

13

Dynamic Partition• 4 group partition for a 16-bit data block

Data Block

1st Partition Field bit 2

2nd Partition Field bit 0Fixed Partition Counter 0

Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

bit 31

bit 12

1000 0010 = 1010

0010 0000 = 0010

Page 14: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

14

Dynamic Partition• Objective

– Separate multiple stuck-at faults into different groups

• Additional meta data– Assuming an n bit block and a k group partition– log2k log2 log2 n log2 log2 k 1

• Example: n = 512, k = 32– Required meta data: 23 bits/block– 6 the number of separable stuck-at faults 32

# of partition fields

size of each partition fieldsize of fixed partition counter

Page 15: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

15

SAFER: 1. Fault Separation2. Single Error Correction

Page 16: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

16

Low-cost Single Error Correction• Stuck-At Fault Property: Readability

1 0 1 0

1 0 1 0

Write

Verify

1 0 1 0

Page 17: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

1 0 1 0

17

Low-cost Single Error Correction• Stuck-At Fault Property: Readability

Write

0 1 0 1

0 1 1

Verify

0 1 0 1

0

Page 18: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

18

Low-cost Single Error Correction• Stuck-At Fault Property: Readability

1 0 1 0

1 0 0 0

Write

Verify

Need to recover!!

00 1 11 0 0

Page 19: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

19

Low-cost Single Error Correction• Data Inversion as an SEC

1 0 1 0

0 1 0 1

2nd Write

2nd Verify

Recovered from Stuck-At Fault!!

0 1 0 1 “F”

Inversion& Mark

0 1 1 “F”

“F”

1 0 1 0

Inversion

0

Flip Mark

One additional bit

per group

Page 20: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

20

Design Issues

Page 21: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

21

SAFER Sequence for a Write

N

Start

Read

Write (1st)

Verify

Error

Success Failure

Inversion Write (2nd)

Verify

ErrorN

Y

Y

Fixed PartitionCounter < MAX

Re-partitionY

NY

Drawbacks:- accelerating wear-out- performance degradation

Page 22: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

22

Fail Information Cache• Objective: avoid the 2nd writes• Solution: early inversion decision• Fail Info. Cache with 1K entries

– Keep track of recent data blocks with stuck-at faults– Store fail positions and their stuck-at values

0

01 tag_a 00

Bank #0

TagValid Stuck Value

Cache Index

0

1 tag_b 01 tag_c 10

Bank #11 tag_d 1

001 tag_e 0

Bank #15

Block Address Fail Pointer

Tag Index Bank Addr

Page 23: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

23

Evaluation

Page 24: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Evaluation• Monte Carlo simulations

– Data block size = 512 bits– Perfect wear-leveling scheme (256-byte block)– Cell write endurance: – IdealECC, ECP, SAFER, SAFER_FC

• Hardware overhead

Idea

lECC

6

ECP6

SAFE

R32

SAFE

R32_

FC

08

16243240485664

Idea

lECC

4

ECP4

SAFE

R8

SAFE

R8_F

C

0

16

32

48

64

Met

a-b

it s

ize

Page 25: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

25

Relative Lifetime Improvement

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Rel

ativ

e L

ifet

ime

Imp

rove

men

t

14.8%

• Cell write endurance: – = 100M writes, = 10M writes

Page 26: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

26

Conclusion• Need to recover from multiple stuck-at faults

• SAFER– Efficient recovery scheme– handles the growing stuck-at faults

• Dynamic partition• Data inversion

– SAFER32_FC• 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8)• 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)

Page 27: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

27

Thank You All!!Questions?

Georgia TechECE MARS Labhttp://arch.ece.gatech.edu

Page 28: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

28

SRAM Fail Info. Cache Overhead• Cell size in 2024

– SRAM = 140 F2 @ 10nm, PCM = 6 F2 @ 8nm– 36.6X difference

• Compared with a 8 Gbit PCM chip

Number of Entries

Tag Size(bits)

Entry Size(bits)

Cache Size(bits)

AreaOverhead

1K 23 25 25.6K 0.01%

2K 22 24 49.2K 0.02%

4K 21 23 94.2K 0.04%

8K 20 22 0.18M 0.08%

16K 19 21 0.33M 0.15%

32K 18 20 0.63M 0.28%

64K 17 19 1.19M 0.53%

128K 16 18 2.25M 1.00%

Page 29: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

29

100 120 140 160 180 200 220 240 260Lifetime (Million Writes)

Relative Lifetime Improvement• Need a method measuring relative lifetime

– independent from and T

• Definition

Cell Write Endurance Distribution: 100M writes 10M writes

Bit Toggle Rate (T) = 0.5

Recovery scheme contribution for lifetime T=

(L F) T=

F L

Lifetime Contribution

Page 30: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

30

Lifetime Contribution per Meta-bit

0.000

0.005

0.010

0.015

0.020

0.025

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Lif

etim

e C

on

trib

uti

on

per

Met

a-b

it

Page 31: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

31

Average Number of Recovered Fails

0

5

10

15

20

25

30

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Ave

rag

e R

eco

vere

d F

ails

per

256

B

Page 32: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

32

SAFER with Fail Cache

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

The number of maximum fails per 512 bits

Mis

s r

ate

1K 2K

4K 8K

16K 32K

64K 128K

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

None 1K 2K 4K 8K 16K 32K 64K 128K

The number of cache entries

Rel

ativ

e L

ifet

ime

Imp

rove

men

t

SAFER2 SAFER4

SAFER8 SAFER16

SAFER32

IdealECC8

Page 33: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

33

Low-cost Single Error Correction• Stuck-At Fault Property: Readability

Write

0 1 0 1

0 1 0 1

Verify

0 1 0 1

1 0 1 0

1 0 1 0

Write

Verify

1 0 1 0

Page 34: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

34

Low-cost Single Error Correction• Stuck-At Fault Property: Readability

Write

0 1 0 1

0 1 1

Verify

0 1 0 1

1 0 1 0

1 0 0 0

Write

Verify

1 0 0

Need to recover!!

0

Page 35: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

35

Low-cost Single Error Correction• Data Inversion as an SEC

– one additional bit per group

Write

0 1 0 1

Verify

0 1 0 1

1 0 1 0

0 1 0 1

2nd Write

2nd Verify

Recovered from Stuck-At Fault!!

0 1 0 1 “F”

Inversion& Mark

0 1 1 “F”

“F”

1 0 1 0

Inversion

0

Page 36: SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

36

Evaluation• Monte Carlo simulations

– Data block size = 512 bits– Perfect wear-leveling scheme (256-byte block)– Cell write endurance: – IdealECC, ECP, SAFER, SAFER_FC

0%

2%

4%

6%

8%

10%

12%

14%

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Har

dw

are

Ove

rhea

d

11.9%