SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

SAFER: Stuck-At-Fault Error Recovery for Memories

Nak Hee Seong†

Dong Hyuk Woo†

Vijayalakshmi Srinivasan‡

Jude A. Rivers‡

Hsien-Hsin S. Lee†

‡ †

2

Emerging Memory Technologies• Resistive memories

– Due to DRAM scaling challenge

• Phase Change Memory (PCM)Scalability, high density Limited write endurance (Avg. 108 writes)

• Incurring stuck-at faults

3

Cell Write Endurance• Endurance variation

– No spatial correlation– Increases with technology scaling

• Issues– Unpredictable cell endurance

• Read verification required for each write

– The weakest cell dictates memory lifetime!– # of stuck-at faults gradually grows!

• Multi-bit error recovery scheme is needed!

4

Existing Error Correcting Methods• (72,64) Hamming code

– For transient faults– Single Error Correction Double Error Detection

(SECDED)– 12.5% overhead

• Error-Correcting Pointers (ECP) [Schechter, ISCA37]

– Dynamically replace failed cells with extra cells– Storing multiple fail pointers for each data block– Recover from 6 fails with 61-bit overhead (11.9%)

5

SAFER: Stuck-At-Fault Error Recovery

SEC SEC

6

Concept of SAFER• Exploit two properties

of Stuck-At Faults– Permanency– Readability

• Multiple error correction– Fault separation– Low-cost Single Error

Correction (SEC)

Fault Separation

7

SAFER: 1. Fault Separation2. Single Error Correction

8

Fault Separation• Assuming 2 faults in an 8-bit block

– C(8,2) = 28 possible fault pairs

• How to separate these 2 faults (of all 28 pairs)?

7 6 5 4 3 2 1 0

Pattern #2

Pattern #1

Pattern #0

7 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0

9

Pattern #2

Pattern #1

Pattern #0

Decision for Fault Separation• Use bit pointers for fault separation

Data Block

Bit Pointer 7 6 5 4 3 2 1 0

1

1

0

1

0

1

1

0

0

0

1

1

0

1

0

0

0

1

0

0

0

1

1

1

bit 2

bit 1

bit 0

Bit Pointer

10

Pattern #0

Pattern #1

Pattern #2

Decision for Fault Separation• Find pattern candidates by XORing bit pointers

Data Block

Bit Pointer 7 6 5 4 3 2 1 0

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

1

0

0

Difference Vector

bit 2

bit 1

bit 0

Bit Pointer

11

Pattern #0

Pattern #1

Pattern #2

Decision for Fault Separation• Find pattern candidates by XORing bit pointers

Data Block

Bit Pointer 7 6 5 4 3 2 1 0

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

0

1

1

bit 2

bit 1

bit 0

Bit Pointer

Extension to Multi-Group Partition• Use two bits for 4 group partition

Data Block

Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

(bit 3, bit 2)

(bit 3, bit 1)

(bit 3, bit 0)

bit 2

bit 1

bit 0

Bit Pointer

bit 3 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 10 0 0 0 0 0 0 0

1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

Data Block

1st Partition Field

2nd Partition Field bit 0Fixed Partition Counter 1

bit 3

Data Block

1st Partition Field bit 2


13

Dynamic Partition• 4 group partition for a 16-bit data block

Data Block

1st Partition Field bit 2


Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

bit 31

bit 12

1000 0010 = 1010

0010 0000 = 0010

14

Dynamic Partition• Objective

– Separate multiple stuck-at faults into different groups

• Additional meta data– Assuming an n bit block and a k group partition– log2k log2 log2 n log2 log2 k 1

• Example: n = 512, k = 32– Required meta data: 23 bits/block– 6 the number of separable stuck-at faults 32

# of partition fields

size of each partition fieldsize of fixed partition counter

15

SAFER: 1. Fault Separation2. Single Error Correction

16

Low-cost Single Error Correction• Stuck-At Fault Property: Readability

1 0 1 0

1 0 1 0

Write

Verify

1 0 1 0

1 0 1 0

17


Write

0 1 0 1

0 1 1

Verify

0 1 0 1

0

18


1 0 1 0

1 0 0 0

Write

Verify

Need to recover!!

00 1 11 0 0

19

Low-cost Single Error Correction• Data Inversion as an SEC

1 0 1 0

0 1 0 1

2nd Write

2nd Verify

Recovered from Stuck-At Fault!!

0 1 0 1 “F”

Inversion& Mark

0 1 1 “F”

“F”

1 0 1 0

Inversion

0

Flip Mark

One additional bit

per group

20

Design Issues

21

SAFER Sequence for a Write

N

Start

Read

Write (1st)

Verify

Error

Success Failure

Inversion Write (2nd)

Verify

ErrorN

Y

Y

Fixed PartitionCounter < MAX

Re-partitionY

NY

Drawbacks:- accelerating wear-out- performance degradation

22

Fail Information Cache• Objective: avoid the 2nd writes• Solution: early inversion decision• Fail Info. Cache with 1K entries

– Keep track of recent data blocks with stuck-at faults– Store fail positions and their stuck-at values

0

01 tag_a 00

Bank #0

TagValid Stuck Value

Cache Index

0

1 tag_b 01 tag_c 10

Bank #11 tag_d 1

001 tag_e 0

Bank #15

Block Address Fail Pointer

Tag Index Bank Addr

23

Evaluation

Evaluation• Monte Carlo simulations

– Data block size = 512 bits– Perfect wear-leveling scheme (256-byte block)– Cell write endurance: – IdealECC, ECP, SAFER, SAFER_FC

• Hardware overhead

Idea

lECC

6

ECP6

SAFE

R32

SAFE

R32_

FC

08

16243240485664

Idea

lECC

4

ECP4

SAFE

R8

SAFE

R8_F

C

0

16

32

48

64

Met

a-b

it s

ize

25

Relative Lifetime Improvement

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Rel

ativ

e L

ifet

ime

Imp

rove

men

t

14.8%

• Cell write endurance: – = 100M writes, = 10M writes

26

Conclusion• Need to recover from multiple stuck-at faults

• SAFER– Efficient recovery scheme– handles the growing stuck-at faults

• Dynamic partition• Data inversion

– SAFER32_FC• 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8)• 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)

27

Thank You All!!Questions?

Georgia TechECE MARS Labhttp://arch.ece.gatech.edu

28

SRAM Fail Info. Cache Overhead• Cell size in 2024

– SRAM = 140 F2 @ 10nm, PCM = 6 F2 @ 8nm– 36.6X difference

• Compared with a 8 Gbit PCM chip

Number of Entries

Tag Size(bits)

Entry Size(bits)

Cache Size(bits)

AreaOverhead

1K 23 25 25.6K 0.01%

2K 22 24 49.2K 0.02%

4K 21 23 94.2K 0.04%

8K 20 22 0.18M 0.08%

16K 19 21 0.33M 0.15%

32K 18 20 0.63M 0.28%

64K 17 19 1.19M 0.53%

128K 16 18 2.25M 1.00%

29

100 120 140 160 180 200 220 240 260Lifetime (Million Writes)

Relative Lifetime Improvement• Need a method measuring relative lifetime

– independent from and T

• Definition

Cell Write Endurance Distribution: 100M writes 10M writes

Bit Toggle Rate (T) = 0.5

Recovery scheme contribution for lifetime T=

(L F) T=

F L

Lifetime Contribution

30

Lifetime Contribution per Meta-bit

0.000

0.005

0.010

0.015

0.020

0.025

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Lif

etim

e C

on

trib

uti

on

per

Met

a-b

it

31

Average Number of Recovered Fails

0

5

10

15

20

25

30

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Ave

rag

e R

eco

vere

d F

ails

per

256

B

32

SAFER with Fail Cache

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

The number of maximum fails per 512 bits

Mis

s r

ate

1K 2K

4K 8K

16K 32K

64K 128K

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

None 1K 2K 4K 8K 16K 32K 64K 128K

The number of cache entries

Rel

ativ

e L

ifet

ime

Imp

rove

men

t

SAFER2 SAFER4

SAFER8 SAFER16

SAFER32

IdealECC8

33


Write

0 1 0 1

0 1 0 1

Verify

0 1 0 1

1 0 1 0

1 0 1 0

Write

Verify

1 0 1 0

34


Write

0 1 0 1

0 1 1

Verify

0 1 0 1

1 0 1 0

1 0 0 0

Write

Verify

1 0 0

Need to recover!!

0

35

Low-cost Single Error Correction• Data Inversion as an SEC

– one additional bit per group

Write

0 1 0 1

Verify

0 1 0 1

1 0 1 0

0 1 0 1

2nd Write

2nd Verify

Recovered from Stuck-At Fault!!

0 1 0 1 “F”

Inversion& Mark

0 1 1 “F”

“F”

1 0 1 0

Inversion

0

36

Evaluation• Monte Carlo simulations

– Data block size = 512 bits– Perfect wear-leveling scheme (256-byte block)– Cell write endurance: – IdealECC, ECP, SAFER, SAFER_FC

0%

2%

4%

6%

8%

10%

12%

14%

Idea

lEC

C1

Idea

lEC

C2

Idea

lEC

C3

Idea

lEC

C4

Idea

lEC

C5

Idea

lEC

C6

Idea

lEC

C7

Idea

lEC

C8

EC

P1

EC

P2

EC

P3

EC

P4

EC

P5

EC

P6

SA

FE

R2

SA

FE

R4

SA

FE

R8

SA

FE

R16

SA

FE

R32

SA

FE

R2_

FC

SA

FE

R4_

FC

SA

FE

R8_

FC

SA

FE

R16

_FC

SA

FE

R32

_FC

Har

dw

are

Ove

rhea

d

11.9%

SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Documents