SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡ †
SAFER: Stuck-At-Fault Error Recovery for Memories
Nak Hee Seong†
Dong Hyuk Woo†
Vijayalakshmi Srinivasan‡
Jude A. Rivers‡
Hsien-Hsin S. Lee†
‡ †
2
Emerging Memory Technologies• Resistive memories
– Due to DRAM scaling challenge
• Phase Change Memory (PCM)Scalability, high density Limited write endurance (Avg. 108 writes)
• Incurring stuck-at faults
3
Cell Write Endurance• Endurance variation
– No spatial correlation– Increases with technology scaling
• Issues– Unpredictable cell endurance
• Read verification required for each write
– The weakest cell dictates memory lifetime!– # of stuck-at faults gradually grows!
• Multi-bit error recovery scheme is needed!
4
Existing Error Correcting Methods• (72,64) Hamming code
– For transient faults– Single Error Correction Double Error Detection
(SECDED)– 12.5% overhead
• Error-Correcting Pointers (ECP) [Schechter, ISCA37]
– Dynamically replace failed cells with extra cells– Storing multiple fail pointers for each data block– Recover from 6 fails with 61-bit overhead (11.9%)
5
SAFER: Stuck-At-Fault Error Recovery
SEC SEC
6
Concept of SAFER• Exploit two properties
of Stuck-At Faults– Permanency– Readability
• Multiple error correction– Fault separation– Low-cost Single Error
Correction (SEC)
Fault Separation
7
SAFER: 1. Fault Separation2. Single Error Correction
8
Fault Separation• Assuming 2 faults in an 8-bit block
– C(8,2) = 28 possible fault pairs
• How to separate these 2 faults (of all 28 pairs)?
7 6 5 4 3 2 1 0
Pattern #2
Pattern #1
Pattern #0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
9
Pattern #2
Pattern #1
Pattern #0
Decision for Fault Separation• Use bit pointers for fault separation
Data Block
Bit Pointer 7 6 5 4 3 2 1 0
1
1
0
1
0
1
1
0
0
0
1
1
0
1
0
0
0
1
0
0
0
1
1
1
bit 2
bit 1
bit 0
Bit Pointer
10
Pattern #0
Pattern #1
Pattern #2
Decision for Fault Separation• Find pattern candidates by XORing bit pointers
Data Block
Bit Pointer 7 6 5 4 3 2 1 0
1 1 1 1 0 0 0 0
1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0
1
0
0
Difference Vector
bit 2
bit 1
bit 0
Bit Pointer
11
Pattern #0
Pattern #1
Pattern #2
Decision for Fault Separation• Find pattern candidates by XORing bit pointers
Data Block
Bit Pointer 7 6 5 4 3 2 1 0
1 1 1 1 0 0 0 0
1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0
0
1
1
bit 2
bit 1
bit 0
Bit Pointer
Extension to Multi-Group Partition• Use two bits for 4 group partition
Data Block
Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
(bit 3, bit 2)
(bit 3, bit 1)
(bit 3, bit 0)
bit 2
bit 1
bit 0
Bit Pointer
bit 3 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 10 0 0 0 0 0 0 0
1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Data Block
1st Partition Field
2nd Partition Field bit 0Fixed Partition Counter 1
bit 3
Data Block
1st Partition Field bit 2
2nd Partition Field bit 0Fixed Partition Counter 0
13
Dynamic Partition• 4 group partition for a 16-bit data block
Data Block
1st Partition Field bit 2
2nd Partition Field bit 0Fixed Partition Counter 0
Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
bit 31
bit 12
1000 0010 = 1010
0010 0000 = 0010
14
Dynamic Partition• Objective
– Separate multiple stuck-at faults into different groups
• Additional meta data– Assuming an n bit block and a k group partition– log2k log2 log2 n log2 log2 k 1
• Example: n = 512, k = 32– Required meta data: 23 bits/block– 6 the number of separable stuck-at faults 32
# of partition fields
size of each partition fieldsize of fixed partition counter
15
SAFER: 1. Fault Separation2. Single Error Correction
16
Low-cost Single Error Correction• Stuck-At Fault Property: Readability
1 0 1 0
1 0 1 0
Write
Verify
1 0 1 0
1 0 1 0
17
Low-cost Single Error Correction• Stuck-At Fault Property: Readability
Write
0 1 0 1
0 1 1
Verify
0 1 0 1
0
18
Low-cost Single Error Correction• Stuck-At Fault Property: Readability
1 0 1 0
1 0 0 0
Write
Verify
Need to recover!!
00 1 11 0 0
19
Low-cost Single Error Correction• Data Inversion as an SEC
1 0 1 0
0 1 0 1
2nd Write
2nd Verify
Recovered from Stuck-At Fault!!
0 1 0 1 “F”
Inversion& Mark
0 1 1 “F”
“F”
1 0 1 0
Inversion
0
Flip Mark
One additional bit
per group
20
Design Issues
21
SAFER Sequence for a Write
N
Start
Read
Write (1st)
Verify
Error
Success Failure
Inversion Write (2nd)
Verify
ErrorN
Y
Y
Fixed PartitionCounter < MAX
Re-partitionY
NY
Drawbacks:- accelerating wear-out- performance degradation
22
Fail Information Cache• Objective: avoid the 2nd writes• Solution: early inversion decision• Fail Info. Cache with 1K entries
– Keep track of recent data blocks with stuck-at faults– Store fail positions and their stuck-at values
0
01 tag_a 00
Bank #0
TagValid Stuck Value
Cache Index
0
1 tag_b 01 tag_c 10
Bank #11 tag_d 1
001 tag_e 0
Bank #15
Block Address Fail Pointer
Tag Index Bank Addr
23
Evaluation
Evaluation• Monte Carlo simulations
– Data block size = 512 bits– Perfect wear-leveling scheme (256-byte block)– Cell write endurance: – IdealECC, ECP, SAFER, SAFER_FC
• Hardware overhead
Idea
lECC
6
ECP6
SAFE
R32
SAFE
R32_
FC
08
16243240485664
Idea
lECC
4
ECP4
SAFE
R8
SAFE
R8_F
C
0
16
32
48
64
Met
a-b
it s
ize
25
Relative Lifetime Improvement
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Idea
lEC
C1
Idea
lEC
C2
Idea
lEC
C3
Idea
lEC
C4
Idea
lEC
C5
Idea
lEC
C6
Idea
lEC
C7
Idea
lEC
C8
EC
P1
EC
P2
EC
P3
EC
P4
EC
P5
EC
P6
SA
FE
R2
SA
FE
R4
SA
FE
R8
SA
FE
R16
SA
FE
R32
SA
FE
R2_
FC
SA
FE
R4_
FC
SA
FE
R8_
FC
SA
FE
R16
_FC
SA
FE
R32
_FC
Rel
ativ
e L
ifet
ime
Imp
rove
men
t
14.8%
• Cell write endurance: – = 100M writes, = 10M writes
26
Conclusion• Need to recover from multiple stuck-at faults
• SAFER– Efficient recovery scheme– handles the growing stuck-at faults
• Dynamic partition• Data inversion
– SAFER32_FC• 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8)• 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)
27
Thank You All!!Questions?
Georgia TechECE MARS Labhttp://arch.ece.gatech.edu
28
SRAM Fail Info. Cache Overhead• Cell size in 2024
– SRAM = 140 F2 @ 10nm, PCM = 6 F2 @ 8nm– 36.6X difference
• Compared with a 8 Gbit PCM chip
Number of Entries
Tag Size(bits)
Entry Size(bits)
Cache Size(bits)
AreaOverhead
1K 23 25 25.6K 0.01%
2K 22 24 49.2K 0.02%
4K 21 23 94.2K 0.04%
8K 20 22 0.18M 0.08%
16K 19 21 0.33M 0.15%
32K 18 20 0.63M 0.28%
64K 17 19 1.19M 0.53%
128K 16 18 2.25M 1.00%
29
100 120 140 160 180 200 220 240 260Lifetime (Million Writes)
Relative Lifetime Improvement• Need a method measuring relative lifetime
– independent from and T
• Definition
Cell Write Endurance Distribution: 100M writes 10M writes
Bit Toggle Rate (T) = 0.5
Recovery scheme contribution for lifetime T=
(L F) T=
F L
Lifetime Contribution
30
Lifetime Contribution per Meta-bit
0.000
0.005
0.010
0.015
0.020
0.025
Idea
lEC
C1
Idea
lEC
C2
Idea
lEC
C3
Idea
lEC
C4
Idea
lEC
C5
Idea
lEC
C6
Idea
lEC
C7
Idea
lEC
C8
EC
P1
EC
P2
EC
P3
EC
P4
EC
P5
EC
P6
SA
FE
R2
SA
FE
R4
SA
FE
R8
SA
FE
R16
SA
FE
R32
SA
FE
R2_
FC
SA
FE
R4_
FC
SA
FE
R8_
FC
SA
FE
R16
_FC
SA
FE
R32
_FC
Lif
etim
e C
on
trib
uti
on
per
Met
a-b
it
31
Average Number of Recovered Fails
0
5
10
15
20
25
30
Idea
lEC
C1
Idea
lEC
C2
Idea
lEC
C3
Idea
lEC
C4
Idea
lEC
C5
Idea
lEC
C6
Idea
lEC
C7
Idea
lEC
C8
EC
P1
EC
P2
EC
P3
EC
P4
EC
P5
EC
P6
SA
FE
R2
SA
FE
R4
SA
FE
R8
SA
FE
R16
SA
FE
R32
SA
FE
R2_
FC
SA
FE
R4_
FC
SA
FE
R8_
FC
SA
FE
R16
_FC
SA
FE
R32
_FC
Ave
rag
e R
eco
vere
d F
ails
per
256
B
32
SAFER with Fail Cache
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The number of maximum fails per 512 bits
Mis
s r
ate
1K 2K
4K 8K
16K 32K
64K 128K
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
None 1K 2K 4K 8K 16K 32K 64K 128K
The number of cache entries
Rel
ativ
e L
ifet
ime
Imp
rove
men
t
SAFER2 SAFER4
SAFER8 SAFER16
SAFER32
IdealECC8
33
Low-cost Single Error Correction• Stuck-At Fault Property: Readability
Write
0 1 0 1
0 1 0 1
Verify
0 1 0 1
1 0 1 0
1 0 1 0
Write
Verify
1 0 1 0
34
Low-cost Single Error Correction• Stuck-At Fault Property: Readability
Write
0 1 0 1
0 1 1
Verify
0 1 0 1
1 0 1 0
1 0 0 0
Write
Verify
1 0 0
Need to recover!!
0
35
Low-cost Single Error Correction• Data Inversion as an SEC
– one additional bit per group
Write
0 1 0 1
Verify
0 1 0 1
1 0 1 0
0 1 0 1
2nd Write
2nd Verify
Recovered from Stuck-At Fault!!
0 1 0 1 “F”
Inversion& Mark
0 1 1 “F”
“F”
1 0 1 0
Inversion
0
36
Evaluation• Monte Carlo simulations
– Data block size = 512 bits– Perfect wear-leveling scheme (256-byte block)– Cell write endurance: – IdealECC, ECP, SAFER, SAFER_FC
0%
2%
4%
6%
8%
10%
12%
14%
Idea
lEC
C1
Idea
lEC
C2
Idea
lEC
C3
Idea
lEC
C4
Idea
lEC
C5
Idea
lEC
C6
Idea
lEC
C7
Idea
lEC
C8
EC
P1
EC
P2
EC
P3
EC
P4
EC
P5
EC
P6
SA
FE
R2
SA
FE
R4
SA
FE
R8
SA
FE
R16
SA
FE
R32
SA
FE
R2_
FC
SA
FE
R4_
FC
SA
FE
R8_
FC
SA
FE
R16
_FC
SA
FE
R32
_FC
Har
dw
are
Ove
rhea
d
11.9%