Page 1
SSD Failures in Datacenters: What? When? And Why?
Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,
Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid
1The 9th ACM Systems And Storage Conference (SYSTOR 2016)
Page 2
SSDs’ popularity
Why SSD Reliability ?
2*Source: IDC, Dec 2015
46.5% annual growth*
Limited field dataDatacenter decision
support
Data reliability
01001100 01001101 11010010 0100000010011100 10111111 10101111 11000101
Page 3
SSDs’ popularity
Why SSD Reliability ?
3*Source: IDC, Dec 2015
46.5% annual growth*
Limited field dataDatacenter decision
support
Data reliability
01001100 01001101 11010010 0100000010011100 10111111 10101111 11000101
Large scale
Field data
Page 4
SSD Failures
4
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
Page 5
SSD Failures
5
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
Page 6
SSD Failures
6
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
Page 7
SSD Failures
7
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
Page 8
SSD Failures
8
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
Page 9
SSD Failures
9
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
Fail-stop failures
Page 10
SSD Reliability
10
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
Consumer Enterprise
Page 11
SSD Reliability
11
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
Page 12
SSD Reliability
12
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
5 large datacenters
Page 13
SSD Reliability
13
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
4 major workloads
Page 14
SSD Reliability
14
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
6 different rack SKUs
Page 15
SSD Reliability
15
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
Various factors in production environment could affect SSD failure trends very differently from lab test conditions
Can we understand SSD failures in the presence of various factors ?
Page 16
Understanding SSD Failures – An analogy
16
SSD
Reactive
Proactive
Page 17
What are the symptoms?
17
FeverUnexpected weight loss
Low blood pressure
Data errors
011001?00101?
Reallocated sectors
SATA downshift
SSD
Program and erase failure
Page 18
SSD Failure Symptoms
18
Reallocated Sector Count
Program and Erase Fail Count
CRC and Uncorrectable Error Count
SATA Downshift Count0
0.5
1
1.5
2
2.5
3
3.5
ReallocatedSector Count
Program andErase Failure
Count
CRC andUncorrectable
Error Count
SATADownshift
Count
AFR
%
w Symptom
w/o Symptom
3.95X2.76X
18X
3.91X
Page 19
Insufficiency of symptom only diagnosis
19
0
10
20
30
40
50
60
70
Reallocations Program andErase Fail
Data Errors SATADownshift
Any
% o
f d
evic
esFailed Healthy Symptoms seen
only in 62% of failed devices
Page 20
What are the factors?
20
Lifestyle
Genetics
Environmental agents
Production environment
Workload
Design decisions
SSD
Page 21
Device level correlating factors
21
Average write rate of a device
Average read rate of a device
Total read and/or write usage
Write Amplification
Read Write Ratio 0
0.5
1
1.5
2
2.5
10
15
20
25
30
35
40
45
50
>50
AFR
%
Avg. host writes per day
More results in the paper
Increasing failure trend at higher write rates
Page 22
Server level correlating factors
22
SSD space utilization
Disk space utilization
Memory utilization
Processor utilization0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70
AFR
%Avg. Disk Space Utilization
More results in the paper
Decreasing failure trend at high disk space usage
Page 23
Datacenter factors
Rack SKU
Datacenter Facility
23
00.10.20.30.40.50.6
1-D 2-A 1-D 2-A
S1-3a S1-3b
AFR
%
SKU and SSD model
More results in the paper
Same model different behavior
Page 24
Understanding SSD Failures – An analogy
24
SSD
Symptoms Factors
Symptoms Factors
MULTI FEATURE ANALYSIS
Page 25
Understanding SSD Failures – An analogy
25
SSD
Symptoms Factors
Symptoms Factors
Random forest based binary classificationPermutation feature ranking
Page 26
What
Understanding What ?
26
are the important factors ?is their order of importance ?are the important combinations?
Page 27
27
0 0.2 0.4 0.6 0.8 1
DataErrors
ReallocSectors
TotalNANDWrites
HostWrites
TotalReads+Writes
AvgMemory
AvgSSDSpace
UsagePerDay
TotalReads
ReadsPerDay
Feature Importance
SYMPTOMS
Understanding What ?
Page 28
28
0 0.2 0.4 0.6 0.8 1
DataErrors
ReallocSectors
TotalNANDWrites
HostWrites
TotalReads+Writes
AvgMemory
AvgSSDSpace
UsagePerDay
TotalReads
ReadsPerDay
Feature Importance
DEVICEWORKLOAD
Understanding What ?
Page 29
29
0 0.2 0.4 0.6 0.8 1
DataErrors
ReallocSectors
TotalNANDWrites
HostWrites
TotalReads+Writes
AvgMemory
AvgSSDSpace
UsagePerDay
TotalReads
ReadsPerDay
Feature Importance
SERVERWORKLOAD
Understanding What ?
Page 30
30
Condition Class
Data Errors <=1 & Reallocated Sectors<=5 H
Data Errors<=1& WAF<=1 H
Media Wear-out=100 & WAF<=1 H
Avg. SSD space >=10 F
Combinations of top 8 important features
Frequent Combinations
SYMPTOMS
Understanding What ?
Page 31
31
Condition Class
Data Errors <=1 & Reallocated Sectors<=5 H
Data Errors<=1& WAF<=1 H
Media Wear-out=100 & WAF<=1 H
Avg. SSD space >=10 F
Combinations of top 8 important features
Frequent Combinations
SYMPTOMS +WORKLOAD
Understanding What ?
Page 32
32
Condition Class
Data Errors <=1 & Reallocated Sectors<=5 H
Data Errors<=1& WAF<=1 H
Media Wear-out=100 & WAF<=1 H
Avg. SSD space >=10 F
Combinations of top 8 important features
Frequent Combinations
WORKLOAD
Understanding What ?
Page 33
What
Understanding When ?
33
is the duration between detection and failure?signatures characterize SSD survivability?
Page 34
Understanding When ?
34
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12
CD
F(x)
Time To Fail (months)
50% of failures
> 4 months
Sufficient time to intervene
Page 35
Understanding When ?
35
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12
CD
F(x)
Time To Fail (months)
50% of failures
> 4 months
Early failures (< 1 month): Rules include symptoms
and their thresholds
Late failures: Rules contains only
workload factors
Page 36
Understanding SSD Failures – An analogy
36
SSD
Symptoms Factors
Symptoms Factors
Observation based causal estimateProbabilistic causal models and Pearl’s do-calculus
Page 37
What
37
factors impact SSD reliability?is their magnitude of impact?
Understanding Why ?
Page 38
Understanding Why ?
38
SSD model and symptoms have direct impact
Workload impacts failures through media wearout
Page 39
Concluding Remarks
• SSD Failures in the field
• Factors -> Symptoms -> Failures
• Important Symptoms: Data Errors and Reallocated Sectors• High intensity and rapid progression fails early
• Important factors: NAND Writes, Total Reads and Writes, etc.
• Direct impact: SSD Model and Symptoms
• Indirect impact: Workload through wear-out
• Future direction: prediction and control
39