Top Banner
Understanding the Robustness of SSDs under Power Fault Joseph Tucek Mark Lillibridge HP Labs Mai Zheng Feng Qin The Ohio State University
44

Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w

Aug 14, 2018

Download

Documents

truongthuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

Understanding the Robustness of SSDs under Power Fault

Joseph Tucek

Mark Lillibridge

HP Labs

Mai Zheng

Feng Qin

The Ohio State University

Page 2: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

2

Solid-State Drives (SSDs)

- a “truly revolutionary and disruptive” technology

• Great performance

• Low power consumption

Page 3: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

3

Solid-State Drives (SSDs)

- a “truly revolutionary and disruptive” technology

• Great performance

• Low power consumption

• *** behavior in adverse conditions ?

Page 4: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

4

Power Faults

- a threat never gone

Page 5: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

5

Power Faults

- a threat never gone

Jul. 2012:“POWER OUTAGE Hits London Data Center ...”

Jul. 2012:“... human error was responsible for a data center POWER OUTAGE ...”

Jun. 2012:“Amazon Data Center LOSES POWER During Storm …”

Aug, 2011:“Colocation provider Colo4 experienced a POWER OUTAGE …”

May 2010:“Car Crash Triggers Amazon POWER OUTAGE …”

Nov. 2010:“About 3,000 servers at Montreal web host iWeb experienced an OUTAGE …”

Jan. 2013:“ A POWER OUTAGE at a key New Jersey data center ...”

Page 6: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

6

Power Faults

- a threat never gone

Jul. 2012:“POWER OUTAGE Hits London Data Center ...”

Jul. 2012:“... human error was responsible for a data center POWER OUTAGE ...”

Jun. 2012:“Amazon Data Center LOSES POWER During Storm …”

Aug, 2011:“Colocation provider Colo4 experienced a POWER OUTAGE …”

May 2010:“Car Crash Triggers Amazon POWER OUTAGE …”

Nov. 2010:“About 3,000 servers at Montreal web host iWeb experienced an OUTAGE …”

Jan. 2013:“ A POWER OUTAGE at a key New Jersey data center ...”

Page 7: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

Potential Failures

Page 8: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

8

Simple Failures

record

after power fault before power fault

• Bit Corruption

• Metadata

Corruption

• Dead Device

mess all data

Page 9: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

9

• Shorn Writes

Simple Failures

• Flying Writes

1 2 disk block #

old

new

1 2

after power fault before power fault

new old

Page 10: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

10

Serializable state

Unserializable state

block #

Write Completion Time

0 A1 2 A3 4

0 A2 2 A3 4

A3

0 1 2 3 4

Complex Failure: Unserializable Writes

A1

A2

thread A

Page 11: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

Testing Framework

Page 12: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

12

Switcher

Design

Workers Workers

Checker Workers

Scheduler

write records ❸

read & check ❷

power off/on

Control Circuit

Page 13: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

13

What to Write?

write records ❸

read & check

0000

0000

0000 • all 0’s ?

Page 14: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

14

What to Write?

write records ❸

read & check

4239

0817

3625 • all 0’s ?

• random numbers?

Page 15: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

15

fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Special Record Format

- allows detecting all 6 types of failures

Page 16: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

16

fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Special Record Format

- allows detecting all 6 types of failures

Bit corruption & Shorn writes

Page 17: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

17

fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Special Record Format

- allows detecting all 6 types of failures

Bit corruption & Shorn writes

Flying writes

Page 18: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

18

fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Special Record Format

- allows detecting all 6 types of failures

Bit corruption & Shorn writes

Flying writes

Unserializable writes

Page 19: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

19

fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Special Record Format

- allows detecting all 6 types of failures

Bit corruption & Shorn writes

Flying writes

Unserializable writes

Page 20: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

20

fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Special Record Format

- allows detecting all 6 types of failures

Bit corruption & Shorn writes

Flying writes

Unserializable writes

regenerating records

Page 21: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

21

fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Special Record Format

- allows detecting all 6 types of failures

Bit corruption & Shorn writes

Flying writes

Unserializable writes

regenerating records

Metadata corruption &

Dead device

Page 22: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

22

Special Record Format

- allows detecting all 6 types of failures

all 0’s?

random numbers?

duplicates of header

extendable padding fixed-sized header

checksum

timestamp

block#

thread_id

op_cnt

seed

… …

Page 23: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

23

Advanced FTL: Compression

Page 24: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

24

Randomization of Record Content

- avoid interference of compression

regular format

random mask

random format

Page 25: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

25

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

thread A thread B

A1’

: generating time of

1st record of A

A1

: completion time of

writing 1st record of A

Page 26: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

26

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

thread A thread B

Intra-thread:

A1 -> A1’ -> A2

Page 27: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

27

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

thread A thread B

Intra-thread:

A1 -> A1’ -> A2

B1 -> B1’ -> B2

Page 28: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

28

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

thread A thread B

Intra-thread:

A1 -> A1’ -> A2

B1 -> B1’ -> B2

Inter-thread:

A1 -> B1 -> A2 -> B2

Page 29: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

29

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

thread A thread B

Intra-thread:

A1 -> A1’ -> A2

B1 -> B1’ -> B2

Inter-thread:

A1 -> B1 -> A2 -> B2

A1’ -> B1’ or

B1’ -> A1’

Conservatively report no errors

Page 30: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

30

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

thread A thread B

Intra-thread:

A1 -> A1’ -> A2

B1 -> B1’ -> B2

Page 31: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

31

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

thread A thread B

Intra-thread:

A1 -> A1’ -> A2

B1 -> B1’ -> B2

Inter-thread:

A2 -> B1

Page 32: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

32

Deriving Completion-time Partial Order

- a key step of unserializable writes detection

Time

A1’

A2

A1

B1

B2

B1’

More details in our paper

& Golab et al. PODC’11

thread A thread B

Intra-thread:

A1 -> A1’ -> A2

B1 -> B1’ -> B2

Inter-thread:

A2 -> B1

A1’ -> B1’

Page 33: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

33

Power Fault Injection

SSD

power off/on

Page 34: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

34

Power Fault Injection

SSD

Host

power off/on

Page 35: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

Results

Page 36: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

36

Experimental Environment

• Block Devices

- 15 SSDs and 2 hard drives

- SLC & MLC

- Manufactured in 2009 – 2012

- 4 have power-loss protection

- Low-end to high-end ($0.63/GB - $6.50/GB)

• Host System

- Debian 6.0 w/ 2.6.32 kernel

- LSI Logic SAS controller

- no filesystem on devices

- Synchronized & Direct I/O (O_SYNC | O_DIRECT)

Page 37: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

37

Summary of Observations

Failures # of SSDs

Bit Corruption 3

Metadata Corruption 1

Dead Device 1

Shorn Writes 3

Flying Writes 0

Unserializable Writes 8

None 2

• 13 of 15 SSDs exhibit failure(s)

• 2 perfect SSDs

• 5 of 6 failures observed

Page 38: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

38

Shorn Writes: Subpage Programming

new

new

512 bytes

pattern #2

pattern #1

4K bytes

Page 39: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

39

Serialization Errors: Avg. Numbers Per Fault

increasing price ($/GB)

0.125 0.25

0.5 1 2 4 8

16 32 64

128 256 512

1024

15 7 6 10 11 12 8 9 4 13 2 5 14

Avg

. # o

f se

rial

izat

ion

err

ors

p

er fa

ult

SSD ID

0

Page 40: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

40

Serialization Errors: Avg. Numbers Per Fault

increasing price ($/GB)

0.125 0.25

0.5 1 2 4 8

16 32 64

128 256 512

1024

15 7 6 10 11 12 8 9 4 13 2 5 14

Avg

. # o

f se

rial

izat

ion

err

ors

p

er fa

ult

SSD ID

0

SLC

Page 41: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

41

Serialization Errors: Patterns Over Time

1

10

100

1000

1 10 20 30 40 50 60 70 80 90 100

testing cycle #

SSD#2 SSD#4 SSD#7 SSD#8

# o

f se

rial

izat

ion

err

ors

Page 42: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

42

• 1 SSD

• 8 injected power faults

• lost 31% (72 GB) data

Metadata Corruption

Dead Device

• 1 SSD

• 136 injected power faults

• can no long be detected by host

Page 43: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

43

Conclusion

• An effective methodology to expose bugs in block devices under power fault

• Important implications to storage design

• e.g. write ahead logging V.S. unserializable writes

Thank you!

Page 44: Understanding the Robustness of SSDs under Power … · Understanding the Robustness of SSDs under Power Fault ... - 4 have power-loss protection ... •Host System - Debian 6.0 w/

44

Pristine Version of Our Paper can be Found at:

http://www.cse.ohio-state.edu/~zhengm/