Top Banner
1 Research on Efficient Erasure-Coding- Based Cluster Storage Systems Patrick P. C. Lee The Chinese University of Hong Kong NCIS’14 Joint work with Runhui Li, Jian Lin, Yuchong Hu
36

Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Mar 06, 2019

Download

Documents

phungkiet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

1

Research on Efficient Erasure-Coding-

Based Cluster Storage Systems

Patrick P. C. Lee

The Chinese University of Hong Kong

NCIS’14

Joint work with Runhui Li, Jian Lin, Yuchong Hu

Page 2: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Motivation

Clustered storage systems are widely

deployed to provide scalable storage by striping

data across multiple nodes

• e.g., GFS, HDFS, Azure, Ceph, Panasas, Lustre, etc.

Failures are common

2

LAN

Page 3: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Failure Types

Temporary failures

• Nodes are temporarily inaccessible (no data loss)

• 90% of failures in practice are transient [Ford, OSDI’10]

• e.g., power loss, network connectivity loss, CPU

overloading, reboots, maintenance, upgrades

Permanent failures

• Data is permanently lost

• e.g., disk crashes, latent sector errors, silent data

corruptions, malicious attacks

3

Page 4: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Replication vs. Erasure Coding

Solution: Add redundancy:

• Replication

• Erasure coding

Enterprises (e.g., Google, Azure, Facebook)

move to erasure coding to save footprints due to

explosive data growth

• e.g., 3-way replication has 200% overhead; erasure

coding can reduce overhead to 33% [Huang, ATC’12]

4

Page 5: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Background: Erasure Coding

Divide file to k data chunks (each with multiple blocks)

Encode data chunks to additional n-k parity chunks

Distribute data/parity chunks to n nodes

Fault-tolerance: any k out of n nodes can recover file data

5

File encode divide

Nodes

(n, k) = (4, 2)

A B C D

A+C B+D

A+D B+C+D

A B

C D

A+C B+D

A+D B+C+D

A B

C D

Page 6: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Erasure Coding

Pros:

• Reduce storage space with high fault tolerance

Cons:

• Data chunk updates imply parity chunk updates

expensive updates

• In general, k chunks are needed to recover a single lost

chunk expensive recovery

Our talk: Can we improve recovery of erasure-

coding-based clustered storage systems, while

preserving storage efficiency?

6

Page 7: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Our Work

CORE [MSST’13, TC]

• Augments existing regenerating codes to support

both optimal single and concurrent failure recovery

Degraded-read scheduling [DSN’14]

• Improves MapReduce performance in failure mode

Designed, implemented, and experimented on

Hadoop Distributed File System

7

Page 8: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Recover a Failure

Q: Can we minimize recovery traffic?

8

Node 1

Node 2

Node 3

Node 4

repaired node

Conventional recovery: download data from any k nodes

Recovery Traffic = = M +

A B

C D

A+C B+D

A+D B+C+D

C D

A+C B+D

A B

File of

size M

A B C D

Page 9: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

9

Node 1

Node 2

Node 3

Node 4

Regenerating Codes [Dimakis et al.; ToIT’10]

A B

C D

A+C B+D

A+D B+C+D

C

A+C

A+B+C

A

B

+ + Recovery Traffic = = 0.75M

repaired node

File of

size M

A B C D

Repair in regenerating codes:

• Surviving nodes encode chunks (network coding)

• Download one encoded chunk from each node

Page 10: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Concurrent Node Failures

Regenerating codes only designed for single

failure recovery

• Optimal regenerating codes collect data from n-1

surviving nodes for single failure recovery

Correlated and co-occurring failures are possible

• In clustered storage [Schroeder, FAST’07; Ford, OSDI’10]

• In dispersed storage [Chun NSDI’06; Shah NSDI’06]

CORE augments regenerating codes for optimal

concurrent failure recovery

• Retains regenerating code construction

10

Page 11: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

CORE’s Idea

Consider a system with n nodes

Two functions for regenerating codes in single

failure recovery:

• Enc: storage node encodes data

• Rec: reconstruct lost data using encoded data from

n-1 surviving nodes

t-failure recovery (t > 1):

• Reconstruct each failed node as if other n-1 nodes

are surviving nodes

11

Page 12: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Example

12

CORE

S0,0 S0,1 S0,2

S1,0 S1,1 S1,2

S2,0 S2,1 S2,2 S3,0 S3,1 S3,2 S4,0 S4,1 S4,2

S5,0 S5,1 S5,2

Node 0

Node 1

Node 2 Node 3 Node 4

Node 5

Setting: n=6, k=3

Suppose now Nodes 0 and 1 fail

Recall that optimal regenerating codes collect data from n-

1 surviving nodes for single failure recovery

How does CORE work?

Page 13: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Example

s0,0, s0,1, s0,2 = Rec0(e1,0, e2,0, e3,0, e4,0, e5,0)

e0,1 = Enc0,1(s0,0, s0,1, s0,2)

= Enc0,1(Rec0(e1,0, e2,0, e3,0, e4,0, e5,0))

s1,0, s1,1, s1,2 = Rec1(e0,1, e2,1, e3,1, e4,1, e5,1)

e1,0 = Enc1,0(s1,0, s1,1, s1,2)

= Enc1,0(Rec1(e0,1, e2,1, e3,1, e4,1, e5,1))

CORE

S0,0 S0,1 S0,2

S1,0 S1,1 S1,2

S2,0 S2,1 S2,2 S3,0 S3,1 S3,2 S4,0 S4,1 S4,2

S5,0 S5,1 S5,2

e0,1

e2,1 e3,1 e4,1

e5,1 e1,0

e2,0 e3,0

e4,0

e5,0

13

Node 0

Node 1

Node 2 Node 3 Node 4

Node 5

Page 14: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Example

We have two equations

e0,1 = Enc0,1(Rec0(e1,0, e2,0, e3,0, e4,0, e5,0))

e1,0 = Enc1,0(Rec1(e0,1, e2,1, e3,1, e4,1, e5,1))

Trick: They form a linear system of equations

If the equations are linearly independent, we

can calculate e0,1 and e1,0

Then we obtain lost data by

s0,0, s0,1, s0,2 = Rec0(e1,0, e2,0, e3,0, e4,0, e5,0)

s1,0, s1,1, s1,2 = Rec1(e0,1, e2,1, e3,1, e4,1, e5,1)

14

Page 15: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Bad Failure Pattern

A system of equations may not have a unique

solution. We call this a bad failure pattern

Bad failure patterns count for less than ~1%

Our idea: reconstruct data by adding one

more node to bypass the bad failure pattern

• Suppose nodes 0,1 form a bad failure pattern and

nodes 0,1,2 form a good failure pattern.

Reconstruct lost data for nodes 0,1,2

• Still achieve bandwidth saving over conventional

15

Page 16: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Bandwidth Saving

Bandwidth Ratio: Ratio of CORE to conventional in

recovery bandwidth

0

0.5

1

1 2 3 4 5 6 7 8 9 10

Ban

dw

idth

Rati

o

Good Failure Pattern

(12,6) (16,8) (20,10)

0

0.5

1

2 3 4 5 6 7 8 9

Ban

dw

idth

Rati

o Bad Failure Pattern

(12,6) (16,8) (20,10)

Bandwidth saving of CORE is significant

• e.g., (n, k) = (20,10)

• Single failure: ~80%

• 2-4 concurrent failures: 36-64% 16

t t

Page 17: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Theorem

Theorem: CORE, which builds on regenerating

codes for single failure recovery, achieves the

lower bound of recovery bandwidth if we recover

a good failure pattern with t ≥ 1 failed nodes

• Over ~99% of failure patterns are good

17

Page 18: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

CORE Implementation

18

Namenode

Datanode Datanode

block

block

block

block

block

block

block

RaidNode 1. Identify corrupted blocks

2. Send recovered blocks

Encoder Encoder Encoder

CORE

Encoder/Decoder

Page 19: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

CORE Implementation

Parallelization

Erasure coding

on C++

• Executed

through JNI

19

Page 20: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Experiments

Testbed:

• 1 namenode, and up to 20 datanodes

• Quad core 3.1GHz CPU, 8GB RAM, 7200RPM SATA harddisk,

1Gbps Ethernet

Coding schemes:

• Reed-Solomon codes vs. CORE (interference alignment codes)

20

Namenode Datanode Datanode Datanode

Page 21: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Decoding Throughput

Evaluate computational

performance:

• Assume single failure (t=1)

• Surviving data for recovery

first loaded in memory

• Decoding throughput: ratio

of size of recovered data to

decoding time

CORE (regenerating

codes) achieves ≥500MB/s

at packet size 8KB

21

Page 22: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Recovery Throughput

CORE shows significantly higher throughput

• e.g., in (20, 10), for single failure, the gain is 3.45x;

for two failures, it’s 2.33x; for three failures, is 1.75x

0

10

20

30

40

50

60

70

(12, 6) (16, 8) (20, 10)

Reco

very

th

pt

(MB

/s)

CORE t=1

RS t=1

CORE t=2

RS t=2

CORE t=3

RS t=3

22

Page 23: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

MapReduce

Q: How does erasure-coded storage affect

data analytics?

Traditional MapReduce is designed with

replication storage in mind

To date, no explicit analysis of MapReduce on

erasure-coded storage

• Failures trigger degraded reads in erasure coding

23

Page 24: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

MapReduce

MapReduce idea:

• Map tasks process blocks and generate intermediate results

• Reduce tasks collect intermediate results and produce final output

Constraint: cross-rack network resource is scarce

24

WordCount

Example:

<A,2>

<B,2>

<C,2>

Map tasks Reduce tasks

<A,1> B C

A C

A B

Slave 0

Slave 1

Slave 2

Shuffle

Page 25: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

MapReduce on Erasure Coding

Show that default scheduling hurts MapReduce

performance on erasure-coded storage

Propose Degraded-First Scheduling for

MapReduce task-level scheduling

• Improves MapReduce performance on erasure-coded

storage in failure mode

25

Page 26: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Default Scheduling in MapReduce

Locality-first scheduling: the master gives the first

priority to assigning a local task to a slave

26

while a heartbeat from slave s arrives do

for job in jobQueue do

if job has a local task on s then

assign the local task

else if job has a remote task then

assign the remote task

else if job has a degraded task then

assign the degraded task

endif

endfor

endwhile

Processing a block stored

in another rack

Processing an unavailable

block in the system

Page 27: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Locality-First in Failure Mode

27

B0,0 B0,1 P0,0 P0,1

B1,0

B2,0

B3,0

P2,1

B4,0

P5,0

B1,1

P3,0

B4,1

P5,1

P1,0

B2,1

P3,1

P4,0

B5,0

P1,1

P2,0

B3,1

P4,1

B5,1

Core Switch

ToR Switch ToR Switch

S0 S1 S2 S3 S4

10 30 40

time(s)

slaves

S1

S2

S3

S4

Process B1,1

Process B5,1

Process B2,1

Process B5,0

Process B3,1

Process B5,1

Process B0,1

Process B4,0

Download P2,0

Download P0,0

Download P1,0

Map finishes

Process B0,0

Process B3,0

Process B2,0

Download P3,0

Process B1,0

Page 28: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Problems & Intuitions

Problems:

• Degraded tasks are launched simultaneously

• Start degraded reads together and compete for network

resources

• When local tasks are processed, network resource is

underutilized

Intuitions: ensure degraded tasks aren’t running

together to complete for network resources

• Finish running degraded tasks before all local tasks

• Keep degraded tasks separated

28

Page 29: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

29

25% saving

Core Switch

ToR Switch ToR Switch

B0,0

B1,0

B2,0

B3,0 B0,1

P2,1

B4,0

P5,0

B1,1

P3,0

B4,1

P5,1

P0,0

P1,0

B2,1

P3,1

P4,0

B5,0

P0,1

P1,1

P2,0

B3,1

P4,1

B5,1

S0 S1 S2 S3 S4

10 30

Map finishes

time(s)

S1

S2

S3

S4

slaves

Degraded-First in Failure Mode

Process B1,1

Process B5,1

Process B4,0

Download P0,0

Download P2,0

Process B5,0

Process B3,1

Process B5,1

Process B0,0 Process B0,1

Process B2,0

Download P1,0

Download P3,0

Process B1,0

Process B2,1

Process B3,0

Page 30: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Degraded-First Scheduling

30

while a heartbeat from slave s arrives do

if 𝒎

𝑴≥𝒎𝒅

𝑴𝒅 and job has a degraded task then

assign the degraded task

endif

assign other map slots as in locality-first scheduling

endwhile

Idea: Launch a degraded task only if the fraction of

degraded tasks being launched (𝒎𝒅 / 𝑴𝒅) is no more

than the fraction of all map tasks being launched (𝒎 / 𝑴)

• Control the pace of launching degraded tasks

• Keep degraded tasks separated in whole map phase

Page 31: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Example

31

Core Switch

ToR Switch ToR Switch

B0,0

B1,0

B2,0

B3,0 B0,1

P2,1

B4,0

P5,0

B1,1

P3,0

B4,1

P5,1

P0,0

P1,0

B2,1

P3,1

P4,0

B5,0

P0,1

P1,1

P2,0

B3,1

P4,1

B5,1

10 30 50

S1

S2

S3

Download P0,0 Process B0,0

Process B2,1

Process B1,1

(1)

(2)

(3)

Process B3,1

Download P1,0 Process B1,0

Process B0,1

Process B5,0

Process B3,0

Download P2,0 Process B2,0

Process B4,1

Process B4,0

Process B5,1

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

Degraded tasks are well separated

Page 32: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Properties

Gives higher priority to degraded tasks if

conditions are appropriate

• That’s why we call it “degraded-first” scheduling

Preserves MapReduce performance as in

locality-first scheduling in normal mode

Enhanced degraded first scheduling (EDF)

• Takes into account network topology when assigning

degraded tasks

32

Page 33: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Experiments

33

Prototype on HDFS-RAID

Hadoop cluster:

• Single master and 12 slaves

• Slaves grouped into three racks, four slaves each

• Both ToR and core switches have 1Gbps bandwidth

• 64MB per block

Workloads:

• Jobs: Grep, WordCount, LineCount

• Each job processes a 240-block file

Page 34: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Experiment Results

Single Job Gain of EDF over LF • 27.0% for wordcount

• 26.1% for grep

• 24.8% for linecount

Multiple Jobs Gain of EDF over LF • 16.6% for wordcount

• 28.4% for grep

• 22.6% for linecount

34

Page 35: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Other Projects on Erasure Coding

Mixed failures

• STAIR codes: a general, space-efficient erasure code for tolerating both device

failures and latent sector errors [FAST’14]

• I/O-efficient integrity checking against silent data corruptions [MSST’14]

Efficient updates

• CodFS: enhanced parity logging to reduce network and disk I/Os in erasure-

coded storage [FAST’14]

Efficient recovery

• NCCloud: reduce bandwidth for archival storage [FAST’12, INFOCOM’13, TC’14]

• I/O-efficient recovery schemes for erasure codes [MSST’12, DSN’12, TC’14, TPDS’14]

Modeling of SSD RAID

• Stochastic model to capture reliability changes as SSDs age [SRDS’13, TC]

Secure outsourced storage

• FMSR-DIP: remote data checking for regenerating codes [SRDS’12, TPDS’14]

• Convergent dispersal: unifying security, deduplication, and erasure coding [HotStorage’14]

35

Page 36: Research on Efficient Erasure-Coding- Based Cluster ...acs.ict.ac.cn/ncis2014/slides/pNCIS2014-Keynote-PCLee.pdf · Research on Efficient Erasure-Coding-Based Cluster Storage Systems

Conclusions

Provide insights into the use of erasure coding on

clustered storage systems

Two systems

• CORE: improve concurrent recovery performance

• Degraded-first scheduling: improve MapReduce performance

in erasure-coded storage in failure mode

Approach:

• Build prototypes, backed by extensive experiments and

theoretical analysis

• Open-source software

http://www.cse.cuhk.edu.hk/~pclee

36