Top Banner
Tachyon: memory centric, fault tolerance storage for cluster framworks presented by Viet-Trung Tran
22

Tachyon memory centric, fault tolerance storage for cluster framworks

Aug 15, 2015

Download

Documents

Viet Trung Tran
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tachyon  memory centric, fault tolerance storage for cluster framworks

Tachyon: memory centric, fault tolerance storage for cluster framworks

presented by Viet-Trung Tran

Page 2: Tachyon  memory centric, fault tolerance storage for cluster framworks

Memory is King

• RAM throughput increasing exponentially

• Disk throughput increasing slowly

Memory-locality key to interactive response time

Page 3: Tachyon  memory centric, fault tolerance storage for cluster framworks

Memory as cache

• Improve READ• Cannot help much with write

• Replication for fault tolerance• Network bandwidth and latency are much worse than that of memory

• Write throughput is limited by disk I/O• Required at least one copy on disk

• Inter-job data sharing cost dominates pipeline end-to-end latency• 34% jobs output as large as input (Cloudera survey)

Page 4: Tachyon  memory centric, fault tolerance storage for cluster framworks

Different jobs share data

Slow writes to disk

Spark Task

Spark mem block manager

block 1

block 3

Spark Task

Spark mem block manager

block 3

block 1

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process(slow writes)

4

Page 5: Tachyon  memory centric, fault tolerance storage for cluster framworks

Different frameworks share data

Spark Task

Spark mem block manager

block 1

block 3

Hadoop MR

YARN

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process(slow writes)

5

Slow writes to disk

Page 6: Tachyon  memory centric, fault tolerance storage for cluster framworks

Tachyon: realiable data sharing at memory speed within and across frameworks/jobs

Tachyon

SparkMapRe

duceSparkSQL

H2O GraphX Impala

HDFS S3Gluster

FSOrange

FSNFS Ceph ……

……

Page 7: Tachyon  memory centric, fault tolerance storage for cluster framworks

Challenges

How to achieve reliability data sharing without replication?

Page 8: Tachyon  memory centric, fault tolerance storage for cluster framworks

Target workload properties

• Immutable data• Deterministic jobs• Locality based scheduling• All data vs working set• Program size vs data size

Page 9: Tachyon  memory centric, fault tolerance storage for cluster framworks

System architecture

Consists of two layer

• Lineage

• Deliver high throughput I/O

• Capture sequence of jobs/tasks that create output

• Persistence

• Asynchronous checkpoints

Facts

• One data copy in memory

• Recomputation for fault-tolerance

Page 10: Tachyon  memory centric, fault tolerance storage for cluster framworks

Memory-Centric Storage Architecture

10

Page 11: Tachyon  memory centric, fault tolerance storage for cluster framworks
Page 12: Tachyon  memory centric, fault tolerance storage for cluster framworks

Master Node

• Similar to HDFS and GPS• Passive standby model

• BUT also contains a workflow manager• Track lineage information• Compute checkpoint order• Interact with cluster resource manager to allocate resources for re-

computations

Page 13: Tachyon  memory centric, fault tolerance storage for cluster framworks

Lineage

Page 14: Tachyon  memory centric, fault tolerance storage for cluster framworks

More complex lineage

Page 15: Tachyon  memory centric, fault tolerance storage for cluster framworks

Lineage metadata

• Binary program

• Configuration

• Input Files List

• Output Files List

• Dependency Type

• Narrow (filter, map)

• Wide (suffle, join)

Page 16: Tachyon  memory centric, fault tolerance storage for cluster framworks

Fault-recovery by recomputations

• Challenge• Bounding the recomputation cost for a long running storage

• Asynchronous checkpointing• Allocate resources for recomputations

• Make sure recomputation tasks get enough resources• Do not impact system performance (task priorities)

• Assumption• Input files are immutable• job executions are deterministic

• Client side caching to mitigate read hotspots

Page 17: Tachyon  memory centric, fault tolerance storage for cluster framworks

Asynchronous checkpointing

• Goals• Bounded recomputation time• Checkpointing hot files• Avoid checkpointing temp files

• Edge algoritim • Modeling relationships of files with a DAG

• Vertices are files • Edge from A to B if B is generated by a job that read A

Page 18: Tachyon  memory centric, fault tolerance storage for cluster framworks

Edge algorithm

• Checkpoint leaves• Checkpointing hot files

• Most file access are less than 3 ( yahoo survey for big data workload)• Thus, access more than twice get checkpointed

• Dealing with large dataset• 96% active job sizes fit in the cluster memory• synchronously write dataset above a defined threshold to disk• Most of the files in memory checkpointed can be evicted from memory

to make room

Page 19: Tachyon  memory centric, fault tolerance storage for cluster framworks

Resource allocation

• Depend on the scheduling policy of the running cluster• Requirements

• Priority compatibility• Resource sharing • Avoid cascading recomputation

• Best ordering recomputation• Most common policies

• priority based• weighted fair sharing

Page 20: Tachyon  memory centric, fault tolerance storage for cluster framworks

Priority based scheduler

Page 21: Tachyon  memory centric, fault tolerance storage for cluster framworks

Fair sharing based scheduler

Page 22: Tachyon  memory centric, fault tolerance storage for cluster framworks

Evaluation

• 110x faster than MemHDFS• 4x faster in realistic jobs• 3,8x faster in case of failure• Recover from master failure within 1 second• reduce replication caused network traffic up to 50%• recomputation impact is less than 1,6%