Top Banner
Themis An I/O-Efficient MapReduce Alex Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, Amin Vahdat * University of California San Diego *& Google, Inc. 1
44

Themis: An I/O-Efficient MapReduce (SoCC 2012)

Aug 17, 2015

Download

Technology

Alex Rasmussen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Themis An I/O-Efficient MapReduce

Alex Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, Amin Vahdat*

University of California San Diego *& Google, Inc. ���

�������������������� �����

1

Page 2: Themis: An I/O-Efficient MapReduce (SoCC 2012)

MapReduce is Everywhere   First published in OSDI 2004   De facto standard for large-scale bulk

data processing   Key benefit: simple programming model,

can handle large volume of data

2

Page 3: Themis: An I/O-Efficient MapReduce (SoCC 2012)

I/O Path Efficiency   I/O bound, so disks are bottleneck   Existing implementations do a lot of disk I/O – Materialize map output, re-read during shuffle – Multiple sort passes if intermediate data large – Swapping in response to memory pressure

  Hadoop Sort 2009: <3MBps per node (3% of single disk’s throughput!)

3

Page 4: Themis: An I/O-Efficient MapReduce (SoCC 2012)

How Low Can You Go?   Agarwal and Vitter: minimum of two reads and

two writes per record for out-of-core sort   Systems that meet this

lower bound have the “2-IO property”

  MapReduce has sort in the middle, so same principle can apply

4

Page 5: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Themis   Goal: Build a MapReduce implementation with

the 2-IO property   TritonSort (NSDI ’11): 2-IO sort – World record holder in large-scale sorting

  Runs on a cluster of machines with many disks per machine and fast NICs

  Performs wide range of I/O-bound MapReduce jobs at nearly TritonSort speeds

5

Page 6: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation

6

Page 7: Themis: An I/O-Efficient MapReduce (SoCC 2012)

7

Network

Phase One Map and Shuffle

Page 8: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Map

Network

8

Page 9: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Map

3 3925 15

Network

9

Page 10: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Map

3 3925 15

1 10 11 20••• ••• 21 30 31 40••• •••

Network

10

Page 11: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Map

3 3925 15

1 10 11 20••• ••• 21 30 31 40••• •••Unsorted

Network

11

Phase Two Sort and Reduce

Page 12: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Map

3 3925 15

1 10 11 20••• ••• 21 30 31 40••• •••

1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• •••

Unsorted

Network

Sort

12

Page 13: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Map

3 3925 15

1 10 11 20••• ••• 21 30 31 40••• •••

1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• •••

1 10 11 20 21 30 31 40••• ••• ••• •••

Unsorted

Reduce

Network

Sort

13

Page 14: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Map

3 3925 15

1 10 11 20••• ••• 21 30 31 40••• •••

1 10 11 20 21 30 31 40••• ••• ••• •••1 10 11 20 21 30 31 40••• ••• ••• •••

1 10 11 20 21 30 31 40••• ••• ••• •••

Unsorted

Reduce

Network

1 10 11 20 21 30 31 40••• ••• ••• •••

Sort

14

Page 15: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Reader Byte Stream Converter Mapper

Sender Receiver Byte Stream Converter

Tuple Demux Chainer Coalescer Writer

Network

Implementation Details

15

Phase One

Page 16: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Reader Byte Stream Converter Mapper

Sender Receiver Byte Stream Converter

Tuple Demux Chainer Coalescer Writer

Network

Implementation Details

16

Phase One

Stage

Page 17: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Implementation Details

17

Worker 1 Worker 2

Worker 3 Worker 4

Page 18: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Challenges   Can’t spill to disk or swap under pressure – Themis memory manager

  Partitions must fit in RAM – Sampling (see paper)

  How does Themis handle failure? –  Job-level fault tolerance

18

Page 19: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation

19

Page 20: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Example   Why is this a problem?   Example stage graph

  Sample of what goes wrong

20

A B C

Page 21: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Memory Management - Goals   If we exceed physical memory, have to swap – OS approach: virtual memory, swapping – Hadoop approach: spill files

  Since this incurs more I/O, unacceptable

21

Page 22: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Example – Runaway Stage

22

A B C

Too Much Memory!

Page 23: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Example – Large Record   Why is this a problem?   Example stage graph

  Sample of what goes wrong

23

A B C

Too Much Memory!

Page 24: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Requirements   Main requirement: can’t allocate more than

the amount physical memory – Swapping or spilling breaks 2-IO

  Provide flow control via back-pressure – Prevent stage from monopolizing memory – Memory allocation can block indefinitely

  Support large records   Should have high memory utilization

24

Page 25: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Approach   Application-level memory management   Three memory management schemes – Pools – Quotas – Constraints

25

Page 26: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Pool-Based Memory Management

26

A B C

PoolAB PoolBC

Page 27: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Pool-Based Memory Management

27

A B C

PoolAB PoolBC

Page 28: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Pool-Based Memory Management

  Prevents using more than what’s in a pool   Empty pools cause back-pressure   Record size limited to size of buffer   Unless tuned, memory utilization might be low – Tuning is hard; must set buffer, pool sizes

  Used when receiving data from network – Must be fast to keep up with 10Gbps links

28

Page 29: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Quota-Based Memory Management

29

A B C

QuotaAC = 1000

300

700

Page 30: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Quota-Based Memory Management

  Provides back-pressure by limiting memory between source and sink stage

  Supports large records (up to size of quota)   High memory utilization if quotas set well   Size of the data between source and sink

cannot change – otherwise leaks   Used in the rest of phase one (map + shuffle)

30

Page 31: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Constraint-Based Memory Management

  Single global memory quota   Requests that would exceed quota are

enqueued and scheduled   Dequeue based on policy – Current: stage distance to network/disk write – Rationale: process record completely before

admitting new records

31

Page 32: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Constraint-Based Memory Management

  Globally limits memory usage   Applies back-pressure dynamically   Supports record sizes up to size of memory   Extremely high utilization   Higher overhead (2-3x over quotas)   Can deadlock for complicated graphs, patterns   Used in phase two (sort + reduce)

32

Page 33: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation

33

Page 34: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Fault Tolerance   Cluster MTTF determined by size, node MTTF – Node MTTF ~ 4 months (Google, OSDI ’10)

34

Google 10,000+ Nodes

2 minute MTTF Failures common

Job must survive fault

Average Hadoop Cluster 30-200 Nodes

80 hour MTTF Failures uncommon OK to just re-run

MTTF Small MTTF Large

Page 35: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Why are Smaller Clusters OK?   Hardware trends let you do more with less –  Increased hard drive density (32TB in 2U) – Faster bus speeds – 10Gbps Ethernet at end host – Larger core counts

  Can store, process petabytes   But is it really OK to just restart on failure?

35

Page 36: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Analytical Modeling   Modeling goal: when is performance gain

nullified by restart cost?   Example: 2x improvement – 30 minute jobs: ~4300 nodes – 2.5 hr jobs: ~800 nodes

  Can sort 100TB on 52 machines in ~2.5 hours – Petabyte scale easily possible in these regions

  See paper for additional discussion

36

Page 37: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Outline   Architecture Overview   Memory Management   Fault Tolerance   Evaluation

37

Page 38: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Workload   Sort: uniform and highly-skewed   CloudBurst: short-read gene alignment

(ported from Hadoop implementation)   PageRank: synthetic graphs, Wikipedia   Word count   n-Gram count (5-grams)   Session extraction from synthetic logs

38

Page 39: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Performance

39

Page 40: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Performance

40

Page 41: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Performance

41

Page 42: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Performance vs. Hadoop

42

Application Hadoop Runtime

(Sec)

Themis Runtime

(Sec)

Improvement

Sort-500G 28881 1789 16.14x CloudBurst 2878 944 3.05x

Page 43: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Summary   Themis – MapReduce with 2-IO property   Avoids swapping and spilling by carefully

managing memory   Potential place for job-level fault tolerance   Executes wide variety of workloads at

extremely high speed

43

Page 44: Themis: An I/O-Efficient MapReduce (SoCC 2012)

Themis - Questions?

44

Zoom!

http://themis.sysnet.ucsd.edu/!