Top Banner
Systems for Data-Intensive Cluster Computing Mihai Budiu Microsoft Research, Silicon Valley ALMADA Summer School Moscow, Russia August 2 & 6, 2013
183

Systems for Data-Intensive Parallel Computing (Lecture by Mihai Budiu)

May 10, 2015

Download

Education

Anton Konushin

This course will cover fundamental principles and techniques for building large-scale data parallel batch processing systems, with examples based on the DryadLINQ software stack, developed at Microsoft Research. We will start by discussing LINQ (language-integrated query), a functional data-parallel language, highlighting features which make it particularly suitable for large-scale parallelization. We then discuss the software architecture of a compiler/runtime for running LINQ on large-scale clusters (thousands of machines), highlighting distributed storage, distributed reliable execution, and compilation.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Systems for Data-Intensive Cluster Computing

Mihai Budiu

Microsoft Research, Silicon Valley

ALMADA Summer School

Moscow, Russia

August 2 & 6, 2013

Page 2: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

About Myself

2

• http://budiu.info/work• Senior Researcher at Microsoft Research

in Mountain View, California• Ph.D. in Computer Science from

Carnegie Mellon, 2004• Worked on reconfigurable hardware,

compilers, security, cloud computing, performance analysis, data visualization

Page 3: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Lessons to remember

Will use this symbol throughout the presentation to point out fundamental concepts and ideas.

3

This talk is not about specific software artifacts. It is a talk about principles, illustrated with some existing implementations.

Page 4: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

4

BIG DATA

Page 5: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

500 Years Ago

5

Tycho Brahe(1546-1601)

Johannes Kepler(1571-1630)

Page 6: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

The Laws of Planetary Motion

6

Tycho’s measurements Kepler’s laws

Page 7: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

The Large Hadron Collider

7

25 PB/year WLHC Grid: 200K computing cores

Page 8: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Genetic Code

8

Page 9: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Astronomy

9

Page 10: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Weather

10

Page 11: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

The Webs

11Internet

Facebook friends graph

Page 12: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Big Data

12

Page 13: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Big Computers

13

Page 14: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Presentation Structure

• Friday– Introduction

– LINQ: a language for data manipulation; homework

– A Software Stack for Data-Intensive Manipulation (Part 1)• Hardware, Management, Cluster, Storage, Execution

• Tuesday– Thinking in Parallel

– A Software Stack for Data-Intensive Manipulation (Part 2)• Language, Application

– Conclusions

14

Page 15: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

LINQ: LANGUAGE-INTEGRATED QUERY

15

Page 16: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

LINQ

• Language extensions to .Net

– (We will use C#)

• Strongly typed query language

– Higher-order query operators

– User-defined functions

– Arbitrary .Net data types

16

Page 17: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

LINQ Availability

• On Windows: Visual Studio 2008 and later, including VS Express (free)

– http://www.microsoft.com/visualstudio/eng/downloads#d-2012-editions

• On other platforms: open-source

– http://mono-project.com/Main_Page

• Java Streams --- planned in Java 8 --- are similar17

Page 18: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Collections

18

List<T>

Elements of type TIterator(current element)

Generic type (parameterized)

Page 19: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

19

LINQ = .Net+ Queries

Collection<T> collection;

bool IsLegal(Key);

string Hash(Key);

var results = collection.Where(c => IsLegal(c.key)).Select(c => new { Hash(c.key), c.value});

Page 20: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Essential LINQ Summary

20

Where (filter)

Select (map)

GroupBy

OrderBy (sort)

Aggregate (fold)

Join

Input

Page 21: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Tutorial Scope

• We only discuss essential LINQ

• We do not discuss in detail– most operators (> 100 of them, including variants)

– lazy evaluation

– yield

– Infinite streams

– IQueryable, IObservable

21

Page 22: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Running the Code

• Code can be downloaded at http://budiu.info/almada-code.zip

• Includes the code on all the following slides (file Slides.cs)

• Includes the homework exercises (details later)

• Includes testing data for the homework

• Contains a Visual Studio Solution file TeachingLinq.sln

• compileMono.bat can be used for Mono22

Page 23: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Useful .Net Constructs

23

Page 24: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Tuples

Tuple<int, int> pair = newTuple<int, int>(5, 3);

Tuple<int, string, double> triple = Tuple.Create(1, "hi", 3.14);

Console.WriteLine("{0} {1} {2}", pair, triple, pair.Item1);

24

Result: (5, 3) (1, hi, 3.14) 5

Page 25: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Anonymous Functions

• Func<I, O>

– Type of function with

• input type I

• output type O

• x => x + 1

– Lambda expression

– Function with argument x, which computes x+1

25

Page 26: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Using Anonymous Functions

Func<int, int> inc = x => x + 1;

Console.WriteLine(inc(10));

26

Result: 11

Page 27: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Functions as First-Order Values

static T Apply<T>(T argument, Func<T, T> operation)

{

return operation(argument);

}

Func<int, int> inc = x => x + 1;

Console.WriteLine(Apply(10, inc));

27Result: 11

Page 28: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Multiple Arguments

Func<int, int, int> add = (x, y) => x + y;

Console.WriteLine(add(3,2));

28

Result: 5

Page 29: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Collections

29

Page 30: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

IEnumerable<T>

IEnumerable<int> data = Enumerable.Range(0, 10);

foreach (int n in data)

Console.WriteLine(n);

Program.Show(data);

30

Result: 0,1,2,3,4,5,6,7,8,9

Helper function I wrote to help you debug.

Page 31: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

IEnumerables everywhere

• List<T> : IEnumerable<T>

• T[] : IEnumerable<T>

• string : IEnumerable<char>

• Dictionary<K,V> : IEnumerable<KeyValuePair<K,V>>

• ICollection<T> : IEnumerable<T>

• Infinite streams

• User-defined

31

Page 32: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Printing a Collection

static void Print<T>(IEnumerable<T> data)

{

foreach (var v in data)

Console.Write("{0} ", v);

Console.WriteLine();

}

32

Page 33: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

LINQ Collection Operations

33

Page 34: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Select

IEnumerable<int> result = data.Select(d => d + 1);

Show(result);

Note: everyone else calls this function “Map”

34

Data: 0,1,2,3,4,5,6,7,8,9

Result: 1,2,3,4,5,6,7,8,9,10

Input and output are both collections.

Page 35: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Select can transform types

var result = data.Select(d => d > 5);

Show(result);

35

Result: False,False,False,False,False,False,True,True,True,True

Page 36: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Separation of Concerns

var result = data.Select(d => d>5);

36

Operator applies to any collection

User-defined function (UDF)Transforms the elements

Page 37: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Chaining operators

37

var result = data.Select(d => d * d).Select(d => Math.Sin(d)).Select(d => d.ToString());

All these are collections.

Result: "0.00","0.84","-0.76","0.41","-0.29","-0.13","-0.99","-0.95","0.92","-0.63"

Page 38: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Generic Operators

IEnumerable<Dest>

Select<Src,Dest>(this IEnumerable<Src>,Func<Src, Dest>)

38

Output type

Input type

Transformation function

Page 39: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Filtering with Where

var result = data.Where(d => d > 5);

Show(result);

39

Result: 6,7,8,9

Page 40: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Aggregations

var ct = data.Count();

var sum = data.Sum();

Console.WriteLine("{0} {1}", ct, sum);

40

Result: 10 45

Page 41: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

General Aggregations

var sum = data.Aggregate((a, b) => a + b);

Console.WriteLine("{0}", sum);

41

((((((((0 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9

Result: 45

Page 42: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Aggregations with Seeds

var sum = data.Aggregate(

"X", // seed(a, b) => string.Format(

"({0} + {1})", a, b));

Console.WriteLine("{0}", sum);

42

Result: “((((((((((X + 0) + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)”

Page 43: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Values can be Collections

var lines = new string[]

{ “First string”,

“Second string” };

IEnumerable<int> counts =lines.Select(d => d.Count());

Show(counts);

43

(d as IEnumerable<char>).Count()

Result: 12,13

Page 44: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Flattening Collections with SelectMany

IEnumerable<string> words =lines.SelectMany(

d => d.Split(' '));

IEnumerable<string[]> phrases = lines.Select(

d => d.Split(' '));

44

Page 45: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

SelectManyvar lines = new string[]

{ “First string”,

“Second string” };

lines.SelectMany(d => d.Split(' '));

lines.Select( d => d.Split(' '));

45This is a collection of collections!

Result: { “First”, “string”, “second”, “string” }

Result: { {“First”, “string”}, {“second”, “string”} }

Page 46: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

GroupByIEnumerable<IGrouping<int, int>> groups

= data.GroupBy(d => d%3);

46

Key function

Result:0 => {0,3,6,9}1 => {1,4,7}2 => {2,5,8}

Page 47: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

IGrouping

var result = data.GroupBy(d => d%3)

.Select(g => g.Key);

47

Result: 0,1,2

Page 48: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Groups are Nested Collections

IGrouping<TKey, TVals> : IEnumerable<TVals>

var result = data.GroupBy(d => d%3)

.Select(g => g.Count());

48

Result: 4,3,3

LINQ computation on each group

Page 49: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Sorting

var sorted =lines.SelectMany(l=>l.Split(' '))

.OrderBy(l => l.Length);

Show(sorted);

49

Sorting key

Result: "First","string","Second","string"

Page 50: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Joinsvar L = Enumerable.Range(0, 4);

var R = Enumerable.Range(0, 3);

var result = L.Join(R, l => l % 2, r => (r + 1) % 2, (l, r) => l + r);

50

L 0 1 2 3 4

R keys 0 1 0 1 0

0 1 1+0=1 3+0=3

1 0 0+1=1 2+1=3 4+1=5

2 1 1+2=3 3+2=5

3 0 0+3=3 2+3=5 4+3=7

resu

lt

Left key

Right key

Join function

Page 51: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Result: {{ length = 5, str = First }{ length = 6, str = string }{ length = 6, str = Second }{ length = 6, str = string }

}

Join Example

var common =

words.Join(data, s => s.Length, d => d,

(s, d) => new

{ length = d, str = s });

51

Left input

Right input

Page 52: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Other Handy Operators

• collection.Max()

• left.Concat(right)

• collection.Take(n)

• Maximum element

• Concatenation

• First n elements

52

Page 53: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Take

First few elements in a collection

var first5 = data

.Select(d => d * d)

.Take(5);

53

Result: 0,1,4,9,16

Page 54: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Homework

54

• http://budiu.info/almada-code.zip• You only need the LINQ operations

taught in this class• Solutions should only use LINQ (no loops)• Fill the 12 functions in class Homework, file

Homework.cs• Testing data given in Program.cs; Main is there• Calling ShowResults(homework, data) will run

your homework on the data (in Program.cs)• You can use the function Program.Show() to

debug/display complex data structures

Page 55: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

– Slides.cs: code from all these slides

– Program.cs: main program, and sample testing data

– Tools.cs: code to display complex C# objects

– Homework.cs: skeleton for 12 homework problems

You must implement these 12 functions.

– (Solutions.cs: sample solutions for homework)

55

Page 56: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Exercises (1)

1. Count the number of words in the input collection (use the supplied SplitStringIntoWordsfunction to split a line into words)

2. Count the number of unique (distinct) words in the input collection. “And” = “and” (ignore case).

3. Find the most frequent 10 words in the input collection, and their counts. Ignore case.

56

Page 57: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Exercises (2)

4. From each line of text extract just the vowels (‘a’, ‘e’, ‘i’, ‘o’, ‘u’), ignoring letter case.

5. Given a list of words find the number of occurrences of each of them in the input collection (ignore case).

6. Generate the Cartesian product of two sets.

57

Page 58: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Exercises (3)

7. Given a matrix of numbers, find the largest value on each row. (Matrix is a list of rows, each row is a list of values.)

8. A sparse matrix is represented as triples (colNumber,rowNumber,value). Compute the matrix transpose. (Indexes start at 0.)

9. Add two sparse matrices.

58

Page 59: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Exercises (4)

10.Multiply two sparse matrices.

11.Consider a directed graph (V, E). Node names are numbers. The graph is given by the set E, a list of edges Tuple<int, int>. Find all nodes reachable in exactly 5 steps starting from node 0.

59

Page 60: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Exercises (5)

12. Compute two iteration steps of the pagerankof a directed graph (V, E). Each node n ∈V has a weight W(n), initially 1/|V|, and an out-degree D(n) (the number of out-edges of n). Each algorithm iteration changes weights of all nodes n as follows:

𝑊′(𝑛) =

𝑚,𝑛 ∈𝐸

𝑊(𝑚)/𝐷(𝑚)

60

Page 61: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

61

Crunching Big Data

Page 62: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Design Space

62

Throughput(batch)

Latency(interactive)

Internet

Datacenter

Data-parallel

Sharedmemory

Page 63: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Software Stack

63

Machine

Cluster

Storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Hig

her

ab

stra

ctio

ns

Page 64: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Execution

Application

Data-Parallel Computation

64

Storage

Language

ParallelDatabases

Map-Reduce

GFSBigTable

Cosmos

Dryad

DryadLINQScope

Sawzall,FlumeJava

Hadoop

HDFS

Pig, HiveSQL ≈SQL LINQ, SQLSawzall, Java

Page 65: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

CLUSTER ARCHITECTURE

65

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Page 66: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Cluster Machines

• Commodity server-class systems

• Optimized for cost

• Remote management interface

• Local storage (multiple drives)

• Multi-core CPU

• Gigabit+ Ethernet

• Stock OS

66

Page 67: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Cluster network topology

rack

top-of-rack switch

top-level switch

To next level switch

Page 68: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

The secret of scalability

• Cheap hardware

• Smart software

• Berkeley Network of Workstations (’94-’98)http://now.cs.berkeley.edu

68

Page 69: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

DEPLOYMENT: AUTOPILOT

69

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Autopilot: Automatic Data Center Management, Michael Isard, in Operating Systems Review, vol. 41, no. 2, pp. 60-67, April 2007

Page 70: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Autopilot goal

• Handle automatically routine tasks

• Without operator intervention

70

Page 71: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Autopiloted System

71

Autopilot services

Application

Autopilot control

Page 72: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Recovery-Oriented Computing

72

• Everything will eventually fail

• Design for failure

• Crash-only software design

• http://roc.cs.berkeley.edu

Brown, A. and D. A. Patterson. Embracing Failure: A Case for Recovery-Oriented Computing (ROC). High Performance Transaction Processing Symposium, October 2001.

Page 73: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Autopilot Architecture

73

Device Manager

Provisioning Deployment Watchdog Repair

Operational data collection and visualization

• Discover new machines; netboot; self-test• Install application binaries and configurations• Monitor application health• Fix broken machines

Page 74: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Distributed State

74

Device Manager

Provisioning Deployment Watchdog Repair

Centralized stateReplicated

Strongly consistent

Distributed stateReplicated

Weakly consistent

Page 75: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Problems of Distributed State

75

StateCopy of

State

State updates

State updates

State queries

State queries

Consistency model = contract

Page 76: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Centralized Replicated Control• Keep essential control state centralized

• Replicate the state for reliability

• Use the Paxos consensus protocol(Zookeeper is the open-source alternative)

76

Paxos

Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133-169, May 1998.

Page 77: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Consistency Models

Strong consistency

• Expensive to provide

• Hard to build right

• Easy to understand

• Easy to program against

• Simple application design

Weak consistency

• Increases availability

• Many different models

• Easy to misuse

• Very hard to understand

• Conflict management in application

77

Page 78: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Autopilot abstraction

78

Machine Machine Machine Machine

Deployment

Self-healing machines

Page 79: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

CLUSTER SERVICES

79

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Page 80: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Cluster Machines

80

Name service Scheduling

Remote execution

StorageRemote

executionStorage

Page 81: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Cluster Services

• Name service: discover cluster machines

• Scheduling: allocate cluster machines

• Storage: file contents

• Remote execution: spawn new computations

81

Page 82: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Cluster services abstraction

82

Machine

Cluster services

Machine Machine Machine

Deployment

Reliable specialized machines

Page 83: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Layered Software Architecture

• Simple components

• Software-provided reliability

• Versioned APIs

• Design for live staged deployment

83

Page 84: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

DISTRIBUTED STORAGE

84

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Page 85: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Bandwidth hierarchy

85

Cache RAM Local disks

Local rack

Remoterack

Remote datacenter

Page 86: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Today’s disks

86

2TB

100MB/s

Application

Page 87: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Storage bandwidth

87

• Expensive• Fast network needed• Limited by network b/w

• Cheap network• Cheap machines• Limited by disk b/w

SAN

JBOD

Page 88: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Time to read 1TB (sequential)

• 1 TB / 100MB/s = 3 hours

• 1 TB / 10 Gbps = 40 minutes

• 1 TB / (100 MB/s/disk x 10000 disks) = 1s

• (1000 machines x 10 disks x 1TB/disk = 10PB)88

Page 89: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Send the application to the data!

89

2TB

100MB/sApplication

Page 90: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

F[1]

Large-scale Distributed Storage

90

Storage metadata service Storage Storage Storage

File F F[0] F[1] F[0]

Page 91: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Parallel Application I/O

91

Storage StorageStorage

F[0] F[1]

Storage metadata service

File F

AppAppAppctrllookup

Page 92: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Cluster Storage Abstraction

92

Set of reliable machines with a global filesystem

Machine

Cluster services

Cluster storage

Machine Machine Machine

Deployment

Page 93: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

DISTRIBUTED EXECUTION:DRYAD

93

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Dryad painting by Evelyn de Morgan

Page 94: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Dryad = Execution Layer

94

Job (application)

Dryad

Cluster

Pipeline

Shell

Machine

Page 95: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D

grep1000 | sed500 | sort1000 | awk500 | perl50

95

Page 96: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Virtualized 2-D Pipelines

96

Page 97: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Virtualized 2-D Pipelines

97

Page 98: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Virtualized 2-D Pipelines

98

Page 99: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Virtualized 2-D Pipelines

99

Page 100: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Virtualized 2-D Pipelines

100

• 2D DAG• multi-machine• virtualized

Page 101: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Dryad Job Structure

101

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

Channels

Stage

Page 102: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Channels

102

X

M

Items

Finite streams of items

• files• TCP pipes• memory FIFOs

Page 103: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Dryad System Architecture

103

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS,Sched

RE RERE

V V V

Graph manager cluster

Page 104: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Separate Data and Control Plane

104

• Different kinds of traffic• Data = bulk, pipelined• Control = interactive

• Different reliability needs

Page 105: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Centralized control

• Manager state is not replicated

• Entire manager state is held in RAM

• Simple implementation

• Vertices use leases: no runaway computations on manager crash

• Manager crash causes complete job crash

105

Page 106: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

GM code

vertex code

Staging1. Build

2. Send .exe

3. Start manager

5. Generate graph

7. Serializevertices

8. MonitorVertex execution

4. Querycluster resources

Nameserver6. Initialize vertices

Remoteexecutionservice

Page 107: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Scaling Factors

• Understand how fast things scale

– # machines << # vertices << # channels

– # control bytes << # data bytes

• Understand the algorithm cost

– O(# machines2) acceptable, but O(# edges2) not

• Every order-of-magnitude increase will reveal new bugs

107

Page 108: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Fault Tolerance

Page 109: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Danger of Fault Tolerance

• Fault tolerance can mask defects in other software layers

• Log fault repairs

• Review the logs periodically

109

Page 110: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Failures

• Fail-stop (crash) failures are easiest to handle

• Many other kinds of failures possible

– Very slow progress

– Byzantine (malicious) failures

– Network partitions

• Understand the failure model for your system

– probability of each kind of failure

– validate the failure model (measurements)

110

Page 111: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

X[0] X[1] X[3] X[2] X’[2]

Completed vertices Slow vertex

Duplicatevertex

Dynamic Graph Rewriting

Duplication Policy = f(running times, data volumes)

Page 112: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

S S S S

A A A

S S

T

S S S S S S

T

# 1 # 2 # 1 # 3 # 3 # 2

# 3# 2# 1

static

dynamic

rack #

Dynamic Aggregation

112

Page 113: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Separate policy and mechanism

• Implement a powerful and generic mechanism

• Leave policy to the application layer

• Trade-off in policy language: power vs. simplicity

113

Page 114: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Policy vs. Mechanism

• Application-level

• Most complex in C++ code

• Invoked with upcalls

• Need good default implementations

• DryadLINQ provides a comprehensive set

114

• Built-in

Scheduling

Graph rewriting

Fault tolerance

Statistics and reporting

Page 115: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Dryad Abstraction

115

Machine

Cluster services

Cluster storage

Machine Machine Machine

Deployment

Distributed Execution

Reliable machine running distributed jobs with“infinite” resources

Page 116: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Thinking in Parallel

116

Page 117: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Computing

117Image from: http://www.dayvectors.net/g2012m02.html

1930s

Page 118: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Logistics

118

= input data

= output data+

= communicationis expensive

Page 119: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Playing Cards

119

Clubs Diamonds Hearts Spades

suits

2,3,4,5,6,7,8,9,10,J,Q,K,A our order

Page 120: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Count the number of cards

120Counting, or aggregation

Page 121: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Keep only the even cards

121

2,3,4,5,6,7,8,9,10,J,Q,K,A

Filter

Page 122: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Count the Number of Figures

122Filter and Aggregate

Page 123: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Group cards by suit

123GroupBy

Page 124: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Which cards are missing?

124

2,3,4,5,6,7,8,9,10,J,Q,K,A

GroupBy-Aggregate = Histogram

Page 125: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Number of cards of each suit

125GroupBy-Aggregate

Page 126: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Sort the cards

126

2,3,4,5,6,7,8,9,10,J,Q,K,A

lexicographic order

Sort

Page 127: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Cards with same color & number

127Join

Page 128: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

DRYADLINQ

128

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Page 129: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

DryadLINQ = Dryad + LINQ

129

Distributed computations

Computations on collections

Page 130: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Distributed Collections

130

Partition

Collection

.Net objects

Page 131: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Collection = Collection of Collections

131

Collection<T>

Collection<T>

Page 132: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

LINQ

132

Dryad

=> DryadLINQ

Page 133: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = collection. Where(c => IsLegal(c.key)). Select(c => new { Hash(c.key), c.value});

133

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)

Data

Page 134: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

DryadLINQ source code

• https://github.com/MicrosoftResearchSVC/Dryad

• Apache license

• Runs on Hadoop YARN

• Research prototype code quality

134

Page 135: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

DryadLINQ Abstraction

135

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

.Net/LINQ with “infinite” resources

Page 136: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Demo

136

Page 137: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Example: counting lines

var table = PartitionedTable.Get<LineRecord>(file);int count = table.Count();

Parse, Count

Sum

Page 138: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Example: counting words

var table = PartitionedTable.Get<LineRecord>(file);int count = table

.SelectMany(l => l.line.Split(‘ ‘))

.Count();

Parse, SelectMany, Count

Sum

Page 139: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Example: counting unique words

var table = PartitionedTable.Get<LineRecord>(file);int count = table

.SelectMany(l => l.line.Split(‘ ‘))

.GroupBy(w => w)

.Count();

GroupBy;Count

HashPartition

Page 140: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Example: word histogram

var table = PartitionedTable.Get<LineRecord>(file);var result = table.SelectMany(l => l.line.Split(' '))

.GroupBy(w => w)

.Select(g => new { word = g.Key, count = g.Count() });

GroupBy;Count

GroupByCountHashPartition

Page 141: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Example: high-frequency words

var table = PartitionedTable.Get<LineRecord>(file);var result = table.SelectMany(l => l.line.Split(' '))

.GroupBy(w => w)

.Select(g => new { word = g.Key, count = g.Count() })

.OrderByDescending(t => t.count)

.Take(100);

Sort; Take Mergesort;

Take

Page 142: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Example: words by frequencyvar table = PartitionedTable.Get<LineRecord>(file);var result = table.SelectMany(l => l.line.Split(' '))

.GroupBy(w => w)

.Select(g => new { word = g.Key, count = g.Count() })

.OrderByDescending(t => t.count);

Sample

Histogram

Broadcast

Range-partition

Sort

Page 143: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Example: Map-Reduce

public static IQueryable<S>

MapReduce<T,M,K,S>(

IQueryable<T> input,

Func<T, IEnumerable<M>> mapper,

Func<M,K> keySelector,

Func<IGrouping<K,M>,S> reducer)

{

var map = input.SelectMany(mapper);

var group = map.GroupBy(keySelector);

var result = group.Select(reducer);

return result;

}

Page 144: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Probabilistic Index Maps

144

Images

features

Page 145: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Parallelizing on Cores

145

Query

DryadLINQ

PLINQ

Local query

Page 146: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

More Tricks of the trade

• Asynchronous operations hide latency

• Management using distributed state machines

• Logging state transitions for debugging

• Compression trades-off bandwidth for CPU

146

Page 147: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

BUILDING ON DRYADLINQ

147

Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Page 148: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Job Visualization

148

Page 149: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Job Schedule

149

Page 150: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

CPU Utilization

150

Page 151: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

So, What Good is it For?

151

Page 152: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

152

Page 153: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Input Device

153

Page 154: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Data Streams

154

Audio

Video

Depth

Page 155: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

“You are the Controller”

155source: forums.steampowered.com

Page 156: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Projected IR pattern

156Source: www.ros.org

Page 157: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Depth computation

157Source: http://nuit-blanche.blogspot.com/2010/11/unsing-kinect-for-compressive-sensing.html

Page 158: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

The Body Tracking Problem

158

Page 159: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

XBox 360 Hardware

159

Source: http://www.pcper.com/article.php?aid=940&type=expert

• Triple Core PowerPC 970, 3.2GHz• Hyperthreaded, 2 threads/core

• 500 MHz ATI graphics card• DirectX 9.5

• 512 MB RAM

• 2005 performance envelope

• Must handle real-time vision AND a modern game

Page 160: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Background eliminationPlayer segmentation

Body Part Classifier

Tracking Pipeline

160

Depth mapSensor

Body Part Identification

Skeleton

Page 161: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

161

Page 162: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

From Depth Map to Body Parts

162

Depth map Body parts

Classifier

Xbox GPU

Page 163: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

What is the Shape of a Human?body size

hair

body type

clothes

furniture

pets

FOV

angle

Page 164: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

How do You Recognize a Human?

164

“nose”

Page 165: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Learn from Examples

165

Classifier

Machine learning

Page 166: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Examples

166source: www.fanpop.com

Page 167: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Motion Capture (Mocap)

167

Page 168: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Rendered Avatar Models

168

Page 169: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Ground Truth Data

169

noise

Page 170: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Decision Trees

170

0255075

100

Head Lefthand

Chest RightFoot

Probability

< >

<

Features to learn

Page 171: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Accuracy

17124 hours/8 core machine

Page 172: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Learn from Many Examples

172

Decision

Tree

Classifier

Machine learning

Page 173: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Highly efficient parallellization

173

time

mac

hin

e

Page 174: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

THE END

174

Page 175: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

The ‘60s

175OS/360

Multics

ARPANET

PDP/8

Spacewars

Virtual memory

Time-sharing

(defun factorial (n) (if (<= n 1) 1

(* n (factorial (- n 1)))))

Page 176: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

What about the 2010’s?

176

Page 177: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Layers

177

Networking

Storage

Distributed Execution

Scheduling

Resource Management

Applications

Identity & Security

Caching and Synchronization

Programming Languages and APIs

Op

erat

ing

Syst

em

Page 178: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Pieces of the Global Computer

178And many, many more…

Page 179: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

This lecture

179

Page 180: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

The cloud

• “The cloud” is evolving rapidly

• New frontiers are being conquered

• The face of computing will change forever

• There is still a lot to be done

180

Page 181: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Conclusion

181

181

=Machine

Cluster services

Cluster storage

Distributed Execution

Language

Machine Machine Machine

Deployment

Applications

Page 182: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Bibliography (1)

182

Dryad: Distributed Data-Parallel Programs from Sequential Building BlocksMichael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropean Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level LanguageYuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon CurreySymposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008

Hunting for problems with ArtemisGabriela F. Creţu-Ciocârlie, Mihai Budiu, and Moises GoldszmidtUSENIX Workshop on the Analysis of System Logs (WASL), San Diego, CA, December 7, 2008

DryadInc: Reusing work in large-scale computationsLucian Popa, Mihai Budiu, Yuan Yu, and Michael IsardWorkshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA, June 15, 2009

Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations, Yuan Yu, Pradeep Kumar Gunda, and Michael Isard, ACM Symposium on Operating Systems Principles (SOSP), October 2009

Quincy: Fair Scheduling for Distributed Computing ClustersMichael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew GoldbergACM Symposium on Operating Systems Principles (SOSP), October 2009

Page 183: Systems for Data-Intensive Parallel Computing  (Lecture by Mihai Budiu)

Bibliography (2)Autopilot: Automatic Data Center Management, Michael Isard, in Operating Systems Review, vol. 41, no. 2, pp. 60-67, April 2007

Distributed Data-Parallel Computing Using a High-Level Programming Language, Michael Isard and Yuan Yu, in International Conference on Management of Data (SIGMOD), July 2009

SCOPE: Easy and Efficient Parallel Processing of Massive Data SetsRonnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou, Very Large Databases Conference (VLDB), Auckland, New Zealand, August 23-28 2008

Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer, Jingren Zhou, Per-ÅkeLarson, and Ronnie Chaiken, in Proc. of the 2010 ICDE Conference (ICDE’10).

Nectar: Automatic Management of Data and Computation in Datacenters, Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang, in Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), October 2010

Optimus: A Dynamic Rewriting Framework for Execution Plans of Data-Parallel Computations, Qifa Ke, Michael Isard, Yuan Yu, Proceedings of EuroSys 2013, April 2013

183