Top Banner
From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ
30

From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

From LINQ to DryadLINQ

Michael IsardWorkshop on Data-Intensive Scientific

Computing Using DryadLINQ

Page 2: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Overview

• From sequential code to parallel execution• Dryad fundamentals• Simple program example, plan for practicals

Page 3: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed computation

• Single computer, shared memory– All objects always available for read and write

• Cluster of workstations– Each computer sees a subset of objects– Writes on one computer must be explicitly shared

• System automatically handles complexity– Needs some help

Page 4: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Data-parallel computation

• LINQ is high-level declarative specification• Same action on entire collection of objects• set.Select(x => f(x))– Compute f(x) on each x in set, independently

• set.GroupBy(x => key(x))– Group by unique keys, independently

• set.OrderBy(x => key(x))– Sort whole set (system chooses how)

Page 5: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed cluster computing

• Dataset is stored on local disks of cluster

setset.0set.7

set.1set.6set.4

set.3set.2set.5

Page 6: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed cluster computing

• Dataset is stored on local disks of cluster

set.0set.7

set.1set.6set.4

set.3set.2set.5

Page 7: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Simple distributed computation

var set2 = set.Select(x => f(x))

set

set2

Page 8: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0

set.7set.1

set.6 set.4

set.3

set.2

set.5

set2.0

set2.1

set2.2

set2.3

set2.4

set2.5

set2.6

set2.7

Page 9: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

Page 10: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

Page 11: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed acyclic graph

• Computation reads and writes along edges• Graph shows parallelism via independence• Goals of DryadLINQ optimizer– Extract parallelism (find independent work)– Control data skew (balance work across nodes)– Limit cross-computer data transfer

Page 12: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed grouping

var groups = set.GroupBy(x => x.key)

• set is a collection of records each with a key• Don’t know what keys are present– Or in which partitions

• First, reorganize data– All records with same key on same computer

• Then can do final grouping in parallel

Page 13: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed grouping

var groups = set.GroupBy(x => x.key)

set

hash partition by key

group locally

groups

ac

ad

db

ba

ac

a caa

ad

dd bb

db

ba

Page 14: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed grouping

var groups = set.GroupBy(x => x.key)

set

hash partition by key

group locally

groups

ac

ad

db

ba

ac

a caa

ad

dd bb

db

ba

a a ac

b bd d

a a ac

b bd d

Page 15: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

1001

11

23

31

41

Page 16: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

1001

11

23

31

41

[1,1][2,100]

1001

11 11

1002 34

11

23

41

Page 17: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

11

23

41

[1,1][2,100]

1001

11 11

1002 34

11

23

41

1 1 1 1 2 3 4 100

1 1 1 1 2 3 4 100

Page 18: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a bb a

a aa aa a

a ad d

b bd d

b d b db b

b db d

a bb a

count

Page 19: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a a a a a a

a a a a a a count

a bb a

a ad d

b db d

a bb a

a bb a

a ad d

b db d

a bb a

a aa aa a

b bd d

b d b db b

b b b b b bd d d d

b b b b b bd d d d

Page 20: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histograma,6 b,6d,4

count

a bb a

a ad d

b db d

a bb a

a bb a

a ad d

b db d

a bb a

a a a a a a b b b b b bd d d d

a a a a a aa,6b,6d,4b b b b b b

d d d d

Page 21: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

Page 22: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

Page 23: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2d,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,6

a,6

b,6 d,4

b,6 d,4

Page 24: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

What Dryad does

• Abstracts cluster resources– Set of computers, network topology, etc.

• Schedule DAG: choose cluster computers– Fairly among competing jobs– So computation is close to data

• Recovers from transient failures– Rerun computations on machine or network fault– Speculate duplicates for slow computations

Page 25: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Resources are virtualized

• Each graph node is process– Writes outputs to disk– Reads inputs from upstream nodes’ output files

• Graph generally larger than cluster– 1TB input, 250MB partition, 4000 parts

• Cluster is shared– Don’t size program for exact cluster– Use whatever share of resources are available

Page 26: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

What controls parallelism

• Initially based on partitioning of inputs

• After reorganization, system or user decides

Page 27: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

DryadLINQ-specific operators

• set = PartitionedTable.Get<T>(uri)• set.ToPartitionedTable(uri)• set.HashPartition(x => f(x), numberOfParts)• set.AssumeHashPartition(x => f(x))• [Associative] f(x) { … }• RangePartition(…), Apply(…), Fork(…)• [Decomposable], [Homomorphic], [Resource]• Field mappings, Multiple partitioned tables, …

Page 28: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

using System;using System.Collections.Generic;using System.Linq;using System.Text;using LinqToDryad;

namespace Count { class Program { public const string inputUri = @"tidyfs://datasets/Count/inputfile1.pt"; static void Main(string[] args) { PartitionedTable<LineRecord> table = PartitionedTable.Get<LineRecord>(inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); } }}

Page 29: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Form into groups

• 9 groups, one MSRI member per group• Try to pick common interest for project later

Page 30: From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

sherwood-246 — sherwood-253,sherwood-255

d:\dryad\data\Workshop\DryadLINQ\samplesCount, Points, Robots

Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe

TidyFS (file system) browserd:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe