Motivation DryadLINQ Design Evaluation DryadLINQ Distributed Computation Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Distributed Batch Processing 1/34
MotivationDryadLINQ Design
Evaluation
DryadLINQDistributed Computation
Smruti R. Sarangi
Department of Computer ScienceIndian Institute of Technology
New Delhi, India
Smruti R. Sarangi Distributed Batch Processing 1/34
MotivationDryadLINQ Design
Evaluation
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 2/34
MotivationDryadLINQ Design
Evaluation
Basic IdeaRelated Work
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 3/34
MotivationDryadLINQ Design
Evaluation
Basic IdeaRelated Work
Dryad-LINQ
Current programming models for large scale distributed pro-gramming
Map-ReduceMPIMicrosoft Dryad
A DryadLINQ program contains LINQ expressions that:Use LINQ expressions to specify side effect free transforma-tions to dataThe Dryad system parallelizes portions of the program, andruns the program on thousands of machines
A sort of a terabyte level data set takes 319 seconds, on a240 node system.
Smruti R. Sarangi Distributed Batch Processing 4/34
MotivationDryadLINQ Design
Evaluation
Basic IdeaRelated Work
Basic Idea
LINQ ConstructsLINQ (Language INtegrated Query) is a set of .NET con-structsIt provides support for imperative and declarative program-mingProgramming languages supported: C#, F#, VB
Imperative programming: variables, loops, iterators, condi-tionalsDeclarative programming: functors, type inferencing
Smruti R. Sarangi Distributed Batch Processing 5/34
MotivationDryadLINQ Design
Evaluation
Basic IdeaRelated Work
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 6/34
MotivationDryadLINQ Design
Evaluation
Basic IdeaRelated Work
Related Work
Parallel DatabasesImplement only declarative variants of SQL QueriesThe query oriented nature of SQL makes it hard to specifytypical programming constructs
Map ReduceNot very flexible.Hard to perform operations such as sorting or databasejoinsLack of type support
Smruti R. Sarangi Distributed Batch Processing 7/34
MotivationDryadLINQ Design
Evaluation
Basic IdeaRelated Work
Related Work - II
Domain specific languages on top of MapReduce – Sawzall,Pig (Yahoo), Hive (Facebook)They are a combination of declarative constructs, and iter-ative constructsHowever, they are not very flexible since their pattern is in-herently based on SQLHow is DryadLINQ different ?
The computation is not dependent on the nature of underly-ing resources.Uses virtual execution plansUnderlying computational resources can change dynamically(faults, outages, . . .)
Smruti R. Sarangi Distributed Batch Processing 8/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Overview of DryadLINQ
Structure of a Dryad jobIt is a directed acyclic graph (DAG)Each vertex is a programEach edge is a data channel that transmits a finite se-quence of records at runtime
Smruti R. Sarangi Distributed Batch Processing 9/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 10/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Dryad System Architecture
Dryad System ArchitectureIt contains a centralized job manager whose role is:
Instantiating a job’s dataflow graph.Scheduling processesFault toleranceJob monitoring and managementTransforming the job graph at runtime according to theuser’s instructions
Smruti R. Sarangi Distributed Batch Processing 11/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
DryadLINQ Execution Overview
1 The user runs a .NET application. It creates a DryadLINQexpression object that has deferred evaluation.
2 The application calls the method ToDryadTable. This methodhands over the expression object to DryadLINQ.
3 DryadLINQ compiles the expression, and makes an execu-tion plan
1 Decomposition into sub-expressions2 Generation of code and data for Dryad nodes3 Generation of serialization and synchronization code.
4 Dryad invokes a custom job manager .
Smruti R. Sarangi Distributed Batch Processing 12/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
DryadLINQ Execution Overview - II
1 The job manager creates a job graph. It schedules andspawns the jobs.
2 Each node in the graph executes the program assigned toit.
3 When the program is done, it writes the data to the outputtable.
4 After the job manager terminates, DryadLINQ collates allthe output and creates the DryadTable object.
5 Control returns to the user application.1 Dryad passes an iterator object to the table object.2 This can be passed to subsequent statements.
Smruti R. Sarangi Distributed Batch Processing 13/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 14/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
LINQ
The base type is an interface IEnumerable<T> – An iteratorfor a set of objects with type T
The programmer is not aware of the data type associatedwith an instance of IEnumerable
IQueryable<T> is a subtype of IEnumerable<T>This is an unevaluated expressionIt undergoes deferred evaluationDryadLINQ creates a concrete class to implement the IQueryableexpression at runtime
Smruti R. Sarangi Distributed Batch Processing 15/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
LINQ SQL Syntax Example
// Join two tables: scoreTriples and staticRankvar adjustedScoreTripes =from d in scoreTriplesjoin r in staticRank on d.docID equals r.keyselect new QueryScoreDocIDTriple(d, r);
var rankedQueries =from s in adjustedScoreTriplesgroup s by s.query into gselect TakeTopQueryResults(g);
Smruti R. Sarangi Distributed Batch Processing 16/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
LINK OOP Syntax Example
var adjustedScoreTriples =scoreTriples.Join(staticRank,d => d.docID, r => r.key,(d, r) => new QueryScoreDocIDTriple(d, r));
var groupedQueries =adjustedScoreTriples.GroupBy(s => s.query);
var rankedQueries = groupedQueries.Select(g => TakeTopQueryResults(g));
Smruti R. Sarangi Distributed Batch Processing 17/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
DryadLINQ Constructs
A DryadLINQ collection (defined by IEnumerable) is a dis-tributed dataset. Partitioning strategies.
Hash PartitioningRange PartitioningRound-robin Partitioning
The results of a DryadLINQ computation are representedby the object – DryadTable<T>
Subtypes determine the actual storage interface.Can include additional details such as metadata and schemas.
Smruti R. Sarangi Distributed Batch Processing 18/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
DryadLINQ Methods
All the methods need to be side effect freeShared objects can be distributed in any wayThe functions to access a DryadTable are serializable
GetTable<T>ToDryadTable<T>
Custom partitioning operatorsHashPartition<T,K>RangePartition<T,K>
Functional Operatorsapply(f,dataset) Applies function f to all the elements in adatasetfork(f,dataset) Similar to Apply , but can produce multipleoutput datasets.
Dryad annotations – parallelization, storage policiesSmruti R. Sarangi Distributed Batch Processing 19/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 20/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
System Implementation
Execution Plan Graph (EPG)DryadLINQ converts the raw link expressions to the nodesof the EPGThe EPG is a DAGA part of the EPG can also be generated at runtime basedon the values of iterative and conditional expressionsDryadLINQ also needs to respect the metadata (node re-quirements, and parallelization directives) while generatingthe EPGNeeds to support the deferred evaluation of functions
Smruti R. Sarangi Distributed Batch Processing 21/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Static Optimizations
Pipelining : One process executes multiple operations in apipelined fashionRedundancy Removal : Remove dead code, and unneces-sary partitioningEager Aggregation : Intelligently reduce data movement byoptimizing aggregation and repartitioningI/O Reduction : Use TCP pipes, and in-memory channelsto reduce persistence to files
Smruti R. Sarangi Distributed Batch Processing 22/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Dynamic Optimizations
Optimally Implementing OrderByDeterministically sample the values.Plot a histogram, and compute the appropriate keys forrange partitioningA set of vertices now perform the range partitioning.A node now fetches the inputs, and then sorts them. Thesetwo actions can be pipelined.
Smruti R. Sarangi Distributed Batch Processing 23/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Execution Plan of OrderBy
Deterministic Sampling
Deterministic Sampling
Computing the historgram & ranges
Data Fetch Data Fetch
Sort Sort
Smruti R. Sarangi Distributed Batch Processing 24/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Code Generation
The EPG is a virtual execution planDryadLINQ dynamically generates code for each EPG node
DryadLINQ generates a .NET assembly snippet that corre-sponds to each LINQ subexpression.It contains the serialization and I/O code for ferrying data.The EPG node code is generated at the computer of theclient ( job submitter ), because it may depend on the lo-cal context. Values in the local context are embedded inthe function/expression. The expression undergoes partialevaluation later.Uses .NET reflection to find the transitive closure of all .NETlibraries. The EPG code, and all the associated libraries areshipped to the cluster computer for remote execution.
Smruti R. Sarangi Distributed Batch Processing 25/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 26/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Interacting with other Frameworks
PLINQRuns a subexpression in a cluster node in parallel usingmulticore processors.Uses user supplied annotations (mostly transparent to theuser).Uses parallel iterators (similar to OpenMP)
SQLDryadLINQ nodes can directly access SQL databases.They can save internal datasets in SQL tables.Can ship some subexpressions to run directly as SQL pro-cedures.
Smruti R. Sarangi Distributed Batch Processing 27/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Debugging
Debugging massively parallel applications is very difficult
Debugging
Visual Studio .NET interface to debug the DryadLINQ pro-gram on a single computerDryadLINQ has a deterministic replay model
It is possibly to replay the entire execution – event by eventSecondly, it is possible to replay any subexpression on alocal machine and view the outputsPerformance Debugging
Collect detailed profiling information.
Smruti R. Sarangi Distributed Batch Processing 28/34
MotivationDryadLINQ Design
Evaluation
System ArchitectureLINQExecution Plan GraphMiscellaneous
Debugging
Debugging massively parallel applications is very difficult
Debugging
Visual Studio .NET interface to debug the DryadLINQ pro-gram on a single computerDryadLINQ has a deterministic replay model
It is possibly to replay the entire execution – event by eventSecondly, it is possible to replay any subexpression on alocal machine and view the outputsPerformance Debugging
Collect detailed profiling information.
Smruti R. Sarangi Distributed Batch Processing 28/34
MotivationDryadLINQ Design
Evaluation
TerasortSkyServer
Setup
240 computer clusterEach node contains two AMD Opteron nodes16GB of main memoryExperiments
TerasortSort a terabyte size dataset.3.87 GB saved per node.
Smruti R. Sarangi Distributed Batch Processing 29/34
MotivationDryadLINQ Design
Evaluation
TerasortSkyServer
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 30/34
MotivationDryadLINQ Design
Evaluation
TerasortSkyServer
Results: Terasort - I
The number of computers was varied from 1 to 250The execution time was 120s for 1 machine, and quicklyjumped to 250 s.Then it grew very slowly (sub-linearly) to 320s.
Source [1]
Smruti R. Sarangi Distributed Batch Processing 31/34
MotivationDryadLINQ Design
Evaluation
TerasortSkyServer
Number of nodes vs Time
Computers DryadLINQ1 26665 58010 32820 17640 113
Smruti R. Sarangi Distributed Batch Processing 32/34
MotivationDryadLINQ Design
Evaluation
TerasortSkyServer
Outline
1 MotivationBasic IdeaRelated Work
2 DryadLINQ DesignSystem ArchitectureLINQExecution Plan GraphMiscellaneous
3 EvaluationTerasortSkyServer
Smruti R. Sarangi Distributed Batch Processing 33/34
MotivationDryadLINQ Design
Evaluation
TerasortSkyServer
SkyServer benchmark
The number of machines was varied from 1 to 40The speedup increased from 1 to 19 sub-linearly for DryadLINQThe speedup increased from 1 to 24 sub-linearly for DryadTwo-pass
Source [1]
Smruti R. Sarangi Distributed Batch Processing 34/34
MotivationDryadLINQ Design
Evaluation
TerasortSkyServer
DryadLINQ: A System for General-Purpose DistributedData-Parallel Computing Using a High-Level Language byYuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, UlfarErlingsson, Pradeep Kumar Gunda, and Jon Currey, OSDI2008
Smruti R. Sarangi Distributed Batch Processing 34/34