Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.
Post on 11-Jan-2016
216 Views
Preview:
Transcript
Data-Intensive Computing on Windows HPC Server with the DryadLINQ FrameworkJohn VertArchitectMicrosoft Corporation
SVR17
Moving Parts
> Windows HPC Server 2008 – cluster management, job scheduling
> Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets
> LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model
> PLINQ – multi-core parallelism across LINQ queries.
> DryadLINQ – Bring LINQ ease of programming to Dryad
Software StackImage
Processing
Windows HPC
Server 2008
HPC Job Scheduler
Dryad
DryadLINQ
MachineLearning
GraphAnalysis
DataMining
.NET Applications
…
Windows HPC
Server 2008
Windows HPC
Server 2008
Windows HPC
Server 2008
Dryad
> Provides a general, flexible distributed execution layer> Dataflow graph as the computation model
> Can be modified by runtime optimizations> Higher language layer supplies graph,
vertex code, serialization code, hints for data locality
> Automatically handles distributed execution> Distributes code, routes data> Schedules processes on machines near data> Masks failures in cluster and network
A Dryad JobDirected acyclic graph (DAG)
Processingvertices
Channels(file, fifo, pipe)
Inputs
Outputs
2-D Piping
Unix Pipes: 1-Dgrep | sed | sort | awk | perl
Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50
6
LINQLanguage Integrated Query
> Declarative extensions to C# and VB.NET for iterating over collections> In memory> Via data providers> SQL-Like
> Broadly adoptable by developers> Easy to use> Reduces written code> Predictable results> Scalable experience> Deep tooling support
PLINQ Parallel Language Integrated Query
Value Proposition:> Enable LINQ developers to take advantage of
parallel hardware—with basic understanding of data parallelism.
> Declarative data parallelism (focus on the “what” not the “how”)
> Alternative to LINQ-to-Objects> Same set of query operators + some extras> Default is IEnumerable<T> based
> Preview in Parallel Extensions to .NET Framework 3.5 CTP
> Shipping in .NET Framework 4.0 Beta 2
DryadLINQLINQ to clusters
> Declarative programming style of LINQ for clusters
> Automatic parallelization> Parallel query plan exploits multi-node
parallelism> PLINQ underneath exploits multi-core parallelism
> Integration with VS and .NET> Type safety, automatic serialization> Query plan optimizations
> Static optimization rules to optimize locality> Dynamic run-time optimizations
Query plan
LINQ query
DryadLINQ: From LINQ to Dryad
Dryad
logs
where
select
Automatic query plan generation
Distributed query
execution by Dryad
var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);
A Simple LINQ Query
IEnumerable<BabyInfo> babies = ...;
var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;
A Simple PLINQ Query
IEnumerable<BabyInfo> babies = ...;
var results = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;
A Simple DryadLINQ Query
PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”);
var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;
PartitionedTable<T>Core data structure for DryadLINQ
> Scale-out, partitioned container for .NET objects
> Derives from IQueryable<T>, IEnumerable<T>> ToPartitionedTable() extension methods
> DryadLINQ operators consume and produce PartitionedTable<T>
> DryadLINQ generates code to serialize/deserialize your .NET objects
> Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem
Partitioned FileFile-based container for PartitionedTable<T> metadata
XC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21
\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000
PartitionedFileFile-based container for PartitionedTable<T> metadata
XC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21
\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000\\HPCA1CN13\XC\output\520a0fcf\Part.00000001\\HPCA1CN12\XC\output\520a0fcf\Part.00000002\\HPCA1CN22\XC\output\520a0fcf\Part.00000003\\HPCA1CN07\XC\output\520a0fcf\Part.00000004\\HPCA1CN08\XC\output\520a0fcf\Part.00000005\\HPCA1CN11\XC\output\520a0fcf\Part.00000006\\HPCA1CN06\XC\output\520a0fcf\Part.00000007\\HPCA1CN14\XC\output\520a0fcf\Part.00000008\\HPCA1CN17\XC\output\520a0fcf\Part.00000009\\HPCA1CN23\XC\output\520a0fcf\Part.00000010\\HPCA1CN20\XC\output\520a0fcf\Part.00000011\\HPCA1CN03\XC\output\520a0fcf\Part.00000012\\HPCA1CN16\XC\output\520a0fcf\Part.00000013\\HPCA1CN05\XC\output\520a0fcf\Part.00000014\\HPCA1CN04\XC\output\520a0fcf\Part.00000015\\HPCA1CN09\XC\output\520a0fcf\Part.00000016\\HPCA1CN18\XC\output\520a0fcf\Part.00000017\\HPCA1CN10\XC\output\520a0fcf\Part.00000018\\HPCA1CN21\XC\output\520a0fcf\Part.00000019
A typical data-intensive query
var logs = PartitionedTable.Get<string>(“weblogs.pt”);var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\jvert") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
Go through logs and keep only lines that are not comments. Parse each line into a new LogEntry object.
Go through logentries and keep only entries that are accesses by jvert.
Group jvert accesses according to what page they correspond to. For each page, count the occurrences.
Sort the pages jvert has accessed according to access frequency.
Dryad Parallel DAG execution
var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\jvert") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
logentries
user
logs
accesses
htmAccesses
output
Query plan generation
> Separation of query from its execution context> Add all the loaded assemblies as resources> Eliminate references to local variables by partially
evaluating all the expressions in the query> Distribute objects used by the query> Detect impure queries when possible
> Automatic code generation> Object serialization code for Dryad channels> Managed code for Dryad Vertices
> Static query plan optimizations> Pipelining: composing multiple operators into one vertex> Minimize unnecessary data repartitions> Other standard DB optimizations
DryadLINQ query plan
Query 0 Output: file://\\hpcmetahn01\XC\output\b7e651a4-38b7-490c-8399-f63eaba7f29a.ptDryadLinq0.dll was built successfully.Input: [PartitionedTable: file://weblogs.pt]Super__1: Where(line => !(line.StartsWith(_))) Select(line => new logdemo.LogEntry(line)) Where(access => access.user.EndsWith(_)) DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count())) DryadHashPartition(e => e.Key,e => e.Key)Super__12: DryadMerge() DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum())) Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
XML representationGenerated by DryadLINQ and passed to Dryad<Query> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion> <ClusterName>hpcmetahn01</ClusterName> ... <Resources> <Resource>wrappernativeinfo.dll</Resource> <Resource>DryadLinq0.dll</Resource> <Resource>System.Threading.dll</Resource> <Resource>logdemo.exe</Resource> <Resource>LinqToDryad.dll</Resource> </Resources> <QueryPlan> <Vertex> <UniqueId>0</UniqueId> <Type>InputTable</Type> <Name>weblogs.pt</Name> ... </Vertex> <Vertex> <UniqueId>1</UniqueId> <Type>Super</Type> <Name>Super__1</Name> ... <Children> <Child> <UniqueId>0</UniqueId> </Child> </Children> </Vertex> ... </QueryPlan><Query>
List of files to be shipped to the cluster
Vertex definitions
DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices
public sealed class DryadLinq__Vertex { public static int Super__1(string args) { < . . . > DryadVertexEnv denv = new DryadVertexEnv(args, dvertexparam); var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0); var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString); var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true); var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true); var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"\jvert"), true); var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false); DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2); DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyy HH:mm:ss.fff")); return 0; } public static int Super__12(string args) { < . . . > }
DryadLINQ query operators
> Almost all the useful LINQ operators> Where, Select, SelectMany, OrderBy,
GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate
> Operators introduced by DryadLINQ> HashPartition, RangePartition, Merge,
Fork> Dryad Apply
> Operates on sequences rather than items
MapReduce in DryadLINQ
MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs{ var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs}
K-means in DryadLINQ
public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c);}
public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(point => NearestCenter(point, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count());}
var vectors = PartitionedTable.Get<Vector>("vectors.pt");IQueryable<Vector> centers = vectors.Take(100);for (int i = 0; i < 10; i++) { centers = Step(vectors, centers);}centers.ToPartitionedTable<Vector>(“centers.pt”);
public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() {…}}
Putting it all togetherIt’s LINQ all the way down
> Major League Baseball dataset> Pitch-by-pitch data for every MLB game
since 2007> 47,909 pitch XML files (one for each
pitcher appearance)> 6,127 player XML files (one for each
player)> Hash partition the input data files to
distribute the work> LINQ to XML to shred the data> DryadLINQ to analyze dataset
Load the dataset and partitionDefine Pitch and Player classes
void StagePitchData(string[] fileList, string PartitionedFile){ // partition the list of filenames across // 20 nodes of the cluster var pitches = fileList.ToPartitionedTable("filelist") .HashPartition((x) => (x), 20) .SelectMany((f) => XElement.Load(f).Elements("atbat")) .SelectMany((a) => a.Elements("pitch") .Select((p) => new Pitch((string)a.Attribute("pitcher"), (string)a.Attribute("batter"), p))); pitches.ToPartitionedTable(PartitionedFile);}
Void StagePlayerData(string[] fileList, string PartitionedFile){ var players = fileList.Select((p) => new Player(XElement.Load(p))); players.ToPartitionedTable(PartitionedFile); return 0;}
Analyze dataset with LINQ
IQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, int count){ return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count);}
Supports LINQ Joins
IQueryable<string> FindFastestPitchers(IQueryable<Pitch> pitches, IQueryable<Player> players, int count){ return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count) .Join(players, (o) => o.Pitcher, (i) => i.Id, (o, i) => i.FirstName + " " + i.LastName) .Distinct();}
DryadLINQ on HPC Server
> DryadLINQ program runs on client workstation> Develop, debug, run locally> When ToPartitionedTable() is called, the query
expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server
> HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager
> The JM then schedules additional tasks to execute the vertices of the DryadLINQ query
> When the job completes, the client program picks up the output result and continues.
Examples of DryadLINQ Applications> Data mining
> Analysis of service logs for network security> Analysis of Windows Watson/SQM data> Cluster monitoring and performance analysis
> Graph analysis> Accelerated Page-Rank computation> Road network shortest-path preprocessing
> Image processing> Image indexing> Decision tree training> Epitome computation
> Simulation> light flow simulations for next-generation display research> Monte-Carlo simulations for mobile data
> eScience> Machine learning platform for health solutions> Astrophysics simulation
Ongoing Work
> Advanced query optimizations> Combination of static analysis and annotations> Sampling execution of the query plan> Dynamic query optimization
> Incremental computation> Real-time event processing> Global scheduling
> Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications
> Scale-out partitioned storage> Pluggable storage providers
> DryadLINQ on Azure> Better debugging, performance analysis, visualization,
etc.
Additional Resources
> Dryad and DryadLINQ> http://connect.microsoft.com/DryadLINQ> DryadLINQ source, Dryad binaries, documentation,
samples, blog, discussion group, etc.
> PLINQ> Available in Parallel Extensions to .NET Framework 3.5 CTP> Available in .NET Framework 4.0 Beta 2> http://msdn.microsoft.com/en-us/concurrency/default.aspx> http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
> Windows HPC Server 2008> http://www.microsoft.com/hpc
> Download it, try it, we want your feedback!
Questions?
YOUR FEEDBACK IS IMPORTANT TO US!
Please fill out session evaluation
forms online atMicrosoftPDC.com
Learn More On Channel 9
> Expand your PDC experience through Channel 9.
> Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses.
channel9.msdn.com/learnBuilt by Developers for Developers….
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
top related