Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scienti fic Computing Using DryadLINQ
Dec 20, 2015
Large Scale Data Processing with DryadLINQ
Dennis FetterlyMicrosoft Research, Silicon Valley
Workshop on Data-Intensive Scientific Computing Using
DryadLINQ
Outline
• Brief introduction to TidyFS• Preparing/loading data onto a cluster• Desirable properties in a Dryad cluster• Detailed description of several IR algorithms
TidyFS goals
• A simple distributed filesystem that provides the abstractions necessary for data parallel computations
• High performance, reliable, scalable service• Workload – High throughput, sequential IO, write once– Cluster machines working in parallel– Terasort
• 240 machines reading at 240 MB/s = 56 GB/s• 240 machines writing at 160 MB/s = 37 GB/s
TidyFS Names
• Stream: a sequence of partitions – i.e. tidyfs://dryadlinqusers/fetterly/clueweb09-English– Can have leases for temp files or cleanup from crashes
• Partition:– Immutable– 64 bit identifier– Can be a member of multiple streams– Stored as NTFS file on cluster machines– Multiple replicas of each partition can be stored
Stream-1 Part 1 Part 2 Part 3 Part 4
Preparation of Data
• Often substantially harder than it appears• Issues:– Data format– Distribution of data– Network bandwidth
• Generating synthetic datasets is sometimes useful
Data Prep – Format
• Text records are simplest– Caveat – information that is not in the line• e.g. - if a line number encodes information
• Binary records often require custom code to load to cluster– Serialization/de-serialization code generated by
DryadLINQ uses C# Reflection
Custom Deserialization Codepublic class UrlDocIdScoreQuery{ public string queryId; public string url; public string docId; public string queryString; public double score;
public static UrlDocIdScoreQuery Read(DryadBinaryReader reader) { UrlDocIdScoreQuery rec = new UrlDocIdScoreQuery(); rec.queryId = ReadAnyString(reader); rec.queryString = ReadAnyString(reader); rec.url = ReadAnyString(reader); rec.docId = ReadAnyString(reader); rec.score = reader.ReadDouble(); return rec; } public static string ReadAnyString(DryadBinaryReader dbr) {…}}
Data Prep - Loading
• DryadLINQ job– Often needs a dummy input anchor
• Custom program– Write records to TidyFS partitions
• “SneakerNet” often a good option
Data Loading - DryadLINQ
• Need input “anchor” to run on cluster– Generate or use existing stream
• Sample:IEnumerable<Entry> GenerateEntries(Random x, int numItems){
for (int i = 0; i < numItems; i++) {// code to generate records yield return record;
}}
Data Gen - DryadLINQ
• Need input “anchor” to run on cluster– Generate or use existing stream
• Sample:IEnumerable<Entry> GenerateEntries(Random x, int numItems){
for (int i = 0; i < numItems; i++) {// code to generate records yield return record;
}}
DryadLINQ Job
var streamname = "tidyfs://datasets/anchor”;var os = @"tidyfs://msri/teamname/data?compression=" + CompressionScheme.GZipFast;var r = PartitionedTable.Get<int>(streamname)
.Take(1) .SelectMany(x => Enumerable.Range(0, partitions)) .HashPartition(x => x, partitions) .Select(x => new Random(x)) .SelectMany(x => GenerateEntries(x, numItems)) .ToPartitionedTable(os);
Data Loading - Databases
• Bulk copy into files– Use queries to produce multiple files
• Perform queries within DryadLINQ UDFIEnumerable<Entry> PerformQuery(string queryArg){
var results = “select * from …”;foreach (var record in results) {
yield return record;}
}
Building a cluster
• Overall goal – a high-throughput system– Not latency sensitive
• More slower computers often better than fewer faster computers
• Multiple cores better that frequency• Multiple disks – increase throughput• Sufficient RAM
Networking a Cluster
• Network topology – medium to large clusters– Attempt to maximize cross rack bandwidth– Two tier topology• Rack switches and core switches
• Port aggregation– Bond multiple connections together • 1 GbE or 10 GbE
Cluster Software
• Runs on Windows HPC Server 2008• Academic Release – For non-commercial use
• Commercial License
DryadLINQ IR Toolkit
• Library that uses DryadLINQ• Source code for a number of IR algorithms– Text retrieval - BM25/BM25F– Link based ranking - PageRank/SALSA-SETR– Text processing - Shingle based duplicate detection
• Designed to work well with ClueWeb09 collection– Including preprocessing the data to load the cluster
• Available from http://research.microsoft.com/dryadlinqir/
ClueWeb09 Collection
• Collected/Distributed by CMU• 1 billion web pages crawled in Jan/Feb 2009• 10 different languages– en, zh, es, ja, de, fr, ko, it, pt, ar
• 5 TB, compressed - 25 TB, uncompressed • Available to research community• Dataset available for your projects– Web graph, 503m English web pages
Example: Term Frequencies
Count term frequencies in a set of documents:
var docs = new PartitionedTable<Doc>(“tidyfs://dennis/docs”);var words = docs.SelectMany(doc => doc.words);var groups = words.GroupBy(word => word);var counts = groups.Select(g => new WordCount(g.Key, g.Count()));counts.ToPartitionedTable(“tidyfs://dennis/counts.txt”);
IN
metadata
SM
doc => doc.words
GB
word => word
S
g => new …
OUT
metadata
20
Execution Plan for Term Frequency
(1)
SM
GB
S
SM
Q
GB
C
D
MS
GB
Sum
SelectMany
Sort
GroupBy
Count
Distribute
Mergesort
GroupBy
Sum
pipelined
pipelined
21
Execution Plan for Term Frequency
(1)
SM
GB
S
SM
Q
GB
C
D
MS
GB
Sum
(2)
SM
Q
GB
C
D
MS
GB
Sum
SM
Q
GB
C
D
MS
GB
Sum
SM
Q
GB
C
D
MS
GB
Sum
BM25 “Grep”• For batch evaluation of queries calculating
BM25 is just a select operation string queryTermDocFreqURLLocal = @"E:\TREC\query-doc-freqs.txt";Dictionary<string, int> dfs = GetDocFreqs(queryTermDocFreqURLLocal); PartitionedTable<InitialWordRecord> initialWords = PartitionedTable.Get<InitialWordRecord>(initialWordsURL);
var BM25s = from doc in initialWords select ComputeDocBM25(queries, doc, dfs);
BM25s.ToPartitionedTable(“tidyfs://dennis/scoredDocs”);
PageRankRanks web pages by propagating scores along hyperlink structure
Each iteration as an SQL query:
1. Join edges with ranks2. Distribute rank on edges3. GroupBy edge destination4. Aggregate into ranks.5. Repeat.
One PageRank Step in DryadLINQ// one step of pagerank: dispersing and re-accumulating rankpublic static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks){ // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);
// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}
A Complete DryadLINQ Program
var pages = DryadLinq.GetTable<Page>(“tidyfs://pages.txt”); // repeat the iterative computation several times var ranks = pages.Select(page => new Rank(page.name, 1.0)); for (int iter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); } ranks.ToDryadTable<Rank>(“outputranks.txt”);
public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links;
public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; }
public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } }
public struct Rank { public UInt64 name; public double rank;
public Rank(UInt64 n, double r) { name = n; rank = r; } }
public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);
// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}
PageRank Optimizations
• Benchmark PageRank on 954m page graph• Naïve approach – 10 iter ~3.5 hours 1.2TB• Apply several optimizations– Change data distribution– Pre-group pages by host– Renaming host groups with dense names– Cull out leaf nodes– Pre-aggregate ranks for each host
• Final version – 10 iter 11.5 min 116 GB
Tactics for Improving Performance
• Loop unrolling• Reduce data movement– Improve data locality
• Choose what to Group
Schedule for Today
• 9:30 – 10:00 Meet with team, finalize project• 10:30-12:00 Work on projects, discuss
approach with a speaker
Cluster Configuration
Head Node
TidyFS ServersCluster machines running tasks and TidyFS storage service
How a Dryad job reads from TidyFS
TidyFS Service
Cluster Machines
Job ManagerList Partitions in Stream Part 1, Machine 1
Part 2, Machine 2
Schedule VertexPart 1
Schedule VertexPart 2
Machine 1
Machine 2
Get Read PathMachine 1, Part 1
Get Read PathMachine 2, Part 2
D:\tidyfs\0001.data
D:\tidyfs\0002.data…
How a Dryad job writes to TidyFS
TidyFS ServiceCluster Machines
Job Manager
Machine 1
Machine 2
Schedule Vertex 1
Schedule Vertex 2
createStr1_v1
createStr1_v2
Part 1
Part 2
Str1_v1
Str1_v2
Part1
Part 2
…
How a Dryad job writes to TidyFS
TidyFS ServiceCluster Machines
Job Manager
Machine 1 GetWritePathMachine 1, Part 1
GetWritePathMachine2, Part 2
D:\tidyfs\0001.data
D:\tidyfs\0002.dataMachine 2
Str1_v1
Str1_v2
Part1
Part 2
Completed
Completed
Create Str1
Str1
ConcatenateStreams(str1, str1_v1, str1_v2)
Delete Streamsstr1_v1, str1_v2
AddPartitionInfo(Part 1, Machine 1, Size, Fingerprint, …)
AddPartitionInfo(Part 2, Machine 2, Size, Fingerprint, …)
…