Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou QUEST Workshop, September 2012
43
Embed
Sampling Based Range Partition for Big Data Analytics + Some Extras
Sampling Based Range Partition for Big Data Analytics + Some Extras. Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou. INQUEST Workshop, September 2012. Big Data Analytics. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sampling Based Range Partition for Big Data Analytics
+ Some Extras
Milan VojnovićMicrosoft Research
Cambridge, United Kingdom
Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou
INQUEST Workshop, September 2012
2
Big Data Analytics
• Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data
• Some figures of scale– Peta / Tera bytes of online services data processed daily– 200M tweets per day (Twitter)– 1B of content pieces shared per day (Facebook)– 8,000 Exabytes of global data by 2015 (The Economist)
3
Research Agenda
Machine learning OptimizationDatabase
queries
Distributed computing system
4
Outline
• Range Partitionwith Fei Xu and Jingren Zhou
• Count Trackingwith Zhenming Liu and Bozidar Radunovic
• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic
5
Range Partition
• Special interest: balanced range partition
1 23241024
883 1201 23241024
812052 1 2324102412083 52
1-100 101-250 950-1024
. . .
. . .
(120,10)(120,5) (120,4)
1 2 k
1 2 m
6
Range Partition Requirements• Given and and desired relative
partition sizes
• -accurate range partition:
with probability at least
𝑄𝑖
𝑖
= number of data items assigned to range
7
Two Approaches
• Sampling based methods– Take a sample of data items– Compute partition boundaries using the sample
• Quantile summary methods– At each node compute a local quantile summary– Merge at the coordinator node
8
Related Work• Sampling based estimation of histograms
studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998)
Required sample size:
• Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) :
For therwise:
9
Related Work (cont’d)
• Quantile summaries based approach (Greenwald and Khanna, 2001)
Communication cost =
• Pros– Deterministic guarantee
• Cons– It requires sorting of data items– Largest frequency of an item must be at most
10
Problem
• Range partition data while making one pass through data with minimal communication between the coordinator and sites
11
Sampling Based Method
1
2
k
coordinator
• Collect samples and partition using the samples
.
.
.• Pros
– simplicity, scalability• Cons
– how many samples to take from each site?
data size imbalance: number of data input records per machine may differ from one machine to another
12
Data Sizes Imbalance
Dataset Records Bytes Sites
DataSet-1 62M 150G 262
DataSet-2 37M 25G 80
DataSet-3 13M 0.26G 1
DataSet-4 7M 1.2T 301
DataSet-5 106M 7T 5652
13
Origins of Data Sizes Imbalance• JOIN
SELECTFROM A INNER JOIN B ON A.KEY==B.KEYORDER BY COL
• Lookup TableIf the record value of column X is inthe lookup table, then return the row