A Fault-Tolerant Environment for Large- Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State University HiPC’12 Pune, India 1
Dec 13, 2015
A Fault-Tolerant Environment for Large-Scale Query Processing
Mehmet Can Kurt Gagan AgrawalDepartment of Computer Science and
Engineering The Ohio State University
HiPC’12 Pune, India 1
Motivation
• “big data” problem– Walmart handles 1 million customer transaction every
hour, estimated data volume is 2.5 Petabytes.– Facebook handles more than 40 billion images– LSST generates 6 petabytes every year
• massive parallelism is the key
HiPC’12 Pune, India 2
Motivation
• Mean-Time To Failure (MTTF) decreases
• Typical first year for a new cluster*– 1000 individual machine failures– 1 PDU failure (~500-1000 machines suddenly disappear)– 20 rack failures (40-80 machines disappear, 1-6 hours to get
back)
HiPC’12 Pune, India 3
* taken from Jeff Dean’s talk in Google IO (http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx)
Our Work
• supporting fault-tolerant query processing and data analysis for a massive scientific dataset
• focusing on two specific query types:1. Range Queries on Spatial datasets2. Aggregation Queries on Point datasets
• supported failure types: single-machine failures and rack failures
HiPC’12 Pune, India 4
* rack: a number of machines connected to the same hardware (network switch, …)
Our Work
• Primary Goals1) high efficiency of execution when there are no failures
(indexing if applicable, ensuring load-balance)
2) handling failures efficiently up to a certain number of nodes (low-overhead fault tolerance through data replication)
3) a modest slowdown in processing times when recovered from a failure (preserving load-balance)
HiPC’12 Pune, India 5
Range Queries on Spatial Data
• nature of the task:– each data object is a rectangle in 2D space– each query is defined with a rectangle– return intersecting data rectangles
• computational model:– master/worker model– master serves as coordinator– each worker responsible for a portion of data
HiPC’12 Pune, India 6
Y
X
query
data data data
worker worker worker
query
quer
y query
master
Range Queries on Spatial Data
• data organization:– chunk is the smallest data unit– create chunks by grouping data objects together– assign chunks to workers in round-robin fashion
HiPC’12 Pune, India 7
Y
X
chunk 1
chunk 2
chunk 3worker
chunk 4 worker
* actual number of chunks depends on chunk size parameter.
Range Queries on Spatial Data
• ensuring load-balance:– enumerate & sort data objects according to Hilbert Space-Filling
Curve, then pack sorted data objects into chunks
• spatial index support:– Hilbert R-Tree deployed on master node– leaf nodes correspond to data chunks– initial filtering at master, tells workers which chunks to look
HiPC’12 Pune, India 8
1
2 3
4
o1
o 4
o3
o8
o6
o 5
o 2
o 7
sorted objects: o1, o3 , o8, o6 , o2 , o7 , o4 , o5
chunk 1 chunk 2 chunk 3 chunk 4
Range Queries on Spatial Data
• Fault-Tolerance Support – Sub-chunk Replication:step1: divide data chunks into k sub-chunksstep2: distribute sub-chunks in round-robin fashion
HiPC’12 Pune, India 9
Worker 1 Worker 2 Worker 3 Worker 4
chunk1 chunk2 chunk3 chunk4
chunk1,1 chunk1,2
step1
chunk2,1 chunk2,2
step1
chunk3,1 chunk3,2
step1
chunk4,1 chunk4,2
step1
* rack-failure: same approach, but distribute sub-chunks to nodes in different rack
k = 2
Range Queries on Spatial Data
• Fault-Tolerance Support - Bookkeeping:– add a sub-leaf level to the bottom of Hilbert R-Tree– Hilbert R-Tree both as a filtering structure and failure
management tool
HiPC’12 Pune, India 10
Aggregation Queries on Point Data
• nature of the task:– each data object is a point in 2D space– each query is defined with a dimension (X or Y), and aggregation function (SUM, AVG, …)
• computational model:– master/worker model– divide space into M partitions– no indexing support– standard 2-phase algorithm:
local and global aggregation
HiPC’12 Pune, India 11
worker 1 worker 2
worker 3 worker 4
X
Y
partial result
in worker 2
M = 4
Aggregation Queries on Point Data
• reducing communication volume– initial partitioning scheme has a direct impact– have insights about data and query workload:
P(X) and P(Y) = probability of aggregation along X and Y-axis|rx| and |ry| = range of X and Y coordinates
• expected communication volume Vcomm defined as:
• Goal: choose a partitioning scheme (cv and ch) that minimizes Vcomm
HiPC’12 Pune, India 12
Aggregation Queries on Point Data
• Fault-Tolerance Support – Sub-partition Replication:step1: divide each partition evenly into M’ sub-partitionsstep2: send each of M’ sub-partitions to a different worker node
• Important questions: 1) how many sub-partitions (M’)?2) how to divide a partition (cv’ and ch’) ?3) where to send each sub-partition? (random vs. rule-based)
HiPC’12 Pune, India 13
Y
X
M’ = 4ch’ = 2cv’ = 2
a better distribution
reduces comm. overhead
rule-based selection: assign to nodes which share
the same coordinate-range
Experiments• local cluster with nodes
– two quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM• entire system implemented in C by using MPI-library• range queries:
– comparison with chunk replication scheme– 32 GB spatial data– 1000 queries are run, and aggregate time is reported
• aggregation queries:– comparison with partition replication scheme– 24 GB point data
• 64 nodes used, unless noted otherwise
HiPC’12 Pune, India 14
Experiments: Range Queries
Optimal Chunk Size Selection Scalability
HiPC’12 Pune, India 15
- Execution Times with No Replication and No Failures
(chunk size = 10000)
Experiments: Range Queries
Single-Machine Failure Rack Failure
HiPC’12 Pune, India 16
- Execution Times under Failure Scenarios (64 workers in total)- k is the number of sub-chunks for a chunk
Experiments: Aggregation Queries
Effect of Partitioning SchemeOn Normal Execution
Single-Machine Failure
HiPC’12 Pune, India 17
P(X) = P(Y) = 0.5, |rx| = |ry| = 10000 P(X) = P(Y) = 0.5, |rx| = |ry| = 100000
Conclusion
• a fault-tolerant environment that can process– range queries on spatial data and aggregation queries on
point data– but, proposed approaches can be extended for other type
of queries and analysis tasks
• high efficiency under normal execution• sub-chunk and sub-partition replications
– preserve load-balance in presence of failures, and hence– outperform traditional replication schemes
HiPC’12 Pune, India 18