Page 1
HadoopDB: An open source hybrid of MapReduceand DBMS technologies
Azza Abouzeid, Kamil Bajda-PawlikowskiDaniel J. Abadi, Avi Silberschatz
Yale Universityhttp://hadoopdb.sourceforge.net
October 2, 2009
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi
Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.
Page 2
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMotivation
Major Trends
1 Data explosion:Automation of business processes, proliferation of digitaldevices.eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
2 Analysis over raw data
Bottom line
Analyzing massive structured data on 1000s of shared-nothingnodes.
Yale University, HadoopWorld 2009 HadoopDB 2/24
Page 3
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMotivation
Major Trends
1 Data explosion:Automation of business processes, proliferation of digitaldevices.eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
2 Analysis over raw data
Bottom line
Analyzing massive structured data on 1000s of shared-nothingnodes.
Yale University, HadoopWorld 2009 HadoopDB 2/24
Page 4
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMotivation
Sales Record Example
Consider a large data set of sales log records, each consisting ofsales information including:
1 a date of sale
2 a price
We would like to take the log records and generate a reportshowing the total sales for each year.
Question:
How do we generate this report e!ciently and cheaply over massivedata contained in a shared-nothing cluster of 1000s of machines?
Yale University, HadoopWorld 2009 HadoopDB 3/24
Page 5
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases
MapReduce (Hadoop)
MapReduce is a programming model which specifies:
A map function that processes a key/value pair to generate aset of intermediate key/value pairs,
A reduce function that merges all intermediate valuesassociated with the same intermediate key.
Hadoop
is a MapReduce implementation for processing large data setsover 1000s of nodes.
Maps (and Reduces) run independently of each other overblocks of data distributed across a cluster.
Yale University, HadoopWorld 2009 HadoopDB 4/24
Page 6
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases
Sales Record Example using Hadoop
Query: Calculate total sales for each year.
We write a MapReduce program:
Map: Takes log records and extracts a key-value pair of yearand sale price in dollars. Outputs the key-value pairs.
Shu!e: Hadoop automatically partitions the key-value pairsby year to the nodes executing the Reduce function
Reduce: Simply sums up all the dollar values for a year.
Yale University, HadoopWorld 2009 HadoopDB 5/24
Page 7
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases
Relational Databases
Suppose that the data is stored in a relational database system,the sales record example could be expressed in SQL as:
SELECT YEAR(date) AS year, SUM(price)FROM salesGROUP BY year
The execution plan is:
projection(year,price) ! hash aggregation(year,price).
Question:
How do we process this e!ciently if the data is very large?
Yale University, HadoopWorld 2009 HadoopDB 6/24
Page 8
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases
Parallel Databases
Parallel Databases are like single-node databases except:
Data is partitioned across nodes
Individual relational operations can be executed in parallel
xxxSELECT YEAR(date) AS year, SUM(price)FROM sales GROUP BY year
Execution plan for the query:projection(year,price) ! partial hash aggregation(year,price) !partitioning(year) ! final aggregation(year,price).
Note that the execution plan resembles the map and reduce phasesof Hadoop.
Yale University, HadoopWorld 2009 HadoopDB 7/24
Page 9
Introduction Candidates Di!erences HadoopDB Evaluation Conclusion
Di!erences between Parallel Databases and Hadoop
Yale University, HadoopWorld 2009 HadoopDB 8/24
Page 10
Introduction Candidates Di!erences HadoopDB Evaluation Conclusion
Di!erences between Parallel Databases and Hadoop
Yale University, HadoopWorld 2009 HadoopDB 8/24
Page 11
Introduction Candidates Di!erences HadoopDB Evaluation Conclusion
To summarize
Yale University, HadoopWorld 2009 HadoopDB 9/24
Page 12
At Yale, we looked beyond the di!erences ...
Page 13
At Yale, we looked beyond the di!erences ...
Page 14
and we discovered ...
... that they complete each otherhttp://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg
Basic design idea
Multiple, independent, singlenode databases coordinated byHadoop.
Page 15
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionBackground Architecture SMS
Hadoop Basics
Yale University, HadoopWorld 2009 HadoopDB 12/24
Page 16
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionBackground Architecture SMS
Architecture
Yale University, HadoopWorld 2009 HadoopDB 13/24
Page 17
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionBackground Architecture SMS
SQL-MR-SQL
SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);
Yale University, HadoopWorld 2009 HadoopDB 14/24
Page 18
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability
Evaluating HadoopDB
Compare HadoopDB to
1 Hadoop
2 Parallel databases (Vertica, DBMS-X)
Features:1 Performance:
We expected HadoopDB to approach the performance ofparallel databases
2 Scalability:We expected HadoopDB to scale as well as Hadoop
We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2clusters of 10, 50, 100 nodes.
Yale University, HadoopWorld 2009 HadoopDB 15/24
Page 19
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability
Load
92 1
41
164
139
100 161
47
43 77
0
200
400
600
800
1000
1200
1400
1600
10 nodes 50 nodes 100 nodes
seconds
Vertica DB-X
HadoopDB Hadoop
Random Unstructured Data(535MB/node)
0
10
20
30
40
50
10 nodes 50 nodes 100 nodes
Thousands
seconds
Vertica DB-X
HadoopDB Hadoop
Structured data (20GB/node)
Yale University, HadoopWorld 2009 HadoopDB 16/24
Page 20
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability
Performance: Grep Task
0
10
20
30
40
50
60
70
10 nodes 50 nodes 100 nodes
se
co
nd
s
Vertica DB-X
HadoopDB Hadoop
SELECT * FROM grep WHERE field LIKE ‘%xyz%’;
1 Full table scan, highlyselective filter
2 Random data, noroom for indexing
3 Hadoop overheadoutweighs queryprocessing time insingle-node databases
Yale University, HadoopWorld 2009 HadoopDB 17/24
Page 21
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability
Performance: Join Task
20
.6
34
.7
67
.7
28
.0
29
.4
31
.912
6.4
22
4.2
30
0.5
0
200
400
600
800
1000
1200
1400
1600
1800
2000
10 nodes 50 nodes 100 nodes
se
co
nd
s
Vertica DB-X
HadoopDB Hadoop
SELECT sourceIP, AVG(pageRank), SUM(adRevenue)FROM rankings, uservisitsWHERE pageURL=destURLAND visitDate BETWEEN 2000-1-15 AND 2000-1-22GROUP BY sourceIPORDER BY SUM(adRevenue) DESC LIMIT 1;
1 No full table scan dueto clustered indexing
2 Hash partitioning ande!cient joinalgorithm
Yale University, HadoopWorld 2009 HadoopDB 18/24
Page 22
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability
Performance: Bottom Line
1 Unstructured dataHadoopDB’s performance matches Hadoop
2 Structured dataHadoopDB’s performance is close to parallel databases
Yale University, HadoopWorld 2009 HadoopDB 19/24
Page 23
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability
Scalability: Setup
1 Simple aggregation task - full table scan
2 Data replicated across 10 nodes
3 Fault-tolerance: Kill a node halfway
4 Fluctuation-tolerance: Slow down a node for the entireexperiment
Yale University, HadoopWorld 2009 HadoopDB 20/24
Page 24
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability
Scalability: Results
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
200%
Fault-tolerance Fluctuation-tolerance
perc
enta
ge s
low
dow
n
Vertica
HadoopDB
Hadoop
1 HadoopDB andHadoop takeadvantage of runtimescheduling bysplitting data intochunks or blocks
2 Parallel databasesrestart entire query onnode failure or waitfor the slowest node
Yale University, HadoopWorld 2009 HadoopDB 21/24
Page 25
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionSummary Future
To summarize
HadoopDB ...
1 is a hybrid of DBMS and MapReduce
2 scales better than commercial parallel databases
3 is as fault-tolerant as Hadoop
4 approaches the performance of parallel databases
5 is free and open-source
http://hadoopdb.sourceforge.net
Yale University, HadoopWorld 2009 HadoopDB 22/24
Page 26
Introduction Candidates Di!erences HadoopDB Evaluation ConclusionSummary Future
Future work
Engineering work:
1 Full SQL support in SMS
2 Data compression
3 Integration with other open source databases
4 Full automation of the loading and replication process
5 Out-of-the box deployment
6 We’re hiring!
Research work:
Incremental loading and on-the-fly repartitioning
Dynamically adjusting fault-tolerance levels based on failurerate
Yale University, HadoopWorld 2009 HadoopDB 23/24
Page 27
Thank You ...
We welcome all thoughts on how to raise HadoopDB ...http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg