Where to Deploy Hadoop: Bare-metal or Cloud? Michael Wendt, Sewook Wee Data Insights R&D Group
Nov 28, 2014
Where to Deploy Hadoop: Bare-metal or Cloud?
Michael Wendt, Sewook WeeData Insights R&D Group
Copyright © 2013 Accenture All rights reserved. 2
Big Data: Bare-metal vs. Cloud
Bare-metal Cloud
On-premise full custom
Hadoop-as-a-Service
Hadoop Appliance
Hadoop Hosting
Copyright © 2013 Accenture All rights reserved. 3
Big Data: Bare-metal vs. Cloud
Bare-metal Cloud
On-premise full custom
Hadoop-as-a-Service
Hadoop Appliance
Hadoop Hosting
Data Privacy Data GravityPrice-Performance
Ratio
Productivity of Developers & Data Scientists
Data Enrichment
Copyright © 2013 Accenture All rights reserved. 4
Big Data: Bare-metal vs. Cloud
Bare-metal Cloud
On-premise full custom
Hadoop-as-a-Service
Hadoop Appliance
Hadoop Hosting
Data Privacy Data GravityPrice-Performance
Ratio
Productivity of Developers & Data Scientists
Data Enrichment
Copyright © 2013 Accenture All rights reserved. 5Servers designed by Daniel Campos from The Noun Project
Price-Performance Ratio Views
Bare-metal Cloud
On-premise full custom
Hadoop-as-a-Service
Cloud? Virtualized? Slow!
Who cares! I’m cheap, just throw more in!
Price-Performance Ratio
Copyright © 2013 Accenture All rights reserved. 6
Hadoop Deployment Comparison Study
Bare-metal Cloud
On-premise full custom
Hadoop-as-a-Service
Accenture Data Platform Benchmark
+TCO analysis
Price-Performance Ratio
Price-Performance Ratio
Copyright © 2013 Accenture All rights reserved. 7
Hadoop Deployment Comparison StudyTCO Analysis
Price-Performance Ratio
Bare-metal Cloud
On-premise full custom
Hadoop-as-a-Service
Accenture Data Platform Benchmark
+TCO analysis
Copyright © 2013 Accenture All rights reserved. 8
TCO of Bare-metal Hadoop Cluster
On-premise full custom
Server hardware
Staff for operation
Data center facility and electricity
Technical support
24 server nodes and 50 TB of HDFS capacity*
small-scale initial production deployment
$3,000.00 $2,914.58 $6,656.00 $9,274.46
$21,845.04
Servers designed by Daniel Campos from The Noun Project
Copyright © 2013 Accenture All rights reserved. 9
TCO of Hadoop-as-a-Service
Hadoop-as-a-Service
Hadoop service
Staff for operation
Storage services
Technical support
Used bare-metal TCO for budget
Calculated the number of affordable instances
$15,318.28 $2,063.00 $1,372.27 $3,091.49
$21,845.04
Copyright © 2013 Accenture All rights reserved. 10
TCO of Hadoop-as-a-Service – Instances
Hadoop service
14 instance types
3 pricing models
42 combinations
Hadoop-as-a-Service
Copyright © 2013 Accenture All rights reserved. 11
TCO of Hadoop-as-a-Service – Instances
Hadoop service
m1.xl
m2.4xl
cc2.8xl
Selected representative 3 instance types:m1.xlarge, m2.4xlarge, cc2.8xlarge
Hadoop-as-a-Service
Copyright © 2013 Accenture All rights reserved. 12
TCO of Hadoop-as-a-Service – Affordable Instances
Hadoop service
50% cluster utilization assumed
1/3 of budget allocated for Spot
instances
Instance type
On-demand instances
(ODI)
Reserved instances
(RI)
Reserved + Spot instances
(RI + SI)
m1.xlarge 68 112 192
m2.4xlarge 20 41 77
cc2.8xlarge 13 28 53$15,318.28
Hadoop-as-a-Service
Copyright © 2013 Accenture All rights reserved. 13
Hadoop Deployment Comparison StudyAccenture Data Platform Benchmark
Price-Performance Ratio
Bare-metal Cloud
On-premise full custom
Hadoop-as-a-Service
+TCO analysis
Accenture Data Platform Benchmark
Copyright © 2013 Accenture All rights reserved. 14
Accenture Data Platform Benchmark
Log management Sessionization
Customer preference prediction Recommendation engine
Text Analytics Document clustering
Use cases Workload
Suite of real-world Hadoop MapReduce applications
From client experience, internal roadmap, public
literature
Open-source
libraries & public
datasets
Categorized & selected common
use cases
Copyright © 2013 Accenture All rights reserved. 15
Accenture Data Platform Benchmark:Sessionization
Log data
Sessions
Log data
BucketingSortingSlicing
Log data
A session is a sequence of related interactions, useful to
analyze as a group
~150 billion log entries,
~24 TB
1 million users,
1.1 billion sessions
Copyright © 2013 Accenture All rights reserved. 16
Accenture Data Platform Benchmark:Recommendation Engine
Ratings data Who rated what item?
Co-occurrence matrixHow many people rated the pair of
items?
RecommendationGiven the way the person rated
these items, he/she is likely to be interested in these other items.
Used item-based collaborative filtering algorithm
Mahout example library used as foundation
Generated 300 million
ratings
3 million population,
50,000 items
Copyright © 2013 Accenture All rights reserved. 17
Accenture Data Platform Benchmark:Document Clustering
Corpus of crawled web pages
Filtered and tokenized documents
Term dictionary
TF vectors
Clustered documents
K-means
TF-IDF vectors
Groups similar documents
Application components used in many areas (e.g., search engines, e-commerce site
optimization)
CommonCrawl
dataset, 10 TB corpus*
~31,000 ARC files or ~300 million HTML pages
Copyright © 2013 Accenture All rights reserved. 18
TCO analysis
Hadoop Deployment Comparison StudyExperiment Setup/Results
Bare-metal Cloud
+
On-premise full custom
Hadoop-as-a-Service
Accenture Data Platform Benchmark
Price-Performance Ratio
Copyright © 2013 Accenture All rights reserved. 19
Experiment Setup: Price-Performance Ratio Comparison
Bare-metalHadoopCluster
Amazon EMR
Clusters
1 bare-metal cluster vs. 9
Amazon EMR clusters
Manual and automated
tuning
Fixed budget for cluster size
Measure execution
time of benchmark
Price-Performance Ratio
Copyright © 2013 Accenture All rights reserved. 20
Optimize phase
Profile phase
Experiment Setup:Starfish Automated Performance Tuning Tool
Starfish (now Unravel) is an automated performance tuning
tool for MapReduce jobs
Speedometer designed by Filippo Camedda from The Noun Project
For the experiment we ran each benchmark twice using Starfish
Manual and automated
tuning
Measure execution
time of optimize phase
Copyright © 2013 Accenture All rights reserved. 21
Experiment Results:Starfish Automated Performance Tuning Tool
Manual and automated
tuning
Starfish tuned Recommendation Engine workload w/ 11 cascaded
MapReduce jobs
Manually tuned Sessionization workload
2+ weeks of manual
tuning, ½ - 1 day
iterations
8x improvement in one tuning
cycle
Achieve performance
increases with less cost using Starfish
Copyright © 2013 Accenture All rights reserved. 22
ODI RI RI+SI
408.07
229.25
125.82
381.55
204.10
166.82
250.13
172.23
114.35
cc2.8xlarge
m2.4xlarge
m1.xlarge
Amazon EMR Configuration
Ex
ec
uti
on
Tim
e (
min
ute
s)
Experiment Results:Sessionization
Bare-metal: 533
13 20 68 28 41 112 53 77 192
Copyright © 2013 Accenture All rights reserved. 23
ODI RI RI+SI
23.33
21.97
18.48
20.13
19.97
16.92
14.28
16.30
15.08
cc2.8xlarge
m2.4xlarge
m1.xlarge
Amazon EMR Configuration
Ex
ec
uti
on
Tim
e (
min
ute
s)
Experiment Results:Recommendation Engine
Bare-metal: 21.59
13 20 68 28 41 112 53 77 192
Copyright © 2013 Accenture All rights reserved. 24
ODI RI RI+SI
1661.03
1157.37
784.82
1649.98
1112.68
629.98
914.35
779.98
742.38
cc2.8xlarge
m2.4xlarge
m1.xlarge
Amazon EMR Configuration
Ex
ec
uti
on
Tim
e (
min
ute
s)
Experiment Results:Document Clustering
Bare-metal: 1186.37
13 20 68 28 41 112 53 77 192
Copyright © 2013 Accenture All rights reserved. 25
Key Takeaways
Hadoop-as-a-Service offers a better price-performance ratio
Cloud expands the performance tuning
opportunities
Automated performance tuning tools are a
necessity
Servers designed by Daniel Campos from The Noun Project
Copyright © 2013 Accenture All rights reserved. 26
Acknowledgement
Copyright © 2013 Accenture All rights reserved. 27
More details
Contact us for the full white paper: Hadoop Deployment Comparison Study
Michael Wendt
R&D Developer
Data Insights R&D
Accenture Technology Labs
(408) 817-2190
Scott Kurth
Group Lead
Data Insights R&D
Accenture Technology Labs
(408) 817-2775