Hadoop - Where did it come from and what's next? Eric Baldeschwieler 1
Dec 02, 2014
1
Hadoop - Where did it come from and what's next?
Eric Baldeschwieler
2
Who is Eric14?• Big data veteran (since 1996)
• Twitter handle: @jeric14
• Previously
• CTO/CEO of Hortonworks
• Yahoo - VP Hadoop Engineering
• Yahoo & Inktomi – Web Search
• Grew up in Pasadena
3
What is Hadoop?
4
What is Apache Hadoop?• Scalable
– Efficiently store and process petabytes of data
– Grows linearly by adding commodity computers
• Reliable– Self healing as hardware fails or is
added
• Flexible– Store all types of data in many
formats– Security, Multi-tenancy
• Economical– Commodity hardware– Open source software
THE open source big data platform
Yarn – Computation Layer• Many programing models
• MapReduce, SQL, Streaming, ML…• Multi-users, with queues, priorities, etc…
HDFS – Hadoop Distributed File System• Data replicated on 3 computers• Automatically replaces lost data /
computers• Very high bandwidth, not IOPs optimized
5
Hadoop hardware• 10 to 4500 node clusters
– 1-4 “master nodes”– Interchangeable
workers• Typical node
– 1-2 U– 4-12 * 2-4TB SATA– 64GB RAM– 2 * 4-8 core, ~2GHz– 10Gb NIC– Single power supply– jBOD, not RAID, …
• Switches– 10 Gb to the node– 20-40 Gb to the core– Layer 2 or 3, simple
6(From Richard McDougall, VMware, Hadoop Summit, 2012 talk)
Hadoop’s cost advantage
SAN Storage
$2 - $10/Gigabyte
$1M gets:0.5Petabytes
1,000,000 IOPS1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:1 Petabyte
400,000 IOPS2Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets:20 Petabytes
10,000,000 IOPS800 Gbytes/sec
“And you get racks of free
computers when you buy storage!”
- Eric14
7
Where did Hadoop come from?
8
Early History• 1995 – 2005
– Yahoo! search team builds 4+ generations of systems to crawl & index the world wide web. 20 Billion pages!
• 2004– Google publishes Google File System & MapReduce papers
• 2005– Yahoo! staffs Juggernaut, open source DFS & MapReduce
• Compete / Differentiate via Open Source contribution!• Attract scientists – Become known center of big data excellence• Avoid building proprietary systems that will be obsolesced• Gain leverage of wider community building one infrastructure
– Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo!• 2006
– Juggernaut & Nutch join forces - Hadoop is born!• Nutch prototype used to seed new Apache Hadoop project• Yahoo! commits to scaling Hadoop, staffs Hadoop Team
9
Early Hadoop
HDFS
MapReduce
Physical Hardware
10
Hadoop at Yahoo!
Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/
twice the engagement
CASE STUDYYAHOO SEARCH ASSIST™
11© Yahoo 2011
Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
• Database for Search Assist™ is built using Apache Hadoop• Several years of log-data• 20-steps of MapReduce
12
, early adopters Scale and productize Hadoop
Apache Hadoop
Hadoop beyond Yahoo!2006 – present
Other Internet Companies Add tools / frameworks, enhance
Hadoop
2008 – present
…
Service Providers Provide training, support, hosting 2010 – present
…Cloudera, MapR, Pivotal, IBMTeradata, Microsoft, Google, RackSpace, Qubole, Altiscale
Mass Adoption
13
Hadoop has seen off many competitors
• Every year I used to see 2-3 “Hadoop killers.” Hadoop kept growing and displacing them– Yahoo had 2 other internal competitors– Microsoft, Lexus/Nexus, Alibaba, Baidu all had
internal efforts– Various cloud technologies, HPC technologies– Various MPP DBs
• Various criticisms of Hadoop– Performance – Hadoop is too slow, its in Java…– There is nothing here not in DBs for decades– Its not ACID, highly available, secure enough, …
14
Why has Hadoop triumphed?• Deep investment from Yahoo
– ~300 person years , web search veteran team– 1000s of users & 100s of use cases– Solved some of worlds biggest problems
• Community open source– Many additional contributors, now an entire industry– Apache Foundation provides continuity, clean IP
• The right economics– Open source, really works on commodity hardware– Yahoo has one Sys Admin per 8000 computers!
• Simple & reliable at huge scale– Assumes failure, detects it and works around it– Does not require expensive & complex highly available hardware
• Java!– good tooling, garbage collection…– Made it easy to get early versions & new contributions working– Made it easy to build community – most common programming language
twice the engagement
CASE STUDYYAHOO! WEBMAP
15© Yahoo 2011
• What is a WebMap?– Gigantic table of information about every web site, page
and link Yahoo! knows about– Directed graph of the web– Various aggregated views (sites, domains, etc.)– Various algorithms for ranking, duplicate detection, region
classification, spam detection, etc.
• Why was it ported to Hadoop?– Custom C++ MapReduce solution was not scaling– Leverage scalability, load balancing and resilience of
Hadoop infrastructure– Focus on application vs. infrastructure
twice the engagement
CASE STUDYWEBMAP PROJECT RESULTS
16© Yahoo 2011
• 33% time savings over previous system on the same cluster (on Hadoop 0.18 or so)
• The map of the web is Big– Over 1000 computers in cluster– 100,000+ maps, ~10,000 reduces– ~70 hours runtime– ~300 TB shuffling– ~200 TB compressed output
• Moving data to Hadoop increased number of groups who could use the data
17
Hadoop Today
18
Hadoop Today
HDFSKafka
YARN
Hive Meta+
HCat
Data Processing• MapReduce• Pig• Spark• Cascading• …
SQL• Hive• Impala• Spark• …
Streaming• Storm• Samza• Spark• …
Services• Slider• Twill• Hbase• Sqoop• …
Ecosystem of products & services
Physical Hardware or Cloud Infrastructure
19
Hadoop use cases• Low cost storage• Data warehouse optimization
– ETL, archival, science/discovery, replacement• Horizontals
– Web/App logs & Marketing– Business Intelligence, Analytics, ML– Security , Internet of things / machine logs Datalake (more on this in a minute)
• Verticals– Banking, finance, healthcare, government / IC– Petroleum / seismic , utilities , retail– Online: advertising, marketing, social, gaming– Science: Bio/genomics , seismic– …
CASE STUDYYAHOO! HOMEPAGE
20
• Serving Maps• Users - Interests
• Five Minute Production
• Weekly Categorization models
SCIENCE HADOOP CLUSTER
SERVING SYSTEMS
PRODUCTION HADOOP CLUSTER
USERBEHAVIOR
ENGAGED USERS
CATEGORIZATIONMODELS (weekly)
SERVINGMAPS
(every 5 minutes)USER
BEHAVIOR
» Identify user interests using Categorization models
» Machine learning to build ever better categorization models
Build customized home pages with latest data (thousands / second)
© Yahoo 2011
21
Hadoop
Big data application model
Web & App Servers(ApacheD, Tomcat…)
Serving Store(Cassandra, MySQL, Riak…)
Interactive layer
Message Bus(Kafka, Flume, Scribe…)
Streaming Engine(Storm, Spark, Samza…)
YARN (MapReduce, Pig, Hive, Spark…)
HDFS
Streaminglayer
Batch layer
22
How do you get Hadoop?• Learning - Desktop VMs & cloud sandboxes
• Cloud Services– Amazon EMR, Microsoft HDInsights, Qubole…
• Private hosted cluster providers– Rackspace, Altiscale…
• Hadoop distributions– Hortonworks, Cloudera, …– On dedicated hardware, virtualized or cloud hosted
• Enterprise Vendors– IBM, Pivotal, Teradata, HP, SAP, Oracle, …
• DIY – Hadoop self supported– Apache Software Foundation– BigTop
23
Hadoop is still hard
• Are you ready for DIY supercomputing?– Design & managing hardware, OS, software, net– Hadoop talent is scarce & expensive
• Many vendors with competing solutions– Distros, Clouds, SAAS, Enterprise Vendors, SIs…
• Solutions are best practices, not products– Ultimately you end up writing new software to
solve your problems
24
So why deal with all this?• You have hit a wall
– You know you need a big data solution because your traditional solution is failing
• Solution not technically feasible with trad. tools• Cost becomes prohibitive
• You are building a data business– You have lots of data and need a data innovation platform– You want technology that can grow with your business
• There are lots of success stories– Folks saving 10s of Millions w Hadoop– Successful Big data businesses with Hadoop at their core
25
Bringing Hadoop into your Org• Start with small projects
– Example quick wins:• Finding patients with “forgotten” chronic conditions• Predict daily website peek load based on historic data• Moving archived documents & images into HBase• Reducing classic ETL costs• Running an existing tool in parallel on many records
(documents, gene sequences, images…)
• Hardware– Public cloud can be cost effective– Otherwise 4-10 node clusters can do a lot,
repurposing old gear is often effective for pilots
26
Build on your success
• After a few projects, capacity planning is more than guess work
• Successes built organizational competence and confidence
• Grow incrementally– Add another project to the same cluster if possible– Each project that adds data, adds value to your cluster
• Not unusual to see…– An enterprise team start with 5 nodes– Running on 10-20 a year later– Jumps to 300 two years in
27
The Future
28
Prediction #1 – Things will get easier
• Huge ecosystem of Hadoop contributors– Major DIY Hadoop shops– Hadoop distributions– Cloud and hosting providers– Established enterprise players– Dozens of new startups– Researchers and hobbyist
• They are all investing in improving Hadoop
29
But, fragmentation!?
• The Hadoop market is clearly fragmented– EG Impala vs. Stinger vs. Spark vs. Hawq– All of the vendors push different collections of software– Almost everyone is pushing some proprietary
modifications– This is confusing and costly for ISVs and users
• There is no obvious process by which things will converge
• What is this going to do to the eco-system?– Is Hadoop going to loose a decade, like Unix?
30
Remember the Lost Unix Decade?
Thanks: http://www.unix.org/what_is_unix/flavors_of_unix.html
31
But what happened in that decade?
• Unix went from an niche OS to the OS– The client server & DB revolutions took Unix into enterprise– The .com revolution happened on Unix
• We built tools to deal with the fragmentation• Competing vendors
– built compelling features to differentiate• and copied each other like mad• and worked to make it easy for people to switch to them
– Evangelized Unix• The world adopted Unix because
– The new roughly standard API was valuable– Solutions to real problems were built and sold
32
Fragmentation is part of the process
• Looking at Unix I think fragmentation was an inevitable and very productive part of the process– Life would have been simpler if a central planning committee could
have just delivered the best possible Unix on day one– But a messy, evolutionary process drove success
• SQL databases & Web browsers followed a similar pattern
• Conclusions– Fragmentation is result of aggressively growing ecosystem– We should expect to see a lot more Hadoop innovation– A lot of the action is going to be in Hadoop applications
• Vendors want to deliver simple, repeatable customer successes• Programming per customer is not in their economic interest
33
Prediction #2 – More Hadoop
• The Data Lake/Hub pattern is compelling for many enterprises
• New centralized data repository– Land and archive raw data from across the
enterprise– Support data processing, cleaning, ETL, Reporting– Support data science and visualization
• Saves money• Supports data centric innovation
34
DataLake – Integrating all your dataOnlineUser-facing systems
SQL AnalyticsBusiness-facing systems
WarehouseTeradata, IBM, Oracle,Redshift…
NewSQLVerticaSAP HANASqlServer (MDX…)Greenplum, Asterdata
NoSQL (Scaleout)Casandra, MangoCouchDB, Riak …ElasticSearch, …
TransactionalMySQL, Postgres, Oracle, …
AggregatesReportsETLed & cleaned data
Tables, logs, …
New Data Sourcesweb logs, sensors,email, multi-media,Science, genetics, medical …
ETLArchival
Data ScienceData production
Ad hoc queryReporting
35
Science tools directly on data lake
36
Datalakes happen
• Time and again we see organizations move to this model
• Network effects– The more data you have in one place, the more uses
you can find in combinations of data• Yahoo built the first Datalake
– With every new project we added new data– Each additional new project was easier & required less
new data• This can be done incrementally!
37
Prediction #3 – Cool new stuff
• Kafka – The Hadoop messaging bug• Yarn – Just starting!! Slider & services coming• Spark – Data science, machine learning• Faster via caching – Tachyon and LLAP• Lots of new products, too many to list
– Datascience – OxData, DataBricks, Adatao…– …
38
-@jeric14
Thanks!
Questions?
39
Except where otherwise noted, this work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. CC Eric Baldeschwieler 2014