Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)

1

Hadoop - Where did it come from and what's next?

Eric Baldeschwieler

2

Who is Eric14?• Big data veteran (since 1996)

• Twitter handle: @jeric14

• Previously

• CTO/CEO of Hortonworks

• Yahoo - VP Hadoop Engineering

• Yahoo & Inktomi – Web Search

• Grew up in Pasadena

3

What is Hadoop?

4

What is Apache Hadoop?• Scalable

– Efficiently store and process petabytes of data

– Grows linearly by adding commodity computers

• Reliable– Self healing as hardware fails or is

added

• Flexible– Store all types of data in many

formats– Security, Multi-tenancy

• Economical– Commodity hardware– Open source software

THE open source big data platform

Yarn – Computation Layer• Many programing models

• MapReduce, SQL, Streaming, ML…• Multi-users, with queues, priorities, etc…

HDFS – Hadoop Distributed File System• Data replicated on 3 computers• Automatically replaces lost data /

computers• Very high bandwidth, not IOPs optimized

5

Hadoop hardware• 10 to 4500 node clusters

– 1-4 “master nodes”– Interchangeable

workers• Typical node

– 1-2 U– 4-12 * 2-4TB SATA– 64GB RAM– 2 * 4-8 core, ~2GHz– 10Gb NIC– Single power supply– jBOD, not RAID, …

• Switches– 10 Gb to the node– 20-40 Gb to the core– Layer 2 or 3, simple

6(From Richard McDougall, VMware, Hadoop Summit, 2012 talk)

Hadoop’s cost advantage

SAN Storage

$2 - $10/Gigabyte

$1M gets:0.5Petabytes

1,000,000 IOPS1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets:1 Petabyte

400,000 IOPS2Gbyte/sec

Local Storage

$0.05/Gigabyte

$1M gets:20 Petabytes

10,000,000 IOPS800 Gbytes/sec

“And you get racks of free

computers when you buy storage!”

- Eric14

7

Where did Hadoop come from?

8

Early History• 1995 – 2005

– Yahoo! search team builds 4+ generations of systems to crawl & index the world wide web. 20 Billion pages!

• 2004– Google publishes Google File System & MapReduce papers

• 2005– Yahoo! staffs Juggernaut, open source DFS & MapReduce

• Compete / Differentiate via Open Source contribution!• Attract scientists – Become known center of big data excellence• Avoid building proprietary systems that will be obsolesced• Gain leverage of wider community building one infrastructure

– Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo!• 2006

– Juggernaut & Nutch join forces - Hadoop is born!• Nutch prototype used to seed new Apache Hadoop project• Yahoo! commits to scaling Hadoop, staffs Hadoop Team

9

Early Hadoop

HDFS

MapReduce

Physical Hardware

10

Hadoop at Yahoo!

Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/

twice the engagement

CASE STUDYYAHOO SEARCH ASSIST™

11© Yahoo 2011

Before Hadoop After Hadoop

Time 26 days 20 minutes

Language C++ Python

Development Time 2-3 weeks 2-3 days

• Database for Search Assist™ is built using Apache Hadoop• Several years of log-data• 20-steps of MapReduce

12

, early adopters Scale and productize Hadoop

Apache Hadoop

Hadoop beyond Yahoo!2006 – present

Other Internet Companies Add tools / frameworks, enhance

Hadoop

2008 – present

…

Service Providers Provide training, support, hosting 2010 – present

…Cloudera, MapR, Pivotal, IBMTeradata, Microsoft, Google, RackSpace, Qubole, Altiscale

Mass Adoption

13

Hadoop has seen off many competitors

• Every year I used to see 2-3 “Hadoop killers.” Hadoop kept growing and displacing them– Yahoo had 2 other internal competitors– Microsoft, Lexus/Nexus, Alibaba, Baidu all had

internal efforts– Various cloud technologies, HPC technologies– Various MPP DBs

• Various criticisms of Hadoop– Performance – Hadoop is too slow, its in Java…– There is nothing here not in DBs for decades– Its not ACID, highly available, secure enough, …

14

Why has Hadoop triumphed?• Deep investment from Yahoo

– ~300 person years , web search veteran team– 1000s of users & 100s of use cases– Solved some of worlds biggest problems

• Community open source– Many additional contributors, now an entire industry– Apache Foundation provides continuity, clean IP

• The right economics– Open source, really works on commodity hardware– Yahoo has one Sys Admin per 8000 computers!

• Simple & reliable at huge scale– Assumes failure, detects it and works around it– Does not require expensive & complex highly available hardware

• Java!– good tooling, garbage collection…– Made it easy to get early versions & new contributions working– Made it easy to build community – most common programming language


CASE STUDYYAHOO! WEBMAP

15© Yahoo 2011

• What is a WebMap?– Gigantic table of information about every web site, page

and link Yahoo! knows about– Directed graph of the web– Various aggregated views (sites, domains, etc.)– Various algorithms for ranking, duplicate detection, region

classification, spam detection, etc.

• Why was it ported to Hadoop?– Custom C++ MapReduce solution was not scaling– Leverage scalability, load balancing and resilience of

Hadoop infrastructure– Focus on application vs. infrastructure


CASE STUDYWEBMAP PROJECT RESULTS

16© Yahoo 2011

• 33% time savings over previous system on the same cluster (on Hadoop 0.18 or so)

• The map of the web is Big– Over 1000 computers in cluster– 100,000+ maps, ~10,000 reduces– ~70 hours runtime– ~300 TB shuffling– ~200 TB compressed output

• Moving data to Hadoop increased number of groups who could use the data

17

Hadoop Today

18

Hadoop Today

HDFSKafka

YARN

Hive Meta+

HCat

Data Processing• MapReduce• Pig• Spark• Cascading• …

SQL• Hive• Impala• Spark• …

Streaming• Storm• Samza• Spark• …

Services• Slider• Twill• Hbase• Sqoop• …

Ecosystem of products & services

Physical Hardware or Cloud Infrastructure

19

Hadoop use cases• Low cost storage• Data warehouse optimization

– ETL, archival, science/discovery, replacement• Horizontals

– Web/App logs & Marketing– Business Intelligence, Analytics, ML– Security , Internet of things / machine logs Datalake (more on this in a minute)

• Verticals– Banking, finance, healthcare, government / IC– Petroleum / seismic , utilities , retail– Online: advertising, marketing, social, gaming– Science: Bio/genomics , seismic– …

CASE STUDYYAHOO! HOMEPAGE

20

• Serving Maps• Users - Interests

• Five Minute Production

• Weekly Categorization models

SCIENCE HADOOP CLUSTER

SERVING SYSTEMS

PRODUCTION HADOOP CLUSTER

USERBEHAVIOR

ENGAGED USERS

CATEGORIZATIONMODELS (weekly)

SERVINGMAPS

(every 5 minutes)USER

BEHAVIOR

» Identify user interests using Categorization models

» Machine learning to build ever better categorization models

Build customized home pages with latest data (thousands / second)

© Yahoo 2011

21

Hadoop

Big data application model

Web & App Servers(ApacheD, Tomcat…)

Serving Store(Cassandra, MySQL, Riak…)

Interactive layer

Message Bus(Kafka, Flume, Scribe…)

Streaming Engine(Storm, Spark, Samza…)

YARN (MapReduce, Pig, Hive, Spark…)

HDFS

Streaminglayer

Batch layer

22

How do you get Hadoop?• Learning - Desktop VMs & cloud sandboxes

• Cloud Services– Amazon EMR, Microsoft HDInsights, Qubole…

• Private hosted cluster providers– Rackspace, Altiscale…

• Hadoop distributions– Hortonworks, Cloudera, …– On dedicated hardware, virtualized or cloud hosted

• Enterprise Vendors– IBM, Pivotal, Teradata, HP, SAP, Oracle, …

• DIY – Hadoop self supported– Apache Software Foundation– BigTop

23

Hadoop is still hard

• Are you ready for DIY supercomputing?– Design & managing hardware, OS, software, net– Hadoop talent is scarce & expensive

• Many vendors with competing solutions– Distros, Clouds, SAAS, Enterprise Vendors, SIs…

• Solutions are best practices, not products– Ultimately you end up writing new software to

solve your problems

24

So why deal with all this?• You have hit a wall

– You know you need a big data solution because your traditional solution is failing

• Solution not technically feasible with trad. tools• Cost becomes prohibitive

• You are building a data business– You have lots of data and need a data innovation platform– You want technology that can grow with your business

• There are lots of success stories– Folks saving 10s of Millions w Hadoop– Successful Big data businesses with Hadoop at their core

25

Bringing Hadoop into your Org• Start with small projects

– Example quick wins:• Finding patients with “forgotten” chronic conditions• Predict daily website peek load based on historic data• Moving archived documents & images into HBase• Reducing classic ETL costs• Running an existing tool in parallel on many records

(documents, gene sequences, images…)

• Hardware– Public cloud can be cost effective– Otherwise 4-10 node clusters can do a lot,

repurposing old gear is often effective for pilots

26

Build on your success

• After a few projects, capacity planning is more than guess work

• Successes built organizational competence and confidence

• Grow incrementally– Add another project to the same cluster if possible– Each project that adds data, adds value to your cluster

• Not unusual to see…– An enterprise team start with 5 nodes– Running on 10-20 a year later– Jumps to 300 two years in

27

The Future

28

Prediction #1 – Things will get easier

• Huge ecosystem of Hadoop contributors– Major DIY Hadoop shops– Hadoop distributions– Cloud and hosting providers– Established enterprise players– Dozens of new startups– Researchers and hobbyist

• They are all investing in improving Hadoop

29

But, fragmentation!?

• The Hadoop market is clearly fragmented– EG Impala vs. Stinger vs. Spark vs. Hawq– All of the vendors push different collections of software– Almost everyone is pushing some proprietary

modifications– This is confusing and costly for ISVs and users

• There is no obvious process by which things will converge

• What is this going to do to the eco-system?– Is Hadoop going to loose a decade, like Unix?

30

Remember the Lost Unix Decade?

Thanks: http://www.unix.org/what_is_unix/flavors_of_unix.html

31

But what happened in that decade?

• Unix went from an niche OS to the OS– The client server & DB revolutions took Unix into enterprise– The .com revolution happened on Unix

• We built tools to deal with the fragmentation• Competing vendors

– built compelling features to differentiate• and copied each other like mad• and worked to make it easy for people to switch to them

– Evangelized Unix• The world adopted Unix because

– The new roughly standard API was valuable– Solutions to real problems were built and sold

32

Fragmentation is part of the process

• Looking at Unix I think fragmentation was an inevitable and very productive part of the process– Life would have been simpler if a central planning committee could

have just delivered the best possible Unix on day one– But a messy, evolutionary process drove success

• SQL databases & Web browsers followed a similar pattern

• Conclusions– Fragmentation is result of aggressively growing ecosystem– We should expect to see a lot more Hadoop innovation– A lot of the action is going to be in Hadoop applications

• Vendors want to deliver simple, repeatable customer successes• Programming per customer is not in their economic interest

33

Prediction #2 – More Hadoop

• The Data Lake/Hub pattern is compelling for many enterprises

• New centralized data repository– Land and archive raw data from across the

enterprise– Support data processing, cleaning, ETL, Reporting– Support data science and visualization

• Saves money• Supports data centric innovation

34

DataLake – Integrating all your dataOnlineUser-facing systems

SQL AnalyticsBusiness-facing systems

WarehouseTeradata, IBM, Oracle,Redshift…

NewSQLVerticaSAP HANASqlServer (MDX…)Greenplum, Asterdata

NoSQL (Scaleout)Casandra, MangoCouchDB, Riak …ElasticSearch, …

TransactionalMySQL, Postgres, Oracle, …

AggregatesReportsETLed & cleaned data

Tables, logs, …

New Data Sourcesweb logs, sensors,email, multi-media,Science, genetics, medical …

ETLArchival

Data ScienceData production

Ad hoc queryReporting

35

Science tools directly on data lake

36

Datalakes happen

• Time and again we see organizations move to this model

• Network effects– The more data you have in one place, the more uses

you can find in combinations of data• Yahoo built the first Datalake

– With every new project we added new data– Each additional new project was easier & required less

new data• This can be done incrementally!

37

Prediction #3 – Cool new stuff

• Kafka – The Hadoop messaging bug• Yarn – Just starting!! Slider & services coming• Spark – Data science, machine learning• Faster via caching – Tachyon and LLAP• Lots of new products, too many to list

– Datascience – OxData, DataBricks, Adatao…– …

38

-@jeric14

Thanks!

Questions?

39

Except where otherwise noted, this work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. CC Eric Baldeschwieler 2014

http://creativecommons.org/licenses/by/4.0/

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)

Technology