Top Banner
Introduction to the Hadoop Ecosystem
127

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Aug 20, 2015

Download

Technology

Uwe Seiler
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

Introduction to the Hadoop Ecosystem

uweseiler

Page 2: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 About me

Big Data Nerd

TravelpiratePhotography Enthusiast

Hadoop Trainer MongoDB Author

Page 3: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 About us

is a bunch of…

Big Data Nerds Agile Ninjas Continuous Delivery Gurus

Enterprise Java Specialists Performance Geeks

Join us!

Page 4: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Agenda

• What is Big Data & Hadoop?

• Core Hadoop

• The Hadoop Ecosystem

• Use Cases

• What‘s next? Hadoop 2.0!

Page 5: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Agenda

• What is Big Data & Hadoop?

• Core Hadoop

• The Hadoop Ecosystem

• Use Cases

• What‘s next? Hadoop 2.0!

Page 6: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Big Data

Page 7: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Big Data is like teenage sex: everybody talks about it,

nobody really knows how to do it, everyone thinks

everyone else is doing it, so everyone claims they are

doing it…

Page 8: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 9: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 10: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 11: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 12: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 13: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 14: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 15: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 16: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 17: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 18: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)

Page 19: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 My favorite definition

Page 20: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13The classic definition

The 3 V’s of Big DataVolume

Velocity

Variety

Page 21: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

«Big Data» != Hadoop

Page 22: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

g

NoSQL

Page 23: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Classification of NoSQL

Key-Value StoresK V

K V

K V

K V

K V

11 1 1

1 11 11

11

Column Stores

Graph Databases Document Stores

_id_id_id

Page 24: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

Horizontal Scaling

Page 25: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13Vertical Scaling

RAMCPU

Storage

Page 26: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

RAMCPU

Storage

Vertical Scaling

Page 27: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

RAMCPU

Storage

Vertical Scaling

Page 28: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13Horizontal Scaling

RAMCPU

Storage

Page 29: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13Horizontal Scaling

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

Page 30: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

RAMCPU

Storage

Horizontal Scaling

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

Page 31: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Why Hadoop?

Traditional dataStores are expensive to scale and by Design difficult to Distribute

Scale out is the way to go!

Page 32: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 How to scale data?

“Data“

r� r�

“Result“

w� w�

worker workerworker

w�

r�

Page 33: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 But…

Parallel processing is complicated!

Page 34: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 But…

Data storage is not trivial!

Page 35: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 What is Hadoop?

Distributed Storage and Computation Framework

Page 36: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 What is Hadoop?

Hadoop != Database

Page 37: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 What is Hadoop?

“Swiss army knife of the 21st century”

http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop

Page 38: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 The Hadoop App Store

HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra

Chukwa

Intel

Sync

Flume Hana HyperT Impala Mahout Nutch Oozie Scoop

Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC

IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper

Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat

Page 39: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

Functionalityless more

ApacheHadoop

HadoopDistributions

Big DataSuites

• HDFS• MapReduce• Hadoop Ecosystem• Hadoop YARN

• Test & Packaging• Installation• Monitoring• Business Support

+• Integrated Environment• Visualization• (Near-)Realtime analysis• Modeling• ETL & Connectors

+

The Hadoop App Store

Page 40: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Agenda

• What is Big Data & Hadoop?

• Core Hadoop

• The Hadoop Ecosystem

• Use Cases

• What‘s next? Hadoop 2.0!

Page 41: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Data Storage

OK, first things first!

I want to store all of my <<Big Data>>

Page 42: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Data Storage

Page 43: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop Distributed File System

• Distributed file system for redundant storage

• Designed to reliably store data on commodity hardware

• Built to expect hardware failures

Page 44: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop Distributed File System

Intended for • large files• batch inserts

Page 45: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 HDFS Architecture

NameNode

Master

Block Map

Slave Slave Slave

Rack 1 Rack 2

Journal Log

DataNode DataNode DataNode

File

Client

Secondary NameNode

Helper

periodical merges#1 #2

#1 #1 #1

Page 46: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 HDFS

Let’s have a look…

Page 47: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Data Processing

Data stored, check!

Now I want to create insightsfrom my data!

Page 48: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Data Processing

Page 49: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 MapReduce

• Programming model for distributed computations at a massive scale

• Execution framework for organizing and performing such computations

• Data locality is king

Page 50: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Typical large-data problem

• Iterate over a large number of records

• Extract something of interest from each

• Shuffle and sort intermediate results

• Aggregate intermediate results

• Generate final output

Map

Reduce

Page 51: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 MapReduce Flow

�� �� �� �� �� �� �� �� �� ���� ��

Combine Combine Combine Combine

a � b 2 c 9 a 3 c 2 b 7 c 8

Partition Partition Partition Partition

Shuffle and Sort

Map Map Map Mapa � b 2 c 3 c 6 a 3 c 2 b 7 c 8

a 1 3 b � 7 c 2 8 9

Reduce Reduce Reduce

a 4 b 9 c 19

Page 52: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Combined Hadoop Architecture

Client

NameNode

Master

Slave

TaskTracker

Secondary NameNode

Helper

JobTracker

DataNode

File

Job

Block

Task

Slave

TaskTracker

DataNode

Block

Task

Slave

TaskTracker

DataNode

Block

Task

Page 53: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Word Count Mapper in Java

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens())

{

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}

Page 54: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Word Count Reducer in Java

public class WordCountReducer extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable>

{

public void reduce(Text key, Iterator values, OutputCollectoroutput, Reporter reporter) throws IOException

{

int sum = 0;

while (values.hasNext())

{

IntWritable value = (IntWritable) values.next();

sum += value.get();

}

output.collect(key, new IntWritable(sum));

}

}

Page 55: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Map/Reduce

Let’s have a look…

Page 56: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Agenda

• What is Big Data & Hadoop?

• Core Hadoop

• The Hadoop Ecosystem

• Use Cases

• What‘s next? Hadoop 2.0!

Page 57: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Scripting for Hadoop

Java for MapReduce? I dunno, dude…

I’m more of a scripting guy…

Page 58: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Scripting for Hadoop

Page 59: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Apache Pig

• High-level data flow language

• Made of two components:• Data processing language Pig Latin• Compiler to translate Pig Latin to

MapReduce

Page 60: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Pig in the Hadoop ecosystem

HDFSHadoop Distributed File System

MapReduceDistributed Programming Framework

HCatalogMetadata Management

PigScripting

Page 61: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Pig Latin

users = LOAD 'users.txt' USING PigStorage(',') AS (name, age);

pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url);

filteredUsers = FILTER users BY age >= 18 and age <=50;

joinResult = JOIN filteredUsers BY name, pages by user;

grouped = GROUP joinResult BY url;

summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks;

sorted = ORDER summed BY clicks desc;

top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

Page 62: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Pig Execution Plan

Page 63: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Try that with Java…

Page 64: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Pig

Let’s have a look…

Page 65: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 SQL for Hadoop

OK, Pig seems quite useful…

But I’m more of a SQL person…

Page 66: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 SQL for Hadoop

Page 67: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Apache Hive

• Data Warehousing Layer on top of Hadoop

• Allows analysis and queries using a SQL-like language

Page 68: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hive in the Hadoop ecosystem

HDFSHadoop Distributed File System

MapReduceDistributed Programming Framework

HCatalogMetadata Management

PigScripting

HiveQuery

Page 69: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hive Architecture

Hive

Hive Engine

HDFS

MapReduce

Meta-store

Thrift Applications

JDBC Applications

ODBC Applications

Hive Thrift Driver

Hive JDBC Driver

Hive ODBC Driver

Hive ServerH

ive

Sh

ell

Page 70: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hive Example

CREATE TABLE users(name STRING, age INT);

CREATE TABLE pages(user STRING, url STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users';

LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages';

SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user)

WHERE users.age >= 18 AND users.age <= 50

GROUP BY pages.url

SORT BY clicks DESC

LIMIT 10;

Page 71: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hive

Let’s have a look…

Page 72: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 But wait, there’s still more!

More components of theHadoop Ecosystem

Page 73: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

HDFSData storage

MapReduceData processing

HCatalogMetadata Management

PigScripting

HiveSQL-like queries

HB

ase

No

SQ

L D

ata

base

MahoutMachine Learning

ZooK

eeper

Clu

ster C

oo

rdin

atio

n

ScoopImport & Export of relational data

Am

ba

riC

luste

r insta

llatio

n&

man

ag

em

en

t

Oozie

Wo

rkflo

w a

uto

matiz

atio

n

FlumeImport & Export of data flows

Page 74: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Agenda

• What is Big Data & Hadoop?

• Core Hadoop

• The Hadoop Ecosystem

• Use Cases

• What‘s next? Hadoop 2.0!

Page 75: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13D

ata

Sou

rces

Da

ta S

yste

ms

Ap

pli

cati

ons

Traditional Sources

RDBMS OLTP OLAP …

Traditional Systems

RDBMS EDW MPP …

BusinessIntelligence

BusinessApplications

CustomApplications

Operation

Manage &

Monitor

Dev Tools

Build &

Test

Classical enterprise platform

Page 76: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13D

ata

Sou

rces

Da

ta S

yste

ms

Ap

pli

cati

ons

Traditional Sources

RDBMS OLTP OLAP …

Traditional Systems

RDBMS EDW MPP …

BusinessIntelligence

BusinessApplications

CustomApplications

Operation

Manage &

Monitor

Dev Tools

Build &

Test

New Sources

Logs Mails Sensor …SocialMedia

EnterpriseHadoopPlattform

Big Data Platform

Page 77: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13D

ata

Sou

rces

Da

ta S

yste

ms

Ap

pli

cati

ons

Traditional Sources

RDBMS OLTP OLAP …

Traditional Systems

RDBMS EDW MPP …

BusinessIntelligence

BusinessApplications

CustomApplications

New Sources

Logs Mails Sensor …SocialMedia

EnterpriseHadoopPlattform

1

23

4

1

2

3

4

Capture all data

Processthe data

Exchange usingtraditional systems

Process & Visualizewithtraditional applications

Pattern #1: Refine data

Page 78: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13D

ata

Sou

rces

Da

ta S

yste

ms

Ap

pli

cati

ons

Traditional Sources

RDBMS OLTP OLAP …

Traditional Systems

RDBMS EDW MPP …

BusinessIntelligence

BusinessApplications

CustomApplications

New Sources

Logs Mails Sensor …SocialMedia

EnterpriseHadoopPlattform

1

2

31

2

3

Captureall data

Processthe data

Explore thedata usingapplicationswith supportfor Hadoop

Pattern #2: Explore data

Page 79: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13D

ata

Sou

rces

Da

ta S

yste

ms

Ap

pli

cati

ons

Traditional Sources

RDBMS OLTP OLAP …

Traditional Systems

RDBMS EDW MPP …

BusinessApplications

CustomApplications

New Sources

Logs Mails Sensor …SocialMedia

EnterpriseHadoopPlattform

1

3 1

2

3

Capture all data

Processthe data

Directlyingest thedata

Pattern #3: Enrich data

2

Page 80: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Bringing it all together…

One example…

Page 81: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Digital Advertising

• 6 billion ad deliveries per day

• Reports (and bills) for the advertising companies needed

• Own C++ solution did not scale

• Adding functions was a nightmare

Page 82: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

CampaignDatabase

FFM AMS

TCP Interface

TCP Interface

Custom Flume Source

Custom Flume Source

Flume HDFS Sink

Local files

CampaignData

Hadoop Cluster

BinaryLog Format

Synchronisation

Pig Hive

Temporarydata

NAS

Aggregateddata

Report Engine

DirectDownload

Job Scheduler

Config UI Job ConfigXML

Start

Ad

Ser

ver

Ad

Ser

ver

AdServing Architecture

Page 83: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 What’s next?

Hadoop 2.0aka YARN

Page 84: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

HDFS

Built for web-scale batch apps

HDFS HDFS

Single App

Batch

Single App

Batch

Single App

Batch

Single App

Batch

Single App

Batch

Hadoop 1.0

Page 85: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 MapReduce is good for…

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive datasets

• Analyzing an entire large dataset

Page 86: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 MapReduce is OK for…

• Iterative jobs (i.e., graph algorithms)– Each iteration must read/write data to

disk– I/O and latency cost of an iteration is

high

Page 87: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 MapReduce is not good for…

• Jobs that need shared state/coordination– Tasks are shared-nothing– Shared-state requires scalable state store

• Low-latency jobs

• Jobs on small datasets

• Finding individual records

Page 88: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 MapReduce limitations

• Scalability– Maximum cluster size ~ 4,500 nodes – Maximum concurrent tasks – 40,000– Coarse synchronization in JobTracker

• Availability– Failure kills all queued and running jobs

• Hard partition of resources into map & reduce slots– Low resource utilization

• Lacks support for alternate paradigms and services – Iterative applications implemented using MapReduce are 10x

slower

Page 89: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13

Hadoop 1.0

HDFSRedundant, reliable

storage

Hadoop 2.0: Next-gen platform

MapReduceCluster resource mgmt.

+ data processing

Hadoop 2.0

HDFS 2.0Redundant, reliable storage

MapReduceData processing

Single use systemBatch Apps

Multi-purpose platformBatch, Interactive, Streaming, …

YARNCluster resource management

OthersData processing

Page 90: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Taking Hadoop beyond batch

Applications run natively in Hadoop

HDFS 2.0Redundant, reliable storage

BatchMapReduce

Store all data in one placeInteract with data in multiple ways

YARNCluster resource management

InteractiveTez

OnlineHOYA

StreamingStorm, …

GraphGiraph

In-MemorySpark

OtherSearch, …

Page 91: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 A brief history of Hadoop 2.0

• Originally conceived & architected by the team at Yahoo! – Arun Murthy created the original JIRA in 2008 and now is

the YARN release manager

• The team at Hortonworks has been working on YARN for 4 years: – 90% of code from Hortonworks & Yahoo!

• Hadoop 2.0 based architecture running at scale at Yahoo! – Deployed on 35,000 nodes for 6+ months

Page 92: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

Page 93: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

Page 94: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 YARN: Architecture

Split up the two major functions of the JobTrackerCluster resource management & Application life-cycle management

ResourceManager

NodeManager NodeManager NodeManager NodeManager

NodeManager NodeManager NodeManager NodeManager

Scheduler

AM 1

Container 1.2

Container 1.1

AM 2

Container 2.1

Container 2.2

Container 2.3

Page 95: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 YARN: Architecture

• Resource Manager – Global resource scheduler – Hierarchical queues

• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring

• Application Master – Per-application – Manages application scheduling and task execution – e.g. MapReduce Application Master

Page 96: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 YARN: ArchitectureResourceManager

NodeManager NodeManager NodeManager NodeManager

NodeManager NodeManager NodeManager NodeManager

Scheduler

MapReduce 1

map 1.2

map 1.1

MapReduce 2

map 2.1

map 2.2

reduce 2.1

NodeManager NodeManager NodeManager NodeManager

reduce 1.1 Tez map 2.3

reduce 2.2

vertex 1

vertex 2

vertex 3

vertex 4

HOYA

HBase Master

Region server 1

Region server 2

Region server 3 Storm

nimbus 1

nimbus 2

Page 97: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

Page 98: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 HDFS Federation

• Removes tight coupling of Block Storage and Namespace

• Scalability & Isolation

• High Availability

• Increased performance

Details: https://issues.apache.org/jira/browse/HDFS-1052

Page 99: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 HDFS Federation: Architecture

NameNodes do not talk to each other

NameNodes manages only slice of namespace

DataNodes can store blocks managed by

any NameNode

NameNode 1Namespace 1

logs finance

Block Management 1

1 2 43

NameNode 2Namespace 2

insights reports

Block Management 2

5 6 87

DataNode 1

DataNode 2

DataNode 3

DataNode 4

Page 100: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 HDFS: Quorum based storage

Active NameNode Standby NameNode

DataNode DataNode DataNode DataNode DataNode

Journal Node

Journal Node

Journal NodeOnly the active

writes edits

The state is shared on a quorum of journal nodes

The Standby simultaneously

reads and applies the edits

DataNodes report to both NameNodes but listen only to the orders from the active one

BlockMap

EditsFile

BlockMap

EditsFile

Page 101: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

Page 102: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hive: Current Focus Area

• Online systems• R-T analytics• CEP

Real-Time Interactive Batch

• Parameterized Reports

• Drilldown• Visualization• Exploration

• Operational batch processing

• Enterprise Reports

• Data Mining

Data SizeData Size

0-5s 5s – 1m 1m – 1h 1h+

Non-Interactive

• Data preparation• Incremental

batch processing

• Dashboards / Scorecards

Current Hive Sweet Spot

Page 103: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Stinger: Extending the sweet spot

• Online systems• R-T analytics• CEP

Real-Time Interactive Batch

• Parameterized Reports

• Drilldown• Visualization• Exploration

• Operational batch processing

• Enterprise Reports

• Data Mining

Data SizeData Size

0-5s 5s – 1m 1m – 1h 1h+

Non-Interactive

• Data preparation• Incremental

batch processing

• Dashboards / Scorecards

Future Hive Expansion

Improve Latency & Throughput• Query engine improvements• New “Optimized RCFile” column store• Next-gen runtime (elim’s M/R latency)

Extend Deep Analytical Ability• Analytics functions• Improved SQL coverage• Continued focus on core Hive use cases

Page 104: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Stinger Initiative at a glance

Page 105: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Tez: The Execution Engine

• Low level data-processing execution engine

• Use it for the base of MapReduce, Hive, Pig, etc.

• Enables pipelining of jobs

• Removes task and job launch times

• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

• Does not write intermediate output to HDFS– Much lighter disk and network usage

• Built on YARN

Page 106: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Pig/Hive MR vs. Pig/Hive Tez

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization Barrier

I/O Synchronization Barrier

Job 1

Job 2

Job 3

Single Job

Page 107: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Tez Service

• MapReduce Query Startup is expensive:– Job launch & task-launch latencies are fatal for

short queries (in order of 5s to 30s)

• Solution:– Tez Service (= Preallocated Application Master)

• Removes job-launch overhead (Application Master)• Removes task-launch overhead (Pre-warmed Containers)

– Hive/Pig• Submit query-plan to Tez Service

– Native Hadoop service, not ad-hoc

Page 108: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Tez: Low latency

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Existing HiveParse Query 0.5s

Create Plan 0.5s

Launch Map-Reduce

20s

Process Map-Reduce

10s

Total 31s

Hive/TezParse Query 0.5s

Create Plan 0.5s

Launch Map-Reduce

20s

Process Map-Reduce

2s

Total 23s

Tez & Tez ServiceParse Query 0.5s

Create Plan 0.5s

Submit to TezService

0.5s

Process Map-Reduce 2s

Total 3.5s

* No exact numbers, for illustration only

Page 109: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Stinger: Summary

* Real numbers, but handle with care!

Page 110: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Applications

• MapReduce 2.0• HOYA - HBase on YARN• Storm, Spark, Apache S4• Hamster (MPI on Hadoop)• Apache Giraph• Apache Hama• Distributed Shell• Tez

Page 111: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Applications

• MapReduce 2.0• HOYA - HBase on YARN• Storm, Spark, Apache S4• Hamster (MPI on Hadoop)• Apache Giraph• Apache Hama• Distributed Shell• Tez

Page 112: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 MapReduce 2.0

• Basically a porting to the YARN architecture

• MapReduce becomes a user-land library

• No need to rewrite MapReduce jobs

• Increased scalability & availability

• Better cluster utilization

Page 113: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Applications

• MapReduce 2.0• HOYA - HBase on YARN• Storm, Spark, Apache S4• Hamster (MPI on Hadoop)• Apache Giraph• Apache Hama• Distributed Shell• Tez

Page 114: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 HOYA: HBase on YARN

• Create on-demand HBase clusters

• Configure different HBase instances differently

• Better isolation

• Create (transient) HBase clusters from MapReduce jobs

• Elasticity of clusters for analytic / batch workload processing

• Better cluster resources utilization

Page 115: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Applications

• MapReduce 2.0• HOYA - HBase on YARN• Storm, Spark, Apache S4• Hamster (MPI on Hadoop)• Apache Giraph• Apache Hama• Distributed Shell• Tez

Page 116: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Twitter Storm

• Stream-processing

• Real-time processing

• Developed as standalone application• https://github.com/nathanmarz/storm

• Ported on YARN• https://github.com/yahoo/storm-yarn

Page 117: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Storm: Conceptual view

Spout

Spout

Spout:Source of streams

Bolt

Bolt

Bolt

Bolt

Bolt

Bolt:Consumer of streams,Processing of tuples,Possibly emits new tuples

Tuple

Tuple

TupleTuple:

List of name-value pairs

Stream:Unbound sequence of tuples

Topology: Network of Spouts & Bolts as the nodes and stream as the edge

Page 118: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Applications

• MapReduce 2.0• HOYA - HBase on YARN• Storm, Spark, Apache S4• Hamster (MPI on Hadoop)• Apache Giraph• Apache Hama• Distributed Shell• Tez

Page 119: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Spark

• High-speed in-memory analytics over Hadoop and Hive

• Separate MapReduce-like engine– Speedup of up to 100x

– On-disk queries 5-10x faster

• Compatible with Hadoop‘s Storage API

• Available as standalone application– https://github.com/mesos/spark

• Experimental support for YARN since 0.6– http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html

Page 120: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Data Sharing in Spark

Page 121: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Applications

• MapReduce 2.0• HOYA - HBase on YARN• Storm, Spark, Apache S4• Hamster (MPI on Hadoop)• Apache Giraph• Apache Hama• Distributed Shell• Tez

Page 122: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Apache Giraph

• Giraph is a framework for processing semi-structured graph data on a massive scale.

• Giraph is loosely based upon Google's Pregel

• Giraph performs iterative calculations on top of an existing Hadoop cluster.

• Available on GitHub– https://github.com/apache/giraph

Page 123: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hadoop 2.0 Summary

1. Scale

2. New programming models & Services

3. Improved cluster utilization

4. Agility

5. Beyond Java

Page 124: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Getting started…

One more thing…

Page 125: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Hortonworks Sandbox

http://hortonworks.com/products/hortonworsk-sandbox

Page 126: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 Books about Hadoop

1. Hadoop - The Definite Guide, Tom White,3rd ed., O’Reilly, 2012.

2. Hadoop in Action, Chuck Lam, Manning, 2011

Programming Pig, Alan GatesO’Reilly, 2011

1. Hadoop Operations, Eric Sammer,O’Reilly, 2012

Page 127: Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)

07.11.13 The end…or the beginning?