YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

A Crash Course in Apache Hadoop

Page 2: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Event Outline1. What is Hadoop

2. Current data challenges

3. Hadoop Solutions

4. Architecture

5. Workshop

Page 3: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Who & When● Origin from Google papers● Originally developed at

Yahoo!○ Doug Cutting, Michael

Cafarella● Project officially began

around 2005.● Named after a toy elephant

Page 4: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Why

● More ways to collect data● Too much data

○ CERN Laboratory○ Google/Yahoo/Facebook

● Forget about processing when you can’t even store it

Page 5: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Analogy● Imagine you needed to transport 2,000,000kg of raw material● How would you do it? (Let’s assume that horsepower is proportional to the

mass that each vehicle can carry)

○ Ferrari 458 will run - $243,000 ~560 Horsepower○ Bugatti Veyron - $2,310,688 ~1200 horsepower○ Brand new Ford F-150 will cost $30,000 ~325 horsepower○ Dodge Caravan - $20,000 ~280 horsepower

Page 6: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

What

● Provide a way to reliably access and process large volumes of data

● Designed to scale across many, many machines.

Page 7: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

The ASF● Apache Open Source

○ OpenOffice○ HTTP Server○ Subversion○ Tomcat Webserver○ Commons○ Maven ○ Hadoop

● Anyone can view the source code!○ Build/edit/modify on your own

machine

Page 8: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Why data?

Opportunities and analytic insights for businesses

Page 9: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Why Hadoop?● Benefits of the Hadoop Architecture

○ Consolidates Data○ Integrates with many existing

platforms○ Scalable and Affordable○ Real-Time Insights

Page 10: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

The Hadoop Ecosystem

Page 11: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Hadoop Architecture

Page 12: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Notable Hadoop Projects● Apache Kafka → Data Streaming● Apache HBase → Big Data Management● Apache Hive → Read and Query from HDFS● Apache ZooKeeper → HA management● Apache Spark → Processing Engine● Apache Ambari → Cluster management

Page 13: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

At the Core of Hadoop● Hadoop Distributed File System (HDFS)

● Hadoop MapReduce (Processing Engine)

● Hadoop Common (Core Hadoop Libraries)

● Hadoop YARN (Yet Another Resource Manager)

○ CPU/Storage/Memory management (parallel jobs)

Page 14: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

ResourceManager

YARN

Cluster Architecture

{ }Worker Node

NodeManagerDataNode

Ambari, Hive, Zeppelin, Knox, etc..

NameNode(HDFS)

ResourceManager(YARN)

NamenodeHDFS

Page 15: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

HDFS Architecture● Fault-tolerant distributed storage

○ Split file into logical blocks○ Store multiple copies of each block

010001001111100101001001110000010101011101001010001001111100101001001110000010101011101001001001010111000

1

2

3

4

1

1

1

2

2

2

3 3

3

4

4

4

File

Cluster

File Blocks

Page 16: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

HDFS - Namenode and Heartbeats● Namenode communicates through Heartbeats

○ Keep track of all the data notes○ Which data is stored and where

NameNode

DataNode 1 DataNode 3DataNode 2 DataNode 4

Hey NameNode, I’m Here!

Hey NameNode, I’m Here!

123 123

DataNode 2, can you replicate that 123 block to DataNode 3?

Page 17: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

MapReduce in Hadoop

● Shuffle and Sort○ Break a problem into

sub-problems● Batch Processing

Page 18: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

What do iOS 4 and Windows 3.1 have in common?

Page 19: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Multi-Use vs. Batch

Page 20: A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Workshop


Related Documents