Top Banner
© 2013 KMS Technology
27

An Introduction of Apache Hadoop

Jan 26, 2015

Download

Technology

KMS Technology

This slide deck that Mr. Minh Tran - KMS's Software Architect shared at "Java-Trends and Career Opportunities" seminar of Information Technology Center of HCMC University of Science.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Introduction of Apache Hadoop

© 2013 KMS Technology

Page 2: An Introduction of Apache Hadoop

AN INTRODUCTION OF

APACHE HADOOP

Page 3: An Introduction of Apache Hadoop

WHO AM I?

Minh Tran

KMS Technology

Current: Software Architect at KMS Technology

Past: Technical at Yahoo!

Senior Engineer at MobiVi, Sciant, ELCA

Admin at JavaVietnam

Page 4: An Introduction of Apache Hadoop

OBJECTIVES

• Understand what Apache Hadoop is

• Understand problems Hadoop aims to solve

• Explore Hadoop architecture and its

ecosystem

Page 5: An Introduction of Apache Hadoop

AGENDA

• Hadoop Overview

• Haddop Architecture at a glance

• Hadoop Ecosystem

• A demo of using Hadoop

Page 6: An Introduction of Apache Hadoop

AGENDA – HADOOP OVERVIEW

• Big Data & Challenges

• What is Hadoop?

• Hadoop Benefits

• Which problem can Hadoop solve?

• Hadoop Installation

Page 7: An Introduction of Apache Hadoop

WHY DO WE HAVE SO MUCH

DATA?

• Every single day

– Twitter processes 340 million messages

– Facebook stores 2.7 billion comments and

“Likes”

– Google processes about 24 petabytes of data

• And every single minute

– More than 200 million e-mails are sent

– Foursquare processes more than 2,000

check-ins

Page 8: An Introduction of Apache Hadoop

WHERE DOES DATA COME FROM?

• Science: medical imaging, sensor data,

genome sequencing, weather data,

satellite feeds, etc.

• Legacy: Sales data, customer behavior,

product databases, accounting data, etc.

• System Data: Log files, network messages,

Web Analytics, intrusion detection, spam

filters • (Not all of this maps cleanly to the relational model)

Page 9: An Introduction of Apache Hadoop

DATA ANALYSIS CHALLENGE

• Huge volumes of data

• Mixed sources result in many different formats

– XML

– CSV

– EDI

– Log files

– Objects

– SQL

– Text

– JSON

– Binary

– etc.

Page 10: An Introduction of Apache Hadoop

WHAT IS HADOOP?

• Scalable data storage and processing

– Open source Apache project

– Harnesses the power of commodity servers

– Distributed and fault-tolerant

• “Core” Hadoop consists of two main parts

– HDFS (storage)

– MapReduce (processing)

Page 11: An Introduction of Apache Hadoop

WHO USES HADOOP?

Page 12: An Introduction of Apache Hadoop

BENEFITS OF ANALYZING WITH

HADOOP

• Previously impossible/impractical

to do this analysis

• Analysis conducted at lower cost

• Analysis conducted in less time

• Greater flexibility

• Linear scalability

Page 13: An Introduction of Apache Hadoop

WHICH PROBLEM CAN

HADOOP SOLVE?

• Nature of the data

– Complex & multiple data sources

– Lots of it

• Nature of the analysis

– Batch processing

– Parallel execution

– Spread data over a cluster of servers and take the computation

to the data

• Common Hadoop Problems:

– Customer churn analysis

– Recommendation engine

– PoS transaction analysis

– Threat analysis

– Search quality

– Data “sandbox”

Page 14: An Introduction of Apache Hadoop

HADOOP INSTALLATION

1. Install a Linux machine, for e.g.: Ubuntu

2. Install latest JDK

3. Install Hadoop package, download at

http://hadoop.apache.org/

Page 15: An Introduction of Apache Hadoop

AGENDA

• Hadoop Overview

• Haddop Architecture at a glance

• Hadoop Ecosystem

• A demo of using Hadoop

Page 16: An Introduction of Apache Hadoop

AGENDA - HADDOP ARCHITECTURE

AT A GLANCE

• Hadoop Distributed File System

• How MapReduce works

Page 17: An Introduction of Apache Hadoop

COLLOCATED STORAGE

AND PROCESSING

• Because 10,000 hard disks are better than one

• Solution: store and process data on the same nodes

– Data locality: “Bring the computation to the data”

– Reduces I/O and boosts performance

Page 18: An Introduction of Apache Hadoop

HARD DISK LATENCY

• Disk seeks are expensive

• Solution: Read lots of data at once to amortize the cost

Page 19: An Introduction of Apache Hadoop

HDFS BLOCKS

• When a file is added to HDFS, it’s split into blocks

• This is a similar concept to native file systems

– HDFS uses a much larger block size (64 MB), for

performance

Page 20: An Introduction of Apache Hadoop

Client application

Hadoop file system client

DataNode 1

C

D

B

DataNode 2

A

C

D

DataNode 3

B

A

C

NameNode

/tmp/file1.txt

Block A

Block B

DataNode 3

DataNode 2

DataNode 1

DataNode 3

Block C DataNode 1

DataNode 2

DataNode 3

HDFS High Level Architecture

Page 21: An Introduction of Apache Hadoop

HOW MAPREDUCE WORKS?

Page 22: An Introduction of Apache Hadoop

ANOTHER EXAMPLE ABOUT

BUILDING INVERTED INDEX

• Input: a number of text files

• Output: a list of tuples, where each tuple is a word and a list of files

that contain the word

doc1.txt

cat sat mat

doc2.txt

cat sat dog

Input filenames and contents

Mappers Intermediate

output Reducers

cat, doc1.txt

sat, doc1.txt

mat, doc1.txt

cat, doc2.txt

sat, doc2.txt

dog, doc2.txt

part-r-00000

cat: doc1.txt, doc2.txt

part-r-00001

sat: doc1.txt, doc2.txt dog: doc2.txt

part-r-00002

mat: doc1.txt

Output filenames and contents

Page 23: An Introduction of Apache Hadoop

AGENDA

• Hadoop Overview

• Haddop Architecture at a glance

• Hadoop Ecosystem

• A demo of using Hadoop

Page 24: An Introduction of Apache Hadoop

HADOOP ECOSYSTEM

Page 25: An Introduction of Apache Hadoop

AGENDA

• Hadoop Overview

• Haddop Architecture at a glance

• Hadoop Ecosystem

• A demo of using Hadoop

Page 26: An Introduction of Apache Hadoop

REFERENCES

• Hadoop In Practice – Alex Homes

• Hadoop Real World Solutions Cookbook – Jonathan R. Owens, Jon

Lentz, Brian Femiano

• Hadoop In Action – Chuck Lam

• Hadoop The Definitive Guide – Tom White

• MapReduce Design Patterns – Donald Miner, Adam Shook

• An Introduction to Hadoop – Mark Fei

• http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-

linux-single-node-cluster/

• http://www.crobak.org/2011/12/getting-started-with-apache-hadoop-

0-23-0/

Page 27: An Introduction of Apache Hadoop

© 2013 KMS Technology

THANK YOU