Hadoop MapReduce

SEMINAR ONSEMINAR ONAndroid App DevelopmentAndroid App Development

Trained by-Trained by-Hewlett-Packard Education Services, Hewlett-Packard Education Services,

MumbaiMumbai

Presented to-Mr. R.K. Banyal By-Mr. Hukum Chand Saini Urvashi Kataria

About HPES:About HPES:• American global IT company headquartered in Palo-

Alto, California, US.• Provider of products, soft wares, technologies,

solutions and services to individual as well as small & medium sized business.

• Major operations include- HP Software, HP Financial Services & Corporate Investments

• Provides practical training in fields like Big Data, Android App Dev, Embedded Systems etc.

An android application that allows you to enjoy your as well as your dear ones birthday.

Save the days, get reminded of them, capture moments on the day itself, get greeted by the app, and celebrate!!

About Birthday Bash:About Birthday Bash:

The home screen:The home screen:

Calculating age and Calculating age and further:further:

Saving name for specified Saving name for specified date:date:

Happy Birthday!Happy Birthday!

Hadoop Map Reduce

(Map + reduce)

Presentation on:Presentation on:

Why MapReduce?Why MapReduce?• Large scale data processing was difficult!

Managing hundreds or thousands of processors

Managing parallelization and distribution

Reliable execution with easy data access

MapReduce provides all of these, easily!

What is Hadoop MapReduce?What is Hadoop MapReduce?

Hadoop ClusterHadoop Cluster HDFS (Physical) HDFS (Physical) StorageStorage

MapReduce ObjectsMapReduce Objects

How Map and Reduce Work How Map and Reduce Work TogetherTogether

Hadoop MapReduce: A Closer Hadoop MapReduce: A Closer LookLook

file

file

InputFormat

Split Split Split

RR RR RR

Map Map Map

Input (K, V) pairs

PartitionerIntermediate (K, V) pairs

Sort

Reduce

OutputFormat

Files loaded from local HDFS store

RecordReaders

Final (K, V) pairs

Writeback to local HDFS store

file

file

InputFormat

Split Split Split

RR RR RR

Map Map Map

Input (K, V) pairs

PartitionerIntermediate (K, V) pairs

Sort

Reduce

OutputFormat

Files loaded from local HDFS store

RecordReaders

Final (K, V) pairs

Writeback to local HDFS store

Node 1 Node 2

Shuffling Process

Intermediate (K,V) pairs

exchanged by all nodes

AlgorithmAlgorithmmap(key, value):// key: document name; value: text of document

for each word w in value:emit(w, 1)

reduce(key, values):// key: a word; values: an iterator over counts

result = 0for each count v in values:result += vemit(key,result)

map(key=url, val=contents): for each word w in contents:

emit (w, “1”)reduce(key=word, values=uniq_counts)://Sum all “1”s in values list

emit result “(word, sum)”

The very famous:The very famous:Word Count ExampleWord Count Example

Ways to MapReduceWays to MapReduce

Libraries Languages

Note: Java is most common, but other languages can be used

Common Data Sources Common Data Sources for MapReduce Jobsfor MapReduce Jobs

Service ProvidersService Providers• Open Source

o Apache

• Commercialo Clouderao Hortonworkso MapRo AWS MapReduceo Microsoft HDInsight (Beta)

Advancements:Advancements:MRV1 & MRV2MRV1 & MRV2

MRV2 (MAPREDUCE VERSION 2)•Splits the existing JobTracker’s roles

o Resource managemento Job lifecycle management

•MapReduce 2.0 provides many benefits over the existing MapReduce framework:

o Better scalability o Through distributed job lifecycle management o Support for multiple Hadoop MapReduce API versions in a

single cluster

Better MapReduce - Better MapReduce - OptimizationsOptimizations

Advantages of MapReduceAdvantages of MapReduce

• Distributed data and computation.• Tasks are independent. Entire nodes can fail and restart.• Linear scaling in the idle case. It’s used to design cheap

commodity, hardware.• Simple programming model. The end-user programmer

only writes map reduce task.

Disadvantages/ Cases where Disadvantages/ Cases where MR isn’t a suitable choice:MR isn’t a suitable choice:

• Real time processing• It is not always very easy to implement each and every

thing as a map reduce program• When your intermediate processes need to talk to each

other • When your processing requires lot of data to be shuffled

over the network• When you need to handle streaming data. MR is best suited

to batch process huge amount of data which you already have

Limitations of Limitations of MapReduceMapReduce

RDBMS vs. RDBMS vs. HadoopHadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times

Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

Query Response Time

Can be near immediate Has latency (due to batch processing)

ReferencesReferences• J. Dean and S. Ghemawat. “MapReduce: Simplified Data

Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004.

• S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” OSDI 200?

• http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010.

• Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009

• http://developer.yahoo.com/hadoop/tutorial/module4.html• J. Lin and C. Dyer. Data-Intensive Text Processing with

MapReduce, Book Draft. February 7, 2010.

Thank You!!Thank You!!

Hadoop MapReduce

Technology