Hadoop

A Study of Hadoop in Map-Reduce

Poumita DasShubharthi DasguptaPriyanka Das

What is Big Data??

Big data is an evolving term that describes any voluminous amount of

structured, semi-structured and unstructured data that has the

potential to be mined for information.

The 3 V’s

Why DFS

An introduction to Map-Reduce

Map-Reduce programs are designed to compute large volumes of data in a

parallel fashion. There are 3 steps

• Map

• Shuffle

• Reduce

Map-Reduce continuedMap Shuffle Reduce

What is Hadoop??

Apache Hadoop is a framework

that allows for the distributed

processing of large data sets

across clusters of commodity

computers using a simple

programming model.

Hadoop core components

• Namenode

• Datanode

• Client

• User

• Job tracker

• Task tracker

Namenode

The NameNode maintains the namespace tree and the mapping of

blocks to DataNodes. In a cluster there may exist hundreds or even

thousands of datanodes.

Secondary NameNode reads the metadata from RAM and writes it into a

secondary storage. However it is NOT a substitute of a NameNode

Datanode

On startup, a DataNode connects to the NameNode; spinning until that

service comes up. It then responds to requests from the NameNode for

filesystem operations.

Client applications can talk directly to a DataNode, once the NameNode has

provided the location of the data.

HDFS client

User applications access the filesystem using the HDFS client. A client has mainly 3

operations.

• Creating a new file

• File read

• File write

Creating a new file

File read

HDFS implements a single-

writer, multiple-reader model.

That is reading is a parallel

operation in Hadoop

File write

An HDFS file consists of blocks.

When there is a need for a new

block, the NameNode allocates

a block with a unique block ID

and determines a list of

DataNodes to host replicas of

the block.

Job tracker and task tracker

Hadoop ecosystem

• PIG

• HIVE

• MAHOUT

A Sample Program

The Output

Why Anagrams?

• Started out as a simple relaxation game, finding anagrams in sentences

• Games and Puzzles like Scrabble

• Ciphers, like permutation cipher, transposition ciphers

Future scope

Keeping in mind the vast application of Hadoop we have certain graph-

searching techniques in mind that would be much more easier to solve

with the help of Map-reduce engine.

References

• Introduction to Hadoop: Welcome to Apache https://hadoop.apache.org/ • Cloudera Documentation: Usage

http://www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CDH5/Hadoop-Tutorial/ht_usage.html • Edureka: Anatomy of a Map-Reduce Job

http://www.edureka.co/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/ • Stackoverflow: Explain Map-Reduce Simply

http://stackoverflow.com/questions/28982/please-explain-mapreduce-simply

Thank you

Hadoop

Documents

apache hadoop

new file file

secondary namenode

hdfs file

study of hadoop

help of map

apachehadoop stackoverflow

big data