Introduction to hadoop and hdfs

By: Akhil Arora & Shrey Mehrotra

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications.

Domains with Large Datasets:

• Meteorology

• Complex physics simulations

• Biological and environmental research

• Internet search

Key challenges

• Capture & Store

• Search

• Sharing & Transfer

• Analysis

Dec 2004 : Google GFS paper published

July 2005 : Nutch uses MapReduce

Feb 2006 : Becomes Lucene subproject

Apr 2007 : Yahoo! on 1000-node cluster

Jan 2008 : An Apache Top Level Project

April 2009 : Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes)

April 2009 : 100 terabyte sort in 173 minutes (on 3400 nodes)

Advertising Improve effectiveness of advertising and promotions

Financial Services Mitigate risk while creating opportunity

Government Decrease Budget Pressures by Offloading Expensive SQL Workloads

Healthcare Deliver better care and streamline operations

Manufacturing Increase production, reduce costs, and improve quality

Oil & Gas Maximize yields and reduce risk in the supply chain

Retail Boost sales in-store and online

Telcoms Telcos and Cable Companies Use Hortonworks for Service, Security and Sales

Projects Powered by Hadoop

http://wiki.apache.org/hadoop/PoweredBy

The Apache™ Hadoop® project develops open-source software for reliable,

scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers using simple

programming models.

It is designed to scale up from single servers to thousands of machines, each

offering local computation and storage.

The library itself is designed to detect and handle failures at the application layer,

so delivering a highly-available service.

• Distributed, Scalable, and Portable file system written in Java for the Hadoop framework

• Fault‐Tolerant Storage System

Hadoop Distributed File System

• High-Performance Parallel Data Processing

• Employs the Divide-Conquer principle

Map-Reduce Programming Model

HDFS Layer : --Stores files across storage nodes

in a Hadoop cluster

Consists of :

• Namenode & Datanodes

Map-Reduce Engine : --

Processes vast amounts of data in-

parallel on large clusters in a

reliable & fault-tolerant manner

Consists of :

• Job Tracker & Task Trackers

NameNode

Maps a block to the Datanodes

Controls read/write access to files

Manages Replication Engine for Blocks

DataNode

Responsible for serving read and write

requests (block creation, deletion, and

replication)

JobTracker

Accepts Map-Reduce tasks from the Users

Assigns tasks to the Task Trackers &

monitors their status

TaskTracker

Runs Map-Reduce tasks

Sends heart-beat to Job Tracker

Retrieves Job resources from HDFS

NameNode DataNode

JobTracker TaskTracker

Hadoop Daemons

Scalability

Batch-Processing only

Reliability & Availability

Partitioning of Resources

Coupling with MapReduce only

•High-Performance Parallel Data Processing

•Employs the Divide-Conquer principleMap-Reduce

Programming Model

•Yet Another Resource Negotiator

•A framework for cluster’s resource management

•Efficient task schedulersYARN

•Distributed, Scalable, and Portable file system written in Java for the Hadoop framework

•Fault‐Tolerant Storage System

Hadoop Distributed File System

HDFS Layer : --Stores files across storage nodes

in a Hadoop cluster

Consists of :

• Namenode & Datanodes

YARN Engine : --

Processes vast amounts of data in-

parallel on large clusters in a

reliable & fault-tolerant manner

Consists of :

• Resource Manager & Node

Manager

NameNode

Maps a block to the Datanodes

Controls read/write access to files

Manages Replication Engine for Blocks

DataNode

Responsible for serving read and write

requests (block creation, deletion, and

replication)

ResourceManager

Accepts Map-Reduce or Application tasks

from the Users

Assigns tasks to the NodeManager &

monitors their statusNodeManager

Runs Application tasks

Sends heart-beat to ResourceManager

Retrieves Application resources from HDFS

NameNode DataNode

Resource Manager

Node Manager

Hadoop Daemons

HDFS Design Goals

Hardware Failure - Detection of faults and quick, automatic recovery

Streaming Data Access - High throughput of data access (Batch Processing)

Large Data Sets - Gigabytes to terabytes in size.

Simple Coherency Model - Write-once-read-many access model for files

Moving computation is cheaper than moving data

HDFS Architecture

Namenode

Datanode_1 Datanode_2 Datanode_3

HDFS

Block 1

HDFS

Block 2

HDFS

Block 3 Block 4

Storage & Replication of Blocks in HDFS

File

div

ide

d into

blo

cks

Block 1

Block 2

Block 3

Block 4

NameNode and DataNodes : Java Processes responsible for HDFS operations

Data Replication : Blocks of a file are replicated for fault tolerance

Replica Placement : A rack-aware replica placement policy

Replica Selection : Minimize global bandwidth consumption and read latency

File System Namespace : Hierarchical file organization

Safemode : File system consistency check

Blocks Minimum amount of data that can be read or write - 128 MB by default

Minimize the cost of seeks

A file can be larger than any single disk in the network

Simplifies the storage subsystem – Same size & eliminating metadata concerns

Provides fault tolerance and availability

Rack Awareness

Get maximum performance out of Hadoop

Resolution of the slave's DNS name (also IP address) to a rack id.

Interface DNSToSwitchMapping

Rack Topology - /rack1 & /rack2

Replica Placement

Critical to HDFS reliability and performance

Improve data reliability, availability, and network bandwidth utilization

Distance b/w Nodes

Replica Placement cont..

Default Strategy :

a) First replica on the same node as the client.

b) Second replica is placed on a different rack from the first (off-rack) chosen at random

c) Third replica is placed on the same rack as the second, but on a different node chosen at random.

d) Further replicas are placed on random nodes on the cluster

Replica Selection - HDFS tries to satisfy a read request from a replica that is closest to the reader.

FileSystem Image and Edit Logs

fsimage file is a persistent checkpoint of the filesystem metadata

When a client performs a write operation, it is first recorded in the edit log.

The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified

Secondary NameNode is used to produce checkpoints of the primary’s in-memory filesystem metadata

FileSystem Image Structure<FS_IMAGE>

<IMAGE_VERSION>-47</IMAGE_VERSION>

<NAMESPACE_ID>415263518</NAMESPACE_ID>

<GENERATION_STAMP>1000</GENERATION_STAMP>

<GENERATION_STAMP_V2>6953</GENERATION_STAMP_V2>

<GENERATION_STAMP_V1_LIMIT>0</GENERATION_STAMP_V1_LIMIT>

<LAST_ALLOCATED_BLOCK_ID>1073747777</LAST_ALLOCATED_BLOCK_ID>

<TRANSACTION_ID>62957</TRANSACTION_ID>

<LAST_INODE_ID>24606</LAST_INODE_ID>

<SNAPSHOT_COUNTER>0</SNAPSHOT_COUNTER>

<NUM_SNAPSHOTS_TOTAL>0</NUM_SNAPSHOTS_TOTAL>

<IS_COMPRESSED>false</IS_COMPRESSED>

<INODES NUM_INODES="1076">

<INODE>

<INODE_PATH>/</INODE_PATH>

<INODE_ID>16385</INODE_ID>

<REPLICATION>0</REPLICATION>

<MODIFICATION_TIME>2014-10-20 16:35</MODIFICATION_TIME>

<ACCESS_TIME>1970-01-01 05:30</ACCESS_TIME>

<BLOCK_SIZE>0</BLOCK_SIZE>

<BLOCKS NUM_BLOCKS="-1"></BLOCKS>

<NS_QUOTA>9223372036854775807</NS_QUOTA>

<DS_QUOTA>-1</DS_QUOTA>

<IS_SNAPSHOTTABLE_DIR>true</IS_SNAPSHOTTABLE_DIR>

<PERMISSIONS>

<USER_NAME>hduser</USER_NAME>

<GROUP_NAME>supergroup</GROUP_NAME>

<PERMISSION_STRING>rwxrwxrwx</PERMISSION_STRING>

</PERMISSIONS>

</INODE>

<SNAPSHOTS NUM_SNAPSHOTS="0">

<SNAPSHOT_QUOTA>0</SNAPSHOT_QUOTA>

</SNAPSHOTS>

<INODE>

<INODE_PATH>/data_in/stock1gbdata</INODE_PATH>

<INODE_ID>24568</INODE_ID>

<REPLICATION>3</REPLICATION>

<MODIFICATION_TIME>2014-10-28 15:58</MODIFICATION_TIME>

<ACCESS_TIME>2014-10-28 15:58</ACCESS_TIME>

<BLOCK_SIZE>134217728</BLOCK_SIZE>

<BLOCKS NUM_BLOCKS="81">

<BLOCK>

<BLOCK_ID>1073747677</BLOCK_ID>

<NUM_BYTES>134217670</NUM_BYTES>


</BLOCK>

<BLOCK>

<BLOCK_ID>1073747678</BLOCK_ID>

<NUM_BYTES>134217646</NUM_BYTES>


</BLOCK>

</BLOCKS>

<INODE>

<INODES>

<INODES_UNDER_CONSTRUCTION

NUM_INODES_UNDER_CONSTRUCTION="0"></INODES_UNDER_CONSTRUCTION>

<CURRENT_DELEGATION_KEY_ID>0</CURRENT_DELEGATION_KEY_ID>

<DELEGATION_KEYS NUM_DELEGATION_KEYS="0"></DELEGATION_KEYS>

<DELEGATION_TOKEN_SEQUENCE_NUMBER>0</DELEGATION_TOKEN_SEQU

ENCE_NUMBER>

<DELEGATION_TOKENS

NUM_DELEGATION_TOKENS="0"></DELEGATION_TOKENS>

</FS_IMAGE>

Safe Mode On start-up, NameNode loads its image file (fsimage) into memory and applies the edits from the edit

log (edits).

It does the check pointing process itself. without recourse to the Secondary NameNode.

Namenode is running in safe mode (offers only a read-only view to clients)

The locations of blocks in the system are not persisted by the NameNode - this information resides with the DataNodes, in the form of a list of the blocks it is storing.

Safe mode is needed to give the DataNodes time to check in to the NameNode with their block lists

Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30 seconds.

Administration

HDFS Trash

HDFS Quotas

Safe Mode

FS Shell

dfsadmin Command

HDFS Trash – Recycle BinWhen a file is deleted by a user, it is not immediately removed from HDFS. HDFS moves it to a file in the /trash directory.

A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace.

Undelete a file: User needs to navigate the /trash directory and retrieve the file by using mv command.

File : core-site.xml

Property : fs.trash.interval

Description : Number of minutes after which the checkpoint gets deleted.

File : core-site.xml

Property : fs.trash.checkpoint.interval

Description : Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval.

HDFS QuotasName Quota - a hard limit on the number of file and directory names in the tree rooted at that directory.

Space Quota - a hard limit on the number of bytes used by files in the tree rooted at that directory.

Reporting Quota - count command of the HDFS shell reports quota values and the current count of names and bytes in use. With the -q option, also report the name quota value set for each directory, the available name quota remaining, the space quota value set, and the available space quota remaining.

fs -count -q <directory>..

dfsadmin -setQuota <N> <directory>... Set the name quota to be N for each directory.

dfsadmin -clrQuota <directory>... Remove any name quota for each directory.

dfsadmin -setSpaceQuota <N> directory>.. Set the space quota to be N bytes for each directory.

dfsadmin -clrSpaceQuota <directory>... Remove any spce quota for each directory.

DfsAdmin Command bin/hadoop dfsadmin [Generic Options] [Command Options]

-safemode enter /

leave / get / wait

Safe mode maintenance command. Safe mode can also be entered manually, but then it can only be turned off manually as well.

-report Reports basic filesystem information and statistics.

-refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned.

-metasave filename Save Namenode's primary data structures to filename in the directory specified by hadoop.log.dirproperty. filename is overwritten if it exists. filename will contain one line for each of the following1. Datanodes heart beating with Namenode2. Blocks waiting to be replicated3. Blocks currrently being replicated4. Blocks waiting to be deleted

FS Shell – Some Basic Commands cat

hadoop fs -cat URI [URI …]

Copies source paths to stdout.

cp hadoop fs -chgrp [-R] GROUP URI [URI …]

Change group association of files. With -R, make the change recursively through the directory structure.

chmodhadoop fs -chmod -R 777 hdfs://nn1.example.com/file1

Change the permissions of files. With -R, make the change recursively through the directory structure.

copyFromLocal / puthadoop fs -copyFromLocal <localsrc> URI

Copy single src, or multiple srcs from local file system to the destination filesystem

copyToLocal / gethadoop fs -copyToLocal <localdst>

Copy files to the local file system.

FS Shell – Commands Continued… expunge

hadoop fs –expunge

Empty the Trash.

mkdir hadoop fs -mkdir <paths>

Takes path uri's as argument and creates directories.

rmr hadoop fs –rmr /user/hadoop/dir

Recursive version of delete.

Touchz hadoop -touchz pathname

Create a file of zero length.

du hadoop fs -du URI [URI …]

Displays aggregate length of files contained in the directory or the length of a file in case its just a file.

Modes

Local Standalone

Pseudo Distributed

Fully Distributed

Local Standalone (Non-distributed)• All Hadoop daemons run as a single Java process on a single system

• Useful for debugging

Pseudo Distributed• Daemons run on a single-node

• Each Hadoop daemon runs in a separate Java process

Fully Distributed• Master-Slave Architecture

• One machine is designated as the NameNode and other as JobTracker (can be within same machine as well)

• Rest of the machines in the cluster act as both Datanode and TaskTracker

(i)

Create Dedicated User

& Group

(ii)

Establish Authentication among Nodes

(iii)

Create Hadoop folder

(iv)

Hadoop Configuration

(v)

Remote Copy Hadoop folder to Slave Nodes

(vi)

Start Hadoop Cluster

(vii)

Testing Hadoop

(viii)

Run Simple WordCount Program

Introduction to hadoop and hdfs

Software

hadoop clusterconsists

big data

hdfslimitations of hadoop

nodeswhere hadoop

large clusters

apache hadoop project

apache hadoop software

collection of data sets