By: Akhil Arora & Shrey Mehrotra
Jul 12, 2015
By: Akhil Arora & Shrey Mehrotra
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications.
Domains with Large Datasets:
• Meteorology
• Complex physics simulations
• Biological and environmental research
• Internet search
Key challenges
• Capture & Store
• Search
• Sharing & Transfer
• Analysis
Dec 2004 : Google GFS paper published
July 2005 : Nutch uses MapReduce
Feb 2006 : Becomes Lucene subproject
Apr 2007 : Yahoo! on 1000-node cluster
Jan 2008 : An Apache Top Level Project
April 2009 : Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes)
April 2009 : 100 terabyte sort in 173 minutes (on 3400 nodes)
Advertising Improve effectiveness of advertising and promotions
Financial Services Mitigate risk while creating opportunity
Government Decrease Budget Pressures by Offloading Expensive SQL Workloads
Healthcare Deliver better care and streamline operations
Manufacturing Increase production, reduce costs, and improve quality
Oil & Gas Maximize yields and reduce risk in the supply chain
Retail Boost sales in-store and online
Telcoms Telcos and Cable Companies Use Hortonworks for Service, Security and Sales
Projects Powered by Hadoop
The Apache™ Hadoop® project develops open-source software for reliable,
scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models.
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
The library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service.
• Distributed, Scalable, and Portable file system written in Java for the Hadoop framework
• Fault‐Tolerant Storage System
Hadoop Distributed File System
• High-Performance Parallel Data Processing
• Employs the Divide-Conquer principle
Map-Reduce Programming Model
HDFS Layer : --Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Job Tracker & Task Trackers
NameNode
Maps a block to the Datanodes
Controls read/write access to files
Manages Replication Engine for Blocks
DataNode
Responsible for serving read and write
requests (block creation, deletion, and
replication)
JobTracker
Accepts Map-Reduce tasks from the Users
Assigns tasks to the Task Trackers &
monitors their status
TaskTracker
Runs Map-Reduce tasks
Sends heart-beat to Job Tracker
Retrieves Job resources from HDFS
NameNode DataNode
JobTracker TaskTracker
Hadoop Daemons
Scalability
Batch-Processing only
Reliability & Availability
Partitioning of Resources
Coupling with MapReduce only
•High-Performance Parallel Data Processing
•Employs the Divide-Conquer principleMap-Reduce
Programming Model
•Yet Another Resource Negotiator
•A framework for cluster’s resource management
•Efficient task schedulersYARN
•Distributed, Scalable, and Portable file system written in Java for the Hadoop framework
•Fault‐Tolerant Storage System
Hadoop Distributed File System
HDFS Layer : --Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
YARN Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Resource Manager & Node
Manager
NameNode
Maps a block to the Datanodes
Controls read/write access to files
Manages Replication Engine for Blocks
DataNode
Responsible for serving read and write
requests (block creation, deletion, and
replication)
ResourceManager
Accepts Map-Reduce or Application tasks
from the Users
Assigns tasks to the NodeManager &
monitors their statusNodeManager
Runs Application tasks
Sends heart-beat to ResourceManager
Retrieves Application resources from HDFS
NameNode DataNode
Resource Manager
Node Manager
Hadoop Daemons
HDFS Design Goals
Hardware Failure - Detection of faults and quick, automatic recovery
Streaming Data Access - High throughput of data access (Batch Processing)
Large Data Sets - Gigabytes to terabytes in size.
Simple Coherency Model - Write-once-read-many access model for files
Moving computation is cheaper than moving data
HDFS Architecture
Namenode
Datanode_1 Datanode_2 Datanode_3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Storage & Replication of Blocks in HDFS
File
div
ide
d into
blo
cks
Block 1
Block 2
Block 3
Block 4
NameNode and DataNodes : Java Processes responsible for HDFS operations
Data Replication : Blocks of a file are replicated for fault tolerance
Replica Placement : A rack-aware replica placement policy
Replica Selection : Minimize global bandwidth consumption and read latency
File System Namespace : Hierarchical file organization
Safemode : File system consistency check
Blocks Minimum amount of data that can be read or write - 128 MB by default
Minimize the cost of seeks
A file can be larger than any single disk in the network
Simplifies the storage subsystem – Same size & eliminating metadata concerns
Provides fault tolerance and availability
Rack Awareness
Get maximum performance out of Hadoop
Resolution of the slave's DNS name (also IP address) to a rack id.
Interface DNSToSwitchMapping
Rack Topology - /rack1 & /rack2
Replica Placement
Critical to HDFS reliability and performance
Improve data reliability, availability, and network bandwidth utilization
Distance b/w Nodes
Replica Placement cont..
Default Strategy :
a) First replica on the same node as the client.
b) Second replica is placed on a different rack from the first (off-rack) chosen at random
c) Third replica is placed on the same rack as the second, but on a different node chosen at random.
d) Further replicas are placed on random nodes on the cluster
Replica Selection - HDFS tries to satisfy a read request from a replica that is closest to the reader.
FileSystem Image and Edit Logs
fsimage file is a persistent checkpoint of the filesystem metadata
When a client performs a write operation, it is first recorded in the edit log.
The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified
Secondary NameNode is used to produce checkpoints of the primary’s in-memory filesystem metadata
FileSystem Image Structure<FS_IMAGE>
<IMAGE_VERSION>-47</IMAGE_VERSION>
<NAMESPACE_ID>415263518</NAMESPACE_ID>
<GENERATION_STAMP>1000</GENERATION_STAMP>
<GENERATION_STAMP_V2>6953</GENERATION_STAMP_V2>
<GENERATION_STAMP_V1_LIMIT>0</GENERATION_STAMP_V1_LIMIT>
<LAST_ALLOCATED_BLOCK_ID>1073747777</LAST_ALLOCATED_BLOCK_ID>
<TRANSACTION_ID>62957</TRANSACTION_ID>
<LAST_INODE_ID>24606</LAST_INODE_ID>
<SNAPSHOT_COUNTER>0</SNAPSHOT_COUNTER>
<NUM_SNAPSHOTS_TOTAL>0</NUM_SNAPSHOTS_TOTAL>
<IS_COMPRESSED>false</IS_COMPRESSED>
<INODES NUM_INODES="1076">
<INODE>
<INODE_PATH>/</INODE_PATH>
<INODE_ID>16385</INODE_ID>
<REPLICATION>0</REPLICATION>
<MODIFICATION_TIME>2014-10-20 16:35</MODIFICATION_TIME>
<ACCESS_TIME>1970-01-01 05:30</ACCESS_TIME>
<BLOCK_SIZE>0</BLOCK_SIZE>
<BLOCKS NUM_BLOCKS="-1"></BLOCKS>
<NS_QUOTA>9223372036854775807</NS_QUOTA>
<DS_QUOTA>-1</DS_QUOTA>
<IS_SNAPSHOTTABLE_DIR>true</IS_SNAPSHOTTABLE_DIR>
<PERMISSIONS>
<USER_NAME>hduser</USER_NAME>
<GROUP_NAME>supergroup</GROUP_NAME>
<PERMISSION_STRING>rwxrwxrwx</PERMISSION_STRING>
</PERMISSIONS>
</INODE>
<SNAPSHOTS NUM_SNAPSHOTS="0">
<SNAPSHOT_QUOTA>0</SNAPSHOT_QUOTA>
</SNAPSHOTS>
<INODE>
<INODE_PATH>/data_in/stock1gbdata</INODE_PATH>
<INODE_ID>24568</INODE_ID>
<REPLICATION>3</REPLICATION>
<MODIFICATION_TIME>2014-10-28 15:58</MODIFICATION_TIME>
<ACCESS_TIME>2014-10-28 15:58</ACCESS_TIME>
<BLOCK_SIZE>134217728</BLOCK_SIZE>
<BLOCKS NUM_BLOCKS="81">
<BLOCK>
<BLOCK_ID>1073747677</BLOCK_ID>
<NUM_BYTES>134217670</NUM_BYTES>
<GENERATION_STAMP>6853</GENERATION_STAMP>
</BLOCK>
<BLOCK>
<BLOCK_ID>1073747678</BLOCK_ID>
<NUM_BYTES>134217646</NUM_BYTES>
<GENERATION_STAMP>6854</GENERATION_STAMP>
</BLOCK>
</BLOCKS>
<INODE>
<INODES>
<INODES_UNDER_CONSTRUCTION
NUM_INODES_UNDER_CONSTRUCTION="0"></INODES_UNDER_CONSTRUCTION>
<CURRENT_DELEGATION_KEY_ID>0</CURRENT_DELEGATION_KEY_ID>
<DELEGATION_KEYS NUM_DELEGATION_KEYS="0"></DELEGATION_KEYS>
<DELEGATION_TOKEN_SEQUENCE_NUMBER>0</DELEGATION_TOKEN_SEQU
ENCE_NUMBER>
<DELEGATION_TOKENS
NUM_DELEGATION_TOKENS="0"></DELEGATION_TOKENS>
</FS_IMAGE>
Safe Mode On start-up, NameNode loads its image file (fsimage) into memory and applies the edits from the edit
log (edits).
It does the check pointing process itself. without recourse to the Secondary NameNode.
Namenode is running in safe mode (offers only a read-only view to clients)
The locations of blocks in the system are not persisted by the NameNode - this information resides with the DataNodes, in the form of a list of the blocks it is storing.
Safe mode is needed to give the DataNodes time to check in to the NameNode with their block lists
Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30 seconds.
Administration
HDFS Trash
HDFS Quotas
Safe Mode
FS Shell
dfsadmin Command
HDFS Trash – Recycle BinWhen a file is deleted by a user, it is not immediately removed from HDFS. HDFS moves it to a file in the /trash directory.
A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace.
Undelete a file: User needs to navigate the /trash directory and retrieve the file by using mv command.
File : core-site.xml
Property : fs.trash.interval
Description : Number of minutes after which the checkpoint gets deleted.
File : core-site.xml
Property : fs.trash.checkpoint.interval
Description : Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval.
HDFS QuotasName Quota - a hard limit on the number of file and directory names in the tree rooted at that directory.
Space Quota - a hard limit on the number of bytes used by files in the tree rooted at that directory.
Reporting Quota - count command of the HDFS shell reports quota values and the current count of names and bytes in use. With the -q option, also report the name quota value set for each directory, the available name quota remaining, the space quota value set, and the available space quota remaining.
fs -count -q <directory>..
dfsadmin -setQuota <N> <directory>... Set the name quota to be N for each directory.
dfsadmin -clrQuota <directory>... Remove any name quota for each directory.
dfsadmin -setSpaceQuota <N> directory>.. Set the space quota to be N bytes for each directory.
dfsadmin -clrSpaceQuota <directory>... Remove any spce quota for each directory.
DfsAdmin Command bin/hadoop dfsadmin [Generic Options] [Command Options]
-safemode enter /
leave / get / wait
Safe mode maintenance command. Safe mode can also be entered manually, but then it can only be turned off manually as well.
-report Reports basic filesystem information and statistics.
-refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned.
-metasave filename Save Namenode's primary data structures to filename in the directory specified by hadoop.log.dirproperty. filename is overwritten if it exists. filename will contain one line for each of the following1. Datanodes heart beating with Namenode2. Blocks waiting to be replicated3. Blocks currrently being replicated4. Blocks waiting to be deleted
FS Shell – Some Basic Commands cat
hadoop fs -cat URI [URI …]
Copies source paths to stdout.
cp hadoop fs -chgrp [-R] GROUP URI [URI …]
Change group association of files. With -R, make the change recursively through the directory structure.
chmodhadoop fs -chmod -R 777 hdfs://nn1.example.com/file1
Change the permissions of files. With -R, make the change recursively through the directory structure.
copyFromLocal / puthadoop fs -copyFromLocal <localsrc> URI
Copy single src, or multiple srcs from local file system to the destination filesystem
copyToLocal / gethadoop fs -copyToLocal <localdst>
Copy files to the local file system.
FS Shell – Commands Continued… expunge
hadoop fs –expunge
Empty the Trash.
mkdir hadoop fs -mkdir <paths>
Takes path uri's as argument and creates directories.
rmr hadoop fs –rmr /user/hadoop/dir
Recursive version of delete.
Touchz hadoop -touchz pathname
Create a file of zero length.
du hadoop fs -du URI [URI …]
Displays aggregate length of files contained in the directory or the length of a file in case its just a file.
Modes
Local Standalone
Pseudo Distributed
Fully Distributed
Local Standalone (Non-distributed)• All Hadoop daemons run as a single Java process on a single system
• Useful for debugging
Pseudo Distributed• Daemons run on a single-node
• Each Hadoop daemon runs in a separate Java process
Fully Distributed• Master-Slave Architecture
• One machine is designated as the NameNode and other as JobTracker (can be within same machine as well)
• Rest of the machines in the cluster act as both Datanode and TaskTracker
(i)
Create Dedicated User
& Group
(ii)
Establish Authentication among Nodes
(iii)
Create Hadoop folder
(iv)
Hadoop Configuration
(v)
Remote Copy Hadoop folder to Slave Nodes
(vi)
Start Hadoop Cluster
(vii)
Testing Hadoop
(viii)
Run Simple WordCount Program