POORNIMA INSTITUTE OF ENGINEERING & TECHNOLOGY, JAIPUR (DEPARTMENT OF COMPUTER ENGINEERING) Big Data Hadoop Presented By: Guided By: Ashok Royal Dr. E.S. Pilli ©Ashok Royal
Jun 30, 2015
POORNIMA INSTITUTE OF ENGINEERING & TECHNOLOGY, JAIPUR
(DEPARTMENT OF COMPUTER ENGINEERING)
Big Data Hadoop
Presented By: Guided By:
Ashok Royal Dr. E.S. Pilli
©Ashok Royal
Topics
1. Organization Profile
2. Schedule
3. Training Description
4. Project Description
5. Learning
6. Snapshots
7. Future Scope
8. Conclusion
9. References9 September 2014
©Ashok Royal
Organization Profile
Name - Malviya National Institute of Techonology,
Jaipur.
MNIT, Jaipur is one of 30 national institutes of
technology in India.
MNIT, established in 1963 inspired by Pt. Madan
Mohan Malviya.
The institute's director is I. K. Bhat and the chairman
of the board of Governors is Dr. K. K. Aggarwal.
9 September 2014
©Ashok Royal
Organization Contacts
Organization’s contacts:
Address: Jawaharlal Nehru Marg,
Jhalana, Malviya Nagar, Jaipur,
Rajasthan 302017
Phone: 0141 271 3201
Email : [email protected]
Website : www.mnit.ac.in
9 September 2014
©Ashok Royal
Schedule
Our training at MNIT were broadly divided into three
phases:
Case study of Hadoop and related papers (first 30
days).
Hadoop cluster making (first 30 days).
Single node
Multi node
Implementation of Near Duplicate Detection Using
Hadoop MapReduce (last 15 days).9 September 2014
©Ashok Royal
Training Coordinator
Name - Dr. E. S. Pilli
Assistant Professor at MNIT, Jaipur. He has done
Ph.D.(CSE) from IIT Roorkee. He is a very
supportive person and currently guiding many
M.Tech and Ph. D students in Cloud Computing,
Big Data & Botnets.
Email: [email protected]
9 September 2014
©Ashok Royal
What is Big Data ?
Lots of data (Terabyte or Petabyte)
BigData is a term used for a collection of data
sets so large and complex that it becomes
difficult to process using existing traditional data
processing application.
Big data refers to large data sets that are
challenging to store, search, share, visualize and
analyze. 9 September 2014
©Ashok Royal
Various forms of Data
Data comes mainly in three forms-
Structured
Unstructured Data
Semi-structured data
9 September 2014
©Ashok Royal
Why Data is so BIG ?
20 terabyte photos uploaded to Facebook
each month .
330 terabyte data that the large collider will
produce each week.
530 terabyte all the videos on YouTube.
1 petabyte data processed by Google's
servers every 72 minutes.9 September 2014
©Ashok Royal
Data growth
9 September 2014
©Ashok Royal
What is hadoop ?
It is a open source software library
written in java.
Hadoop Software library is a
framework that allow for the
distributed processing of large data sets
(Big Data) across clusters of computers using
programming models.9 September 2014
©Ashok Royal
Modules of hadoop
Hadoop Commons
Hadoop Distributed File system (HDFS)
Hadoop mapReduce
9 September 2014
©Ashok Royal
Hadoop Commons
It provides access to the file system supported by
Hadoop.
The Hadoop common package contain the
necessary JAR files and scripts needed to start
hadoop.
The package also provides sourse code,
documentation, and a contribution section which
include projects from Hadoop Community.9 September 2014
©Ashok Royal
HDFS
Hadoop uses HDFS, a distributed file
system based on GFS (Google File
System) as its shared filesystem.
HDFS architecture divides files into large
chunks distributed across data servers.
It has namenode and datanodes.
9 September 2014
©Ashok Royal
Main components of HDFS
Namenode:
Master of the system
Maintains and manage the blocks which are present on
the datanodes.
Datanodes:
Slaves which are deployed on each machine and provide
the actual storage.
Responsible for serving read and write requests for the
clients.9 September 2014
©Ashok Royal
Main components of HDFS
Secondary Name node
Used as savepoint
Connects to Namenode every hour*
Backup of namenode metadata
Saved metadata can build a failed
Namenode.
9 September 2014
©Ashok Royal
Map Reduce
The Hadoop MapReduce framework
harnesses a cluster of machines and
excute user define MapReduce jobs
across the nodes in the cluster.
A MapReduce computation has two
phases
A map phase and
A reduce phase9 September 2014
©Ashok Royal
HDFS and MapReduce Layers
9 September 2014
©Ashok Royal
Hadoop Server Roles
JobTrackerMapReduce job
submitted by client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
9 September 2014
©Ashok Royal
Hadoop Architecture
9 September 2014
©Ashok Royal
Hadoop Streaming
It allows to create and run map/reduce jobs with any
executable or script as the Mapper and/or Reducer.
HDFS is basically designed to process large files on
commodity cluster at high speed.
Write onces read many approch. After huge data
being placed, we tend to use the data not modify it.
Time to read the whole data is more important.
9 September 2014
©Ashok Royal
Hadoop Workflow
1. Load data into the cluster (HDFS
writes)
2. Analyze the data (Map Reduce)
3. Store results in the cluster (HDFS
writes)
4. Read the results from the cluster (HDFS
reads) 9 September 2014
©Ashok Royal
Word count Example
9 September 2014
©Ashok Royal
Prominent users of Hadoop
Amazon – 100 nodes
Facebook – two cluster of 8000 and 3000 nodes
Adobe – 80 nodes
Ebay – 532 nodes
Yahoo – cluster of about 4500 nodes
IIIT Hyderabad – 30 nodes
IBM, Microsoft many more companies are also using
Hadoop.9 September 2014
©Ashok Royal
near duplicates = pairs of objects with high similarity
similarity -> quantitative way -> similarity function
Given a collection of records, the similarity join problem is to find all pairs of records, <x,y>, such
that sim(x,y)>=t
Tokenize:
Each record is a set of tokens from a finite universe.
Suppose each record is a single text document
x = “yes as soon as possible”
y = “as soon as possible please”
x = {A, B, C, D, E}
y = {B, C, D, E, F}
9 September 2014
©Ashok Royal
Project Description
Project Name - Near Duplicate Detection
Comparative analysis of millions documents exist in
network jargon to find similar document based on a
predefined threshold value.
Near duplicate detection is essentially used in web
crawls and many others data mining tasks.
The near duplicates are not considered as “exact
duplicates ” , but are files with minute differences .
9 September 2014
©Ashok Royal
Application in Search Engine
9 September 2014
©Ashok Royal
Application in Search Engine cont. The web documents with similarity score
greater than a predefined threshold are
considered as near duplicates
These near duplicated pages are not added
to the repository of search engine.
Reduce storage cost of search engines
Improve the quality of search index
9 September 2014
©Ashok Royal
Similarity Function
9 September 2014
For calculation of similarity between two
document we have
used jaccard Function.
Jaccard Similarity Function:
Example:
tyx
yxyxJ
),(
x = {A,B,C,D,E}y = {B,C,D,E,F}
4/6 = 0.67
©Ashok Royal
Steps to Detect near duplicate
9 September 2014
©Ashok Royal
Snapshots - HDFS
9 September 2014
©Ashok Royal
Snapshot- mapreduce Processing
9 September 2014
©Ashok Royal
Conclusion
Training in big data helped us to know what is the
crazy trend in IT industries and how technology is
becoming more fruitful to human development.
Big Data is the future. Currently A lot of research is
going on in this field. As data is increasing at faster
rate thus there is a huge need of such tools and
technology which can handle it.
9 September 2014
©Ashok Royal
Conclusion
Hadoop is the most emerging framework used
by most of big firms like Facebook, Microsoft,
IBM, Yahoo, Amazon and lots of other more.
Our experience at MNIT, was absolutely
awesome as it has given as the platform and
support for our tasks and case study.
9 September 2014
©Ashok Royal
Bigdata and bigdata solutions is one of
the burning issues in the present it
industry so, work on those will surely
make us more useful to that.
9 September 2014
©Ashok Royal
The proposed method solve the difficulties of
information retrieval from the web.
The approach has detected the near duplicate web
pages efficiently based on the keywords extracted
from the web pages.
It reduces the memory space for web repositories.
The near duplicate detection increases the search
engines quality.
9 September 2014
©Ashok Royal
References
J. G. Conrad, X. S. Guo, and C. P. Schriber. Online
duplicate document detection: signature reliability
in a dynamic retrieval environment. In CIKM, 2003.
2012 2nd International Conference on Computer
Science and Network Technology Near Duplicate
Detection Using Map-Reduce by Qinsheng Du, Wei
Liu, Guolin Li and Yonglin Tang
9 September 2014
©Ashok Royal
References
Chuan Xiao, Wei Wang, Xuemin Lin,
Jeffrey Xu Yu, Guoren Wang. Efficient
Similarity Joins for Near-Duplicate
Detection
Apache Hadoop.
http://hadoop.apache.org.
Hadoop-beginners.blogspot.com.9 September 2014
©Ashok Royal
Namenode and Datanode
The NameNode executes file system namespace
operations like opening, closing, and renaming files
and directories. It also determines the mapping of
blocks to DataNodes.
The DataNodes are responsible for serving read and
write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode.
9 September 2014
©Ashok Royal
MapReduce paradigm
Map phase: Once divided, datasets are assigned to the task
tracker to perform the Map phase. The data functional
operation will be performed over the data, emitting the
mapped key and value pairs as the output of the Map
phase, (i.e data processing).
Reduce phase: The master node then collects the answers
to all the subproblems and combines them in some way to
form the output; the answer to the problem it was originally
trying to solve, (i.e data collection and digesting).
9 September 2014
9 September 2014
Any Query?
Thank you
Hadoop
9 September 2014