PowerPoint Presentation
SCHEDULEMODULE 1. Introduction to Big Data and HadoopMODULE 2.
HDFS Internals, Hadoop configuration, and Data loadingMODULE 3.
Introduction to Map reduceMODULE 4. Advanced Map reduce
conceptsMODULE 5. Introduction to pigMODULE 6. Advanced pig and
Introduction to HiveMODULE 7. Advanced HiveMODULE 8. Extending Hive
and Introduction to HBaseMODULE 9. Advanced HBase and oozie MODULE
10. Project set-up DiscussionMODULE 1. Introduction to Big Data and
Hadoop
Big data is a popular term used to describe the exponential
growth and availability of data, both structured and
unstructuredMore data may lead to more accurate analyses. More
accurate analyses may lead to more confident decision making. And
better decisions can mean greater operational efficiencies, cost
reductions and reduced risk.The hopeful vision is that
organizations will be able to take data from any source, harness
relevant data and analyze it to find answers that enable 1) cost
reductions, 2) time reductions, 3) new product development and
optimized offerings, and 4) smarter business decision making.Big
data analytics is often associated with cloud computing because the
analysis of large data sets in real-time requires a platform like
Hadoop to store large data sets across a distributed cluster and
Map Reduce to coordinate, combine and process data from multiple
sources.Hadoop is a free, Java-based programming framework that
supports the processing of large data sets in a distributed
computing environment.
Hadoop makes it possible to run applications on systems with
thousands of nodes involving thousands of terabytes. Its
distributed file system facilitates rapid data transfer rates among
nodes and allows the system to continue operating uninterrupted in
case of a node failure. This approach lowers the risk of system
failure, even if a significant number of nodes become
inoperative.Companies that need to process large and varied data
sets frequently look to Apache Hadoop as a potential tool, because
it offers the ability to process, store and manage huge amounts of
both structured and unstructured data. The open source Hadoop
framework is built on top of a distributed file system and a
cluster architecture that enable it to transfer data rapidly and
continue operating even if one or more compute nodes fail. But
Hadoop isn't a cure-all system for big data application needs as a
whole. And while big-name Internet companies like Yahoo, Facebook,
Twitter, eBay and Google are prominent users of the technology,
Hadoop projects are new undertakings for many other types of
organizations.Big data is come from Genomics and Astronomy.WHAT
IS
Huge amount of data (TB or PB)Big in volumeIt is a moving
data
Velocity : Real time capture and real time analyticsVolume :
petabytes per day/weekVariety : unstructured data, web logs, audio,
video, image, structured dataData cannot be affordable in single
physical machineDistributed storage in multiple systemsFile system
is going to be distributed oneBid data in Industry 1. Financial
services uses 22% 2. Technology uses 16% 3. Telecommunications uses
14% 4. Retail uses 9% 5. Government uses 7%
Big Data is a collection of large datasets that cannot be
processed using traditional computing techniques. It is not a
single technique or a tool, rather it involves many areas of
business and technology.The data in it will be of three
types.Structured data: Relational data.Semi Structured data: XML
data.Unstructured data: Word, PDF, Text, Media Logs.Benefits of Big
DataUsing the information in the social media like preferences and
product perception of their consumers, product companies and retail
organizations are planning their production.Using the data
regarding the previous medical history of patients, hospitals are
providing better and quick service.Big Data TechnologiesBig data
technologies are important in providing more accurate analysis,
which may lead to more concrete decision-making resulting in
greater operational efficiencies, cost reductions, and reduced
risks for the business.There are various technologies in the market
from different vendors including Amazon, IBM, Microsoft, etc., to
handle big data.
CHALLENGES ASSOCIATED WITH Big Data
StorageCaptureSharingVisualizationCurationStorage : Some vendors
are using increased memory and powerful parallel processing to
crunch large volumes of data extremely quickly. Another method is
putting data in-memory but using a grid computing approach, where
many machines are used to solve a problem. Both approaches allow
organizations to explore huge data volumes and gain business
insights in near-real time.Capture : Even if you can capture and
analyze data quickly and put it in the proper context for the
audience that will be consuming the information, the value of data
for decision-making purposes will play an important role, if the
data is not accurate or timely. This is a challenge with any data
analysis, but when considering the volumes of information involved
in big data projects, it becomes even more pronounced. Sharing
:
REAL TIME PROCESSINGReal time data processing involves a
continual input, process and output of dataData processing time is
typically much smaller (in fraction of seconds) Examples Complex
event processing (CEP) platform, which combines data fro multiple
sources to detect patterns and attempts to identify either
opportunities or threatsOperational intelligence (OI) platform
which use real time data processing and (CEP) to gain insight into
operations by running query analysis against live feeds and event
dataOI is near real time analytics over operational data and
provides variability over many data sources. The goal is to obtain
near real time insights using continuous analytics to allow the
organization to take immediate action BATCH PROCESSINGExecutinga
series of non-interactivejobsall at one time.Batch jobs can
bestoredup during working hours and then executed during the
evening or whenever the computer is idle.Batch processing is an
efficient and preferred way for processing high volume of dataData
processing programs are run over a group of transactions and
collected over a business agreed time periodData is collected,
entered, processed and then the batch results are produced for
every batch window. Batch processing requires separate programs for
input, process and output ExamplesAn example of batch processing is
the way that credit card companies process billing. The customer
does not receive a bill for each separate credit card purchase but
one monthly bill for all of that months purchases. The bill is
created through batch processing, where all of the data are
collected and held until the bill is processed as a batch at the
end of the billing cycle.Financial reporting andForecasting
Operation dataBISocial dataHistoric dataService dataExtract BIG
DATA BATCH PROCESSINGTransformLoadBig data analysis HOW HADOOP COME
INTO EXISTANCEHadoop and Big Data, they became synonymous. But they
are two different things.Hadoop is a parallel programming modelthat
is implemented on a bunch of low-cost clustered processors, and
it's intended to support data-intensive distributed applications.
That's what Hadoop is all about.Due to the advent of new
technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing
rapidly every yearAn enterprise will have a computer to store and
process big data. For storage purpose, the programmers will take
the help of their choice of database vendors such as Oracle, IBM,
etc. In this approach, the user interacts with the application,
which in turn handles the part of data storage and analysis.This
approach works fine with those applications that process less
voluminous data that can be accommodated by standard database
servers, or up to the limit of the processor that is processing the
data. But when it comes to dealing with huge amounts of scalable
data, it is a hectic task to process such data through a single
database bottleneckGoogle solved this problem using an algorithm
called Map Reduce. This algorithm divides the task into small parts
and assigns them to many computers, and collects the results from
them which when integrated, form the result dataset. Using the
solution provided by Google, Doug Cutting and his team developed an
Open Source Project called HADOOP.WHAT IS
Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models.The Hadoop framework
application works in an environment that provides distributed
storage and computation across clusters of computers.Hadoop is
designed to scale up from single server to thousands of machines,
each offering local computation and storage.Hadoop supports any
type of dataHadoop is best in solving Big data Batch processing
Hadoop = Storage + Computer grid As mentioned above Hadoop is a
place to store and known to be a distributed file systemConcepts
come under Hadoop is HDFS , Map reduce , Pig , Hive , Sqoop , Flume
, HBaseCore components of HadoopHDFSMap reduce
HADOOP KEY
CHARACTERISTICSCharacteristicsReliableScalableEconomicalFlexible
HADOOP KEY DIFFERENTIATORSDifferentiating
FactorsRobustAccessibleSimpleScalableHadoop is a system for large
scale data processing. It has two main components:HDFS Distributed
across nodes Natively redundant NameNode tracks locations
MapReduceSplits a task across processorsnear the data &
assembles resultsSelf-Healing, High BandwidthClustered
storageJobTracker manages the TaskTrackers MODULE 2 . Hadoop
Distributed File system The most common file system used by Hadoop
is the Hadoop Distributed File System (HDFS). It is designed to run
on large clusters (thousands of computers) of small computer
machines in a reliable, fault-tolerant manner.HDFS uses a
master/slave architecture where master consists of a singleName
Nodethat manages the file system metadata and one or more slaveData
Nodesthat store the actual data.A file in an HDFS namespace is
split into several blocks and those blocks are stored in a set of
Data Nodes. The Name Node determines the mapping of blocks to the
Data Nodes. The Data Nodes takes care of read and write operation
with the file system. They also take care of block creation,
deletion and replication based on instruction given by Name
Node.HDFS provides a shell like any other file system and a list of
commands are available to interact with the file system. File
system is a Storage side component. HDFS COMPONENTSName node :
http://localhost:50070/dfshealth.jspStorage side acts as master of
the system , HDFS has only one Name nodeIt maintains, manages and
administers the data blocks present on the Data nodesDefault Data
block size 64 MB (It can be changed) The Name Node determines the
mapping of blocks to the Data Nodes.Read/write operation wanted to
be perform fast, seek time should be lessIncreasing the block size
tends to decrease seek time and increase in streaming time finally
the R/W operations will be fastName node will keep track of overall
file directoryIf at all failure of the Name node occurs back up of
name node is must doneSo, here secondary name node will act as back
upWe have to make secondary name node
Secondary Name node :Secondary name node will get data from name
node for every one hourIf it fails at middle of hour we cant trace
back the data
It will take meta data for every hour & keep secure
MetadataName nodeSecondaryName node HDFS ARCHITECTURE Meta data
operations Block ops
Read Data nodes Replication Data nodes
Write Rack 1 Rack 2Name nodeClientClient RACK AWARENESS Rack 1
Rack 2 Rack 3
File 1File 2File 2File 3File 3File 1File 1File 2File 3 HDFS FILE
WRITE OPERATION 1. Open 2. Create 3. Shows location
4. Write data 5. ack packet
4 4
5 5 HDFS UserDistributed File SystemName nodeData nodeData
nodeData node HDFS FILE READ OPERATION 1. Open 2. Get Block
location
4. Read 5. Close
3. Read 3. Read 3. Read HDFSUserDistributed File SystemFS Data
Input StreamName nodeData nodeData nodeData node MAP REDUCE
FRAMEWORK
It takes processing to the data The Reducer is responsible for
processing oneIt allows processing of data in parallel or more
values which share a common key
ReducerMapperInput key value pairPersistent
dataMapMapMapTransient dataReduceReduceReducePersistent dataOutput
key value pair