ashish singh UNDERSTANDING/PROCESSING Ashish singh TE IT Roll No. 4311
ashish singh
UNDERSTANDING/PROCESSING
Ashish singh TE IT
Roll No. 4311
Bigdata???
ashish singh
● How BIG is bigdata???
● 1 million rows?
● 10 million rows?
● 100 million rows?
● 500 million rows?
Lets find out how “BIG” it is.
Size of BIGDATA
ashish singh
● Year 2000 ---> 800k Petabytes of data
● By 2020 ----> 35 Zettabytes of data
● Twitter 7 terabytes in a day
● Facebook 10 terabytes in a day
Isn't that massive abouts of data???
How it all started?
ashish singh
● In 1999 GOOGLE started Indexing the giant web(Internet)!!!
● Traditional enterprise architecture was not sufficient for this need.
● Google was a startup then had NO MONEY to afford the “very expensive enterprise architecture”.
● So google had to find a way out.
● And they did!!!
Network Architectures
ashish singh
Enterprise Architecture
&
Cluster Architecture
Enterprise architecture
ashish singh
● Servers(expensive)
● SAN(storage area network)
● Storage(ssd & sata)
● Overall becomes expensive.
Enterprise architecture
ashish singh
● SAN connects servers to storage drives.
● Beauty : decoupling b/w servers and storage
● Storage & servers can be removed or expaned indepenedent of each other.
● MAINTAINANCE IS VERY COSTLY....:(
Cluster architecture
ashish singh
● Node is a set of cores + main memory + hard disks like we have in our systems.
● Stack of nodes = Rack
● Group of Rack = cluster
● Nodes run on LINUX
● High speed connection
Cluster architecture
ashish singh
● Locality : processing in same node or atleast in rack where data is present.
● Massive parrellization.
● Network is no longer a bottleneck.
● Cluster archi has some disadvs.
Cluster architecture
ashish singh
● Detect and respond to failures (commodity hardware not reliable)
● Data replication prob(3x replication)
● Even Distribution for scale
Hadoop
ashish singh
Created by Doug couting
Originally built to support distribution for nutch search engine HADOOP
Inspired by Google mapreduce and bigtable paper 2004
Okay,so what exactly is “hadoop”
ashish singh
● Named on stuffed elephant of the son of creator Doug Cutting
● Inspired from google papers on mapreduce and bigtable
● Works on cluster architecture(using commodity hardware) and not Enterprise architecture.
Components of Hadoop
ashish singh
●HDFS(Hadoop distributed file system)
●Mapreduce
Need for Hadoop?
ashish singh
● Massive amounts of data that was not possible to handle with traditional Enterprise architecture.
● To enable applicaitons to take out most out of cluster architecture.
● Data is distributed in cluster architecture (using HDFS)
● Data Locality(using Mapreduce).
● Using these two techniques it was possible to handle big data.
HDFS
ashish singh
● Filesystem to split scatered replicated data across various nodes in a cluster architecture.
● A file consists of equal size file blocks
● File size is multiple of storage blocks
● File size is 64 mb
● Strorage block 512 kb
● File size is used as a unit to store data in various nodes
HDFS Architecture
ashish singh
● Since nodes fail these file blocks must be replicted usually (3x)
● Hadoop treats all nodes as data nodes but treats one node as Namenode
● Namenode keeps track of where are all the copies and replicaas of a particular file block
HDFS
ashish singh
● Namenode is resposible for mapping of addresses to the file blocks
● Upon failure namenode identifies the failed node .
● Namennode retireves the copies of the file block .
● Replicates them into new nodes .
HDFS Architecture
ashish singh
Updates the addresses of the new nodes into its maping.
● Any apllication that wants to read a file block must approach the name node and then can directly reach the fileblock
● Namenode is the single point of failure in hadoop
● Secondary namenodes serve as backup for namenode.
HDFS
ashish singh
● Basically designed for large files
● Block oriented
● Linux style commands eg. Ls, cp , mv, rm
● Fault tolerant on node failure
● Self-healing from time to time
● Scalable just by adding nodes
Mapreduce
ashish singh
●Mapper Phase●Reduce Phase
Mapper Phase
ashish singh
● All nodes do the same computation.
● Computation is done on the data in the node or nearby it.
● Reduces the data flow and thus reducing cost.
● Uses Data locality
● File blocks are same size so computation takes same time adding to scale of mapper function.
Mapreduce
ashish singh
● Comprises of three classes –
● Mapper class
● Reducer class
● Driver class
● Tasktracker/ Jobtracker
Reducer
ashish singh
● Reducer phase will start only after mapper is done
● Takes (k,v) pairs and emits (k,v) pair
Mapreduce structure
ashish singh
Hadoop data flow.
ashish singh
Example: counting words
ashish singh
Counting words
ashish singh
Counting words
ashish singh
Counting words
ashish singh
In this basically the reduces combines the shuffled <key, value> according to their keys in the reducer node.
And finally adding all the values corresponding to a particular key.
In the output we get a key value output .
When should we choose hadoop?
ashish singh
● Data is too huge
● Processes are independent
● Online analytical processing
● Better scalability
● Parallelism
● Unstructured data
Real World use Cases
ashish singh
● Clickstream analysis
● Sentiment Analysis
● Recommendation Engines.
● Ad Targeting.
● Search Quality
Q / A
ashish singh
● Support Wikipedia(the free encyclopedia)
● Use Open Source softwares.
● Thank you.