Demystifying bigdata ashish1

ashish singh

UNDERSTANDING/PROCESSING

Ashish singh TE IT

Roll No. 4311

Bigdata???

ashish singh

● How BIG is bigdata???

● 1 million rows?

● 10 million rows?



Lets find out how “BIG” it is.

Size of BIGDATA

ashish singh

● Year 2000 ---> 800k Petabytes of data

● By 2020 ----> 35 Zettabytes of data

● Twitter 7 terabytes in a day

● Facebook 10 terabytes in a day

Isn't that massive abouts of data???

How it all started?

ashish singh

● In 1999 GOOGLE started Indexing the giant web(Internet)!!!

● Traditional enterprise architecture was not sufficient for this need.

● Google was a startup then had NO MONEY to afford the “very expensive enterprise architecture”.

● So google had to find a way out.

● And they did!!!

Network Architectures

ashish singh

Enterprise Architecture

&

Cluster Architecture

Enterprise architecture

ashish singh

● Servers(expensive)

● SAN(storage area network)

● Storage(ssd & sata)

● Overall becomes expensive.

Enterprise architecture

ashish singh

● SAN connects servers to storage drives.

● Beauty : decoupling b/w servers and storage

● Storage & servers can be removed or expaned indepenedent of each other.

● MAINTAINANCE IS VERY COSTLY....:(

Cluster architecture

ashish singh

● Node is a set of cores + main memory + hard disks like we have in our systems.

● Stack of nodes = Rack

● Group of Rack = cluster

● Nodes run on LINUX

● High speed connection


ashish singh

● Locality : processing in same node or atleast in rack where data is present.

● Massive parrellization.

● Network is no longer a bottleneck.

● Cluster archi has some disadvs.


ashish singh

● Detect and respond to failures (commodity hardware not reliable)

● Data replication prob(3x replication)

● Even Distribution for scale

Hadoop

ashish singh

Created by Doug couting

Originally built to support distribution for nutch search engine HADOOP

Inspired by Google mapreduce and bigtable paper 2004

Okay,so what exactly is “hadoop”

ashish singh

● Named on stuffed elephant of the son of creator Doug Cutting

● Inspired from google papers on mapreduce and bigtable

● Works on cluster architecture(using commodity hardware) and not Enterprise architecture.

Components of Hadoop

ashish singh

●HDFS(Hadoop distributed file system)

●Mapreduce

Need for Hadoop?

ashish singh

● Massive amounts of data that was not possible to handle with traditional Enterprise architecture.

● To enable applicaitons to take out most out of cluster architecture.

● Data is distributed in cluster architecture (using HDFS)

● Data Locality(using Mapreduce).

● Using these two techniques it was possible to handle big data.

HDFS

ashish singh

● Filesystem to split scatered replicated data across various nodes in a cluster architecture.

● A file consists of equal size file blocks

● File size is multiple of storage blocks

● File size is 64 mb

● Strorage block 512 kb

● File size is used as a unit to store data in various nodes

HDFS Architecture

ashish singh

● Since nodes fail these file blocks must be replicted usually (3x)

● Hadoop treats all nodes as data nodes but treats one node as Namenode

● Namenode keeps track of where are all the copies and replicaas of a particular file block

HDFS

ashish singh

● Namenode is resposible for mapping of addresses to the file blocks

● Upon failure namenode identifies the failed node .

● Namennode retireves the copies of the file block .

● Replicates them into new nodes .

HDFS Architecture

ashish singh

Updates the addresses of the new nodes into its maping.

● Any apllication that wants to read a file block must approach the name node and then can directly reach the fileblock

● Namenode is the single point of failure in hadoop

● Secondary namenodes serve as backup for namenode.

HDFS

ashish singh

● Basically designed for large files

● Block oriented

● Linux style commands eg. Ls, cp , mv, rm

● Fault tolerant on node failure

● Self-healing from time to time

● Scalable just by adding nodes

Mapreduce

ashish singh

●Mapper Phase●Reduce Phase

Mapper Phase

ashish singh

● All nodes do the same computation.

● Computation is done on the data in the node or nearby it.

● Reduces the data flow and thus reducing cost.

● Uses Data locality

● File blocks are same size so computation takes same time adding to scale of mapper function.

Mapreduce

ashish singh

● Comprises of three classes –

● Mapper class

● Reducer class

● Driver class

● Tasktracker/ Jobtracker

Reducer

ashish singh

● Reducer phase will start only after mapper is done

● Takes (k,v) pairs and emits (k,v) pair

Mapreduce structure

ashish singh

Hadoop data flow.

ashish singh

Example: counting words

ashish singh

Counting words

ashish singh

Counting words

ashish singh

Counting words

ashish singh

In this basically the reduces combines the shuffled <key, value> according to their keys in the reducer node.

And finally adding all the values corresponding to a particular key.

In the output we get a key value output .

When should we choose hadoop?

ashish singh

● Data is too huge

● Processes are independent

● Online analytical processing

● Better scalability

● Parallelism

● Unstructured data

Real World use Cases

ashish singh

● Clickstream analysis

● Sentiment Analysis

● Recommendation Engines.

● Ad Targeting.

● Search Quality

Q / A

ashish singh

● Support Wikipedia(the free encyclopedia)

● Use Open Source softwares.

● Thank you.

Demystifying bigdata ashish1

Documents