Top Banner
ashish singh UNDERSTANDING/PROCESSING Ashish singh TE IT Roll No. 4311
32

Demystifying bigdata ashish1

Jan 29, 2018

Download

Documents

Ashish Singh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Demystifying bigdata ashish1

ashish singh

UNDERSTANDING/PROCESSING

Ashish singh TE IT

Roll No. 4311

Page 2: Demystifying bigdata ashish1

Bigdata???

ashish singh

● How BIG is bigdata???

● 1 million rows?

● 10 million rows?

● 100 million rows?

● 500 million rows?

Lets find out how “BIG” it is.

Page 3: Demystifying bigdata ashish1

Size of BIGDATA

ashish singh

● Year 2000 ---> 800k Petabytes of data

● By 2020 ----> 35 Zettabytes of data

● Twitter 7 terabytes in a day

● Facebook 10 terabytes in a day

Isn't that massive abouts of data???

Page 4: Demystifying bigdata ashish1

How it all started?

ashish singh

● In 1999 GOOGLE started Indexing the giant web(Internet)!!!

● Traditional enterprise architecture was not sufficient for this need.

● Google was a startup then had NO MONEY to afford the “very expensive enterprise architecture”.

● So google had to find a way out.

● And they did!!!

Page 5: Demystifying bigdata ashish1

Network Architectures

ashish singh

Enterprise Architecture

&

Cluster Architecture

Page 6: Demystifying bigdata ashish1

Enterprise architecture

ashish singh

● Servers(expensive)

● SAN(storage area network)

● Storage(ssd & sata)

● Overall becomes expensive.

Page 7: Demystifying bigdata ashish1

Enterprise architecture

ashish singh

● SAN connects servers to storage drives.

● Beauty : decoupling b/w servers and storage

● Storage & servers can be removed or expaned indepenedent of each other.

● MAINTAINANCE IS VERY COSTLY....:(

Page 8: Demystifying bigdata ashish1

Cluster architecture

ashish singh

● Node is a set of cores + main memory + hard disks like we have in our systems.

● Stack of nodes = Rack

● Group of Rack = cluster

● Nodes run on LINUX

● High speed connection

Page 9: Demystifying bigdata ashish1

Cluster architecture

ashish singh

● Locality : processing in same node or atleast in rack where data is present.

● Massive parrellization.

● Network is no longer a bottleneck.

● Cluster archi has some disadvs.

Page 10: Demystifying bigdata ashish1

Cluster architecture

ashish singh

● Detect and respond to failures (commodity hardware not reliable)

● Data replication prob(3x replication)

● Even Distribution for scale

Page 11: Demystifying bigdata ashish1

Hadoop

ashish singh

Created by Doug couting

Originally built to support distribution for nutch search engine HADOOP

Inspired by Google mapreduce and bigtable paper 2004

Page 12: Demystifying bigdata ashish1

Okay,so what exactly is “hadoop”

ashish singh

● Named on stuffed elephant of the son of creator Doug Cutting

● Inspired from google papers on mapreduce and bigtable

● Works on cluster architecture(using commodity hardware) and not Enterprise architecture.

Page 13: Demystifying bigdata ashish1

Components of Hadoop

ashish singh

●HDFS(Hadoop distributed file system)

●Mapreduce

Page 14: Demystifying bigdata ashish1

Need for Hadoop?

ashish singh

● Massive amounts of data that was not possible to handle with traditional Enterprise architecture.

● To enable applicaitons to take out most out of cluster architecture.

● Data is distributed in cluster architecture (using HDFS)

● Data Locality(using Mapreduce).

● Using these two techniques it was possible to handle big data.

Page 15: Demystifying bigdata ashish1

HDFS

ashish singh

● Filesystem to split scatered replicated data across various nodes in a cluster architecture.

● A file consists of equal size file blocks

● File size is multiple of storage blocks

● File size is 64 mb

● Strorage block 512 kb

● File size is used as a unit to store data in various nodes

Page 16: Demystifying bigdata ashish1

HDFS Architecture

ashish singh

● Since nodes fail these file blocks must be replicted usually (3x)

● Hadoop treats all nodes as data nodes but treats one node as Namenode

● Namenode keeps track of where are all the copies and replicaas of a particular file block

Page 17: Demystifying bigdata ashish1

HDFS

ashish singh

● Namenode is resposible for mapping of addresses to the file blocks

● Upon failure namenode identifies the failed node .

● Namennode retireves the copies of the file block .

● Replicates them into new nodes .

Page 18: Demystifying bigdata ashish1

HDFS Architecture

ashish singh

Updates the addresses of the new nodes into its maping.

● Any apllication that wants to read a file block must approach the name node and then can directly reach the fileblock

● Namenode is the single point of failure in hadoop

● Secondary namenodes serve as backup for namenode.

Page 19: Demystifying bigdata ashish1

HDFS

ashish singh

● Basically designed for large files

● Block oriented

● Linux style commands eg. Ls, cp , mv, rm

● Fault tolerant on node failure

● Self-healing from time to time

● Scalable just by adding nodes

Page 20: Demystifying bigdata ashish1

Mapreduce

ashish singh

●Mapper Phase●Reduce Phase

Page 21: Demystifying bigdata ashish1

Mapper Phase

ashish singh

● All nodes do the same computation.

● Computation is done on the data in the node or nearby it.

● Reduces the data flow and thus reducing cost.

● Uses Data locality

● File blocks are same size so computation takes same time adding to scale of mapper function.

Page 22: Demystifying bigdata ashish1

Mapreduce

ashish singh

● Comprises of three classes –

● Mapper class

● Reducer class

● Driver class

● Tasktracker/ Jobtracker

Page 23: Demystifying bigdata ashish1

Reducer

ashish singh

● Reducer phase will start only after mapper is done

● Takes (k,v) pairs and emits (k,v) pair

Page 24: Demystifying bigdata ashish1

Mapreduce structure

ashish singh

Page 25: Demystifying bigdata ashish1

Hadoop data flow.

ashish singh

Page 26: Demystifying bigdata ashish1

Example: counting words

ashish singh

Page 27: Demystifying bigdata ashish1

Counting words

ashish singh

Page 28: Demystifying bigdata ashish1

Counting words

ashish singh

Page 29: Demystifying bigdata ashish1

Counting words

ashish singh

In this basically the reduces combines the shuffled <key, value> according to their keys in the reducer node.

And finally adding all the values corresponding to a particular key.

In the output we get a key value output .

Page 30: Demystifying bigdata ashish1

When should we choose hadoop?

ashish singh

● Data is too huge

● Processes are independent

● Online analytical processing

● Better scalability

● Parallelism

● Unstructured data

Page 31: Demystifying bigdata ashish1

Real World use Cases

ashish singh

● Clickstream analysis

● Sentiment Analysis

● Recommendation Engines.

● Ad Targeting.

● Search Quality

Page 32: Demystifying bigdata ashish1

Q / A

ashish singh

● Support Wikipedia(the free encyclopedia)

● Use Open Source softwares.

● Thank you.