Top Banner
Apache Hadoop - Large Scale Data Processing Sharath Bandaru & Sai Dinesh Koppuravur Advanced Topics Presentation ISYE 582 :Engineering Information Systems
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache hadoop

Apache Hadoop

- Large Scale Data Processing

Sharath Bandaru & Sai Dinesh Koppuravuri

Advanced Topics PresentationISYE 582 :Engineering Information Systems

Page 2: Apache hadoop

Overview Understanding Big Data

Structured/Unstructured Data

Limitations Of Existing Data Analytics Structure

Apache Hadoop

Hadoop Architecture

HDFS

Map Reduce

Conclusions

References

Page 3: Apache hadoop

Understanding Big Data

Big DataIs creating

Large And Growing Files

Measured in:Petabytes (10^12)Terabytes (10^15)

Which is largely unstructured

Page 4: Apache hadoop

Structured/Unstructured Data

Page 5: Apache hadoop

Why now ?D

ata

Gro

wth

STRUCTURED DATA – 20%

1980 2013

UNSTRUCTURED DATA – 80%

Source : Cloudera, 2013

Page 6: Apache hadoop

Challenges posed by Big Data

Velocity

Volume

Variety

400 million tweets in a day on Twitter1 million transactions by Wal-Mart every hour

2.5 peta bytes created by Wal-Mart transactions in an hour

Videos, Photos, Text messages, Images, Audios, Documents, Emails, etc.,

Page 7: Apache hadoop

Limitations Of Existing Data Analytics Architecture

BI Reports + Interactive Apps

RDBMS (aggregated data)

ETL Compute Grid

Storage Only Grid ( original raw data )

Collection

Instrumentation

Moving Data To Compute Doesn’t Scale

Can’t Explore Original High Fidelity Raw Data

Archiving=Premature Data Death

Page 8: Apache hadoop

So What is Apache ?

• A set of tools that supports running of applications on big data.

• Core Hadoop has two main systems:

- HDFS : self-healing high-bandwidth clustered storage.

- Map Reduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

Page 9: Apache hadoop

History

Source : Cloudera, 2013

Page 10: Apache hadoop

The Key Benefit: Agility/Flexibility

Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before any data can be loaded.

• An explicit load operation has to take place which transforms data to DB internal structure.

• New columns must be added explicitly before new data for such columns can be loaded into the database

• Data is simply copied to the file store, no transformation is needed.

• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding).

• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.

• Read is Fast• Standards/Governance

• Load is Fast• Flexibility/Agility

Pros

Page 11: Apache hadoop

Use The Right Tool For The Right Job

Relational Databases: Hadoop:

Use when:• Interactive OLAP Analytics (< 1 sec)• Multistep ACID transactions• 100 % SQL compliance

Use when:• Structured or Not (Flexibility)• Scalability of Storage/Compute• Complex Data Processing

Page 12: Apache hadoop

Traditional Approach

Big Data

Powerful ComputerProcessing limit

Enterprise Approach:

Page 13: Apache hadoop

Hadoop Architecture

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Map Reduce

HDFS

Page 14: Apache hadoop

Hadoop Architecture

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Application

Page 15: Apache hadoop

Job Tracker

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Application

Page 16: Apache hadoop

Job Tracker

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Application

Page 17: Apache hadoop

HDFS: Hadoop Distributed File System

• A given file is broken into blocks (default=64MB), then blocks are replicated across cluster(default=3).

1

2

3

4

5

HDFS

3

4

5

1

2

5

1

3

4

2

4

5

1

2

3

Optimized for :• Throughput• Put/Get/Delete• Appends

Block Replication for :• Durability• Availability• Throughput

Block Replicas are distributed across servers and racks

Page 18: Apache hadoop

Fault Tolerance for Data

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

HDFS

Page 19: Apache hadoop

Fault Tolerance for Processing

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Map Reduce

Page 20: Apache hadoop

Fault Tolerance for Processing

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Tables are backed up

Page 21: Apache hadoop

Map Reduce

Input Data

Map Map Map Map Map

Shuffle

Reduce Reduce

Results

Page 22: Apache hadoop

Understanding the concept of Map Reduce

Mother

Sam

An Apple

• Believed “an apple a day keeps a doctor away”

The Story Of Sam

Page 23: Apache hadoop

Understanding the concept of Map Reduce

• Sam thought of “drinking” the apple

He used a to cut the

and a to make

juice.

Page 24: Apache hadoop

Understanding the concept of Map Reduce

Next day• Sam applied his invention to all the fruits he could find in the fruit basket

(map ‘( )’)

(reduce ‘( )’) Classical Notion of Map Reduce in Functional Programming

A list of values mapped into another list of values, which

gets reduced into a single value

Page 25: Apache hadoop

Understanding the concept of Map Reduce

18 Years Later

• Sam got his first job in “Tropicana” for his expertise in making juices.

Now, it’s not just one basket

but a whole container of fruits

Also, they produce a list of

juice types separately

Fruits

NOT ENOUGH !! But, Sam had just ONE

and ONE

Large data and list of values for output

Wa i t !

Page 26: Apache hadoop

Understanding the concept of Map Reduce

Brave Sam

Fruits

(<a, > , <o, > , <p, > , …)

Each input to a map is a list of <key, value> pairs

Each output of a map is a list of <key, value> pairs

(<a’, > , <o’, > , <p’, > , …)Grouped by keyEach input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism)e.g. <a’, ( …)>Reduced into a list of values

Implemented parallel version of his innovation

Page 27: Apache hadoop

Understanding the concept of Map Reduce

• Sam realized,

– To create his favorite mix fruit juice he can use a combiner after the reducers

– If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them

– The knife (mapper) and blender (reducer) should not contain residue after use – Side Effect Free

Source: (Map Reduce, 2010).

Page 28: Apache hadoop

Conclusions• The key benefits of Apache Hadoop:

1) Agility/ Flexibility (Quickest Time to Insight)

2) Complex Data Processing (Any Language, Any Problem)

3) Scalability of Storage/Compute (Freedom to Grow)

4) Economical Storage (Keep All Your Data Alive Forever)

• The key systems for Apache Hadoop are:

1) Hadoop Distributed File System : self-healing high-bandwidth clustered storage.

2) Map Reduce : distributed fault-tolerant resource management coupled with scalable data processing.

Page 29: Apache hadoop

References

• Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013, from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.

• Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data Processing on Large Clusters.

• The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from http://hadoop.apache.org/.

• Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy. retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.

• Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data Operating System. Retrieved April 15, 2013, from http://www.youtube.com/watch?v=d2xeNpfzsYI