Top Banner
Hadoop Mayuri Agarwal
25
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop

Hadoop

Mayuri Agarwal

Page 2: Hadoop

Data Management !!!!!!

Page 3: Hadoop
Page 4: Hadoop

Big Data-What does it mean?

Velocity:Often time

sensitive , big data must be used as it is

streaming in to the enterprise it

order to maximize its value to the

business.Batch ,Near time ,

Real-time ,streams

Volume:Big data comes in one size : large . Enterprises are awash with data ,easy amassing terabytes

and even petabytes of information.TB , Records , Transactions ,Tables , Files.

Variety:Big data extends beyond structured data ,

including semi-structured and unstructured data to all varieties :text , audio , video ,click

streams ,log files and more Structured , Unstructured , Semi-structured

Veracity:Quality and

provenance of received data.

Good , Undefined , bad ,

Inconsistency , Incompleteness ,

Ambiguity

Value

Page 5: Hadoop

Big Data

90%

10%

Worldwide Data

Last 2 yearsSince the Beginnning of the Time

Page 6: Hadoop

What is Hadoop?

Software project that enables the distributed processing of large data sets across clusters of commodity servers

Works with structured and unstructured data

Open source software + Hardware commodity = IT cost Reduction

It is designed to scale up from a single server to thousands of machines

Very high degree of fault tolerance software’s ability to detect and handle failures at the

application layer

Page 7: Hadoop

The origin of the name Hadoop….

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Page 8: Hadoop

Hadoop Sub-projects

HDFS

Map-Reduce

Page 9: Hadoop

HDFS-Hadoop Distributed File System

Distributed, scalable, and portable file system

Each node in a Hadoop instance typically has a single Namenode : a cluster of Datanodes form the HDFS cluster

Asynchronous replication.

Data divided into 64mb (default) or 128mb blocks , each block replicated 3 times (default)

Namenode holds file system metadata.

Files are broken up and spread over Datanode .

Page 10: Hadoop

HDFS- Read & Write

Page 11: Hadoop

MapReduce

Software framework for distributed computation

Input | Map() | Copy/Sort | Reduce () | Output

JobTracker schedules and manages jobs.

Task tracker executes individual map() and reduce task on each cluster node.

Page 12: Hadoop

Example : MapReduce

Page 13: Hadoop

Master – Slave Model

Page 14: Hadoop

Hadoop Ecosystem

Page 15: Hadoop

HBase HBase is an open source , non-relational, distributed database A Key-value store

A value is identified by the key Both key and value are a byte array

The values are stored in key-order Thus access data by key is very fast

Users create table in HBase There is no schema of HBase table Very good for sparse data Takes lots of disk space

Page 16: Hadoop

HBase Architecture

Master: Responsible for coordinating with region server.

Region server: Serves data for read and write

Zookeeper: Manages the HBase cluster

Low latency and random access to data

Page 17: Hadoop

Hive

A system for managing and querying structured data built on Hadoop

SQL-Like query language called HQL

Main purpose is analysis and ad hoc querying

Database/table/partition –DDL operation

Not for :small data sets ,Low latency queries ,OLTP

Page 18: Hadoop

Hadoop-Hive Architecture

Page 19: Hadoop

HBase-Hive configuration

HBase as ETL data sink

HBase as Data Source

Low Latency warehouse

Page 20: Hadoop

Hive and MySQL Database Structure

Page 21: Hadoop

Hadoop Limitations Not a high-speed SQL database. Is not a particularly simple technology. Hadoop is not easy to connect to legacy systems. Hadoop is not a replacement for traditional data warehouses. It is an

adjunctive product to data warehouses. Normal DBAs will need to learn new skills before they can adopt

Hadoop tools. The architecture around the data - the way you store data, the way

you de-normalize data, the way you ingest data, the way you extract data - is different in Hadoop.

Linux and Java skills are critical for making a Hadoop environment a reality.

Page 22: Hadoop

Hadoop’s Capability Hadoop is a super-powerful environment that can transform your

understanding of data.

Hadoop can store vast amounts of data.

Hadoop can run queries on huge data sets.

You can archive data on Hadoop and still query it.

Hadoop allows you to ingest data at incredible speeds and analyze it and report on it in near real-time.

Hadoop massively reduces the latency of data.

Page 23: Hadoop

Hadoop: Hot skill to acquire on IT job circuit

The market for data technologies, such as databases, is a multi-billion dollar industry.

Many start-ups are working on technology extensions to Hadoop to make it both analytical and transactional. That would be big.

Major companies have a big data strategy and want to build their businesses on top of this

Google, the originator of Hadoop, has already moved on – suggesting that within a decade either the Hadoop framework will have to be developed beyond all recognition or that something newer could be on the way to supplant it.

Every major internet company - be it Google, Twitter, Linkedin or Facebook - uses some form of Hadoop .

Page 24: Hadoop
Page 25: Hadoop

Thanks

[email protected]