Top Banner
24

f ACT s Data intensive applications with Petabytes of data Web pages - 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35.

Jan 12, 2016

Download

Documents

Nigel Lawson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.
Page 2: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

fACTsData intensive applications with

Petabytes of data

Web pages - 20+ billion web pages x 20KB = 400+ terabytes

One computer can read 30-35 MB/sec from disk ~four months to read the web

same problem with 1000 machines, < 3 hours

Page 3: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

Single-thread performance doesn’t matterWe have large problems and total

throughput/price more important than peak performance

Stuff Breaks – more reliability• If you have one server, it may stay up three years

(1,000 days)• If you have 10,000 servers, expect to lose ten a day

“Ultra-reliable” hardware doesn’t really helpAt large scales, super-fancy reliable hardware still

fails, albeit less often– software still needs to be fault-tolerant– commodity machines without fancy hardware

give better perf/price

Page 4: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.
Page 5: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

What is Hadoop?

It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it

Hadoop is a framework used to have distributed processing of big data which is stored at different physical locations.

Page 6: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Page 7: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

Hadoop Includes

HDFS a distributed filesystem

Map/Reduce HDFS implements this programming model. It is an offline computing engine

Page 8: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.
Page 9: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

Hardware failure is the norm rather than the exception.

Moving Computation is Cheaper than Moving Data

Page 10: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

HDFS run on commodity hardware

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware

provides high throughput access to application data

suitable for applications that have large data sets

Page 11: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

NameNode and DataNodes HDFS has a master/slave architecture

NameNode :-manages the file system namespace and regulates access to files by clients

DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on

a file is split into one or more blocks

Page 12: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

these blocks are stored in a set of DataNodes

NameNode executes file system namespace operations like opening, closing, and renaming files and directories

It also determines the mapping of blocks to DataNodes

The DataNodes are responsible for serving read and write requests from the file system’s clients.

The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Page 13: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.
Page 14: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.
Page 15: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are then input to the reduce tasks

Typically the compute nodes and the storage nodes are the same

Page 16: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node

The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks

The slaves execute the tasks as directed by the master

Page 17: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

applications specify the input/output locations

supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes.

The Hadoop job client then submits the job and configuration to the JobTracker

JobTracker assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information

Page 18: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

LETS Simulate The MapReduce operates on <key,

value> pairs, that is, the input to the job set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job

Process

Consider a simple example File 1:-Hello World Bye World  File 2:-Hello Hadoop Goodbye Hadoop

Page 19: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

For the given sample input the first map emits:< Hello, 1> < World, 1> < Bye, 1> < World, 1> 

The second map emits:< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> 

Page 20: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

After using Cobiner The output of the first map:

< Bye, 1> < Hello, 1> < World, 2> 

The output of the second map:< Goodbye, 1> < Hadoop, 2> < Hello, 1> 

Page 21: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

Thus the output of the job is:< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> 

Page 22: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.
Page 24: f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

Thank You……….