Top Banner
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar
32

Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Distributed MapReduce Team B

Presented by:• Christian Bryan• Matthew Dailey• Greg Opperman• Nate Piper• Brett Ponsler• Samuel Song• Alex Ostapenko• Keilin Bickar

Page 2: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Introduction

Page 3: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Functional Languages

Page 4: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

What makes MapReduce Special?

Map function Lisp – McCarthy et al in 1958 Reduce function (paper example) = Summing up occurrences The combination? Behind the scenes action One user to n computers, where the only insight into n is the speed at which computation is completed.

Page 5: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Example of an abstraction(define appendEven (lambda (x) (cond ((empty? x) empty) (else (begin (cond ((= 0 (remainder (car x) 2)) (cons (car x) (appendEven (cdr x)))) (else (cons (* 2 (car x)) (appendEven (cdr x)) ))))))))

(define appendEvenMap (lambda (x) (cond ((= 0 (remainder x 2)) x) (else (* 2 x)))))

(appendEven myL)

(map appendEvenMap myL)

(list 2 2 6 4 10 6 14 8 18 10 22 12)(list 2 2 6 4 10 6 14 8 18 10 22 12)

Page 7: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Goals of Distributed System• Transparency

• Scalable

• More fault tolerant than standalone system

• Monotonicity – Can’t retract statements

• Which computer is correct?

• Many points of failure

Problems when scaling

Page 8: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Naturally Distributable

Page 9: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Why?

• The 'map' and 'reduce' functions themselves.• 'map' takes in a function and a set of data.• That set of data is partitioned and ready to go.• Function + Data = Convenient

Page 10: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

More...

• 'reduce' is less convenient.• Takes in an operation and a dataset.• GFS helps out alot.

Page 11: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Distributing Map and Reduce

• User writes Map functiono (k1,v1) → list(k2,v2)

• Next, user rights Reduce programo (k2,list(v2)) → list(v2)

• Specification file defines inputs, outputs, and tuning parameterso Passed to MapReduce function

• MapReduce library handles the rest!

Page 12: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
Page 13: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Productivity Improvements

• Programmers no longer have to program for the network

• Simplified library to make a program distributed, can be reused

• Can focus on problem instead of distributed implementation of it

• Quote from Google: "Fun to use" oProgrammers having fun are more

productive

Page 14: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

MapReduce Performance

Page 15: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Measured Performance

• ~1800 2GHz processors with 4Gb of RAM used

• First test task – search through ~1Tb of data for a particular pattern

• Second test task – sort ~1Tb of data

Page 16: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Test 1 (searching)

• Input split into 64Mb pieces• Machines assigned until all are working @55sec• Sources of delay: startup, opening files, locality

optimization

Page 17: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Test 2 (sorting)

Page 18: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Criticism from Database Systems Community

• Very old concepts used• Poor implementation (indices)• Limited set of features (idea of views)

Page 19: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Fault Tolerance

Page 20: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Worker Failure

• Master pings workers periodically• Worker “fails” if it does not respond within

a certain amount of time.• All map tasks completed or in progress by

worker are reset to idle state • Eligible for rescheduling

Page 21: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Worker Failure

• Completed Reduce tasks are not reset because their output is stored in a global file system and not locally on the Failed Machine.

• All workers are notified of the changes in workers

• Resilient to large scale worker failure.

Page 22: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Master Failure

• Periodic checkpoints• Upon Failure: a new copy starts from last

checkpoint• Failure of master is unlikely• Current implementation aborts upon

Master Failure

Page 23: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
Page 24: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Google Cluster Configuration

• Large clusters of commodity PCs connected together with switched Ethernet

• Typically dual-processor x86 processors running Linux, 2-4 GB of memory

• Inexpensive IDE disks attached directly to individual machines

• Commodity networking hardware is used. Typically either 100 megabits/second or 1 gigabit/second at the machine level

Page 25: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Google Cluster Operation

• Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.

• A distributed file system (GFS) is used to manage the data stored on the disks.

• Uses replication to provide availability and reliability on top of unreliable hardware.

Page 26: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Networking

Page 27: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Cost Efficiency

• April 2004, Google spent about $250 million on hardware equipmento includes other equipment than CPUs such as

routers and firewallso Approximately

63, 272 machines 126,554 CPUs 253, 088 GHz of processing power 126,544 Gb of RAM 5,062 TB of Hard Drive Space About 253 teraflops (trillion floating point

operations per second)

Page 28: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Cost Efficiency

• January 2005, Japan's NEC's Earth Simulator supercomputero$250 milliono41 teraflops

• Much more expensive compared to a large cluster of personal computers

Page 29: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Cost Efficiency

• 2003, Virgina Tech used 1,100 Apple computersocost $5 milliono10 teraflopso3rd most powerful at the timeosupercomputer would have cost much

more

Page 30: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Cost Efficiency

• Disadvantagesodeal with network bandwidthoconstantly monitor for hardware failure

Page 31: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Conclusion

Page 32: Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.

Questions?