Top Banner
CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #12: NoSQL and MapReduce
69

CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Jun 30, 2018

Download

Documents

lykhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

CS4604:Introduc0ontoDatabaseManagementSystems

B.AdityaPrakashLecture#12:NoSQLandMapReduce

Page 2: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

NOSQL(someslidesfromXiaoYu)

Prakash2016 VTCS4604 2

Page 3: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

WhyNoSQL?

Prakash2016 VTCS4604 3

Page 4: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

RDBMS§  Thepredominantchoiceinstoringdata

– Notsotruefordataminerssincewemuchintxtfiles.

§  Firstformulatedin1969byCodd– WeareusingRDBMSeverywhere

Prakash2016 VTCS4604 4

Page 5: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"

Prakash2016 VTCS4604 5

Page 6: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

WhenRDBMSmetWeb2.0

SlidefromLorenzoAlberton,"NoSQLDatabases:Why,whatandwhen"Prakash2016 VTCS4604 6

Page 7: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Whattodoifdataisreallylarge?

§  Peta-bytes(exabytes,ze_abytes…..)

§  Googleprocessed24PBofdataperday(2009)

§  FBadds0.5PBperday

Prakash2016 VTCS4604 7

Page 8: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Prakash2016 VTCS4604 8

BIGdata

Page 9: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

What’sWrongwithRela0onalDB?

§  Nothingiswrong.Youjustneedtousetherighttool.

§  Reladonalishardtoscale.– Easytoscalereads– Hardtoscalewrites

Prakash2016 VTCS4604 9

Page 10: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

What’sNoSQL?

§  Themisleadingterm“NoSQL”isshortfor“NotOnlySQL”.

§  non-reladonal,schema-free,non-(quite)-ACID– MoreonACIDtransacdonslaterinclass

§  horizontallyscalable,distributed,easyreplicadonsupport

§  simpleAPI

Prakash2016 VTCS4604 10

Page 11: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Four(emerging)NoSQLCategories

§  Key-value(K-V)stores– BasedonDistributedHashTables/Amazon’sDynamopaper*

– Datamodel:(global)collecdonofK-Vpairs– Example:Voldemort

§  ColumnFamilies– BigTableclones**– Datamodel:bigtable,columnfamilies– Example:HBase,Cassandra,Hypertable

*GDeCandiaetal,Dynamo:Amazon'sHighlyAvailableKey-valueStore,SOSP07**FChangetal,Bigtable:ADistributedStorageSystemforStructuredData,OSDI06

Prakash2016 VTCS4604 11

Page 12: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Four(emerging)NoSQLCategories

§  Documentdatabases–  InspiredbyLotusNotes– Datamodel:collecdonsofK-VCollecdons– Example:CouchDB,MongoDB

§  Graphdatabases–  InspiredbyEuler&graphtheory– Datamodel:nodes,reladons,K-Vonboth– Example:AllegroGraph,VertexDB,Neo4j

Prakash2016 VTCS4604 12

Page 13: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

FocusofDifferentDataModels

Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"

Prakash2016 VTCS4604 13

Page 14: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

C-A-P“theorem"

Consistency

Availability

ParddonTolerance

RDBMS

NoSQL(most)

Prakash2016 VTCS4604 14

Page 15: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

WhentouseNoSQL?§  Bigness§  Massivewriteperformance

–  Twi_ergenerates7TB/perday(2010)§  Fastkey-valueaccess§  Flexibleschemaordatatypes§  Schemamigradon§  Writeavailability

–  Writesneedtosucceednoma_erwhat(CAP,parddoning)§  Easiermaintainability,administradonandoperadons§  Nosinglepointoffailure§  Generallyavailableparallelcompudng§  Programmereaseofuse§  Usetherightdatamodelfortherightproblem§  Avoidhiqngthewall§  Distributedsystemssupport§  TunableCAPtradeoffs fromh_p://highscalability.com/

Prakash2016 VTCS4604 15

Page 16: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Key-ValueStoresid hair_color age height

1923 Red 18 6’0”

3371 Blue 34 NA

… … … …

Tableinreladonaldb Store/DomaininKey-Valuedb

Finduserswhoseageisabove18?Findalla_ributesofuser1923?FinduserswhosehaircolorisRedandageis19?(Joinoperadon)Calculateaverageageofallgradstudents?

Prakash2016 VTCS4604 16

Page 17: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

VoldemortinLinkedIn

SidAnand,LinkedInDataInfrastructure(QConLondon2012)

Prakash2016 VTCS4604 17

Page 18: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

VoldemortvsMySQL

SidAnand,LinkedInDataInfrastructure(QConLondon2012)

Prakash2016 VTCS4604 18

Page 19: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

ColumnFamilies–BigTablelike

FChang,etal,Bigtable:ADistributedStorageSystemforStructuredData,osdi06 Prakash2016 VTCS4604 19

Page 20: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

BigTableDataModel

The row name is a reversed URL. The contents column family contains the pagecontents, and the anchor column family contains the text of any anchors thatreferencethepage.

Prakash2016 VTCS4604 20

Page 21: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

BigTablePerformance

Prakash2016 VTCS4604 21

Page 22: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

DocumentDatabase-mongoDB

Tableinreladonaldb

Documentsinacollecdon

Inidalrelease2009

Opensource,documentdbJson-likedocumentwithdynamicschema

Prakash2016 VTCS4604 22

Page 23: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

mongoDBProductDeployment

Andmuchmore…Prakash2016 VTCS4604 23

Page 24: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

GraphDatabase

DataModelAbstracdon:• Nodes• Reladons• Properdes

Prakash2016 VTCS4604 24

Page 25: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Neo4j-BuildaGraph

Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"

Prakash2016 VTCS4604 25

Page 26: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

ADebatablePerformanceEvalua0on

Prakash2016 VTCS4604 26

Page 27: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Conclusion

§  Usetherightdatamodelfortherightproblem

Prakash2016 VTCS4604 27

Page 28: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

THEHADOOPECOSYSTEM

Prakash2016 VTCS4604 28

Page 29: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

VTCS4604 29Prakash2016

Page 30: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

SinglevsCluster

§  4TBHDDsarecomingout§  Cluster?

– Howmanymachines?– Handlemachineanddrivefailure– Needredundancy,backup..

Prakash2016 VTCS4604 30

How to analyze such large datasets?

First thing, how to store them?

Single machine? 4TB drive is out

Cluster of machines?

• How many machines?• Need to worry about

machine and drive failure. Really?

• Need data backup, redundancy, recovery, etc.

5

3% of 100,000 hard drives fail within first 3 months

Failure Trends in a Large Disk Drive Populationhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

3%of100KHDDsfailin<=3months

h_p://stadc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

Page 31: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Hadoop

§  Opensourcesovware– Reliable,scalable,distributedcompudng

§  Canhandlethousandsofmachines§ Wri_eninJAVA§  Asimpleprogrammingmodel§  HDFS(HadoopDistributedFileSystem)

– Faulttolerant(canrecoverfromfailures)

Prakash2016 VTCS4604 31

Open-source software for reliable, scalable, distributed computing

Written in Java

Scale to thousands of machines

• Linear scalability: if you have 2 machines, your job runs twice as fast

Uses simple programming model (MapReduce)

Fault tolerant (HDFS)

• Can recover from machine/disk failure (no need to restart computation)

7http://hadoop.apache.org

Page 32: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

IdeaandSolu0on§  Issue:Copyingdataoveranetworktakes0me§  Idea:

– Bringcomputadonclosetothedata– Storefilesmuldpledmesforreliability

§ Map-reduceaddressestheseproblems– Google’scomputadonal/datamanipuladonmodel– Elegantwaytoworkwithbigdata– StorageInfrastructure–Filesystem

•  Google:GFS.Hadoop:HDFS– Programmingmodel

•  Map-ReduceVTCS4604 32Prakash2016

Page 33: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Map-Reduce[DeanandGhemawat2004]

§  Abstracdonforsimplecompudng– Hidesdetailsofparallelizadon,fault-tolerance,data-balancing

– MUSTRead!h_p://stadc.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

Prakash2016 VTCS4604 33

Page 34: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

HadoopVSNoSQL

§  Hadoop:compudngframework– Supportsdata-intensiveapplicadons–  IncludesMapReduce,HDFSetc.(wewillstudyMRmainlynext)

§  NoSQL:NotonlySQLdatabases– CanbebuiltONhadoop.E.g.HBase.

Prakash2016 VTCS4604 34

Page 35: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

StorageInfrastructure

§  Problem:–  Ifnodesfail,howtostoredatapersistently?

§  Answer:– DistributedFileSystem:

•  Providesglobalfilenamespace•  GoogleGFS;HadoopHDFS;

§  Typicalusagepagern– Hugefiles(100sofGBtoTB)– Dataisrarelyupdatedinplace–  Readsandappendsarecommon

VTCS4604 35Prakash2016

Page 36: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

DistributedFileSystem§  Chunkservers

–  Fileissplitintocondguouschunks–  Typicallyeachchunkis16-64MB–  Eachchunkreplicated(usually2xor3x)–  Trytokeepreplicasindifferentracks

§  Masternode–  a.k.a.NameNodeinHadoop’sHDFS–  Storesmetadataaboutwherefilesarestored– Mightbereplicated

§  Clientlibraryforfileaccess–  Talkstomastertofindchunkservers–  Connectsdirectlytochunkserverstoaccessdata

VTCS4604 36Prakash2016

Page 37: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

ProgrammingModel:MapReduce

Warm-uptask:§ Wehaveahugetextdocument

§  Countthenumberofdmeseachdisdnctwordappearsinthefile

§  Sampleapplica0on:– AnalyzewebserverlogstofindpopularURLs

VTCS4604 37Prakash2016

Page 38: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Task:WordCount

Case1:–  Filetoolargeformemory,butall<word,count>pairsfitinmemory

Case2:§  Countoccurrencesofwords:

– words(doc.txt) | sort | uniq -c •  wherewordstakesafileandoutputsthewordsinit,oneperaline

§  Case2capturestheessenceofMapReduce– Greatthingisthatitisnaturallyparallelizable

VTCS4604 38Prakash2016

Page 39: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

MapReduce:Overview

§  Sequendallyreadalotofdata§  Map:

–  Extractsomethingyoucareabout

§  Groupbykey:SortandShuffle§  Reduce:

–  Aggregate,summarize,filterortransform

§  Writetheresult

Outlinestaysthesame,MapandReducechangetofittheproblem

VTCS4604 39Prakash2016

Page 40: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

MapReduce:TheMapStep

vk

k v

k v

mapvk

vk

k vmap

Input key-value pairs

Intermediate key-value pairs

k v

VTCS4604 40Prakash2016

Page 41: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

MapReduce:TheReduceStep

k v

k v

k v

k v

Intermediate key-value pairs

Groupbykey

reduce

reduce

k v

k v

k v

k v

k v

k v v

v v

Key-value groups Output key-value pairs

VTCS4604 41Prakash2016

Page 42: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

MoreSpecifically§  Input:asetofkey-valuepairs§  Programmerspecifiestwomethods:

– Map(k, v) → <k’, v’>* •  Takesakey-valuepairandoutputsasetofkey-valuepairs

–  E.g.,keyisthefilename,valueisasinglelineinthefile

•  ThereisoneMapcallforevery(k,v)pair

– Reduce(k’, <v’>*) → <k’, v’’>* •  Allvaluesv’withsamekeyk’arereducedtogetherandprocessedinv’order

•  ThereisoneReducefuncdoncallperuniquekeyk’

VTCS4604 42Prakash2016

Page 43: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

MapReduce:WordCoun0ng

The crew of the space shuttle Endeavor recently re turned to Ear th as ambassadors, harbingers of a new era o f space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need ……………………..

Big document

(The,1)(crew,1)(of,1)(the,1)(space,1)(shu_le,1)

(Endeavor,1)(recently,1)

….

(crew,1)(crew,1)(space,1)(the,1)(the,1)(the,1)

(shu_le,1)(recently,1)

(crew,2)(space,1)(the,3)

(shu_le,1)(recently,1)

MAP:Readinputandproducesasetofkey-valuepairs

Groupbykey:Collectallpairswithsamekey

Reduce:Collectallvaluesbelongingtothekeyandoutput

(key, value)

Provided by the programmer

Provided by the programmer

(key, value) (key, value)

Sequ

enda

llyre

adth

edata

Onlysequ

enda

lreads

VTCS4604 43Prakash2016

Page 44: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

WordCountUsingMapReduce

map(key, value): // key: document name; value: text of the document for each word w in value:

emit(w, 1)

reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result)

VTCS4604 44Prakash2016

Page 45: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Map-Reduce(MR)asSQL

§  selectcount(*)fromDOCUMENTgroupbyword

Prakash2016 VTCS4604 45

Mapper

Reducer

Page 46: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Map-Reduce:Environment

Map-Reduceenvironmenttakescareof:§  Parddoningtheinputdata§  Schedulingtheprogram’sexecudonacrossasetofmachines

§  Performingthegroupbykeystep§  Handlingmachinefailures§ Managingrequiredinter-machinecommunicadon

VTCS4604 46Prakash2016

Page 47: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Map-Reduce:Adiagram

VTCS4604 47

Bigdocument

MAP:Readinputandproducesasetofkey-valuepairs

Groupbykey:Collectallpairswith

samekey(Hashmerge,Shuffle,

Sort,Par00on)

Reduce:Collectallvalues

belongingtothekeyandoutput

Prakash2016

Page 48: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Map-Reduce:InParallel

VTCS4604 48AllphasesaredistributedwithmanytasksdoingtheworkPrakash2016

Page 49: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Map-Reduce§  Programmerspecifies:

–  MapandReduceandinputfiles§  Workflow:

–  Readinputsasasetofkey-value-pairs–  Maptransformsinputkv-pairsintoa

newsetofk'v'-pairs–  Sorts&Shufflesthek'v'-pairstooutput

nodes–  Allk’v’-pairswithagivenk’aresentto

thesamereduce–  Reduceprocessesallk'v'-pairsgrouped

bykeyintonewk''v''-pairs–  Writetheresuldngpairstofiles

§  Allphasesaredistributedwithmanytasksdoingthework

Input0

Map0

Input1

Map1

Input2

Map2

Reduce0 Reduce1

Out0 Out1

Shuffle

49VTCS4604Prakash2016

Page 50: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

DataFlow

§  Inputandfinaloutputarestoredonadistributedfilesystem(FS):– Schedulertriestoschedulemaptasks“close”tophysicalstoragelocadonofinputdata

§  IntermediateresultsarestoredonlocalFSofMapandReduceworkers

§  OutputisoneninputtoanotherMapReducetask

VTCS4604 50Prakash2016

Page 51: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Coordina0on:Master

§  Masternodetakescareofcoordina0on:–  Taskstatus:(idle,in-progress,completed)–  Idletasksgetscheduledasworkersbecomeavailable– Whenamaptaskcompletes,itsendsthemasterthelocadonandsizesofitsRintermediatefiles,oneforeachreducer

– Masterpushesthisinfotoreducers

§  Masterpingsworkersperiodicallytodetectfailures

VTCS4604 51Prakash2016

Page 52: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

DealingwithFailures

§  Mapworkerfailure– Maptaskscompletedorin-progressatworkerareresettoidle

–  Reduceworkersarenodfiedwhentaskisrescheduledonanotherworker

§  Reduceworkerfailure– Onlyin-progresstasksareresettoidle–  Reducetaskisrestarted

§  Masterfailure– MapReducetaskisabortedandclientisnodfied

VTCS4604 52Prakash2016

Page 53: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

PROBLEMSSUITEDFORMAP-REDUCE

Prakash2016 VTCS4604 53

Page 54: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Example:Hostsize

§  Supposewehavealargewebcorpus§  Lookatthemetadatafile

–  Linesoftheform:(URL,size,date,…)§  Foreachhost,findthetotalnumberofbytes

–  Thatis,thesumofthepagesizesforallURLsfromthatpardcularhost

§  Otherexamples:–  Linkanalysisandgraphprocessing– MachineLearningalgorithms

VTCS4604 54Prakash2016

Page 55: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Example:LanguageModel

§  Sta0s0calmachinetransla0on:– Needtocountnumberofdmesevery5-wordsequenceoccursinalargecorpusofdocuments

§  VeryeasywithMapReduce:– Map:

•  Extract(5-wordsequence,count)fromdocument

– Reduce:•  Combinethecounts

VTCS4604 55Prakash2016

Page 56: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

DegreeofgraphExample

§  FinddegreeofeverynodeinagraphExample:Inafriendshipgraph,whatisthenumberoffriendsofeveryperson:Node6=1Node2=3Node4=3Node1=2Node3=2Node5=3

Prakash2016 VTCS4604 56

Page 57: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Degreeofeachnodeinagraph

§  Supposeyouhavetheedgelist === ==atable!

Schema? Edges(from,to)

Prakash2016 VTCS4604 57

6 4 4 6 4 3 3 4 4 5 5 4 ...

Page 58: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Degreeofeachnodeinagraph

§  Supposeyouhavetheedgelist === ==atable!

Schema? Edges(from,to)

SQLfordegreelist?

Prakash2016 VTCS4604 58

SELECTfrom,count(*)FROMEdgesGROUPBYfrom

6 4 4 6 4 3 3 4 4 5 5 4 ...

Page 59: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Degreeofeachnodeinagraph

§  SoinSQL:§ MapReduce?Mapper:emit(from,1)

Reducer:emit(from,count())

Prakash2016 VTCS4604 59

SELECTfrom,count(*)FROMEdgesGROUPBYfrom

Remember

6 4 4 6 4 3 3 4 4 5 5 4 ...

I.E.essen0allyequivalenttothe‘word-count’exampleJ

Page 60: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

InHW5

§  Youwillhavetofindthedegreedistribu9onofanetwork.

Prakash2016 VTCS4604 60

Page 61: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Conclusions

§  Hadoopisadistributeddata-intensivecompudngframework

§ MapReduce– Simpleprogrammingparadigm– Surprisinglypowerful(maynotbesuitableforalltasksthough)

§  HadoophasspecializedFileSystem,Master-SlaveArchitecturetoscale-up

Prakash2016 VTCS4604 61

Page 62: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

NoSQLandHadoop

§  Hotareawithseveralnewproblems– Goodforacademicresearch– Goodforindustry

=FunANDProfitJ

Prakash2016 VTCS4604 62

Page 63: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

POINTERSANDFURTHERREADING

Prakash2016 VTCS4604 63

Page 64: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Implementa0ons

§  Google– NotavailableoutsideGoogle

§  Hadoop– Anopen-sourceimplementadoninJava– UsesHDFSforstablestorage– Download:http://lucene.apache.org/hadoop/

§  AsterData– Cluster-opdmizedSQLDatabasethatalsoimplementsMapReduce

VTCS4604 64Prakash2016

Page 65: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

CloudCompu0ng

§  Abilitytorentcompudngbythehour– Addidonalservicese.g.,persistentstorage

§  Amazon’s“ElasdcComputeCloud”(EC2)

§  AsterDataandHadoopcanbothberunonEC2

VTCS4604 65Prakash2016

Page 66: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Reading

§  JeffreyDeanandSanjayGhemawat:MapReduce:SimplifiedDataProcessingonLargeClusters– h_p://labs.google.com/papers/mapreduce.html

§  SanjayGhemawat,HowardGobioff,andShun-TakLeung:TheGoogleFileSystem– h_p://labs.google.com/papers/gfs.html

VTCS4604 66Prakash2016

Page 67: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Resources§  HadoopWiki

–  Introducdon•  h_p://wiki.apache.org/lucene-hadoop/

–  GeqngStarted•  h_p://wiki.apache.org/lucene-hadoop/GeqngStartedWithHadoop

–  Map/ReduceOverview•  h_p://wiki.apache.org/lucene-hadoop/HadoopMapReduce•  h_p://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses

–  EclipseEnvironment•  h_p://wiki.apache.org/lucene-hadoop/EclipseEnvironment

§  Javadoc–  h_p://lucene.apache.org/hadoop/docs/api/

VTCS4604 67Prakash2016

Page 68: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

Resources

§  ReleasesfromApachedownloadmirrors– h_p://www.apache.org/dyn/closer.cgi/lucene/hadoop/

§  Nightlybuildsofsource– h_p://people.apache.org/dist/lucene/hadoop/nightly/

§  Sourcecodefromsubversion– h_p://lucene.apache.org/hadoop/version_control.html

VTCS4604 68Prakash2016

Page 69: CS 4604: Introducon to Database Management Systemscourses.cs.vt.edu/~cs4604/Spring16/lectures/lecture-12.pdf · CS 4604: Introducon to Database Management Systems B. Aditya Prakash

FurtherReading§  Programmingmodelinspiredbyfuncdonallanguageprimidves§  Parddoning/shufflingsimilartomanylarge-scalesordngsystems

–  NOW-Sort['97]§  Re-execudonforfaulttolerance

–  BAD-FS['04]andTACC['97]§  LocalityopdmizadonhasparallelswithAcdveDisks/Diamondwork

–  AcdveDisks['01],Diamond['04]§  BackuptaskssimilartoEagerSchedulinginCharlo_esystem

–  Charlo_e['96]§  DynamicloadbalancingsolvessimilarproblemasRiver's

distributedqueues–  River['99]

VTCS4604 69Prakash2016