MapReduce&Pig&Spark
IoannaMiliouGiuseppeAttardi
AdvancedProgrammingUniversitadiPisa
Hadoop• TheApache™Hadoop®projectdevelopsopen-sourcesoftwarefor
reliable,scalable,distributedcomputing.
• Frameworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.
• Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.
• Itisdesignedtodetectandhandlefailuresattheapplicationlayer.
ThecoreofApacheHadoopconsistsofastoragepart,knownasHadoopDistributedFileSystem(HDFS),andaprocessingpartcalledMapReduce.
Hadoop• Theprojectincludesthesemodules:
– HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.
– HadoopDistributedFileSystem(HDFS):Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.
– HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.
– HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.
Hadoop• OtherHadoop-relatedprojectsatApacheinclude:
– Ambari:Aweb-basedtoolforprovisioning, managing, andmonitoring ApacheHadoop.
– Avro:Adataserializationsystem.– Cassandra:Ascalablemulti-masterdatabasewithnosinglepointsoffailure.– Chukwa:Adatacollectionsystemformanaging largedistributed systems.– HBase:Ascalable,distributeddatabasethatsupports structureddatastorage
forlargetables.– Hive:Adatawarehouseinfrastructurethatprovidesdatasummarizationand
adhocquerying.– Mahout:AScalablemachinelearninganddatamining library.– Tez:Ageneralizeddata-flowprogramming framework,builtonHadoopYARN,
whichprovidesapowerfulandflexibleengine toexecuteanarbitraryDAGoftaskstoprocessdataforbothbatchandinteractiveuse-cases.
– ZooKeeper:Ahigh-performance coordinationservicefordistributedapplications.
Hadoop– Pig:Ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation.
– Spark:AfastandgeneralcomputeengineforHadoopdata.Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.
HadoopStack
WhatisMapReduce?
• MapReduceistheheartofHadoop®
• ProgrammingparadigmthatallowsformassivescalabilityacrosshundredsorthousandsofserversinaHadoopcluster.
ProposedbyDeanandGhemawat atGoogle
Whatisit?
• ProcessingengineofHadoop• Usedforbigdatabatchprocessing• Parallelprocessingofhugedatavolumes• Faulttolerant• Scalable
Whyuseit?
• YourdatainTerabyte/Petabyterange• YouhavehugeI/O• Hadoopframeworktakescareof– Jobandtaskmanagement– Failures– Storage– Replication
YoujustwriteMapandReducejobs
BigUsers
• Users– Facebook– Yahoo– Amazon– Ebay
• Providers– Amazon– Cloudera– HortonWorks– MapR
Map&ReduceThetermMapReduceactuallyreferstotwoseparateanddistincttasksthatHadoopprogramsperform.
1. Themap job,whichtakesasetofdataandconvertsitintoanothersetofdata,whereindividualelementsarebrokendownintotuples(key/valuepairs).
map(k1,v1)→list(k2,v2)
1. Thereduce jobtakestheoutputfromamapasinputandcombinesthosedatatuplesintoasmallersetoftuples.
reduce(k2,list(v2))→list(v2)
AsthesequenceofthenameMapReduceimplies,thereducejobisalwaysperformedafterthemapjob.
TypicalProblemsolvedbyMapReduce
• Readalotofdata
• Map :extractsomethingyoucareaboutfromeachrecord
• ShuffleandSort
• Reduce:aggregate,summarize,filter,ortransform
• Writetheresults
InputData
Map Map Map Map
Shuffle
Reduce Reduce
Results
Example:WordCountinWebPages
Atypicalexerciseforanewengineerinhisorherfirstweek
• Inputisfileswithonedocumentperrecord• Specifyamap functionthattakesakey/valuepair
key=documentURLvalue=documentcontents
• Outputofmapfunctionis(potentiallymany)key/valuepairs.Inourcase,output(word,“1”)onceperwordinthedocument
“document1”, “AppleOrangeMangoOrangeGrapesPlum”
“Apple”,“1”“Orange”,“1”“Mango”,“1”
…
Examplecontinued:WordCountinWebPages
• MapReducelibrarygatherstogetherallpairswiththesamekey(shuffle/sort)
• ThereducefunctioncombinesthevaluesforakeyInourcase,computethesum
• Outputofreducepairedwithkeyandsaved
key=“Apple”values=“1”
key=“Mango”values=“1”
key=“Orange”values=“1”,“1”
key=“Plum”values=“1”
key=“Grapes”values=“1”
“1” “1”“1”“1”“2”
“Apple”,“1”“Orange”,“2”“Mango”,“1”“Grapes”,“1”“Plum”,“1”
ExamplePseudo-code
map() reduce()
MapReducewrappersWrappershavebeendevelopedinorderto:• provideabettercontrolovertheMapReducecode• aidinthesourcecodedevelopment
Somewell-knownexample:• Sawzall (Google)• Pig(originallyYahoo,nowApache)• Hive(Facebook)• DryadLINQ (Microsoft)
WidelyapplicableatGoogle• ImplementedasaC++librarylinkedtouserprograms• Canreadandwritemanydifferentdatatypes
Exampleuses:
Example:GeneratingLanguageModelStatistics• Usedinthestatisticalmachinetranslationsystem
o needtocount#oftimesevery5-wordsequenceoccursinlargecorpusofdocuments(andkeepallthosewherecount>=4)
• EasywithMapReduce:o map :extract5-wordsequences=>countfromdocumento reduce :combinecounts,andkeepifcountlargeenough
Example:JoiningwithOtherData
• Example:generateper-docsummary,butincludeper-hostinformation(e.g.#ofpagesonhost,importanttermsonhost)o per-hostinformationmightbeinper-processdatastructure,or
mightinvolveRPCtoasetofmachinescontainingdataforallsites
• EasywithMapReduce:o map :extracthostnamefromURL,lookupper-hostinfo,
combinewithper-docdataandemito reduce :identityfunction(justemitkey/valuedirectly)
MapReduce:Scheduling
• Onemaster,manyworkerso InputdatasplitintoMmaptasks(typically64MBinsize)o Reducephasepartitioned intoRreducetaskso Tasksareassignedtoworkersdynamicallyo Often:M=200000,R=4000,workers=2000
• Masterassignseachmaptasktoafreeworkero Considerslocalityofdatatoworkerwhenassigning tasko Worker readstaskinput (often fromlocaldisk)o WorkerproducesRlocalfilescontaining intermediatek/vpairs
• Masterassignseachreducetasktoafreeworkero Worker readsintermediatek/vpairsfrommapworkerso Worker sorts&appliesuser’sReduceop toproduce theoutput
TaskGranularityandPipeliningFinegranularitytasks:manymoremaptasksthanmachines• Minimizestimeforfaultrecovery• Canpipelineshufflingwithmapexecution• Betterdynamicloadbalancing
Oftenuse200,000map/5000reducetasksw/2000machines
Faulttolerance:Handledviare-execution
• Onworkerfailure:o Detectfailureviaperiodicheartbeatso Re-executecompletedandin-progressmap taskso Re-executeinprogressreduce taskso Taskcompletioncommittedthroughmaster
• Masterfailure:o Stateischeckpointed :newmasterrecovers&continues
Robust:OnceGooglelost1600of1800machines,butfinishedfine
Refinement:Backuptasks
• Slowworkerssignificantlylengthencompletiontimeo Otherjobsconsumingresourcesonmachineo Baddiskswithsofterrorstransferdataveryslowlyo Weirdthings:processorcachesdisabled(!!)
• Solution:Nearendofphase,spawnbackupcopiesoftaskso Whicheveronefinishesfirst"wins"
• Effect:Dramaticallyshortensjobcompletiontime
Refinement:LocalityOptimization
Masterschedulingpolicy:• Asksforlocationsofreplicasofinputfileblocks• Maptaskstypicallysplitinto64MB• Maptasksscheduledsoinputblockreplicaareonsame
machineorsamerack
Effect:Thousandsofmachinesreadinputatlocaldiskspeed• Withoutthis,rackswitcheslimitreadrate
Refinement:SkippingBadRecords
Map/Reducefunctionssometimesfailforparticularinputs• Bestsolutionistodebug&fix,butnotalwayspossible
Onseg fault:• SendUDPpackettomasterfromsignalhandler• Includesequencenumberofrecordbeingprocessed
IfmasterseesK failuresforsamerecord(typicallyK setto2or3):• Nextworkeristoldtoskiptherecord
Effect:Canworkaroundbugsinthird-partylibraries
OtherRefinements
• Optionalsecondarykeysforordering
• Compressionofintermediatedata
• Combiner:usefulforsavingnetworkbandwidth
• Localexecutionfordebugging/testing
• User-definedcounters
“Playaround”
• AmazonElasticMapReduce(AmazonEMR)• Hortonworks Sandbox• MapR SandboxforHadoop• Qubole• MicrosoftAzureHDInsight• Cloudera
MapReduceexamplesinJava
Serializable vsWritable• Serializable storestheclassnameandtheobjectrepresentationto
thestream;otherinstancesoftheclassarereferredtobyanhandletotheclassname:thisapproachisnotusablewithrandomaccess
• Forthesamereason,thesortingneededfortheshuffleandsortphasecannotbeusedwithSerializable
• Thedeserializationprocesscreatesannewinstanceoftheobject,whileHadoopneedstoreuseobjecttominimizecomputation
• HadoopintroducesthetwointerfacesWritable andWritableComparable thatsolvetheseproblem
Writablewrappers
ImplementingWritable:theSumCount class
Glossary
WordCount
• http://www.gutenberg.org/cache/epub/201/pg201.txt
• InputData:Thetextofthebook“Flatland”byEdwinAbbott
WordCountmapper
WordCount reducer
WordCountresults
TopN :Wewanttofindthetop-nusedwordsofatextfile
• http://www.gutenberg.org/cache/epub/201/pg201.txt
• InputData:Thetextofthebook“Flatland”byEdwinAbbott
TopNmapper
TopN reducer
TopN results
MEAN:Wewanttofindthemeanmaxtemperatureforeverymonth
• http://archivio-meteo.distile.it/tabelle-dati-archivio-meteo/
• InputData:TemperatureinMilan(DD/MM/YYYY,MIN,MAX)02012015,-2, 703012015,-1,804012015,1,16…29012015,0,530012015,0,931012015,-3,6
Meanmapper
Meanreducer
Meanresults
TODO:k-meansclusteringalgorithm
• Wewanttoaggregate2Dpointsinclustersusingk-meansalgorithm
• Inputdata:Arandomsetofpoints2.2705 0.91781.8600 2.10022.0915 1.3679-0.16120.8481…
k-meansalgorithmInput:datapointsD,numberofclusterk
1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.
Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.
Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.
Output:datapointswithclustermemberships
MapReduceexamplesinPython
WordCountusingmrjob
“a”,936“ab”,6“abbot”,3“abbott”,2“abbreviated”,1…
ProductRecommendations
• Goal:Foreachproductaclientbuys,generatea‘peoplewhoboughtthisalsoboughtthis’recommendation
• InputData:product_id_1,product_id_2
CoincidentPurchaseFrequency
TopRecommendations
But…Supposeyouhave:
• userdatainonefile,• websitedatainanother,
andyouneedtofind
• thetop5 mostvisitedpagesbyusersaged18-25.
InMapReduce
InPigLatin
WhatisApachePig?
Idea:aMapReduceprogramessentiallyperformsagroup-by-aggregationinparalleloveraclusterofmachines.
• Pig isahigh-levelplatformforcreatingMapReduceprogramsusedwithHadoop.
• ThelanguageforthisplatformiscalledPigLatin.Itcombineshigh-leveldeclarativequeryinginthespiritofSQL,andlow-level,proceduralprogrammingà laMapReduce.
DevelopedatYahoo
Pig• ApachePig isaplatformforanalyzinglargedatasetsthat
consistsofahigh-levellanguageforexpressingdataanalysisprograms,coupledwithinfrastructureforevaluatingtheseprograms.ThesalientpropertyofPigprogramsisthattheirstructureisamenabletosubstantialparallelization,whichinturnsenablesthemtohandleverylargedatasets.
• Atthepresenttime,Pig'sinfrastructurelayerconsistsofacompilerthatproducessequencesofMap-Reduceprograms,forwhichlarge-scaleparallelimplementationsalreadyexist(e.g.,theHadoopsubproject).
PigLatinPigLatinhasthefollowingkeyproperties:
• Easeofprogramming. Itistrivialtoachieveparallelexecutionofsimple,"embarrassinglyparallel"dataanalysistasks.Complextaskscomprisedofmultipleinterrelateddatatransformationsareexplicitlyencodedasdataflowsequences,makingthemeasytowrite,understand,andmaintain.
• Optimizationopportunities.Thewayinwhichtasksareencodedpermitsthesystemtooptimizetheirexecutionautomatically,allowingtheusertofocusonsemanticsratherthanefficiency.
• Extensibility. Userscancreatetheirownfunctionstodospecial-purposeprocessing.
Performance
PigHighlights
• Userdefinedfunctions(UDFs)canbewrittenforcolumntransformation(TOUPPER),oraggregation(SUM)
• UDFscanbewrittentotakeadvantageofthecombiner• Fourjoinimplementationsbuiltin:hash,fragment-replicate,merge,
skewed• Multi-query:Pigwillcombinecertaintypesofoperationstogetherina
singlepipelinetoreducethenumberoftimesdataisscanned• Orderbyprovidestotalorderingacrossreducersinabalancedway• WritingloadandstorefunctionsiseasyonceanInputFormat and
OutputFormat exist• Piggybank,acollectionofuserscontributedUDFs
WhousesPigforwhat?
• 70%ofproductionjobsatYahoo (10ksperday)
• AlsousedbyTwitter,LinkedIn,Ebay,AOL,…
• Usedto– Processweblogs– Builduserbehaviormodels– Processimages– Buildmapsoftheweb– Doresearchonrawdatasets
Components
Pigresidesonusermachine
Usermachine
HadoopCluster
Jobexecutesoncluster
NoneedtoinstallanythingextraonyourHadoopcluster.
So,whyPig?
• Fasterdevelopment– Fewerlinesofcode– Don’tre-inventthewheel
• Flexible– Metadataisoptional– Extensible– Proceduralprogramming
But…
• Doyouneedyourprogramtorunfaster?
• Doesyouranalyticjobrunsforhours?
LimitationsofMapReduce
OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.
MapReduceisnotdesignedforiterativeprocesses:aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.
degradationofperformance
LimitationsofPig
Pigusesbatchorientedframeworks,whichmeansyouranalyticsjobswillrunformanyminutesorhours.
Spark isfaster!
WhatisApacheSpark?
• Afastandgeneralcomputeengineforlarge-scaledataprocessing.
• Themajorfeature:theabilitytoperformin-memorycomputation(thedatacanbecachedinmemory).
• Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.
DevelopedattheUniversityofCaliforniaatBerkeley
Spark
• It providesaninterfaceforprogrammingentireclusterswithimplicitdataparallelismandfault-tolerance.
• Forcertaintasks,itistestedtobeupto100xfaster(datainmemory)or10x(dataindisk)fasterthanHadoopMapReduce
• ItcanrunonHadoopYARNmanagerandcanreaddatafromHDFS.
Spark• Designedtobeusedwitharangeofprogramminglanguagesand
onavarietyofarchitectures.
• Increasinglypopularwithawiderangeofdevelopers,thankstospeed,simplicity,andbroadsupport forexistingdevelopmentenvironmentsandstoragesystems.
• Relativelyaccessible tothoselearningtoworkwithitforthefirsttime.
• OneofApache'slargestandmostvibrant,withover500contributorsfrommorethan200organizationsresponsibleforcodeinthesoftwarerelease.
Why?
• SparkisbasicallydevelopedtoovercomeMapReduce’sshortcomingthatitisnotoptimizedforiterativealgorithms andinteractivedataanalysiswhichperformscyclicoperationsonsamesetofdata.
• SparkovercomesthisproblembyprovidinganewstorageprimitivecalledResilientDistributedDatasets (RDDs).
ResilientDistributedDatasets(RDDs)
TheResilientDistributedDatasetisaconceptattheheartofSpark.Itisdesignedtosupportin-memorydatastorage,distributedacrossaclusterinamannerthatisdemonstrablybothfault-tolerantandefficient.• Fault-tolerance isachieved,inpart,bytrackingthelineageof
transformationsappliedtocoarse-grainedsetsofdata.• Efficiency isachievedthroughparallelizationofprocessingacrossmultiple
nodesinthecluster,andminimizationofdatareplicationbetweenthosenodes.
OncedataisloadedintoanRDD,twobasictypesofoperationcanbecarriedout:• Transformations,whichcreateanewRDDbychangingtheoriginal
throughprocessessuchasmapping,filtering,andmore;• Actions,suchascounts,whichmeasurebutdonotchangetheoriginal
data.
WordCountinSpark
Anotherexample:logisticregression
Acommonmachinelearningalgorithmforclassifyingobjectssuchas,say,spamvs.non-spamemails.
PigvsSpark
• Pig– Thisisthebestdataloading toolavailableinsidehadoop.– Itusesascripting languagecalledPigLatin,whichismoreworkflowdriven.– Don'tneedtobeanexpertJavaprogrammerbutneedafewcodingskills.– Isalsoanabstractionlayerontopofmap-reduce.– Simple towriteandcontrol.
• Spark– Prettymuchthesuccessortomap-reduceinHadoop,withanemphasisonin-
memorycomputing.– You'llneedtobeaprettygood Javaprogrammer tousethis.– Muchlowerlevel.
Howtochooseaplatform?
• Thedecisiontochooseaparticularplatformforacertainapplicationusuallydependsonthefollowingimportantfactors:– datasize– speedorthroughputoptimization– modeldevelopment(Training/Applyingamodel)
Example:k-meansclusteringalgorithm
Thek-meansalgorithmisusedforprovidingmoreinsightintotheanalyticsalgorithmsondifferentplatforms.
Characteristics:• popularandwidelyused• iterativenature• compute-intensivetask(calculatingthecentroids)• aggregationofthelocalresultstoobtainaglobalsolution
k-meansalgorithmInput:datapointsD,numberofclusterk
1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.
Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.
Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.
Output:datapointswithclustermemberships
k-meansonMapReduceInput:datapointsD,numberofclusterkandcentroids
1. foreachdatapointd∈ Ddo2. assigndtotheclosestcentroid
Output:centroidswithassociateddatapoints
Input:centroidswithassociateddatapoints1. computethenewcentroidsbycalculatingthe
averageofdatapointsincluster2. writetheglobalcentroidstothedisk
Output:newcentroids
Map
Reduce
k-meansonPigLatinREGISTERudf.jarDEFINEfind_centroid FindCentroid('$centroids');points=LOAD'points.txt'as(id:int,pos:double);centroided=FOREACHpointsGENERATEpos,find_centroid(pos)ascentroid;grouped=GROUPcentroided BYcentroid;result=FOREACHgroupedGENERATEgroup,AVG(centroided.pos);STOREresultINTO'output’;
k-meansonSpark
SimilartoMapReduce-based implementation
• Insteadofwritingtheglobalcentroidstothedisk,theyarewrittentomemorywhichspeedsuptheprocessingandreducesthediskI/Ooverhead.
• Thedatawillbeloadedintothesystemmemoryinordertoprovidefasteraccess.
References• Dean,J.andGhemawat,S.MapReduce:Simplifieddataprocessingonlarge
clusters.InProceedingsofOperatingSystemsDesignandImplementation (OSDI).SanFrancisco,CA.137-150.2004
• Hadoop:OpensourceimplementationofMapReduce.http://lucene.apache.org/hadoop/
• C.Olston,B.Reed,U.Srivastava,R.Kumar,A.Tomkins.PigLatin:ANot-so-foreignLanguageforDataProcessing.InProceedingsSIGMOD'08.2008
• D.SinghandC.K.Reddy.Asurveyonplatformsforbigdataanalytics, JournalofBigData.2014
• M.Zaharia,M.Chowdhury,T.Das,A.Dave,J.Ma,M.McCauley,M.Franklin,S.Shenker, andI.Stoica. ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing. USENIXNSDI.2012