SparkforPythonDevelopers
TableofContents
SparkforPythonDevelopers
Credits
AbouttheAuthor
Acknowledgment
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.SettingUpaSparkVirtualEnvironment
Understandingthearchitectureofdata-intensiveapplications
Infrastructurelayer
Persistencelayer
Integrationlayer
Analyticslayer
Engagementlayer
UnderstandingSpark
Sparklibraries
PySparkinaction
TheResilientDistributedDataset
UnderstandingAnaconda
SettinguptheSparkpoweredenvironment
SettingupanOracleVirtualBoxwithUbuntu
InstallingAnacondawithPython2.7
InstallingJava8
InstallingSpark
EnablingIPythonNotebook
BuildingourfirstappwithPySpark
VirtualizingtheenvironmentwithVagrant
Movingtothecloud
DeployingappsinAmazonWebServices
VirtualizingtheenvironmentwithDocker
Summary
2.BuildingBatchandStreamingAppswithSpark
Architectingdata-intensiveapps
Processingdataatrest
Processingdatainmotion
Exploringdatainteractively
Connectingtosocialnetworks
GettingTwitterdata
GettingGitHubdata
GettingMeetupdata
Analyzingthedata
Discoveringtheanatomyoftweets
ExploringtheGitHubworld
UnderstandingthecommunitythroughMeetup
Previewingourapp
Summary
3.JugglingDatawithSpark
Revisitingthedata-intensiveapparchitecture
Serializinganddeserializingdata
Harvestingandstoringdata
PersistingdatainCSV
PersistingdatainJSON
SettingupMongoDB
InstallingtheMongoDBserverandclient
RunningtheMongoDBserver
RunningtheMongoclient
InstallingthePyMongodriver
CreatingthePythonclientforMongoDB
HarvestingdatafromTwitter
ExploringdatausingBlaze
TransferringdatausingOdo
ExploringdatausingSparkSQL
UnderstandingSparkdataframes
UnderstandingtheSparkSQLqueryoptimizer
LoadingandprocessingCSVfileswithSparkSQL
QueryingMongoDBfromSparkSQL
Summary
4.LearningfromDataUsingSpark
ContextualizingSparkMLlibintheapparchitecture
ClassifyingSparkMLlibalgorithms
Supervisedandunsupervisedlearning
Additionallearningalgorithms
SparkMLlibdatatypes
Machinelearningworkflowsanddataflows
Supervisedmachinelearningworkflows
Unsupervisedmachinelearningworkflows
ClusteringtheTwitterdataset
ApplyingScikit-LearnontheTwitterdataset
Preprocessingthedataset
Runningtheclusteringalgorithm
Evaluatingthemodelandtheresults
Buildingmachinelearningpipelines
Summary
5.StreamingLiveDatawithSpark
Layingthefoundationsofstreamingarchitecture
SparkStreaminginnerworking
GoingunderthehoodofSparkStreaming
Buildinginfaulttolerance
ProcessinglivedatawithTCPsockets
SettingupTCPsockets
Processinglivedata
ManipulatingTwitterdatainrealtime
ProcessingTweetsinrealtimefromtheTwitterfirehose
Buildingareliableandscalablestreamingapp
SettingupKafka
InstallingandtestingKafka
Developingproducers
Developingconsumers
DevelopingaSparkStreamingconsumerforKafka
Exploringflume
DevelopingdatapipelineswithFlume,Kafka,andSpark
ClosingremarksontheLambdaandKappaarchitecture
UnderstandingLambdaarchitecture
UnderstandingKappaarchitecture
Summary
6.VisualizingInsightsandTrends
Revisitingthedata-intensiveappsarchitecture
Preprocessingthedataforvisualization
Gaugingwords,moods,andmemesataglance
Settingupwordcloud
Creatingwordclouds
Geo-locatingtweetsandmappingmeetups
Geo-locatingtweets
DisplayingupcomingmeetupsonGoogleMaps
Summary
Index
SparkforPythonDevelopers
SparkforPythonDevelopersCopyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:December2015
Productionreference:1171215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78439-969-6
www.packtpub.com
CreditsAuthor
AmitNandi
Reviewers
ManuelIgnacioFrancoGaleano
RahulKavale
DanielLemire
ChetMancini
LaurenceWelch
CommissioningEditor
AmarabhaBanerjee
AcquisitionEditor
SonaliVernekar
ContentDevelopmentEditor
MerintThomasMathew
TechnicalEditor
NaveenkumarJain
CopyEditor
RoshniBanerjee
ProjectCoordinator
SuzanneCoutinho
Proofreader
SafisEditing
Indexer
PriyaSane
Graphics
KirkD’Penha
ProductionCoordinator
ShantanuN.Zagade
CoverWork
ShantanuN.Zagade
AbouttheAuthorAmitNandistudiedphysicsattheFreeUniversityofBrusselsinBelgium,wherehedidhisresearchoncomputergeneratedholograms.Computergeneratedhologramsarethekeycomponentsofanopticalcomputer,whichispoweredbyphotonsrunningatthespeedoflight.HethenworkedwiththeuniversityCraysupercomputer,sendingbatchjobsofprogramswritteninFortran.Thisgavehimatasteforcomputing,whichkeptgrowing.Hehasworkedextensivelyonlargebusinessreengineeringinitiatives,usingSAPasthemainenabler.Hefocusedforthelast15yearsonstart-upsinthedataspace,pioneeringnewareasoftheinformationtechnologylandscape.Heiscurrentlyfocusingonlarge-scaledata-intensiveapplicationsasanenterprisearchitect,dataengineer,andsoftwaredeveloper.Heunderstandsandspeakssevenhumanlanguages.AlthoughPythonishiscomputerlanguageofchoice,heaimstobeabletowritefluentlyinsevencomputerlanguagestoo.
AcknowledgmentIwanttoexpressmyprofoundgratitudetomyparentsfortheirunconditionalloveandstrongsupportinallmyendeavors.
ThisbookarosefromaninitialdiscussionwithRichardGall,anacquisitioneditoratPacktPublishing.Withoutthisinitialdiscussion,thisbookwouldneverhavehappened.So,Iamgratefultohim.ThefollowupsondiscussionsandthecontractualtermswereagreedwithRebeccaYoue.Iwouldliketothankherforhersupport.IwouldalsoliketothankMerintMathew,acontenteditorwhohelpedmebringthisbooktothefinishline.IamthankfultoMerintforhissubtlepersistenceandtactfulsupportduringthewriteupsandrevisionsofthisbook.
Wearestandingontheshouldersofgiants.Iwanttoacknowledgesomeofthegiantswhohelpedmeshapemythinking.Iwanttorecognizethebeauty,elegance,andpowerofPythonasenvisionedbyGuidovanRossum.MyrespectfulgratitudegoestoMateiZahariaandtheteamatBerkeleyAMPLabandDatabricksfordevelopinganewapproachtocomputingwithSparkandMesos.TravisOliphant,PeterWang,andtheteamatContinuum.ioaredoingatremendousjobofkeepingPythonrelevantinafast-changingcomputinglandscape.Thankyoutoyouall.
AbouttheReviewersManuelIgnacioFrancoGaleanoisasoftwaredeveloperfromColombia.HeholdsacomputersciencedegreefromtheUniversityofQuindío.Atthemomentofpublicationofthisbook,hewasstudyingtogethisMScincomputersciencefromUniversityCollegeDublin,Ireland.Hehasawiderangeofintereststhatincludedistributedsystems,machinelearning,microservices,andsoon.Heislookingforawaytoapplymachinelearningtechniquestoaudiodatainordertohelppeoplelearnmoreaboutmusic.
RahulKavaleworksasasoftwaredeveloperatTinyOwlLtd.Heisinterestedinmultipletechnologiesrangingfrombuildingwebapplicationstosolvingbigdataproblems.Hehasworkedinmultiplelanguages,includingScala,Ruby,andJava,andhasworkedonApacheSpark,ApacheStorm,ApacheKafka,Hadoop,andHive.HeenjoyswritingScala.Functionalprogramminganddistributedcomputingarehisareasofinterest.HehasbeenusingSparksinceitsearlystageforvaryingusecases.HehasalsohelpedwiththereviewforthePragmaticScalabook.
DanielLemirehasaBScandMScinmathematicsfromtheUniversityofTorontoandaPhDinengineeringmathematicsfromtheEcolePolytechniqueandtheUniversitédeMontréal.HeisaprofessorofcomputerscienceattheUniversitéduQuébec.HehasalsobeenaresearchofficerattheNationalResearchCouncilofCanadaandanentrepreneur.Hehaswrittenover45peer-reviewedpublications,includingmorethan25journalarticles.Hehasheldcompetitiveresearchgrantsforthelast15years.Hehasbeenanexpertonseveralcommitteeswithfundingagencies(NSERCandFQRNT).Hehasservedasaprogramcommitteememberonleadingcomputerscienceconferences(forexample,ACMCIKM,ACMWSDM,ACMSIGIR,andACMRecSys).HisopensourcesoftwarehasbeenusedbymajorcorporationssuchasGoogleandFacebook.Hisresearchinterestsincludedatabases,informationretrievalandhigh-performanceprogramming.Heblogsregularlyoncomputerscienceathttp://lemire.me/blog/.
ChetManciniisadataengineeratIntentMedia,IncinNewYork,whereheworkswiththedatascienceteamtostoreandprocessterabytesofwebtraveldatatobuildpredictivemodelsofshopperbehavior.Heenjoysfunctionalprogramming,immutabledatastructures,andmachinelearning.Hewritesandspeaksontopicssurroundingdataengineeringandinformationarchitecture.
HeisacontributortoApacheSparkandotherlibrariesintheSparkecosystem.Chethasamaster’sdegreeincomputersciencefromCornellUniversity.
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.
https://www2.packtpub.com/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser
FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.
PrefaceSparkforPythonDevelopersaimstocombinetheeleganceandflexibilityofPythonwiththepowerandversatilityofApacheSpark.SparkiswritteninScalaandrunsontheJavavirtualmachine.ItisneverthelesspolyglotandoffersbindingsandAPIsforJava,Scala,Python,andR.Pythonisawell-designedlanguagewithanextensivesetofspecializedlibraries.ThisbooklooksatPySparkwithinthePyDataecosystem.SomeoftheprominentPyDatalibrariesincludePandas,Blaze,Scikit-Learn,Matplotlib,Seaborn,andBokeh.Theselibrariesareopensource.Theyaredeveloped,used,andmaintainedbythedatascientistandPythondeveloperscommunity.PySparkintegrateswellwiththePyDataecosystem,asendorsedbytheAnacondaPythondistribution.Thebookputsforwardajourneytobuilddata-intensiveappsalongwithanarchitecturalblueprintthatcoversthefollowingsteps:first,setupthebaseinfrastructurewithSpark.Second,acquire,collect,process,andstorethedata.Third,gaininsightsfromthecollecteddata.Fourth,streamlivedataandprocessitinrealtime.Finally,visualizetheinformation.
TheobjectiveofthebookistolearnaboutPySparkandPyDatalibrariesbybuildingappsthatanalyzetheSparkcommunity’sinteractionsonsocialnetworks.ThefocusisonTwitterdata.
WhatthisbookcoversChapter1,SettingUpaSparkVirtualEnvironment,covershowtocreateasegregatedvirtualmachineasoursandboxordevelopmentenvironmenttoexperimentwithSparkandPyDatalibraries.ItcovershowtoinstallSparkandthePythonAnacondadistribution,whichincludesPyDatalibraries.Alongtheway,weexplainthekeySparkconcepts,thePythonAnacondaecosystem,andbuildaSparkwordcountapp.
Chapter2,BuildingBatchandStreamingAppswithSpark,laysthefoundationoftheDataIntensiveAppsArchitecture.Itdescribesthefivelayersoftheappsarchitectureblueprint:infrastructure,persistence,integration,analytics,andengagement.WeestablishAPIconnectionswiththreesocialnetworks:Twitter,GitHub,andMeetup.ThischapterprovidesthetoolstoconnecttothesethreenontrivialAPIssothatyoucancreateyourowndatamashupsatalaterstage.
Chapter3,JugglingDatawithSpark,covershowtoharvestdatafromTwitterandprocessitusingPandas,Blaze,andSparkSQLwiththeirrespectiveimplementationsofthedataframedatastructure.WeproceedwithfurtherinvestigationsandtechniquesusingSparkSQL,leveragingontheSparkdataframedatastructure.
Chapter4,LearningfromDataUsingSpark,givesanoverviewoftheeverexpandinglibraryofalgorithmsofSparkMLlib.Itcoverssupervisedandunsupervisedlearning,recommendersystems,optimization,andfeatureextractionalgorithms.WeputtheTwitterharvesteddatasetthroughaPythonScikit-LearnandSparkMLlibK-meansclusteringinordertosegregatetheApacheSparkrelevanttweets.
Chapter5,StreamingLiveDatawithSpark,laysdownthefoundationofstreamingarchitectureappsanddescribestheirchallenges,constraints,andbenefits.WeillustratethestreamingconceptswithTCPsockets,followedbylivetweetingestionandprocessingdirectlyfromtheTwitterfirehose.WealsodescribeFlume,areliable,flexible,andscalabledataingestionandtransportpipelinesystem.ThecombinationofFlume,Kafka,andSparkdeliversunparalleledrobustness,speed,andagilityinanever-changinglandscape.Weendthechapterwithsomeremarksandobservationsontwostreamingarchitecturalparadigms,theLambdaandKappaarchitectures.
Chapter6,VisualizingInsightsandTrends,focusesonafewkeyvisualizationtechniques.Itcovershowtobuildwordcloudsandexposetheirintuitivepowertorevealalotofthekeywords,moods,andmemescarriedthroughthousandsoftweets.WethenfocusoninteractivemappingvisualizationsusingBokeh.Webuildaworldmapfromthegroundupandcreateascatterplotofcriticaltweets.OurfinalvisualizationistooverlayanactualGooglemapofLondon,highlightingupcomingmeetupsandtheirrespectivetopics.
WhatyouneedforthisbookYouneedinquisitiveness,perseverance,andpassionfordata,softwareengineering,applicationarchitectureandscalability,andbeautifulsuccinctvisualizations.Thescopeisbroadandwide.
YouneedagoodunderstandingofPythonorasimilarlanguagewithobject-orientedandfunctionalprogrammingcapabilities.PreliminaryexperienceofdatawranglingwithPython,R,oranysimilartoolishelpful.
Youneedtoappreciatehowtoconceive,build,andscaledataapplications.
WhothisbookisforThetargetaudienceincludesthefollowing:
Datascientistsaretheprimaryinterestedparties.ThisbookwillhelpyouunleashthepowerofSparkandleverageyourPython,R,andmachinelearningbackground.SoftwaredeveloperswithafocusonPythonwillreadilyexpandtheirskillstocreatedata-intensiveappsusingSparkasaprocessingengineandPythonvisualizationlibrariesandwebframeworks.DataarchitectswhocancreaterapiddatapipelinesandbuildthefamousLambdaarchitecturethatencompassesbatchandstreamingprocessingtorenderinsightsondatainrealtime,usingtheSparkandPythonrichecosystem,willalsobenefitfromthisbook.
ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.
Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows“LaunchPySparkwithIPYNBindirectoryexamples/AN_SparkwheretheJupyterorIPythonNotebooksarestored”.
Ablockofcodeissetasfollows:
#Wordcounton1stChapteroftheBookusingPySpark
#importregexmodule
importre
#importaddfromoperatormodule
fromoperatorimportadd
#readinputfile
file_in=sc.textFile('/home/an/Documents/A00_Documents/Spark4Py20150315')
Anycommand-lineinputoroutputiswrittenasfollows:
#installanaconda2.x.x
bashAnaconda-2.x.x-Linux-x86[_64].sh
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“AfterinstallingVirtualBox,let’sopentheOracleVMVirtualBoxManagerandclicktheNewbutton.”
NoteWarningsorimportantnotesappearinaboxlikethis.
TipTipsandtricksappearlikethis.
ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.
Tosendusgeneralfeedback,simplysendane-mailto<[email protected]>,andmentionthebooktitleviathesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.
CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.
DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.
PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.
Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.
Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.
QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.
Chapter1.SettingUpaSparkVirtualEnvironmentInthischapter,wewillbuildanisolatedvirtualenvironmentfordevelopmentpurposes.TheenvironmentwillbepoweredbySparkandthePyDatalibrariesprovidedbythePythonAnacondadistribution.TheselibrariesincludePandas,Scikit-Learn,Blaze,Matplotlib,Seaborn,andBokeh.Wewillperformthefollowingactivities:
SettingupthedevelopmentenvironmentusingtheAnacondaPythondistribution.ThiswillincludeenablingtheIPythonNotebookenvironmentpoweredbyPySparkforourdataexplorationtasks.InstallingandenablingSpark,andthePyDatalibrariessuchasPandas,Scikit-Learn,Blaze,Matplotlib,andBokeh.Buildingawordcountexampleapptoensurethateverythingisworkingfine.
Thelastdecadehasseentheriseanddominanceofdata-drivenbehemothssuchasAmazon,Google,Twitter,LinkedIn,andFacebook.Thesecorporations,byseeding,sharing,ordisclosingtheirinfrastructureconcepts,softwarepractices,anddataprocessingframeworks,havefosteredavibrantopensourcesoftwarecommunity.Thishastransformedtheenterprisetechnology,systems,andsoftwarearchitecture.
ThisincludesnewinfrastructureandDevOps(shortfordevelopmentandoperations),conceptsleveragingvirtualization,cloudtechnology,andsoftware-definednetworks.
Toprocesspetabytesofdata,Hadoopwasdevelopedandopensourced,takingitsinspirationfromtheGoogleFileSystem(GFS)andtheadjoiningdistributedcomputingframework,MapReduce.Overcomingthecomplexitiesofscalingwhilekeepingcostsundercontrolhasalsoledtoaproliferationofnewdatastores.ExamplesofrecentdatabasetechnologyincludeCassandra,acolumnardatabase;MongoDB,adocumentdatabase;andNeo4J,agraphdatabase.
Hadoop,thankstoitsabilitytoprocesshugedatasets,hasfosteredavastecosystemtoquerydatamoreiterativelyandinteractivelywithPig,Hive,Impala,andTez.HadoopiscumbersomeasitoperatesonlyinbatchmodeusingMapReduce.Sparkiscreatingarevolutionintheanalyticsanddataprocessingrealmbytargetingtheshortcomingsofdiskinput-outputandbandwidth-intensiveMapReducejobs.
SparkiswritteninScala,andthereforeintegratesnativelywiththeJavaVirtualMachine(JVM)poweredecosystem.SparkhadearlyonprovidedPythonAPIandbindingsbyenablingPySpark.TheSparkarchitectureandecosystemisinherentlypolyglot,withanobviousstrongpresenceofJava-ledsystems.
ThisbookwillfocusonPySparkandthePyDataecosystem.Pythonisoneofthepreferredlanguagesintheacademicandscientificcommunityfordata-intensiveprocessing.PythonhasdevelopedarichecosystemoflibrariesandtoolsindatamanipulationwithPandasandBlaze,inMachineLearningwithScikit-Learn,andindatavisualizationwithMatplotlib,Seaborn,andBokeh.Hence,theaimofthisbookistobuildanend-to-end
architecturefordata-intensiveapplicationspoweredbySparkandPython.Inordertoputtheseconceptsintopractice,wewillanalyzesocialnetworkssuchasTwitter,GitHub,andMeetup.WewillfocusontheactivitiesandsocialinteractionsofSparkandtheOpenSourceSoftwarecommunitybytappingintoGitHub,Twitter,andMeetup.
Buildingdata-intensiveapplicationsrequireshighlyscalableinfrastructure,polyglotstorage,seamlessdataintegration,multiparadigmanalyticsprocessing,andefficientvisualization.Thefollowingparagraphdescribesthedata-intensiveapparchitectureblueprintthatwewilladoptthroughoutthebook.Itisthebackboneofthebook.WewilldiscoverSparkinthecontextofthebroaderPyDataecosystem.
TipDownloadingtheexamplecode
YoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
Understandingthearchitectureofdata-intensiveapplicationsInordertounderstandthearchitectureofdata-intensiveapplications,thefollowingconceptualframeworkisused.Theisarchitectureisdesignedonthefollowingfivelayers:
InfrastructurelayerPersistencelayerIntegrationlayerAnalyticslayerEngagementlayer
ThefollowingscreenshotdepictsthefivelayersoftheDataIntensiveAppFramework:
Fromthebottomup,let’sgothroughthelayersandtheirmainpurpose.
InfrastructurelayerTheinfrastructurelayerisprimarilyconcernedwithvirtualization,scalability,andcontinuousintegration.Inpracticalterms,andintermsofvirtualization,wewillgothroughbuildingourowndevelopmentenvironmentinaVirtualBoxandvirtualmachinepoweredbySparkandtheAnacondadistributionofPython.Ifwewishtoscalefromthere,wecancreateasimilarenvironmentinthecloud.ThepracticeofcreatingasegregateddevelopmentenvironmentandmovingintotestandproductiondeploymentcanbeautomatedandcanbepartofacontinuousintegrationcyclepoweredbyDevOpstoolssuchasVagrant,Chef,Puppet,andDocker.Dockerisaverypopularopensourceprojectthateasestheinstallationanddeploymentofnewenvironments.ThebookwillbelimitedtobuildingthevirtualmachineusingVirtualBox.Fromadata-intensiveapparchitecturepointofview,wearedescribingtheessentialstepsoftheinfrastructurelayerbymentioningscalabilityandcontinuousintegrationbeyondjustvirtualization.
PersistencelayerThepersistencelayermanagesthevariousrepositoriesinaccordancewithdataneedsandshapes.Itensuresthesetupandmanagementofthepolyglotdatastores.ItincludesrelationaldatabasemanagementsystemssuchasMySQLandPostgreSQL;key-valuedatastoressuchasHadoop,Riak,andRedis;columnardatabasessuchasHBaseandCassandra;documentdatabasessuchasMongoDBandCouchbase;andgraphdatabasessuchasNeo4j.ThepersistencelayermanagesvariousfilesystemssuchasHadoop’sHDFS.ItinteractswithvariousstoragesystemsfromnativeharddrivestoAmazonS3.Itmanagesvariousfilestorageformatssuchascsv,json,andparquet,whichisacolumn-orientedformat.
IntegrationlayerTheintegrationlayerfocusesondataacquisition,transformation,quality,persistence,consumption,andgovernance.ItisessentiallydrivenbythefollowingfiveCs:connect,collect,correct,compose,andconsume.
Thefivestepsdescribethelifecycleofdata.Theyarefocusedonhowtoacquirethedatasetofinterest,exploreit,iterativelyrefineandenrichthecollectedinformation,andgetitreadyforconsumption.So,thestepsperformthefollowingoperations:
Connect:Targetsthebestwaytoacquiredatafromthevariousdatasources,APIsofferedbythesesources,theinputformat,inputschemasiftheyexist,therateofdatacollection,andlimitationsfromprovidersCorrect:FocusesontransformingdataforfurtherprocessingandalsoensuresthatthequalityandconsistencyofthedatareceivedaremaintainedCollect:Looksatwhichdatatostorewhereandinwhatformat,toeasedatacompositionandconsumptionatlaterstagesCompose:Concentratesitsattentiononhowtomashupthevariousdatasetscollected,andenrichtheinformationinordertobuildacompellingdata-drivenproductConsume:TakescareofdataprovisioningandrenderingandhowtherightdatareachestherightindividualattherighttimeControl:Thissixthadditionalstepwillsoonerorlaterberequiredasthedata,theorganization,andtheparticipantsgrowanditisaboutensuringdatagovernance
Thefollowingdiagramdepictstheiterativeprocessofdataacquisitionandrefinementforconsumption:
AnalyticslayerTheanalyticslayeriswhereSparkprocessesdatawiththevariousmodels,algorithms,andmachinelearningpipelinesinordertoderiveinsights.Forourpurpose,inthisbook,theanalyticslayerispoweredbySpark.WewilldelvedeeperinsubsequentchaptersintothemeritsofSpark.Inanutshell,whatmakesitsopowerfulisthatitallowsmultipleparadigmsofanalyticsprocessinginasingleunifiedplatform.Itallowsbatch,streaming,andinteractiveanalytics.Batchprocessingonlargedatasetswithlongerlatencyperiodsallowsustoextractpatternsandinsightsthatcanfeedintoreal-timeeventsinstreamingmode.Interactiveanditerativeanalyticsaremoresuitedfordataexploration.SparkoffersbindingsandAPIsinPythonandR.WithitsSparkSQLmoduleandtheSparkDataframe,itoffersaveryfamiliaranalyticsinterface.
EngagementlayerTheengagementlayerinteractswiththeenduserandprovidesdashboards,interactivevisualizations,andalerts.WewillfocushereonthetoolsprovidedbythePyDataecosystemsuchasMatplotlib,Seaborn,andBokeh.
UnderstandingSparkHadoopscaleshorizontallyasthedatagrows.Hadooprunsoncommodityhardware,soitiscost-effective.Intensivedataapplicationsareenabledbyscalable,distributedprocessingframeworksthatalloworganizationstoanalyzepetabytesofdataonlargecommodityclusters.Hadoopisthefirstopensourceimplementationofmap-reduce.HadoopreliesonadistributedframeworkforstoragecalledHDFS(HadoopDistributedFileSystem).Hadooprunsmap-reducetasksinbatchjobs.Hadooprequirespersistingthedatatodiskateachmap,shuffle,andreduceprocessstep.Theoverheadandthelatencyofsuchbatchjobsadverselyimpacttheperformance.
Sparkisafast,distributedgeneralanalyticscomputingengineforlarge-scaledataprocessing.ThemajorbreakthroughfromHadoopisthatSparkallowsdatasharingbetweenprocessingstepsthroughin-memoryprocessingofdatapipelines.
Sparkisuniqueinthatitallowsfourdifferentstylesofdataanalysisandprocessing.Sparkcanbeusedin:
Batch:Thismodeisusedformanipulatinglargedatasets,typicallyperforminglargemap-reducejobsStreaming:ThismodeisusedtoprocessincominginformationinnearrealtimeIterative:ThismodeisformachinelearningalgorithmssuchasagradientdescentwherethedataisaccessedrepetitivelyinordertoreachconvergenceInteractive:ThismodeisusedfordataexplorationaslargechunksofdataareinmemoryandduetotheveryquickresponsetimeofSpark
Thefollowingfigurehighlightstheprecedingfourprocessingstyles:
Sparkoperatesinthreemodes:onesinglemode,standaloneonasinglemachineandtwodistributedmodesonaclusterofmachines—onYarn,theHadoopdistributedresourcemanager,oronMesos,theopensourceclustermanagerdevelopedatBerkeleyconcurrentlywithSpark:
SparkoffersapolyglotinterfaceinScala,Java,Python,andR.
SparklibrariesSparkcomeswithbatteriesincluded,withsomepowerfullibraries:
SparkSQL:ThisprovidestheSQL-likeabilitytointerrogatestructureddataandinteractivelyexplorelargedatasetsSparkMLLIB:ThisprovidesmajoralgorithmsandapipelineframeworkformachinelearningSparkStreaming:Thisisfornearreal-timeanalysisofdatausingmicrobatchesandslidingwidowsonincomingstreamsofdataSparkGraphX:Thisisforgraphprocessingandcomputationoncomplexconnectedentitiesandrelationships
PySparkinactionSparkiswritteninScala.ThewholeSparkecosystemnaturallyleveragestheJVMenvironmentandcapitalizesonHDFSnatively.HadoopHDFSisoneofthemanydatastoressupportedbySpark.Sparkisagnosticandfromthebeginninginteractedwithmultipledatasources,types,andformats.
PySparkisnotatranscribedversionofSparkonaJava-enableddialectofPythonsuchasJython.PySparkprovidesintegratedAPIbindingsaroundSparkandenablesfullusageofthePythonecosystemwithinallthenodesoftheclusterwiththepicklePythonserializationand,moreimportantly,suppliesaccesstotherichecosystemofPython’smachinelearninglibrariessuchasScikit-LearnordataprocessingsuchasPandas.
WhenweinitializeaSparkprogram,thefirstthingaSparkprogrammustdoistocreateaSparkContextobject.IttellsSparkhowtoaccessthecluster.ThePythonprogramcreatesaPySparkContext.Py4JisthegatewaythatbindsthePythonprogramtotheSparkJVMSparkContext.TheJVMSparkContextserializestheapplicationcodesandtheclosuresandsendsthemtotheclusterforexecution.Theclustermanagerallocatesresourcesandschedules,andshipstheclosurestotheSparkworkersintheclusterwhoactivatePythonvirtualmachinesasrequired.Ineachmachine,theSparkWorkerismanagedbyanexecutorthatcontrolscomputation,storage,andcache.
Here’sanexampleofhowtheSparkdrivermanagesboththePySparkcontextandtheSparkcontextwithitslocalfilesystemsanditsinteractionswiththeSparkworkerthroughtheclustermanager:
TheResilientDistributedDatasetSparkapplicationsconsistofadriverprogramthatrunstheuser’smainfunction,createsdistributeddatasetsonthecluster,andexecutesvariousparalleloperations(transformationsandactions)onthosedatasets.
Sparkapplicationsarerunasanindependentsetofprocesses,coordinatedbyaSparkContextinadriverprogram.
TheSparkContextwillbeallocatedsystemresources(machines,memory,CPU)fromtheClustermanager.
TheSparkContextmanagesexecutorswhomanageworkersinthecluster.ThedriverprogramhasSparkjobsthatneedtorun.Thejobsaresplitintotaskssubmittedtotheexecutorforcompletion.Theexecutortakescareofcomputation,storage,andcachingineachmachine.
ThekeybuildingblockinSparkistheRDD(ResilientDistributedDataset).Adatasetisacollectionofelements.Distributedmeansthedatasetcanbeonanynodeinthecluster.ResilientmeansthatthedatasetcouldgetlostorpartiallylostwithoutmajorharmtothecomputationinprogressasSparkwillre-computefromthedatalineageinmemory,alsoknownastheDAG(shortforDirectedAcyclicGraph)ofoperations.Basically,SparkwillsnapshotinmemoryastateoftheRDDinthecache.Ifoneofthecomputingmachinescrashesduringoperation,SparkrebuildstheRDDsfromthecachedRDDand
theDAGofoperations.RDDsrecoverfromnodefailure.
TherearetwotypesofoperationonRDDs:
Transformations:AtransformationtakesanexistingRDDandleadstoapointerofanewtransformedRDD.AnRDDisimmutable.Oncecreated,itcannotbechanged.EachtransformationcreatesanewRDD.Transformationsarelazilyevaluated.Transformationsareexecutedonlywhenanactionoccurs.Inthecaseoffailure,thedatalineageoftransformationsrebuildstheRDD.Actions:AnactiononanRDDtriggersaSparkjobandyieldsavalue.AnactionoperationcausesSparktoexecutethe(lazy)transformationoperationsthatarerequiredtocomputetheRDDreturnedbytheaction.TheactionresultsinaDAGofoperations.TheDAGiscompiledintostageswhereeachstageisexecutedasaseriesoftasks.Ataskisafundamentalunitofwork.
Here’ssomeusefulinformationonRDDs:
RDDsarecreatedfromadatasourcesuchasanHDFSfileoraDBquery.TherearethreewaystocreateanRDD:
ReadingfromadatastoreTransforminganexistingRDDUsinganin-memorycollection
RDDsaretransformedwithfunctionssuchasmaporfilter,whichyieldnewRDDs.Anactionsuchasfirst,take,collect,orcountonanRDDwilldelivertheresultsintotheSparkdriver.TheSparkdriveristheclientthroughwhichtheuserinteractswiththeSparkcluster.
ThefollowingdiagramillustratestheRDDtransformationandaction:
UnderstandingAnacondaAnacondaisawidelyusedfreePythondistributionmaintainedbyContinuum(https://www.continuum.io/).WewillusetheprevailingsoftwarestackprovidedbyAnacondatogenerateourapps.Inthisbook,wewillusePySparkandthePyDataecosystem.ThePyDataecosystemispromoted,supported,andmaintainedbyContinuumandpoweredbytheAnacondaPythondistribution.TheAnacondaPythondistributionessentiallysavestimeandaggravationintheinstallationofthePythonenvironment;wewilluseitinconjunctionwithSpark.Anacondahasitsownpackagemanagementthatsupplementsthetraditionalpipinstallandeasy-install.Anacondacomeswithbatteriesincluded,namelysomeofthemostimportantpackagessuchasPandas,Scikit-Learn,Blaze,Matplotlib,andBokeh.Anupgradetoanyoftheinstalledlibraryisasimplecommandattheconsole:
$condaupdate
Alistofinstalledlibrariesinourenvironmentcanbeobtainedwithcommand:
$condalist
Thekeycomponentsofthestackareasfollows:
Anaconda:ThisisafreePythondistributionwithalmost200Pythonpackagesforscience,math,engineering,anddataanalysis.Conda:Thisisapackagemanagerthattakescareofallthedependenciesofinstallingacomplexsoftwarestack.ThisisnotrestrictedtoPythonandmanagestheinstallprocessforRandotherlanguages.Numba:ThisprovidesthepowertospeedupcodeinPythonwithhigh-performancefunctionsandjust-in-timecompilation.Blaze:Thisenableslargescaledataanalyticsbyofferingauniformandadaptableinterfacetoaccessavarietyofdataproviders,whichincludestreamingPython,Pandas,SQLAlchemy,andSpark.Bokeh:Thisprovidesinteractivedatavisualizationsforlargeandstreamingdatasets.Wakari:ThisallowsustoshareanddeployIPythonNotebooksandotherappsonahostedenvironment.
ThefollowingfigureshowsthecomponentsoftheAnacondastack:
SettinguptheSparkpoweredenvironmentInthissection,wewilllearntosetupSpark:
CreateasegregateddevelopmentenvironmentinavirtualmachinerunningonUbuntu14.04,soitdoesnotinterferewithanyexistingsystem.InstallSpark1.3.0withitsdependencies,namely.InstalltheAnacondaPython2.7environmentwithalltherequiredlibrariessuchasPandas,Scikit-Learn,Blaze,andBokeh,andenablePySpark,soitcanbeaccessedthroughIPythonNotebooks.Setupthebackendordatastoresofourenvironment.WewilluseMySQLastherelationaldatabase,MongoDBasthedocumentstore,andCassandraasthecolumnardatabase.
Eachstoragebackendservesaspecificpurposedependingonthenatureofthedatatobehandled.TheMySQLRDBMsisusedforstandardtabularprocessedinformationthatcanbeeasilyqueriedusingSQL.AswewillbeprocessingalotofJSON-typedatafromvariousAPIs,theeasiestwaytostorethemisinadocument.Forreal-timeandtime-series-relatedinformation,Cassandraisbestsuitedasacolumnardatabase.
Thefollowingdiagramgivesaviewoftheenvironmentwewillbuildandusethroughoutthebook:
SettingupanOracleVirtualBoxwithUbuntuSettingupacleannewVirtualBoxenvironmentonUbuntu14.04isthesafestwaytocreateadevelopmentenvironmentthatdoesnotconflictwithexistinglibrariesandcanbelaterreplicatedinthecloudusingasimilarlistofcommands.
InordertosetupanenvironmentwithAnacondaandSpark,wewillcreateaVirtualBoxvirtualmachinerunningUbuntu14.04.
Let’sgothroughthestepsofusingVirtualBoxwithUbuntu:
1. OracleVirtualBoxVMisfreeandcanbedownloadedfromhttps://www.virtualbox.org/wiki/Downloads.Theinstallationisprettystraightforward.
2. AfterinstallingVirtualBox,let’sopentheOracleVMVirtualBoxManagerandclicktheNewbutton.
3. We’llgivethenewVManame,andselectTypeLinuxandVersionUbuntu(64bit).4. YouneedtodownloadtheISOfromtheUbuntuwebsiteandallocatesufficientRAM
(4GBrecommended)anddiskspace(20GBrecommended).WewillusetheUbuntu14.04.1LTSrelease,whichisfoundhere:http://www.ubuntu.com/download/desktop.
5. Oncetheinstallationcompleted,itisadvisabletoinstalltheVirtualBoxGuestAdditionsbygoingto(fromtheVirtualBoxmenu,withthenewVMrunning)Devices|InsertGuestAdditionsCDimage.FailingtoprovidetheguestadditionsinaWindowshostgivesaverylimiteduserinterfacewithreducedwindowsizes.
6. Oncetheadditionalinstallationcompletes,reboottheVM,anditwillbereadytouse.ItishelpfultoenablethesharedclipboardbyselectingtheVMandclickingSettings,thengotoGeneral|Advanced|SharedClipboardandclickonBidirectional.
InstallingAnacondawithPython2.7PySparkcurrentlyrunsonlyonPython2.7.(TherearerequestsfromthecommunitytoupgradetoPython3.3.)ToinstallAnaconda,followthesesteps:
1. DownloadtheAnacondaInstallerforLinux64-bitPython2.7fromhttp://continuum.io/downloads#all.
2. AfterdownloadingtheAnacondainstaller,openaterminalandnavigatetothedirectoryorfolderwheretheinstallerhasbeensaved.Fromhere,runthefollowingcommand,replacingthe2.x.xinthecommandwiththeversionnumberofthedownloadedinstallerfile:
#installanaconda2.x.x
bashAnaconda-2.x.x-Linux-x86[_64].sh
3. Afteracceptingthelicenseterms,youwillbeaskedtospecifytheinstalllocation(whichdefaultsto~/anaconda).
4. Aftertheself-extractionisfinished,youshouldaddtheanacondabinarydirectorytoyourPATHenvironmentvariable:
#addanacondatoPATH
bashAnaconda-2.x.x-Linux-x86[_64].sh
InstallingJava8SparkrunsontheJVMandrequirestheJavaSDK(shortforSoftwareDevelopmentKit)andnottheJRE(shortforJavaRuntimeEnvironment),aswewillbuildappswithSpark.TherecommendedversionisJavaVersion7orhigher.Java8isthemostsuitable,asitincludesmanyofthefunctionalprogrammingtechniquesavailablewithScalaandPython.
ToinstallJava8,followthesesteps:
1. InstallOracleJava8usingthefollowingcommands:
#installoraclejava8
$sudoapt-getinstallsoftware-properties-common
$sudoadd-apt-repositoryppa:webupd8team/java
$sudoapt-getupdate
$sudoapt-getinstalloracle-java8-installer
2. SettheJAVA_HOMEenvironmentvariableandensurethattheJavaprogramisonyourPATH.
3. CheckthatJAVA_HOMEisproperlyinstalled:
#
$echoJAVA_HOME
InstallingSparkHeadovertotheSparkdownloadpageathttp://spark.apache.org/downloads.html.
TheSparkdownloadpageoffersthepossibilitytodownloadearlierversionsofSparkanddifferentpackageanddownloadtypes.Wewillselectthelatestrelease,pre-builtforHadoop2.6andlater.TheeasiestwaytoinstallSparkistouseaSparkpackageprebuiltforHadoop2.6andlater,ratherthanbuilditfromsource.Movethefiletothedirectory~/sparkundertherootdirectory.
DownloadthelatestreleaseofSpark—Spark1.5.2,releasedonNovember9,2015:
1. SelectSparkrelease1.5.2(Nov092015),2. ChosethepackagetypePrebuiltforHadoop2.6andlater,3. ChosethedownloadtypeDirectDownload,4. DownloadSpark:spark-1.5.2-bin-hadoop2.6.tgz,5. Verifythisreleaseusingthe1.3.0signaturesandchecksums,
Thiscanalsobeaccomplishedbyrunning:
#downloadspark
$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz
Next,we’llextractthefilesandcleanup:
#extract,cleanup,movetheunzippedfilesunderthesparkdirectory
$tar-xfspark-1.5.2-bin-hadoop2.6.tgz
$rmspark-1.5.2-bin-hadoop2.6.tgz
$sudomvspark-*spark
Now,wecanruntheSparkPythoninterpreterwith:
#runspark
$cd~/spark
./bin/pyspark
Youshouldseesomethinglikethis:
Welcometo
______
/__/__________//__
_\\/_\/_`/__/'_/
/__/.__/\_,_/_//_/\_\version1.5.2
/_/
UsingPythonversion2.7.6(default,Mar22201422:59:56)
SparkContextavailableassc.
>>>
TheinterpreterwillhavealreadyprovideduswithaSparkcontextobject,sc,whichwecanseebyrunning:
>>>print(sc)
<pyspark.context.SparkContextobjectat0x7f34b61c4e50>
EnablingIPythonNotebookWewillworkwithIPythonNotebookforafriendlieruserexperiencethantheconsole.
YoucanlaunchIPythonNotebookbyusingthefollowingcommand:
$IPYTHON_OPTS="notebook--pylabinline"./bin/pyspark
LaunchPySparkwithIPYNBinthedirectoryexamples/AN_SparkwhereJupyterorIPythonNotebooksarestored:
#cdto/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark
#launchcommandusingpython2.7andthespark-csvpackage:
$IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-
hadoop2.6/bin/pyspark--packagescom.databricks:spark-csv_2.11:1.2.0
#launchcommandusingpython3.4andthespark-csvpackage:
$IPYTHON_OPTS='notebook'PYSPARK_PYTHON=python3
/home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark--packages
com.databricks:spark-csv_2.11:1.2.0
BuildingourfirstappwithPySparkWearereadytochecknowthateverythingisworkingfine.Theobligatorywordcountwillbeputtothetestinprocessingawordcountonthefirstchapterofthisbook.
Thecodewewillberunningislistedhere:
#Wordcounton1stChapteroftheBookusingPySpark
#importregexmodule
importre
#importaddfromoperatormodule
fromoperatorimportadd
#readinputfile
file_in=sc.textFile('/home/an/Documents/A00_Documents/Spark4Py20150315')
#countlines
print('numberoflinesinfile:%s'%file_in.count())
#adduplengthsofeachline
chars=file_in.map(lambdas:len(s)).reduce(add)
print('numberofcharactersinfile:%s'%chars)
#Getwordsfromtheinputfile
words=file_in.flatMap(lambdaline:re.split('\W+',line.lower().strip()))
#wordsofmorethan3characters
words=words.filter(lambdax:len(x)>3)
#setcount1perword
words=words.map(lambdaw:(w,1))
#reducephase-sumcountallthewords
words=words.reduceByKey(add)
Inthisprogram,wearefirstreadingthefilefromthedirectory/home/an/Documents/A00_Documents/Spark4Py20150315intofile_in.
Wearethenintrospectingthefilebycountingthenumberoflinesandthenumberofcharactersperline.
Wearesplittingtheinputfileintowordsandgettingtheminlowercase.Forourwordcountpurpose,wearechoosingwordslongerthanthreecharactersinordertoavoidshorterandmuchmorefrequentwordssuchasthe,and,fortoskewthecountintheirfavor.Generally,theyareconsideredstopwordsandshouldbefilteredoutinanylanguageprocessingtask.
Atthisstage,wearegettingreadyfortheMapReducesteps.Toeachword,wemapavalueof1andreduceitbysummingalltheuniquewords.
HereareillustrationsofthecodeintheIPythonNotebook.Thefirst10cellsarepreprocessingthewordcountonthedataset,whichisretrievedfromthelocalfiledirectory.
Swapthewordcounttuplesintheformat(count,word)inordertosortbycount,whichisnowtheprimarykeyofthetuple:
#createtuple(count,word)andsortindescending
words=words.map(lambdax:(x[1],x[0])).sortByKey(False)
#taketop20wordsbyfrequency
words.take(20)
Inordertodisplayourresult,wearecreatingthetuple(count,word)anddisplayingthetop20mostfrequentlyusedwordsindescendingorder:
Let’screateahistogramfunction:
#createfunctionforhistogramofmostfrequentwords
%matplotlibinline
importmatplotlib.pyplotasplt
#
defhistogram(words):
count=map(lambdax:x[1],words)
word=map(lambdax:x[0],words)
plt.barh(range(len(count)),count,color='grey')
plt.yticks(range(len(count)),word)
#Changeorderoftuple(word,count)from(count,word)
words=words.map(lambdax:(x[1],x[0]))
words.take(25)
#displayhistogram
histogram(words.take(25))
Here,wevisualizethemostfrequentwordsbyplottingtheminabarchart.Wehavetofirstswapthetuplefromtheoriginal(count,word)to(word,count):
Sohereyouhaveit:themostfrequentwordsusedinthefirstchapterareSpark,followedbyDataandAnaconda.
VirtualizingtheenvironmentwithVagrantInordertocreateaportablePythonandSparkenvironmentthatcanbeeasilysharedandcloned,thedevelopmentenvironmentcanbebuiltwithavagrantfile.
WewillpointtotheMassiveOpenOnlineCourses(MOOCs)deliveredbyBerkeleyUniversityandDatabricks:
IntroductiontoBigDatawithApacheSpark,ProfessorAnthonyD.Josephcanbefoundathttps://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1xScalableMachineLearning,ProfessorAmeetTalwalkarcanbefoundathttps://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x
ThecourselabswereexecutedonIPythonNotebookspoweredbyPySpark.TheycanbefoundinthefollowingGitHubrepository:https://github.com/spark-mooc/mooc-setup/.
OnceyouhavesetupVagrantonyourmachine,followtheseinstructionstogetstarted:https://docs.vagrantup.com/v2/getting-started/index.html.
Clonethespark-mooc/mooc-setup/githubrepositoryinyourworkdirectoryandlaunchthecommand$vagrantup,withinthecloneddirectory:
BeawarethattheversionofSparkmaybeoutdatedasthevagrantfilemaynotbeup-to-date.
Youwillseeanoutputsimilartothis:
C:\Programs\spark\edx1001\mooc-setup-master>vagrantup
Bringingmachine'sparkvm'upwith'virtualbox'provider…
==>sparkvm:Checkingifbox'sparkmooc/base'isuptodate…
==>sparkvm:Clearinganypreviouslysetforwardedports…
==>sparkvm:Clearinganypreviouslysetnetworkinterfaces…
==>sparkvm:Preparingnetworkinterfacesbasedonconfiguration…
sparkvm:Adapter1:nat
==>sparkvm:Forwardingports…
sparkvm:8001=>8001(adapter1)
sparkvm:4040=>4040(adapter1)
sparkvm:22=>2222(adapter1)
==>sparkvm:BootingVM…
==>sparkvm:Waitingformachinetoboot.Thismaytakeafewminutes…
sparkvm:SSHaddress:127.0.0.1:2222
sparkvm:SSHusername:vagrant
sparkvm:SSHauthmethod:privatekey
sparkvm:Warning:Connectiontimeout.Retrying…
sparkvm:Warning:Remoteconnectiondisconnect.Retrying…
==>sparkvm:Machinebootedandready!
==>sparkvm:CheckingforguestadditionsinVM…
==>sparkvm:Settinghostname…
==>sparkvm:Mountingsharedfolders…
sparkvm:/vagrant=>C:/Programs/spark/edx1001/mooc-setup-master
==>sparkvm:Machinealreadyprovisioned.Run`vagrantprovision`oruse
the`--provision`
==>sparkvm:toforceprovisioning.Provisionersmarkedtorunalwayswill
stillrun.
C:\Programs\spark\edx1001\mooc-setup-master>
ThiswilllaunchtheIPythonNotebookspoweredbyPySparkonlocalhost:8001:
MovingtothecloudAswearedealingwithdistributedsystems,anenvironmentonavirtualmachinerunningonasinglelaptopislimitedforexplorationandlearning.WecanmovetothecloudinordertoexperiencethepowerandscalabilityoftheSparkdistributedframework.
DeployingappsinAmazonWebServicesOncewearereadytoscaleourapps,wecanmigrateourdevelopmentenvironmenttoAmazonWebServices(AWS).
HowtorunSparkonEC2isclearlydescribedinthefollowingpage:https://spark.apache.org/docs/latest/ec2-scripts.html.
WeemphasizefivekeystepsinsettinguptheAWSSparkenvironment:
1. CreateanAWSEC2keypairviatheAWSconsolehttp://aws.amazon.com/console/.2. Exportyourkeypairtoyourenvironment:
exportAWS_ACCESS_KEY_ID=accesskeyid
exportAWS_SECRET_ACCESS_KEY=secretaccesskey
3. Launchyourcluster:
~$cd$SPARK_HOME/ec2
ec2$./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch
<cluster-name>
4. SSHintoaclustertorunSparkjobs:
ec2$./spark-ec2-k<keypair>-i<key-file>login<cluster-name>
5. Destroyyourclusterafterusage:
ec2$./spark-ec2destroy<cluster-name>
VirtualizingtheenvironmentwithDockerInordertocreateaportablePythonandSparkenvironmentthatcanbeeasilysharedandcloned,thedevelopmentenvironmentcanbebuiltinDockercontainers.
WewishcapitalizeonDocker’stwomainfunctions:
Creatingisolatedcontainersthatcanbeeasilydeployedondifferentoperatingsystemsorinthecloud.AllowingeasysharingofthedevelopmentenvironmentimagewithallitsdependenciesusingTheDockerHub.TheDockerHubissimilartoGitHub.Itallowseasycloningandversioncontrol.Thesnapshotimageoftheconfiguredenvironmentcanbethebaselineforfurtherenhancements.
ThefollowingdiagramillustratesaDocker-enabledenvironmentwithSpark,Anaconda,andthedatabaseserverandtheirrespectivedatavolumes.
DockerofferstheabilitytocloneanddeployanenvironmentfromtheDockerfile.
YoucanfindanexampleDockerfilewithaPySparkandAnacondasetupatthefollowingaddress:https://hub.docker.com/r/thisgokeboysef/pyspark-docker/~/dockerfile/.
InstallDockeraspertheinstructionsprovidedatthefollowinglinks:
http://docs.docker.com/mac/started/ifyouareonMacOSXhttp://docs.docker.com/linux/started/ifyouareonLinuxhttp://docs.docker.com/windows/started/ifyouareonWindows
InstallthedockercontainerwiththeDockerfileprovidedearlierwiththefollowingcommand:
$dockerpullthisgokeboysef/pyspark-docker
OthergreatsourcesofinformationonhowtodockerizeyourenvironmentcanbeseenatLab41.TheGitHubrepositorycontainsthenecessarycode:
https://github.com/Lab41/ipython-spark-docker
Thesupportingblogpostisrichininformationonthoughtprocessesinvolvedinbuildingthedockerenvironment:http://lab41.github.io/blog/2015/04/13/ipython-on-spark-on-docker/.
SummaryWesetthecontextofbuildingdata-intensiveappsbydescribingtheoverallarchitecturestructuredaroundtheinfrastructure,persistence,integration,analytics,andengagementlayers.WealsodiscussedSparkandAnacondawiththeirrespectivebuildingblocks.WesetupanenvironmentinaVirtualBoxwithAnacondaandSparkanddemonstratedawordcountappusingthetextcontentofthefirstchapterasinput.
Inthenextchapter,wewilldelvemoredeeplyintothearchitectureblueprintfordata-intensiveappsandtapintotheTwitter,GitHub,andMeetupAPIstogetafeelofthedatawewillbeminingwithSpark.
Chapter2.BuildingBatchandStreamingAppswithSparkTheobjectiveofthebookistoteachyouaboutPySparkandthePyDatalibrariesbybuildinganappthatanalyzestheSparkcommunity’sinteractionsonsocialnetworks.WewillgatherinformationonApacheSparkfromGitHub,checktherelevanttweetsonTwitter,andgetafeelforthebuzzaroundSparkinthebroaderopensourcesoftwarecommunitiesusingMeetup.
Inthischapter,wewilloutlinethevarioussourcesofdataandinformation.Wewillgetanunderstandingoftheirstructure.Wewilloutlinethedataprocessingpipeline,fromcollectiontobatchandstreamingprocessing.
Inthissection,wewillcoverthefollowingpoints:
Outlinedataprocessingpipelinesfromcollectiontobatchandstreamprocessing,effectivelydepictingthearchitectureoftheappweareplanningtobuild.Checkoutthevariousdatasources(GitHub,Twitter,andMeetup),theirdatastructure(JSON,structuredinformation,unstructuredtext,geo-location,timeseriesdata,andsoon),andtheircomplexities.WealsodiscussthetoolstoconnecttothreedifferentAPIs,soyoucanbuildyourowndatamashups.ThebookwillfocusonTwitterinthefollowingchapters.
Architectingdata-intensiveappsWedefinedthedata-intensiveappframeworkarchitectureblueprintinthepreviouschapter.Let’sputbackincontextthevarioussoftwarecomponentswearegoingtousethroughoutthebookinouroriginalframework.Here’sanillustrationofthevariouscomponentsofsoftwaremappedinthedata-intensivearchitectureframework:
Sparkisanextremelyefficient,distributedcomputingframework.Inordertoexploititsfullpower,weneedtoarchitectoursolutionaccordingly.Forperformancereasons,theoverallsolutionneedstoalsobeawareofitsusageintermsofCPU,storage,andnetwork.
Theseimperativesdrivethearchitectureofoursolution:
Latency:Thisarchitecturecombinesslowandfastprocessing.Slowprocessingisdoneonhistoricaldatainbatchmode.Thisisalsocalleddataatrest.Thisphasebuildsprecomputedmodelsanddatapatternsthatwillbeusedbythefastprocessingarmoncelivecontinuousdataisfedintothesystem.Fastprocessingofdataorreal-timeanalysisofstreamingdatareferstodatainmotion.Dataatrestisessentially
processingdatainbatchmodewithalongerlatency.Datainmotionreferstothestreamingcomputationofdataingestedinrealtime.Scalability:Sparkisnativelylinearlyscalablethroughitsdistributedin-memorycomputingframework.DatabasesanddatastoresinteractingwithSparkneedtobealsoabletoscalelinearlyasdatavolumegrows.Faulttolerance:Whenafailureoccursduetohardware,software,ornetworkreasons,thearchitectureshouldberesilientenoughandprovideavailabilityatalltimes.Flexibility:Thedatapipelinesputinplaceinthisarchitecturecanbeadaptedandretrofittedveryquicklydependingontheusecase.
Sparkisuniqueasitallowsbatchprocessingandstreaminganalyticsonthesameunifiedplatform.
Wewillconsidertwodataprocessingpipelines:
ThefirstonehandlesdataatrestandisfocusedonputtingtogetherthepipelineforbatchanalysisofthedataThesecondone,datainmotion,targetsreal-timedataingestionanddeliveringinsightsbasedonprecomputedmodelsanddatapatterns
ProcessingdataatrestLet’sgetanunderstandingofthedataatrestorbatchprocessingpipeline.TheobjectiveinthispipelineistoingestthevariousdatasetsfromTwitter,GitHub,andMeetup;preparethedataforSparkMLlib,themachinelearningengine;andderivethebasemodelsthatwillbeappliedforinsightgenerationinbatchmodeorinrealtime.
Thefollowingdiagramillustratesthedatapipelineinordertoenableprocessingdataatrest:
ProcessingdatainmotionProcessingdatainmotionintroducesanewlevelofcomplexity,asweareintroducinganewpossibilityoffailure.Ifwewanttoscale,weneedtoconsiderbringingindistributedmessagequeuesystemssuchasKafka.Wewilldedicateasubsequentchaptertounderstandingstreaminganalytics.
Thefollowingdiagramdepictsadatapipelineforprocessingdatainmotion:
ExploringdatainteractivelyBuildingadata-intensiveappisnotasstraightforwardasexposingadatabasetoawebinterface.Duringthesetupofboththedataatrestanddatainmotionprocessing,wewillcapitalizeonSpark’sabilitytoanalysedatainteractivelyandrefinethedatarichnessandqualityrequiredforthemachinelearningandstreamingactivities.Here,wewillgothroughaniterativecycleofdatacollection,refinement,andinvestigationinordertogettothedatasetofinterestforourapps.
ConnectingtosocialnetworksLet’sdelveintothefirststepsofthedata-intensiveapparchitecture’sintegrationlayer.Wearegoingtofocusonharvestingthedata,ensuringitsintegrityandpreparingforbatchandstreamingdataprocessingbySparkatthenextstage.Thisphaseisdescribedinthefiveprocesssteps:connect,correct,collect,compose,andconsume.Theseareiterativestepsofdataexplorationthatwillgetusacquaintedwiththedataandhelpusrefinethedatastructureforfurtherprocessing.
Thefollowingdiagramdepictstheiterativeprocessofdataacquisitionandrefinementforconsumption:
Weconnecttothesocialnetworksofinterest:Twitter,GitHub,andMeetup.WewilldiscussthemodeofaccesstotheAPIs(shortforApplicationProgrammingInterface)andhowtocreateaRESTfulconnectionwiththoseserviceswhilerespectingtheratelimitationimposedbythesocialnetworks.REST(shortforRepresentationStateTransfer)isthemostwidelyadoptedarchitecturalstyleontheInternetinordertoenablescalablewebservices.ItreliesonexchangingmessagespredominantlyinJSON(shortforJavaScriptObjectNotation).RESTfulAPIsandwebservicesimplementthefourmostprevalentverbsGET,PUT,POST,andDELETE.GETisusedtoretrieveanelementoracollectionfromagivenURI.PUTupdatesacollectionwithanewone.POSTallowsthecreationofanewentry,whileDELETEeliminatesacollection.
GettingTwitterdataTwitterallowsaccesstoregistereduserstoitssearchandstreamingtweetservicesunderanauthorizationprotocolcalledOAuththatallowsAPIapplicationstosecurelyactonauser’sbehalf.Inordertocreatetheconnection,thefirststepistocreateanapplicationwithTwitterathttps://apps.twitter.com/app/new.
Oncetheapplicationhasbeencreated,TwitterwillissuethefourcodesthatwillallowittotapintotheTwitterhose:
CONSUMER_KEY='GetYourKey@Twitter'
CONSUMER_SECRET='GetYourKey@Twitter'
OAUTH_TOKEN='GetYourToken@Twitter'
OAUTH_TOKEN_SECRET='GetYourToken@Twitter'
IfyouwishtogetafeelforthevariousRESTfulqueriesoffered,youcanexploretheTwitterAPIonthedevconsoleathttps://dev.twitter.com/rest/tools/console:
WewillmakeaprogrammaticconnectiononTwitterusingthefollowingcode,whichwillactivateourOAuthaccessandallowsustotapintotheTwitterAPIundertheratelimitation.Inthestreamingmode,thelimitationisforaGETrequest.
GettingGitHubdataGitHubusesasimilarauthenticationprocesstoTwitter.HeadtothedevelopersiteandretrieveyourcredentialsafterdulyregisteringwithGitHubathttps://developer.github.com/v3/:
GettingMeetupdataMeetupcanbeaccessedusingthetokenissuedinthedeveloperresourcestomembersofMeetup.com.ThenecessarytokenorOAuthcredentialforMeetupAPIaccesscanbeobtainedontheirdeveloper’swebsiteathttps://secure.meetup.com/meetup_api:
AnalyzingthedataLet’sgetafirstfeelforthedataextractedfromeachofthesocialnetworksandgetanunderstandingofthedatastructurefromeachthesesources.
DiscoveringtheanatomyoftweetsInthissection,wearegoingtoestablishconnectionwiththeTwitterAPI.Twitterofferstwoconnectionmodes:theRESTAPI,whichallowsustosearchhistoricaltweetsforagivensearchtermorhashtag,andthestreamingAPI,whichdeliversreal-timetweetsundertheratelimitinplace.
InordertogetabetterunderstandingofhowtooperatewiththeTwitterAPI,wewillgothroughthefollowingsteps:
1. InstalltheTwitterPythonlibrary.2. EstablishaconnectionprogrammaticallyviaOAuth,theauthenticationrequiredfor
Twitter.3. SearchforrecenttweetsforthequeryApacheSparkandexploretheresultsobtained.4. DecideonthekeyattributesofinterestandretrievetheinformationfromtheJSON
output.
Let’sgothroughitstep-by-step:
1. InstallthePythonTwitterlibrary.Inordertoinstallit,youneedtowritepipinstalltwitterfromthecommandline:
$pipinstalltwitter
2. CreatethePythonTwitterAPIclassanditsbasemethodsforauthentication,searching,andparsingtheresults.self.authgetsthecredentialsfromTwitter.ItthencreatesaregisteredAPIasself.api.Wehaveimplementedtwomethods:thefirstonetosearchTwitterwithagivenqueryandthesecondonetoparsetheoutputtoretrieverelevantinformationsuchasthetweetID,thetweettext,andthetweetauthor.Thecodeisasfollows:
importtwitter
importurlparse
frompprintimportpprintaspp
classTwitterAPI(object):
"""
TwitterAPIclassallowstheConnectiontoTwitterviaOAuth
onceyouhaveregisteredwithTwitterandreceivethe
necessarycredentiials
"""
#initializeandgetthetwittercredentials
def__init__(self):
consumer_key='Provideyourcredentials'
consumer_secret='Provideyourcredentials'
access_token='Provideyourcredentials'
access_secret='Provideyourcredentials'
self.consumer_key=consumer_key
self.consumer_secret=consumer_secret
self.access_token=access_token
self.access_secret=access_secret
#
#authenticatecredentialswithTwitterusingOAuth
self.auth=twitter.oauth.OAuth(access_token,access_secret,
consumer_key,consumer_secret)
#createsregisteredTwitterAPI
self.api=twitter.Twitter(auth=self.auth)
#
#searchTwitterwithqueryq(i.e."ApacheSpark")andmax.result
defsearchTwitter(self,q,max_res=10,**kwargs):
search_results=self.api.search.tweets(q=q,count=10,
**kwargs)
statuses=search_results['statuses']
max_results=min(1000,max_res)
for_inrange(10):
try:
next_results=search_results['search_metadata']
['next_results']
exceptKeyErrorase:
break
next_results=urlparse.parse_qsl(next_results[1:])
kwargs=dict(next_results)
search_results=self.api.search.tweets(**kwargs)
statuses+=search_results['statuses']
iflen(statuses)>max_results:
break
returnstatuses
#
#parsetweetsasitiscollectedtoextractid,creation
#date,userid,tweettext
defparseTweets(self,statuses):
return[(status['id'],
status['created_at'],
status['user']['id'],
status['user']['name'],
status['text'],url['expanded_url'])
forstatusinstatuses
forurlinstatus['entities']['urls']]
3. Instantiatetheclasswiththerequiredauthentication:
t=TwitterAPI()
4. RunasearchonthequerytermApacheSpark:
q="ApacheSpark"
tsearch=t.searchTwitter(q)
5. AnalyzetheJSONoutput:
pp(tsearch[1])
{u'contributors':None,
u'coordinates':None,
u'created_at':u'SatApr2514:50:57+00002015',
u'entities':{u'hashtags':[{u'indices':[74,86],u'text':
u'sparksummit'}],
u'media':[{u'display_url':
u'pic.twitter.com/WKUMRXxIWZ',
u'expanded_url':
u'http://twitter.com/bigdata/status/591976255831969792/photo/1',
u'id':591976255156715520,
u'id_str':u'591976255156715520',
u'indices':[143,144],
u'media_url':
...(snip)...
u'text':u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers
&leadersat#sparksummitNYC:videoclipsareout
http://t.co/qrqpP6cG9shttp://t\u2026',
u'truncated':False,
u'user':{u'contributors_enabled':False,
u'created_at':u'SatApr0414:44:31+00002015',
u'default_profile':True,
u'default_profile_image':True,
u'description':u'',
u'entities':{u'description':{u'urls':[]}},
u'favourites_count':0,
u'follow_request_sent':False,
u'followers_count':586,
u'following':False,
u'friends_count':2,
u'geo_enabled':False,
u'id':3139047660,
u'id_str':u'3139047660',
u'is_translation_enabled':False,
u'is_translator':False,
u'lang':u'zh-cn',
u'listed_count':749,
u'location':u'',
u'name':u'MegaDataMama',
u'notifications':False,
u'profile_background_color':u'C0DEED',
u'profile_background_image_url':
u'http://abs.twimg.com/images/themes/theme1/bg.png',
u'profile_background_image_url_https':
u'https://abs.twimg.com/images/themes/theme1/bg.png',
...(snip)...
u'screen_name':u'MegaDataMama',
u'statuses_count':26673,
u'time_zone':None,
u'url':None,
u'utc_offset':None,
u'verified':False}}
6. ParsetheTwitteroutputtoretrievekeyinformationofinterest:
tparsed=t.parseTweets(tsearch)
pp(tparsed)
[(591980327784046592,
u'SatApr2515:01:23+00002015',
63407360,
u'Jos\xe9CarlosBaquero',
u'BigDatasystemsaremakingadifferenceinthefightagainst
cancer.#BigData#ApacheSparkhttp://t.co/pnOLmsKdL9',
u'http://tmblr.co/ZqTggs1jHytN0'),
(591977704464875520,
u'SatApr2514:50:57+00002015',
3139047660,
u'MegaDataMama',
u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers&
leadersat#sparksummitNYC:videoclipsareouthttp://t.co/qrqpP6cG9s
http://t\u2026',
u'http://goo.gl/eF5xwK'),
(591977172589539328,
u'SatApr2514:48:51+00002015',
2997608763,
u'EmmaClark',
u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers&
leadersat#sparksummitNYC:videoclipsareouthttp://t.co/qrqpP6cG9s
http://t\u2026',
u'http://goo.gl/eF5xwK'),
...(snip)...
(591879098349268992,
u'SatApr2508:19:08+00002015',
331263208,
u'MarioMolina',
u'#ApacheSparkspeedsupbigdatadecision-making
http://t.co/8hdEXreNfN',
u'http://www.computerweekly.com/feature/Apache-Spark-speeds-up-big-
data-decision-making')]
ExploringtheGitHubworldInordertogetabetterunderstandingonhowtooperatewiththeGitHubAPI,wewillgothroughthefollowingsteps:
1. InstalltheGitHubPythonlibrary.2. AccesstheAPIbyusingthetokenprovidedwhenweregisteredinthedeveloper
website.3. RetrievesomekeyfactsontheApachefoundationthatishostingthespark
repository.
Let’sgothroughtheprocessstep-by-step:
1. InstallthePythonPyGithublibrary.Inordertoinstallit,youneedtopipinstallPyGithubfromthecommandline:
pipinstallPyGithub
2. ProgrammaticallycreateaclienttoinstantiatetheGitHubAPI:
fromgithubimportGithub
#Getyourownaccesstoken
ACCESS_TOKEN='Get_Your_Own_Access_Token'
#WearefocusingourattentiontoUser=apacheandRepo=spark
USER='apache'
REPO='spark'
g=Github(ACCESS_TOKEN,per_page=100)
user=g.get_user(USER)
repo=user.get_repo(REPO)
3. RetrievekeyfactsfromtheApacheUser.Thereare640activeApacherepositoriesinGitHub:
repos_apache=[repo.nameforrepoing.get_user('apache').get_repos()]
len(repos_apache)
640
4. RetrievekeyfactsfromtheSparkrepository,TheprograminglanguagesusedintheSparkrepoaregivenhereunder:
pp(repo.get_languages())
{u'C':1493,
u'CSS':4472,
u'Groff':5379,
u'Java':1054894,
u'JavaScript':21569,
u'Makefile':7771,
u'Python':1091048,
u'R':339201,
u'Scala':10249122,
u'Shell':172244}
5. RetrieveafewkeyparticipantsofthewideSparkGitHubrepositorynetwork.Thereare3,738stargazersintheApacheSparkrepositoryatthetimeofwriting.Thenetworkisimmense.ThefirststargazerisMateiZaharia,thecofounderoftheSparkprojectwhenhewasdoinghisPhDinBerkeley.
stargazers=[sforsinrepo.get_stargazers()]
print"Numberofstargazers",len(stargazers)
Numberofstargazers3738
[stargazers[i].loginforiinrange(0,20)]
[u'mateiz',
u'beyang',
u'abo',
u'CodingCat',
u'andy327',
u'CrazyJvm',
u'jyotiska',
u'BaiGang',
u'sundstei',
u'dianacarroll',
u'ybotco',
u'xelax',
u'prabeesh',
u'invkrh',
u'bedla',
u'nadesai',
u'pcpratts',
u'narkisr',
u'Honghe',
u'Jacke']
UnderstandingthecommunitythroughMeetupInordertogetabetterunderstandingofhowtooperatewiththeMeetupAPI,wewillgothroughthefollowingsteps:
1. CreateaPythonprogramtocalltheMeetupAPIusinganauthenticationtoken.2. RetrieveinformationofpasteventsformeetupgroupssuchasLondonDataScience.3. Retrievetheprofileofthemeetupmembersinordertoanalyzetheirparticipationin
similarmeetupgroups.
Let’sgothroughtheprocessstep-by-step:
1. AsthereisnoreliableMeetupAPIPythonlibrary,wewillprogrammaticallycreateaclienttoinstantiatetheMeetupAPI:
importjson
importmimeparse
importrequests
importurllib
frompprintimportpprintaspp
MEETUP_API_HOST='https://api.meetup.com'
EVENTS_URL=MEETUP_API_HOST+'/2/events.json'
MEMBERS_URL=MEETUP_API_HOST+'/2/members.json'
GROUPS_URL=MEETUP_API_HOST+'/2/groups.json'
RSVPS_URL=MEETUP_API_HOST+'/2/rsvps.json'
PHOTOS_URL=MEETUP_API_HOST+'/2/photos.json'
GROUP_URLNAME='London-Machine-Learning-Meetup'
#GROUP_URLNAME='London-Machine-Learning-Meetup'#'Data-Science-
London'
classMee
tupAPI(object):
"""
Retrievesinformationaboutmeetup.com
"""
def__init__(self,api_key,num_past_events=10,http_timeout=1,
http_retries=2):
"""
CreateanewinstanceofMeetupAPI
"""
self._api_key=api_key
self._http_timeout=http_timeout
self._http_retries=http_retries
self._num_past_events=num_past_events
defget_past_events(self):
"""
Getpastmeetupeventsforagivenmeetupgroup
"""
params={'key':self._api_key,
'group_urlname':GROUP_URLNAME,
'status':'past',
'desc':'true'}
ifself._num_past_events:
params['page']=str(self._num_past_events)
query=urllib.urlencode(params)
url='{0}?{1}'.format(EVENTS_URL,query)
response=requests.get(url,timeout=self._http_timeout)
data=response.json()['results']
returndata
defget_members(self):
"""
Getmeetupmembersforagivenmeetupgroup
"""
params={'key':self._api_key,
'group_urlname':GROUP_URLNAME,
'offset':'0',
'format':'json',
'page':'100',
'order':'name'}
query=urllib.urlencode(params)
url='{0}?{1}'.format(MEMBERS_URL,query)
response=requests.get(url,timeout=self._http_timeout)
data=response.json()['results']
returndata
defget_groups_by_member(self,member_id='38680722'):
"""
Getmeetupgroupsforagivenmeetupmember
"""
params={'key':self._api_key,
'member_id':member_id,
'offset':'0',
'format':'json',
'page':'100',
'order':'id'}
query=urllib.urlencode(params)
url='{0}?{1}'.format(GROUPS_URL,query)
response=requests.get(url,timeout=self._http_timeout)
data=response.json()['results']
returndata
2. Then,wewillretrievepasteventsfromagivenMeetupgroup:
m=MeetupAPI(api_key='Get_Your_Own_Key')
last_meetups=m.get_past_events()
pp(last_meetups[5])
{u'created':1401809093000,
u'description':u"<p>WearehostingajointmeetupbetweenSpark
LondonandMachineLearningLondon.Giventheexcitementinthemachine
learningcommunityaroundSparkatthemomentajointmeetupisin
order!</p><p>MichaelArmbrustfromtheApacheSparkcoreteamwillbe
flyingoverfromtheStatestogiveusatalkinperson.\xa0Thanksto
oursponsors,Cloudera,MapRandDatabricksforhelpingmakethis
happen.</p><p>ThefirstpartofthetalkwillbeaboutMLlib,the
machinelearninglibraryforSpark,\xa0andthesecondpart,on\xa0Spark
SQL.</p><p>Don'tsignupifyouhavealreadysignedupontheSpark
Londonpagethough!</p><p>\n\n\nAbstractforpartone:</p><p>Inthis
talk,we\u2019llintroduceSparkandshowhowtouseittobuildfast,
end-to-endmachinelearningworkflows.UsingSpark\u2019shigh-level
API,wecanprocessrawdatawithfamiliarlibrariesinJava,Scalaor
Python(e.g.NumPy)toextractthefeaturesformachinelearning.Then,
usingMLlib,itsbuilt-inmachinelearninglibrary,wecanrunscalable
versionsofpopularalgorithms.We\u2019llalsocoverupcoming
developmentworkincludingnewbuilt-inalgorithmsandRbindings.</p>
<p>\n\n\n\nAbstractforparttwo:\xa0</p><p>Inthistalk,we'll
examineSparkSQL,anewAlphacomponentthatispartoftheApache
Spark1.0release.SparkSQLletsdevelopersnativelyquerydatastored
inbothexistingRDDsandexternalsourcessuchasApacheHive.Akey
featureofSparkSQListheabilitytoblurthelinesbetween
relationaltablesandRDDs,makingiteasyfordeveloperstointermix
SQLcommandsthatqueryexternaldatawithcomplexanalytics.In
additiontoSparkSQL,we'llexploretheCatalystoptimizerframework,
whichallowsSparkSQLtoautomaticallyrewritequeryplanstoexecute
moreefficiently.</p>",
u'event_url':u'http://www.meetup.com/London-Machine-Learning-
Meetup/events/186883262/',
u'group':{u'created':1322826414000,
u'group_lat':51.52000045776367,
u'group_lon':-0.18000000715255737,
u'id':2894492,
u'join_mode':u'open',
u'name':u'LondonMachineLearningMeetup',
u'urlname':u'London-Machine-Learning-Meetup',
u'who':u'MachineLearningEnthusiasts'},
u'headcount':0,
u'id':u'186883262',
u'maybe_rsvp_count':0,
u'name':u'JointSparkLondonandMachineLearningMeetup',
u'rating':{u'average':4.800000190734863,u'count':5},
u'rsvp_limit':70,
u'status':u'past',
u'time':1403200800000,
u'updated':1403450844000,
u'utc_offset':3600000,
u'venue':{u'address_1':u'12ErrolSt,London',
u'city':u'EC1Y8LX',
u'country':u'gb',
u'id':19504802,
u'lat':51.522533,
u'lon':-0.090934,
u'name':u'RoyalStatisticalSociety',
u'repinned':False},
u'visibility':u'public',
u'waitlist_count':84,
u'yes_rsvp_count':70}
3. GetinformationabouttheMeetupmembers:
members=m.get_members()
{u'city':u'London',
u'country':u'gb',
u'hometown':u'London',
u'id':11337881,
u'joined':1421418896000,
u'lat':51.53,
u'link':u'http://www.meetup.com/members/11337881',
u'lon':-0.09,
u'name':u'AbhishekShivkumar',
u'other_services':{u'twitter':{u'identifier':u'@abhisemweb'}},
u'photo':{u'highres_link':
u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/highres_1089864
3.jpeg',
u'photo_id':10898643,
u'photo_link':
u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/member_10898643
.jpeg',
u'thumb_link':
u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/thumb_10898643.
jpeg'},
u'self':{u'common':{}},
u'state':u'17',
u'status':u'active',
u'topics':[{u'id':1372,u'name':u'SemanticWeb',u'urlkey':
u'semweb'},
{u'id':1512,u'name':u'XML',u'urlkey':u'xml'},
{u'id':49585,
u'name':u'SemanticSocialNetworks',
u'urlkey':u'semantic-social-networks'},
{u'id':24553,
u'name':u'NaturalLanguageProcessing',
...(snip)...
u'name':u'AndroidDevelopment',
u'urlkey':u'android-developers'}],
u'visited':1429281599000}
PreviewingourappOurchallengeistomakesenseofthedataretrievedfromthesesocialnetworks,findingthekeyrelationshipsandderivinginsights.Someoftheelementsofinterestareasfollows:
Visualizingthetopinfluencers:Discoverthetopinfluencersinthecommunity:
HeavyTwitterusersonApacheSparkCommittersinGitHubLeadingMeetuppresentations
UnderstandingtheNetwork:NetworkgraphofGitHubcommitters,watchers,andstargazersIdentifyingtheHotLocations:LocatingthemostactivelocationforSpark
Thefollowingscreenshotprovidesapreviewofourapp:
SummaryInthischapter,welaidouttheoverallarchitectureofourapp.Weexplainedthetwomainparadigmsofprocessingdata:batchprocessing,alsocalleddataatrest,andstreaminganalytics,referredtoasdatainmotion.Weproceededtoestablishconnectionstothreesocialnetworksofinterest:Twitter,GitHub,andMeetup.Wesampledthedataandprovidedapreviewofwhatweareaimingtobuild.TheremainderofthebookwillfocusontheTwitterdataset.WeprovidedherethetoolsandAPItoaccessthreesocialnetworks,soyoucanatalaterstagecreateyourowndatamashups.Wearenowreadytoinvestigatethedatacollected,whichwillbethetopicofthenextchapter.
Inthenextchapter,wewilldelvedeeperintodataanalysis,extractingthekeyattributesofinterestforourpurposesandmanagingthestorageoftheinformationforbatchandstreamprocessing.
Chapter3.JugglingDatawithSparkAsperthebatchandstreamingarchitecturelaidoutinthepreviouschapter,weneeddatatofuelourapplications.WewillharvestdatafocusedonApacheSparkfromTwitter.Theobjectiveofthischapteristopreparedatatobefurtherusedbythemachinelearningandstreamingapplications.Thischapterfocusesonhowtoexchangecodeanddataacrossthedistributednetwork.Wewillgetpracticalinsightsintoserialization,persistence,marshaling,andcaching.WewillgettogripswithonSparkSQL,thekeySparkmoduletointeractivelyexplorestructuredandsemi-structureddata.ThefundamentaldatastructurepoweringSparkSQListheSparkdataframe.TheSparkdataframeisinspiredbythePythonPandasdataframeandtheRdataframe.Itisapowerfuldatastructure,wellunderstoodandappreciatedbydatascientistswithabackgroundinRorPython.
Inthischapter,wewillcoverthefollowingpoints:
ConnecttoTwitter,collecttherelevantdata,andthenpersistitinvariousformatssuchasJSONandCSVanddatastoressuchasMongoDBAnalyzethedatausingBlazeandOdo,aspin-offlibraryfromBlaze,inordertoconnectandtransferdatafromvarioussourcesanddestinationsIntroduceSparkdataframesasthefoundationfordatainterchangebetweenthevariousSparkmodulesandexploredatainteractivelyusingSparkSQL
Revisitingthedata-intensiveapparchitectureLet’sfirstputincontextthefocusofthischapterwithrespecttothedata-intensiveapparchitecture.Wewillconcentrateourattentionontheintegrationlayerandessentiallyrunthroughiterativecyclesoftheacquisition,refinement,andpersistenceofthedata.ThiscyclewastermedthefiveCs.ThefiveCsstandforconnect,collect,correct,compose,andconsume.TheyaretheessentialprocesseswerunthroughintheintegrationlayerinordertogettotherightqualityandquantityofdataretrievedfromTwitter.WewillalsodelvedeeperinthepersistencelayerandsetupadatastoresuchasMongoDBtocollectourdataforprocessinglater.
WewillexplorethedatawithBlaze,aPythonlibraryfordatamanipulation,andSparkSQL,theinteractivemoduleofSparkfordatadiscoverypoweredbytheSparkdataframe.ThedataframeparadigmissharedbyPythonPandas,PythonBlaze,andSparkSQL.Wewillgetafeelforthenuancesofthethreedataframeflavors.
Thefollowingdiagramsetsthecontextofthechapter’sfocus,highlightingtheintegrationlayerandthepersistencelayer:
SerializinganddeserializingdataAsweareharvestingdatafromwebAPIsunderratelimitconstraints,weneedtostorethem.Asthedataisprocessedonadistributedcluster,weneedconsistentwaystosavestateandretrieveitforlaterusage.
Let’snowdefineserialization,persistence,marshaling,andcachingormemorization.
SerializingaPythonobjectconvertsitintoastreamofbytes.ThePythonobjectneedstoberetrievedbeyondthescopeofitsexistence,whentheprogramisshut.TheserializedPythonobjectcanbetransferredoveranetworkorstoredinapersistentstorage.DeserializationistheoppositeandconvertsthestreamofbytesintotheoriginalPythonobjectsotheprogramcancarryonfromthesavedstate.ThemostpopularserializationlibraryinPythonisPickle.Asamatteroffact,thePySparkcommandsaretransferredoverthewiretotheworkernodesviapickleddata.
Persistencesavesaprogram’sstatedatatodiskormemorysothatitcancarryonwhereitleftoffuponrestart.ItsavesaPythonobjectfrommemorytoafileoradatabaseandloadsitlaterwiththesamestate.
MarshallingsendsPythoncodeordataoveranetworkTCPconnectioninamulticoreordistributedsystem.
CachingconvertsaPythonobjecttoastringinmemorysothatitcanbeusedasadictionarykeylateron.Sparksupportspullingadatasetintoacluster-wide,in-memorycache.ThisisveryusefulwhendataisaccessedrepeatedlysuchaswhenqueryingasmallreferencedatasetorrunninganiterativealgorithmsuchasGooglePageRank.
CachingisacrucialconceptforSparkasitallowsustosaveRDDsinmemoryorwithaspillagetodisk.ThecachingstrategycanbeselectedbasedonthelineageofthedataortheDAG(shortforDirectedAcyclicGraph)oftransformationsappliedtotheRDDsinordertominimizeshuffleorcrossnetworkheavydataexchange.InordertoachievegoodperformancewithSpark,bewareofdatashuffling.AgoodpartitioningpolicyanduseofRDDcaching,coupledwithavoidingunnecessaryactionoperations,leadstobetterperformancewithSpark.
HarvestingandstoringdataBeforedelvingintodatabasepersistentstoragesuchasMongoDB,wewilllookatsomeusefulfilestoragesthatarewidelyused:CSV(shortforcomma-separatedvalues)andJSON(shortforJavaScriptObjectNotation)filestorage.Theenduringpopularityofthesetwofileformatsliesinafewkeyreasons:theyarehumanreadable,simple,relativelylightweight,andeasytouse.
PersistingdatainCSVTheCSVformatislightweight,humanreadable,andeasytouse.Ithasdelimitedtextcolumnswithaninherenttabularschema.
PythonoffersarobustcsvlibrarythatcanserializeacsvfileintoaPythondictionary.Forthepurposeofourprogram,wehavewrittenapythonclassthatmanagestopersistdatainCSVformatandreadfromagivenCSV.
Let’srunthroughthecodeoftheclassIO_csvobject.The__init__sectionoftheclassbasicallyinstantiatesthefilepath,thefilename,andthefilesuffix(inthiscase,.csv):
classIO_csv(object):
def__init__(self,filepath,filename,filesuffix='csv'):
self.filepath=filepath#/path/to/filewithoutthe/'at
theend
self.filename=filename#FILE_NAME
self.filesuffix=filesuffix
ThesavemethodoftheclassusesaPythonnamedtupleandtheheaderfieldsofthecsvfileinordertoimpartaschemawhilepersistingtherowsoftheCSV.Ifthecsvfilealreadyexists,itwillbeappendedandnotoverwrittenotherwise;itwillbecreated:
defsave(self,data,NTname,fields):
#NTname=NameoftheNamedTuple
#fields=headerofCSV-listofthefieldsname
NTuple=namedtuple(NTname,fields)
ifos.path.isfile('{0}/{1}.{2}'.format(self.filepath,
self.filename,self.filesuffix)):
#Appendexistingfile
withopen('{0}/{1}.{2}'.format(self.filepath,self.filename,
self.filesuffix),'ab')asf:
writer=csv.writer(f)
#writer.writerow(fields)#fields=headerofCSV
writer.writerows([rowforrowinmap(NTuple._make,data)])
#listcomprehensionusingmapontheNamedTuple._make()
iterableandthedatafiletobesaved
#Noticewriter.writerowsandnotwriter.writerow(i.e.
listofmultiplerowssenttocsvfile
else:
#Createnewfile
withopen('{0}/{1}.{2}'.format(self.filepath,self.filename,
self.filesuffix),'wb')asf:
writer=csv.writer(f)
writer.writerow(fields)#fields=headerofCSV-listof
thefieldsname
writer.writerows([rowforrowinmap(NTuple._make,data)])
#listcomprehensionusingmapontheNamedTuple._make()
iterableandthedatafiletobesaved
#Noticewriter.writerowsandnotwriter.writerow(i.e.
listofmultiplerowssenttocsvfile
TheloadmethodoftheclassalsousesaPythonnamedtupleandtheheaderfieldsofthe
csvfileinordertoretrievethedatausingaconsistentschema.Theloadmethodisamemory-efficientgeneratortoavoidloadingahugefileinmemory:henceweuseyieldinplaceofreturn:
defload(self,NTname,fields):
#NTname=NameoftheNamedTuple
#fields=headerofCSV-listofthefieldsname
NTuple=namedtuple(NTname,fields)
withopen('{0}/{1}.{2}'.format(self.filepath,self.filename,
self.filesuffix),'rU')asf:
reader=csv.reader(f)
forrowinmap(NTuple._make,reader):
#UsingmapontheNamedTuple._make()iterableandthe
readerfiletobeloaded
yieldrow
Here’sthenamedtuple.Weareusingittoparsethetweetinordertosaveorretrievethemtoandfromthecsvfile:
fields01=['id','created_at','user_id','user_name','tweet_text',
'url']
Tweet01=namedtuple('Tweet01',fields01)
defparse_tweet(data):
"""
Parsea``tweet``fromthegivenresponsedata.
"""
returnTweet01(
id=data.get('id',None),
created_at=data.get('created_at',None),
user_id=data.get('user_id',None),
user_name=data.get('user_name',None),
tweet_text=data.get('tweet_text',None),
url=data.get('url')
)
PersistingdatainJSONJSONisoneofthemostpopulardataformatsforInternet-basedapplications.AlltheAPIswearedealingwith,Twitter,GitHub,andMeetup,delivertheirdatainJSONformat.TheJSONformatisrelativelylightweightcomparedtoXMLandhumanreadable,andtheschemaisembeddedinJSON.AsopposedtotheCSVformat,whereallrecordsfollowexactlythesametabularstructure,JSONrecordscanvaryintheirstructure.JSONissemi-structured.AJSONrecordcanbemappedintoaPythondictionaryofdictionaries.
Let’srunthroughthecodeoftheclassIO_jsonobject.The__init__sectionoftheclassbasicallyinstantiatesthefilepath,thefilename,andthefilesuffix(inthiscase,.json):
classIO_json(object):
def__init__(self,filepath,filename,filesuffix='json'):
self.filepath=filepath#/path/to/filewithoutthe/'at
theend
self.filename=filename#FILE_NAME
self.filesuffix=filesuffix
#self.file_io=os.path.join(dir_name,.'.join((base_filename,
filename_suffix)))
Thesavemethodoftheclassusesutf-8encodinginordertoensurereadandwritecompatibilityofthedata.IftheJSONfilealreadyexists,itwillbeappendedandnotoverwritten;otherwiseitwillbecreated:
defsave(self,data):
ifos.path.isfile('{0}/{1}.{2}'.format(self.filepath,
self.filename,self.filesuffix)):
#Appendexistingfile
withio.open('{0}/{1}.{2}'.format(self.filepath,self.filename,
self.filesuffix),'a',encoding='utf-8')asf:
f.write(unicode(json.dumps(data,ensure_ascii=False)))#
Inpython3,thereisno"unicode"function
#f.write(json.dumps(data,ensure_ascii=False))#createa
\"escapecharfor"inthesavedfile
else:
#Createnewfile
withio.open('{0}/{1}.{2}'.format(self.filepath,self.filename,
self.filesuffix),'w',encoding='utf-8')asf:
f.write(unicode(json.dumps(data,ensure_ascii=False)))
#f.write(json.dumps(data,ensure_ascii=False))
Theloadmethodoftheclassjustreturnsthefilethathasbeenread.Afurtherjson.loadsfunctionneedstobeappliedinordertoretrievethejsonoutofthefileread:
defload(self):
withio.open('{0}/{1}.{2}'.format(self.filepath,self.filename,
self.filesuffix),encoding='utf-8')asf:
returnf.read()
SettingupMongoDBItiscrucialtostoretheinformationharvested.Thus,wesetupMongoDBasourmaindocumentdatastore.AsalltheinformationcollectedisinJSONformatandMongoDBstoresinformationinBSON(shortforBinaryJSON),itisthereforeanaturalchoice.
Wewillrunthroughthefollowingstepsnow:
InstallingtheMongoDBserverandclientRunningtheMongoDBserverRunningtheMongoclientInstallingthePyMongodriverCreatingthePythonMongoclient
InstallingtheMongoDBserverandclientInordertoinstalltheMongoDBpackage,performthroughthefollowingsteps:
1. Importthepublickeyusedbythepackagemanagementsystem(inourcase,Ubuntu’sapt).ToimporttheMongoDBpublickey,weissuethefollowingcommand:
sudoapt-keyadv--keyserverhkp://keyserver.ubuntu.com:80--recv
7F0CEB10
2. CreatealistfileforMongoDB.Tocreatethelistfile,weusethefollowingcommand:
echo"debhttp://repo.mongodb.org/apt/ubuntu"$("lsb_release-sc)"/
mongodb-org/3.0multiverse"|sudotee/etc/apt/sources.list.d/mongodb-
org-3.0.list
3. Updatethelocalpackagedatabaseassudo:
sudoapt-getupdate
4. InstalltheMongoDBpackages.WeinstallthelateststableversionofMongoDBwiththefollowingcommand:
sudoapt-getinstall-ymongodb-org
RunningtheMongoDBserverLet’sstarttheMongoDBserver:
1. TostartMongoDBserver,weissuethefollowingcommandtostartmongod:
sudoservicemongodbstart
2. Tocheckwhethermongodhasstartedproperly,weissuethecommand:
an@an-VB:/usr/bin$ps-ef|grepmongo
mongodb9671407:03?00:02:02/usr/bin/mongod--
config/etc/mongod.conf
an31433085007:45pts/300:00:00grep--color=automongo
Inthiscase,weseethatmongodbisrunninginprocess967.
3. Themongodserversendsamessagetotheeffectthatitiswaitingforconnectiononport27017.ThisisthedefaultportforMongoDB.Itcanbechangedintheconfigurationfile.
4. Wecancheckthecontentsofthelogfileat/var/log/mongod/mongod.log:
an@an-VB:/var/lib/mongodb$ls-lru
total81936
drwxr-xr-x2mongodbnogroup4096Apr2511:19_tmp
-rw-r--r--1mongodbnogroup69Apr2511:19storage.bson
-rwxr-xr-x1mongodbnogroup5Apr2511:19mongod.lock
-rw-------1mongodbnogroup16777216Apr2511:19local.ns
-rw-------1mongodbnogroup67108864Apr2511:19local.0
drwxr-xr-x2mongodbnogroup4096Apr2511:19journal
5. Inordertostopthemongodbserver,justissuethefollowingcommand:
sudoservicemongodbstop
RunningtheMongoclientRunningtheMongoclientintheconsoleisaseasyascallingmongo,ashighlightedinthefollowingcommand:
an@an-VB:/usr/bin$mongo
MongoDBshellversion:3.0.2
connectingto:test
Serverhasstartupwarnings:
2015-05-30T07:03:49.387+0200ICONTROL[initandlisten]
2015-05-30T07:03:49.388+0200ICONTROL[initandlisten]
Atthemongoclientconsoleprompt,wecanseethedatabaseswiththefollowingcommands:
>showdbs
local0.078GB
test0.078GB
Weselectthetestdatabaseusingusetest:
>usetest
switchedtodbtest
Wedisplaythecollectionswithinthetestdatabase:
>showcollections
restaurants
system.indexes
Wecheckasamplerecordintherestaurantcollectionlistedpreviously:
>db.restaurants.find()
{"_id":ObjectId("553b70055e82e7b824ae0e6f"),"address:{"building:
"1007","coord":[-73.856077,40.848447],"street:"MorrisParkAve",
"zipcode:"10462},"borough:"Bronx","cuisine:"Bakery","grades:[{
"grade:"A","score":2,"date":ISODate("2014-03-03T00:00:00Z")},{
"date":ISODate("2013-09-11T00:00:00Z"),"grade:"A","score":6},{
"score":10,"date":ISODate("2013-01-24T00:00:00Z"),"grade:"A},{
"date":ISODate("2011-11-23T00:00:00Z"),"grade:"A","score":9},{
"date":ISODate("2011-03-10T00:00:00Z"),"grade:"B","score":14}],
"name:"MorrisParkBakeShop","restaurant_id:"30075445"}
InstallingthePyMongodriverInstallingthePythondriverwithanacondaiseasy.Justrunthefollowingcommandattheterminal:
condainstallpymongo
CreatingthePythonclientforMongoDBWearecreatingaIO_mongoclassthatwillbeusedinourharvestingandprocessingprogramstostorethedatacollectedandretrievedsavedinformation.Inordertocreatethemongoclient,wewillimporttheMongoClientmodulefrompymongo.Weconnecttothemongodbserveronlocalhostatport27017.Thecommandisasfollows:
frompymongoimportMongoClientasMCli
classIO_mongo(object):
conn={'host':'localhost','ip':'27017'}
Weinitializeourclasswiththeclientconnection,thedatabase(inthiscase,twtr_db),andthecollection(inthiscase,twtr_coll)tobeaccessed:
def__init__(self,db='twtr_db',coll='twtr_coll',**conn):
#ConnectstotheMongoDBserver
self.client=MCli(**conn)
self.db=self.client[db]
self.coll=self.db[coll]
Thesavemethodinsertsnewrecordsinthepreinitializedcollectionanddatabase:
defsave(self,data):
#Inserttocollectionindb
returnself.coll.insert(data)
Theloadmethodallowstheretrievalofspecificrecordsaccordingtocriteriaandprojection.Inthecaseoflargeamountofdata,itreturnsacursor:
defload(self,return_cursor=False,criteria=None,projection=None):
ifcriteriaisNone:
criteria={}
ifprojectionisNone:
cursor=self.coll.find(criteria)
else:
cursor=self.coll.find(criteria,projection)
#Returnacursorforlargeamountsofdata
ifreturn_cursor:
returncursor
else:
return[itemforitemincursor]
HarvestingdatafromTwitterEachsocialnetworkposesitslimitationsandchallenges.Oneofthemainobstaclesforharvestingdataisanimposedratelimit.Whilerunningrepeatedorlong-runningconnectionsbetweenrateslimitpauses,wehavetobecarefultoavoidcollectingduplicatedata.
Wehaveredesignedourconnectionprogramsoutlinedinthepreviouschaptertotakecareoftheratelimits.
InthisTwitterAPIclassthatconnectsandcollectsthetweetsaccordingtothesearchquerywespecify,wehaveaddedthefollowing:
LoggingcapabilityusingthePythonlogginglibrarywiththeaimofcollectinganyerrorsorwarninginthecaseofprogramfailurePersistencecapabilityusingMongoDB,withtheIO_mongoclassexposedpreviouslyaswellasJSONfileusingtheIO_jsonclassAPIratelimitanderrormanagementcapability,sowecanensuremoreresilientcallstoTwitterwithoutgettingbarredfortappingintothefirehose
Let’sgothroughthesteps:
1. WeinitializebyinstantiatingtheTwitterAPIwithourcredentials:
classTwitterAPI(object):
"""
TwitterAPIclassallowstheConnectiontoTwitterviaOAuth
onceyouhaveregisteredwithTwitterandreceivethe
necessarycredentials
"""
def__init__(self):
consumer_key='get_your_credentials'
consumer_secret=getyour_credentials'
access_token='get_your_credentials'
access_secret='getyour_credentials'
self.consumer_key=consumer_key
self.consumer_secret=consumer_secret
self.access_token=access_token
self.access_secret=access_secret
self.retries=3
self.auth=twitter.oauth.OAuth(access_token,access_secret,
consumer_key,consumer_secret)
self.api=twitter.Twitter(auth=self.auth)
2. Weinitializetheloggerbyprovidingtheloglevel:
logger.debug(debugmessage)logger.info(infomessage)logger.warn(warnmessage)logger.error(errormessage)logger.critical(criticalmessage)
3. Wesetthelogpathandthemessageformat:
#loggerinitialisation
appName='twt150530'
self.logger=logging.getLogger(appName)
#self.logger.setLevel(logging.DEBUG)
#createconsolehandlerandsetleveltodebug
logPath='/home/an/spark/spark-1.3.0-bin-
hadoop2.4/examples/AN_Spark/data'
fileName=appName
fileHandler=logging.FileHandler("{0}/{1}.log".format(logPath,
fileName))
formatter=logging.Formatter('%(asctime)s-%(name)s-%
(levelname)s-%(message)s')
fileHandler.setFormatter(formatter)
self.logger.addHandler(fileHandler)
self.logger.setLevel(logging.DEBUG)
4. WeinitializetheJSONfilepersistenceinstruction:
#SavetoJSONfileinitialisation
jsonFpath='/home/an/spark/spark-1.3.0-bin-
hadoop2.4/examples/AN_Spark/data'
jsonFname='twtr15053001'
self.jsonSaver=IO_json(jsonFpath,jsonFname)
5. WeinitializetheMongoDBdatabaseandcollectionforpersistence:
#SavetoMongoDBIntitialisation
self.mongoSaver=IO_mongo(db='twtr01_db',coll='twtr01_coll')
6. ThemethodsearchTwitterlaunchesthesearchaccordingtothequeryspecified:
defsearchTwitter(self,q,max_res=10,**kwargs):
search_results=self.api.search.tweets(q=q,count=10,
**kwargs)
statuses=search_results['statuses']
max_results=min(1000,max_res)
for_inrange(10):
try:
next_results=search_results['search_metadata']
['next_results']
#self.logger.info('info'insearchTwitter-
next_results:%s'%next_results[1:])
exceptKeyErrorase:
self.logger.error('error'insearchTwitter:%s',%(e))
break
#next_results=urlparse.parse_qsl(next_results[1:])#
python2.7
next_results=urllib.parse.parse_qsl(next_results[1:])
#self.logger.info('info'insearchTwitter-
next_results[max_id]:',next_results[0:])
kwargs=dict(next_results)
#self.logger.info('info'insearchTwitter-
next_results[max_id]:%s'%kwargs['max_id'])
search_results=self.api.search.tweets(**kwargs)
statuses+=search_results['statuses']
self.saveTweets(search_results['statuses'])
iflen(statuses)>max_results:
self.logger.info('info'insearchTwitter-got%i
tweets-max:%i'%(len(statuses),max_results))
break
returnstatuses
7. ThesaveTweetsmethodactuallysavesthecollectedtweetsinJSONandinMongoDB:
defsaveTweets(self,statuses):
#SavingtoJSONFile
self.jsonSaver.save(statuses)
#SavingtoMongoDB
forsinstatuses:
self.mongoSaver.save(s)
8. TheparseTweetsmethodallowsustoextractthekeytweetinformationfromthevastamountofinformationprovidedbytheTwitterAPI:
defparseTweets(self,statuses):
return[(status['id'],
status['created_at'],
status['user']['id'],
status['user']['name']
status['text''text'],
url['expanded_url'])
forstatusinstatuses
forurlinstatus['entities']['urls']]
9. ThegetTweetsmethodcallsthesearchTwittermethoddescribedpreviously.ThegetTweetsmethodensuresthatAPIcallsaremadereliablywhilstrespectingtheimposedratelimit.Thecodeisasfollows:
defgetTweets(self,q,max_res=10):
"""
MakeaTwitterAPIcallwhilstmanagingratelimitanderrors.
"""
defhandleError(e,wait_period=2,
sleep_when_rate_limited=True):
ifwait_period>3600:#Seconds
self.logger.error('ToomanyretriesingetTweets:%s',
%(e))
raisee
ife.e.code==401:
self.logger.error('error401*NotAuthorised*in
getTweets:%s',%(e))
returnNone
elife.e.code==404:
self.logger.error('error404*NotFound*in
getTweets:%s',%(e))
returnNone
elife.e.code==429:
self.logger.error('error429*APIRateLimitExceeded
*ingetTweets:%s',%(e))
ifsleep_when_rate_limited:
self.logger.error('error429*Retryingin15
minutes*ingetTweets:%s',%(e))
sys.stderr.flush()
time.sleep(60*15+5)
self.logger.info('error429*Retryingnow*in
getTweets:%s',%(e))
return2
else:
raisee#Callermusthandletheratelimiting
issue
elife.e.codein(500,502,503,504):
self.logger.info('Encountered%iError.Retryingin%i
seconds'%(e.e.code,wait_period))
time.sleep(wait_period)
wait_period*=1.5
returnwait_period
else:
self.logger.error('Exit-aborting-%s',%(e))
raisee
10. Here,wearecallingthesearchTwitterAPIwiththerelevantquerybasedontheparametersspecified.Ifweencounteranyerrorsuchasratelimitationfromtheprovider,thiswillbeprocessedbythehandleErrormethod:
whileTrue:
try:
self.searchTwitter(q,max_res=10)
excepttwitter.api.TwitterHTTPErrorase:
error_count=0
wait_period=handleError(e,wait_period)
ifwait_periodisNone:
return
ExploringdatausingBlazeBlazeisanopensourcePythonlibrary,primarilydevelopedbyContinuum.io,leveragingPythonNumpyarraysandPandasdataframe.Blazeextendstoout-of-corecomputing,whilePandasandNumpyaresingle-core.
Blazeoffersanadaptable,unified,andconsistentuserinterfaceacrossvariousbackends.Blazeorchestratesthefollowing:
Data:SeamlessexchangeofdataacrossstoragessuchasCSV,JSON,HDF5,HDFS,andBcolzfiles.Computation:UsingthesamequeryprocessingagainstcomputationalbackendssuchasSpark,MongoDB,Pandas,orSQLAlchemy.Symbolicexpressions:Abstractexpressionssuchasjoin,group-by,filter,selection,andprojectionwithasyntaxsimilartoPandasbutlimitedinscope.Implementsthesplit-apply-combinemethodspioneeredbytheRlanguage.
BlazeexpressionsarelazilyevaluatedandinthatrespectshareasimilarprocessingparadigmwithSparkRDDstransformations.
Let’sdiveintoBlazebyfirstimportingthenecessarylibraries:numpy,pandas,blazeandodo.Odoisaspin-offofBlazeandensuresdatamigrationfromvariousbackends.Thecommandsareasfollows:
importnumpyasnp
importpandasaspd
fromblazeimportData,by,join,merge
fromodoimportodo
BokehJSsuccessfullyloaded.
WecreateaPandasDataframebyreadingtheparsedtweetssavedinaCSVfile,twts_csv:
twts_pd_df=pd.DataFrame(twts_csv_read,columns=Tweet01._fields)
twts_pd_df.head()
Out[65]:
idcreated_atuser_iduser_nametweet_texturl
15988311114065100822015-05-1412:43:5714755521raulsaeztapiaRT
@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-
solutions.com/wp/2015/05/the-...
25988311114065100822015-05-1412:43:5714755521raulsaeztapiaRT
@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-
solutions.com/wp/2015/05/the-...
3988089447195934722015-05-1411:15:5214755521raulsaeztapiaRT
@alvaroagea:Simply@ApacheSparkhttp://t.c…
http://www.webex.com/ciscospark/
45988089447195934722015-05-1411:15:5214755521raulsaeztapiaRT
@alvaroagea:Simply@ApacheSparkhttp://t.c…http://sparkjava.com/
WeruntheTweetsPandaDataframetothedescribe()functiontogetsomeoverallinformationonthedataset:
twts_pd_df.describe()
Out[66]:
idcreated_atuser_iduser_nametweet_texturl
count191919191919
unique776667
top5988089447195934722015-05-1411:15:5214755521raulsaeztapia
RT@alvaroagea:Simply@ApacheSparkhttp://t.c…http://bit.ly/1Hfd0Xm
freq669966
WeconvertthePandasdataframeintoaBlazedataframebysimplypassingitthroughtheData()function:
#
#Blazedataframe
#
twts_bz_df=Data(twts_pd_df)
WecanretrievetheschemarepresentationoftheBlazedataframebypassingtheschemafunction:
twts_bz_df.schema
Out[73]:
dshape("""{
id:?string,
created_at:?string,
user_id:?string,
user_name:?string,
tweet_text:?string,
url:?string
}""")
The.dshapefunctiongivesarecordcountandtheschema:
twts_bz_df.dshape
Out[74]:
dshape("""19*{
id:?string,
created_at:?string,
user_id:?string,
user_name:?string,
tweet_text:?string,
url:?string
}""")
WecanprinttheBlazedataframecontent:
twts_bz_df.data
Out[75]:
idcreated_atuser_iduser_nametweet_texturl
15988311114065100822015-05-1412:43:5714755521raulsaeztapia
RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-
solutions.com/wp/2015/05/the-...
25988311114065100822015-05-1412:43:5714755521raulsaeztapia
RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-
solutions.com/wp/2015/05/the-...
...
185987829700828078082015-05-1409:32:391377652806
embeddedcomputer.nlRT@BigDataTechCon:MovingRatingPredictionw…
http://buff.ly/1QBpk8J
195987779337301606402015-05-1409:12:38294862170Ellen
FriedmanI'mstillonEurotime.Ifyouaretoochecko…
http://bit.ly/1Hfd0Xm
Weextractthecolumntweet_textandtaketheuniquevalues:
twts_bz_df.tweet_text.distinct()
Out[76]:
tweet_text
0RT@pacoid:Greatrecapof@StrataConfEUinL…
1RT@alvaroagea:Simply@ApacheSparkhttp://t.c…
2RT@PrabhaGana:Whatexactlyis@ApacheSparka…
3RT@Ellen_Friedman:I'mstillonEurotime.If…
4RT@BigDataTechCon:MovingRatingPredictionw…
5I'mstillonEurotime.Ifyouaretoochecko…
Weextractmultiplecolumns['id','user_name','tweet_text']fromthedataframeandtaketheuniquerecords:
twts_bz_df[['id','user_name','tweet_text']].distinct()
Out[78]:
iduser_nametweet_text
0598831111406510082raulsaeztapiaRT@pacoid:Greatrecapof
@StrataConfEUinL…
1598808944719593472raulsaeztapiaRT@alvaroagea:Simply
@ApacheSparkhttp://t.c…
2598796205091500032JohnHumphreysRT@PrabhaGana:Whatexactlyis
@ApacheSparka…
3598788561127735296LeonardoD'AmbrosiRT@Ellen_Friedman:I'mstill
onEurotime.If…
4598785545557438464AlexeyKosenkovRT@Ellen_Friedman:I'mstillon
Eurotime.If…
5598782970082807808embeddedcomputer.nlRT@BigDataTechCon:Moving
RatingPredictionw…
6598777933730160640EllenFriedmanI'mstillonEurotime.Ifyou
aretoochecko…
TransferringdatausingOdoOdoisaspin-offprojectofBlaze.Odoallowstheinterchangeofdata.Odoensuresthemigrationofdataacrossdifferentformats(CSV,JSON,HDFS,andmore)andacrossdifferentdatabases(SQLdatabases,MongoDB,andsoon)usingaverysimplepredicate:
Odo(source,target)
Totransfertoadatabase,theaddressisspecifiedusingaURL.Forexample,foraMongoDBdatabase,itwouldlooklikethis:
mongodb://username:password@hostname:port/database_name::collection_name
Let’srunsomeexamplesofusingOdo.Here,weillustrateodobyreadingaCSVfileandcreatingaBlazedataframe:
filepath=csvFpath
filename=csvFname
filesuffix=csvSuffix
twts_odo_df=Data('{0}/{1}.{2}'.format(filepath,filename,filesuffix))
Countthenumberofrecordsinthedataframe:
twts_odo_df.count()
Out[81]:
19
Displaythefiveinitialrecordsofthedataframe:
twts_odo_df.head(5)
Out[82]:
idcreated_atuser_iduser_nametweet_texturl
05988311114065100822015-05-1412:43:5714755521raulsaeztapia
RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-
solutions.com/wp/2015/05/the-...
15988311114065100822015-05-1412:43:5714755521raulsaeztapia
RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-
solutions.com/wp/2015/05/the-...
25988089447195934722015-05-1411:15:5214755521raulsaeztapia
RT@alvaroagea:Simply@ApacheSparkhttp://t.c…
http://www.webex.com/ciscospark/
35988089447195934722015-05-1411:15:5214755521raulsaeztapia
RT@alvaroagea:Simply@ApacheSparkhttp://t.c…http://sparkjava.com/
45988089447195934722015-05-1411:15:5214755521raulsaeztapia
RT@alvaroagea:Simply@ApacheSparkhttp://t.c…https://www.sparkfun.com/
Getdshapeinformationfromthedataframe,whichgivesusthenumberofrecordsandtheschema:
twts_odo_df.dshape
Out[83]:
dshape("var*{
id:int64,
created_at:?datetime,
user_id:int64,
user_name:?string,
tweet_text:?string,
url:?string
}""")
SaveaprocessedBlazedataframeintoJSON:
odo(twts_odo_distinct_df,'{0}/{1}.{2}'.format(jsonFpath,jsonFname,
jsonSuffix))
Out[92]:
<odo.backends.json.JSONLinesat0x7f77f0abfc50>
ConvertaJSONfiletoaCSVfile:
odo('{0}/{1}.{2}'.format(jsonFpath,jsonFname,jsonSuffix),'{0}/{1}.
{2}'.format(csvFpath,csvFname,csvSuffix))
Out[94]:
<odo.backends.csv.CSVat0x7f77f0abfe10>
ExploringdatausingSparkSQLSparkSQLisarelationalqueryenginebuiltontopofSparkCore.SparkSQLusesaqueryoptimizercalledCatalyst.
RelationalqueriescanbeexpressedusingSQLorHiveQLandexecutedagainstJSON,CSV,andvariousdatabases.SparkSQLgivesusthefullexpressivenessofdeclarativeprogramingwithSparkdataframesontopoffunctionalprogrammingwithRDDs.
UnderstandingSparkdataframesHere’[email protected],theadventofSparkSQLanddataframes.Italsohighlightsthevariousdatasourcesinthelowerpartofthediagram.Onthetoppart,wecannoticeRasthenewlanguagethatwillbegraduallysupportedontopofScala,Java,andPython.Ultimately,theDataFramephilosophyispervasivebetweenR,Python,andSpark.
SparkdataframesoriginatefromSchemaRDDs.ItcombinesRDDwithaschemathatcanbeinferredbySpark,ifrequested,whenregisteringthedataframe.ItallowsustoquerycomplexnestedJSONdatawithplainSQL.Lazyevaluation,lineage,partitioning,andpersistenceapplytodataframes.
Let’squerythedatawithSparkSQL,byfirstimportingSparkContextandSQLContext:
frompysparkimportSparkConf,SparkContext
frompyspark.sqlimportSQLContext,Row
In[95]:
sc
Out[95]:
<pyspark.context.SparkContextat0x7f7829581890>
In[96]:
sc.master
Out[96]:
u'local[*]'
''In[98]:
#InstantiateSparkSQLcontext
sqlc=SQLContext(sc)
WereadintheJSONfilewesavedwithOdo:
twts_sql_df_01=sqlc.jsonFile("/home/an/spark/spark-1.3.0-bin-
hadoop2.4/examples/AN_Spark/data/twtr15051401_distinct.json")
In[101]:
twts_sql_df_01.show()
created_atidtweet_textuser_id
user_name
2015-05-14T12:43:57Z598831111406510082RT@pacoid:Great…14755521
raulsaeztapia
2015-05-14T11:15:52Z598808944719593472RT@alvaroagea:S…14755521
raulsaeztapia
2015-05-14T10:25:15Z598796205091500032RT@PrabhaGana:W…48695135John
Humphreys
2015-05-14T09:54:52Z598788561127735296RT@Ellen_Friedma…2385931712
LeonardoD'Ambrosi
2015-05-14T09:42:53Z598785545557438464RT@Ellen_Friedma…461020977
AlexeyKosenkov
2015-05-14T09:32:39Z598782970082807808RT@BigDataTechCo…1377652806
embeddedcomputer.nl
2015-05-14T09:12:38Z598777933730160640I'mstillonEuro…294862170Ellen
Friedman
WeprinttheschemaoftheSparkdataframe:
twts_sql_df_01.printSchema()
root
|--created_at:string(nullable=true)
|--id:long(nullable=true)
|--tweet_text:string(nullable=true)
|--user_id:long(nullable=true)
|--user_name:string(nullable=true)
Weselecttheuser_namecolumnfromthedataframe:
twts_sql_df_01.select('user_name').show()
user_name
raulsaeztapia
raulsaeztapia
JohnHumphreys
LeonardoD'Ambrosi
AlexeyKosenkov
embeddedcomputer.nl
EllenFriedman
Weregisterthedataframeasatable,sowecanexecuteaSQLqueryonit:
twts_sql_df_01.registerAsTable('tweets_01')
WeexecuteaSQLstatementagainstthedataframe:
twts_sql_df_01_selection=sqlc.sql("SELECT*FROMtweets_01WHERE
user_name='raulsaeztapia'")
In[109]:
twts_sql_df_01_selection.show()
created_atidtweet_textuser_id
user_name
2015-05-14T12:43:57Z598831111406510082RT@pacoid:Great…14755521
raulsaeztapia
2015-05-14T11:15:52Z598808944719593472RT@alvaroagea:S…14755521
raulsaeztapia
Let’sprocesssomemorecomplexJSON;wereadtheoriginalTwitterJSONfile:
tweets_sqlc_inf=sqlc.jsonFile(infile)
SparkSQLisabletoinfertheschemaofacomplexnestedJSONfile:
tweets_sqlc_inf.printSchema()
root
|--contributors:string(nullable=true)
|--coordinates:string(nullable=true)
|--created_at:string(nullable=true)
|--entities:struct(nullable=true)
||--hashtags:array(nullable=true)
|||--element:struct(containsNull=true)
||||--indices:array(nullable=true)
|||||--element:long(containsNull=true)
||||--text:string(nullable=true)
||--media:array(nullable=true)
|||--element:struct(containsNull=true)
||||--display_url:string(nullable=true)
||||--expanded_url:string(nullable=true)
||||--id:long(nullable=true)
||||--id_str:string(nullable=true)
||||--indices:array(nullable=true)
...(snip)...
||--statuses_count:long(nullable=true)
||--time_zone:string(nullable=true)
||--url:string(nullable=true)
||--utc_offset:long(nullable=true)
||--verified:boolean(nullable=true)
Weextractthekeyinformationofinterestfromthewallofdatabyselectingspecificcolumnsinthedataframe(inthiscase,['created_at','id','text','user.id','user.name','entities.urls.expanded_url']):
tweets_extract_sqlc=tweets_sqlc_inf[['created_at','id','text',
'user.id','user.name','entities.urls.expanded_url']].distinct()
In[145]:
tweets_extract_sqlc.show()
created_atidtextid
nameexpanded_url
ThuMay1409:32:...598782970082807808RT@BigDataTechCo…1377652806
embeddedcomputer.nlArrayBuffer(http:...
ThuMay1412:43:...598831111406510082RT@pacoid:Great…14755521
raulsaeztapiaArrayBuffer(http:...
ThuMay1412:18:...598824733086523393@rabbitonwebspea…
...
ThuMay1412:28:...598827171168264192RT@baandrzejczak…20909005Paweł
SzulcArrayBuffer()
UnderstandingtheSparkSQLqueryoptimizerWeexecuteaSQLstatementagainstthedataframe:
tweets_extract_sqlc_sel=sqlc.sql("SELECT*fromTweets_xtr_001WHERE
name='raulsaeztapia'")
WegetadetailedviewofthequeryplansexecutedbySparkSQL:
ParsedlogicalplanAnalyzedlogicalplanOptimizedlogicalplanPhysicalplan
ThequeryplanusesSparkSQL’sCatalystoptimizer.Inordertogeneratethecompiledbytecodefromthequeryparts,theCatalystoptimizerrunsthroughlogicalplanparsingandoptimizationfollowedbyphysicalplanevaluationandoptimizationbasedoncost.
Thisisillustratedinthefollowingtweet:
Lookingbackatourcode,wecallthe.explainfunctionontheSparkSQLquerywejustexecuted,anditdeliversthefulldetailsofthestepstakenbytheCatalystoptimizerinordertoassessandoptimizethelogicalplanandthephysicalplanandgettotheresultRDD:
tweets_extract_sqlc_sel.explain(extended=True)
==ParsedLogicalPlan==
'Project[*]
'Filter('name=raulsaeztapia)'name''UnresolvedRelation'
[Tweets_xtr_001],None
==AnalyzedLogicalPlan==
Project[created_at#7,id#12L,text#27,id#80L,name#81,expanded_url#82]
Filter(name#81=raulsaeztapia)
Distinct
Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.name
ASname#81,entities#8.urls.expanded_urlASexpanded_url#82]
Relation[contributors#5,coordinates#6,created_at#7,entities#8,favorite_coun
t#9L,favorited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_rep
ly_to_status_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_
reply_to_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,
retweet_count#23L,retweeted#24,retweeted_status#25,source#26,text#27,trunca
ted#28,user#29]JSONRelation(/home/an/spark/spark-1.3.0-bin-
hadoop2.4/examples/AN_Spark/data/twtr15051401.json,1.0,None)
==OptimizedLogicalPlan==
Filter(name#81=raulsaeztapia)
Distinct
Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.nameAS
name#81,entities#8.urls.expanded_urlASexpanded_url#82]
Relation[contributors#5,coordinates#6,created_at#7,entities#8,favorite_coun
t#9L,favorited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_rep
ly_to_status_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_
reply_to_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,
retweet_count#23L,retweeted#24,retweeted_status#25,source#26,text#27,trunca
ted#28,user#29]JSONRelation(/home/an/spark/spark-1.3.0-bin-
hadoop2.4/examples/AN_Spark/data/twtr15051401.json,1.0,None)
==PhysicalPlan==
Filter(name#81=raulsaeztapia)
Distinctfalse
Exchange(HashPartitioning
[created_at#7,id#12L,text#27,id#80L,name#81,expanded_url#82],200)
Distincttrue
Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.name
ASname#81,entities#8.urls.expanded_urlASexpanded_url#82]
PhysicalRDD
[contributors#5,coordinates#6,created_at#7,entities#8,favorite_count#9L,fav
orited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_reply_to_st
atus_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_reply_to
_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,retweet_
count#23L,retweeted#24,retweeted_status#25,source#26,text#27,truncated#28,u
ser#29],MapPartitionsRDD[165]atmapatJsonRDD.scala:41
CodeGeneration:false
==RDD==
Finally,here’stheresultofthequery:
tweets_extract_sqlc_sel.show()
created_atidtextidname
expanded_url
ThuMay1412:43:...598831111406510082RT@pacoid:Great…14755521
raulsaeztapiaArrayBuffer(http:...
ThuMay1411:15:...598808944719593472RT@alvaroagea:S…14755521
raulsaeztapiaArrayBuffer(http:...
In[148]:
LoadingandprocessingCSVfileswithSparkSQLWewillusetheSparkpackagespark-csv_2.11:1.2.0.ThecommandtobeusedtolaunchPySparkwiththeIPythonNotebookandthespark-csvpackageshouldexplicitlystatethe–packagesargument:
$IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-
hadoop2.6/bin/pyspark--packagescom.databricks:spark-csv_2.11:1.2.0
Thiswilltriggerthefollowingoutput;wecanseethatthespark-csvpackageisinstalledwithallitsdependencies:
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark$
IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-
hadoop2.6/bin/pyspark--packagescom.databricks:spark-csv_2.11:1.2.0
...(snip)...
IvyDefaultCachesetto:/home/an/.ivy2/cache
Thejarsforthepackagesstoredin:/home/an/.ivy2/jars
::loadingsettings::url=jar:file:/home/an/spark/spark-1.5.0-bin-
hadoop2.6/lib/spark-assembly-1.5.0-
hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11addedasadependency
::resolvingdependencies::org.apache.spark#spark-submit-parent;1.0
confs:[default]
foundcom.databricks#spark-csv_2.11;1.2.0incentral
foundorg.apache.commons#commons-csv;1.1incentral
foundcom.univocity#univocity-parsers;1.5.1incentral
::resolutionreport::resolve835ms::artifactsdl48ms
::modulesinuse:
com.databricks#spark-csv_2.11;1.2.0fromcentralin[default]
com.univocity#univocity-parsers;1.5.1fromcentralin[default]
org.apache.commons#commons-csv;1.1fromcentralin[default]
----------------------------------------------------------------
||modules||artifacts|
|conf|number|search|dwnlded|evicted||number|dwnlded|
----------------------------------------------------------------
|default|3|0|0|0||3|0
----------------------------------------------------------------
::retrieving::org.apache.spark#spark-submit-parent
confs:[default]
0artifactscopied,3alreadyretrieved(0kB/45ms)
Wearenowreadytoloadourcsvfileandprocessit.Let’sfirstimporttheSQLContext:
#
#ReadcsvinaSparkDF
#
sqlContext=SQLContext(sc)
spdf_in=sqlContext.read.format('com.databricks.spark.csv')\
.options(delimiter=";").options(header="true")\
.options(header='true').load(csv_in)
Weaccesstheschemaofthedataframecreatedfromtheloadedcsv:
In[10]:
spdf_in.printSchema()
root
|--:string(nullable=true)
|--id:string(nullable=true)
|--created_at:string(nullable=true)
|--user_id:string(nullable=true)
|--user_name:string(nullable=true)
|--tweet_text:string(nullable=true)
Wecheckthecolumnsofthedataframe:
In[12]:
spdf_in.columns
Out[12]:
['','id','created_at','user_id','user_name','tweet_text']
Weintrospectthedataframecontent:
In[13]:
spdf_in.show()
+---+------------------+--------------------+----------+------------------
+--------------------+
||id|created_at|user_id|user_name|
tweet_text|
+---+------------------+--------------------+----------+------------------
+--------------------+
|0|638830426971181057|TueSep0121:46:...|3276255125|True
Equality|ernestsgantt:Bey…|
|1|638830426727911424|TueSep0121:46:...|3276255125|True
Equality|ernestsgantt:Bey…|
|2|638830425402556417|TueSep0121:46:...|3276255125|True
Equality|ernestsgantt:Bey…|
...(snip)...
|41|638830280988426250|TueSep0121:46:...|951081582|Jack
Baldwin|RT@cloudaus:We…|
|42|638830276626399232|TueSep0121:46:...|6525302|Masayoshi
Nakamura|PynamoDB使いやすいです|
+---+------------------+--------------------+----------+------------------
+--------------------+
onlyshowingtop20rows
QueryingMongoDBfromSparkSQLTherearetwomajorwaystointeractwithMongoDBfromSpark:thefirstisthroughtheHadoopMongoDBconnector,andthesecondoneisdirectlyfromSparktoMongoDB.
ThefirstapproachtointeractwithMongoDBfromSparkistosetupaHadoopenvironmentandquerythroughtheHadoopMongoDBconnector.TheconnectordetailsarehostedonGitHubathttps://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage.AnactualusecaseisdescribedintheseriesofblogpostsfromMongoDB:
UsingMongoDBwithHadoop&Spark:Part1-Introduction&Setup(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-introduction-setup)UsingMongoDBwithHadoopandSpark:Part2-HiveExample(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive-example)UsingMongoDBwithHadoop&Spark:Part3-SparkExample&KeyTakeaways(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways)
SettingupafullHadoopenvironmentisbitelaborate.Wewillfavorthesecondapproach.Wewillusethespark-mongodbconnectordevelopedandmaintainedbyStratio.WeareusingtheStratiospark-mongodbpackagehostedatspark.packages.org.Thepackagesinformationandversioncanbefoundinspark.packages.org:
NoteReleases
Version:0.10.1(8263c8|zip|jar)/Date:2015-11-18/License:Apache-2.0/Scalaversion:2.10
(http://spark-packages.org/package/Stratio/spark-mongodb)
ThecommandtolaunchPySparkwiththeIPythonNotebookandthespark-mongodbpackageshouldexplicitlystatethepackagesargument:
$IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-
hadoop2.6/bin/pyspark--packagescom.stratio.datasource:spark-
mongodb_2.10:0.10.1
Thiswilltriggerthefollowingoutput;wecanseethatthespark-mongodbpackageisinstalledwithallitsdependencies:
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark$
IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-
hadoop2.6/bin/pyspark--packagescom.stratio.datasource:spark-
mongodb_2.10:0.10.1…(snip)...
IvyDefaultCachesetto:/home/an/.ivy2/cache
Thejarsforthepackagesstoredin:/home/an/.ivy2/jars
::loadingsettings::url=jar:file:/home/an/spark/spark-1.5.0-bin-
hadoop2.6/lib/spark-assembly-1.5.0-
hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.stratio.datasource#spark-mongodb_2.10addedasadependency
::resolvingdependencies::org.apache.spark#spark-submit-parent;1.0
confs:[default]
foundcom.stratio.datasource#spark-mongodb_2.10;0.10.1incentral
[W22:10:50.910NotebookApp]Timeoutwaitingforkernel_inforeplyfrom
764081d3-baf9-4978-ad89-7735e6323cb6
foundorg.mongodb#casbah-commons_2.10;2.8.0incentral
foundcom.github.nscala-time#nscala-time_2.10;1.0.0incentral
foundjoda-time#joda-time;2.3incentral
foundorg.joda#joda-convert;1.2incentral
foundorg.slf4j#slf4j-api;1.6.0incentral
foundorg.mongodb#mongo-java-driver;2.13.0incentral
foundorg.mongodb#casbah-query_2.10;2.8.0incentral
foundorg.mongodb#casbah-core_2.10;2.8.0incentral
downloadinghttps://repo1.maven.org/maven2/com/stratio/datasource/spark-
mongodb_2.10/0.10.1/spark-mongodb_2.10-0.10.1.jar…
[SUCCESSFUL]com.stratio.datasource#spark-mongodb_2.10;0.10.1!spark-
mongodb_2.10.jar(3130ms)
downloadinghttps://repo1.maven.org/maven2/org/mongodb/casbah-
commons_2.10/2.8.0/casbah-commons_2.10-2.8.0.jar…
[SUCCESSFUL]org.mongodb#casbah-commons_2.10;2.8.0!casbah-
commons_2.10.jar(2812ms)
downloadinghttps://repo1.maven.org/maven2/org/mongodb/casbah-
query_2.10/2.8.0/casbah-query_2.10-2.8.0.jar…
[SUCCESSFUL]org.mongodb#casbah-query_2.10;2.8.0!casbah-query_2.10.jar
(1432ms)
downloadinghttps://repo1.maven.org/maven2/org/mongodb/casbah-
core_2.10/2.8.0/casbah-core_2.10-2.8.0.jar…
[SUCCESSFUL]org.mongodb#casbah-core_2.10;2.8.0!casbah-core_2.10.jar
(2785ms)
downloadinghttps://repo1.maven.org/maven2/com/github/nscala-time/nscala-
time_2.10/1.0.0/nscala-time_2.10-1.0.0.jar…
[SUCCESSFUL]com.github.nscala-time#nscala-time_2.10;1.0.0!nscala-
time_2.10.jar(2725ms)
downloadinghttps://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.0/slf4j-
api-1.6.0.jar…
[SUCCESSFUL]org.slf4j#slf4j-api;1.6.0!slf4j-api.jar(371ms)
downloadinghttps://repo1.maven.org/maven2/org/mongodb/mongo-java-
driver/2.13.0/mongo-java-driver-2.13.0.jar…
[SUCCESSFUL]org.mongodb#mongo-java-driver;2.13.0!mongo-java-driver.jar
(5259ms)
downloadinghttps://repo1.maven.org/maven2/joda-time/joda-time/2.3/joda-
time-2.3.jar…
[SUCCESSFUL]joda-time#joda-time;2.3!joda-time.jar(6949ms)
downloadinghttps://repo1.maven.org/maven2/org/joda/joda-convert/1.2/joda-
convert-1.2.jar…
[SUCCESSFUL]org.joda#joda-convert;1.2!joda-convert.jar(548ms)
::resolutionreport::resolve11850ms::artifactsdl26075ms
::modulesinuse:
com.github.nscala-time#nscala-time_2.10;1.0.0fromcentralin[default]
com.stratio.datasource#spark-mongodb_2.10;0.10.1fromcentralin
[default]
joda-time#joda-time;2.3fromcentralin[default]
org.joda#joda-convert;1.2fromcentralin[default]
org.mongodb#casbah-commons_2.10;2.8.0fromcentralin[default]
org.mongodb#casbah-core_2.10;2.8.0fromcentralin[default]
org.mongodb#casbah-query_2.10;2.8.0fromcentralin[default]
org.mongodb#mongo-java-driver;2.13.0fromcentralin[default]
org.slf4j#slf4j-api;1.6.0fromcentralin[default]
---------------------------------------------------------------------
||modules||artifacts|
|conf|number|search|dwnlded|evicted||number|dwnlded|
---------------------------------------------------------------------
|default|9|9|9|0||9|9|
---------------------------------------------------------------------
::retrieving::org.apache.spark#spark-submit-parent
confs:[default]
9artifactscopied,0alreadyretrieved(2335kB/51ms)
...(snip)...
WearenowreadytoqueryMongoDBonlocalhost:27017fromthecollectiontwtr01_collinthedatabasetwtr01_db.
WefirstimporttheSQLContext:
In[5]:
frompyspark.sqlimportSQLContext
sqlContext.sql("CREATETEMPORARYTABLEtweet_tableUSING
com.stratio.datasource.mongodbOPTIONS(host'localhost:27017',database
'twtr01_db',collection'twtr01_coll')")
sqlContext.sql("SELECT*FROMtweet_tablewhereid=598830778269769728
").collect()
Here’stheoutputofourquery:
Out[5]:
[Row(text=u'@spark_ioisnow@particle-awesomenews-nowIcanenjoymy
ParticleCores/Photons+@sparkfunsensors+@ApacheSparkanalytics:-)',
_id=u'55aa640fd770871cba74cb88',contributors=None,retweeted=False,
user=Row(contributors_enabled=False,created_at=u'MonAug2514:01:26+0000
2008',default_profile=True,default_profile_image=False,
description=u'Buildingopensourcetoolsforandteachingenterprise
softwaredevelopers',entities=Row(description=Row(urls=[]),url=Row(urls=
[Row(url=u'http://t.co/TSHp13EWeu',indices=[0,22],
...(snip)...
9],name=u'SparkisParticle',screen_name=u'spark_io'),Row(id=487010011,
id_str=u'487010011',indices=[17,26],name=u'Particle',
screen_name=u'particle'),Row(id=17877351,id_str=u'17877351',indices=[88,
97],name=u'SparkFunElectronics',screen_name=u'sparkfun'),
Row(id=1551361069,id_str=u'1551361069',indices=[108,120],name=u'Apache
Spark',screen_name=u'ApacheSpark')]),is_quote_status=None,lang=u'en',
quoted_status_id_str=None,quoted_status_id=None,created_at=u'ThuMay14
12:42:37+00002015',retweeted_status=None,truncated=False,place=None,
id=598830778269769728,in_reply_to_user_id=3187046084,retweet_count=0,
in_reply_to_status_id=None,in_reply_to_screen_name=u'spark_io',
in_reply_to_user_id_str=u'3187046084',source=u'<a
href="http://twitter.com"rel="nofollow">TwitterWebClient</a>',
id_str=u'598830778269769728',coordinates=None,
metadata=Row(iso_language_code=u'en',result_type=u'recent'),
quoted_status=None)]
#
SummaryInthischapter,weharvesteddatafromTwitter.Oncethedatawasacquired,weexploredtheinformationusingContinuum.io'sBlazeandOdolibraries.SparkSQLisanimportantmoduleforinteractivedataexploration,analysis,andtransformation,leveragingtheSparkdataframedatastructure.ThedataframeconceptoriginatesfromRandthenwasadoptedbyPythonPandaswithgreatsuccess.Thedataframeistheworkhorseofthedatascientist.ThecombinationofSparkSQLanddataframecreatesapowerfulenginefordataprocessing.
WearenowgearingupforextractingtheinsightsfromthedatasetsusingmachinelearningfromSparkMLlib.
Chapter4.LearningfromDataUsingSparkAswehavelaidthefoundationfordatatobeharvestedinthepreviouschapter,wearenowreadytolearnfromthedata.Machinelearningisaboutdrawinginsightsfromdata.OurobjectiveistogiveanoverviewoftheSparkMLlib(shortforMachineLearninglibrary)andapplytheappropriatealgorithmstoourdatasetinordertoderiveinsights.FromtheTwitterdataset,wewillbeapplyinganunsupervisedclusteringalgorithminordertodistinguishbetweenApacheSpark-relevanttweetsversustherest.Wehaveasinitialinputamixedbagoftweets.Wefirstneedtopreprocessthedatainordertoextracttherelevantfeatures,thenapplythemachinelearningalgorithmtoourdataset,andfinallyevaluatetheresultsandtheperformanceofourmodel.
Inthischapter,wewillcoverthefollowingpoints:
ProvidinganoverviewoftheSparkMLlibmodulewithitsalgorithmsandthetypicalmachinelearningworkflow.PreprocessingtheTwitterharvesteddatasettoextracttherelevantfeatures,applyinganunsupervisedclusteringalgorithmtoidentifyApacheSpark-relevanttweets.Then,evaluatingthemodelandtheresultsobtained.DescribingtheSparkmachinelearningpipeline.
ContextualizingSparkMLlibintheapparchitectureLet’sfirstcontextualizethefocusofthischapterondata-intensiveapparchitecture.Wewillconcentrateourattentionontheanalyticslayerandmorepreciselymachinelearning.Thiswillserveasafoundationforstreamingappsaswewanttoapplythelearningfromthebatchprocessingofdataasinferencerulesforthestreaminganalysis.
Thefollowingdiagramsetsthecontextofthechapter’sfocus,highlightingthemachinelearningmodulewithintheanalyticslayerwhileusingtoolsforexploratorydataanalysis,SparkSQL,andPandas.
ClassifyingSparkMLlibalgorithmsSparkMLlibisarapidlyevolvingmoduleofSparkwithnewalgorithmsaddedwitheachreleaseofSpark.
Thefollowingdiagramprovidesahigh-leveloverviewofSparkMLlibalgorithmsgroupedinthetraditionalbroadmachinelearningtechniquesandfollowingthecategoricalorcontinuousnatureofthedata:
WecategorizetheSparkMLlibalgorithmsintwocolumns,categoricalorcontinuous,dependingonthetypeofdata.Wedistinguishbetweendatathatiscategoricalormorequalitativeinnatureversuscontinuousdata,whichisquantitativeinnature.Anexampleofqualitativedataispredictingtheweather;giventheatmosphericpressure,thetemperature,andthepresenceandtypeofclouds,theweatherwillbesunny,dry,rainy,orovercast.Thesearediscretevalues.Ontheotherhand,let’ssaywewanttopredicthouseprices,giventhelocation,squaremeterage,andthenumberofbeds;therealestatevaluecanbepredictedusinglinearregression.Inthiscase,wearetalkingaboutcontinuousorquantitativevalues.
Thehorizontalgroupingreflectsthetypesofmachinelearningmethodused.Unsupervisedversussupervisedmachinelearningtechniquesaredependentonwhetherthetrainingdataislabeled.Inanunsupervisedlearningchallenge,nolabelsaregiventothelearningalgorithm.Thegoalistofindthehiddenstructureinitsinput.Inthecaseofsupervisedlearning,thedataislabeled.Thefocusisonmakingpredictionsusingregressionifthedataiscontinuousorclassificationifthedataiscategorical.
Animportantcategoryofmachinelearningisrecommendersystems,whichleveragecollaborativefilteringtechniques.TheAmazonwebstoreandNetflixhaveverypowerfulrecommendersystemspoweringtheirrecommendations.
StochasticGradientDescentisoneofthemachinelearningoptimizationtechniquesthatiswellsuitedforSparkdistributedcomputation.
Forprocessinglargeamountsoftext,SparkofferscruciallibrariesforfeatureextractionandtransformationsuchasTF-IDF(shortforTermFrequency–InverseDocumentFrequency),Word2Vec,standardscaler,andnormalizer.
SupervisedandunsupervisedlearningWedelvemoredeeplyhereintothetraditionalmachinelearningalgorithmsofferedbySparkMLlib.Wedistinguishbetweensupervisedandunsupervisedlearningdependingonwhetherthedataislabeled.Wedistinguishbetweencategoricalorcontinuousdependingonwhetherthedataisdiscreteorcontinuous.
ThefollowingdiagramexplainstheSparkMLlibsupervisedandunsupervisedmachinelearningalgorithmsandpreprocessingtechniques:
ThefollowingsupervisedandunsupervisedMLlibalgorithmsandpreprocessingtechniquesarecurrentlyavailableinSpark:
Clustering:Thisisanunsupervisedmachinelearningtechniquewherethedataisnotlabeled.Theaimistoextractstructurefromthedata:
K-Means:ThispartitionsthedatainKdistinctclustersGaussianMixture:ClustersareassignedbasedonthemaximumposteriorprobabilityofthecomponentPowerIterationClustering(PIC):ThisgroupsverticesofagraphbasedonpairwiseedgesimilaritiesLatentDirichletAllocation(LDA):ThisisusedtogroupcollectionsoftextdocumentsintotopicsStreamingK-Means:Thismeansclustersdynamicallystreamingdatausingawindowingfunctionontheincomingdata
DimensionalityReduction:Thisaimstoreducethenumberoffeaturesunderconsideration.Essentially,thisreducesnoiseinthedataandfocusesonthekeyfeatures:
SingularValueDecomposition(SVD):Thisbreaksthematrixthatcontainsthedataintosimplermeaningfulpieces.Itfactorizestheinitialmatrixintothree
matrices.PrincipalComponentAnalysis(PCA):Thisapproximatesahighdimensionaldatasetwithalowdimensionalsubspace.
RegressionandClassification:Regressionpredictsoutputvaluesusinglabeledtrainingdata,whileClassificationgroupstheresultsintoclasses.ClassificationhasdependentvariablesthatarecategoricalorunorderedwhilstRegressionhasdependentvariablesthatarecontinuousandordered:
LinearRegressionModels(linearregression,logisticregression,andsupportvectormachines):Linearregressionalgorithmscanbeexpressedasconvexoptimizationproblemsthataimtominimizeanobjectivefunctionbasedonavectorofweightvariables.Theobjectivefunctioncontrolsthecomplexityofthemodelthroughtheregularizedpartofthefunctionandtheerrorofthemodelthroughthelosspartofthefunction.NaiveBayes:Thismakespredictionsbasedontheconditionalprobabilitydistributionofalabelgivenanobservation.Itassumesthatfeaturesaremutuallyindependentofeachother.DecisionTrees:Thisperformsrecursivebinarypartitioningofthefeaturespace.Theinformationgainatthetreenodelevelismaximizedinordertodeterminethebestsplitforthepartition.Ensemblesoftrees(RandomForestsandGradient-BoostedTrees):Treeensemblealgorithmscombinebasedecisiontreemodelsinordertobuildaperformantmodel.Theyareintuitiveandverysuccessfulforclassificationandregressiontasks.
IsotonicRegression:Thisminimizesthemeansquarederrorbetweengivendataandobservedresponses.
AdditionallearningalgorithmsSparkMLliboffersmorealgorithmsthanthesupervisedandunsupervisedlearningones.Wehavebroadlythreemoreadditionaltypesofmachinelearningmethods:recommendersystems,optimizationalgorithms,andfeatureextraction.
ThefollowingadditionalMLlibalgorithmsarecurrentlyavailableinSpark:
Collaborativefiltering:Thisisthebasisforrecommendersystems.Itcreatesauser-itemassociationmatrixandaimstofillthegaps.Basedonotherusersanditemsalongwiththeirratings,itrecommendsanitemthatthetargetuserhasnoratingsfor.Indistributedcomputing,oneofthemostsuccessfulalgorithmsisALS(shortforAlternatingLeastSquare):
AlternatingLeastSquares:Thismatrixfactorizationtechniqueincorporatesimplicitfeedback,temporaleffects,andconfidencelevels.Itdecomposesthelargeuseritemmatrixintoalowerdimensionaluseranditemfactors.Itminimizesaquadraticlossfunctionbyfixingalternativelyitsfactors.
Featureextractionandtransformation:Theseareessentialtechniquesforlargetextdocumentprocessing.Itincludesthefollowingtechniques:
TermFrequency:SearchenginesuseTF-IDFtoscoreandrankdocumentrelevanceinavastcorpus.Itisalsousedinmachinelearningtodeterminetheimportanceofawordinadocumentorcorpus.Termfrequencystatisticallydeterminestheweightofatermrelativetoitsfrequencyinthecorpus.Termfrequencyonitsowncanbemisleadingasitoveremphasizeswordssuchasthe,
of,orandthatgivelittleinformation.InverseDocumentFrequencyprovidesthespecificityorthemeasureoftheamountofinformation,whetherthetermisrareorcommonacrossalldocumentsinthecorpus.Word2Vec:Thisincludestwomodels,Skip-GramandContinuousBagofWord.TheSkip-Grampredictsneighboringwordsgivenaword,basedonslidingwindowsofwords,whileContinuousBagofWordspredictsthecurrentwordgiventheneighboringwords.StandardScaler:Aspartofpreprocessing,thedatasetmustoftenbestandardizedbymeanremovalandvariancescaling.Wecomputethemeanandstandarddeviationonthetrainingdataandapplythesametransformationtothetestdata.Normalizer:Wescalethesamplestohaveunitnorm.Itisusefulforquadraticformssuchasthedotproductorkernelmethods.Featureselection:Thisreducesthedimensionalityofthevectorspacebyselectingthemostrelevantfeaturesforthemodel.Chi-SquareSelector:Thisisastatisticalmethodtomeasuretheindependenceoftwoevents.
Optimization:ThesespecificSparkMLliboptimizationalgorithmsfocusonvarioustechniquesofgradientdescent.Sparkprovidesveryefficientimplementationofgradientdescentonadistributedclusterofmachines.Itlooksforthelocalminimabyiterativelygoingdownthesteepestdescent.Itiscompute-intensiveasititeratesthroughallthedataavailable:
StochasticGradientDescent:Weminimizeanobjectivefunctionthatisthesumofdifferentiablefunctions.StochasticGradientDescentusesonlyasampleofthetrainingdatainordertoupdateaparameterinaparticulariteration.Itisusedforlarge-scaleandsparsemachinelearningproblemssuchastextclassification.
Limited-memoryBFGS(L-BFGS):Asthenamesays,L-BFGSuseslimitedmemoryandsuitsthedistributedoptimizationalgorithmimplementationofSparkMLlib.
SparkMLlibdatatypesMLlibsupportsfouressentialdatatypes:localvector,labeledpoint,localmatrix,anddistributedmatrix.ThesedatatypesarewidelyusedinSparkMLlibalgorithms:
Localvector:Thisresidesinasinglemachine.Itcanbedenseorsparse:
Densevectorisatraditionalarrayofdoubles.Anexampleofdensevectoris[5.0,0.0,1.0,7.0].Sparsevectorusesintegerindicesanddoublevalues.Sothesparserepresentationofthevector[5.0,0.0,1.0,7.0]wouldbe(4,[0,2,3],[5.0,1.0,7.0]),whererepresentthedimensionofthevector.
Here’sanexampleoflocalvectorinPySpark:
importnumpyasnp
importscipy.sparseassps
frompyspark.mllib.linalgimportVectors
#NumPyarrayfordensevector.
dvect1=np.array([5.0,0.0,1.0,7.0])
#Pythonlistfordensevector.
dvect2=[5.0,0.0,1.0,7.0]
#SparseVectorcreation
svect1=Vectors.sparse(4,[0,2,3],[5.0,1.0,7.0])
#Sparsevectorusingasingle-columnSciPycsc_matrix
svect2=sps.csc_matrix((np.array([5.0,1.0,7.0]),np.array([0,2,
3])),shape=(4,1))
Labeledpoint.Alabeledpointisadenseorsparsevectorwithalabelusedinsupervisedlearning.Inthecaseofbinarylabels,0.0representsthenegativelabelwhilst1.0representsthepositivevalue.
Here’sanexampleofalabeledpointinPySpark:
frompyspark.mllib.linalgimportSparseVector
frompyspark.mllib.regressionimportLabeledPoint
#Labeledpointwithapositivelabelandadensefeaturevector.
lp_pos=LabeledPoint(1.0,[5.0,0.0,1.0,7.0])
#Labeledpointwithanegativelabelandasparsefeaturevector.
lp_neg=LabeledPoint(0.0,SparseVector(4,[0,2,3],[5.0,1.0,
7.0]))
LocalMatrix:Thislocalmatrixresidesinasinglemachinewithinteger-typeindicesandvaluesoftypedouble.
Here’sanexampleofalocalmatrixinPySpark:
frompyspark.mllib.linalgimportMatrix,Matrices
#Densematrix((1.0,2.0,3.0),(4.0,5.0,6.0))
dMatrix=Matrices.dense(2,3,[1,2,3,4,5,6])
#Sparsematrix((9.0,0.0),(0.0,8.0),(0.0,6.0))
sMatrix=Matrices.sparse(3,2,[0,1,3],[0,2,1],[9,6,8])
DistributedMatrix:LeveragingthedistributedmatureoftheRDD,distributedmatricescanbesharedinaclusterofmachines.Wedistinguishfourdistributedmatrixtypes:RowMatrix,IndexedRowMatrix,CoordinateMatrix,andBlockMatrix:
RowMatrix:ThistakesanRDDofvectorsandcreatesadistributedmatrixofrowswithmeaninglessindices,calledRowMatrix,fromtheRDDofvectors.IndexedRowMatrix:Inthiscase,rowindicesaremeaningful.First,wecreateanRDDofindexedrowsusingtheclassIndexedRowandthencreateanIndexedRowMatrix.CoordinateMatrix:Thisisusefultorepresentverylargeandverysparsematrices.CoordinateMatrixiscreatedfromRDDsoftheMatrixEntrypoints,representedbyatupleoftype(long,long,orfloat)BlockMatrix:ThesearecreatedfromRDDsofsub-matrixblocks,whereasub-matrixblockis((blockRowIndex,blockColIndex),sub-matrix).
MachinelearningworkflowsanddataflowsBeyondalgorithms,machinelearningisalsoaboutprocesses.Wewilldiscussthetypicalworkflowsanddataflowsofsupervisedandunsupervisedmachinelearning.
SupervisedmachinelearningworkflowsInsupervisedmachinelearning,theinputtrainingdatasetislabeled.Oneofthekeydatapracticesistosplitinputdataintotrainingandtestsets,andvalidatethemodeaccordingly.
Wetypicallygothroughasix-stepprocessflowinsupervisedlearning:
Collectthedata:Thisstepessentiallytiesinwiththepreviouschapterandensureswecollecttherightdatawiththerightvolumeandgranularityinordertoenablethemachinelearningalgorithmtoprovidereliableanswers.Preprocessthedata:Thisstepisaboutcheckingthedataqualitybysampling,fillinginthemissingvaluesifany,scalingandnormalizingthedata.Wealsodefinethefeatureextractionprocess.Typically,inthecaseoflargetext-baseddatasets,weapplytokenization,stopwordsremoval,stemming,andTF-IDF.
Inthecaseofsupervisedlearning,weseparatetheinputdataintoatrainingandtestset.Wecanalsoimplementvariousstrategiesofsamplingandsplittingthedatasetforcross-validationpurposes.
Readythedata:Inthisstep,wegetthedataintheformatordatatypeexpectedbythealgorithms.InthecaseofSparkMLlib,thisincludeslocalvector,denseorsparsevectors,labeledpoints,localmatrix,distributedmatrixwithrowmatrix,indexedrowmatrix,coordinatematrix,andblockmatrix.Model:Inthisstep,weapplythealgorithmsthataresuitablefortheproblemathandandgettheresultsforevaluationofthemostsuitablealgorithmintheevaluatestep.Wemighthavemultiplealgorithmssuitablefortheproblem;theirrespectiveperformancewillbescoredintheevaluatesteptoselectthebestpreformingones.Wecanimplementanensembleorcombinationofmodelsinordertoreachthebestresults.Optimize:Wemayneedtorunagridsearchfortheoptimalparametersofcertainalgorithms.Theseparametersaredeterminedduringtraining,andfine-tunedduringthetestingandproductionphase.Evaluate:Weultimatelyscorethemodelsandselectthebestoneintermsofaccuracy,performance,reliability,andscalability.Wemovethebestperformingmodeltotestwiththeheldouttestdatainordertoascertainthepredictionaccuracyofourmodel.Oncesatisfiedwiththefine-tunedmodel,wemoveittoproductiontoprocesslivedata.
Thesupervisedmachinelearningworkflowanddataflowarerepresentedinthefollowingdiagram:
UnsupervisedmachinelearningworkflowsAsopposedtosupervisedlearning,ourinitialdataisnotlabeledinthecaseofunsupervisedlearning,whichismostoftenthecaseinreallife.Wewillextractthestructurefromthedatabyusingclusteringordimensionalityreductionalgorithms.Intheunsupervisedlearningcase,wedonotsplitthedataintotrainingandtest,aswecannotmakeanypredictionbecausethedataisnotlabeled.Wewilltrainthedataalongsixstepssimilartothoseinsupervisedlearning.Oncethemodelistrained,wewillevaluatetheresultsandfine-tunethemodelandthenreleaseitforproduction.
Unsupervisedlearningcanbeapreliminarysteptosupervisedlearning.Namely,welookatreducingthedimensionalityofthedatapriortoattackingthelearningphase.
Theunsupervisedmachinelearningworkflowsanddataflowarerepresentedasfollows:
ClusteringtheTwitterdatasetLet’sfirstgetafeelforthedataextractedfromTwitterandgetanunderstandingofthedatastructureinordertoprepareandrunitthroughtheK-Meansclusteringalgorithms.Ourplanofattackusestheprocessanddataflowdepictedearlierforunsupervisedlearning.Thestepsareasfollows:
1. Combinealltweetfilesintoasingledataframe.2. Parsethetweets,removestopwords,extractemoticons,extractURL,andfinally
normalizethewords(forexample,mappingthemtolowercaseandremovingpunctuationandnumbers).
3. Featureextractionincludesthefollowing:
Tokenization:ThisbreaksdowntheparsedtweettextintoindividualwordsortokensTF-IDF:ThisappliestheTF-IDFalgorithmtocreatefeaturevectorsfromthetokenizedtweettextsHashTF-IDF:Thisappliesahashingfunctiontothetokenvectors
4. RuntheK-Meansclusteringalgorithm.5. EvaluatetheresultsoftheK-Meansclustering:
IdentifytweetmembershiptoclustersPerformdimensionalityreductiontotwodimensionswiththeMulti-DimensionalScalingorthePrincipalComponentAnalysisalgorithmPlottheclusters
6. Pipeline:
Fine-tunethenumberofrelevantclustersKMeasurethemodelcostSelecttheoptimalmodel
ApplyingScikit-LearnontheTwitterdatasetPython’sownScikit-Learnmachinelearninglibraryisoneofthemostreliable,intuitive,androbusttoolsaround.Let’srunthroughapreprocessingandunsupervisedlearningusingPandasandScikit-Learn.ItisoftenbeneficialtoexploreasampleofthedatausingScikit-LearnbeforespinningoffclusterswithSparkMLlib.
Wehaveamixedbagof7,540tweets.ItcontainstweetsrelatedtoApacheSpark,Python,theupcomingpresidentialelectionwithHillaryClintonandDonaldTrumpasprotagonists,andsometweetsrelatedtofashionandmusicwithLadyGagaandJustinBieber.WearerunningtheK-MeansclusteringalgorithmusingPythonScikit-LearnontheTwitterdatasetharvested.WefirstloadthesampledataintoaPandasdataframe:
importpandasaspd
csv_in='C:\\Users\\Amit\\Documents\\IPython
Notebooks\\AN00_Data\\unq_tweetstxt.csv'
twts_df01=pd.read_csv(csv_in,sep=';',encoding='utf-8')
In[24]:
twts_df01.count()
Out[24]:
Unnamed:07540
id7540
created_at7540
user_id7540
user_name7538
tweet_text7540
dtype:int64
#
#Introspectingthetweetstext
#
In[82]:
twtstxt_ls01[6910:6920]
Out[82]:
['RT@deroach_Ismoke:IamNOTvotingfor#hilaryclinton
http://t.co/jaZZpcHkkJ',
'RT@AnimalRightsJen:#HilaryClintonWhatdoBernieSandersandDonald
TrumpHaveinCommon?:Hehassofarbeenth…http://t.co/t2YRcGCh6…',
'IunderstandwhyBillwasoutbangingotherchicks….....Imeanlookat
whatheismarriedto…..\n@HilaryClinton',
'#HilaryClintonWhatdoBernieSandersandDonaldTrumpHaveinCommon?:
Hehassofarbeenth…http://t.co/t2YRcGCh67#Tcot#UniteBlue']
Wefirstperformafeatureextractionfromthetweets’text.WeapplyasparsevectorizertothedatasetusingaTF-IDFvectorizerwith10,000featuresandEnglishstopwords:
In[37]:
print("Extractingfeaturesfromthetrainingdatasetusingasparse
vectorizer")
t0=time()
Extractingfeaturesfromthetrainingdatasetusingasparsevectorizer
In[38]:
vectorizer=TfidfVectorizer(max_df=0.5,max_features=10000,
min_df=2,stop_words='english',
use_idf=True)
X=vectorizer.fit_transform(twtstxt_ls01)
#
#OutputoftheTFIDFFeaturevectorizer
#
print("donein%fs"%(time()-t0))
print("n_samples:%d,n_features:%d"%X.shape)
print()
donein5.232165s
n_samples:7540,n_features:6638
Asthedatasetisnowbrokenintoa7540samplewithvectorsof6,638features,wearereadytofeedthissparsematrixtotheK-Meansclusteringalgorithm.Wewillchoosesevenclustersand100maximumiterationsinitially:
In[47]:
km=KMeans(n_clusters=7,init='k-means++',max_iter=100,n_init=1,
verbose=1)
print("Clusteringsparsedatawith%s"%km)
t0=time()
km.fit(X)
print("donein%0.3fs"%(time()-t0))
ClusteringsparsedatawithKMeans(copy_x=True,init='k-means++',
max_iter=100,n_clusters=7,n_init=1,
n_jobs=1,precompute_distances='auto',random_state=None,tol=0.0001,
verbose=1)
Initializationcomplete
Iteration0,inertia13635.141
Iteration1,inertia6943.485
Iteration2,inertia6924.093
Iteration3,inertia6915.004
Iteration4,inertia6909.212
Iteration5,inertia6903.848
Iteration6,inertia6888.606
Iteration7,inertia6863.226
Iteration8,inertia6860.026
Iteration9,inertia6859.338
Iteration10,inertia6859.213
Iteration11,inertia6859.102
Iteration12,inertia6859.080
Iteration13,inertia6859.060
Iteration14,inertia6859.047
Iteration15,inertia6859.039
Iteration16,inertia6859.032
Iteration17,inertia6859.031
Iteration18,inertia6859.029
Convergedatiteration18
donein1.701s
TheK-Meansclusteringalgorithmconvergedafter18iterations.Weseeinthefollowingresultsthesevenclusterswiththeirrespectivekeywords.Clusters0and6areaboutmusicandfashionwithJustinBieberandLadyGaga-relatedtweets.Clusters1and5arerelatedtotheU.S.A.presidentialelectionswithDonaldTrump-andHilaryClinton-relatedtweets.Clusters2and3aretheonesofinteresttousastheyareaboutApacheSparkandPython.Cluster4containsThailand-relatedtweets:
#
#Introspecttoptermspercluster
#
In[49]:
print("Toptermspercluster:")
order_centroids=km.cluster_centers_.argsort()[:,::-1]
terms=vectorizer.get_feature_names()
foriinrange(7):
print("Cluster%d:"%i,end='')
forindinorder_centroids[i,:20]:
print('%s'%terms[ind],end='')
print()
Toptermspercluster:
Cluster0:justinbieberlovemeanrtfollowthankhihttpswhatdoyoumean
videowannahearwhatdoyoumeanviralrorykramerhappylolmakingperson
dreamjustin
Cluster1:donaldtrumphilaryclintonrthttpstrump2016realdonaldtrump
trumpgopampjustinbieberpresidentclintonemailsoy8ltkstzetcotlike
berniesandershilarypeopleemail
Cluster2:bigdataapachesparkhadoopanalyticsrtsparktrainingchennai
ibmdatascienceapacheprocessingclouderamapreducedatasaphttpsvora
transformingdevelopment
Cluster3:apachesparkpythonhttpsrtsparkdataampdatabricksusingnew
learnhadoopibmbigapachecontinuumiobluemixlearningjoinopen
Cluster4:ernestsganttsimbata3jdhm2015elsahel12phuketdailynews
dreamintentionsbeyhiveinfrancealmtorta18civipartnership9_a_625whu72ep0
k7erhvu7wnfdmxxxcm3hosxuh2fxnt5o5rmb0xhpjnbgkqn0djovap57ujdh
dtzsz3lb6xsunnysai12345sdcvulih6g
Cluster5:trumpdonalddonaldtrumpstarbuckstrumpquotetrumpforpresident
oy8ltkstzehttpszfns7pxysxsillygoystumptrump2016newsjeremycoffee
corbynok7vc8aetzrttonight
Cluster6:ladygagagagaladyrthttpslovefollowhorrorcdstoryahshotel
americanjapanhotelhumantraffickingmusicfashiondietqueenahs
Wewillvisualizetheresultsbyplottingthecluster.Wehave7,540sampleswith6,638features.Itwillbeimpossibletovisualizethatmanydimensions.WewillusetheMulti-DimensionalScaling(MDS)algorithmtobringdownthemultidimensionalfeaturesoftheclustersintotwotractabledimensionstobeabletopicturethem:
importmatplotlib.pyplotasplt
importmatplotlibasmpl
fromsklearn.manifoldimportMDS
MDS()
#
#BringdowntheMDStotwodimensions(components)aswewillplot
#theclusters
#
mds=MDS(n_components=2,dissimilarity="precomputed",random_state=1)
pos=mds.fit_transform(dist)#shape(n_components,n_samples)
xs,ys=pos[:,0],pos[:,1]
In[67]:
#
#Setupcolorsperclustersusingadict
#
cluster_colors={0:'#1b9e77',1:'#d95f02',2:'#7570b3',3:'#e7298a',
4:'#66a61e',5:'#9990b3',6:'#e8888a'}
#
#setupclusternamesusingadict
#
cluster_names={0:'Music,Pop',
1:'USAPolitics,Election',
2:'BigData,Spark',
3:'Spark,Python',
4:'Thailand',
5:'USAPolitics,Election',
6:'Music,Pop'}
In[115]:
#
#ipythonmagictoshowthematplotlibplotsinline
#
%matplotlibinline
#
#CreatedataframewhichincludesMDSresults,clusternumbersandtweet
textstobedisplayed
#
df=pd.DataFrame(dict(x=xs,y=ys,label=clusters,txt=twtstxt_ls02_utf8))
ix_start=2000
ix_stop=2050
df01=df[ix_start:ix_stop]
print(df01[['label','txt']])
print(len(df01))
print()
#Groupbycluster
groups=df.groupby('label')
groups01=df01.groupby('label')
#Setuptheplot
fig,ax=plt.subplots(figsize=(17,10))
ax.margins(0.05)
#
#Buildtheplotobject
#
forname,groupingroups01:
ax.plot(group.x,group.y,marker='o',linestyle='',ms=12,
label=cluster_names[name],color=cluster_colors[name],
mec='none')
ax.set_aspect('auto')
ax.tick_params(\
axis='x',#settingsforx-axis
which='both',#
bottom='off',#
top='off',#
labelbottom='off')
ax.tick_params(\
axis='y',#settingsfory-axis
which='both',#
left='off',#
top='off',#
labelleft='off')
ax.legend(numpoints=1)#
#
#Addlabelinx,ypositionwithtweettext
#
foriinrange(ix_start,ix_stop):
ax.text(df01.ix[i]['x'],df01.ix[i]['y'],df01.ix[i]['txt'],size=10)
plt.show()#Displaytheplot
labeltext
20002b'RT@BigDataTechCon:'
20013b"@4Quant'spresentat"
20022b'CassandraSummit201'
Here’saplotofCluster2,BigDataandSpark,representedbybluedotsalongwithCluster3,SparkandPython,representedbyreddots,andsomesampletweetsrelatedtotherespectiveclusters:
WehavegainedsomegoodinsightsintothedatawiththeexplorationandprocessingdonewithScikit-Learn.WewillnowfocusourattentiononSparkMLlibandtakeitforarideontheTwitterdataset.
PreprocessingthedatasetNow,wewillfocusonfeatureextractionandengineeringinordertoreadythedatafortheclusteringalgorithmrun.WeinstantiatetheSparkContextandreadtheTwitterdatasetintoaSparkdataframe.Wewillthensuccessivelytokenizethetweettextdata,applyahashingTermfrequencyalgorithmtothetokens,andfinallyapplytheInverseDocumentFrequencyalgorithmandrescalethedata.Thecodeisasfollows:
In[3]:
#
#ReadcsvinaPandaDF
#
#
importpandasaspd
csv_in='/home/an/spark/spark-1.5.0-bin-
hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv'
pddf_in=pd.read_csv(csv_in,index_col=None,header=0,sep=';',
encoding='utf-8')
In[4]:
sqlContext=SQLContext(sc)
In[5]:
#
#ConvertaPandaDFtoaSparkDF
#
#
spdf_02=sqlContext.createDataFrame(pddf_in[['id','user_id','user_name',
'tweet_text']])
In[8]:
spdf_02.show()
In[7]:
spdf_02.take(3)
Out[7]:
[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',
tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:
elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026
http://t.co/VpD7FoqMr0'),
Row(id=638830426727911424,user_id=3276255125,user_name=u'TrueEquality',
tweet_text=u'ernestsgantt:BeyHiveInFrance:PhuketDailyNews:
dreamintentions:elsahel12:simbata3:JDHM2015:almtorta18:CiviPa\u2026
http://t.co/VpD7FoqMr0'),
Row(id=638830425402556417,user_id=3276255125,user_name=u'TrueEquality',
tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:ernestsgantt:elsahel12:
simbata3:JDHM2015:almtorta18:CiviPartnership:dr\u2026
http://t.co/EMDOn8chPK')]
In[9]:
frompyspark.ml.featureimportHashingTF,IDF,Tokenizer
In[10]:
#
#Tokenizethetweet_text
#
tokenizer=Tokenizer(inputCol="tweet_text",outputCol="tokens")
tokensData=tokenizer.transform(spdf_02)
In[11]:
tokensData.take(1)
Out[11]:
[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',
tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:
elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026
http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',
u'9_a_6:',u'dreamintentions:',u'elsahel12:',u'simbata3:',u'jdhm2015:',
u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'])]
In[14]:
#
#ApplyHashingTFtothetokens
#
hashingTF=HashingTF(inputCol="tokens",outputCol="rawFeatures",
numFeatures=2000)
featuresData=hashingTF.transform(tokensData)
In[15]:
featuresData.take(1)
Out[15]:
[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',
tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:
elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026
http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',
u'9_a_6:',u'dreamintentions:',u'elsahel12:',u'simbata3:',u'jdhm2015:',
u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'],
rawFeatures=SparseVector(2000,{74:1.0,97:1.0,100:1.0,160:1.0,185:
1.0,742:1.0,856:1.0,991:1.0,1383:1.0,1620:1.0}))]
In[16]:
#
#ApplyIDFtotherawfeaturesandrescalethedata
#
idf=IDF(inputCol="rawFeatures",outputCol="features")
idfModel=idf.fit(featuresData)
rescaledData=idfModel.transform(featuresData)
forfeaturesinrescaledData.select("features").take(3):
print(features)
In[17]:
rescaledData.take(2)
Out[17]:
[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',
tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:
elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026
http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',
u'9_a_6:',u'dreamintentions:',u'elsahel12:',u'simbata3:',u'jdhm2015:',
u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'],
rawFeatures=SparseVector(2000,{74:1.0,97:1.0,100:1.0,160:1.0,185:
1.0,742:1.0,856:1.0,991:1.0,1383:1.0,1620:1.0}),
features=SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:
2.9985,185:2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,
1620:3.073})),
Row(id=638830426727911424,user_id=3276255125,user_name=u'TrueEquality',
tweet_text=u'ernestsgantt:BeyHiveInFrance:PhuketDailyNews:
dreamintentions:elsahel12:simbata3:JDHM2015:almtorta18:CiviPa\u2026
http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',
u'phuketdailynews:',u'dreamintentions:',u'elsahel12:',u'simbata3:',
u'jdhm2015:',u'almtorta18:',u'civipa\u2026',u'http://t.co/vpd7foqmr0'],
rawFeatures=SparseVector(2000,{74:1.0,97:1.0,100:1.0,160:1.0,185:
1.0,460:1.0,987:1.0,991:1.0,1383:1.0,1620:1.0}),
features=SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:
2.9985,185:2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,
1620:3.073}))]
In[21]:
rs_pddf=rescaledData.toPandas()
In[22]:
rs_pddf.count()
Out[22]:
id7540
user_id7540
user_name7540
tweet_text7540
tokens7540
rawFeatures7540
features7540
dtype:int64
In[27]:
feat_lst=rs_pddf.features.tolist()
In[28]:
feat_lst[:2]
Out[28]:
[SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:
2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,1620:3.073}),
SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:
2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,1620:3.073})]
RunningtheclusteringalgorithmWewillusetheK-MeansalgorithmagainsttheTwitterdataset.Asanunlabeledandshuffledbagoftweets,wewanttoseeiftheApacheSparktweetsaregroupedinasinglecluster.Fromtheprevioussteps,theTF-IDFsparsevectoroffeaturesisconvertedintoanRDDthatwillbetheinputtotheSparkMLlibprogram.WeinitializetheK-Meansmodelwith5clusters,10iterationsof10runs:
In[32]:
frompyspark.mllib.clusteringimportKMeans,KMeansModel
fromnumpyimportarray
frommathimportsqrt
In[34]:
#Loadandparsethedata
in_Data=sc.parallelize(feat_lst)
In[35]:
in_Data.take(3)
Out[35]:
[SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:
2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,1620:3.073}),
SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:
2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,1620:3.073}),
SparseVector(2000,{20:4.3534,74:2.6762,97:1.8625,100:5.2768,185:
2.7481,856:4.1406,991:2.9518,1039:3.073,1620:3.073,1864:4.6377})]
In[37]:
in_Data.count()
Out[37]:
7540
In[38]:
#Buildthemodel(clusterthedata)
clusters=KMeans.train(in_Data,5,maxIterations=10,
runs=10,initializationMode="random")
In[53]:
#EvaluateclusteringbycomputingWithinSetSumofSquaredErrors
deferror(point):
center=clusters.centers[clusters.predict(point)]
returnsqrt(sum([x**2forxin(point-center)]))
WSSSE=in_Data.map(lambdapoint:error(point)).reduce(lambdax,y:x+y)
print("WithinSetSumofSquaredError="+str(WSSSE))
EvaluatingthemodelandtheresultsOnewaytofine-tunetheclusteringalgorithmisbyvaryingthenumberofclustersandverifyingtheoutput.Let’schecktheclustersandgetafeelfortheclusteringresultssofar:
In[43]:
cluster_membership=in_Data.map(lambdax:clusters.predict(x))
In[54]:
cluster_idx=cluster_membership.zipWithIndex()
In[55]:
type(cluster_idx)
Out[55]:
pyspark.rdd.PipelinedRDD
In[58]:
cluster_idx.take(20)
Out[58]:
[(3,0),
(3,1),
(3,2),
(3,3),
(3,4),
(3,5),
(1,6),
(3,7),
(3,8),
(3,9),
(3,10),
(3,11),
(3,12),
(3,13),
(3,14),
(1,15),
(3,16),
(3,17),
(1,18),
(1,19)]
In[59]:
cluster_df=cluster_idx.toDF()
In[65]:
pddf_with_cluster=pd.concat([pddf_in,cluster_pddf],axis=1)
In[76]:
pddf_with_cluster._1.unique()
Out[76]:
array([3,1,4,0,2])
In[79]:
pddf_with_cluster[pddf_with_cluster['_1']==0].head(10)
Out[79]:
Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2
62273642418116819988480FriSep1119:23:09+0000201549693598
AjinkyaKaleRT@bigdata:DistributedMatrixComputationsi…06227
625745642391207205859328FriSep1117:36:13+00002015937467860
AngelaBassa[Auto]I'mreading""DistributedMatrixComput…06257
6297119642348577147064320FriSep1114:46:49+0000201518318677
BenLoricaDistributedMatrixComputationsin@ApacheSpar…06297
In[80]:
pddf_with_cluster[pddf_with_cluster['_1']==1].head(10)
Out[80]:
Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2
66638830419090079746TueSep0121:46:55+000020152241040634
MassimoCarrisiPython:Python:Removing\xa0fromstring?-I…16
1517638830380578045953TueSep0121:46:46+0000201557699376
RafaelMonneratRT@ramalhoorg:NoitedeautógrafosdoFluent…115
1841638830280988426250TueSep0121:46:22+00002015951081582
JackBaldwinRT@cloudaus:Weare3/4full!2-day@swcarpen…118
1942638830276626399232TueSep0121:46:21+000020156525302
MasayoshiNakamuraPynamoDB#AWS#DynamoDB#Pythonhttp://...119
2043638830213288235008TueSep0121:46:06+000020153153874869
BaltimorePythonFlexx:PythonUItookitbasedonwebtechnolog…120
2144638830117645516800TueSep0121:45:43+0000201548474625
RadioFreeDenaliHmm,emerge--depcleanwantstoremovesomethi…1
21
2246638829977014636544TueSep0121:45:10+00002015154915461
LucianoRamalhoNoitedeautógrafosdoFluentPythonnoGaroa…122
2347638829882928070656TueSep0121:44:47+00002015917320920
bsbafflesbrains@DanSWrightHarperchannelingMontyPython."...123
2448638829868679954432TueSep0121:44:44+00002015134280898
LannickTechnologyRT@SergeyKalnish:Iam#hiring:SeniorBacke…1
24
2549638829707484508161TueSep0121:44:05+000020152839203454
JoshuaJonesRT@LindseyPelas:SurvivingMontyPythoninFl…125
In[81]:
pddf_with_cluster[pddf_with_cluster['_1']==2].head(10)
Out[81]:
Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2
7280688639056941592014848WedSep0212:47:02+00002015
2735137484ChrisAtruegayiconwhenwill@ladygaga@[email protected]
7280
In[82]:
pddf_with_cluster[pddf_with_cluster['_1']==3].head(10)
Out[82]:
Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2
00638830426971181057TueSep0121:46:57+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…30
11638830426727911424TueSep0121:46:57+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…31
22638830425402556417TueSep0121:46:56+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…32
33638830424563716097TueSep0121:46:56+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…33
44638830422256816132TueSep0121:46:56+000020153276255125
TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…34
55638830420159655936TueSep0121:46:55+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…35
77638830418330980352TueSep0121:46:55+000020153276255125
TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…37
88638830397648822272TueSep0121:46:50+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…38
99638830395375529984TueSep0121:46:49+000020153276255125
TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…39
1010638830392389177344TueSep0121:46:49+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…310
In[83]:
pddf_with_cluster[pddf_with_cluster['_1']==4].head(10)
Out[83]:
Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2
1361882642648214454317056SatSep1210:37:28+0000201527415756
RaymondEnisuohLAChosenForUS2024OlympicBid-LA2016See…4
1361
1363885642647848744583168SatSep1210:36:01+0000201527415756
RaymondEnisuohPrisonSee:https://t.co/x3EKAExeFi……………...4
1363
541211640480770369286144SunSep0611:04:49+00002015
3242403023DonaldTrump2016"igiboooy!@Starbucks
https://t.co/97wdL…45412
542827640477140660518912SunSep0610:50:24+00002015
3242403023DonaldTrump2016"@Starbuckshttps://t.co/wsEYFIefk7"-
D…45428
545561640469542272110592SunSep0610:20:12+00002015
3242403023DonaldTrump2016"starbucks@StarbucksMamPlaza
https://t.co…45455
545662640469541370372096SunSep0610:20:12+00002015
3242403023DonaldTrump2016"Aaahhhthepumpkinspicelatteisback,
fall…45456
545763640469539524898817SunSep0610:20:12+00002015
3242403023DonaldTrump2016"RTkayyleighferry:OhmygodddHarry
Potter…45457
545864640469537176031232SunSep0610:20:11+00002015
3242403023DonaldTrump2016"Starbuckshttps://t.co/3xYYXlwNkf"-
Donald…45458
545965640469536119070720SunSep0610:20:11+00002015
3242403023DonaldTrump2016"AStarbucksisunderconstructioninmy
neig…45459
546066640469530435813376SunSep0610:20:10+00002015
3242403023DonaldTrump2016"Babamstarbucks'tanfotogtafatıyor
bendedu…45460
Wemapthe5clusterswithsomesampletweets.Cluster0isaboutSpark.Cluster1isaboutPython.Cluster2isaboutLadyGaga.Cluster3isaboutThailand’sPhuketNews.Cluster4isaboutDonaldTrump.
BuildingmachinelearningpipelinesWewanttocomposethefeatureextraction,preparatoryactivities,training,testing,andpredictionactivitieswhileoptimizingthebesttuningparametertogetthebestperformingmodel.
ThefollowingtweetcapturesperfectlyinfivelinesofcodeapowerfulmachinelearningPipelineimplementedinSparkMLlib:
TheSparkMLpipelineisinspiredbyPython’sScikit-Learnandcreatesasuccinct,declarativestatementofthesuccessivetransformationstothedatainordertoquicklydeliveratunablemodel.
SummaryInthischapter,wegotanoverviewofSparkMLlib’sever-expandinglibraryofalgorithmsSparkMLlib.Wediscussedsupervisedandunsupervisedlearning,recommendersystems,optimization,andfeatureextractionalgorithms.WethenputtheharvesteddatafromTwitterintothemachinelearningprocess,algorithms,andevaluationtoderiveinsightsfromthedata.WeputtheTwitter-harvesteddatasetthroughaPythonScikit-LearnandSparkMLlibK-meansclusteringinordertosegregatethetweetsrelevanttoApacheSpark.Wealsoevaluatedtheperformanceofthemodel.
Thisgetsusreadyforthenextchapter,whichwillcoverStreamingAnalyticsusingSpark.Let’sjumprightin.
Chapter5.StreamingLiveDatawithSparkInthischapter,wewillfocusonlivestreamingdataflowingintoSparkandprocessingit.Sofar,wehavediscussedmachinelearninganddataminingwithbatchprocessing.Wearenowlookingatprocessingcontinuouslyflowingdataanddetectingfactsandpatternsonthefly.Wearenavigatingfromalaketoariver.
Wewillfirstinvestigatethechallengesarisingfromsuchadynamicandeverchangingenvironment.Afterlayingthegroundsontheprerequisiteofastreamingapplication,wewillinvestigatevariousimplementationsusinglivesourcesofdatasuchasTCPsocketstotheTwitterfirehoseandputinplacealowlatency,highthroughput,andscalabledatapipelinecombiningSpark,KafkaandFlume.
Inthischapter,wewillcoverthefollowingpoints:
Analyzingastreamingapplication’sarchitecturalchallenges,constraints,andrequirementsProcessinglivedatafromaTCPsocketwithSparkStreamingConnectingtotheTwitterfirehosedirectlytoparsetweetsinquasirealtimeEstablishingareliable,faulttolerant,scalable,highthroughput,lowlatencyintegratedapplicationusingSpark,Kafka,andFlumeClosingremarksonLambdaandKappaarchitectureparadigms
LayingthefoundationsofstreamingarchitectureAscustomary,let’sfirstgobacktoouroriginaldrawingofthedata-intensiveappsarchitectureblueprintandhighlighttheSparkStreamingmodulethatwillbethetopicofinterest.
ThefollowingdiagramsetsthecontextbyhighlightingtheSparkStreamingmoduleandinteractionswithSparkSQLandSparkMLlibwithintheoveralldata-intensiveappsframework.
Dataflowsfromstockmarkettimeseries,enterprisetransactions,interactions,events,webtraffic,clickstreams,andsensors.Alleventsaretime-stampeddataandurgent.Thisisthecaseforfrauddetectionandprevention,mobilecross-sellandupsell,ortrafficalerts.Thosestreamsofdatarequireimmediateprocessingformonitoringpurposes,suchasdetectinganomalies,outliers,spam,fraud,andintrusion;andalsoforprovidingbasicstatistics,insights,trends,andrecommendations.Insomecases,thesummarized
aggregatedinformationissufficienttobestoredforlaterusage.Fromanarchitectureparadigmperspective,wearemovingfromaservice-orientedarchitecturetoanevent-drivenarchitecture.
Twomodelsemergeforprocessingstreamsofdata:
Processingonerecordatatimeastheycomein.Wedonotbuffertheincomingrecordsinacontainerbeforeprocessingthem.ThisisthecaseofTwitter’sStorm,Yahoo’sS4,andGoogle’sMillWheel.Micro-batchingorbatchcomputationsonsmallintervalsasperformedbySparkStreamingandStormTrident.Inthiscase,webuffertheincomingrecordsinacontaineraccordingtothetimewindowprescribedinthemicro-batchingsettings.
SparkStreaminghasoftenbeencomparedagainstStorm.Theyaretwodifferentmodelsofstreamingdata.SparkStreamingisbasedonmicro-batching.Stormisbasedonprocessingrecordsastheycomein.Stormalsooffersamicro-batchingoption,withitsStormTridentoption.
Thedrivingfactorinastreamingapplicationislatency.LatencyvariesfromthemillisecondsrangeinthecaseofRPC(shortforRemoteProcedureCall)toseveralsecondsorminutesformicrobatchingsolutionsuchasSparkStreaming.
RPCallowssynchronousoperationsbetweentherequestingprogramswaitingfortheresultsfromtheremoteserver’sprocedure.ThreadsallowconcurrencyofmultipleRPCcallstotheserver.
AnexampleofsoftwareimplementingadistributedRPCmodelisApacheStorm.
Stormimplementsstatelesssubmillisecondlatencyprocessingofunboundedtuplesusingtopologiesordirectedacyclicgraphscombiningspoutsassourceofdatastreamsandboltsforoperationssuchasfilter,join,aggregation,andtransformation.StormalsoimplementsahigherlevelabstractioncalledTridentwhich,similarlytoSpark,processesdatastreamsinmicrobatches.
So,lookingatthelatencycontinuum,fromsubmillisecondtosecond,Stormisagoodcandidate.Forsecondstominutesscale,SparkStreamingandStormTridentareexcellentfits.Forseveralminutesonward,SparkandaNoSQLdatabasesuchasCassandraorHBaseareadequatesolutions.Forrangesbeyondthehourandwithhighvolumeofdata,Hadoopistheidealcontender.
Althoughthroughputiscorrelatedtolatency,itisnotasimpleinverselylinearrelationship.Ifprocessingamessagetakes2ms,whichdeterminesthelatency,thenonewouldassumethethroughputislimitedto500messagespersec.Batchingmessagesallowsforhigherthroughputifweallowourmessagestobebufferedfor8msmore.Withalatencyof10ms,thesystemcanbufferupto10,000messages.Forabearableincreaseinlatency,wehavesubstantiallyincreasedthroughput.Thisisthemagicofmicro-batchingthatSparkStreamingexploits.
SparkStreaminginnerworkingTheSparkStreamingarchitectureleveragestheSparkcorearchitecture.ItoverlaysontheSparkContextaStreamingContextastheentrypointtotheStreamfunctionality.TheClusterManagerwilldedicateatleastoneworkernodeasReceiver,whichwillbeanexecutorwithalongtasktoprocesstheincomingstream.TheExecutorcreatesDiscretizedStreamsorDStreamsfrominputdatastreamandreplicatesbydefault,theDStreamtothecacheofanotherworker.Onereceiverservesoneinputdatastream.MultiplereceiversimproveparallelismandgeneratemultipleDStreamsthatSparkcanuniteorjoinResilientDistributedDatasets(RDD).
ThefollowingdiagramgivesanoverviewoftheinnerworkingofSparkStreaming.TheclientinteractswiththeSparkClusterviatheclustermanager,whileSparkStreaminghasadedicatedworkerwithalongrunningtaskingestingtheinputdatastreamandtransformingitintodiscretizedstreamsorDStreams.Thedataiscollected,bufferedandreplicatedbyareceiverandthenpushedtoastreamofRDDs.
Sparkreceiverscaningestdatafrommanysources.CoreinputsourcesrangefromTCPsocketandHDFS/AmazonS3toAkkaActors.AdditionalsourcesincludeApacheKafka,ApacheFlume,AmazonKinesis,ZeroMQ,Twitter,andcustomoruser-definedreceivers.
Wedistinguishbetweenreliableresourcesthatacknowledgesreceiptofdatatothesourceandreplicationforpossibleresend,versusunreliablereceiverswhodonotacknowledgereceiptofthemessage.Sparkscalesoutintermsofthenumberofworkers,partitionandreceivers.
ThefollowingdiagramgivesanoverviewofSparkStreamingwiththepossiblesourcesandthepersistenceoptions:
GoingunderthehoodofSparkStreamingSparkStreamingiscomposedofReceiversandpoweredbyDiscretizedStreamsandSparkConnectorsforpersistence.
AsforSparkCore,theessentialdatastructureistheRDD,thefundamentalprogrammingabstractionforSparkStreamingistheDiscretizedStreamorDStream.
ThefollowingdiagramillustratestheDiscretizedStreamsascontinuoussequencesofRDDs.ThebatchintervalsofDStreamareconfigurable.
DStreamssnapshotstheincomingdatainbatchintervals.Thosetimestepstypicallyrangefrom500mstoseveralseconds.TheunderlyingstructureofaDStreamisanRDD.
ADStreamisessentiallyacontinuoussequenceofRDDs.ThisispowerfulasitallowsustoleveragefromSparkStreamingallthetraditionalfunctions,transformationsandactionsavailableinSparkCoreandallowsustodialoguewithSparkSQL,performingSQLqueriesonincomingstreamsofdataandSparkMLlib.Transformationssimilartothoseongenericandkey-valuepairRDDsareapplicable.TheDStreamsbenefitfromtheinnerRDDslineageandfaulttolerance.Additionaltransformationandoutputoperationsexistfordiscretizedstreamoperations.MostgenericoperationsonDStreamaretransformandforeachRDD.
ThefollowingdiagramgivesanoverviewofthelifecycleofDStreams.Fromcreationofthemicro-batchesofmessagesmaterializedtoRDDsonwhichtransformationfunctionandactionsthattriggerSparkjobsareapplied.Breakingdownthestepsillustratedinthediagram,wereadthediagramtopdown:
1. IntheInputStream,theincomingmessagesarebufferedinacontaineraccordingtothetimewindowallocatedforthemicro-batching.
2. Inthediscretizedstreamstep,thebufferedmicro-batchesaretransformedasDStreamRDDs.
3. TheMappedDStreamstepisobtainedbyapplyingatransformationfunctiontotheoriginalDStream.Thesefirstthreestepsconstitutethetransformationoftheoriginaldatareceivedinpredefinedtimewindows.AstheunderlyingdatastructureistheRDD,weconservethedatalineageofthetransformations.
4. ThefinalstepisanactionontheRDD.IttriggerstheSparkjob.
Transformationcanbestatelessorstateful.Statelessmeansthatnostateismaintainedbytheprogram,whilestatefulmeanstheprogramkeepsastate,inwhichcaseprevioustransactionsarerememberedandmayaffectthecurrenttransaction.Astatefuloperationmodifiesorrequiressomestateofthesystem,andastatelessoperationdoesnot.
StatelesstransformationsprocesseachbatchinaDStreamatatime.Statefultransformationsprocessmultiplebatchestoobtainresults.Statefultransformationsrequirethecheckpointdirectorytobeconfigured.CheckpointingisthemainmechanismforfaulttoleranceinSparkStreamingtoperiodicallysavedataandmetadataaboutanapplication.
TherearetwotypesofstatefultransformationsforSparkStreaming:updateStateByKeyandwindowedtransformations.
updateStateByKeyaretransformationsthatmaintainstateforeachkeyinastreamofPairRDDs.ItreturnsanewstateDStreamwherethestateforeachkeyisupdatedbyapplyingthegivenfunctiononthepreviousstateofthekeyandthenewvaluesofeachkey.Anexamplewouldbearunningcountofgivenhashtagsinastreamoftweets.
Windowedtransformationsarecarriedovermultiplebatchesinaslidingwindow.Awindowhasadefinedlengthordurationspecifiedintimeunits.ItmustbeamultipleofaDStreambatchinterval.Itdefineshowmanybatchesareincludedinawindowedtransformation.
Awindowhasaslidingintervalorslidingdurationspecifiedintimeunits.ItmustbeamultipleofaDStreambatchinterval.Itdefineshowmanybatchestoslideawindoworhowfrequentlytocomputeawindowedtransformation.
ThefollowingschemadepictsthewindowingoperationonDStreamstoderivewindowDStreamswithagivenlengthandslidinginterval:
AsamplefunctioniscountByWindow(windowLength,slideInterval).ItreturnsanewDStreaminwhicheachRDDhasasingleelementgeneratedbycountingthenumberofelementsinaslidingwindowoverthisDStream.Anillustrationinthiscasewouldbearunningcountofgivenhashtagsinastreamoftweetsevery60seconds.Thewindowtimeframeisspecified.
Minutescalewindowlengthisreasonable.Hourscalewindowlengthisnotrecommendedasitiscomputeandmemoryintensive.ItwouldbemoreconvenienttoaggregatethedatainadatabasesuchasCassandraorHBase.
Windowedtransformationscomputeresultsbasedonwindowlengthandwindowslideinterval.Sparkperformanceisprimarilyaffectedbyonwindowlength,windowslideinterval,andpersistence.
BuildinginfaulttoleranceReal-timestreamprocessingsystemsmustbeoperational24/7.Theyneedtoberesilienttoallsortsoffailuresinthesystem.SparkanditsRDDabstractionaredesignedtoseamlesslyhandlefailuresofanyworkernodesinthecluster.
MainSparkStreamingfaulttolerancemechanismsarecheckpointing,automaticdriverrestart,andautomaticfailover.Sparkenablesrecoveryfromdriverfailureusingcheckpointing,whichpreservestheapplicationstate.
Writeaheadlogs,reliablereceivers,andfilestreamsguaranteeszerodatalossasofSparkVersion1.2.Writeaheadlogsrepresentafaulttolerantstorageforreceiveddata.
Failuresrequirerecomputingresults.DStreamoperationshaveexactly-onesemantics.Transformationscanberecomputedmultipletimesbutwillyieldthesameresult.DStreamoutputoperationshaveatleastoncesemantics.Outputoperationsmaybeexecutedmultipletimes.
ProcessinglivedatawithTCPsocketsAsasteppingstonetotheoverallunderstandingofstreamingoperations,wewillfirstexperimentwithTCPsocket.TCPsocketestablishestwo-waycommunicationbetweenclientandserver,anditcanexchangedatathroughtheestablishedconnection.WebSocketconnectionsarelonglived,unliketypicalHTTPconnections.HTTPisnotmeanttokeepanopenconnectionfromtheservertopushcontinuouslydatatothewebbrowsers.MostwebapplicationshenceresortedtolongpollingviafrequentAsynchronousJavaScript(AJAX)andXMLrequests.WebSockets,standardizedandimplementedinHTML5,aremovingbeyondwebbrowsersandarebecomingacross-platformstandardforreal-timecommunicationbetweenclientandserver.
SettingupTCPsocketsWecreateaTCPSocketServerbyrunningnetcat,asmallutilityfoundinmostLinuxsystems,asadataserverwiththecommand>nc-lk9999,where9999istheportwherewearesendingdata:
#
#SocketServer
#
an@an-VB:~$nc-lk9999
helloworld
howareyou
helloworld
coolitworks
Oncenetcatisrunning,wewillopenasecondconsolewithourSparkStreamingclienttoreceivethedataandprocess.AssoonastheSparkStreamingclientconsoleislistening,westarttypingthewordstobeprocessed,thatis,helloworld.
ProcessinglivedataWewillbeusingtheexampleprogramprovidedintheSparkbundleforSparkStreamingcallednetwork_wordcount.py.ItcanbefoundontheGitHubrepositoryunderhttps://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.pyThecodeisasfollows:
"""
CountswordsinUTF8encoded,'\n'delimitedtextreceivedfromthe
networkeverysecond.
Usage:network_wordcount.py<hostname><port>
<hostname>and<port>describetheTCPserverthatSparkStreamingwould
connecttoreceivedata.
Torunthisonyourlocalmachine,youneedtofirstrunaNetcatserver
`$nc-lk9999`
andthenruntheexample
`$bin/spark-submit
examples/src/main/python/streaming/network_wordcount.pylocalhost9999`
"""
from__future__importprint_function
importsys
frompysparkimportSparkContext
frompyspark.streamingimportStreamingContext
if__name__=="__main__":
iflen(sys.argv)!=3:
print("Usage:network_wordcount.py<hostname><port>",
file=sys.stderr)
exit(-1)
sc=SparkContext(appName="PythonStreamingNetworkWordCount")
ssc=StreamingContext(sc,1)
lines=ssc.socketTextStream(sys.argv[1],int(sys.argv[2]))
counts=lines.flatMap(lambdaline:line.split(""))\
.map(lambdaword:(word,1))\
.reduceByKey(lambdaa,b:a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
Here,weexplainthestepsoftheprogram:
1. ThecodefirstinitializesaSparkStreamingContextwiththecommand:
ssc=StreamingContext(sc,1)
2. Next,thestreamingcomputationissetup.3. OneormoreDStreamobjectsthatreceivedataaredefinedtoconnecttolocalhostor
127.0.0.1onport9999:
stream=ssc.socketTextStream("127.0.0.1",9999)
4. TheDStreamcomputationisdefined:transformationsandoutputoperations:
stream.map(x:lambda(x,1))
.reduce(a+b)
.print()
5. Computationisstarted:
ssc.start()
6. Programterminationispendingmanualorerrorprocessingcompletion:
ssc.awaitTermination()
7. Manualcompletionisanoptionwhenacompletionconditionisknown:
ssc.stop()
WecanmonitortheSparkStreamingapplicationbyvisitingtheSparkmonitoringhomepageatlocalhost:4040.
Here’stheresultofrunningtheprogramandfeedingthewordsonthenetcat4serverconsole:
#
#SocketClient
#an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit
examples/src/main/python/streaming/network_wordcount.pylocalhost9999
RuntheSparkStreamingnetwork_countprogrambyconnectingtothesocketlocalhostonport9999:
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit
examples/src/main/python/streaming/network_wordcount.pylocalhost9999
-------------------------------------------
Time:2015-10-1820:06:06
-------------------------------------------
(u'world',1)
(u'hello',1)
-------------------------------------------
Time:2015-10-1820:06:07
-------------------------------------------
...
-------------------------------------------
Time:2015-10-1820:06:17
-------------------------------------------
(u'you',1)
(u'how',1)
(u'are',1)
-------------------------------------------
Time:2015-10-1820:06:18
-------------------------------------------
...
-------------------------------------------
Time:2015-10-1820:06:26
-------------------------------------------
(u'',1)
(u'world',1)
(u'hello',1)
-------------------------------------------
Time:2015-10-1820:06:27
-------------------------------------------
...
-------------------------------------------
Time:2015-10-1820:06:37
-------------------------------------------
(u'works',1)
(u'it',1)
(u'cool',1)
-------------------------------------------
Time:2015-10-1820:06:38
-------------------------------------------
Thus,wehaveestablishedconnectionthroughthesocketonport9999,streamedthedatasentbythenetcatserver,andperformedawordcountonthemessagessent.
ManipulatingTwitterdatainrealtimeTwitterofferstwoAPIs.OnesearchAPIthatessentiallyallowsustoretrievepasttweetsbasedonsearchterms.ThisishowwehavebeencollectingourdatafromTwitterinthepreviouschaptersofthebook.Interestingly,forourcurrentpurpose,TwitteroffersalivestreamingAPIwhichallowstoingesttweetsastheyareemittedintheblogosphere.
ProcessingTweetsinrealtimefromtheTwitterfirehoseThefollowingprogramconnectstotheTwitterfirehoseandprocessestheincomingtweetstoexcludedeletedorinvalidtweetsandparsesontheflyonlytherelevantonestoextractscreenname,theactualtweet,ortweettext,retweetcount,geo-locationinformation.TheprocessedtweetsaregatheredintoanRDDQueuebySparkStreamingandthendisplayedontheconsoleataone-secondinterval:
"""
TwitterStreamingAPISparkStreamingintoanRDD-Queuetoprocesstweets
live
CreateaqueueofRDDsthatwillbemapped/reducedoneatatimein
1secondintervals.
Torunthisexampleuse
'$bin/spark-submit
examples/AN_Spark/AN_Spark_Code/s07_twitterstreaming.py'
"""
#
importtime
frompysparkimportSparkContext
frompyspark.streamingimportStreamingContext
importtwitter
importdateutil.parser
importjson
#ConnectingStreamingTwitterwithStreamingSparkviaQueue
classTweet(dict):
def__init__(self,tweet_in):
super(Tweet,self).__init__(self)
iftweet_inand'delete'notintweet_in:
self['timestamp']=
dateutil.parser.parse(tweet_in[u'created_at']
).replace(tzinfo=None).isoformat()
self['text']=tweet_in['text'].encode('utf-8')
#self['text']=tweet_in['text']
self['hashtags']=[x['text'].encode('utf-8')forxin
tweet_in['entities']['hashtags']]
#self['hashtags']=[x['text']forxintweet_in['entities']
['hashtags']]
self['geo']=tweet_in['geo']['coordinates']iftweet_in['geo']
elseNone
self['id']=tweet_in['id']
self['screen_name']=tweet_in['user']
['screen_name'].encode('utf-8')
#self['screen_name']=tweet_in['user']['screen_name']
self['user_id']=tweet_in['user']['id']
defconnect_twitter():
twitter_stream=twitter.TwitterStream(auth=twitter.OAuth(
token="get_your_own_credentials",
token_secret="get_your_own_credentials",
consumer_key="get_your_own_credentials",
consumer_secret="get_your_own_credentials"))
returntwitter_stream
defget_next_tweet(twitter_stream):
stream=twitter_stream.statuses.sample(block=True)
tweet_in=None
whilenottweet_inor'delete'intweet_in:
tweet_in=stream.next()
tweet_parsed=Tweet(tweet_in)
returnjson.dumps(tweet_parsed)
defprocess_rdd_queue(twitter_stream):
#CreatethequeuethroughwhichRDDscanbepushedto
#aQueueInputDStream
rddQueue=[]
foriinrange(3):
rddQueue+=
[ssc.sparkContext.parallelize([get_next_tweet(twitter_stream)],5)]
lines=ssc.queueStream(rddQueue)
lines.pprint()
if__name__=="__main__":
sc=SparkContext(appName="PythonStreamingQueueStream")
ssc=StreamingContext(sc,1)
#Instantiatethetwitter_stream
twitter_stream=connect_twitter()
#GetRDDqueueofthestreamsjsonorparsed
process_rdd_queue(twitter_stream)
ssc.start()
time.sleep(2)
ssc.stop(stopSparkContext=True,stopGraceFully=True)
Whenwerunthisprogram,itdeliversthefollowingoutput:
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$bin/spark-submit
examples/AN_Spark/AN_Spark_Code/s07_twitterstreaming.py
-------------------------------------------
Time:2015-11-0321:53:14
-------------------------------------------
{"user_id":3242732207,"screen_name":"cypuqygoducu","timestamp":"2015-
11-03T20:53:04","hashtags":[],"text":"RT@VIralBuzzNewss:Our
DistinctiveEditionHolidaybreakChallengeIsInthisarticle!Hooray!...
-https://t.co/9d8wumrd5vhttps://t.co/\u2026","geo":null,"id":
661647303678259200}
-------------------------------------------
Time:2015-11-0321:53:15
-------------------------------------------
{"user_id":352673159,"screen_name":"melly_boo_orig","timestamp":"2015-
11-03T20:53:05","hashtags":["eminem"],"text":"#eminem
https://t.co/GlEjPJnwxy","geo":null,"id":661647307847409668}
-------------------------------------------
Time:2015-11-0321:53:16
-------------------------------------------
{"user_id":500620889,"screen_name":"NBAtheist","timestamp":"2015-11-
03T20:53:06","hashtags":["tehInterwebbies","Nutters"],"text":"See?
Thatdidn'ttakelongoranyactualeffort.Thisis#tehInterwebbies…
#NuttersAbound!https://t.co/QS8gLStYFO","geo":null,"id":
661647312062709761}
So,wegotanexampleofstreamingtweetswithSparkandprocessingthemonthefly.
BuildingareliableandscalablestreamingappIngestingdataistheprocessofacquiringdatafromvarioussourcesandstoringitforprocessingimmediatelyoratalaterstage.Dataconsumingsystemsaredispersedandcanbephysicallyandarchitecturallyfarfromthesources.Dataingestionisoftenimplementedmanuallywithscriptsandrudimentaryautomation.ItactuallycallsforhigherlevelframeworkslikeFlumeandKafka.
Thechallengesofdataingestionarisefromthefactthatthesourcesarephysicallyspreadoutandaretransientwhichmakestheintegrationbrittle.Dataproductioniscontinuousforweather,traffic,socialmedia,networkactivity,shopfloorsensors,security,andsurveillance.Everincreasingdatavolumesandratescoupledwitheverchangingdatastructureandsemanticsmakesdataingestionadhocanderrorprone.
Theaimistobecomemoreagile,reliable,andscalable.Agility,reliability,andscalabilityofthedataingestiondeterminetheoverallhealthofthepipeline.Agilitymeansintegratingnewsourcesastheyariseandincorporatingchangestoexistingsourcesasneeded.Inordertoensuresafetyandreliability,weneedtoprotecttheinfrastructureagainstdatalossanddownstreamapplicationsfromsilentdatacorruptionatingress.Scalabilityavoidsingestbottleneckswhilekeepingcosttractable.
IngestMode Description Example
ManualorScripted FilecopyusingcommandlineinterfaceorGUIinterface HDFSClient,ClouderaHue
BatchDataTransport Bulkdatatransportusingtools DistCp,Sqoop
MicroBatch TransportofsmallbatchesofdataSqoop,Sqoop2
Storm
Pipelining Flowliketransportofeventstreams FlumeScribe
MessageQueue PublishSubscribemessagebusofevents Kafka,Kinesis
Inordertoenableanevent-drivenbusinessthatisabletoingestmultiplestreamsofdata,processitinflight,andmakesenseofitalltogettorapiddecisions,thekeydriveristheUnifiedLog.
AUnifiedLogisacentralizedenterprisestructuredlogavailableforreal-timesubscription.Alltheorganization’sdataisputinacentrallogforsubscription.Recordsarenumberedbeginningwithzerointheorderthattheyarewritten.Itisalsoknownasacommitlogorjournal.TheconceptoftheUnifiedLogisthecentraltenetoftheKappaarchitecture.
ThepropertiesoftheUnifiedLogareasfollows:
Unified:Thereisasingledeploymentfortheentireorganization
Appendonly:EventsareimmutableandareappendedOrdered:EacheventhasauniqueoffsetwithinashardDistributed:Forfaulttolerancepurpose,theUnifiedLogisdistributedredundantlyonaclusterofcomputersFast:Thesystemsingeststhousandsofmessagespersecond
SettingupKafkaInordertoisolatedownstreamparticularconsumptionofdatafromthevagariesofupstreamemissionofdata,weneedtodecoupletheprovidersofdatafromthereceiversorconsumersofdata.Astheyarelivingintwodifferentworldswithdifferentcyclesandconstraints,Kafkadecouplesthedatapipelines.
ApacheKafkaisadistributedpublishsubscribemessagingsystemrethoughtasadistributedcommitlog.Themessagesarestoredbytopic.
ApacheKafkahasthefollowingproperties.Itsupports:
HighthroughputforhighvolumeofeventsfeedsReal-timeprocessingofnewandderivedfeedsLargedatabacklogsandpersistenceforofflineconsumptionLowlatencyasenterprisewidemessagingsystemFaulttolerancethankstoitsdistributednature
MessagesarestoredinpartitionwithauniquesequentialIDcalledoffset.Consumerstracktheirpointersviatupleof(offset,partition,topic).
Let’sdivedeeperintheanatomyofKafka.
Kafkahasessentiallythreecomponents:producers,consumersandbrokers.Producerspushandwritedatatobrokers.Consumerspullandreaddatafrombrokers.Brokersdonotpushmessagestoconsumers.Consumerspullmessagefrombrokers.ThesetupisdistributedandcoordinatedbyApacheZookeeper.
Thebrokersmanageandstorethedataintopics.Topicsaresplitinreplicatedpartitions.Thedataispersistedinthebroker,butnotremoveduponconsumption,butuntilretentionperiod.Ifaconsumerfails,itcanalwaysgobacktothebrokertofetchthedata.
KafkarequiresApacheZooKeeper.ZooKeeperisahigh-performancecoordinationservicefordistributedapplications.Itcentrallymanagesconfiguration,registryornamingservice,groupmembership,lock,andsynchronizationforcoordinationbetweenservers.Itprovidesahierarchicalnamespacewithmetadata,monitoringstatistics,andstateofthecluster.ZooKeepercanintroducebrokersandconsumersontheflyandthenrebalancesthecluster.
KafkaproducersdonotneedZooKeeper.KafkabrokersuseZooKeepertoprovidegeneralstateinformationaswellelectleaderincaseoffailure.KafkaconsumersuseZooKeepertotrackmessageoffset.NewerversionsofKafkawillsavetheconsumerstogothroughZooKeeperandcanretrievetheKafkaspecialtopicsinformation.Kafkaprovidesautomaticloadbalancingforproducers.
ThefollowingdiagramgivesanoverviewoftheKafkasetup:
InstallingandtestingKafkaWewilldownloadtheApacheKafkabinariesfromthededicatedwebpageathttp://kafka.apache.org/downloads.htmlandinstallthesoftwareinourmachineusingthefollowingsteps:
1. Downloadthecode.2. Downloadthe0.8.2.0releaseandun-tarit:
>tar-xzfkafka_2.10-0.8.2.0.tgz
>cdkafka_2.10-0.8.2.0
3. Startzooeeper.KafkausesZooKeepersoweneedtofirststartaZooKeeperserver.WewillusetheconveniencescriptpackagedwithKafkatogetasingle-nodeZooKeeperinstance.
>bin/zookeeper-server-start.shconfig/zookeeper.properties
an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/zookeeper-server-start.sh
config/zookeeper.properties
[2015-10-3122:49:14,808]INFOReadingconfigurationfrom:
config/zookeeper.properties
(org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2015-10-3122:49:14,816]INFOautopurge.snapRetainCountsetto3
(org.apache.zookeeper.server.DatadirCleanupManager)...
4. NowlaunchtheKafkaserver:
>bin/kafka-server-start.shconfig/server.properties
an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-server-start.sh
config/server.properties
[2015-10-3122:52:04,643]INFOVerifyingproperties
(kafka.utils.VerifiableProperties)
[2015-10-3122:52:04,714]INFOPropertybroker.idisoverriddento0
(kafka.utils.VerifiableProperties)
[2015-10-3122:52:04,715]INFOPropertylog.cleaner.enableis
overriddentofalse(kafka.utils.VerifiableProperties)
[2015-10-3122:52:04,715]INFOPropertylog.dirsisoverriddento
/tmp/kafka-logs(kafka.utils.VerifiableProperties)[2013-04-22
15:01:47,051]INFOPropertysocket.send.buffer.bytesisoverriddento
1048576(kafka.utils.VerifiableProperties)
5. Createatopic.Let’screateatopicnamedtestwithasinglepartitionandonlyonereplica:
>bin/kafka-topics.sh--create--zookeeperlocalhost:2181--
replication-factor1--partitions1--topictest
6. Wecannowseethattopicifwerunthelisttopiccommand:
>bin/kafka-topics.sh--list--zookeeperlocalhost:2181
Test
an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-topics.sh--create--
zookeeperlocalhost:2181--replication-factor1--partitions1--topic
test
Createdtopic"test".
an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-topics.sh--list--
zookeeperlocalhost:2181
test
7. ChecktheKafkainstallationbycreatingaproducerandconsumer.Wefirstlaunchaproducerandtypeamessageintheconsole:
an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-console-producer.sh--
broker-listlocalhost:9092--topictest
[2015-10-3122:54:43,698]WARNPropertytopicisnotvalid
(kafka.utils.VerifiableProperties)
Thisisamessage
Thisisanothermessage
8. Wethenlaunchaconsumertocheckthatwereceivethemessage:
an@an-VB:~$cdkafka/
an@an-VB:~/kafka$cdkafka_2.10-0.8.2.0/
an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-console-consumer.sh--
zookeeperlocalhost:2181--topictest--from-beginning
Thisisamessage
Thisisanothermessage
Themessageswereappropriatelyreceivedbytheconsumer:
1. CheckKafkaandSparkStreamingconsumer.WewillbeusingtheSparkStreamingKafkawordcountexampleprovidedintheSparkbundle.Awordofcaution:wehavetobindtheKafkapackages,--packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0,whenwesubmittheSparkjob.Thecommandisasfollows:
./bin/spark-submit--packagesorg.apache.spark:spark-streaming-
kafka_2.10:1.5.0\
examples/src/main/python/streaming/kafka_wordcount.py\
localhost:2181test
2. WhenwelaunchtheSparkStreamingwordcountprogramwithKafka,wegetthefollowingoutput:
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit--
packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0
examples/src/main/python/streaming/kafka_wordcount.py
localhost:2181test
-------------------------------------------
Time:2015-10-3123:46:33
-------------------------------------------
(u'',1)
(u'from',2)
(u'Hello',2)
(u'Kafka',2)
-------------------------------------------
Time:2015-10-3123:46:34
-------------------------------------------
-------------------------------------------
Time:2015-10-3123:46:35
-------------------------------------------
3. InstalltheKafkaPythondriverinordertobeabletoprogrammaticallydevelopProducersandConsumersandinteractwithKafkaandSparkusingPython.Wewillusetheroad-testedlibraryfromDavidArthur,aka,MumrahonGitHub(https://github.com/mumrah).Wecanpipinstallitasfollows:
>pipinstallkafka-python
an@an-VB:~$pipinstallkafka-python
Collectingkafka-python
Downloadingkafka-python-0.9.4.tar.gz(63kB)
...
Successfullyinstalledkafka-python-0.9.4
Developingproducers
ThefollowingprogramcreatesaSimpleKafkaProducerthatwillemitthemessagethisisamessagesentfromtheKafkaproducer:fivetimes,followedbyatimestampeverysecond:
#
#kafkaproducer
#
#
importtime
fromkafka.commonimportLeaderNotAvailableError
fromkafka.clientimportKafkaClient
fromkafka.producerimportSimpleProducer
fromdatetimeimportdatetime
defprint_response(response=None):
ifresponse:
print('Error:{0}'.format(response[0].error))
print('Offset:{0}'.format(response[0].offset))
defmain():
kafka=KafkaClient("localhost:9092")
producer=SimpleProducer(kafka)
try:
time.sleep(5)
topic='test'
foriinrange(5):
time.sleep(1)
msg='Thisisamessagesentfromthekafkaproducer:'\
+str(datetime.now().time())+'—'\
+str(datetime.now().strftime("%A,%d%B%Y%I:%M%p"))
print_response(producer.send_messages(topic,msg))
exceptLeaderNotAvailableError:
#https://github.com/mumrah/kafka-python/issues/249
time.sleep(1)
print_response(producer.send_messages(topic,msg))
kafka.close()
if__name__=="__main__":
main()
Whenwerunthisprogram,thefollowingoutputisgenerated:
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/AN_Spark_Code$
pythons08_kafka_producer_01.py
Error:0
Offset:13
Error:0
Offset:14
Error:0
Offset:15
Error:0
Offset:16
Error:0
Offset:17
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/AN_Spark_Code$
IttellsustherewerenoerrorsandgivestheoffsetofthemessagesgivenbytheKafkabroker.
DevelopingconsumersTofetchthemessagesfromtheKafkabrokers,wedevelopaKafkaconsumer:
#kafkaconsumer
#consumesmessagesfrom"test"topicandwritesthemtoconsole.
#
fromkafka.clientimportKafkaClient
fromkafka.consumerimportSimpleConsumer
defmain():
kafka=KafkaClient("localhost:9092")
print("Consumerestablishedconnectiontokafka")
consumer=SimpleConsumer(kafka,"my-group","test")
formessageinconsumer:
#Thiswillwaitandprintmessagesastheybecomeavailable
print(message)
if__name__=="__main__":
main()
Whenwerunthisprogram,weeffectivelyconfirmthattheconsumerreceivedallthemessages:
an@an-VB:~$cd~/spark/spark-1.5.0-bin-
hadoop2.6/examples/AN_Spark/AN_Spark_Code/
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/AN_Spark_Code$
pythons08_kafka_consumer_01.py
Consumerestablishedconnectiontokafka
OffsetAndMessage(offset=13,message=Message(magic=0,attributes=0,
key=None,value='Thisisamessagesentfromthekafkaproducer:
11:50:17.867309Sunday,01November201511:50AM'))
...
OffsetAndMessage(offset=17,message=Message(magic=0,attributes=0,
key=None,value='Thisisamessagesentfromthekafkaproducer:
11:50:22.051423Sunday,01November201511:50AM'))
DevelopingaSparkStreamingconsumerforKafkaBasedontheexamplecodeprovidedintheSparkStreamingbundle,wewillcreateaSparkStreamingconsumerforKafkaandperformawordcountonthemessagesstoredwiththebrokers:
#
#KafkaSparkStreamingConsumer
#
from__future__importprint_function
importsys
frompysparkimportSparkContext
frompyspark.streamingimportStreamingContext
frompyspark.streaming.kafkaimportKafkaUtils
if__name__=="__main__":
iflen(sys.argv)!=3:
print("Usage:kafka_spark_consumer_01.py<zk><topic>",
file=sys.stderr)
exit(-1)
sc=SparkContext(appName="PythonStreamingKafkaWordCount")
ssc=StreamingContext(sc,1)
zkQuorum,topic=sys.argv[1:]
kvs=KafkaUtils.createStream(ssc,zkQuorum,"spark-streaming-
consumer",{topic:1})
lines=kvs.map(lambdax:x[1])
counts=lines.flatMap(lambdaline:line.split(""))\
.map(lambdaword:(word,1))\
.reduceByKey(lambdaa,b:a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
RunthisprogramwiththefollowingSparksubmitcommand:
./bin/spark-submit--packagesorg.apache.spark:spark-streaming-
kafka_2.10:1.5.0
examples/AN_Spark/AN_Spark_Code/s08_kafka_spark_consumer_01.py
localhost:2181test
Wegetthefollowingoutput:
an@an-VB:~$cdspark/spark-1.5.0-bin-hadoop2.6/
an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit\
>--packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0\
>examples/AN_Spark/AN_Spark_Code/s08_kafka_spark_consumer_01.py
localhost:2181test…
::retrieving::org.apache.spark#spark-submit-parent
confs:[default]
0artifactscopied,10alreadyretrieved(0kB/18ms)
-------------------------------------------
Time:2015-11-0112:13:16
-------------------------------------------
-------------------------------------------
Time:2015-11-0112:13:17
-------------------------------------------
-------------------------------------------
Time:2015-11-0112:13:18
-------------------------------------------
-------------------------------------------
Time:2015-11-0112:13:19
-------------------------------------------
(u'a',5)
(u'the',5)
(u'11:50AM',5)
(u'from',5)
(u'This',5)
(u'11:50:21.044374Sunday,',1)
(u'message',5)
(u'11:50:20.036422Sunday,',1)
(u'11:50:22.051423Sunday,',1)
(u'11:50:17.867309Sunday,',1)
...
-------------------------------------------
Time:2015-11-0112:13:20
-------------------------------------------
-------------------------------------------
Time:2015-11-0112:13:21
-------------------------------------------
ExploringflumeFlumeisacontinuousingestionsystem.Itwasoriginallydesignedtobealogaggregationsystem,butitevolvedtohandleanytypeofstreamingeventdata.
Flumeisadistributed,reliable,scalable,andavailablepipelinesystemforefficientcollection,aggregation,andtransportoflargevolumesofdata.Ithasbuilt-insupportforcontextualrouting,filteringreplication,andmultiplexing.Itisrobustandfaulttolerant,withtunablereliabilitymechanismsandmanyfailoverandrecoverymechanisms.Itusesasimpleextensibledatamodelthatallowsforrealtimeanalyticapplication.
Flumeoffersthefollowing:
GuaranteeddeliverysemanticsLowlatencyreliabledatatransferDeclarativeconfigurationwithnocodingrequiredExtendableandcustomizablesettingsIntegrationwithmostcommonlyusedend-points
TheanatomyofFlumecontainsthefollowingelements:
Event:AneventisthefundamentalunitofdatathatistransportedbyFlumefromsourcetodestination.ItislikeamessagewithabytearraypayloadopaquetoFlumeandoptionalheadersusedforcontextualrouting.Client:Aclientproducesandtransmitsevents.AclientdecouplesFlumefromthedataconsumers.Itisanentitythatgenerateseventsandsendsthemtooneormoreagents.CustomclientorFlumelog4Jappendprogramorembeddedapplicationagentcanbeclient.Agent:Anagentisacontainerhostingsources,channels,sinks,andotherelementsthatenablethetransportationofeventsfromoneplacetotheother.Itprovidesconfiguration,lifecyclemanagementandmonitoringforhostedcomponents.AnagentisaphysicalJavavirtualmachinerunningFlume.Source:SourceistheentitythroughwhichFlumereceivesevents.Sourcesrequireatleastonechanneltofunctioninordertoeitheractivelypolldataorpassivelywaitfordatatobedeliveredtothem.Avarietyofsourcesallowdatatobecollected,suchaslog4jlogsandsyslogs.Sink:Sinkistheentitythatdrainsdatafromthechannelanddeliversittothenextdestination.Avarietyofsinksallowdatatobestreamedtoarangeofdestinations.Sinkssupportserializationtouser’sformat.OneexampleistheHDFSsinkthatwriteseventstoHDFS.Channel:Channelistheconduitbetweenthesourceandthesinkthatbuffersincomingeventsuntildrainedbysinks.Sourcesfeedeventsintothechannelandthesinksdrainthechannel.Channelsdecoupletheimpedanceofupstreamanddownstreamsystems.Burstofdataupstreamisdampedbythechannels.Failuresdownstreamaretransparentlyabsorbedbythechannels.Sizingthechannelcapacitytocopewiththeseeventsiskeytorealizingthesebenefits.Channelsoffertwolevelsofpersistence:eithermemorychannel,whichisvolatileiftheJVMcrashes,orFile
channelbackedbyWriteAheadLogthatstorestheinformationtodisk.Channelsarefullytransactional.
Let’sillustratealltheseconcepts:
DevelopingdatapipelineswithFlume,Kafka,andSparkBuildingresilientdatapipelineleveragesthelearningsfromtheprevioussections.WeareplumbingtogetherdataingestionandtransportwithFlume,databrokeragewithareliableandsophisticatedpublishandsubscribemessagingsystemsuchasKafka,andfinallyprocesscomputationontheflyusingSparkStreaming.
Thefollowingdiagramillustratesthecompositionofstreamingdatapipelinesassequenceofconnect,collect,conduct,compose,consume,consign,andcontrolactivities.Theseactivitiesareconfigurablebasedontheusecase:
ConnectestablishesthebindingwiththestreamingAPI.Collectcreatescollectionthreads.Conductdecouplesthedataproducersfromtheconsumersbycreatingabufferqueueorpublish-subscribemechanism.Composeisfocusedonprocessingthedata.Consumeprovisionstheprocesseddatafortheconsumingsystems.Consigntakescareofthedatapersistence.Controlcaterstogovernanceandmonitoringofthesystems,data,andapplications.
Thefollowingdiagramillustratestheconceptsofthestreamingdatapipelineswithitskeycomponents:SparkStreaming,Kafka,Flume,andlowlatencydatabases.Inthe
consumingorcontrollingapplications,wearemonitoringoursystemsinrealtime(depictedbyamonitor)orsendingreal-timealerts(depictedbyredlights)incasecertainthresholdsarecrossed.
ThefollowingdiagramillustratesSpark’suniqueabilitytoprocessinasingleplatformdatainmotionanddataatrestwhileseamlesslyinterfacingwithmultiplepersistencedatastoresaspertheusecaserequirement.
Thisdiagrambringsinoneunifiedwholealltheconceptsdiscusseduptonow.Thetoppartdescribesthestreamingprocessingpipeline.Thebottompartdescribesthebatchprocessingpipeline.Theybothshareacommonpersistencelayerinthemiddleofthediagramdepictingthevariousmodesofpersistenceandserialization.
ClosingremarksontheLambdaandKappaarchitectureTwoarchitectureparadigmsarecurrentlyinvogue:theLambdaandKappaarchitectures.
LambdaisthebrainchildoftheStormcreatorandmaincommitter,NathanMarz.Itessentiallyadvocatesbuildingafunctionalarchitectureonalldata.Thearchitecturehastwobranches.ThefirstisabatcharmenvisionedtobepoweredbyHadoop,wherehistorical,high-latency,high-throughputdataarepre-processedandmadereadyforconsumption.Thereal-timearmisenvisionedtobepoweredbyStorm,anditprocessesincrementallystreamingdata,derivesinsightsonthefly,andfeedsaggregatedinformationbacktothebatchstorage.
KappaisthebrainchildofonethemaincommitterofKafka,JayKreps,andhiscolleaguesatConfluent(previouslyatLinkedIn).Itisadvocatingafullstreamingpipeline,effectivelyimplementing,attheenterpriselevel,theunifiedlogenouncedinthepreviouspages.
UnderstandingLambdaarchitectureLambdaarchitecturecombinesbatchandstreamingdatatoprovideaunifiedquerymechanismonallavailabledata.Lambdaarchitectureenvisionsthreelayers:abatchlayerwhereprecomputedinformationarestored,aspeedlayerwherereal-timeincrementalinformationisprocessedasdatastreams,andfinallytheservinglayerthatmergesbatchandreal-timeviewsforadhocqueries.ThefollowingdiagramgivesanoverviewoftheLambdaarchitecture:
UnderstandingKappaarchitectureTheKappaarchitectureproposestodrivethefullenterpriseinstreamingmode.TheKappaarchitecturearosefromacritiquefromJayKrepsandhiscolleaguesatLinkedInatthetime.Sincethen,theymovedandcreatedConfluentwithApacheKafkaasthemainenableroftheKappaarchitecturevision.ThebasictenetistomoveinallstreamingmodewithaUnifiedLogasthemainbackboneoftheenterpriseinformationarchitecture.
AUnifiedLogisacentralizedenterprisestructuredlogavailableforreal-timesubscription.Alltheorganization’sdataisputinacentrallogforsubscription.Recordsarenumberedbeginningwithzerosothattheyarewritten.Itisalsoknownasacommitlogorjournal.TheconceptoftheUnifiedLogisthecentraltenetoftheKappaarchitecture.
Thepropertiesoftheunifiedlogareasfollows:
Unified:ThereisasingledeploymentfortheentireorganizationAppendonly:EventsareimmutableandareappendedOrdered:EacheventhasauniqueoffsetwithinashardDistributed:Forfaulttolerancepurpose,theunifiedlogisdistributedredundantlyonaclusterofcomputersFast:Thesystemsingeststhousandsofmessagespersecond
ThefollowingscreenshotcapturesthemomentJayKrepsannouncedhisreservationsabouttheLambdaarchitecture.HismainreservationabouttheLambdaarchitectureisimplementingthesamejobintwodifferentsystems,HadoopandStorm,witheachoftheirspecificidiosyncrasies,andwithallthecomplexitiesthatcomealongwithit.Kappaarchitectureprocessesthereal-timedataandreprocesseshistoricaldatainthesameframeworkpoweredbyApacheKafka.
SummaryInthischapter,welaidoutthefoundationsofstreamingarchitectureappsanddescribedtheirchallenges,constraints,andbenefits.WewentunderthehoodandexaminedtheinnerworkingofSparkStreamingandhowitfitswithSparkCoreanddialogueswithSparkSQLandSparkMLlib.WeillustratedthestreamingconceptswithTCPsockets,followedbylivetweetingestionandprocessingdirectlyfromtheTwitterfirehose.WediscussedthenotionsofdecouplingupstreamdatapublishingfromdownstreamdatasubscriptionandconsumptionusingKafkainordertomaximizetheresilienceoftheoverallstreamingarchitecture.WealsodiscussedFlume—areliable,flexible,andscalabledataingestionandtransportpipelinesystem.ThecombinationofFlume,Kafka,andSparkdeliversunparalleledrobustness,speed,andagilityinaneverchanginglandscape.Weclosedthechapterwithsomeremarksandobservationsontwostreamingarchitecturalparadigms,theLambdaandKappaarchitectures.
TheLambdaarchitecturecombinesbatchandstreamingdatainacommonqueryfront-end.ItwasenvisionedwithHadoopandStorminmindinitially.Sparkhasitsownbatchandstreamingparadigms,anditoffersasingleenvironmentwithcommoncodebasetoeffectivelybringthisarchitectureparadigmtolife.
TheKappaarchitecturepromulgatestheconceptoftheunifiedlog,whichcreatesanevent-orientedarchitecturewherealleventsintheenterprisearechanneledinacentralizedcommitlogthatisavailabletoallconsumingsystemsinrealtime.
Wearenowreadyforthevisualizationofthedatacollectedandprocessedsofar.
Chapter6.VisualizingInsightsandTrendsSofar,wehavefocusedonthecollection,analysis,andprocessingofdatafromTwitter.Wehavesetthestagetouseourdataforvisualrenderingandextractinginsightsandtrends.WewillgiveaquicklayofthelandaboutvisualizationtoolsinthePythonecosystem.WewillhighlightBokehasapowerfultoolforrenderingandviewinglargedatasets.BokehispartofthePythonAnacondaDistributionecosystem.
Inthischapter,wewillcoverthefollowingpoints:
GaugingthekeywordsandmemeswithinasocialnetworkcommunityusingchartsandwordcloudMappingthemostactivelocationwherecommunitiesaregrowingaroundcertainthemesortopics
Revisitingthedata-intensiveappsarchitectureWehavereachedthefinallayerofthedata-intensiveappsarchitecture:theengagementlayer.Thislayerfocusesonhowtosynthesize,emphasize,andvisualizethekeycontextrelevantinformationforthedataconsumers.Abunchofnumbersinaconsolewillnotsufficetoengagewithend-users.Itiscriticaltopresentthemassofinformationinarapid,digestible,andattractivefashion.
Thefollowingdiagramsetsthecontextofthechapter’sfocushighlightingtheengagementlayer.
ForPythonplottingandvisualizations,wehavequiteafewtoolsandlibraries.Themostinterestingandrelevantonesforourpurposearethefollowing:
MatplotlibisthegrandfatherofthePythonplottinglibraries.MatplotlibwasoriginallythebrainchildofJohnHunterwhowasanopensourcesoftwareproponent
andestablishedMatplotlibasoneofthemostprevalentplottinglibrariesbothintheacademicandthedatascientificcommunities.Matplotliballowsthegenerationofplots,histograms,powerspectra,barcharts,errorcharts,scatterplots,andsoon.ExamplescanbefoundontheMatplotlibdedicatedwebsiteathttp://matplotlib.org/examples/index.html.Seaborn,developedbyMichaelWaskom,isagreatlibrarytoquicklyvisualizestatisticalinformation.ItisbuiltontopofMatplotlibandintegratesseamlesslywithPandasandthePythondatastack,includingNumpy.AgalleryofgraphsfromSeabornathttp://stanford.edu/~mwaskom/software/seaborn/examples/index.htmlshowsthepotentialofthelibrary.ggplotisrelativelynewandaimstooffertheequivalentofthefamousggplot2fromtheRecosystemforthePythondatawranglers.Ithasthesamelookandfeelofggplot2andusesthesamegrammarofgraphicsasexpoundedbyHadleyWickham.TheggplotthePythonportisdevelopedbytheteamatyhat.Moreinformationcanbefoundathttp://ggplot.yhathq.com.D3.jsisaverypopular,JavaScriptlibrarydevelopedbyMikeBostock.D3standsforDataDrivenDocumentsandbringsdatatolifeonanymodernbrowserleveragingHTML,SVG,andCSS.Itdeliversdynamic,powerful,interactivevisualizationsbymanipulatingtheDOM,theDocumentObjectModel.ThePythoncommunitycouldnotwaittointegrateD3withMatplotlib.UndertheimpulseofJakeVanderplas,mpld3wascreatedwiththeaimofbringingmatplotlibtothebrowser.Examplesgraphicsarehostedatthefollowingaddress:http://mpld3.github.io/index.html.Bokehaimstodeliverhigh-performanceinteractivityoververylargeorstreamingdatasetswhilstleveraginglotoftheconceptsofD3.jswithouttheburdenofwritingsomeintimidatingjavascriptandcsscode.Bokehdeliversdynamicvisualizationsonthebrowserwithorwithoutaserver.ItintegratesseamlesslywithMatplotlib,SeabornandggplotandrendersbeautifullyinIPythonnotebooksorJupyternotebooks.BokehisactivelydevelopedbytheteamatContinuum.ioandisanintegralpartoftheAnacondaPythondatastack.
Bokehserverprovidesafull-fledged,dynamicplottingenginethatmaterializesareactivescenegraphfromJSON.ItuseswebsocketstokeepstateandupdatetheHTML5canvasusingBackbone.jsandCoffee-scriptunderthehoods.Bokeh,asitisfueledbydatainJSON,createseasybindingsforotherlanguagessuchasR,Scala,andJulia.
Thisgivesahigh-leveloverviewofthemainplottingandvisualizationlibrary.Itisnotexhaustive.Let’smovetoconcreteexamplesofvisualizations.
PreprocessingthedataforvisualizationBeforejumpingintothevisualizations,wewilldosomepreparatoryworkonthedataharvested:
In[16]:
#ReadharvesteddatastoredincsvinaPandaDF
importpandasaspd
csv_in='/home/an/spark/spark-1.5.0-bin-
hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv'
pddf_in=pd.read_csv(csv_in,index_col=None,header=0,sep=';',
encoding='utf-8')
In[20]:
print('tweetspandasdataframe-count:',pddf_in.count())
print('tweetspandasdataframe-shape:',pddf_in.shape)
print('tweetspandasdataframe-colns:',pddf_in.columns)
('tweetspandasdataframe-count:',Unnamed:07540
id7540
created_at7540
user_id7540
user_name7538
tweet_text7540
dtype:int64)
('tweetspandasdataframe-shape:',(7540,6))
('tweetspandasdataframe-colns:',Index([u'Unnamed:0',u'id',
u'created_at',u'user_id',u'user_name',u'tweet_text'],dtype='object'))
Forthepurposeofourvisualizationactivity,wewilluseadatasetof7,540tweets.Thekeyinformationisstoredinthetweet_textcolumn.Wepreviewthedatastoredinthedataframecallingthehead()functiononthedataframe:
In[21]:
pddf_in.head()
Out[21]:
Unnamed:0idcreated_atuser_iduser_nametweet_text
00638830426971181057TueSep0121:46:57+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…
11638830426727911424TueSep0121:46:57+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…
22638830425402556417TueSep0121:46:56+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…
33638830424563716097TueSep0121:46:56+000020153276255125
TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…
44638830422256816132TueSep0121:46:56+000020153276255125
TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…
Wewillnowcreatesomeutilityfunctionstocleanupthetweettextandparsethetwitterdate.First,weimportthePythonregularexpressionregexlibraryreandthetimelibrarytoparsedatesandtime:
In[72]:
importre
importtime
Wecreateadictionaryofregexthatwillbecompiledandthenpassedasfunction:
RT:ThefirstregexwithkeyRTlooksforthekeywordRTatthebeginningofthetweettext:
re.compile(r'^RT'),
ALNUM:ThesecondregexwithkeyALNUMlooksforwordsincludingalphanumericcharactersandunderscoresignprecededbythe@symbolinthetweettext:
re.compile(r'(@[a-zA-Z0-9_]+)'),
HASHTAG:ThethirdregexwithkeyHASHTAGlooksforwordsincludingalphanumericcharactersprecededbythe#symbolinthetweettext:
re.compile(r'(#[\w\d]+)'),
SPACES:ThefourthregexwithkeySPACESlooksforblankorlinespacecharactersinthetweettext:
re.compile(r'\s+'),
URL:ThefifthregexwithkeyURLlooksforurladdressesincludingalphanumericcharactersprecededwithhttps://orhttp://markersinthetweettext:
re.compile(r'([https://|http://]?[a-zA-Z\d\/]+[\.]+[a-zA-Z\d\/\.]+)')
In[24]:
regexp={"RT":"^RT","ALNUM":r"(@[a-zA-Z0-9_]+)",
"HASHTAG":r"(#[\w\d]+)","URL":r"([https://|http://]?[a-zA-
Z\d\/]+[\.]+[a-zA-Z\d\/\.]+)",
"SPACES":r"\s+"}
regexp=dict((key,re.compile(value))forkey,valuein
regexp.items())
In[25]:
regexp
Out[25]:
{'ALNUM':re.compile(r'(@[a-zA-Z0-9_]+)'),
'HASHTAG':re.compile(r'(#[\w\d]+)'),
'RT':re.compile(r'^RT'),
'SPACES':re.compile(r'\s+'),
'URL':re.compile(r'([https://|http://]?[a-zA-Z\d\/]+[\.]+[a-zA-
Z\d\/\.]+)')}
Wecreateautilityfunctiontoidentifywhetheratweetisaretweetoranoriginaltweet:
In[77]:
defgetAttributeRT(tweet):
"""seeiftweetisaRT"""
returnre.search(regexp["RT"],tweet.strip())!=None
Then,weextractalluserhandlesinatweet:
defgetUserHandles(tweet):
"""givenatweetwetryandextractalluserhandles"""
returnre.findall(regexp["ALNUM"],tweet)
Wealsoextractallhashtagsinatweet:
defgetHashtags(tweet):
"""returnallhashtags"""
returnre.findall(regexp["HASHTAG"],tweet)
ExtractallURLlinksinatweetasfollows:
defgetURLs(tweet):
"""URL:[http://]?[\w\.?/]+"""
returnre.findall(regexp["URL"],tweet)
WestripallURLlinksanduserhandlesprecededby@signinatweettext.Thisfunctionwillbethebasisofthewordcloudwewillbuildsoon:
defgetTextNoURLsUsers(tweet):
"""returnparsedtexttermsstrippedofURLsandUserNamesintweet
text
''.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z\t])|(\w+:\/\/\S+)","
",x).split())"""
return''.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z\t])|
(\w+:\/\/\S+)|(RT)","",tweet).lower().split())
Welabelthedatasowecancreategroupsofdatasetsforthewordcloud:
defsetTag(tweet):
"""settagstotweet_textbasedonsearchtermsfromtags_list"""
tags_list=['spark','python','clinton','trump','gaga','bieber']
lower_text=tweet.lower()
returnfilter(lambdax:x.lower()inlower_text,tags_list)
Weparsethetwitterdateintheyyyy-mm-ddhh:mm:ssformat:
defdecode_date(s):
"""parseTwitterdateintoformatyyyy-mm-ddhh:mm:ss"""
returntime.strftime('%Y-%m-%d%H:%M:%S',time.strptime(s,'%a%b%d
%H:%M:%S+0000%Y'))
Wepreviewthedatapriortoprocessing:
In[43]:
pddf_in.columns
Out[43]:
Index([u'Unnamed:0',u'id',u'created_at',u'user_id',u'user_name',
u'tweet_text'],dtype='object')
In[45]:
#df.drop([ColumnNameorlist],inplace=True,axis=1)
pddf_in.drop(['Unnamed:0'],inplace=True,axis=1)
In[46]:
pddf_in.head()
Out[46]:
idcreated_atuser_iduser_nametweet_text
0638830426971181057TueSep0121:46:57+000020153276255125True
Equalityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…
1638830426727911424TueSep0121:46:57+000020153276255125True
Equalityernestsgantt:BeyHiveInFrance:PhuketDailyNews…
2638830425402556417TueSep0121:46:56+000020153276255125True
Equalityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…
3638830424563716097TueSep0121:46:56+000020153276255125True
Equalityernestsgantt:BeyHiveInFrance:PhuketDailyNews…
4638830422256816132TueSep0121:46:56+000020153276255125True
Equalityernestsgantt:elsahel12:9_A_6:dreamintention…
Wecreatenewdataframecolumnsbyapplyingtheutilityfunctionsdescribed.Wecreateanewcolumnforhtag,userhandles,URLs,thetexttermsstrippedfromURLs,andunwantedcharactersandthelabels.Wefinallyparsethedate:
In[82]:
pddf_in['htag']=pddf_in.tweet_text.apply(getHashtags)
pddf_in['user_handles']=pddf_in.tweet_text.apply(getUserHandles)
pddf_in['urls']=pddf_in.tweet_text.apply(getURLs)
pddf_in['txt_terms']=pddf_in.tweet_text.apply(getTextNoURLsUsers)
pddf_in['search_grp']=pddf_in.tweet_text.apply(setTag)
pddf_in['date']=pddf_in.created_at.apply(decode_date)
Thefollowingcodegivesaquicksnapshotofthenewlygenerateddataframe:
In[83]:
pddf_in[2200:2210]
Out[83]:
idcreated_atuser_iduser_nametweet_texthtagurlsptxt
tgrpdateuser_handlestxt_termssearch_grp
2200638242693374681088MonAug3106:51:30+0000201519525954
CENATICElimpactode@ApacheSparkenelprocesamiento…[#sparkSpecial]
[://t.co/4PQmJNuEJB]elimpactodeenelprocesamientodedatosye…
[spark]2015-08-3106:51:30[@ApacheSpark]elimpactodeenel
procesamientodedatosye…[spark]
2201638238014695575552MonAug3106:32:55+0000201551115854
NawfalRealTimeStreamingwithApacheSpark\nhttp://...[#IoT,
#SmartMelboune,#BigData,#Apachespark][://t.co/GW5PaqwVab]realtime
streamingwithapachesparkiotsmar…[spark]2015-08-3106:32:55[]
realtimestreamingwithapachesparkiotsmar…[spark]
2202638236084124516352MonAug3106:25:14+0000201562885987
MithunKattiRT@differentsachin:Sparktheflameofdigita…
[#IBMHackathon,#SparkHackathon,#ISLconnectIN…[]sparktheflameof
digitalindiaibmhackathon…[spark]2015-08-3106:25:14
[@differentsachin,@ApacheSpark]sparktheflameofdigitalindia
ibmhackathon…[spark]
2203638234734649176064MonAug3106:19:53+00002015140462395
solaimuruganvInstalling@[email protected]…[]
[1.4.1,://t.co/3c5dGbfaZe.]installingwith141gotmanymoreissue
whil…[spark]2015-08-3106:19:53[@ApacheMahout,@ApacheSpark]
installingwith141gotmanymoreissuewhil…[spark]
2204638233517307072512MonAug3106:15:02+000020152428473836
RalfHeinekeRT@RomeoKienzler:Joinme@velocityconfon#m…
[#machinelearning,#devOps,#Bl][://t.co/U5xL7pYEmF]joinmeon
machinelearningbaseddevopsoperat…[spark]2015-08-3106:15:02
[@RomeoKienzler,@velocityconf,@ApacheSpark]joinmeonmachinelearning
baseddevopsoperat…[spark]
2205638230184848687106MonAug3106:01:48+00002015289355748
AkimBoykoRT@databricks:Watchlivetodayat10amPTis…[][1.5,
://t.co/16cix6ASti]watchlivetodayat10amptis15presentedb…
[spark]2015-08-3106:01:48[@databricks,@ApacheSpark,@databricks,
@pwen…watchlivetodayat10amptis15presentedb…[spark]
2206638227830443110400MonAug3105:52:27+00002015145001241
sachinaggarwalSparktheflameofdigitalIndia@#IBMHackath…
[#IBMHackathon,#SparkHackathon,#ISLconnectIN…[://t.co/C1AO3uNexe]
sparktheflameofdigitalindiaibmhackathon…[spark]2015-08-31
05:52:27[@ApacheSpark]sparktheflameofdigitalindiaibmhackathon…
[spark]
2207638227031268810752MonAug3105:49:16+00002015145001241
sachinaggarwalRT@pravin_gadakh:Imagine,innovateandIgni…
[#IBMHackathon,#ISLconnectIN2015][]gadakhimagineinnovateand
ignitedigitalind…[spark]2015-08-3105:49:16[@pravin_gadakh,
@ApacheSpark]gadakhimagineinnovateandignitedigitalind…[spark]
2208638224591920336896MonAug3105:39:35+00002015494725634
IBMAsiaPacificRT@sachinparmar:PassionateaboutSpark??Hav…
[#IBMHackathon,#ISLconnectIN][India..]passionateaboutsparkhave
dreamsofcleansa…[spark]2015-08-3105:39:35[@sachinparmar]
passionateaboutsparkhavedreamsofcleansa…[spark]
2209638223327467692032MonAug3105:34:33+000020153158070968
OpenSourceIndia"GameChanger"#ApacheSparkspeedsup#bigdata…
[#ApacheSpark,#bigdata][://t.co/ieTQ9ocMim]gamechangerapachespark
speedsupbigdatapro…[spark]2015-08-3105:34:33[]gamechanger
apachesparkspeedsupbigdatapro…[spark]
WesavetheprocessedinformationinaCSVformat.Wehave7,540recordsand13columns.Inyourcase,theoutputwillvaryaccordingtothedatasetyouchose:
In[84]:
f_name='/home/an/spark/spark-1.5.0-bin-
hadoop2.6/examples/AN_Spark/data/unq_tweets_processed.csv'
pddf_in.to_csv(f_name,sep=';',encoding='utf-8',index=False)
In[85]:
pddf_in.shape
Out[85]:
(7540,13)
Gaugingwords,moods,andmemesataglanceWearenowreadytoproceedwithbuildingthewordcloudswhichwillgiveusasenseoftheimportantwordscarriedinthosetweets.Wewillcreatewordcloudsforthedatasetsharvested.Wordcloudsextractthetopwordsinalistofwordsandcreateascatterplotofthewordswherethesizeofthewordiscorrelatedtoitsfrequency.Themorefrequentthewordinthedataset,thebiggerwillbethefontsizeinthewordcloudrendering.Theyincludethreeverydifferentthemesandtwocompetingoranalogousentities.Ourfirstthemeisobviouslydataprocessingandanalytics,withApacheSparkandPythonasourentities.Oursecondthemeisthe2016presidentialelectioncampaign,withthetwocontenders:HilaryClintonandDonaldTrump.OurlastthemeistheworldofpopmusicwithJustinBieberandLadyGagaasthetwoexponents.
SettingupwordcloudWewillillustratetheprogrammingstepsbyanalyzingthesparkrelatedtweets.Weloadthedataandpreviewthedataframe:
In[21]:
importpandasaspd
csv_in='/home/an/spark/spark-1.5.0-bin-
hadoop2.6/examples/AN_Spark/data/spark_tweets.csv'
tspark_df=pd.read_csv(csv_in,index_col=None,header=0,sep=',',
encoding='utf-8')
In[3]:
tspark_df.head(3)
Out[3]:
idcreated_atuser_iduser_nametweet_texthtagurlsptxt
tgrpdateuser_handlestxt_termssearch_grp
0638818911773856000TueSep0121:01:11+000020152511247075Noor
DinRT@kdnuggets:RleadsRapidMiner,Pythoncatc…[#KDN]
[://t.co/3bsaTT7eUs]rleadsrapidminerpythoncatchesupbigdata…
[spark,python]2015-09-0121:01:11[@kdnuggets]rleadsrapidminer
pythoncatchesupbigdata…[spark,python]
1622142176768737000FriJul1720:33:48+0000201524537879IBM
CloudantBeoneofthefirsttosign-upforIBMAnalyti…[#ApacheSpark,
#SparkInsight][://t.co/C5TZpetVA6,://t.co/R1L29DePaQ]beoneofthe
firsttosignupforibmanalyti…[spark]2015-07-1720:33:48[]be
oneofthefirsttosignupforibmanalyti…[spark]
2622140453069169000FriJul1720:26:57+00002015515145898Arno
CandelNicearticleon#apachespark,#hadoopand#dat…[#apachespark,
#hadoop,#datascience][://t.co/IyF44pV0f3]nicearticleonapachespark
hadoopanddatasci…[spark]2015-07-1720:26:57[@h2oai]nice
articleonapachesparkhadoopanddatasci…[spark]
NoteThewordcloudlibrarywewilluseistheonedevelopedbyAndreasMuellerandhostedonhisGitHubaccountathttps://github.com/amueller/word_cloud.
ThelibraryrequiresPIL(shortforPythonImagingLibrary).PILiseasilyinstallablebyinvokingcondainstallpil.PILisacomplexlibrarytoinstallandisnotyetportedonPython3.4,soweneedtorunaPython2.7+environmenttobeabletoseeourwordcloud:
#
#InstallPIL(doesnotworkwithPython3.4)
#
an@an-VB:~$condainstallpil
Fetchingpackagemetadata:....
Solvingpackagespecifications:..................
Packageplanforinstallationinenvironment/home/an/anaconda:
Thefollowingpackageswillbedownloaded:
package|build
---------------------------|-----------------
libpng-1.6.17|0214KB
freetype-2.5.5|02.2MB
conda-env-2.4.4|py27_024KB
pil-1.1.7|py27_2650KB
------------------------------------------------------------
Total:3.0MB
ThefollowingpackageswillbeUPDATED:
conda-env:2.4.2-py27_0-->2.4.4-py27_0
freetype:2.5.2-0-->2.5.5-0
libpng:1.5.13-1-->1.6.17-0
pil:1.1.7-py27_1-->1.1.7-py27_2
Proceed([y]/n)?y
Next,weinstallthewordcloudlibrary:
#
#Installwordcloud
#AndreasMueller
#https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py
#
an@an-VB:~$pipinstallwordcloud
Collectingwordcloud
Downloadingwordcloud-1.1.3.tar.gz(163kB)
100%|████████████████████████████████|163kB548kB/s
Buildingwheelsforcollectedpackages:wordcloud
Runningsetup.pybdist_wheelforwordcloud
Storedindirectory:
/home/an/.cache/pip/wheels/32/a9/74/58e379e5dc614bfd9dd9832d67608faac9b2bc6
c194d6f6df5
Successfullybuiltwordcloud
Installingcollectedpackages:wordcloud
Successfullyinstalledwordcloud-1.1.3
CreatingwordcloudsAtthisstage,wearereadytoinvokethewordcloudprogramwiththegeneratedlistoftermsfromthetweettext.
Let’sgetstartedwiththewordcloudprogrambyfirstcalling%matplotlibinlinetodisplaythewordcloudinournotebook:
In[4]:
%matplotlibinline
In[11]:
Weconvertthedataframetxt_termscolumnintoalistofwords.Wemakesureitisallconvertedintothestrtypetoavoidanybadsurprisesandcheckthelist’sfirstfourrecords:
len(tspark_df['txt_terms'].tolist())
Out[11]:
2024
In[22]:
tspark_ls_str=[str(t)fortintspark_df['txt_terms'].tolist()]
In[14]:
len(tspark_ls_str)
Out[14]:
2024
In[15]:
tspark_ls_str[:4]
Out[15]:
['rleadsrapidminerpythoncatchesupbigdatatoolsgrowsparkignites
kdn',
'beoneofthefirsttosignupforibmanalyticsforapachesparktoday
sparkinsight',
'nicearticleonapachesparkhadoopanddatascience',
'spark101runningsparkandmapreducetogetherinproduction
hadoopsummit2015apachesparkaltiscale']
WefirstcalltheMatplotlibandthewordcloudlibraries:
importmatplotlib.pyplotasplt
fromwordcloudimportWordCloud,STOPWORDS
Fromtheinputlistofterms,wecreateaunifiedstringoftermsseparatedbyawhitespaceastheinputtothewordcloudprogram.Thewordcloudprogramremovesstopwords:
#jointweetstoasinglestring
words=''.join(tspark_ls_str)
#createwordcloud
wordcloud=WordCloud(
#removestopwords
stopwords=STOPWORDS,
background_color='black',
width=1800,
height=1400
).generate(words)
#renderwordcloudimage
plt.imshow(wordcloud)
plt.axis('off')
#savewordcloudimageondisk
plt.savefig('./spark_tweets_wordcloud_1.png',dpi=300)
#displayimageinJupyternotebook
plt.show()
Here,wecanvisualizethewordcloudsforApacheSparkandPython.Clearly,inthecaseofSpark,Hadoop,bigdata,andanalyticsarethememes,whilePythonrecallstherootofitsnameMontyPythonwithastrongfocusondeveloper,apachespark,andprogrammingwithsomehintstojavaandruby.
WecanalsogetaglimpseinthefollowingwordcloudsofthewordspreoccupyingtheNorthAmerican2016presidentialelectioncandidates:HilaryClintonandDonaldTrump.SeeminglyHilaryClintonisovershadowedbythepresenceofheropponentsDonaldTrumpandBernieSanders,whileTrumpisheavilycenteredonlyonhimself:
Interestingly,inthecaseofJustinBieberandLadyGaga,thewordloveappears.InthecaseofBieber,followandbelieberarekeywords,whilediet,weightloss,andfashionarethepreoccupationsfortheLadyGagacrowd.
Geo-locatingtweetsandmappingmeetupsNow,wewilldiveintothecreationofinteractivemapswithBokeh.First,wecreateaworldmapwherewegeo-locatesampletweetsand,onmovingourmouseovertheselocations,wecanseetheusersandtheirrespectivetweetsinahoverbox.
ThesecondmapisfocusedonmappingupcomingmeetupsinLondon.Itcouldbeaninteractivemapthatwouldactasareminderofdate,time,andlocationforupcomingmeetupsinaspecificcity.
Geo-locatingtweetsTheobjectiveistocreateaworldmapscatterplotofthelocationsofimportanttweetsonthemap,andthetweetsandauthorsarerevealedonhoveringoverthesepoints.Wewillgothroughthreestepstobuildthisinteractivevisualization:
1. Createthebackgroundworldmapbyfirstloadingadictionaryofalltheworldcountryboundariesdefinedbytheirrespectivelongitudeandlatitudes.
2. Loadtheimportanttweetswewishtogeo-locatewiththeirrespectivecoordinatesandauthors.
3. Finally,scatterplotontheworldmapthetweetscoordinatesandactivatethehovertooltovisualizeinteractivelythetweetsandauthoronthehighlighteddotsonthemap.
Instepone,wecreateaPythonlistcalleddatathatwillcontainalltheworldcountriesboundarieswiththeirrespectivelatitudeandlongitude:
In[4]:
#
#ThismoduleexposesgeometrydataforWorldCountryBoundaries.
#
importcsv
importcodecs
importgzip
importxml.etree.cElementTreeaset
importos
fromos.pathimportdirname,join
nan=float('NaN')
__file__=os.getcwd()
data={}
withgzip.open(join(dirname(__file__),
'AN_Spark/data/World_Country_Boundaries.csv.gz'))asf:
decoded=codecs.iterdecode(f,"utf-8")
next(decoded)
reader=csv.reader(decoded,delimiter=',',quotechar='"')
forrowinreader:
geometry,code,name=row
xml=et.fromstring(geometry)
lats=[]
lons=[]
fori,polyin
enumerate(xml.findall('.//outerBoundaryIs/LinearRing/coordinates')):
ifi>0:
lats.append(nan)
lons.append(nan)
coords=(c.split(',')[:2]forcinpoly.text.split())
lat,lon=list(zip(*[(float(lat),float(lon))forlon,latin
coords]))
lats.extend(lat)
lons.extend(lon)
data[code]={
'name':name,
'lats':lats,
'lons':lons,
}
In[5]:
len(data)
Out[5]:
235
Insteptwo,weloadasamplesetofimportanttweetsthatwewishtovisualizewiththeirrespectivegeo-locationinformation:
In[69]:
#data
#
#
In[8]:
importpandasaspd
csv_in='/home/an/spark/spark-1.5.0-bin-
hadoop2.6/examples/AN_Spark/data/spark_tweets_20.csv'
t20_df=pd.read_csv(csv_in,index_col=None,header=0,sep=',',
encoding='utf-8')
In[9]:
t20_df.head(3)
Out[9]:
idcreated_atuser_iduser_nametweet_texthtagurls
ptxttgrpdateuser_handlestxt_termssearch_grplatlon
0638818911773856000TueSep0121:01:11+000020152511247075Noor
DinRT@kdnuggets:RleadsRapidMiner,Pythoncatc…[#KDN]
[://t.co/3bsaTT7eUs]rleadsrapidminerpythoncatchesupbigdata…
[spark,python]2015-09-0121:01:11[@kdnuggets]rleads
rapidminerpythoncatchesupbigdata…[spark,python]37.279518
-121.867905
1622142176768737000FriJul1720:33:48+0000201524537879IBM
CloudantBeoneofthefirsttosign-upforIBMAnalyti…
[#ApacheSpark,#SparkInsight][://t.co/C5TZpetVA6,://t.co/R1L29DePaQ]
beoneofthefirsttosignupforibmanalyti…[spark]2015-07-17
20:33:48[]beoneofthefirsttosignupforibmanalyti…[spark]
37.774930-122.419420
2622140453069169000FriJul1720:26:57+00002015515145898Arno
CandelNicearticleon#apachespark,#hadoopand#dat…[#apachespark,
#hadoop,#datascience][://t.co/IyF44pV0f3]nicearticleon
apachesparkhadoopanddatasci…[spark]2015-07-1720:26:57
[@h2oai]nicearticleonapachesparkhadoopanddatasci…[spark]
51.500130-0.126305
In[98]:
len(t20_df.user_id.unique())
Out[98]:
19
In[17]:
t20_geo=t20_df[['date','lat','lon','user_name','tweet_text']]
In[24]:
#
t20_geo.rename(columns={'user_name':'user','tweet_text':'text'},
inplace=True)
In[25]:
t20_geo.head(4)
Out[25]:
datelatlonusertext
02015-09-0121:01:1137.279518-121.867905NoorDinRT
@kdnuggets:RleadsRapidMiner,Pythoncatc…
12015-07-1720:33:4837.774930-122.419420IBMCloudantBe
oneofthefirsttosign-upforIBMAnalyti…
22015-07-1720:26:5751.500130-0.126305ArnoCandelNice
articleon#apachespark,#hadoopand#dat…
32015-07-1719:35:3151.500130-0.126305IraMichaelBlonder
Spark101:RunningSparkand#MapReducetogeth…
In[22]:
df=t20_geo
#
Instepthree,wefirstimportedallthenecessaryBokehlibraries.WewillinstantiatetheoutputintheJupyterNotebook.Wegettheworldcountriesboundaryinformationloaded.Wegetthegeo-locatedtweetdata.WeinstantiatetheBokehinteractivetoolssuchaswheelandboxzoomaswellasthehovertool.
In[29]:
#
#BokehVisualizationoftweetsonworldmap
#
frombokeh.plottingimport*
frombokeh.modelsimportHoverTool,ColumnDataSource
fromcollectionsimportOrderedDict
#OutputinJupiterNotebook
output_notebook()
#Gettheworldmap
world_countries=data.copy()
#Getthetweetdata
tweets_source=ColumnDataSource(df)
#Createworldmap
countries_source=ColumnDataSource(data=dict(
countries_xs=[world_countries[code]['lons']forcodein
world_countries],
countries_ys=[world_countries[code]['lats']forcodein
world_countries],
country=[world_countries[code]['name']forcodeinworld_countries],
))
#Instantiatethebokehinteractivetools
TOOLS="pan,wheel_zoom,box_zoom,reset,resize,hover,save"
Wearenowreadytolayerthevariouselementsgatheredintoanobjectfigurecalledp.Definethetitle,width,andheightofp.Attachthetools.Createtheworldmapbackgroundbypatcheswithalightbackgroundcolorandborders.Scatterplotthetweetsaccordingtotheirrespectivegeo-coordinates.Then,activatethehovertoolwiththeusersandtheirrespectivetweet.Finally,renderthepictureonthebrowser.Thecodeisasfollows:
#Instantiantethefigureobject
p=figure(
title="%stweets"%(str(len(df.index))),
title_text_font_size="20pt",
plot_width=1000,
plot_height=600,
tools=TOOLS)
#Createworldpatchesbackground
p.patches(xs="countries_xs",ys="countries_ys",source=countries_source,
fill_color="#F1EEF6",fill_alpha=0.3,
line_color="#999999",line_width=0.5)
#Scatterplotsbylongitudeandlatitude
p.scatter(x="lon",y="lat",source=tweets_source,fill_color="#FF0000",
line_color="#FF0000")
#
#Activatehovertoolwithuserandcorrespondingtweetinformation
hover=p.select(dict(type=HoverTool))
hover.point_policy="follow_mouse"
hover.tooltips=OrderedDict([
("user","@user"),
("tweet","@text"),
])
#Renderthefigureonthebrowser
show(p)
BokehJSsuccessfullyloaded.
inspect
#
#
Thefollowingcodegivesanoverviewoftheworldmapwiththereddotsrepresentingthelocationsofthetweets’origins:
Wecanhoveronaspecificdottorevealthetweetsinthatlocation:
Wecanzoomintoaspecificlocation:
Finally,wecanrevealthetweetsinthegivenzoomed-inlocation:
DisplayingupcomingmeetupsonGoogleMapsNow,ourobjectiveistofocusonupcomingmeetupsinLondon.WearemappingthreemeetupsDataScienceLondon,ApacheSpark,andMachineLearning.WeembedaGoogleMapwithinaBokehvisualizationandgeo-locatethethreemeetupsaccordingtotheircoordinatesandgetinformationsuchasthenameoftheupcomingeventforeachmeetupwithahovertool.
First,importallthenecessaryBokehlibraries:
In[]:
#
#BokehGoogleMapVisualizationofLondonwithhoveronspecificpoints
#
#
from__future__importprint_function
frombokeh.browserlibimportview
frombokeh.documentimportDocument
frombokeh.embedimportfile_html
frombokeh.models.glyphsimportCircle
frombokeh.modelsimport(
GMapPlot,Range1d,ColumnDataSource,
PanTool,WheelZoomTool,BoxSelectTool,
HoverTool,ResetTool,
BoxSelectionOverlay,GMapOptions)
frombokeh.resourcesimportINLINE
x_range=Range1d()
y_range=Range1d()
WewillinstantiatetheGoogleMapthatwillactasthesubstrateuponwhichourBokehvisualizationwillbelayered:
#JSONstylestringtakenfrom:https://snazzymaps.com/style/1/pale-dawn
map_options=GMapOptions(lat=51.50013,lng=-0.126305,map_type="roadmap",
zoom=13,styles="""
[{"featureType":"administrative","elementType":"all","stylers":
[{"visibility":"on"},{"lightness":33}]},
{"featureType":"landscape","elementType":"all","stylers":
[{"color":"#f2e5d4"}]},
{"featureType":"poi.park","elementType":"geometry","stylers":
[{"color":"#c5dac6"}]},
{"featureType":"poi.park","elementType":"labels","stylers":
[{"visibility":"on"},{"lightness":20}]},
{"featureType":"road","elementType":"all","stylers":[{"lightness":20}]},
{"featureType":"road.highway","elementType":"geometry","stylers":
[{"color":"#c5c6c6"}]},
{"featureType":"road.arterial","elementType":"geometry","stylers":
[{"color":"#e4d7c6"}]},
{"featureType":"road.local","elementType":"geometry","stylers":
[{"color":"#fbfaf7"}]},
{"featureType":"water","elementType":"all","stylers":[{"visibility":"on"},
{"color":"#acbcc9"}]}]
""")
InstantiatetheBokehobjectplotfromtheclassGMapPlotwiththedimensionsandmapoptionsfromthepreviousstep:
#InstantiateGoogleMapPlot
plot=GMapPlot(
x_range=x_range,y_range=y_range,
map_options=map_options,
title="LondonMeetups"
)
Bringintheinformationfromourthreemeetupswewishtoplotandgettheinformationbyhoveringabovetherespectivecoordinates:
source=ColumnDataSource(
data=dict(
lat=[51.49013,51.50013,51.51013],
lon=[-0.130305,-0.126305,-0.120305],
fill=['orange','blue','green'],
name=['LondonDataScience','Spark','MachineLearning'],
text=['GraphData&Algorithms','SparkInternals','DeepLearningon
Spark']
)
)
DefinethedotstobedrawnontheGoogleMap:
circle=Circle(x="lon",y="lat",size=15,fill_color="fill",
line_color=None)
plot.add_glyph(source,circle)
DefinethestingsfortheBokehtoolstobeusedinthisvisualization:
#TOOLS="pan,wheel_zoom,box_zoom,reset,hover,save"
pan=PanTool()
wheel_zoom=WheelZoomTool()
box_select=BoxSelectTool()
reset=ResetTool()
hover=HoverTool()
#save=SaveTool()
plot.add_tools(pan,wheel_zoom,box_select,reset,hover)
overlay=BoxSelectionOverlay(tool=box_select)
plot.add_layout(overlay)
Activatethehovertoolwiththeinformationthatwillbecarried:
hover=plot.select(dict(type=HoverTool))
hover.point_policy="follow_mouse"
hover.tooltips=OrderedDict([
("Name","@name"),
("Text","@text"),
("(Long,Lat)","(@lon,@lat)"),
])
show(plot)
RendertheplotthatgivesaprettygoodviewofLondon:
Oncewehoveronahighlighteddot,wecangettheinformationofthegivenmeetup:
Fullsmoothzoomingcapabilityispreserved,asthefollowingscreenshotshows:
SummaryInthischapter,wefocusedonfewvisualizationtechniques.Wesawhowtobuildwordcloudsandtheirintuitivepowertoreveal,ataglance,lotsofthekeywords,moods,andmemescarriedthroughthousandsoftweets.
WethendiscussedinteractivemappingvisualizationsusingBokeh.Webuiltaworldmapfromthegroundupandcreatedascatterplotofcriticaltweets.Oncethemapwasrenderedonthebrowser,wecouldinteractivelyhoverfromdottodotandrevealthetweetsoriginatingfromdifferentpartsoftheworld.
OurfinalvisualizationwasfocusedonmappingupcomingmeetupsinLondononSpark,datascience,andmachinelearningandtheirrespectivetopics,makingabeautifulinteractivevisualizationwithanactualGoogleMap.
IndexA
AmazonWebServices(AWS)apps,deployingwith/DeployingappsinAmazonWebServicesabout/DeployingappsinAmazonWebServices
Anacondadefining/UnderstandingAnaconda
AnacondaInstallerURL/InstallingAnacondawithPython2.7
AnacondastackAnaconda/UnderstandingAnacondaConda/UnderstandingAnacondaNumba/UnderstandingAnacondaBlaze/UnderstandingAnacondaBokeh/UnderstandingAnacondaWakari/UnderstandingAnaconda
analyticslayer/AnalyticslayerApacheKafka
about/SettingupKafkaproperties/SettingupKafka
ApacheSparkabout/DisplayingupcomingmeetupsonGoogleMaps
APIs(ApplicationProgrammingInterface)about/Connectingtosocialnetworks
apppreviewing/Previewingourapp
appsdeploying,withAmazonWebServices(AWS)/DeployingappsinAmazonWebServices
architecture,data-intensiveapplicationsabout/Understandingthearchitectureofdata-intensiveapplicationsinfrastructurelayer/Infrastructurelayerpersistencelayer/Persistencelayerintegrationlayer/Integrationlayeranalyticslayer/Analyticslayerengagementlayer/Engagementlayer
AsynchronousJavaScript(AJAX)about/ProcessinglivedatawithTCPsockets
AWSconsoleURL/DeployingappsinAmazonWebServices
BBigData,withApacheSpark
references/VirtualizingtheenvironmentwithVagrantBlaze
used,forexploringdata/ExploringdatausingBlazeBSON(BinaryJSON)
about/SettingupMongoDB
CCatalyst
about/ExploringdatausingSparkSQLChef
about/InfrastructurelayerClustering
K-Means/SupervisedandunsupervisedlearningGaussianMixture/SupervisedandunsupervisedlearningPowerIterationClustering(PIC)/SupervisedandunsupervisedlearningLatentDirichletAllocation(LDA)/Supervisedandunsupervisedlearning
Clustermanagerabout/TheResilientDistributedDataset
comma-separatedvalues(CSV)about/Harvestingandstoringdata
ContinuumURL/UnderstandingAnaconda
Couchbaseabout/Persistencelayer
DD3.js
about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture
DAG(DirectedAcyclicGraph)about/TheResilientDistributedDataset,Serializinganddeserializingdata
dataserializing/Serializinganddeserializingdatadeserializing/Serializinganddeserializingdataharvesting/Harvestingandstoringdatastoring/Harvestingandstoringdatapersisting,inCSV/PersistingdatainCSVpersisting,inJSON/PersistingdatainJSONMongoDB,settingup/SettingupMongoDB,harvestingfromTwitter/HarvestingdatafromTwitterexploring,Blazeused/ExploringdatausingBlazetransferring,Odoused/TransferringdatausingOdoexploring,SparkSQLused/ExploringdatausingSparkSQLpre-processing,forvisualization/Preprocessingthedataforvisualization
data-intensiveappsarchitecting/Architectingdata-intensiveappslatency/Architectingdata-intensiveappsscalability/Architectingdata-intensiveappsfaulttolerance/Architectingdata-intensiveappsflexibility/Architectingdata-intensiveappsdataatrest,processing/Processingdataatrestdatainmotion,processing/Processingdatainmotiondata,exploring/Exploringdatainteractively
data-intensiveappsarchitectureabout/Revisitingthedata-intensiveappsarchitecture
dataanalysisdefining/AnalyzingthedataTweetsanatomy,discovering/Discoveringtheanatomyoftweets
DataDrivenDocuments(D3)about/Revisitingthedata-intensiveappsarchitecture
dataflowsabout/Machinelearningworkflowsanddataflows
dataintensiveappsarchitecturedefining/Revisitingthedata-intensiveapparchitecture
datalifecycleConnect/IntegrationlayerCorrect/IntegrationlayerCollect/Integrationlayer
Compose/IntegrationlayerConsume/IntegrationlayerControl/Integrationlayer
DataScienceLondonabout/DisplayingupcomingmeetupsonGoogleMaps
datatypes,SparkMLliblocalvector/SparkMLlibdatatypeslabeledpoint/SparkMLlibdatatypeslocalmatrix/SparkMLlibdatatypesdistributedmatrix/SparkMLlibdatatypes
DecisionTreesabout/Supervisedandunsupervisedlearning
DimensionalityReductionSingularValueDecomposition(SVD)/SupervisedandunsupervisedlearningPrincipalComponentAnalysis(PCA)/Supervisedandunsupervisedlearning
Dockerabout/Infrastructurelayerenvironment,virtualizingwith/VirtualizingtheenvironmentwithDockerreferences/VirtualizingtheenvironmentwithDocker
DStream(DiscretizedStream)defining/GoingunderthehoodofSparkStreaming
Eelements,Flume
Event/ExploringflumeClient/ExploringflumeSource/ExploringflumeSink/ExploringflumeChannel/Exploringflume
engagementlayer/EngagementlayerEnsemblesoftrees
about/Supervisedandunsupervisedlearningenvironment
virtualizing,withVagrant/VirtualizingtheenvironmentwithVagrantvirtualizing,withDocker/VirtualizingtheenvironmentwithDocker
FFirstApp
building,withPySpark/BuildingourfirstappwithPySparkFlume
about/Exploringflumeadvantages/Exploringflumeelements/Exploringflume
Gggplot
about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture
GitHubURL/GettingGitHubdataabout/ExploringtheGitHubworldoperating,withMeetupAPI/UnderstandingthecommunitythroughMeetup
GoogleMapsupcomingmeetups,displayingon/DisplayingupcomingmeetupsonGoogleMaps
HHadoopMongoDBconnector
URL/QueryingMongoDBfromSparkSQLHbaseandCassandra
about/PersistencelayerHDFS(HadoopDistributedFileSystem)
about/UnderstandingSpark
Iinfrastructurelayer/InfrastructurelayerIngestMode
BatchDataTransport/BuildingareliableandscalablestreamingappMicroBatch/BuildingareliableandscalablestreamingappPipelining/BuildingareliableandscalablestreamingappMessageQueue/Buildingareliableandscalablestreamingapp
integrationlayer/Integrationlayer
JJava8
installing/InstallingJava8JRE(JavaRuntimeEnvironment)
about/InstallingJava8JSON(JavaScriptObjectNotation)
about/Connectingtosocialnetworks,Harvestingandstoringdata
KKafka
settingup/SettingupKafkainstalling/InstallingandtestingKafkatesting/InstallingandtestingKafkaURL/InstallingandtestingKafkaproducers,developing/Developingproducersconsumers,developing/DevelopingconsumersSparkStreamingconsumer,developingfor/DevelopingaSparkStreamingconsumerforKafka
Kappaarchitecturedefining/ClosingremarksontheLambdaandKappaarchitecture,UnderstandingKappaarchitecture
LLambdaarchitecture
defining/ClosingremarksontheLambdaandKappaarchitecture,UnderstandingLambdaarchitecture
LinearRegressionModelsabout/Supervisedandunsupervisedlearning
MMachineLearning
about/DisplayingupcomingmeetupsonGoogleMapsmachinelearningpipelines
building/Buildingmachinelearningpipelinesmachinelearningworkflows
about/MachinelearningworkflowsanddataflowsMassiveOpenOnlineCourses(MOOCs)
about/VirtualizingtheenvironmentwithVagrantMatplotlib
about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture
MeetupAPIURL/GettingMeetupdata
meetupsmapping/Geo-locatingtweetsandmappingmeetups
MLlibalgorithmsCollaborativefiltering/Additionallearningalgorithmsfeatureextractionandtransformation/Additionallearningalgorithmsoptimization/AdditionallearningalgorithmsLimited-memoryBFGS(L-BFGS)/Additionallearningalgorithms
modelsdefining,forprocessingstreamsofdata/Layingthefoundationsofstreamingarchitecture
MongoDBabout/Persistencelayersettingup/SettingupMongoDBserverandclient,installing/InstallingtheMongoDBserverandclientserver,running/RunningtheMongoDBserverMongoclient,running/RunningtheMongoclientPyMongodriver,installing/InstallingthePyMongodriverPythonclient,creatingfor/CreatingthePythonclientforMongoDBreferences/QueryingMongoDBfromSparkSQL
MongoDB,fromSparkSQLURL/QueryingMongoDBfromSparkSQL
Multi-DimensionalScaling(MDS)algorithmabout/ApplyingScikit-LearnontheTwitterdataset
Mumrah,onGitHubURL/InstallingandtestingKafka
MySQLabout/Persistencelayer
NNaiveBayes
about/SupervisedandunsupervisedlearningNeo4j
about/Persistencelayernetwork_wordcount.py
URL/Processinglivedata
OOdo
about/TransferringdatausingOdoused,fortransferringdata/TransferringdatausingOdo
operations,onRDDstransformations/TheResilientDistributedDatasetaction/TheResilientDistributedDataset
Ppersistencelayer/PersistencelayerPIL(PythonImagingLibrary)
about/SettingupwordcloudPostgreSQL
about/PersistencelayerPuppet
about/InfrastructurelayerPySpark
FirstApp,buildingwith/BuildingourfirstappwithPySpark
RRDD(ResilientDistributedDataset)
about/TheResilientDistributedDatasetResilientDistributedDatasets(RDD)
about/SparkStreaminginnerworkingREST(RepresentationStateTransfer)
about/ConnectingtosocialnetworksRPC(RemoteProcedureCall)
about/Layingthefoundationsofstreamingarchitecture
SSDK(SoftwareDevelopmentKit)
about/InstallingJava8Seaborn
about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture
socialnetworksconnectingto/ConnectingtosocialnetworksTwitterdata,obtaining/GettingTwitterdataGitHubdata,obtaining/GettingGitHubdataMeetupdata,obtaining/GettingMeetupdata
Sparkdefining/UnderstandingSparkBatch/UnderstandingSparkStreaming/UnderstandingSparkIterative/UnderstandingSparkInteractive/UnderstandingSparklibraries/SparklibrariesURL/InstallingSparkClustering/SupervisedandunsupervisedlearningDimensionalityReduction/SupervisedandunsupervisedlearningRegressionandClassification/SupervisedandunsupervisedlearningIsotonicRegression/SupervisedandunsupervisedlearningMLlibalgorithms/Additionallearningalgorithms
Spark,onEC2URL/DeployingappsinAmazonWebServices
SparkContextabout/SparkStreaminginnerworking
Sparkdataframesdefining/UnderstandingSparkdataframes
SparklibrariesSparkSQL/SparklibrariesSparkMLLIB/SparklibrariesSparkStreaming/SparklibrariesSparkGraphX/SparklibrariesPySpark,defining/PySparkinactionRDD(ResilientDistributedDataset)/TheResilientDistributedDataset
SparkMLlibcontextualizing,inapparchitecture/ContextualizingSparkMLlibintheapparchitecturedatatypes/SparkMLlibdatatypes
SparkMLlibalgorithmsclassifying/ClassifyingSparkMLlibalgorithms
supervisedlearning/Supervisedandunsupervisedlearningunsupervisedlearning/Supervisedandunsupervisedlearningadditionallearningalgorithms/Additionallearningalgorithms
SparkPoweredEnvironmentsettingup/SettinguptheSparkpoweredenvironmentOracleVirtualBox,settingupwithUbuntu/SettingupanOracleVirtualBoxwithUbuntuAnaconda,installingwithPython2.7/InstallingAnacondawithPython2.7Java8,installing/InstallingJava8Spark,installing/InstallingSparkIPythonNotebook,enabling/EnablingIPythonNotebook
SparkSQLused,forexploringdata/ExploringdatausingSparkSQLabout/ExploringdatausingSparkSQLCSVfiles,loadingwith/LoadingandprocessingCSVfileswithSparkSQLCSVfiles,processingwith/LoadingandprocessingCSVfileswithSparkSQLMongoDB,queryingfrom/QueryingMongoDBfromSparkSQL
SparkSQLmoduleabout/Analyticslayer
SparkSQLqueryoptimizerdefining/UnderstandingtheSparkSQLqueryoptimizer
Sparkstreamingdefining/SparkStreaminginnerworking,GoingunderthehoodofSparkStreamingbuilding,infaulttolerance/Buildinginfaulttolerance
StochasticGradientDescentabout/ClassifyingSparkMLlibalgorithms
streamingappbuilding/BuildingareliableandscalablestreamingappKafka,settingup/SettingupKafkaflume,exploring/Exploringflumedatapipelines,developingwithFlume/DevelopingdatapipelineswithFlume,Kafka,andSparkdatapipelines,developingwithKafka/DevelopingdatapipelineswithFlume,Kafka,andSparkdatapipelines,developingwithSpark/DevelopingdatapipelineswithFlume,Kafka,andSpark
streamingarchitectureabout/Layingthefoundationsofstreamingarchitecture
StreamingContextabout/SparkStreaminginnerworking
supervisedmachinelearningworkflowabout/Supervisedmachinelearningworkflows
TTCPSockets
livedata,processingwith/ProcessinglivedatawithTCPsockets,Processinglivedatasettingup/SettingupTCPsockets
TF-IDF(TermFrequency-InverseDocumentFrequency)about/ClassifyingSparkMLlibalgorithms
Tridentabout/Layingthefoundationsofstreamingarchitecture
tweetsgeo-locating/Geo-locatingtweetsandmappingmeetups,Geo-locatingtweets
TwitterURL/GettingTwitterdata
TwitterAPI,ondevconsoleURL/GettingTwitterdata
Twitterdatamanipulating/ManipulatingTwitterdatainrealtimetweets,processingfromTwitterfirehose/ProcessingTweetsinrealtimefromtheTwitterfirehose
Twitterdatasetclustering/ClusteringtheTwitterdatasetSciKit-Learn,applyingon/ApplyingScikit-LearnontheTwitterdatasetdataset,preprocessing/Preprocessingthedatasetclusteringalgorithm,running/Runningtheclusteringalgorithmmodelandresults,evaluating/Evaluatingthemodelandtheresults
UUbuntu14.04.1LTSrelease
URL/SettingupanOracleVirtualBoxwithUbuntuunifiedlog
properties/UnderstandingKappaarchitectureUnifiedLog
properties/Buildingareliableandscalablestreamingappunsupervisedmachinelearningworkflow
about/Unsupervisedmachinelearningworkflows
VVagrant
about/Infrastructurelayerenvironment,virtualizingwith/VirtualizingtheenvironmentwithVagrantreference/VirtualizingtheenvironmentwithVagrant
VirtualBoxVMURL/SettingupanOracleVirtualBoxwithUbuntu
visualizationdata,pre-processingfor/Preprocessingthedataforvisualization
Wwordclouds
creating/Gaugingwords,moods,andmemesataglance,Creatingwordcloudssettingup/SettingupwordcloudURL/Settingupwordcloud