Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparkforPythonDevelopers

TableofContents


Credits

AbouttheAuthor

Acknowledgment

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.SettingUpaSparkVirtualEnvironment

Understandingthearchitectureofdata-intensiveapplications

Infrastructurelayer

Persistencelayer

Integrationlayer

Analyticslayer

Engagementlayer

UnderstandingSpark

Sparklibraries

PySparkinaction

TheResilientDistributedDataset

UnderstandingAnaconda

SettinguptheSparkpoweredenvironment

SettingupanOracleVirtualBoxwithUbuntu

InstallingAnacondawithPython2.7

InstallingJava8

InstallingSpark

EnablingIPythonNotebook

BuildingourfirstappwithPySpark

VirtualizingtheenvironmentwithVagrant

Movingtothecloud

DeployingappsinAmazonWebServices

VirtualizingtheenvironmentwithDocker

Summary

2.BuildingBatchandStreamingAppswithSpark

Architectingdata-intensiveapps

Processingdataatrest

Processingdatainmotion

Exploringdatainteractively

Connectingtosocialnetworks

GettingTwitterdata

GettingGitHubdata

GettingMeetupdata

Analyzingthedata

Discoveringtheanatomyoftweets

ExploringtheGitHubworld

UnderstandingthecommunitythroughMeetup

Previewingourapp

Summary

3.JugglingDatawithSpark

Revisitingthedata-intensiveapparchitecture

Serializinganddeserializingdata

Harvestingandstoringdata

PersistingdatainCSV

PersistingdatainJSON

SettingupMongoDB

InstallingtheMongoDBserverandclient

RunningtheMongoDBserver

RunningtheMongoclient

InstallingthePyMongodriver

CreatingthePythonclientforMongoDB

HarvestingdatafromTwitter

ExploringdatausingBlaze

TransferringdatausingOdo

ExploringdatausingSparkSQL

UnderstandingSparkdataframes

UnderstandingtheSparkSQLqueryoptimizer

LoadingandprocessingCSVfileswithSparkSQL

QueryingMongoDBfromSparkSQL

Summary

4.LearningfromDataUsingSpark

ContextualizingSparkMLlibintheapparchitecture

ClassifyingSparkMLlibalgorithms

Supervisedandunsupervisedlearning

Additionallearningalgorithms

SparkMLlibdatatypes

Machinelearningworkflowsanddataflows

Supervisedmachinelearningworkflows

Unsupervisedmachinelearningworkflows

ClusteringtheTwitterdataset

ApplyingScikit-LearnontheTwitterdataset

Preprocessingthedataset

Runningtheclusteringalgorithm

Evaluatingthemodelandtheresults

Buildingmachinelearningpipelines

Summary

5.StreamingLiveDatawithSpark

Layingthefoundationsofstreamingarchitecture

SparkStreaminginnerworking

GoingunderthehoodofSparkStreaming

Buildinginfaulttolerance

ProcessinglivedatawithTCPsockets

SettingupTCPsockets

Processinglivedata

ManipulatingTwitterdatainrealtime

ProcessingTweetsinrealtimefromtheTwitterfirehose

Buildingareliableandscalablestreamingapp

SettingupKafka

InstallingandtestingKafka

Developingproducers

Developingconsumers

DevelopingaSparkStreamingconsumerforKafka

Exploringflume

DevelopingdatapipelineswithFlume,Kafka,andSpark

ClosingremarksontheLambdaandKappaarchitecture

UnderstandingLambdaarchitecture

UnderstandingKappaarchitecture

Summary

6.VisualizingInsightsandTrends

Revisitingthedata-intensiveappsarchitecture

Preprocessingthedataforvisualization

Gaugingwords,moods,andmemesataglance

Settingupwordcloud

Creatingwordclouds

Geo-locatingtweetsandmappingmeetups

Geo-locatingtweets

DisplayingupcomingmeetupsonGoogleMaps

Summary

Index


SparkforPythonDevelopersCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:December2015

Productionreference:1171215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-969-6

www.packtpub.com

http://www.packtpub.com

CreditsAuthor

AmitNandi

Reviewers

ManuelIgnacioFrancoGaleano

RahulKavale

DanielLemire

ChetMancini

LaurenceWelch

CommissioningEditor

AmarabhaBanerjee

AcquisitionEditor

SonaliVernekar

ContentDevelopmentEditor

MerintThomasMathew

TechnicalEditor

NaveenkumarJain

CopyEditor

RoshniBanerjee

ProjectCoordinator

SuzanneCoutinho

Proofreader

SafisEditing

Indexer

PriyaSane

Graphics

KirkD’Penha

ProductionCoordinator

ShantanuN.Zagade

CoverWork

ShantanuN.Zagade

AbouttheAuthorAmitNandistudiedphysicsattheFreeUniversityofBrusselsinBelgium,wherehedidhisresearchoncomputergeneratedholograms.Computergeneratedhologramsarethekeycomponentsofanopticalcomputer,whichispoweredbyphotonsrunningatthespeedoflight.HethenworkedwiththeuniversityCraysupercomputer,sendingbatchjobsofprogramswritteninFortran.Thisgavehimatasteforcomputing,whichkeptgrowing.Hehasworkedextensivelyonlargebusinessreengineeringinitiatives,usingSAPasthemainenabler.Hefocusedforthelast15yearsonstart-upsinthedataspace,pioneeringnewareasoftheinformationtechnologylandscape.Heiscurrentlyfocusingonlarge-scaledata-intensiveapplicationsasanenterprisearchitect,dataengineer,andsoftwaredeveloper.Heunderstandsandspeakssevenhumanlanguages.AlthoughPythonishiscomputerlanguageofchoice,heaimstobeabletowritefluentlyinsevencomputerlanguagestoo.

AcknowledgmentIwanttoexpressmyprofoundgratitudetomyparentsfortheirunconditionalloveandstrongsupportinallmyendeavors.

ThisbookarosefromaninitialdiscussionwithRichardGall,anacquisitioneditoratPacktPublishing.Withoutthisinitialdiscussion,thisbookwouldneverhavehappened.So,Iamgratefultohim.ThefollowupsondiscussionsandthecontractualtermswereagreedwithRebeccaYoue.Iwouldliketothankherforhersupport.IwouldalsoliketothankMerintMathew,acontenteditorwhohelpedmebringthisbooktothefinishline.IamthankfultoMerintforhissubtlepersistenceandtactfulsupportduringthewriteupsandrevisionsofthisbook.

Wearestandingontheshouldersofgiants.Iwanttoacknowledgesomeofthegiantswhohelpedmeshapemythinking.Iwanttorecognizethebeauty,elegance,andpowerofPythonasenvisionedbyGuidovanRossum.MyrespectfulgratitudegoestoMateiZahariaandtheteamatBerkeleyAMPLabandDatabricksfordevelopinganewapproachtocomputingwithSparkandMesos.TravisOliphant,PeterWang,andtheteamatContinuum.ioaredoingatremendousjobofkeepingPythonrelevantinafast-changingcomputinglandscape.Thankyoutoyouall.

AbouttheReviewersManuelIgnacioFrancoGaleanoisasoftwaredeveloperfromColombia.HeholdsacomputersciencedegreefromtheUniversityofQuindío.Atthemomentofpublicationofthisbook,hewasstudyingtogethisMScincomputersciencefromUniversityCollegeDublin,Ireland.Hehasawiderangeofintereststhatincludedistributedsystems,machinelearning,microservices,andsoon.Heislookingforawaytoapplymachinelearningtechniquestoaudiodatainordertohelppeoplelearnmoreaboutmusic.

RahulKavaleworksasasoftwaredeveloperatTinyOwlLtd.Heisinterestedinmultipletechnologiesrangingfrombuildingwebapplicationstosolvingbigdataproblems.Hehasworkedinmultiplelanguages,includingScala,Ruby,andJava,andhasworkedonApacheSpark,ApacheStorm,ApacheKafka,Hadoop,andHive.HeenjoyswritingScala.Functionalprogramminganddistributedcomputingarehisareasofinterest.HehasbeenusingSparksinceitsearlystageforvaryingusecases.HehasalsohelpedwiththereviewforthePragmaticScalabook.

DanielLemirehasaBScandMScinmathematicsfromtheUniversityofTorontoandaPhDinengineeringmathematicsfromtheEcolePolytechniqueandtheUniversitédeMontréal.HeisaprofessorofcomputerscienceattheUniversitéduQuébec.HehasalsobeenaresearchofficerattheNationalResearchCouncilofCanadaandanentrepreneur.Hehaswrittenover45peer-reviewedpublications,includingmorethan25journalarticles.Hehasheldcompetitiveresearchgrantsforthelast15years.Hehasbeenanexpertonseveralcommitteeswithfundingagencies(NSERCandFQRNT).Hehasservedasaprogramcommitteememberonleadingcomputerscienceconferences(forexample,ACMCIKM,ACMWSDM,ACMSIGIR,andACMRecSys).HisopensourcesoftwarehasbeenusedbymajorcorporationssuchasGoogleandFacebook.Hisresearchinterestsincludedatabases,informationretrievalandhigh-performanceprogramming.Heblogsregularlyoncomputerscienceathttp://lemire.me/blog/.

ChetManciniisadataengineeratIntentMedia,IncinNewYork,whereheworkswiththedatascienceteamtostoreandprocessterabytesofwebtraveldatatobuildpredictivemodelsofshopperbehavior.Heenjoysfunctionalprogramming,immutabledatastructures,andmachinelearning.Hewritesandspeaksontopicssurroundingdataengineeringandinformationarchitecture.

HeisacontributortoApacheSparkandotherlibrariesintheSparkecosystem.Chethasamaster’sdegreeincomputersciencefromCornellUniversity.

http://lemire.me/blog/

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

http://www.PacktPub.com


mailto:[email protected]


https://www2.packtpub.com/books/subscription/packtlib

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.


PrefaceSparkforPythonDevelopersaimstocombinetheeleganceandflexibilityofPythonwiththepowerandversatilityofApacheSpark.SparkiswritteninScalaandrunsontheJavavirtualmachine.ItisneverthelesspolyglotandoffersbindingsandAPIsforJava,Scala,Python,andR.Pythonisawell-designedlanguagewithanextensivesetofspecializedlibraries.ThisbooklooksatPySparkwithinthePyDataecosystem.SomeoftheprominentPyDatalibrariesincludePandas,Blaze,Scikit-Learn,Matplotlib,Seaborn,andBokeh.Theselibrariesareopensource.Theyaredeveloped,used,andmaintainedbythedatascientistandPythondeveloperscommunity.PySparkintegrateswellwiththePyDataecosystem,asendorsedbytheAnacondaPythondistribution.Thebookputsforwardajourneytobuilddata-intensiveappsalongwithanarchitecturalblueprintthatcoversthefollowingsteps:first,setupthebaseinfrastructurewithSpark.Second,acquire,collect,process,andstorethedata.Third,gaininsightsfromthecollecteddata.Fourth,streamlivedataandprocessitinrealtime.Finally,visualizetheinformation.

TheobjectiveofthebookistolearnaboutPySparkandPyDatalibrariesbybuildingappsthatanalyzetheSparkcommunity’sinteractionsonsocialnetworks.ThefocusisonTwitterdata.

WhatthisbookcoversChapter1,SettingUpaSparkVirtualEnvironment,covershowtocreateasegregatedvirtualmachineasoursandboxordevelopmentenvironmenttoexperimentwithSparkandPyDatalibraries.ItcovershowtoinstallSparkandthePythonAnacondadistribution,whichincludesPyDatalibraries.Alongtheway,weexplainthekeySparkconcepts,thePythonAnacondaecosystem,andbuildaSparkwordcountapp.

Chapter2,BuildingBatchandStreamingAppswithSpark,laysthefoundationoftheDataIntensiveAppsArchitecture.Itdescribesthefivelayersoftheappsarchitectureblueprint:infrastructure,persistence,integration,analytics,andengagement.WeestablishAPIconnectionswiththreesocialnetworks:Twitter,GitHub,andMeetup.ThischapterprovidesthetoolstoconnecttothesethreenontrivialAPIssothatyoucancreateyourowndatamashupsatalaterstage.

Chapter3,JugglingDatawithSpark,covershowtoharvestdatafromTwitterandprocessitusingPandas,Blaze,andSparkSQLwiththeirrespectiveimplementationsofthedataframedatastructure.WeproceedwithfurtherinvestigationsandtechniquesusingSparkSQL,leveragingontheSparkdataframedatastructure.

Chapter4,LearningfromDataUsingSpark,givesanoverviewoftheeverexpandinglibraryofalgorithmsofSparkMLlib.Itcoverssupervisedandunsupervisedlearning,recommendersystems,optimization,andfeatureextractionalgorithms.WeputtheTwitterharvesteddatasetthroughaPythonScikit-LearnandSparkMLlibK-meansclusteringinordertosegregatetheApacheSparkrelevanttweets.

Chapter5,StreamingLiveDatawithSpark,laysdownthefoundationofstreamingarchitectureappsanddescribestheirchallenges,constraints,andbenefits.WeillustratethestreamingconceptswithTCPsockets,followedbylivetweetingestionandprocessingdirectlyfromtheTwitterfirehose.WealsodescribeFlume,areliable,flexible,andscalabledataingestionandtransportpipelinesystem.ThecombinationofFlume,Kafka,andSparkdeliversunparalleledrobustness,speed,andagilityinanever-changinglandscape.Weendthechapterwithsomeremarksandobservationsontwostreamingarchitecturalparadigms,theLambdaandKappaarchitectures.

Chapter6,VisualizingInsightsandTrends,focusesonafewkeyvisualizationtechniques.Itcovershowtobuildwordcloudsandexposetheirintuitivepowertorevealalotofthekeywords,moods,andmemescarriedthroughthousandsoftweets.WethenfocusoninteractivemappingvisualizationsusingBokeh.Webuildaworldmapfromthegroundupandcreateascatterplotofcriticaltweets.OurfinalvisualizationistooverlayanactualGooglemapofLondon,highlightingupcomingmeetupsandtheirrespectivetopics.

WhatyouneedforthisbookYouneedinquisitiveness,perseverance,andpassionfordata,softwareengineering,applicationarchitectureandscalability,andbeautifulsuccinctvisualizations.Thescopeisbroadandwide.

YouneedagoodunderstandingofPythonorasimilarlanguagewithobject-orientedandfunctionalprogrammingcapabilities.PreliminaryexperienceofdatawranglingwithPython,R,oranysimilartoolishelpful.

Youneedtoappreciatehowtoconceive,build,andscaledataapplications.

WhothisbookisforThetargetaudienceincludesthefollowing:

Datascientistsaretheprimaryinterestedparties.ThisbookwillhelpyouunleashthepowerofSparkandleverageyourPython,R,andmachinelearningbackground.SoftwaredeveloperswithafocusonPythonwillreadilyexpandtheirskillstocreatedata-intensiveappsusingSparkasaprocessingengineandPythonvisualizationlibrariesandwebframeworks.DataarchitectswhocancreaterapiddatapipelinesandbuildthefamousLambdaarchitecturethatencompassesbatchandstreamingprocessingtorenderinsightsondatainrealtime,usingtheSparkandPythonrichecosystem,willalsobenefitfromthisbook.

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows“LaunchPySparkwithIPYNBindirectoryexamples/AN_SparkwheretheJupyterorIPythonNotebooksarestored”.

Ablockofcodeissetasfollows:

#Wordcounton1stChapteroftheBookusingPySpark

#importregexmodule

importre

#importaddfromoperatormodule

fromoperatorimportadd

#readinputfile

file_in=sc.textFile('/home/an/Documents/A00_Documents/Spark4Py20150315')

Anycommand-lineinputoroutputiswrittenasfollows:

#installanaconda2.x.x

bashAnaconda-2.x.x-Linux-x86[_64].sh

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“AfterinstallingVirtualBox,let’sopentheOracleVMVirtualBoxManagerandclicktheNewbutton.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.

Tosendusgeneralfeedback,simplysendane-mailto<[email protected]>,andmentionthebooktitleviathesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.


http://www.packtpub.com/authors

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.


http://www.packtpub.com/support

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.

http://www.packtpub.com/submit-errata


PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.


QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.


Chapter1.SettingUpaSparkVirtualEnvironmentInthischapter,wewillbuildanisolatedvirtualenvironmentfordevelopmentpurposes.TheenvironmentwillbepoweredbySparkandthePyDatalibrariesprovidedbythePythonAnacondadistribution.TheselibrariesincludePandas,Scikit-Learn,Blaze,Matplotlib,Seaborn,andBokeh.Wewillperformthefollowingactivities:

SettingupthedevelopmentenvironmentusingtheAnacondaPythondistribution.ThiswillincludeenablingtheIPythonNotebookenvironmentpoweredbyPySparkforourdataexplorationtasks.InstallingandenablingSpark,andthePyDatalibrariessuchasPandas,Scikit-Learn,Blaze,Matplotlib,andBokeh.Buildingawordcountexampleapptoensurethateverythingisworkingfine.

Thelastdecadehasseentheriseanddominanceofdata-drivenbehemothssuchasAmazon,Google,Twitter,LinkedIn,andFacebook.Thesecorporations,byseeding,sharing,ordisclosingtheirinfrastructureconcepts,softwarepractices,anddataprocessingframeworks,havefosteredavibrantopensourcesoftwarecommunity.Thishastransformedtheenterprisetechnology,systems,andsoftwarearchitecture.

ThisincludesnewinfrastructureandDevOps(shortfordevelopmentandoperations),conceptsleveragingvirtualization,cloudtechnology,andsoftware-definednetworks.

Toprocesspetabytesofdata,Hadoopwasdevelopedandopensourced,takingitsinspirationfromtheGoogleFileSystem(GFS)andtheadjoiningdistributedcomputingframework,MapReduce.Overcomingthecomplexitiesofscalingwhilekeepingcostsundercontrolhasalsoledtoaproliferationofnewdatastores.ExamplesofrecentdatabasetechnologyincludeCassandra,acolumnardatabase;MongoDB,adocumentdatabase;andNeo4J,agraphdatabase.

Hadoop,thankstoitsabilitytoprocesshugedatasets,hasfosteredavastecosystemtoquerydatamoreiterativelyandinteractivelywithPig,Hive,Impala,andTez.HadoopiscumbersomeasitoperatesonlyinbatchmodeusingMapReduce.Sparkiscreatingarevolutionintheanalyticsanddataprocessingrealmbytargetingtheshortcomingsofdiskinput-outputandbandwidth-intensiveMapReducejobs.

SparkiswritteninScala,andthereforeintegratesnativelywiththeJavaVirtualMachine(JVM)poweredecosystem.SparkhadearlyonprovidedPythonAPIandbindingsbyenablingPySpark.TheSparkarchitectureandecosystemisinherentlypolyglot,withanobviousstrongpresenceofJava-ledsystems.

ThisbookwillfocusonPySparkandthePyDataecosystem.Pythonisoneofthepreferredlanguagesintheacademicandscientificcommunityfordata-intensiveprocessing.PythonhasdevelopedarichecosystemoflibrariesandtoolsindatamanipulationwithPandasandBlaze,inMachineLearningwithScikit-Learn,andindatavisualizationwithMatplotlib,Seaborn,andBokeh.Hence,theaimofthisbookistobuildanend-to-end

architecturefordata-intensiveapplicationspoweredbySparkandPython.Inordertoputtheseconceptsintopractice,wewillanalyzesocialnetworkssuchasTwitter,GitHub,andMeetup.WewillfocusontheactivitiesandsocialinteractionsofSparkandtheOpenSourceSoftwarecommunitybytappingintoGitHub,Twitter,andMeetup.

Buildingdata-intensiveapplicationsrequireshighlyscalableinfrastructure,polyglotstorage,seamlessdataintegration,multiparadigmanalyticsprocessing,andefficientvisualization.Thefollowingparagraphdescribesthedata-intensiveapparchitectureblueprintthatwewilladoptthroughoutthebook.Itisthebackboneofthebook.WewilldiscoverSparkinthecontextofthebroaderPyDataecosystem.

TipDownloadingtheexamplecode

YoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.



Understandingthearchitectureofdata-intensiveapplicationsInordertounderstandthearchitectureofdata-intensiveapplications,thefollowingconceptualframeworkisused.Theisarchitectureisdesignedonthefollowingfivelayers:

InfrastructurelayerPersistencelayerIntegrationlayerAnalyticslayerEngagementlayer

ThefollowingscreenshotdepictsthefivelayersoftheDataIntensiveAppFramework:

Fromthebottomup,let’sgothroughthelayersandtheirmainpurpose.

InfrastructurelayerTheinfrastructurelayerisprimarilyconcernedwithvirtualization,scalability,andcontinuousintegration.Inpracticalterms,andintermsofvirtualization,wewillgothroughbuildingourowndevelopmentenvironmentinaVirtualBoxandvirtualmachinepoweredbySparkandtheAnacondadistributionofPython.Ifwewishtoscalefromthere,wecancreateasimilarenvironmentinthecloud.ThepracticeofcreatingasegregateddevelopmentenvironmentandmovingintotestandproductiondeploymentcanbeautomatedandcanbepartofacontinuousintegrationcyclepoweredbyDevOpstoolssuchasVagrant,Chef,Puppet,andDocker.Dockerisaverypopularopensourceprojectthateasestheinstallationanddeploymentofnewenvironments.ThebookwillbelimitedtobuildingthevirtualmachineusingVirtualBox.Fromadata-intensiveapparchitecturepointofview,wearedescribingtheessentialstepsoftheinfrastructurelayerbymentioningscalabilityandcontinuousintegrationbeyondjustvirtualization.

PersistencelayerThepersistencelayermanagesthevariousrepositoriesinaccordancewithdataneedsandshapes.Itensuresthesetupandmanagementofthepolyglotdatastores.ItincludesrelationaldatabasemanagementsystemssuchasMySQLandPostgreSQL;key-valuedatastoressuchasHadoop,Riak,andRedis;columnardatabasessuchasHBaseandCassandra;documentdatabasessuchasMongoDBandCouchbase;andgraphdatabasessuchasNeo4j.ThepersistencelayermanagesvariousfilesystemssuchasHadoop’sHDFS.ItinteractswithvariousstoragesystemsfromnativeharddrivestoAmazonS3.Itmanagesvariousfilestorageformatssuchascsv,json,andparquet,whichisacolumn-orientedformat.

IntegrationlayerTheintegrationlayerfocusesondataacquisition,transformation,quality,persistence,consumption,andgovernance.ItisessentiallydrivenbythefollowingfiveCs:connect,collect,correct,compose,andconsume.

Thefivestepsdescribethelifecycleofdata.Theyarefocusedonhowtoacquirethedatasetofinterest,exploreit,iterativelyrefineandenrichthecollectedinformation,andgetitreadyforconsumption.So,thestepsperformthefollowingoperations:

Connect:Targetsthebestwaytoacquiredatafromthevariousdatasources,APIsofferedbythesesources,theinputformat,inputschemasiftheyexist,therateofdatacollection,andlimitationsfromprovidersCorrect:FocusesontransformingdataforfurtherprocessingandalsoensuresthatthequalityandconsistencyofthedatareceivedaremaintainedCollect:Looksatwhichdatatostorewhereandinwhatformat,toeasedatacompositionandconsumptionatlaterstagesCompose:Concentratesitsattentiononhowtomashupthevariousdatasetscollected,andenrichtheinformationinordertobuildacompellingdata-drivenproductConsume:TakescareofdataprovisioningandrenderingandhowtherightdatareachestherightindividualattherighttimeControl:Thissixthadditionalstepwillsoonerorlaterberequiredasthedata,theorganization,andtheparticipantsgrowanditisaboutensuringdatagovernance

Thefollowingdiagramdepictstheiterativeprocessofdataacquisitionandrefinementforconsumption:

AnalyticslayerTheanalyticslayeriswhereSparkprocessesdatawiththevariousmodels,algorithms,andmachinelearningpipelinesinordertoderiveinsights.Forourpurpose,inthisbook,theanalyticslayerispoweredbySpark.WewilldelvedeeperinsubsequentchaptersintothemeritsofSpark.Inanutshell,whatmakesitsopowerfulisthatitallowsmultipleparadigmsofanalyticsprocessinginasingleunifiedplatform.Itallowsbatch,streaming,andinteractiveanalytics.Batchprocessingonlargedatasetswithlongerlatencyperiodsallowsustoextractpatternsandinsightsthatcanfeedintoreal-timeeventsinstreamingmode.Interactiveanditerativeanalyticsaremoresuitedfordataexploration.SparkoffersbindingsandAPIsinPythonandR.WithitsSparkSQLmoduleandtheSparkDataframe,itoffersaveryfamiliaranalyticsinterface.

EngagementlayerTheengagementlayerinteractswiththeenduserandprovidesdashboards,interactivevisualizations,andalerts.WewillfocushereonthetoolsprovidedbythePyDataecosystemsuchasMatplotlib,Seaborn,andBokeh.

UnderstandingSparkHadoopscaleshorizontallyasthedatagrows.Hadooprunsoncommodityhardware,soitiscost-effective.Intensivedataapplicationsareenabledbyscalable,distributedprocessingframeworksthatalloworganizationstoanalyzepetabytesofdataonlargecommodityclusters.Hadoopisthefirstopensourceimplementationofmap-reduce.HadoopreliesonadistributedframeworkforstoragecalledHDFS(HadoopDistributedFileSystem).Hadooprunsmap-reducetasksinbatchjobs.Hadooprequirespersistingthedatatodiskateachmap,shuffle,andreduceprocessstep.Theoverheadandthelatencyofsuchbatchjobsadverselyimpacttheperformance.

Sparkisafast,distributedgeneralanalyticscomputingengineforlarge-scaledataprocessing.ThemajorbreakthroughfromHadoopisthatSparkallowsdatasharingbetweenprocessingstepsthroughin-memoryprocessingofdatapipelines.

Sparkisuniqueinthatitallowsfourdifferentstylesofdataanalysisandprocessing.Sparkcanbeusedin:

Batch:Thismodeisusedformanipulatinglargedatasets,typicallyperforminglargemap-reducejobsStreaming:ThismodeisusedtoprocessincominginformationinnearrealtimeIterative:ThismodeisformachinelearningalgorithmssuchasagradientdescentwherethedataisaccessedrepetitivelyinordertoreachconvergenceInteractive:ThismodeisusedfordataexplorationaslargechunksofdataareinmemoryandduetotheveryquickresponsetimeofSpark

Thefollowingfigurehighlightstheprecedingfourprocessingstyles:

Sparkoperatesinthreemodes:onesinglemode,standaloneonasinglemachineandtwodistributedmodesonaclusterofmachines—onYarn,theHadoopdistributedresourcemanager,oronMesos,theopensourceclustermanagerdevelopedatBerkeleyconcurrentlywithSpark:

SparkoffersapolyglotinterfaceinScala,Java,Python,andR.

SparklibrariesSparkcomeswithbatteriesincluded,withsomepowerfullibraries:

SparkSQL:ThisprovidestheSQL-likeabilitytointerrogatestructureddataandinteractivelyexplorelargedatasetsSparkMLLIB:ThisprovidesmajoralgorithmsandapipelineframeworkformachinelearningSparkStreaming:Thisisfornearreal-timeanalysisofdatausingmicrobatchesandslidingwidowsonincomingstreamsofdataSparkGraphX:Thisisforgraphprocessingandcomputationoncomplexconnectedentitiesandrelationships

PySparkinactionSparkiswritteninScala.ThewholeSparkecosystemnaturallyleveragestheJVMenvironmentandcapitalizesonHDFSnatively.HadoopHDFSisoneofthemanydatastoressupportedbySpark.Sparkisagnosticandfromthebeginninginteractedwithmultipledatasources,types,andformats.

PySparkisnotatranscribedversionofSparkonaJava-enableddialectofPythonsuchasJython.PySparkprovidesintegratedAPIbindingsaroundSparkandenablesfullusageofthePythonecosystemwithinallthenodesoftheclusterwiththepicklePythonserializationand,moreimportantly,suppliesaccesstotherichecosystemofPython’smachinelearninglibrariessuchasScikit-LearnordataprocessingsuchasPandas.

WhenweinitializeaSparkprogram,thefirstthingaSparkprogrammustdoistocreateaSparkContextobject.IttellsSparkhowtoaccessthecluster.ThePythonprogramcreatesaPySparkContext.Py4JisthegatewaythatbindsthePythonprogramtotheSparkJVMSparkContext.TheJVMSparkContextserializestheapplicationcodesandtheclosuresandsendsthemtotheclusterforexecution.Theclustermanagerallocatesresourcesandschedules,andshipstheclosurestotheSparkworkersintheclusterwhoactivatePythonvirtualmachinesasrequired.Ineachmachine,theSparkWorkerismanagedbyanexecutorthatcontrolscomputation,storage,andcache.

Here’sanexampleofhowtheSparkdrivermanagesboththePySparkcontextandtheSparkcontextwithitslocalfilesystemsanditsinteractionswiththeSparkworkerthroughtheclustermanager:

TheResilientDistributedDatasetSparkapplicationsconsistofadriverprogramthatrunstheuser’smainfunction,createsdistributeddatasetsonthecluster,andexecutesvariousparalleloperations(transformationsandactions)onthosedatasets.

Sparkapplicationsarerunasanindependentsetofprocesses,coordinatedbyaSparkContextinadriverprogram.

TheSparkContextwillbeallocatedsystemresources(machines,memory,CPU)fromtheClustermanager.

TheSparkContextmanagesexecutorswhomanageworkersinthecluster.ThedriverprogramhasSparkjobsthatneedtorun.Thejobsaresplitintotaskssubmittedtotheexecutorforcompletion.Theexecutortakescareofcomputation,storage,andcachingineachmachine.

ThekeybuildingblockinSparkistheRDD(ResilientDistributedDataset).Adatasetisacollectionofelements.Distributedmeansthedatasetcanbeonanynodeinthecluster.ResilientmeansthatthedatasetcouldgetlostorpartiallylostwithoutmajorharmtothecomputationinprogressasSparkwillre-computefromthedatalineageinmemory,alsoknownastheDAG(shortforDirectedAcyclicGraph)ofoperations.Basically,SparkwillsnapshotinmemoryastateoftheRDDinthecache.Ifoneofthecomputingmachinescrashesduringoperation,SparkrebuildstheRDDsfromthecachedRDDand

theDAGofoperations.RDDsrecoverfromnodefailure.

TherearetwotypesofoperationonRDDs:

Transformations:AtransformationtakesanexistingRDDandleadstoapointerofanewtransformedRDD.AnRDDisimmutable.Oncecreated,itcannotbechanged.EachtransformationcreatesanewRDD.Transformationsarelazilyevaluated.Transformationsareexecutedonlywhenanactionoccurs.Inthecaseoffailure,thedatalineageoftransformationsrebuildstheRDD.Actions:AnactiononanRDDtriggersaSparkjobandyieldsavalue.AnactionoperationcausesSparktoexecutethe(lazy)transformationoperationsthatarerequiredtocomputetheRDDreturnedbytheaction.TheactionresultsinaDAGofoperations.TheDAGiscompiledintostageswhereeachstageisexecutedasaseriesoftasks.Ataskisafundamentalunitofwork.

Here’ssomeusefulinformationonRDDs:

RDDsarecreatedfromadatasourcesuchasanHDFSfileoraDBquery.TherearethreewaystocreateanRDD:

ReadingfromadatastoreTransforminganexistingRDDUsinganin-memorycollection

RDDsaretransformedwithfunctionssuchasmaporfilter,whichyieldnewRDDs.Anactionsuchasfirst,take,collect,orcountonanRDDwilldelivertheresultsintotheSparkdriver.TheSparkdriveristheclientthroughwhichtheuserinteractswiththeSparkcluster.

ThefollowingdiagramillustratestheRDDtransformationandaction:

UnderstandingAnacondaAnacondaisawidelyusedfreePythondistributionmaintainedbyContinuum(https://www.continuum.io/).WewillusetheprevailingsoftwarestackprovidedbyAnacondatogenerateourapps.Inthisbook,wewillusePySparkandthePyDataecosystem.ThePyDataecosystemispromoted,supported,andmaintainedbyContinuumandpoweredbytheAnacondaPythondistribution.TheAnacondaPythondistributionessentiallysavestimeandaggravationintheinstallationofthePythonenvironment;wewilluseitinconjunctionwithSpark.Anacondahasitsownpackagemanagementthatsupplementsthetraditionalpipinstallandeasy-install.Anacondacomeswithbatteriesincluded,namelysomeofthemostimportantpackagessuchasPandas,Scikit-Learn,Blaze,Matplotlib,andBokeh.Anupgradetoanyoftheinstalledlibraryisasimplecommandattheconsole:

$condaupdate

Alistofinstalledlibrariesinourenvironmentcanbeobtainedwithcommand:

$condalist

Thekeycomponentsofthestackareasfollows:

Anaconda:ThisisafreePythondistributionwithalmost200Pythonpackagesforscience,math,engineering,anddataanalysis.Conda:Thisisapackagemanagerthattakescareofallthedependenciesofinstallingacomplexsoftwarestack.ThisisnotrestrictedtoPythonandmanagestheinstallprocessforRandotherlanguages.Numba:ThisprovidesthepowertospeedupcodeinPythonwithhigh-performancefunctionsandjust-in-timecompilation.Blaze:Thisenableslargescaledataanalyticsbyofferingauniformandadaptableinterfacetoaccessavarietyofdataproviders,whichincludestreamingPython,Pandas,SQLAlchemy,andSpark.Bokeh:Thisprovidesinteractivedatavisualizationsforlargeandstreamingdatasets.Wakari:ThisallowsustoshareanddeployIPythonNotebooksandotherappsonahostedenvironment.

ThefollowingfigureshowsthecomponentsoftheAnacondastack:

https://www.continuum.io/

SettinguptheSparkpoweredenvironmentInthissection,wewilllearntosetupSpark:

CreateasegregateddevelopmentenvironmentinavirtualmachinerunningonUbuntu14.04,soitdoesnotinterferewithanyexistingsystem.InstallSpark1.3.0withitsdependencies,namely.InstalltheAnacondaPython2.7environmentwithalltherequiredlibrariessuchasPandas,Scikit-Learn,Blaze,andBokeh,andenablePySpark,soitcanbeaccessedthroughIPythonNotebooks.Setupthebackendordatastoresofourenvironment.WewilluseMySQLastherelationaldatabase,MongoDBasthedocumentstore,andCassandraasthecolumnardatabase.

Eachstoragebackendservesaspecificpurposedependingonthenatureofthedatatobehandled.TheMySQLRDBMsisusedforstandardtabularprocessedinformationthatcanbeeasilyqueriedusingSQL.AswewillbeprocessingalotofJSON-typedatafromvariousAPIs,theeasiestwaytostorethemisinadocument.Forreal-timeandtime-series-relatedinformation,Cassandraisbestsuitedasacolumnardatabase.

Thefollowingdiagramgivesaviewoftheenvironmentwewillbuildandusethroughoutthebook:

SettingupanOracleVirtualBoxwithUbuntuSettingupacleannewVirtualBoxenvironmentonUbuntu14.04isthesafestwaytocreateadevelopmentenvironmentthatdoesnotconflictwithexistinglibrariesandcanbelaterreplicatedinthecloudusingasimilarlistofcommands.

InordertosetupanenvironmentwithAnacondaandSpark,wewillcreateaVirtualBoxvirtualmachinerunningUbuntu14.04.

Let’sgothroughthestepsofusingVirtualBoxwithUbuntu:

1. OracleVirtualBoxVMisfreeandcanbedownloadedfromhttps://www.virtualbox.org/wiki/Downloads.Theinstallationisprettystraightforward.

2. AfterinstallingVirtualBox,let’sopentheOracleVMVirtualBoxManagerandclicktheNewbutton.

3. We’llgivethenewVManame,andselectTypeLinuxandVersionUbuntu(64bit).4. YouneedtodownloadtheISOfromtheUbuntuwebsiteandallocatesufficientRAM

(4GBrecommended)anddiskspace(20GBrecommended).WewillusetheUbuntu14.04.1LTSrelease,whichisfoundhere:http://www.ubuntu.com/download/desktop.

5. Oncetheinstallationcompleted,itisadvisabletoinstalltheVirtualBoxGuestAdditionsbygoingto(fromtheVirtualBoxmenu,withthenewVMrunning)Devices|InsertGuestAdditionsCDimage.FailingtoprovidetheguestadditionsinaWindowshostgivesaverylimiteduserinterfacewithreducedwindowsizes.

6. Oncetheadditionalinstallationcompletes,reboottheVM,anditwillbereadytouse.ItishelpfultoenablethesharedclipboardbyselectingtheVMandclickingSettings,thengotoGeneral|Advanced|SharedClipboardandclickonBidirectional.

https://www.virtualbox.org/wiki/Downloads

http://www.ubuntu.com/download/desktop

InstallingAnacondawithPython2.7PySparkcurrentlyrunsonlyonPython2.7.(TherearerequestsfromthecommunitytoupgradetoPython3.3.)ToinstallAnaconda,followthesesteps:

1. DownloadtheAnacondaInstallerforLinux64-bitPython2.7fromhttp://continuum.io/downloads#all.

2. AfterdownloadingtheAnacondainstaller,openaterminalandnavigatetothedirectoryorfolderwheretheinstallerhasbeensaved.Fromhere,runthefollowingcommand,replacingthe2.x.xinthecommandwiththeversionnumberofthedownloadedinstallerfile:

#installanaconda2.x.x


3. Afteracceptingthelicenseterms,youwillbeaskedtospecifytheinstalllocation(whichdefaultsto~/anaconda).

4. Aftertheself-extractionisfinished,youshouldaddtheanacondabinarydirectorytoyourPATHenvironmentvariable:

#addanacondatoPATH


http://continuum.io/downloads#all

InstallingJava8SparkrunsontheJVMandrequirestheJavaSDK(shortforSoftwareDevelopmentKit)andnottheJRE(shortforJavaRuntimeEnvironment),aswewillbuildappswithSpark.TherecommendedversionisJavaVersion7orhigher.Java8isthemostsuitable,asitincludesmanyofthefunctionalprogrammingtechniquesavailablewithScalaandPython.

ToinstallJava8,followthesesteps:

1. InstallOracleJava8usingthefollowingcommands:

#installoraclejava8

$sudoapt-getinstallsoftware-properties-common

$sudoadd-apt-repositoryppa:webupd8team/java

$sudoapt-getupdate

$sudoapt-getinstalloracle-java8-installer

2. SettheJAVA_HOMEenvironmentvariableandensurethattheJavaprogramisonyourPATH.

3. CheckthatJAVA_HOMEisproperlyinstalled:

#

$echoJAVA_HOME

InstallingSparkHeadovertotheSparkdownloadpageathttp://spark.apache.org/downloads.html.

TheSparkdownloadpageoffersthepossibilitytodownloadearlierversionsofSparkanddifferentpackageanddownloadtypes.Wewillselectthelatestrelease,pre-builtforHadoop2.6andlater.TheeasiestwaytoinstallSparkistouseaSparkpackageprebuiltforHadoop2.6andlater,ratherthanbuilditfromsource.Movethefiletothedirectory~/sparkundertherootdirectory.

DownloadthelatestreleaseofSpark—Spark1.5.2,releasedonNovember9,2015:

1. SelectSparkrelease1.5.2(Nov092015),2. ChosethepackagetypePrebuiltforHadoop2.6andlater,3. ChosethedownloadtypeDirectDownload,4. DownloadSpark:spark-1.5.2-bin-hadoop2.6.tgz,5. Verifythisreleaseusingthe1.3.0signaturesandchecksums,

Thiscanalsobeaccomplishedbyrunning:

#downloadspark

$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz

Next,we’llextractthefilesandcleanup:

#extract,cleanup,movetheunzippedfilesunderthesparkdirectory

$tar-xfspark-1.5.2-bin-hadoop2.6.tgz

$rmspark-1.5.2-bin-hadoop2.6.tgz

$sudomvspark-*spark

Now,wecanruntheSparkPythoninterpreterwith:

#runspark

$cd~/spark

./bin/pyspark

Youshouldseesomethinglikethis:

Welcometo

______

/__/__________//__

_\\/_\/_`/__/'_/

/__/.__/\_,_/_//_/\_\version1.5.2

/_/

UsingPythonversion2.7.6(default,Mar22201422:59:56)

SparkContextavailableassc.

>>>

TheinterpreterwillhavealreadyprovideduswithaSparkcontextobject,sc,whichwecanseebyrunning:

>>>print(sc)

<pyspark.context.SparkContextobjectat0x7f34b61c4e50>

http://spark.apache.org/downloads.html

EnablingIPythonNotebookWewillworkwithIPythonNotebookforafriendlieruserexperiencethantheconsole.

YoucanlaunchIPythonNotebookbyusingthefollowingcommand:

$IPYTHON_OPTS="notebook--pylabinline"./bin/pyspark

LaunchPySparkwithIPYNBinthedirectoryexamples/AN_SparkwhereJupyterorIPythonNotebooksarestored:

#cdto/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark

#launchcommandusingpython2.7andthespark-csvpackage:

$IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-

hadoop2.6/bin/pyspark--packagescom.databricks:spark-csv_2.11:1.2.0

#launchcommandusingpython3.4andthespark-csvpackage:

$IPYTHON_OPTS='notebook'PYSPARK_PYTHON=python3

/home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark--packages

com.databricks:spark-csv_2.11:1.2.0

BuildingourfirstappwithPySparkWearereadytochecknowthateverythingisworkingfine.Theobligatorywordcountwillbeputtothetestinprocessingawordcountonthefirstchapterofthisbook.

Thecodewewillberunningislistedhere:

#Wordcounton1stChapteroftheBookusingPySpark

#importregexmodule

importre

#importaddfromoperatormodule

fromoperatorimportadd

#readinputfile

file_in=sc.textFile('/home/an/Documents/A00_Documents/Spark4Py20150315')

#countlines

print('numberoflinesinfile:%s'%file_in.count())

#adduplengthsofeachline

chars=file_in.map(lambdas:len(s)).reduce(add)

print('numberofcharactersinfile:%s'%chars)

#Getwordsfromtheinputfile

words=file_in.flatMap(lambdaline:re.split('\W+',line.lower().strip()))

#wordsofmorethan3characters

words=words.filter(lambdax:len(x)>3)

#setcount1perword

words=words.map(lambdaw:(w,1))

#reducephase-sumcountallthewords

words=words.reduceByKey(add)

Inthisprogram,wearefirstreadingthefilefromthedirectory/home/an/Documents/A00_Documents/Spark4Py20150315intofile_in.

Wearethenintrospectingthefilebycountingthenumberoflinesandthenumberofcharactersperline.

Wearesplittingtheinputfileintowordsandgettingtheminlowercase.Forourwordcountpurpose,wearechoosingwordslongerthanthreecharactersinordertoavoidshorterandmuchmorefrequentwordssuchasthe,and,fortoskewthecountintheirfavor.Generally,theyareconsideredstopwordsandshouldbefilteredoutinanylanguageprocessingtask.

Atthisstage,wearegettingreadyfortheMapReducesteps.Toeachword,wemapavalueof1andreduceitbysummingalltheuniquewords.

HereareillustrationsofthecodeintheIPythonNotebook.Thefirst10cellsarepreprocessingthewordcountonthedataset,whichisretrievedfromthelocalfiledirectory.

Swapthewordcounttuplesintheformat(count,word)inordertosortbycount,whichisnowtheprimarykeyofthetuple:

#createtuple(count,word)andsortindescending

words=words.map(lambdax:(x[1],x[0])).sortByKey(False)

#taketop20wordsbyfrequency

words.take(20)

Inordertodisplayourresult,wearecreatingthetuple(count,word)anddisplayingthetop20mostfrequentlyusedwordsindescendingorder:

Let’screateahistogramfunction:

#createfunctionforhistogramofmostfrequentwords

%matplotlibinline

importmatplotlib.pyplotasplt

#

defhistogram(words):

count=map(lambdax:x[1],words)

word=map(lambdax:x[0],words)

plt.barh(range(len(count)),count,color='grey')

plt.yticks(range(len(count)),word)

#Changeorderoftuple(word,count)from(count,word)

words=words.map(lambdax:(x[1],x[0]))

words.take(25)

#displayhistogram

histogram(words.take(25))

Here,wevisualizethemostfrequentwordsbyplottingtheminabarchart.Wehavetofirstswapthetuplefromtheoriginal(count,word)to(word,count):

Sohereyouhaveit:themostfrequentwordsusedinthefirstchapterareSpark,followedbyDataandAnaconda.

VirtualizingtheenvironmentwithVagrantInordertocreateaportablePythonandSparkenvironmentthatcanbeeasilysharedandcloned,thedevelopmentenvironmentcanbebuiltwithavagrantfile.

WewillpointtotheMassiveOpenOnlineCourses(MOOCs)deliveredbyBerkeleyUniversityandDatabricks:

IntroductiontoBigDatawithApacheSpark,ProfessorAnthonyD.Josephcanbefoundathttps://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1xScalableMachineLearning,ProfessorAmeetTalwalkarcanbefoundathttps://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

ThecourselabswereexecutedonIPythonNotebookspoweredbyPySpark.TheycanbefoundinthefollowingGitHubrepository:https://github.com/spark-mooc/mooc-setup/.

OnceyouhavesetupVagrantonyourmachine,followtheseinstructionstogetstarted:https://docs.vagrantup.com/v2/getting-started/index.html.

Clonethespark-mooc/mooc-setup/githubrepositoryinyourworkdirectoryandlaunchthecommand$vagrantup,withinthecloneddirectory:

BeawarethattheversionofSparkmaybeoutdatedasthevagrantfilemaynotbeup-to-date.

Youwillseeanoutputsimilartothis:

C:\Programs\spark\edx1001\mooc-setup-master>vagrantup

Bringingmachine'sparkvm'upwith'virtualbox'provider…

==>sparkvm:Checkingifbox'sparkmooc/base'isuptodate…

==>sparkvm:Clearinganypreviouslysetforwardedports…

==>sparkvm:Clearinganypreviouslysetnetworkinterfaces…

==>sparkvm:Preparingnetworkinterfacesbasedonconfiguration…

sparkvm:Adapter1:nat

==>sparkvm:Forwardingports…

sparkvm:8001=>8001(adapter1)



==>sparkvm:BootingVM…

==>sparkvm:Waitingformachinetoboot.Thismaytakeafewminutes…

sparkvm:SSHaddress:127.0.0.1:2222

sparkvm:SSHusername:vagrant

sparkvm:SSHauthmethod:privatekey

sparkvm:Warning:Connectiontimeout.Retrying…

sparkvm:Warning:Remoteconnectiondisconnect.Retrying…

==>sparkvm:Machinebootedandready!

==>sparkvm:CheckingforguestadditionsinVM…

==>sparkvm:Settinghostname…

==>sparkvm:Mountingsharedfolders…

sparkvm:/vagrant=>C:/Programs/spark/edx1001/mooc-setup-master

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

https://github.com/spark-mooc/mooc-setup/

https://docs.vagrantup.com/v2/getting-started/index.html

==>sparkvm:Machinealreadyprovisioned.Run`vagrantprovision`oruse

the`--provision`

==>sparkvm:toforceprovisioning.Provisionersmarkedtorunalwayswill

stillrun.

C:\Programs\spark\edx1001\mooc-setup-master>

ThiswilllaunchtheIPythonNotebookspoweredbyPySparkonlocalhost:8001:

MovingtothecloudAswearedealingwithdistributedsystems,anenvironmentonavirtualmachinerunningonasinglelaptopislimitedforexplorationandlearning.WecanmovetothecloudinordertoexperiencethepowerandscalabilityoftheSparkdistributedframework.

DeployingappsinAmazonWebServicesOncewearereadytoscaleourapps,wecanmigrateourdevelopmentenvironmenttoAmazonWebServices(AWS).

HowtorunSparkonEC2isclearlydescribedinthefollowingpage:https://spark.apache.org/docs/latest/ec2-scripts.html.

WeemphasizefivekeystepsinsettinguptheAWSSparkenvironment:

1. CreateanAWSEC2keypairviatheAWSconsolehttp://aws.amazon.com/console/.2. Exportyourkeypairtoyourenvironment:

exportAWS_ACCESS_KEY_ID=accesskeyid

exportAWS_SECRET_ACCESS_KEY=secretaccesskey

3. Launchyourcluster:

~$cd$SPARK_HOME/ec2

ec2$./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch

<cluster-name>

4. SSHintoaclustertorunSparkjobs:

ec2$./spark-ec2-k<keypair>-i<key-file>login<cluster-name>

5. Destroyyourclusterafterusage:

ec2$./spark-ec2destroy<cluster-name>

https://spark.apache.org/docs/latest/ec2-scripts.html

http://aws.amazon.com/console/

VirtualizingtheenvironmentwithDockerInordertocreateaportablePythonandSparkenvironmentthatcanbeeasilysharedandcloned,thedevelopmentenvironmentcanbebuiltinDockercontainers.

WewishcapitalizeonDocker’stwomainfunctions:

Creatingisolatedcontainersthatcanbeeasilydeployedondifferentoperatingsystemsorinthecloud.AllowingeasysharingofthedevelopmentenvironmentimagewithallitsdependenciesusingTheDockerHub.TheDockerHubissimilartoGitHub.Itallowseasycloningandversioncontrol.Thesnapshotimageoftheconfiguredenvironmentcanbethebaselineforfurtherenhancements.

ThefollowingdiagramillustratesaDocker-enabledenvironmentwithSpark,Anaconda,andthedatabaseserverandtheirrespectivedatavolumes.

DockerofferstheabilitytocloneanddeployanenvironmentfromtheDockerfile.

YoucanfindanexampleDockerfilewithaPySparkandAnacondasetupatthefollowingaddress:https://hub.docker.com/r/thisgokeboysef/pyspark-docker/~/dockerfile/.

InstallDockeraspertheinstructionsprovidedatthefollowinglinks:

http://docs.docker.com/mac/started/ifyouareonMacOSXhttp://docs.docker.com/linux/started/ifyouareonLinuxhttp://docs.docker.com/windows/started/ifyouareonWindows

InstallthedockercontainerwiththeDockerfileprovidedearlierwiththefollowingcommand:

$dockerpullthisgokeboysef/pyspark-docker

OthergreatsourcesofinformationonhowtodockerizeyourenvironmentcanbeseenatLab41.TheGitHubrepositorycontainsthenecessarycode:

https://github.com/Lab41/ipython-spark-docker

Thesupportingblogpostisrichininformationonthoughtprocessesinvolvedinbuildingthedockerenvironment:http://lab41.github.io/blog/2015/04/13/ipython-on-spark-on-docker/.

https://hub.docker.com/r/thisgokeboysef/pyspark-docker/~/dockerfile/

http://docs.docker.com/mac/started/

http://docs.docker.com/linux/started/

http://docs.docker.com/windows/started/

https://github.com/Lab41/ipython-spark-docker

http://lab41.github.io/blog/2015/04/13/ipython-on-spark-on-docker/

SummaryWesetthecontextofbuildingdata-intensiveappsbydescribingtheoverallarchitecturestructuredaroundtheinfrastructure,persistence,integration,analytics,andengagementlayers.WealsodiscussedSparkandAnacondawiththeirrespectivebuildingblocks.WesetupanenvironmentinaVirtualBoxwithAnacondaandSparkanddemonstratedawordcountappusingthetextcontentofthefirstchapterasinput.

Inthenextchapter,wewilldelvemoredeeplyintothearchitectureblueprintfordata-intensiveappsandtapintotheTwitter,GitHub,andMeetupAPIstogetafeelofthedatawewillbeminingwithSpark.

Chapter2.BuildingBatchandStreamingAppswithSparkTheobjectiveofthebookistoteachyouaboutPySparkandthePyDatalibrariesbybuildinganappthatanalyzestheSparkcommunity’sinteractionsonsocialnetworks.WewillgatherinformationonApacheSparkfromGitHub,checktherelevanttweetsonTwitter,andgetafeelforthebuzzaroundSparkinthebroaderopensourcesoftwarecommunitiesusingMeetup.

Inthischapter,wewilloutlinethevarioussourcesofdataandinformation.Wewillgetanunderstandingoftheirstructure.Wewilloutlinethedataprocessingpipeline,fromcollectiontobatchandstreamingprocessing.

Inthissection,wewillcoverthefollowingpoints:

Outlinedataprocessingpipelinesfromcollectiontobatchandstreamprocessing,effectivelydepictingthearchitectureoftheappweareplanningtobuild.Checkoutthevariousdatasources(GitHub,Twitter,andMeetup),theirdatastructure(JSON,structuredinformation,unstructuredtext,geo-location,timeseriesdata,andsoon),andtheircomplexities.WealsodiscussthetoolstoconnecttothreedifferentAPIs,soyoucanbuildyourowndatamashups.ThebookwillfocusonTwitterinthefollowingchapters.

Architectingdata-intensiveappsWedefinedthedata-intensiveappframeworkarchitectureblueprintinthepreviouschapter.Let’sputbackincontextthevarioussoftwarecomponentswearegoingtousethroughoutthebookinouroriginalframework.Here’sanillustrationofthevariouscomponentsofsoftwaremappedinthedata-intensivearchitectureframework:

Sparkisanextremelyefficient,distributedcomputingframework.Inordertoexploititsfullpower,weneedtoarchitectoursolutionaccordingly.Forperformancereasons,theoverallsolutionneedstoalsobeawareofitsusageintermsofCPU,storage,andnetwork.

Theseimperativesdrivethearchitectureofoursolution:

Latency:Thisarchitecturecombinesslowandfastprocessing.Slowprocessingisdoneonhistoricaldatainbatchmode.Thisisalsocalleddataatrest.Thisphasebuildsprecomputedmodelsanddatapatternsthatwillbeusedbythefastprocessingarmoncelivecontinuousdataisfedintothesystem.Fastprocessingofdataorreal-timeanalysisofstreamingdatareferstodatainmotion.Dataatrestisessentially

processingdatainbatchmodewithalongerlatency.Datainmotionreferstothestreamingcomputationofdataingestedinrealtime.Scalability:Sparkisnativelylinearlyscalablethroughitsdistributedin-memorycomputingframework.DatabasesanddatastoresinteractingwithSparkneedtobealsoabletoscalelinearlyasdatavolumegrows.Faulttolerance:Whenafailureoccursduetohardware,software,ornetworkreasons,thearchitectureshouldberesilientenoughandprovideavailabilityatalltimes.Flexibility:Thedatapipelinesputinplaceinthisarchitecturecanbeadaptedandretrofittedveryquicklydependingontheusecase.

Sparkisuniqueasitallowsbatchprocessingandstreaminganalyticsonthesameunifiedplatform.

Wewillconsidertwodataprocessingpipelines:

ThefirstonehandlesdataatrestandisfocusedonputtingtogetherthepipelineforbatchanalysisofthedataThesecondone,datainmotion,targetsreal-timedataingestionanddeliveringinsightsbasedonprecomputedmodelsanddatapatterns

ProcessingdataatrestLet’sgetanunderstandingofthedataatrestorbatchprocessingpipeline.TheobjectiveinthispipelineistoingestthevariousdatasetsfromTwitter,GitHub,andMeetup;preparethedataforSparkMLlib,themachinelearningengine;andderivethebasemodelsthatwillbeappliedforinsightgenerationinbatchmodeorinrealtime.

Thefollowingdiagramillustratesthedatapipelineinordertoenableprocessingdataatrest:

ProcessingdatainmotionProcessingdatainmotionintroducesanewlevelofcomplexity,asweareintroducinganewpossibilityoffailure.Ifwewanttoscale,weneedtoconsiderbringingindistributedmessagequeuesystemssuchasKafka.Wewilldedicateasubsequentchaptertounderstandingstreaminganalytics.

Thefollowingdiagramdepictsadatapipelineforprocessingdatainmotion:

ExploringdatainteractivelyBuildingadata-intensiveappisnotasstraightforwardasexposingadatabasetoawebinterface.Duringthesetupofboththedataatrestanddatainmotionprocessing,wewillcapitalizeonSpark’sabilitytoanalysedatainteractivelyandrefinethedatarichnessandqualityrequiredforthemachinelearningandstreamingactivities.Here,wewillgothroughaniterativecycleofdatacollection,refinement,andinvestigationinordertogettothedatasetofinterestforourapps.

ConnectingtosocialnetworksLet’sdelveintothefirststepsofthedata-intensiveapparchitecture’sintegrationlayer.Wearegoingtofocusonharvestingthedata,ensuringitsintegrityandpreparingforbatchandstreamingdataprocessingbySparkatthenextstage.Thisphaseisdescribedinthefiveprocesssteps:connect,correct,collect,compose,andconsume.Theseareiterativestepsofdataexplorationthatwillgetusacquaintedwiththedataandhelpusrefinethedatastructureforfurtherprocessing.

Thefollowingdiagramdepictstheiterativeprocessofdataacquisitionandrefinementforconsumption:

Weconnecttothesocialnetworksofinterest:Twitter,GitHub,andMeetup.WewilldiscussthemodeofaccesstotheAPIs(shortforApplicationProgrammingInterface)andhowtocreateaRESTfulconnectionwiththoseserviceswhilerespectingtheratelimitationimposedbythesocialnetworks.REST(shortforRepresentationStateTransfer)isthemostwidelyadoptedarchitecturalstyleontheInternetinordertoenablescalablewebservices.ItreliesonexchangingmessagespredominantlyinJSON(shortforJavaScriptObjectNotation).RESTfulAPIsandwebservicesimplementthefourmostprevalentverbsGET,PUT,POST,andDELETE.GETisusedtoretrieveanelementoracollectionfromagivenURI.PUTupdatesacollectionwithanewone.POSTallowsthecreationofanewentry,whileDELETEeliminatesacollection.

GettingTwitterdataTwitterallowsaccesstoregistereduserstoitssearchandstreamingtweetservicesunderanauthorizationprotocolcalledOAuththatallowsAPIapplicationstosecurelyactonauser’sbehalf.Inordertocreatetheconnection,thefirststepistocreateanapplicationwithTwitterathttps://apps.twitter.com/app/new.

Oncetheapplicationhasbeencreated,TwitterwillissuethefourcodesthatwillallowittotapintotheTwitterhose:

CONSUMER_KEY='GetYourKey@Twitter'

CONSUMER_SECRET='GetYourKey@Twitter'

OAUTH_TOKEN='GetYourToken@Twitter'

OAUTH_TOKEN_SECRET='GetYourToken@Twitter'

IfyouwishtogetafeelforthevariousRESTfulqueriesoffered,youcanexploretheTwitterAPIonthedevconsoleathttps://dev.twitter.com/rest/tools/console:

https://apps.twitter.com/app/new

https://dev.twitter.com/rest/tools/console

WewillmakeaprogrammaticconnectiononTwitterusingthefollowingcode,whichwillactivateourOAuthaccessandallowsustotapintotheTwitterAPIundertheratelimitation.Inthestreamingmode,thelimitationisforaGETrequest.

GettingGitHubdataGitHubusesasimilarauthenticationprocesstoTwitter.HeadtothedevelopersiteandretrieveyourcredentialsafterdulyregisteringwithGitHubathttps://developer.github.com/v3/:

https://developer.github.com/v3/

GettingMeetupdataMeetupcanbeaccessedusingthetokenissuedinthedeveloperresourcestomembersofMeetup.com.ThenecessarytokenorOAuthcredentialforMeetupAPIaccesscanbeobtainedontheirdeveloper’swebsiteathttps://secure.meetup.com/meetup_api:

https://secure.meetup.com/meetup_api

AnalyzingthedataLet’sgetafirstfeelforthedataextractedfromeachofthesocialnetworksandgetanunderstandingofthedatastructurefromeachthesesources.

DiscoveringtheanatomyoftweetsInthissection,wearegoingtoestablishconnectionwiththeTwitterAPI.Twitterofferstwoconnectionmodes:theRESTAPI,whichallowsustosearchhistoricaltweetsforagivensearchtermorhashtag,andthestreamingAPI,whichdeliversreal-timetweetsundertheratelimitinplace.

InordertogetabetterunderstandingofhowtooperatewiththeTwitterAPI,wewillgothroughthefollowingsteps:

1. InstalltheTwitterPythonlibrary.2. EstablishaconnectionprogrammaticallyviaOAuth,theauthenticationrequiredfor

Twitter.3. SearchforrecenttweetsforthequeryApacheSparkandexploretheresultsobtained.4. DecideonthekeyattributesofinterestandretrievetheinformationfromtheJSON

output.

Let’sgothroughitstep-by-step:

1. InstallthePythonTwitterlibrary.Inordertoinstallit,youneedtowritepipinstalltwitterfromthecommandline:

$pipinstalltwitter

2. CreatethePythonTwitterAPIclassanditsbasemethodsforauthentication,searching,andparsingtheresults.self.authgetsthecredentialsfromTwitter.ItthencreatesaregisteredAPIasself.api.Wehaveimplementedtwomethods:thefirstonetosearchTwitterwithagivenqueryandthesecondonetoparsetheoutputtoretrieverelevantinformationsuchasthetweetID,thetweettext,andthetweetauthor.Thecodeisasfollows:

importtwitter

importurlparse

frompprintimportpprintaspp

classTwitterAPI(object):

"""

TwitterAPIclassallowstheConnectiontoTwitterviaOAuth

onceyouhaveregisteredwithTwitterandreceivethe

necessarycredentiials

"""

#initializeandgetthetwittercredentials

def__init__(self):

consumer_key='Provideyourcredentials'

consumer_secret='Provideyourcredentials'

access_token='Provideyourcredentials'

access_secret='Provideyourcredentials'

self.consumer_key=consumer_key

self.consumer_secret=consumer_secret

self.access_token=access_token

self.access_secret=access_secret

#

#authenticatecredentialswithTwitterusingOAuth

self.auth=twitter.oauth.OAuth(access_token,access_secret,

consumer_key,consumer_secret)

#createsregisteredTwitterAPI

self.api=twitter.Twitter(auth=self.auth)

#

#searchTwitterwithqueryq(i.e."ApacheSpark")andmax.result

defsearchTwitter(self,q,max_res=10,**kwargs):

search_results=self.api.search.tweets(q=q,count=10,

**kwargs)

statuses=search_results['statuses']

max_results=min(1000,max_res)

for_inrange(10):

try:

next_results=search_results['search_metadata']

['next_results']

exceptKeyErrorase:

break

next_results=urlparse.parse_qsl(next_results[1:])

kwargs=dict(next_results)

search_results=self.api.search.tweets(**kwargs)

statuses+=search_results['statuses']

iflen(statuses)>max_results:

break

returnstatuses

#

#parsetweetsasitiscollectedtoextractid,creation

#date,userid,tweettext

defparseTweets(self,statuses):

return[(status['id'],

status['created_at'],

status['user']['id'],

status['user']['name'],

status['text'],url['expanded_url'])

forstatusinstatuses

forurlinstatus['entities']['urls']]

3. Instantiatetheclasswiththerequiredauthentication:

t=TwitterAPI()

4. RunasearchonthequerytermApacheSpark:

q="ApacheSpark"

tsearch=t.searchTwitter(q)

5. AnalyzetheJSONoutput:

pp(tsearch[1])

{u'contributors':None,

u'coordinates':None,

u'created_at':u'SatApr2514:50:57+00002015',

u'entities':{u'hashtags':[{u'indices':[74,86],u'text':

u'sparksummit'}],

u'media':[{u'display_url':

u'pic.twitter.com/WKUMRXxIWZ',

u'expanded_url':

u'http://twitter.com/bigdata/status/591976255831969792/photo/1',

u'id':591976255156715520,

u'id_str':u'591976255156715520',

u'indices':[143,144],

u'media_url':

...(snip)...

u'text':u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers

&leadersat#sparksummitNYC:videoclipsareout

http://t.co/qrqpP6cG9shttp://t\u2026',

u'truncated':False,

u'user':{u'contributors_enabled':False,

u'created_at':u'SatApr0414:44:31+00002015',

u'default_profile':True,

u'default_profile_image':True,

u'description':u'',

u'entities':{u'description':{u'urls':[]}},

u'favourites_count':0,

u'follow_request_sent':False,

u'followers_count':586,

u'following':False,

u'friends_count':2,

u'geo_enabled':False,

u'id':3139047660,

u'id_str':u'3139047660',

u'is_translation_enabled':False,

u'is_translator':False,

u'lang':u'zh-cn',

u'listed_count':749,

u'location':u'',

u'name':u'MegaDataMama',

u'notifications':False,

u'profile_background_color':u'C0DEED',

u'profile_background_image_url':

u'http://abs.twimg.com/images/themes/theme1/bg.png',

u'profile_background_image_url_https':

u'https://abs.twimg.com/images/themes/theme1/bg.png',

...(snip)...

u'screen_name':u'MegaDataMama',

u'statuses_count':26673,

u'time_zone':None,

u'url':None,

u'utc_offset':None,

u'verified':False}}

6. ParsetheTwitteroutputtoretrievekeyinformationofinterest:

tparsed=t.parseTweets(tsearch)

pp(tparsed)

[(591980327784046592,

u'SatApr2515:01:23+00002015',

63407360,

u'Jos\xe9CarlosBaquero',

u'BigDatasystemsaremakingadifferenceinthefightagainst

cancer.#BigData#ApacheSparkhttp://t.co/pnOLmsKdL9',

u'http://tmblr.co/ZqTggs1jHytN0'),

(591977704464875520,

u'SatApr2514:50:57+00002015',

3139047660,

u'MegaDataMama',

u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers&

leadersat#sparksummitNYC:videoclipsareouthttp://t.co/qrqpP6cG9s

http://t\u2026',

u'http://goo.gl/eF5xwK'),

(591977172589539328,

u'SatApr2514:48:51+00002015',

2997608763,

u'EmmaClark',

u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers&

leadersat#sparksummitNYC:videoclipsareouthttp://t.co/qrqpP6cG9s

http://t\u2026',

u'http://goo.gl/eF5xwK'),

...(snip)...

(591879098349268992,

u'SatApr2508:19:08+00002015',

331263208,

u'MarioMolina',

u'#ApacheSparkspeedsupbigdatadecision-making

http://t.co/8hdEXreNfN',

u'http://www.computerweekly.com/feature/Apache-Spark-speeds-up-big-

data-decision-making')]

ExploringtheGitHubworldInordertogetabetterunderstandingonhowtooperatewiththeGitHubAPI,wewillgothroughthefollowingsteps:

1. InstalltheGitHubPythonlibrary.2. AccesstheAPIbyusingthetokenprovidedwhenweregisteredinthedeveloper

website.3. RetrievesomekeyfactsontheApachefoundationthatishostingthespark

repository.

Let’sgothroughtheprocessstep-by-step:

1. InstallthePythonPyGithublibrary.Inordertoinstallit,youneedtopipinstallPyGithubfromthecommandline:

pipinstallPyGithub

2. ProgrammaticallycreateaclienttoinstantiatetheGitHubAPI:

fromgithubimportGithub

#Getyourownaccesstoken

ACCESS_TOKEN='Get_Your_Own_Access_Token'

#WearefocusingourattentiontoUser=apacheandRepo=spark

USER='apache'

REPO='spark'

g=Github(ACCESS_TOKEN,per_page=100)

user=g.get_user(USER)

repo=user.get_repo(REPO)

3. RetrievekeyfactsfromtheApacheUser.Thereare640activeApacherepositoriesinGitHub:

repos_apache=[repo.nameforrepoing.get_user('apache').get_repos()]

len(repos_apache)

640

4. RetrievekeyfactsfromtheSparkrepository,TheprograminglanguagesusedintheSparkrepoaregivenhereunder:

pp(repo.get_languages())

{u'C':1493,

u'CSS':4472,

u'Groff':5379,

u'Java':1054894,

u'JavaScript':21569,

u'Makefile':7771,

u'Python':1091048,

u'R':339201,

u'Scala':10249122,

u'Shell':172244}

5. RetrieveafewkeyparticipantsofthewideSparkGitHubrepositorynetwork.Thereare3,738stargazersintheApacheSparkrepositoryatthetimeofwriting.Thenetworkisimmense.ThefirststargazerisMateiZaharia,thecofounderoftheSparkprojectwhenhewasdoinghisPhDinBerkeley.

stargazers=[sforsinrepo.get_stargazers()]

print"Numberofstargazers",len(stargazers)

Numberofstargazers3738

[stargazers[i].loginforiinrange(0,20)]

[u'mateiz',

u'beyang',

u'abo',

u'CodingCat',

u'andy327',

u'CrazyJvm',

u'jyotiska',

u'BaiGang',

u'sundstei',

u'dianacarroll',

u'ybotco',

u'xelax',

u'prabeesh',

u'invkrh',

u'bedla',

u'nadesai',

u'pcpratts',

u'narkisr',

u'Honghe',

u'Jacke']

UnderstandingthecommunitythroughMeetupInordertogetabetterunderstandingofhowtooperatewiththeMeetupAPI,wewillgothroughthefollowingsteps:

1. CreateaPythonprogramtocalltheMeetupAPIusinganauthenticationtoken.2. RetrieveinformationofpasteventsformeetupgroupssuchasLondonDataScience.3. Retrievetheprofileofthemeetupmembersinordertoanalyzetheirparticipationin

similarmeetupgroups.

Let’sgothroughtheprocessstep-by-step:

1. AsthereisnoreliableMeetupAPIPythonlibrary,wewillprogrammaticallycreateaclienttoinstantiatetheMeetupAPI:

importjson

importmimeparse

importrequests

importurllib

frompprintimportpprintaspp

MEETUP_API_HOST='https://api.meetup.com'

EVENTS_URL=MEETUP_API_HOST+'/2/events.json'

MEMBERS_URL=MEETUP_API_HOST+'/2/members.json'

GROUPS_URL=MEETUP_API_HOST+'/2/groups.json'

RSVPS_URL=MEETUP_API_HOST+'/2/rsvps.json'

PHOTOS_URL=MEETUP_API_HOST+'/2/photos.json'

GROUP_URLNAME='London-Machine-Learning-Meetup'

#GROUP_URLNAME='London-Machine-Learning-Meetup'#'Data-Science-

London'

classMee

tupAPI(object):

"""

Retrievesinformationaboutmeetup.com

"""

def__init__(self,api_key,num_past_events=10,http_timeout=1,

http_retries=2):

"""

CreateanewinstanceofMeetupAPI

"""

self._api_key=api_key

self._http_timeout=http_timeout

self._http_retries=http_retries

self._num_past_events=num_past_events

defget_past_events(self):

"""

Getpastmeetupeventsforagivenmeetupgroup

"""

params={'key':self._api_key,

'group_urlname':GROUP_URLNAME,

'status':'past',

'desc':'true'}

ifself._num_past_events:

params['page']=str(self._num_past_events)

query=urllib.urlencode(params)

url='{0}?{1}'.format(EVENTS_URL,query)

response=requests.get(url,timeout=self._http_timeout)

data=response.json()['results']

returndata

defget_members(self):

"""

Getmeetupmembersforagivenmeetupgroup

"""


'group_urlname':GROUP_URLNAME,

'offset':'0',

'format':'json',

'page':'100',

'order':'name'}


url='{0}?{1}'.format(MEMBERS_URL,query)



returndata

defget_groups_by_member(self,member_id='38680722'):

"""

Getmeetupgroupsforagivenmeetupmember

"""


'member_id':member_id,

'offset':'0',

'format':'json',

'page':'100',

'order':'id'}


url='{0}?{1}'.format(GROUPS_URL,query)



returndata

2. Then,wewillretrievepasteventsfromagivenMeetupgroup:

m=MeetupAPI(api_key='Get_Your_Own_Key')

last_meetups=m.get_past_events()

pp(last_meetups[5])

{u'created':1401809093000,

u'description':u"WearehostingajointmeetupbetweenSpark

LondonandMachineLearningLondon.Giventheexcitementinthemachine

learningcommunityaroundSparkatthemomentajointmeetupisin

order!MichaelArmbrustfromtheApacheSparkcoreteamwillbe

flyingoverfromtheStatestogiveusatalkinperson.\xa0Thanksto

oursponsors,Cloudera,MapRandDatabricksforhelpingmakethis

happen.ThefirstpartofthetalkwillbeaboutMLlib,the

machinelearninglibraryforSpark,\xa0andthesecondpart,on\xa0Spark

SQL.Don'tsignupifyouhavealreadysignedupontheSpark

Londonpagethough!\n\n\nAbstractforpartone:Inthis

talk,we\u2019llintroduceSparkandshowhowtouseittobuildfast,

end-to-endmachinelearningworkflows.UsingSpark\u2019shigh-level

API,wecanprocessrawdatawithfamiliarlibrariesinJava,Scalaor

Python(e.g.NumPy)toextractthefeaturesformachinelearning.Then,

usingMLlib,itsbuilt-inmachinelearninglibrary,wecanrunscalable

versionsofpopularalgorithms.We\u2019llalsocoverupcoming

developmentworkincludingnewbuilt-inalgorithmsandRbindings.

\n\n\n\nAbstractforparttwo:\xa0Inthistalk,we'll

examineSparkSQL,anewAlphacomponentthatispartoftheApache

Spark1.0release.SparkSQLletsdevelopersnativelyquerydatastored

inbothexistingRDDsandexternalsourcessuchasApacheHive.Akey

featureofSparkSQListheabilitytoblurthelinesbetween

relationaltablesandRDDs,makingiteasyfordeveloperstointermix

SQLcommandsthatqueryexternaldatawithcomplexanalytics.In

additiontoSparkSQL,we'llexploretheCatalystoptimizerframework,

whichallowsSparkSQLtoautomaticallyrewritequeryplanstoexecute

moreefficiently.",

u'event_url':u'http://www.meetup.com/London-Machine-Learning-

Meetup/events/186883262/',

u'group':{u'created':1322826414000,

u'group_lat':51.52000045776367,

u'group_lon':-0.18000000715255737,

u'id':2894492,

u'join_mode':u'open',

u'name':u'LondonMachineLearningMeetup',

u'urlname':u'London-Machine-Learning-Meetup',

u'who':u'MachineLearningEnthusiasts'},

u'headcount':0,

u'id':u'186883262',

u'maybe_rsvp_count':0,

u'name':u'JointSparkLondonandMachineLearningMeetup',

u'rating':{u'average':4.800000190734863,u'count':5},

u'rsvp_limit':70,

u'status':u'past',

u'time':1403200800000,

u'updated':1403450844000,

u'utc_offset':3600000,

u'venue':{u'address_1':u'12ErrolSt,London',

u'city':u'EC1Y8LX',

u'country':u'gb',

u'id':19504802,

u'lat':51.522533,

u'lon':-0.090934,

u'name':u'RoyalStatisticalSociety',

u'repinned':False},

u'visibility':u'public',

u'waitlist_count':84,

u'yes_rsvp_count':70}

3. GetinformationabouttheMeetupmembers:

members=m.get_members()

{u'city':u'London',

u'country':u'gb',

u'hometown':u'London',

u'id':11337881,

u'joined':1421418896000,

u'lat':51.53,

u'link':u'http://www.meetup.com/members/11337881',

u'lon':-0.09,

u'name':u'AbhishekShivkumar',

u'other_services':{u'twitter':{u'identifier':u'@abhisemweb'}},

u'photo':{u'highres_link':

u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/highres_1089864

3.jpeg',

u'photo_id':10898643,

u'photo_link':

u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/member_10898643

.jpeg',

u'thumb_link':

u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/thumb_10898643.

jpeg'},

u'self':{u'common':{}},

u'state':u'17',

u'status':u'active',

u'topics':[{u'id':1372,u'name':u'SemanticWeb',u'urlkey':

u'semweb'},

{u'id':1512,u'name':u'XML',u'urlkey':u'xml'},

{u'id':49585,

u'name':u'SemanticSocialNetworks',

u'urlkey':u'semantic-social-networks'},

{u'id':24553,

u'name':u'NaturalLanguageProcessing',

...(snip)...

u'name':u'AndroidDevelopment',

u'urlkey':u'android-developers'}],

u'visited':1429281599000}

PreviewingourappOurchallengeistomakesenseofthedataretrievedfromthesesocialnetworks,findingthekeyrelationshipsandderivinginsights.Someoftheelementsofinterestareasfollows:

Visualizingthetopinfluencers:Discoverthetopinfluencersinthecommunity:

HeavyTwitterusersonApacheSparkCommittersinGitHubLeadingMeetuppresentations

UnderstandingtheNetwork:NetworkgraphofGitHubcommitters,watchers,andstargazersIdentifyingtheHotLocations:LocatingthemostactivelocationforSpark

Thefollowingscreenshotprovidesapreviewofourapp:

SummaryInthischapter,welaidouttheoverallarchitectureofourapp.Weexplainedthetwomainparadigmsofprocessingdata:batchprocessing,alsocalleddataatrest,andstreaminganalytics,referredtoasdatainmotion.Weproceededtoestablishconnectionstothreesocialnetworksofinterest:Twitter,GitHub,andMeetup.Wesampledthedataandprovidedapreviewofwhatweareaimingtobuild.TheremainderofthebookwillfocusontheTwitterdataset.WeprovidedherethetoolsandAPItoaccessthreesocialnetworks,soyoucanatalaterstagecreateyourowndatamashups.Wearenowreadytoinvestigatethedatacollected,whichwillbethetopicofthenextchapter.

Inthenextchapter,wewilldelvedeeperintodataanalysis,extractingthekeyattributesofinterestforourpurposesandmanagingthestorageoftheinformationforbatchandstreamprocessing.

Chapter3.JugglingDatawithSparkAsperthebatchandstreamingarchitecturelaidoutinthepreviouschapter,weneeddatatofuelourapplications.WewillharvestdatafocusedonApacheSparkfromTwitter.Theobjectiveofthischapteristopreparedatatobefurtherusedbythemachinelearningandstreamingapplications.Thischapterfocusesonhowtoexchangecodeanddataacrossthedistributednetwork.Wewillgetpracticalinsightsintoserialization,persistence,marshaling,andcaching.WewillgettogripswithonSparkSQL,thekeySparkmoduletointeractivelyexplorestructuredandsemi-structureddata.ThefundamentaldatastructurepoweringSparkSQListheSparkdataframe.TheSparkdataframeisinspiredbythePythonPandasdataframeandtheRdataframe.Itisapowerfuldatastructure,wellunderstoodandappreciatedbydatascientistswithabackgroundinRorPython.

Inthischapter,wewillcoverthefollowingpoints:

ConnecttoTwitter,collecttherelevantdata,andthenpersistitinvariousformatssuchasJSONandCSVanddatastoressuchasMongoDBAnalyzethedatausingBlazeandOdo,aspin-offlibraryfromBlaze,inordertoconnectandtransferdatafromvarioussourcesanddestinationsIntroduceSparkdataframesasthefoundationfordatainterchangebetweenthevariousSparkmodulesandexploredatainteractivelyusingSparkSQL

Revisitingthedata-intensiveapparchitectureLet’sfirstputincontextthefocusofthischapterwithrespecttothedata-intensiveapparchitecture.Wewillconcentrateourattentionontheintegrationlayerandessentiallyrunthroughiterativecyclesoftheacquisition,refinement,andpersistenceofthedata.ThiscyclewastermedthefiveCs.ThefiveCsstandforconnect,collect,correct,compose,andconsume.TheyaretheessentialprocesseswerunthroughintheintegrationlayerinordertogettotherightqualityandquantityofdataretrievedfromTwitter.WewillalsodelvedeeperinthepersistencelayerandsetupadatastoresuchasMongoDBtocollectourdataforprocessinglater.

WewillexplorethedatawithBlaze,aPythonlibraryfordatamanipulation,andSparkSQL,theinteractivemoduleofSparkfordatadiscoverypoweredbytheSparkdataframe.ThedataframeparadigmissharedbyPythonPandas,PythonBlaze,andSparkSQL.Wewillgetafeelforthenuancesofthethreedataframeflavors.

Thefollowingdiagramsetsthecontextofthechapter’sfocus,highlightingtheintegrationlayerandthepersistencelayer:

SerializinganddeserializingdataAsweareharvestingdatafromwebAPIsunderratelimitconstraints,weneedtostorethem.Asthedataisprocessedonadistributedcluster,weneedconsistentwaystosavestateandretrieveitforlaterusage.

Let’snowdefineserialization,persistence,marshaling,andcachingormemorization.

SerializingaPythonobjectconvertsitintoastreamofbytes.ThePythonobjectneedstoberetrievedbeyondthescopeofitsexistence,whentheprogramisshut.TheserializedPythonobjectcanbetransferredoveranetworkorstoredinapersistentstorage.DeserializationistheoppositeandconvertsthestreamofbytesintotheoriginalPythonobjectsotheprogramcancarryonfromthesavedstate.ThemostpopularserializationlibraryinPythonisPickle.Asamatteroffact,thePySparkcommandsaretransferredoverthewiretotheworkernodesviapickleddata.

Persistencesavesaprogram’sstatedatatodiskormemorysothatitcancarryonwhereitleftoffuponrestart.ItsavesaPythonobjectfrommemorytoafileoradatabaseandloadsitlaterwiththesamestate.

MarshallingsendsPythoncodeordataoveranetworkTCPconnectioninamulticoreordistributedsystem.

CachingconvertsaPythonobjecttoastringinmemorysothatitcanbeusedasadictionarykeylateron.Sparksupportspullingadatasetintoacluster-wide,in-memorycache.ThisisveryusefulwhendataisaccessedrepeatedlysuchaswhenqueryingasmallreferencedatasetorrunninganiterativealgorithmsuchasGooglePageRank.

CachingisacrucialconceptforSparkasitallowsustosaveRDDsinmemoryorwithaspillagetodisk.ThecachingstrategycanbeselectedbasedonthelineageofthedataortheDAG(shortforDirectedAcyclicGraph)oftransformationsappliedtotheRDDsinordertominimizeshuffleorcrossnetworkheavydataexchange.InordertoachievegoodperformancewithSpark,bewareofdatashuffling.AgoodpartitioningpolicyanduseofRDDcaching,coupledwithavoidingunnecessaryactionoperations,leadstobetterperformancewithSpark.

HarvestingandstoringdataBeforedelvingintodatabasepersistentstoragesuchasMongoDB,wewilllookatsomeusefulfilestoragesthatarewidelyused:CSV(shortforcomma-separatedvalues)andJSON(shortforJavaScriptObjectNotation)filestorage.Theenduringpopularityofthesetwofileformatsliesinafewkeyreasons:theyarehumanreadable,simple,relativelylightweight,andeasytouse.

PersistingdatainCSVTheCSVformatislightweight,humanreadable,andeasytouse.Ithasdelimitedtextcolumnswithaninherenttabularschema.

PythonoffersarobustcsvlibrarythatcanserializeacsvfileintoaPythondictionary.Forthepurposeofourprogram,wehavewrittenapythonclassthatmanagestopersistdatainCSVformatandreadfromagivenCSV.

Let’srunthroughthecodeoftheclassIO_csvobject.The__init__sectionoftheclassbasicallyinstantiatesthefilepath,thefilename,andthefilesuffix(inthiscase,.csv):

classIO_csv(object):

def__init__(self,filepath,filename,filesuffix='csv'):

self.filepath=filepath#/path/to/filewithoutthe/'at

theend

self.filename=filename#FILE_NAME

self.filesuffix=filesuffix

ThesavemethodoftheclassusesaPythonnamedtupleandtheheaderfieldsofthecsvfileinordertoimpartaschemawhilepersistingtherowsoftheCSV.Ifthecsvfilealreadyexists,itwillbeappendedandnotoverwrittenotherwise;itwillbecreated:

defsave(self,data,NTname,fields):

#NTname=NameoftheNamedTuple

#fields=headerofCSV-listofthefieldsname

NTuple=namedtuple(NTname,fields)

ifos.path.isfile('{0}/{1}.{2}'.format(self.filepath,

self.filename,self.filesuffix)):

#Appendexistingfile

withopen('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),'ab')asf:

writer=csv.writer(f)

#writer.writerow(fields)#fields=headerofCSV

writer.writerows([rowforrowinmap(NTuple._make,data)])

#listcomprehensionusingmapontheNamedTuple._make()

iterableandthedatafiletobesaved

#Noticewriter.writerowsandnotwriter.writerow(i.e.

listofmultiplerowssenttocsvfile

else:

#Createnewfile


self.filesuffix),'wb')asf:

writer=csv.writer(f)

writer.writerow(fields)#fields=headerofCSV-listof

thefieldsname

writer.writerows([rowforrowinmap(NTuple._make,data)])

#listcomprehensionusingmapontheNamedTuple._make()

iterableandthedatafiletobesaved

#Noticewriter.writerowsandnotwriter.writerow(i.e.

listofmultiplerowssenttocsvfile

TheloadmethodoftheclassalsousesaPythonnamedtupleandtheheaderfieldsofthe

csvfileinordertoretrievethedatausingaconsistentschema.Theloadmethodisamemory-efficientgeneratortoavoidloadingahugefileinmemory:henceweuseyieldinplaceofreturn:

defload(self,NTname,fields):

#NTname=NameoftheNamedTuple

#fields=headerofCSV-listofthefieldsname

NTuple=namedtuple(NTname,fields)


self.filesuffix),'rU')asf:

reader=csv.reader(f)

forrowinmap(NTuple._make,reader):

#UsingmapontheNamedTuple._make()iterableandthe

readerfiletobeloaded

yieldrow

Here’sthenamedtuple.Weareusingittoparsethetweetinordertosaveorretrievethemtoandfromthecsvfile:

fields01=['id','created_at','user_id','user_name','tweet_text',

'url']

Tweet01=namedtuple('Tweet01',fields01)

defparse_tweet(data):

"""

Parsea``tweet``fromthegivenresponsedata.

"""

returnTweet01(

id=data.get('id',None),

created_at=data.get('created_at',None),

user_id=data.get('user_id',None),

user_name=data.get('user_name',None),

tweet_text=data.get('tweet_text',None),

url=data.get('url')

)

PersistingdatainJSONJSONisoneofthemostpopulardataformatsforInternet-basedapplications.AlltheAPIswearedealingwith,Twitter,GitHub,andMeetup,delivertheirdatainJSONformat.TheJSONformatisrelativelylightweightcomparedtoXMLandhumanreadable,andtheschemaisembeddedinJSON.AsopposedtotheCSVformat,whereallrecordsfollowexactlythesametabularstructure,JSONrecordscanvaryintheirstructure.JSONissemi-structured.AJSONrecordcanbemappedintoaPythondictionaryofdictionaries.

Let’srunthroughthecodeoftheclassIO_jsonobject.The__init__sectionoftheclassbasicallyinstantiatesthefilepath,thefilename,andthefilesuffix(inthiscase,.json):

classIO_json(object):

def__init__(self,filepath,filename,filesuffix='json'):

self.filepath=filepath#/path/to/filewithoutthe/'at

theend

self.filename=filename#FILE_NAME

self.filesuffix=filesuffix

#self.file_io=os.path.join(dir_name,.'.join((base_filename,

filename_suffix)))

Thesavemethodoftheclassusesutf-8encodinginordertoensurereadandwritecompatibilityofthedata.IftheJSONfilealreadyexists,itwillbeappendedandnotoverwritten;otherwiseitwillbecreated:

defsave(self,data):

ifos.path.isfile('{0}/{1}.{2}'.format(self.filepath,

self.filename,self.filesuffix)):

#Appendexistingfile

withio.open('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),'a',encoding='utf-8')asf:

f.write(unicode(json.dumps(data,ensure_ascii=False)))#

Inpython3,thereisno"unicode"function

#f.write(json.dumps(data,ensure_ascii=False))#createa

\"escapecharfor"inthesavedfile

else:

#Createnewfile


self.filesuffix),'w',encoding='utf-8')asf:

f.write(unicode(json.dumps(data,ensure_ascii=False)))

#f.write(json.dumps(data,ensure_ascii=False))

Theloadmethodoftheclassjustreturnsthefilethathasbeenread.Afurtherjson.loadsfunctionneedstobeappliedinordertoretrievethejsonoutofthefileread:

defload(self):


self.filesuffix),encoding='utf-8')asf:

returnf.read()

SettingupMongoDBItiscrucialtostoretheinformationharvested.Thus,wesetupMongoDBasourmaindocumentdatastore.AsalltheinformationcollectedisinJSONformatandMongoDBstoresinformationinBSON(shortforBinaryJSON),itisthereforeanaturalchoice.

Wewillrunthroughthefollowingstepsnow:

InstallingtheMongoDBserverandclientRunningtheMongoDBserverRunningtheMongoclientInstallingthePyMongodriverCreatingthePythonMongoclient

InstallingtheMongoDBserverandclientInordertoinstalltheMongoDBpackage,performthroughthefollowingsteps:

1. Importthepublickeyusedbythepackagemanagementsystem(inourcase,Ubuntu’sapt).ToimporttheMongoDBpublickey,weissuethefollowingcommand:

sudoapt-keyadv--keyserverhkp://keyserver.ubuntu.com:80--recv

7F0CEB10

2. CreatealistfileforMongoDB.Tocreatethelistfile,weusethefollowingcommand:

echo"debhttp://repo.mongodb.org/apt/ubuntu"$("lsb_release-sc)"/

mongodb-org/3.0multiverse"|sudotee/etc/apt/sources.list.d/mongodb-

org-3.0.list

3. Updatethelocalpackagedatabaseassudo:

sudoapt-getupdate

4. InstalltheMongoDBpackages.WeinstallthelateststableversionofMongoDBwiththefollowingcommand:

sudoapt-getinstall-ymongodb-org

RunningtheMongoDBserverLet’sstarttheMongoDBserver:

1. TostartMongoDBserver,weissuethefollowingcommandtostartmongod:

sudoservicemongodbstart

2. Tocheckwhethermongodhasstartedproperly,weissuethecommand:

an@an-VB:/usr/bin$ps-ef|grepmongo

mongodb9671407:03?00:02:02/usr/bin/mongod--

config/etc/mongod.conf

an31433085007:45pts/300:00:00grep--color=automongo

Inthiscase,weseethatmongodbisrunninginprocess967.

3. Themongodserversendsamessagetotheeffectthatitiswaitingforconnectiononport27017.ThisisthedefaultportforMongoDB.Itcanbechangedintheconfigurationfile.

4. Wecancheckthecontentsofthelogfileat/var/log/mongod/mongod.log:

an@an-VB:/var/lib/mongodb$ls-lru

total81936

drwxr-xr-x2mongodbnogroup4096Apr2511:19_tmp

-rw-r--r--1mongodbnogroup69Apr2511:19storage.bson

-rwxr-xr-x1mongodbnogroup5Apr2511:19mongod.lock

-rw-------1mongodbnogroup16777216Apr2511:19local.ns

-rw-------1mongodbnogroup67108864Apr2511:19local.0

drwxr-xr-x2mongodbnogroup4096Apr2511:19journal

5. Inordertostopthemongodbserver,justissuethefollowingcommand:

sudoservicemongodbstop

RunningtheMongoclientRunningtheMongoclientintheconsoleisaseasyascallingmongo,ashighlightedinthefollowingcommand:

an@an-VB:/usr/bin$mongo

MongoDBshellversion:3.0.2

connectingto:test

Serverhasstartupwarnings:

2015-05-30T07:03:49.387+0200ICONTROL[initandlisten]

2015-05-30T07:03:49.388+0200ICONTROL[initandlisten]

Atthemongoclientconsoleprompt,wecanseethedatabaseswiththefollowingcommands:

>showdbs

local0.078GB

test0.078GB

Weselectthetestdatabaseusingusetest:

>usetest

switchedtodbtest

Wedisplaythecollectionswithinthetestdatabase:

>showcollections

restaurants

system.indexes

Wecheckasamplerecordintherestaurantcollectionlistedpreviously:

>db.restaurants.find()

{"_id":ObjectId("553b70055e82e7b824ae0e6f"),"address:{"building:

"1007","coord":[-73.856077,40.848447],"street:"MorrisParkAve",

"zipcode:"10462},"borough:"Bronx","cuisine:"Bakery","grades:[{

"grade:"A","score":2,"date":ISODate("2014-03-03T00:00:00Z")},{

"date":ISODate("2013-09-11T00:00:00Z"),"grade:"A","score":6},{

"score":10,"date":ISODate("2013-01-24T00:00:00Z"),"grade:"A},{

"date":ISODate("2011-11-23T00:00:00Z"),"grade:"A","score":9},{

"date":ISODate("2011-03-10T00:00:00Z"),"grade:"B","score":14}],

"name:"MorrisParkBakeShop","restaurant_id:"30075445"}

InstallingthePyMongodriverInstallingthePythondriverwithanacondaiseasy.Justrunthefollowingcommandattheterminal:

condainstallpymongo

CreatingthePythonclientforMongoDBWearecreatingaIO_mongoclassthatwillbeusedinourharvestingandprocessingprogramstostorethedatacollectedandretrievedsavedinformation.Inordertocreatethemongoclient,wewillimporttheMongoClientmodulefrompymongo.Weconnecttothemongodbserveronlocalhostatport27017.Thecommandisasfollows:

frompymongoimportMongoClientasMCli

classIO_mongo(object):

conn={'host':'localhost','ip':'27017'}

Weinitializeourclasswiththeclientconnection,thedatabase(inthiscase,twtr_db),andthecollection(inthiscase,twtr_coll)tobeaccessed:

def__init__(self,db='twtr_db',coll='twtr_coll',**conn):

#ConnectstotheMongoDBserver

self.client=MCli(**conn)

self.db=self.client[db]

self.coll=self.db[coll]

Thesavemethodinsertsnewrecordsinthepreinitializedcollectionanddatabase:

defsave(self,data):

#Inserttocollectionindb

returnself.coll.insert(data)

Theloadmethodallowstheretrievalofspecificrecordsaccordingtocriteriaandprojection.Inthecaseoflargeamountofdata,itreturnsacursor:

defload(self,return_cursor=False,criteria=None,projection=None):

ifcriteriaisNone:

criteria={}

ifprojectionisNone:

cursor=self.coll.find(criteria)

else:

cursor=self.coll.find(criteria,projection)

#Returnacursorforlargeamountsofdata

ifreturn_cursor:

returncursor

else:

return[itemforitemincursor]

HarvestingdatafromTwitterEachsocialnetworkposesitslimitationsandchallenges.Oneofthemainobstaclesforharvestingdataisanimposedratelimit.Whilerunningrepeatedorlong-runningconnectionsbetweenrateslimitpauses,wehavetobecarefultoavoidcollectingduplicatedata.

Wehaveredesignedourconnectionprogramsoutlinedinthepreviouschaptertotakecareoftheratelimits.

InthisTwitterAPIclassthatconnectsandcollectsthetweetsaccordingtothesearchquerywespecify,wehaveaddedthefollowing:

LoggingcapabilityusingthePythonlogginglibrarywiththeaimofcollectinganyerrorsorwarninginthecaseofprogramfailurePersistencecapabilityusingMongoDB,withtheIO_mongoclassexposedpreviouslyaswellasJSONfileusingtheIO_jsonclassAPIratelimitanderrormanagementcapability,sowecanensuremoreresilientcallstoTwitterwithoutgettingbarredfortappingintothefirehose

Let’sgothroughthesteps:

1. WeinitializebyinstantiatingtheTwitterAPIwithourcredentials:

classTwitterAPI(object):

"""

TwitterAPIclassallowstheConnectiontoTwitterviaOAuth

onceyouhaveregisteredwithTwitterandreceivethe

necessarycredentials

"""

def__init__(self):

consumer_key='get_your_credentials'

consumer_secret=getyour_credentials'

access_token='get_your_credentials'

access_secret='getyour_credentials'

self.consumer_key=consumer_key

self.consumer_secret=consumer_secret

self.access_token=access_token

self.access_secret=access_secret

self.retries=3

self.auth=twitter.oauth.OAuth(access_token,access_secret,

consumer_key,consumer_secret)

self.api=twitter.Twitter(auth=self.auth)

2. Weinitializetheloggerbyprovidingtheloglevel:

logger.debug(debugmessage)logger.info(infomessage)logger.warn(warnmessage)logger.error(errormessage)logger.critical(criticalmessage)

3. Wesetthelogpathandthemessageformat:

#loggerinitialisation

appName='twt150530'

self.logger=logging.getLogger(appName)

#self.logger.setLevel(logging.DEBUG)

#createconsolehandlerandsetleveltodebug

logPath='/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data'

fileName=appName

fileHandler=logging.FileHandler("{0}/{1}.log".format(logPath,

fileName))

formatter=logging.Formatter('%(asctime)s-%(name)s-%

(levelname)s-%(message)s')

fileHandler.setFormatter(formatter)

self.logger.addHandler(fileHandler)

self.logger.setLevel(logging.DEBUG)

4. WeinitializetheJSONfilepersistenceinstruction:

#SavetoJSONfileinitialisation

jsonFpath='/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data'

jsonFname='twtr15053001'

self.jsonSaver=IO_json(jsonFpath,jsonFname)

5. WeinitializetheMongoDBdatabaseandcollectionforpersistence:

#SavetoMongoDBIntitialisation

self.mongoSaver=IO_mongo(db='twtr01_db',coll='twtr01_coll')

6. ThemethodsearchTwitterlaunchesthesearchaccordingtothequeryspecified:

defsearchTwitter(self,q,max_res=10,**kwargs):

search_results=self.api.search.tweets(q=q,count=10,

**kwargs)

statuses=search_results['statuses']

max_results=min(1000,max_res)

for_inrange(10):

try:

next_results=search_results['search_metadata']

['next_results']

#self.logger.info('info'insearchTwitter-

next_results:%s'%next_results[1:])

exceptKeyErrorase:

self.logger.error('error'insearchTwitter:%s',%(e))

break

#next_results=urlparse.parse_qsl(next_results[1:])#

python2.7

next_results=urllib.parse.parse_qsl(next_results[1:])


next_results[max_id]:',next_results[0:])

kwargs=dict(next_results)


next_results[max_id]:%s'%kwargs['max_id'])

search_results=self.api.search.tweets(**kwargs)

statuses+=search_results['statuses']

self.saveTweets(search_results['statuses'])

iflen(statuses)>max_results:

self.logger.info('info'insearchTwitter-got%i

tweets-max:%i'%(len(statuses),max_results))

break

returnstatuses

7. ThesaveTweetsmethodactuallysavesthecollectedtweetsinJSONandinMongoDB:

defsaveTweets(self,statuses):

#SavingtoJSONFile

self.jsonSaver.save(statuses)

#SavingtoMongoDB

forsinstatuses:

self.mongoSaver.save(s)

8. TheparseTweetsmethodallowsustoextractthekeytweetinformationfromthevastamountofinformationprovidedbytheTwitterAPI:

defparseTweets(self,statuses):

return[(status['id'],

status['created_at'],

status['user']['id'],

status['user']['name']

status['text''text'],

url['expanded_url'])

forstatusinstatuses

forurlinstatus['entities']['urls']]

9. ThegetTweetsmethodcallsthesearchTwittermethoddescribedpreviously.ThegetTweetsmethodensuresthatAPIcallsaremadereliablywhilstrespectingtheimposedratelimit.Thecodeisasfollows:

defgetTweets(self,q,max_res=10):

"""

MakeaTwitterAPIcallwhilstmanagingratelimitanderrors.

"""

defhandleError(e,wait_period=2,

sleep_when_rate_limited=True):

ifwait_period>3600:#Seconds

self.logger.error('ToomanyretriesingetTweets:%s',

%(e))

raisee

ife.e.code==401:

self.logger.error('error401*NotAuthorised*in

getTweets:%s',%(e))

returnNone

elife.e.code==404:

self.logger.error('error404*NotFound*in

getTweets:%s',%(e))

returnNone

elife.e.code==429:

self.logger.error('error429*APIRateLimitExceeded

*ingetTweets:%s',%(e))

ifsleep_when_rate_limited:

self.logger.error('error429*Retryingin15

minutes*ingetTweets:%s',%(e))

sys.stderr.flush()

time.sleep(60*15+5)

self.logger.info('error429*Retryingnow*in

getTweets:%s',%(e))

return2

else:

raisee#Callermusthandletheratelimiting

issue

elife.e.codein(500,502,503,504):

self.logger.info('Encountered%iError.Retryingin%i

seconds'%(e.e.code,wait_period))

time.sleep(wait_period)

wait_period*=1.5

returnwait_period

else:

self.logger.error('Exit-aborting-%s',%(e))

raisee

10. Here,wearecallingthesearchTwitterAPIwiththerelevantquerybasedontheparametersspecified.Ifweencounteranyerrorsuchasratelimitationfromtheprovider,thiswillbeprocessedbythehandleErrormethod:

whileTrue:

try:

self.searchTwitter(q,max_res=10)

excepttwitter.api.TwitterHTTPErrorase:

error_count=0

wait_period=handleError(e,wait_period)

ifwait_periodisNone:

return

ExploringdatausingBlazeBlazeisanopensourcePythonlibrary,primarilydevelopedbyContinuum.io,leveragingPythonNumpyarraysandPandasdataframe.Blazeextendstoout-of-corecomputing,whilePandasandNumpyaresingle-core.

Blazeoffersanadaptable,unified,andconsistentuserinterfaceacrossvariousbackends.Blazeorchestratesthefollowing:

Data:SeamlessexchangeofdataacrossstoragessuchasCSV,JSON,HDF5,HDFS,andBcolzfiles.Computation:UsingthesamequeryprocessingagainstcomputationalbackendssuchasSpark,MongoDB,Pandas,orSQLAlchemy.Symbolicexpressions:Abstractexpressionssuchasjoin,group-by,filter,selection,andprojectionwithasyntaxsimilartoPandasbutlimitedinscope.Implementsthesplit-apply-combinemethodspioneeredbytheRlanguage.

BlazeexpressionsarelazilyevaluatedandinthatrespectshareasimilarprocessingparadigmwithSparkRDDstransformations.

Let’sdiveintoBlazebyfirstimportingthenecessarylibraries:numpy,pandas,blazeandodo.Odoisaspin-offofBlazeandensuresdatamigrationfromvariousbackends.Thecommandsareasfollows:

importnumpyasnp

importpandasaspd

fromblazeimportData,by,join,merge

fromodoimportodo

BokehJSsuccessfullyloaded.

WecreateaPandasDataframebyreadingtheparsedtweetssavedinaCSVfile,twts_csv:

twts_pd_df=pd.DataFrame(twts_csv_read,columns=Tweet01._fields)

twts_pd_df.head()

Out[65]:

idcreated_atuser_iduser_nametweet_texturl

15988311114065100822015-05-1412:43:5714755521raulsaeztapiaRT

@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-

solutions.com/wp/2015/05/the-...


@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-



@alvaroagea:Simply@ApacheSparkhttp://t.c…

http://www.webex.com/ciscospark/


@alvaroagea:Simply@ApacheSparkhttp://t.c…http://sparkjava.com/

WeruntheTweetsPandaDataframetothedescribe()functiontogetsomeoverallinformationonthedataset:

twts_pd_df.describe()

Out[66]:


count191919191919

unique776667

top5988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…http://bit.ly/1Hfd0Xm

freq669966

WeconvertthePandasdataframeintoaBlazedataframebysimplypassingitthroughtheData()function:

#

#Blazedataframe

#

twts_bz_df=Data(twts_pd_df)

WecanretrievetheschemarepresentationoftheBlazedataframebypassingtheschemafunction:

twts_bz_df.schema

Out[73]:

dshape("""{

id:?string,

created_at:?string,

user_id:?string,

user_name:?string,

tweet_text:?string,

url:?string

}""")

The.dshapefunctiongivesarecordcountandtheschema:

twts_bz_df.dshape

Out[74]:

dshape("""19*{

id:?string,

created_at:?string,

user_id:?string,

user_name:?string,

tweet_text:?string,

url:?string

}""")

WecanprinttheBlazedataframecontent:

twts_bz_df.data

Out[75]:


15988311114065100822015-05-1412:43:5714755521raulsaeztapia

RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-


25988311114065100822015-05-1412:43:5714755521raulsaeztapia



...

185987829700828078082015-05-1409:32:391377652806

embeddedcomputer.nlRT@BigDataTechCon:MovingRatingPredictionw…

http://buff.ly/1QBpk8J

195987779337301606402015-05-1409:12:38294862170Ellen

FriedmanI'mstillonEurotime.Ifyouaretoochecko…

http://bit.ly/1Hfd0Xm

Weextractthecolumntweet_textandtaketheuniquevalues:

twts_bz_df.tweet_text.distinct()

Out[76]:

tweet_text

0RT@pacoid:Greatrecapof@StrataConfEUinL…

1RT@alvaroagea:Simply@ApacheSparkhttp://t.c…

2RT@PrabhaGana:Whatexactlyis@ApacheSparka…

3RT@Ellen_Friedman:I'mstillonEurotime.If…

4RT@BigDataTechCon:MovingRatingPredictionw…

5I'mstillonEurotime.Ifyouaretoochecko…

Weextractmultiplecolumns['id','user_name','tweet_text']fromthedataframeandtaketheuniquerecords:

twts_bz_df[['id','user_name','tweet_text']].distinct()

Out[78]:

iduser_nametweet_text

0598831111406510082raulsaeztapiaRT@pacoid:Greatrecapof

@StrataConfEUinL…

1598808944719593472raulsaeztapiaRT@alvaroagea:Simply

@ApacheSparkhttp://t.c…

2598796205091500032JohnHumphreysRT@PrabhaGana:Whatexactlyis

@ApacheSparka…

3598788561127735296LeonardoD'AmbrosiRT@Ellen_Friedman:I'mstill

onEurotime.If…

4598785545557438464AlexeyKosenkovRT@Ellen_Friedman:I'mstillon

Eurotime.If…

5598782970082807808embeddedcomputer.nlRT@BigDataTechCon:Moving

RatingPredictionw…

6598777933730160640EllenFriedmanI'mstillonEurotime.Ifyou

aretoochecko…

TransferringdatausingOdoOdoisaspin-offprojectofBlaze.Odoallowstheinterchangeofdata.Odoensuresthemigrationofdataacrossdifferentformats(CSV,JSON,HDFS,andmore)andacrossdifferentdatabases(SQLdatabases,MongoDB,andsoon)usingaverysimplepredicate:

Odo(source,target)

Totransfertoadatabase,theaddressisspecifiedusingaURL.Forexample,foraMongoDBdatabase,itwouldlooklikethis:

mongodb://username:password@hostname:port/database_name::collection_name

Let’srunsomeexamplesofusingOdo.Here,weillustrateodobyreadingaCSVfileandcreatingaBlazedataframe:

filepath=csvFpath

filename=csvFname

filesuffix=csvSuffix

twts_odo_df=Data('{0}/{1}.{2}'.format(filepath,filename,filesuffix))

Countthenumberofrecordsinthedataframe:

twts_odo_df.count()

Out[81]:

19

Displaythefiveinitialrecordsofthedataframe:

twts_odo_df.head(5)

Out[82]:


05988311114065100822015-05-1412:43:5714755521raulsaeztapia



15988311114065100822015-05-1412:43:5714755521raulsaeztapia



25988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…

http://www.webex.com/ciscospark/

35988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…http://sparkjava.com/

45988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…https://www.sparkfun.com/

Getdshapeinformationfromthedataframe,whichgivesusthenumberofrecordsandtheschema:

twts_odo_df.dshape

Out[83]:

dshape("var*{

id:int64,

created_at:?datetime,

user_id:int64,

user_name:?string,

tweet_text:?string,

url:?string

}""")

SaveaprocessedBlazedataframeintoJSON:

odo(twts_odo_distinct_df,'{0}/{1}.{2}'.format(jsonFpath,jsonFname,

jsonSuffix))

Out[92]:

<odo.backends.json.JSONLinesat0x7f77f0abfc50>

ConvertaJSONfiletoaCSVfile:

odo('{0}/{1}.{2}'.format(jsonFpath,jsonFname,jsonSuffix),'{0}/{1}.

{2}'.format(csvFpath,csvFname,csvSuffix))

Out[94]:

<odo.backends.csv.CSVat0x7f77f0abfe10>

ExploringdatausingSparkSQLSparkSQLisarelationalqueryenginebuiltontopofSparkCore.SparkSQLusesaqueryoptimizercalledCatalyst.

RelationalqueriescanbeexpressedusingSQLorHiveQLandexecutedagainstJSON,CSV,andvariousdatabases.SparkSQLgivesusthefullexpressivenessofdeclarativeprogramingwithSparkdataframesontopoffunctionalprogrammingwithRDDs.

UnderstandingSparkdataframesHere’[email protected],theadventofSparkSQLanddataframes.Italsohighlightsthevariousdatasourcesinthelowerpartofthediagram.Onthetoppart,wecannoticeRasthenewlanguagethatwillbegraduallysupportedontopofScala,Java,andPython.Ultimately,theDataFramephilosophyispervasivebetweenR,Python,andSpark.

SparkdataframesoriginatefromSchemaRDDs.ItcombinesRDDwithaschemathatcanbeinferredbySpark,ifrequested,whenregisteringthedataframe.ItallowsustoquerycomplexnestedJSONdatawithplainSQL.Lazyevaluation,lineage,partitioning,andpersistenceapplytodataframes.

Let’squerythedatawithSparkSQL,byfirstimportingSparkContextandSQLContext:

frompysparkimportSparkConf,SparkContext

frompyspark.sqlimportSQLContext,Row

In[95]:

sc

Out[95]:

<pyspark.context.SparkContextat0x7f7829581890>

In[96]:

sc.master

Out[96]:

u'local[*]'

''In[98]:

#InstantiateSparkSQLcontext

sqlc=SQLContext(sc)

WereadintheJSONfilewesavedwithOdo:

twts_sql_df_01=sqlc.jsonFile("/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data/twtr15051401_distinct.json")

In[101]:

twts_sql_df_01.show()

created_atidtweet_textuser_id

user_name

2015-05-14T12:43:57Z598831111406510082RT@pacoid:Great…14755521

raulsaeztapia

2015-05-14T11:15:52Z598808944719593472RT@alvaroagea:S…14755521

raulsaeztapia

2015-05-14T10:25:15Z598796205091500032RT@PrabhaGana:W…48695135John

Humphreys

2015-05-14T09:54:52Z598788561127735296RT@Ellen_Friedma…2385931712

LeonardoD'Ambrosi

2015-05-14T09:42:53Z598785545557438464RT@Ellen_Friedma…461020977

AlexeyKosenkov

2015-05-14T09:32:39Z598782970082807808RT@BigDataTechCo…1377652806

embeddedcomputer.nl

2015-05-14T09:12:38Z598777933730160640I'mstillonEuro…294862170Ellen

Friedman

WeprinttheschemaoftheSparkdataframe:

twts_sql_df_01.printSchema()

root

|--created_at:string(nullable=true)

|--id:long(nullable=true)

|--tweet_text:string(nullable=true)

|--user_id:long(nullable=true)

|--user_name:string(nullable=true)

Weselecttheuser_namecolumnfromthedataframe:

twts_sql_df_01.select('user_name').show()

user_name

raulsaeztapia

raulsaeztapia

JohnHumphreys

LeonardoD'Ambrosi

AlexeyKosenkov

embeddedcomputer.nl

EllenFriedman

Weregisterthedataframeasatable,sowecanexecuteaSQLqueryonit:

twts_sql_df_01.registerAsTable('tweets_01')

WeexecuteaSQLstatementagainstthedataframe:

twts_sql_df_01_selection=sqlc.sql("SELECT*FROMtweets_01WHERE

user_name='raulsaeztapia'")

In[109]:

twts_sql_df_01_selection.show()

created_atidtweet_textuser_id

user_name

2015-05-14T12:43:57Z598831111406510082RT@pacoid:Great…14755521

raulsaeztapia

2015-05-14T11:15:52Z598808944719593472RT@alvaroagea:S…14755521

raulsaeztapia

Let’sprocesssomemorecomplexJSON;wereadtheoriginalTwitterJSONfile:

tweets_sqlc_inf=sqlc.jsonFile(infile)

SparkSQLisabletoinfertheschemaofacomplexnestedJSONfile:

tweets_sqlc_inf.printSchema()

root

|--contributors:string(nullable=true)

|--coordinates:string(nullable=true)


|--entities:struct(nullable=true)

||--hashtags:array(nullable=true)

|||--element:struct(containsNull=true)

||||--indices:array(nullable=true)

|||||--element:long(containsNull=true)

||||--text:string(nullable=true)

||--media:array(nullable=true)

|||--element:struct(containsNull=true)

||||--display_url:string(nullable=true)

||||--expanded_url:string(nullable=true)

||||--id:long(nullable=true)

||||--id_str:string(nullable=true)

||||--indices:array(nullable=true)

...(snip)...

||--statuses_count:long(nullable=true)

||--time_zone:string(nullable=true)

||--url:string(nullable=true)

||--utc_offset:long(nullable=true)

||--verified:boolean(nullable=true)

Weextractthekeyinformationofinterestfromthewallofdatabyselectingspecificcolumnsinthedataframe(inthiscase,['created_at','id','text','user.id','user.name','entities.urls.expanded_url']):

tweets_extract_sqlc=tweets_sqlc_inf[['created_at','id','text',

'user.id','user.name','entities.urls.expanded_url']].distinct()

In[145]:

tweets_extract_sqlc.show()

created_atidtextid

nameexpanded_url

ThuMay1409:32:...598782970082807808RT@BigDataTechCo…1377652806

embeddedcomputer.nlArrayBuffer(http:...

ThuMay1412:43:...598831111406510082RT@pacoid:Great…14755521

raulsaeztapiaArrayBuffer(http:...

ThuMay1412:18:...598824733086523393@rabbitonwebspea…

...

ThuMay1412:28:...598827171168264192RT@baandrzejczak…20909005Paweł

SzulcArrayBuffer()

UnderstandingtheSparkSQLqueryoptimizerWeexecuteaSQLstatementagainstthedataframe:

tweets_extract_sqlc_sel=sqlc.sql("SELECT*fromTweets_xtr_001WHERE

name='raulsaeztapia'")

WegetadetailedviewofthequeryplansexecutedbySparkSQL:

ParsedlogicalplanAnalyzedlogicalplanOptimizedlogicalplanPhysicalplan

ThequeryplanusesSparkSQL’sCatalystoptimizer.Inordertogeneratethecompiledbytecodefromthequeryparts,theCatalystoptimizerrunsthroughlogicalplanparsingandoptimizationfollowedbyphysicalplanevaluationandoptimizationbasedoncost.

Thisisillustratedinthefollowingtweet:

Lookingbackatourcode,wecallthe.explainfunctionontheSparkSQLquerywejustexecuted,anditdeliversthefulldetailsofthestepstakenbytheCatalystoptimizerinordertoassessandoptimizethelogicalplanandthephysicalplanandgettotheresultRDD:

tweets_extract_sqlc_sel.explain(extended=True)

==ParsedLogicalPlan==

'Project[*]

'Filter('name=raulsaeztapia)'name''UnresolvedRelation'

[Tweets_xtr_001],None

==AnalyzedLogicalPlan==

Project[created_at#7,id#12L,text#27,id#80L,name#81,expanded_url#82]

Filter(name#81=raulsaeztapia)

Distinct

Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.name

ASname#81,entities#8.urls.expanded_urlASexpanded_url#82]

Relation[contributors#5,coordinates#6,created_at#7,entities#8,favorite_coun

t#9L,favorited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_rep

ly_to_status_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_

reply_to_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,

retweet_count#23L,retweeted#24,retweeted_status#25,source#26,text#27,trunca

ted#28,user#29]JSONRelation(/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data/twtr15051401.json,1.0,None)

==OptimizedLogicalPlan==


Distinct

Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.nameAS

name#81,entities#8.urls.expanded_urlASexpanded_url#82]

Relation[contributors#5,coordinates#6,created_at#7,entities#8,favorite_coun

t#9L,favorited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_rep

ly_to_status_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_

reply_to_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,

retweet_count#23L,retweeted#24,retweeted_status#25,source#26,text#27,trunca

ted#28,user#29]JSONRelation(/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data/twtr15051401.json,1.0,None)

==PhysicalPlan==


Distinctfalse

Exchange(HashPartitioning

[created_at#7,id#12L,text#27,id#80L,name#81,expanded_url#82],200)

Distincttrue

Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.name

ASname#81,entities#8.urls.expanded_urlASexpanded_url#82]

PhysicalRDD

[contributors#5,coordinates#6,created_at#7,entities#8,favorite_count#9L,fav

orited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_reply_to_st

atus_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_reply_to

_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,retweet_

count#23L,retweeted#24,retweeted_status#25,source#26,text#27,truncated#28,u

ser#29],MapPartitionsRDD[165]atmapatJsonRDD.scala:41

CodeGeneration:false

==RDD==

Finally,here’stheresultofthequery:

tweets_extract_sqlc_sel.show()

created_atidtextidname

expanded_url

ThuMay1412:43:...598831111406510082RT@pacoid:Great…14755521


ThuMay1411:15:...598808944719593472RT@alvaroagea:S…14755521


In[148]:

LoadingandprocessingCSVfileswithSparkSQLWewillusetheSparkpackagespark-csv_2.11:1.2.0.ThecommandtobeusedtolaunchPySparkwiththeIPythonNotebookandthespark-csvpackageshouldexplicitlystatethe–packagesargument:



Thiswilltriggerthefollowingoutput;wecanseethatthespark-csvpackageisinstalledwithallitsdependencies:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark$

IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-


...(snip)...

IvyDefaultCachesetto:/home/an/.ivy2/cache

Thejarsforthepackagesstoredin:/home/an/.ivy2/jars

::loadingsettings::url=jar:file:/home/an/spark/spark-1.5.0-bin-

hadoop2.6/lib/spark-assembly-1.5.0-

hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

com.databricks#spark-csv_2.11addedasadependency

::resolvingdependencies::org.apache.spark#spark-submit-parent;1.0

confs:[default]

foundcom.databricks#spark-csv_2.11;1.2.0incentral

foundorg.apache.commons#commons-csv;1.1incentral

foundcom.univocity#univocity-parsers;1.5.1incentral

::resolutionreport::resolve835ms::artifactsdl48ms

::modulesinuse:

com.databricks#spark-csv_2.11;1.2.0fromcentralin[default]

com.univocity#univocity-parsers;1.5.1fromcentralin[default]

org.apache.commons#commons-csv;1.1fromcentralin[default]

----------------------------------------------------------------

||modules||artifacts|

|conf|number|search|dwnlded|evicted||number|dwnlded|

----------------------------------------------------------------

|default|3|0|0|0||3|0

----------------------------------------------------------------

::retrieving::org.apache.spark#spark-submit-parent

confs:[default]

0artifactscopied,3alreadyretrieved(0kB/45ms)

Wearenowreadytoloadourcsvfileandprocessit.Let’sfirstimporttheSQLContext:

#

#ReadcsvinaSparkDF

#

sqlContext=SQLContext(sc)

spdf_in=sqlContext.read.format('com.databricks.spark.csv')\

.options(delimiter=";").options(header="true")\

.options(header='true').load(csv_in)

Weaccesstheschemaofthedataframecreatedfromtheloadedcsv:

In[10]:

spdf_in.printSchema()

root

|--:string(nullable=true)

|--id:string(nullable=true)


|--user_id:string(nullable=true)

|--user_name:string(nullable=true)

|--tweet_text:string(nullable=true)

Wecheckthecolumnsofthedataframe:

In[12]:

spdf_in.columns

Out[12]:

['','id','created_at','user_id','user_name','tweet_text']

Weintrospectthedataframecontent:

In[13]:

spdf_in.show()

+---+------------------+--------------------+----------+------------------

+--------------------+

||id|created_at|user_id|user_name|

tweet_text|

+---+------------------+--------------------+----------+------------------

+--------------------+

|0|638830426971181057|TueSep0121:46:...|3276255125|True

Equality|ernestsgantt:Bey…|

|1|638830426727911424|TueSep0121:46:...|3276255125|True


|2|638830425402556417|TueSep0121:46:...|3276255125|True


...(snip)...

|41|638830280988426250|TueSep0121:46:...|951081582|Jack

Baldwin|RT@cloudaus:We…|

|42|638830276626399232|TueSep0121:46:...|6525302|Masayoshi

Nakamura|PynamoDB使いやすいです|

+---+------------------+--------------------+----------+------------------

+--------------------+

onlyshowingtop20rows

QueryingMongoDBfromSparkSQLTherearetwomajorwaystointeractwithMongoDBfromSpark:thefirstisthroughtheHadoopMongoDBconnector,andthesecondoneisdirectlyfromSparktoMongoDB.

ThefirstapproachtointeractwithMongoDBfromSparkistosetupaHadoopenvironmentandquerythroughtheHadoopMongoDBconnector.TheconnectordetailsarehostedonGitHubathttps://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage.AnactualusecaseisdescribedintheseriesofblogpostsfromMongoDB:

UsingMongoDBwithHadoop&Spark:Part1-Introduction&Setup(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-introduction-setup)UsingMongoDBwithHadoopandSpark:Part2-HiveExample(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive-example)UsingMongoDBwithHadoop&Spark:Part3-SparkExample&KeyTakeaways(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways)

SettingupafullHadoopenvironmentisbitelaborate.Wewillfavorthesecondapproach.Wewillusethespark-mongodbconnectordevelopedandmaintainedbyStratio.WeareusingtheStratiospark-mongodbpackagehostedatspark.packages.org.Thepackagesinformationandversioncanbefoundinspark.packages.org:

NoteReleases

Version:0.10.1(8263c8|zip|jar)/Date:2015-11-18/License:Apache-2.0/Scalaversion:2.10

(http://spark-packages.org/package/Stratio/spark-mongodb)

ThecommandtolaunchPySparkwiththeIPythonNotebookandthespark-mongodbpackageshouldexplicitlystatethepackagesargument:


hadoop2.6/bin/pyspark--packagescom.stratio.datasource:spark-

mongodb_2.10:0.10.1

Thiswilltriggerthefollowingoutput;wecanseethatthespark-mongodbpackageisinstalledwithallitsdependencies:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark$

IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-

hadoop2.6/bin/pyspark--packagescom.stratio.datasource:spark-

mongodb_2.10:0.10.1…(snip)...

IvyDefaultCachesetto:/home/an/.ivy2/cache

Thejarsforthepackagesstoredin:/home/an/.ivy2/jars

::loadingsettings::url=jar:file:/home/an/spark/spark-1.5.0-bin-

hadoop2.6/lib/spark-assembly-1.5.0-

https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage

https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-introduction-setup

https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive-example

https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways

http://spark-packages.org/package/Stratio/spark-mongodb

hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

com.stratio.datasource#spark-mongodb_2.10addedasadependency

::resolvingdependencies::org.apache.spark#spark-submit-parent;1.0

confs:[default]

foundcom.stratio.datasource#spark-mongodb_2.10;0.10.1incentral

[W22:10:50.910NotebookApp]Timeoutwaitingforkernel_inforeplyfrom

764081d3-baf9-4978-ad89-7735e6323cb6

foundorg.mongodb#casbah-commons_2.10;2.8.0incentral

foundcom.github.nscala-time#nscala-time_2.10;1.0.0incentral

foundjoda-time#joda-time;2.3incentral

foundorg.joda#joda-convert;1.2incentral

foundorg.slf4j#slf4j-api;1.6.0incentral

foundorg.mongodb#mongo-java-driver;2.13.0incentral

foundorg.mongodb#casbah-query_2.10;2.8.0incentral

foundorg.mongodb#casbah-core_2.10;2.8.0incentral

downloadinghttps://repo1.maven.org/maven2/com/stratio/datasource/spark-

mongodb_2.10/0.10.1/spark-mongodb_2.10-0.10.1.jar…

[SUCCESSFUL]com.stratio.datasource#spark-mongodb_2.10;0.10.1!spark-

mongodb_2.10.jar(3130ms)

downloadinghttps://repo1.maven.org/maven2/org/mongodb/casbah-

commons_2.10/2.8.0/casbah-commons_2.10-2.8.0.jar…

[SUCCESSFUL]org.mongodb#casbah-commons_2.10;2.8.0!casbah-

commons_2.10.jar(2812ms)


query_2.10/2.8.0/casbah-query_2.10-2.8.0.jar…

[SUCCESSFUL]org.mongodb#casbah-query_2.10;2.8.0!casbah-query_2.10.jar

(1432ms)


core_2.10/2.8.0/casbah-core_2.10-2.8.0.jar…

[SUCCESSFUL]org.mongodb#casbah-core_2.10;2.8.0!casbah-core_2.10.jar

(2785ms)

downloadinghttps://repo1.maven.org/maven2/com/github/nscala-time/nscala-

time_2.10/1.0.0/nscala-time_2.10-1.0.0.jar…

[SUCCESSFUL]com.github.nscala-time#nscala-time_2.10;1.0.0!nscala-

time_2.10.jar(2725ms)

downloadinghttps://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.0/slf4j-

api-1.6.0.jar…

[SUCCESSFUL]org.slf4j#slf4j-api;1.6.0!slf4j-api.jar(371ms)

downloadinghttps://repo1.maven.org/maven2/org/mongodb/mongo-java-

driver/2.13.0/mongo-java-driver-2.13.0.jar…

[SUCCESSFUL]org.mongodb#mongo-java-driver;2.13.0!mongo-java-driver.jar

(5259ms)

downloadinghttps://repo1.maven.org/maven2/joda-time/joda-time/2.3/joda-

time-2.3.jar…

[SUCCESSFUL]joda-time#joda-time;2.3!joda-time.jar(6949ms)

downloadinghttps://repo1.maven.org/maven2/org/joda/joda-convert/1.2/joda-

convert-1.2.jar…

[SUCCESSFUL]org.joda#joda-convert;1.2!joda-convert.jar(548ms)

::resolutionreport::resolve11850ms::artifactsdl26075ms

::modulesinuse:

com.github.nscala-time#nscala-time_2.10;1.0.0fromcentralin[default]

com.stratio.datasource#spark-mongodb_2.10;0.10.1fromcentralin

[default]

joda-time#joda-time;2.3fromcentralin[default]

org.joda#joda-convert;1.2fromcentralin[default]

org.mongodb#casbah-commons_2.10;2.8.0fromcentralin[default]

org.mongodb#casbah-core_2.10;2.8.0fromcentralin[default]

org.mongodb#casbah-query_2.10;2.8.0fromcentralin[default]

org.mongodb#mongo-java-driver;2.13.0fromcentralin[default]

org.slf4j#slf4j-api;1.6.0fromcentralin[default]

---------------------------------------------------------------------

||modules||artifacts|

|conf|number|search|dwnlded|evicted||number|dwnlded|

---------------------------------------------------------------------

|default|9|9|9|0||9|9|

---------------------------------------------------------------------


confs:[default]


...(snip)...

WearenowreadytoqueryMongoDBonlocalhost:27017fromthecollectiontwtr01_collinthedatabasetwtr01_db.

WefirstimporttheSQLContext:

In[5]:

frompyspark.sqlimportSQLContext

sqlContext.sql("CREATETEMPORARYTABLEtweet_tableUSING

com.stratio.datasource.mongodbOPTIONS(host'localhost:27017',database

'twtr01_db',collection'twtr01_coll')")

sqlContext.sql("SELECT*FROMtweet_tablewhereid=598830778269769728

").collect()

Here’stheoutputofourquery:

Out[5]:

[Row(text=u'@spark_ioisnow@particle-awesomenews-nowIcanenjoymy

ParticleCores/Photons+@sparkfunsensors+@ApacheSparkanalytics:-)',

_id=u'55aa640fd770871cba74cb88',contributors=None,retweeted=False,

user=Row(contributors_enabled=False,created_at=u'MonAug2514:01:26+0000

2008',default_profile=True,default_profile_image=False,

description=u'Buildingopensourcetoolsforandteachingenterprise

softwaredevelopers',entities=Row(description=Row(urls=[]),url=Row(urls=

[Row(url=u'http://t.co/TSHp13EWeu',indices=[0,22],

...(snip)...

9],name=u'SparkisParticle',screen_name=u'spark_io'),Row(id=487010011,

id_str=u'487010011',indices=[17,26],name=u'Particle',

screen_name=u'particle'),Row(id=17877351,id_str=u'17877351',indices=[88,

97],name=u'SparkFunElectronics',screen_name=u'sparkfun'),

Row(id=1551361069,id_str=u'1551361069',indices=[108,120],name=u'Apache

Spark',screen_name=u'ApacheSpark')]),is_quote_status=None,lang=u'en',

quoted_status_id_str=None,quoted_status_id=None,created_at=u'ThuMay14

12:42:37+00002015',retweeted_status=None,truncated=False,place=None,

id=598830778269769728,in_reply_to_user_id=3187046084,retweet_count=0,

in_reply_to_status_id=None,in_reply_to_screen_name=u'spark_io',

in_reply_to_user_id_str=u'3187046084',source=u'<a

href="http://twitter.com"rel="nofollow">TwitterWebClient</a>',

id_str=u'598830778269769728',coordinates=None,

metadata=Row(iso_language_code=u'en',result_type=u'recent'),

quoted_status=None)]

#

SummaryInthischapter,weharvesteddatafromTwitter.Oncethedatawasacquired,weexploredtheinformationusingContinuum.io'sBlazeandOdolibraries.SparkSQLisanimportantmoduleforinteractivedataexploration,analysis,andtransformation,leveragingtheSparkdataframedatastructure.ThedataframeconceptoriginatesfromRandthenwasadoptedbyPythonPandaswithgreatsuccess.Thedataframeistheworkhorseofthedatascientist.ThecombinationofSparkSQLanddataframecreatesapowerfulenginefordataprocessing.

WearenowgearingupforextractingtheinsightsfromthedatasetsusingmachinelearningfromSparkMLlib.

Chapter4.LearningfromDataUsingSparkAswehavelaidthefoundationfordatatobeharvestedinthepreviouschapter,wearenowreadytolearnfromthedata.Machinelearningisaboutdrawinginsightsfromdata.OurobjectiveistogiveanoverviewoftheSparkMLlib(shortforMachineLearninglibrary)andapplytheappropriatealgorithmstoourdatasetinordertoderiveinsights.FromtheTwitterdataset,wewillbeapplyinganunsupervisedclusteringalgorithminordertodistinguishbetweenApacheSpark-relevanttweetsversustherest.Wehaveasinitialinputamixedbagoftweets.Wefirstneedtopreprocessthedatainordertoextracttherelevantfeatures,thenapplythemachinelearningalgorithmtoourdataset,andfinallyevaluatetheresultsandtheperformanceofourmodel.


ProvidinganoverviewoftheSparkMLlibmodulewithitsalgorithmsandthetypicalmachinelearningworkflow.PreprocessingtheTwitterharvesteddatasettoextracttherelevantfeatures,applyinganunsupervisedclusteringalgorithmtoidentifyApacheSpark-relevanttweets.Then,evaluatingthemodelandtheresultsobtained.DescribingtheSparkmachinelearningpipeline.

ContextualizingSparkMLlibintheapparchitectureLet’sfirstcontextualizethefocusofthischapterondata-intensiveapparchitecture.Wewillconcentrateourattentionontheanalyticslayerandmorepreciselymachinelearning.Thiswillserveasafoundationforstreamingappsaswewanttoapplythelearningfromthebatchprocessingofdataasinferencerulesforthestreaminganalysis.

Thefollowingdiagramsetsthecontextofthechapter’sfocus,highlightingthemachinelearningmodulewithintheanalyticslayerwhileusingtoolsforexploratorydataanalysis,SparkSQL,andPandas.

ClassifyingSparkMLlibalgorithmsSparkMLlibisarapidlyevolvingmoduleofSparkwithnewalgorithmsaddedwitheachreleaseofSpark.

Thefollowingdiagramprovidesahigh-leveloverviewofSparkMLlibalgorithmsgroupedinthetraditionalbroadmachinelearningtechniquesandfollowingthecategoricalorcontinuousnatureofthedata:

WecategorizetheSparkMLlibalgorithmsintwocolumns,categoricalorcontinuous,dependingonthetypeofdata.Wedistinguishbetweendatathatiscategoricalormorequalitativeinnatureversuscontinuousdata,whichisquantitativeinnature.Anexampleofqualitativedataispredictingtheweather;giventheatmosphericpressure,thetemperature,andthepresenceandtypeofclouds,theweatherwillbesunny,dry,rainy,orovercast.Thesearediscretevalues.Ontheotherhand,let’ssaywewanttopredicthouseprices,giventhelocation,squaremeterage,andthenumberofbeds;therealestatevaluecanbepredictedusinglinearregression.Inthiscase,wearetalkingaboutcontinuousorquantitativevalues.

Thehorizontalgroupingreflectsthetypesofmachinelearningmethodused.Unsupervisedversussupervisedmachinelearningtechniquesaredependentonwhetherthetrainingdataislabeled.Inanunsupervisedlearningchallenge,nolabelsaregiventothelearningalgorithm.Thegoalistofindthehiddenstructureinitsinput.Inthecaseofsupervisedlearning,thedataislabeled.Thefocusisonmakingpredictionsusingregressionifthedataiscontinuousorclassificationifthedataiscategorical.

Animportantcategoryofmachinelearningisrecommendersystems,whichleveragecollaborativefilteringtechniques.TheAmazonwebstoreandNetflixhaveverypowerfulrecommendersystemspoweringtheirrecommendations.

StochasticGradientDescentisoneofthemachinelearningoptimizationtechniquesthatiswellsuitedforSparkdistributedcomputation.

Forprocessinglargeamountsoftext,SparkofferscruciallibrariesforfeatureextractionandtransformationsuchasTF-IDF(shortforTermFrequency–InverseDocumentFrequency),Word2Vec,standardscaler,andnormalizer.

SupervisedandunsupervisedlearningWedelvemoredeeplyhereintothetraditionalmachinelearningalgorithmsofferedbySparkMLlib.Wedistinguishbetweensupervisedandunsupervisedlearningdependingonwhetherthedataislabeled.Wedistinguishbetweencategoricalorcontinuousdependingonwhetherthedataisdiscreteorcontinuous.

ThefollowingdiagramexplainstheSparkMLlibsupervisedandunsupervisedmachinelearningalgorithmsandpreprocessingtechniques:

ThefollowingsupervisedandunsupervisedMLlibalgorithmsandpreprocessingtechniquesarecurrentlyavailableinSpark:

Clustering:Thisisanunsupervisedmachinelearningtechniquewherethedataisnotlabeled.Theaimistoextractstructurefromthedata:

K-Means:ThispartitionsthedatainKdistinctclustersGaussianMixture:ClustersareassignedbasedonthemaximumposteriorprobabilityofthecomponentPowerIterationClustering(PIC):ThisgroupsverticesofagraphbasedonpairwiseedgesimilaritiesLatentDirichletAllocation(LDA):ThisisusedtogroupcollectionsoftextdocumentsintotopicsStreamingK-Means:Thismeansclustersdynamicallystreamingdatausingawindowingfunctionontheincomingdata

DimensionalityReduction:Thisaimstoreducethenumberoffeaturesunderconsideration.Essentially,thisreducesnoiseinthedataandfocusesonthekeyfeatures:

SingularValueDecomposition(SVD):Thisbreaksthematrixthatcontainsthedataintosimplermeaningfulpieces.Itfactorizestheinitialmatrixintothree

matrices.PrincipalComponentAnalysis(PCA):Thisapproximatesahighdimensionaldatasetwithalowdimensionalsubspace.

RegressionandClassification:Regressionpredictsoutputvaluesusinglabeledtrainingdata,whileClassificationgroupstheresultsintoclasses.ClassificationhasdependentvariablesthatarecategoricalorunorderedwhilstRegressionhasdependentvariablesthatarecontinuousandordered:

LinearRegressionModels(linearregression,logisticregression,andsupportvectormachines):Linearregressionalgorithmscanbeexpressedasconvexoptimizationproblemsthataimtominimizeanobjectivefunctionbasedonavectorofweightvariables.Theobjectivefunctioncontrolsthecomplexityofthemodelthroughtheregularizedpartofthefunctionandtheerrorofthemodelthroughthelosspartofthefunction.NaiveBayes:Thismakespredictionsbasedontheconditionalprobabilitydistributionofalabelgivenanobservation.Itassumesthatfeaturesaremutuallyindependentofeachother.DecisionTrees:Thisperformsrecursivebinarypartitioningofthefeaturespace.Theinformationgainatthetreenodelevelismaximizedinordertodeterminethebestsplitforthepartition.Ensemblesoftrees(RandomForestsandGradient-BoostedTrees):Treeensemblealgorithmscombinebasedecisiontreemodelsinordertobuildaperformantmodel.Theyareintuitiveandverysuccessfulforclassificationandregressiontasks.

IsotonicRegression:Thisminimizesthemeansquarederrorbetweengivendataandobservedresponses.

AdditionallearningalgorithmsSparkMLliboffersmorealgorithmsthanthesupervisedandunsupervisedlearningones.Wehavebroadlythreemoreadditionaltypesofmachinelearningmethods:recommendersystems,optimizationalgorithms,andfeatureextraction.

ThefollowingadditionalMLlibalgorithmsarecurrentlyavailableinSpark:

Collaborativefiltering:Thisisthebasisforrecommendersystems.Itcreatesauser-itemassociationmatrixandaimstofillthegaps.Basedonotherusersanditemsalongwiththeirratings,itrecommendsanitemthatthetargetuserhasnoratingsfor.Indistributedcomputing,oneofthemostsuccessfulalgorithmsisALS(shortforAlternatingLeastSquare):

AlternatingLeastSquares:Thismatrixfactorizationtechniqueincorporatesimplicitfeedback,temporaleffects,andconfidencelevels.Itdecomposesthelargeuseritemmatrixintoalowerdimensionaluseranditemfactors.Itminimizesaquadraticlossfunctionbyfixingalternativelyitsfactors.

Featureextractionandtransformation:Theseareessentialtechniquesforlargetextdocumentprocessing.Itincludesthefollowingtechniques:

TermFrequency:SearchenginesuseTF-IDFtoscoreandrankdocumentrelevanceinavastcorpus.Itisalsousedinmachinelearningtodeterminetheimportanceofawordinadocumentorcorpus.Termfrequencystatisticallydeterminestheweightofatermrelativetoitsfrequencyinthecorpus.Termfrequencyonitsowncanbemisleadingasitoveremphasizeswordssuchasthe,

of,orandthatgivelittleinformation.InverseDocumentFrequencyprovidesthespecificityorthemeasureoftheamountofinformation,whetherthetermisrareorcommonacrossalldocumentsinthecorpus.Word2Vec:Thisincludestwomodels,Skip-GramandContinuousBagofWord.TheSkip-Grampredictsneighboringwordsgivenaword,basedonslidingwindowsofwords,whileContinuousBagofWordspredictsthecurrentwordgiventheneighboringwords.StandardScaler:Aspartofpreprocessing,thedatasetmustoftenbestandardizedbymeanremovalandvariancescaling.Wecomputethemeanandstandarddeviationonthetrainingdataandapplythesametransformationtothetestdata.Normalizer:Wescalethesamplestohaveunitnorm.Itisusefulforquadraticformssuchasthedotproductorkernelmethods.Featureselection:Thisreducesthedimensionalityofthevectorspacebyselectingthemostrelevantfeaturesforthemodel.Chi-SquareSelector:Thisisastatisticalmethodtomeasuretheindependenceoftwoevents.

Optimization:ThesespecificSparkMLliboptimizationalgorithmsfocusonvarioustechniquesofgradientdescent.Sparkprovidesveryefficientimplementationofgradientdescentonadistributedclusterofmachines.Itlooksforthelocalminimabyiterativelygoingdownthesteepestdescent.Itiscompute-intensiveasititeratesthroughallthedataavailable:

StochasticGradientDescent:Weminimizeanobjectivefunctionthatisthesumofdifferentiablefunctions.StochasticGradientDescentusesonlyasampleofthetrainingdatainordertoupdateaparameterinaparticulariteration.Itisusedforlarge-scaleandsparsemachinelearningproblemssuchastextclassification.

Limited-memoryBFGS(L-BFGS):Asthenamesays,L-BFGSuseslimitedmemoryandsuitsthedistributedoptimizationalgorithmimplementationofSparkMLlib.

SparkMLlibdatatypesMLlibsupportsfouressentialdatatypes:localvector,labeledpoint,localmatrix,anddistributedmatrix.ThesedatatypesarewidelyusedinSparkMLlibalgorithms:

Localvector:Thisresidesinasinglemachine.Itcanbedenseorsparse:

Densevectorisatraditionalarrayofdoubles.Anexampleofdensevectoris[5.0,0.0,1.0,7.0].Sparsevectorusesintegerindicesanddoublevalues.Sothesparserepresentationofthevector[5.0,0.0,1.0,7.0]wouldbe(4,[0,2,3],[5.0,1.0,7.0]),whererepresentthedimensionofthevector.

Here’sanexampleoflocalvectorinPySpark:

importnumpyasnp

importscipy.sparseassps

frompyspark.mllib.linalgimportVectors

#NumPyarrayfordensevector.

dvect1=np.array([5.0,0.0,1.0,7.0])

#Pythonlistfordensevector.

dvect2=[5.0,0.0,1.0,7.0]

#SparseVectorcreation

svect1=Vectors.sparse(4,[0,2,3],[5.0,1.0,7.0])

#Sparsevectorusingasingle-columnSciPycsc_matrix

svect2=sps.csc_matrix((np.array([5.0,1.0,7.0]),np.array([0,2,

3])),shape=(4,1))

Labeledpoint.Alabeledpointisadenseorsparsevectorwithalabelusedinsupervisedlearning.Inthecaseofbinarylabels,0.0representsthenegativelabelwhilst1.0representsthepositivevalue.

Here’sanexampleofalabeledpointinPySpark:

frompyspark.mllib.linalgimportSparseVector

frompyspark.mllib.regressionimportLabeledPoint

#Labeledpointwithapositivelabelandadensefeaturevector.

lp_pos=LabeledPoint(1.0,[5.0,0.0,1.0,7.0])

#Labeledpointwithanegativelabelandasparsefeaturevector.

lp_neg=LabeledPoint(0.0,SparseVector(4,[0,2,3],[5.0,1.0,

7.0]))

LocalMatrix:Thislocalmatrixresidesinasinglemachinewithinteger-typeindicesandvaluesoftypedouble.

Here’sanexampleofalocalmatrixinPySpark:

frompyspark.mllib.linalgimportMatrix,Matrices

#Densematrix((1.0,2.0,3.0),(4.0,5.0,6.0))

dMatrix=Matrices.dense(2,3,[1,2,3,4,5,6])

#Sparsematrix((9.0,0.0),(0.0,8.0),(0.0,6.0))

sMatrix=Matrices.sparse(3,2,[0,1,3],[0,2,1],[9,6,8])

DistributedMatrix:LeveragingthedistributedmatureoftheRDD,distributedmatricescanbesharedinaclusterofmachines.Wedistinguishfourdistributedmatrixtypes:RowMatrix,IndexedRowMatrix,CoordinateMatrix,andBlockMatrix:

RowMatrix:ThistakesanRDDofvectorsandcreatesadistributedmatrixofrowswithmeaninglessindices,calledRowMatrix,fromtheRDDofvectors.IndexedRowMatrix:Inthiscase,rowindicesaremeaningful.First,wecreateanRDDofindexedrowsusingtheclassIndexedRowandthencreateanIndexedRowMatrix.CoordinateMatrix:Thisisusefultorepresentverylargeandverysparsematrices.CoordinateMatrixiscreatedfromRDDsoftheMatrixEntrypoints,representedbyatupleoftype(long,long,orfloat)BlockMatrix:ThesearecreatedfromRDDsofsub-matrixblocks,whereasub-matrixblockis((blockRowIndex,blockColIndex),sub-matrix).

MachinelearningworkflowsanddataflowsBeyondalgorithms,machinelearningisalsoaboutprocesses.Wewilldiscussthetypicalworkflowsanddataflowsofsupervisedandunsupervisedmachinelearning.

SupervisedmachinelearningworkflowsInsupervisedmachinelearning,theinputtrainingdatasetislabeled.Oneofthekeydatapracticesistosplitinputdataintotrainingandtestsets,andvalidatethemodeaccordingly.

Wetypicallygothroughasix-stepprocessflowinsupervisedlearning:

Collectthedata:Thisstepessentiallytiesinwiththepreviouschapterandensureswecollecttherightdatawiththerightvolumeandgranularityinordertoenablethemachinelearningalgorithmtoprovidereliableanswers.Preprocessthedata:Thisstepisaboutcheckingthedataqualitybysampling,fillinginthemissingvaluesifany,scalingandnormalizingthedata.Wealsodefinethefeatureextractionprocess.Typically,inthecaseoflargetext-baseddatasets,weapplytokenization,stopwordsremoval,stemming,andTF-IDF.

Inthecaseofsupervisedlearning,weseparatetheinputdataintoatrainingandtestset.Wecanalsoimplementvariousstrategiesofsamplingandsplittingthedatasetforcross-validationpurposes.

Readythedata:Inthisstep,wegetthedataintheformatordatatypeexpectedbythealgorithms.InthecaseofSparkMLlib,thisincludeslocalvector,denseorsparsevectors,labeledpoints,localmatrix,distributedmatrixwithrowmatrix,indexedrowmatrix,coordinatematrix,andblockmatrix.Model:Inthisstep,weapplythealgorithmsthataresuitablefortheproblemathandandgettheresultsforevaluationofthemostsuitablealgorithmintheevaluatestep.Wemighthavemultiplealgorithmssuitablefortheproblem;theirrespectiveperformancewillbescoredintheevaluatesteptoselectthebestpreformingones.Wecanimplementanensembleorcombinationofmodelsinordertoreachthebestresults.Optimize:Wemayneedtorunagridsearchfortheoptimalparametersofcertainalgorithms.Theseparametersaredeterminedduringtraining,andfine-tunedduringthetestingandproductionphase.Evaluate:Weultimatelyscorethemodelsandselectthebestoneintermsofaccuracy,performance,reliability,andscalability.Wemovethebestperformingmodeltotestwiththeheldouttestdatainordertoascertainthepredictionaccuracyofourmodel.Oncesatisfiedwiththefine-tunedmodel,wemoveittoproductiontoprocesslivedata.

Thesupervisedmachinelearningworkflowanddataflowarerepresentedinthefollowingdiagram:

UnsupervisedmachinelearningworkflowsAsopposedtosupervisedlearning,ourinitialdataisnotlabeledinthecaseofunsupervisedlearning,whichismostoftenthecaseinreallife.Wewillextractthestructurefromthedatabyusingclusteringordimensionalityreductionalgorithms.Intheunsupervisedlearningcase,wedonotsplitthedataintotrainingandtest,aswecannotmakeanypredictionbecausethedataisnotlabeled.Wewilltrainthedataalongsixstepssimilartothoseinsupervisedlearning.Oncethemodelistrained,wewillevaluatetheresultsandfine-tunethemodelandthenreleaseitforproduction.

Unsupervisedlearningcanbeapreliminarysteptosupervisedlearning.Namely,welookatreducingthedimensionalityofthedatapriortoattackingthelearningphase.

Theunsupervisedmachinelearningworkflowsanddataflowarerepresentedasfollows:

ClusteringtheTwitterdatasetLet’sfirstgetafeelforthedataextractedfromTwitterandgetanunderstandingofthedatastructureinordertoprepareandrunitthroughtheK-Meansclusteringalgorithms.Ourplanofattackusestheprocessanddataflowdepictedearlierforunsupervisedlearning.Thestepsareasfollows:

1. Combinealltweetfilesintoasingledataframe.2. Parsethetweets,removestopwords,extractemoticons,extractURL,andfinally

normalizethewords(forexample,mappingthemtolowercaseandremovingpunctuationandnumbers).

3. Featureextractionincludesthefollowing:

Tokenization:ThisbreaksdowntheparsedtweettextintoindividualwordsortokensTF-IDF:ThisappliestheTF-IDFalgorithmtocreatefeaturevectorsfromthetokenizedtweettextsHashTF-IDF:Thisappliesahashingfunctiontothetokenvectors

4. RuntheK-Meansclusteringalgorithm.5. EvaluatetheresultsoftheK-Meansclustering:

IdentifytweetmembershiptoclustersPerformdimensionalityreductiontotwodimensionswiththeMulti-DimensionalScalingorthePrincipalComponentAnalysisalgorithmPlottheclusters

6. Pipeline:

Fine-tunethenumberofrelevantclustersKMeasurethemodelcostSelecttheoptimalmodel

ApplyingScikit-LearnontheTwitterdatasetPython’sownScikit-Learnmachinelearninglibraryisoneofthemostreliable,intuitive,androbusttoolsaround.Let’srunthroughapreprocessingandunsupervisedlearningusingPandasandScikit-Learn.ItisoftenbeneficialtoexploreasampleofthedatausingScikit-LearnbeforespinningoffclusterswithSparkMLlib.

Wehaveamixedbagof7,540tweets.ItcontainstweetsrelatedtoApacheSpark,Python,theupcomingpresidentialelectionwithHillaryClintonandDonaldTrumpasprotagonists,andsometweetsrelatedtofashionandmusicwithLadyGagaandJustinBieber.WearerunningtheK-MeansclusteringalgorithmusingPythonScikit-LearnontheTwitterdatasetharvested.WefirstloadthesampledataintoaPandasdataframe:

importpandasaspd

csv_in='C:\\Users\\Amit\\Documents\\IPython

Notebooks\\AN00_Data\\unq_tweetstxt.csv'

twts_df01=pd.read_csv(csv_in,sep=';',encoding='utf-8')

In[24]:

twts_df01.count()

Out[24]:

Unnamed:07540

id7540

created_at7540

user_id7540

user_name7538

tweet_text7540

dtype:int64

#

#Introspectingthetweetstext

#

In[82]:

twtstxt_ls01[6910:6920]

Out[82]:

['RT@deroach_Ismoke:IamNOTvotingfor#hilaryclinton

http://t.co/jaZZpcHkkJ',

'RT@AnimalRightsJen:#HilaryClintonWhatdoBernieSandersandDonald

TrumpHaveinCommon?:Hehassofarbeenth…http://t.co/t2YRcGCh6…',

'IunderstandwhyBillwasoutbangingotherchicks….....Imeanlookat

whatheismarriedto…..\n@HilaryClinton',

'#HilaryClintonWhatdoBernieSandersandDonaldTrumpHaveinCommon?:

Hehassofarbeenth…http://t.co/t2YRcGCh67#Tcot#UniteBlue']

Wefirstperformafeatureextractionfromthetweets’text.WeapplyasparsevectorizertothedatasetusingaTF-IDFvectorizerwith10,000featuresandEnglishstopwords:

In[37]:

print("Extractingfeaturesfromthetrainingdatasetusingasparse

vectorizer")

t0=time()

Extractingfeaturesfromthetrainingdatasetusingasparsevectorizer

In[38]:

vectorizer=TfidfVectorizer(max_df=0.5,max_features=10000,

min_df=2,stop_words='english',

use_idf=True)

X=vectorizer.fit_transform(twtstxt_ls01)

#

#OutputoftheTFIDFFeaturevectorizer

#

print("donein%fs"%(time()-t0))

print("n_samples:%d,n_features:%d"%X.shape)

print()

donein5.232165s

n_samples:7540,n_features:6638

Asthedatasetisnowbrokenintoa7540samplewithvectorsof6,638features,wearereadytofeedthissparsematrixtotheK-Meansclusteringalgorithm.Wewillchoosesevenclustersand100maximumiterationsinitially:

In[47]:

km=KMeans(n_clusters=7,init='k-means++',max_iter=100,n_init=1,

verbose=1)

print("Clusteringsparsedatawith%s"%km)

t0=time()

km.fit(X)

print("donein%0.3fs"%(time()-t0))

ClusteringsparsedatawithKMeans(copy_x=True,init='k-means++',

max_iter=100,n_clusters=7,n_init=1,

n_jobs=1,precompute_distances='auto',random_state=None,tol=0.0001,

verbose=1)

Initializationcomplete

Iteration0,inertia13635.141



















Convergedatiteration18

donein1.701s

TheK-Meansclusteringalgorithmconvergedafter18iterations.Weseeinthefollowingresultsthesevenclusterswiththeirrespectivekeywords.Clusters0and6areaboutmusicandfashionwithJustinBieberandLadyGaga-relatedtweets.Clusters1and5arerelatedtotheU.S.A.presidentialelectionswithDonaldTrump-andHilaryClinton-relatedtweets.Clusters2and3aretheonesofinteresttousastheyareaboutApacheSparkandPython.Cluster4containsThailand-relatedtweets:

#

#Introspecttoptermspercluster

#

In[49]:

print("Toptermspercluster:")

order_centroids=km.cluster_centers_.argsort()[:,::-1]

terms=vectorizer.get_feature_names()

foriinrange(7):

print("Cluster%d:"%i,end='')

forindinorder_centroids[i,:20]:

print('%s'%terms[ind],end='')

print()

Toptermspercluster:

Cluster0:justinbieberlovemeanrtfollowthankhihttpswhatdoyoumean

videowannahearwhatdoyoumeanviralrorykramerhappylolmakingperson

dreamjustin

Cluster1:donaldtrumphilaryclintonrthttpstrump2016realdonaldtrump

trumpgopampjustinbieberpresidentclintonemailsoy8ltkstzetcotlike

berniesandershilarypeopleemail

Cluster2:bigdataapachesparkhadoopanalyticsrtsparktrainingchennai

ibmdatascienceapacheprocessingclouderamapreducedatasaphttpsvora

transformingdevelopment

Cluster3:apachesparkpythonhttpsrtsparkdataampdatabricksusingnew

learnhadoopibmbigapachecontinuumiobluemixlearningjoinopen

Cluster4:ernestsganttsimbata3jdhm2015elsahel12phuketdailynews

dreamintentionsbeyhiveinfrancealmtorta18civipartnership9_a_625whu72ep0

k7erhvu7wnfdmxxxcm3hosxuh2fxnt5o5rmb0xhpjnbgkqn0djovap57ujdh

dtzsz3lb6xsunnysai12345sdcvulih6g

Cluster5:trumpdonalddonaldtrumpstarbuckstrumpquotetrumpforpresident

oy8ltkstzehttpszfns7pxysxsillygoystumptrump2016newsjeremycoffee

corbynok7vc8aetzrttonight

Cluster6:ladygagagagaladyrthttpslovefollowhorrorcdstoryahshotel

americanjapanhotelhumantraffickingmusicfashiondietqueenahs

Wewillvisualizetheresultsbyplottingthecluster.Wehave7,540sampleswith6,638features.Itwillbeimpossibletovisualizethatmanydimensions.WewillusetheMulti-DimensionalScaling(MDS)algorithmtobringdownthemultidimensionalfeaturesoftheclustersintotwotractabledimensionstobeabletopicturethem:


importmatplotlibasmpl

fromsklearn.manifoldimportMDS

MDS()

#

#BringdowntheMDStotwodimensions(components)aswewillplot

#theclusters

#

mds=MDS(n_components=2,dissimilarity="precomputed",random_state=1)

pos=mds.fit_transform(dist)#shape(n_components,n_samples)

xs,ys=pos[:,0],pos[:,1]

In[67]:

#

#Setupcolorsperclustersusingadict

#

cluster_colors={0:'#1b9e77',1:'#d95f02',2:'#7570b3',3:'#e7298a',

4:'#66a61e',5:'#9990b3',6:'#e8888a'}

#

#setupclusternamesusingadict

#

cluster_names={0:'Music,Pop',

1:'USAPolitics,Election',

2:'BigData,Spark',

3:'Spark,Python',

4:'Thailand',

5:'USAPolitics,Election',

6:'Music,Pop'}

In[115]:

#

#ipythonmagictoshowthematplotlibplotsinline

#

%matplotlibinline

#

#CreatedataframewhichincludesMDSresults,clusternumbersandtweet

textstobedisplayed

#

df=pd.DataFrame(dict(x=xs,y=ys,label=clusters,txt=twtstxt_ls02_utf8))

ix_start=2000

ix_stop=2050

df01=df[ix_start:ix_stop]

print(df01[['label','txt']])

print(len(df01))

print()

#Groupbycluster

groups=df.groupby('label')

groups01=df01.groupby('label')

#Setuptheplot

fig,ax=plt.subplots(figsize=(17,10))

ax.margins(0.05)

#

#Buildtheplotobject

#

forname,groupingroups01:

ax.plot(group.x,group.y,marker='o',linestyle='',ms=12,

label=cluster_names[name],color=cluster_colors[name],

mec='none')

ax.set_aspect('auto')

ax.tick_params(\

axis='x',#settingsforx-axis

which='both',#

bottom='off',#

top='off',#

labelbottom='off')

ax.tick_params(\

axis='y',#settingsfory-axis

which='both',#

left='off',#

top='off',#

labelleft='off')

ax.legend(numpoints=1)#

#

#Addlabelinx,ypositionwithtweettext

#

foriinrange(ix_start,ix_stop):

ax.text(df01.ix[i]['x'],df01.ix[i]['y'],df01.ix[i]['txt'],size=10)

plt.show()#Displaytheplot

labeltext

20002b'RT@BigDataTechCon:'

20013b"@4Quant'spresentat"

20022b'CassandraSummit201'

Here’saplotofCluster2,BigDataandSpark,representedbybluedotsalongwithCluster3,SparkandPython,representedbyreddots,andsomesampletweetsrelatedtotherespectiveclusters:

WehavegainedsomegoodinsightsintothedatawiththeexplorationandprocessingdonewithScikit-Learn.WewillnowfocusourattentiononSparkMLlibandtakeitforarideontheTwitterdataset.

PreprocessingthedatasetNow,wewillfocusonfeatureextractionandengineeringinordertoreadythedatafortheclusteringalgorithmrun.WeinstantiatetheSparkContextandreadtheTwitterdatasetintoaSparkdataframe.Wewillthensuccessivelytokenizethetweettextdata,applyahashingTermfrequencyalgorithmtothetokens,andfinallyapplytheInverseDocumentFrequencyalgorithmandrescalethedata.Thecodeisasfollows:

In[3]:

#

#ReadcsvinaPandaDF

#

#

importpandasaspd

csv_in='/home/an/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv'

pddf_in=pd.read_csv(csv_in,index_col=None,header=0,sep=';',

encoding='utf-8')

In[4]:

sqlContext=SQLContext(sc)

In[5]:

#

#ConvertaPandaDFtoaSparkDF

#

#

spdf_02=sqlContext.createDataFrame(pddf_in[['id','user_id','user_name',

'tweet_text']])

In[8]:

spdf_02.show()

In[7]:

spdf_02.take(3)

Out[7]:

[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:

elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026

http://t.co/VpD7FoqMr0'),

Row(id=638830426727911424,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:PhuketDailyNews:

dreamintentions:elsahel12:simbata3:JDHM2015:almtorta18:CiviPa\u2026

http://t.co/VpD7FoqMr0'),


tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:ernestsgantt:elsahel12:

simbata3:JDHM2015:almtorta18:CiviPartnership:dr\u2026

http://t.co/EMDOn8chPK')]

In[9]:

frompyspark.ml.featureimportHashingTF,IDF,Tokenizer

In[10]:

#

#Tokenizethetweet_text

#

tokenizer=Tokenizer(inputCol="tweet_text",outputCol="tokens")

tokensData=tokenizer.transform(spdf_02)

In[11]:

tokensData.take(1)

Out[11]:




http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',

u'9_a_6:',u'dreamintentions:',u'elsahel12:',u'simbata3:',u'jdhm2015:',

u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'])]

In[14]:

#

#ApplyHashingTFtothetokens

#

hashingTF=HashingTF(inputCol="tokens",outputCol="rawFeatures",

numFeatures=2000)

featuresData=hashingTF.transform(tokensData)

In[15]:

featuresData.take(1)

Out[15]:






u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'],

rawFeatures=SparseVector(2000,{74:1.0,97:1.0,100:1.0,160:1.0,185:

1.0,742:1.0,856:1.0,991:1.0,1383:1.0,1620:1.0}))]

In[16]:

#

#ApplyIDFtotherawfeaturesandrescalethedata

#

idf=IDF(inputCol="rawFeatures",outputCol="features")

idfModel=idf.fit(featuresData)

rescaledData=idfModel.transform(featuresData)

forfeaturesinrescaledData.select("features").take(3):

print(features)

In[17]:

rescaledData.take(2)

Out[17]:






u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'],


1.0,742:1.0,856:1.0,991:1.0,1383:1.0,1620:1.0}),

features=SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:

2.9985,185:2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,

1620:3.073})),


tweet_text=u'ernestsgantt:BeyHiveInFrance:PhuketDailyNews:

dreamintentions:elsahel12:simbata3:JDHM2015:almtorta18:CiviPa\u2026


u'phuketdailynews:',u'dreamintentions:',u'elsahel12:',u'simbata3:',

u'jdhm2015:',u'almtorta18:',u'civipa\u2026',u'http://t.co/vpd7foqmr0'],


1.0,460:1.0,987:1.0,991:1.0,1383:1.0,1620:1.0}),

features=SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:

2.9985,185:2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,

1620:3.073}))]

In[21]:

rs_pddf=rescaledData.toPandas()

In[22]:

rs_pddf.count()

Out[22]:

id7540

user_id7540

user_name7540

tweet_text7540

tokens7540

rawFeatures7540

features7540

dtype:int64

In[27]:

feat_lst=rs_pddf.features.tolist()

In[28]:

feat_lst[:2]

Out[28]:

[SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,1620:3.073}),

SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,1620:3.073})]

RunningtheclusteringalgorithmWewillusetheK-MeansalgorithmagainsttheTwitterdataset.Asanunlabeledandshuffledbagoftweets,wewanttoseeiftheApacheSparktweetsaregroupedinasinglecluster.Fromtheprevioussteps,theTF-IDFsparsevectoroffeaturesisconvertedintoanRDDthatwillbetheinputtotheSparkMLlibprogram.WeinitializetheK-Meansmodelwith5clusters,10iterationsof10runs:

In[32]:

frompyspark.mllib.clusteringimportKMeans,KMeansModel

fromnumpyimportarray

frommathimportsqrt

In[34]:

#Loadandparsethedata

in_Data=sc.parallelize(feat_lst)

In[35]:

in_Data.take(3)

Out[35]:

[SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,1620:3.073}),

SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,1620:3.073}),

SparseVector(2000,{20:4.3534,74:2.6762,97:1.8625,100:5.2768,185:

2.7481,856:4.1406,991:2.9518,1039:3.073,1620:3.073,1864:4.6377})]

In[37]:

in_Data.count()

Out[37]:

7540

In[38]:

#Buildthemodel(clusterthedata)

clusters=KMeans.train(in_Data,5,maxIterations=10,

runs=10,initializationMode="random")

In[53]:

#EvaluateclusteringbycomputingWithinSetSumofSquaredErrors

deferror(point):

center=clusters.centers[clusters.predict(point)]

returnsqrt(sum([x**2forxin(point-center)]))

WSSSE=in_Data.map(lambdapoint:error(point)).reduce(lambdax,y:x+y)

print("WithinSetSumofSquaredError="+str(WSSSE))

EvaluatingthemodelandtheresultsOnewaytofine-tunetheclusteringalgorithmisbyvaryingthenumberofclustersandverifyingtheoutput.Let’schecktheclustersandgetafeelfortheclusteringresultssofar:

In[43]:

cluster_membership=in_Data.map(lambdax:clusters.predict(x))

In[54]:

cluster_idx=cluster_membership.zipWithIndex()

In[55]:

type(cluster_idx)

Out[55]:

pyspark.rdd.PipelinedRDD

In[58]:

cluster_idx.take(20)

Out[58]:

[(3,0),

(3,1),

(3,2),

(3,3),

(3,4),

(3,5),

(1,6),

(3,7),

(3,8),

(3,9),

(3,10),

(3,11),

(3,12),

(3,13),

(3,14),

(1,15),

(3,16),

(3,17),

(1,18),

(1,19)]

In[59]:

cluster_df=cluster_idx.toDF()

In[65]:

pddf_with_cluster=pd.concat([pddf_in,cluster_pddf],axis=1)

In[76]:

pddf_with_cluster._1.unique()

Out[76]:

array([3,1,4,0,2])

In[79]:

pddf_with_cluster[pddf_with_cluster['_1']==0].head(10)

Out[79]:

Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2

62273642418116819988480FriSep1119:23:09+0000201549693598

AjinkyaKaleRT@bigdata:DistributedMatrixComputationsi…06227

625745642391207205859328FriSep1117:36:13+00002015937467860

AngelaBassa[Auto]I'mreading""DistributedMatrixComput…06257

6297119642348577147064320FriSep1114:46:49+0000201518318677

BenLoricaDistributedMatrixComputationsin@ApacheSpar…06297

In[80]:


Out[80]:


66638830419090079746TueSep0121:46:55+000020152241040634

MassimoCarrisiPython:Python:Removing\xa0fromstring?-I…16

1517638830380578045953TueSep0121:46:46+0000201557699376

RafaelMonneratRT@ramalhoorg:NoitedeautógrafosdoFluent…115

1841638830280988426250TueSep0121:46:22+00002015951081582

JackBaldwinRT@cloudaus:Weare3/4full!2-day@swcarpen…118

1942638830276626399232TueSep0121:46:21+000020156525302

MasayoshiNakamuraPynamoDB#AWS#DynamoDB#Pythonhttp://...119

2043638830213288235008TueSep0121:46:06+000020153153874869

BaltimorePythonFlexx:PythonUItookitbasedonwebtechnolog…120

2144638830117645516800TueSep0121:45:43+0000201548474625

RadioFreeDenaliHmm,emerge--depcleanwantstoremovesomethi…1

21

2246638829977014636544TueSep0121:45:10+00002015154915461

LucianoRamalhoNoitedeautógrafosdoFluentPythonnoGaroa…122

2347638829882928070656TueSep0121:44:47+00002015917320920

bsbafflesbrains@DanSWrightHarperchannelingMontyPython."...123

2448638829868679954432TueSep0121:44:44+00002015134280898

LannickTechnologyRT@SergeyKalnish:Iam#hiring:SeniorBacke…1

24

2549638829707484508161TueSep0121:44:05+000020152839203454

JoshuaJonesRT@LindseyPelas:SurvivingMontyPythoninFl…125

In[81]:


Out[81]:


7280688639056941592014848WedSep0212:47:02+00002015

2735137484ChrisAtruegayiconwhenwill@ladygaga@[email protected]

7280

In[82]:


Out[82]:


00638830426971181057TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…30

11638830426727911424TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…31

22638830425402556417TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…32

33638830424563716097TueSep0121:46:56+000020153276255125


44638830422256816132TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…34

55638830420159655936TueSep0121:46:55+000020153276255125


77638830418330980352TueSep0121:46:55+000020153276255125


88638830397648822272TueSep0121:46:50+000020153276255125


99638830395375529984TueSep0121:46:49+000020153276255125


1010638830392389177344TueSep0121:46:49+000020153276255125


In[83]:


Out[83]:


1361882642648214454317056SatSep1210:37:28+0000201527415756

RaymondEnisuohLAChosenForUS2024OlympicBid-LA2016See…4

1361

1363885642647848744583168SatSep1210:36:01+0000201527415756

RaymondEnisuohPrisonSee:https://t.co/x3EKAExeFi……………...4

1363

541211640480770369286144SunSep0611:04:49+00002015

3242403023DonaldTrump2016"igiboooy!@Starbucks

https://t.co/97wdL…45412

542827640477140660518912SunSep0610:50:24+00002015

3242403023DonaldTrump2016"@Starbuckshttps://t.co/wsEYFIefk7"-

D…45428

545561640469542272110592SunSep0610:20:12+00002015

3242403023DonaldTrump2016"starbucks@StarbucksMamPlaza

https://t.co…45455

545662640469541370372096SunSep0610:20:12+00002015

3242403023DonaldTrump2016"Aaahhhthepumpkinspicelatteisback,

fall…45456

545763640469539524898817SunSep0610:20:12+00002015

3242403023DonaldTrump2016"RTkayyleighferry:OhmygodddHarry

Potter…45457

545864640469537176031232SunSep0610:20:11+00002015

3242403023DonaldTrump2016"Starbuckshttps://t.co/3xYYXlwNkf"-

Donald…45458

545965640469536119070720SunSep0610:20:11+00002015

3242403023DonaldTrump2016"AStarbucksisunderconstructioninmy

neig…45459

546066640469530435813376SunSep0610:20:10+00002015

3242403023DonaldTrump2016"Babamstarbucks'tanfotogtafatıyor

bendedu…45460

Wemapthe5clusterswithsomesampletweets.Cluster0isaboutSpark.Cluster1isaboutPython.Cluster2isaboutLadyGaga.Cluster3isaboutThailand’sPhuketNews.Cluster4isaboutDonaldTrump.

BuildingmachinelearningpipelinesWewanttocomposethefeatureextraction,preparatoryactivities,training,testing,andpredictionactivitieswhileoptimizingthebesttuningparametertogetthebestperformingmodel.

ThefollowingtweetcapturesperfectlyinfivelinesofcodeapowerfulmachinelearningPipelineimplementedinSparkMLlib:

TheSparkMLpipelineisinspiredbyPython’sScikit-Learnandcreatesasuccinct,declarativestatementofthesuccessivetransformationstothedatainordertoquicklydeliveratunablemodel.

SummaryInthischapter,wegotanoverviewofSparkMLlib’sever-expandinglibraryofalgorithmsSparkMLlib.Wediscussedsupervisedandunsupervisedlearning,recommendersystems,optimization,andfeatureextractionalgorithms.WethenputtheharvesteddatafromTwitterintothemachinelearningprocess,algorithms,andevaluationtoderiveinsightsfromthedata.WeputtheTwitter-harvesteddatasetthroughaPythonScikit-LearnandSparkMLlibK-meansclusteringinordertosegregatethetweetsrelevanttoApacheSpark.Wealsoevaluatedtheperformanceofthemodel.

Thisgetsusreadyforthenextchapter,whichwillcoverStreamingAnalyticsusingSpark.Let’sjumprightin.

Chapter5.StreamingLiveDatawithSparkInthischapter,wewillfocusonlivestreamingdataflowingintoSparkandprocessingit.Sofar,wehavediscussedmachinelearninganddataminingwithbatchprocessing.Wearenowlookingatprocessingcontinuouslyflowingdataanddetectingfactsandpatternsonthefly.Wearenavigatingfromalaketoariver.

Wewillfirstinvestigatethechallengesarisingfromsuchadynamicandeverchangingenvironment.Afterlayingthegroundsontheprerequisiteofastreamingapplication,wewillinvestigatevariousimplementationsusinglivesourcesofdatasuchasTCPsocketstotheTwitterfirehoseandputinplacealowlatency,highthroughput,andscalabledatapipelinecombiningSpark,KafkaandFlume.


Analyzingastreamingapplication’sarchitecturalchallenges,constraints,andrequirementsProcessinglivedatafromaTCPsocketwithSparkStreamingConnectingtotheTwitterfirehosedirectlytoparsetweetsinquasirealtimeEstablishingareliable,faulttolerant,scalable,highthroughput,lowlatencyintegratedapplicationusingSpark,Kafka,andFlumeClosingremarksonLambdaandKappaarchitectureparadigms

LayingthefoundationsofstreamingarchitectureAscustomary,let’sfirstgobacktoouroriginaldrawingofthedata-intensiveappsarchitectureblueprintandhighlighttheSparkStreamingmodulethatwillbethetopicofinterest.

ThefollowingdiagramsetsthecontextbyhighlightingtheSparkStreamingmoduleandinteractionswithSparkSQLandSparkMLlibwithintheoveralldata-intensiveappsframework.

Dataflowsfromstockmarkettimeseries,enterprisetransactions,interactions,events,webtraffic,clickstreams,andsensors.Alleventsaretime-stampeddataandurgent.Thisisthecaseforfrauddetectionandprevention,mobilecross-sellandupsell,ortrafficalerts.Thosestreamsofdatarequireimmediateprocessingformonitoringpurposes,suchasdetectinganomalies,outliers,spam,fraud,andintrusion;andalsoforprovidingbasicstatistics,insights,trends,andrecommendations.Insomecases,thesummarized

aggregatedinformationissufficienttobestoredforlaterusage.Fromanarchitectureparadigmperspective,wearemovingfromaservice-orientedarchitecturetoanevent-drivenarchitecture.

Twomodelsemergeforprocessingstreamsofdata:

Processingonerecordatatimeastheycomein.Wedonotbuffertheincomingrecordsinacontainerbeforeprocessingthem.ThisisthecaseofTwitter’sStorm,Yahoo’sS4,andGoogle’sMillWheel.Micro-batchingorbatchcomputationsonsmallintervalsasperformedbySparkStreamingandStormTrident.Inthiscase,webuffertheincomingrecordsinacontaineraccordingtothetimewindowprescribedinthemicro-batchingsettings.

SparkStreaminghasoftenbeencomparedagainstStorm.Theyaretwodifferentmodelsofstreamingdata.SparkStreamingisbasedonmicro-batching.Stormisbasedonprocessingrecordsastheycomein.Stormalsooffersamicro-batchingoption,withitsStormTridentoption.

Thedrivingfactorinastreamingapplicationislatency.LatencyvariesfromthemillisecondsrangeinthecaseofRPC(shortforRemoteProcedureCall)toseveralsecondsorminutesformicrobatchingsolutionsuchasSparkStreaming.

RPCallowssynchronousoperationsbetweentherequestingprogramswaitingfortheresultsfromtheremoteserver’sprocedure.ThreadsallowconcurrencyofmultipleRPCcallstotheserver.

AnexampleofsoftwareimplementingadistributedRPCmodelisApacheStorm.

Stormimplementsstatelesssubmillisecondlatencyprocessingofunboundedtuplesusingtopologiesordirectedacyclicgraphscombiningspoutsassourceofdatastreamsandboltsforoperationssuchasfilter,join,aggregation,andtransformation.StormalsoimplementsahigherlevelabstractioncalledTridentwhich,similarlytoSpark,processesdatastreamsinmicrobatches.

So,lookingatthelatencycontinuum,fromsubmillisecondtosecond,Stormisagoodcandidate.Forsecondstominutesscale,SparkStreamingandStormTridentareexcellentfits.Forseveralminutesonward,SparkandaNoSQLdatabasesuchasCassandraorHBaseareadequatesolutions.Forrangesbeyondthehourandwithhighvolumeofdata,Hadoopistheidealcontender.

Althoughthroughputiscorrelatedtolatency,itisnotasimpleinverselylinearrelationship.Ifprocessingamessagetakes2ms,whichdeterminesthelatency,thenonewouldassumethethroughputislimitedto500messagespersec.Batchingmessagesallowsforhigherthroughputifweallowourmessagestobebufferedfor8msmore.Withalatencyof10ms,thesystemcanbufferupto10,000messages.Forabearableincreaseinlatency,wehavesubstantiallyincreasedthroughput.Thisisthemagicofmicro-batchingthatSparkStreamingexploits.

SparkStreaminginnerworkingTheSparkStreamingarchitectureleveragestheSparkcorearchitecture.ItoverlaysontheSparkContextaStreamingContextastheentrypointtotheStreamfunctionality.TheClusterManagerwilldedicateatleastoneworkernodeasReceiver,whichwillbeanexecutorwithalongtasktoprocesstheincomingstream.TheExecutorcreatesDiscretizedStreamsorDStreamsfrominputdatastreamandreplicatesbydefault,theDStreamtothecacheofanotherworker.Onereceiverservesoneinputdatastream.MultiplereceiversimproveparallelismandgeneratemultipleDStreamsthatSparkcanuniteorjoinResilientDistributedDatasets(RDD).

ThefollowingdiagramgivesanoverviewoftheinnerworkingofSparkStreaming.TheclientinteractswiththeSparkClusterviatheclustermanager,whileSparkStreaminghasadedicatedworkerwithalongrunningtaskingestingtheinputdatastreamandtransformingitintodiscretizedstreamsorDStreams.Thedataiscollected,bufferedandreplicatedbyareceiverandthenpushedtoastreamofRDDs.

Sparkreceiverscaningestdatafrommanysources.CoreinputsourcesrangefromTCPsocketandHDFS/AmazonS3toAkkaActors.AdditionalsourcesincludeApacheKafka,ApacheFlume,AmazonKinesis,ZeroMQ,Twitter,andcustomoruser-definedreceivers.

Wedistinguishbetweenreliableresourcesthatacknowledgesreceiptofdatatothesourceandreplicationforpossibleresend,versusunreliablereceiverswhodonotacknowledgereceiptofthemessage.Sparkscalesoutintermsofthenumberofworkers,partitionandreceivers.

ThefollowingdiagramgivesanoverviewofSparkStreamingwiththepossiblesourcesandthepersistenceoptions:

GoingunderthehoodofSparkStreamingSparkStreamingiscomposedofReceiversandpoweredbyDiscretizedStreamsandSparkConnectorsforpersistence.

AsforSparkCore,theessentialdatastructureistheRDD,thefundamentalprogrammingabstractionforSparkStreamingistheDiscretizedStreamorDStream.

ThefollowingdiagramillustratestheDiscretizedStreamsascontinuoussequencesofRDDs.ThebatchintervalsofDStreamareconfigurable.

DStreamssnapshotstheincomingdatainbatchintervals.Thosetimestepstypicallyrangefrom500mstoseveralseconds.TheunderlyingstructureofaDStreamisanRDD.

ADStreamisessentiallyacontinuoussequenceofRDDs.ThisispowerfulasitallowsustoleveragefromSparkStreamingallthetraditionalfunctions,transformationsandactionsavailableinSparkCoreandallowsustodialoguewithSparkSQL,performingSQLqueriesonincomingstreamsofdataandSparkMLlib.Transformationssimilartothoseongenericandkey-valuepairRDDsareapplicable.TheDStreamsbenefitfromtheinnerRDDslineageandfaulttolerance.Additionaltransformationandoutputoperationsexistfordiscretizedstreamoperations.MostgenericoperationsonDStreamaretransformandforeachRDD.

ThefollowingdiagramgivesanoverviewofthelifecycleofDStreams.Fromcreationofthemicro-batchesofmessagesmaterializedtoRDDsonwhichtransformationfunctionandactionsthattriggerSparkjobsareapplied.Breakingdownthestepsillustratedinthediagram,wereadthediagramtopdown:

1. IntheInputStream,theincomingmessagesarebufferedinacontaineraccordingtothetimewindowallocatedforthemicro-batching.

2. Inthediscretizedstreamstep,thebufferedmicro-batchesaretransformedasDStreamRDDs.

3. TheMappedDStreamstepisobtainedbyapplyingatransformationfunctiontotheoriginalDStream.Thesefirstthreestepsconstitutethetransformationoftheoriginaldatareceivedinpredefinedtimewindows.AstheunderlyingdatastructureistheRDD,weconservethedatalineageofthetransformations.

4. ThefinalstepisanactionontheRDD.IttriggerstheSparkjob.

Transformationcanbestatelessorstateful.Statelessmeansthatnostateismaintainedbytheprogram,whilestatefulmeanstheprogramkeepsastate,inwhichcaseprevioustransactionsarerememberedandmayaffectthecurrenttransaction.Astatefuloperationmodifiesorrequiressomestateofthesystem,andastatelessoperationdoesnot.

StatelesstransformationsprocesseachbatchinaDStreamatatime.Statefultransformationsprocessmultiplebatchestoobtainresults.Statefultransformationsrequirethecheckpointdirectorytobeconfigured.CheckpointingisthemainmechanismforfaulttoleranceinSparkStreamingtoperiodicallysavedataandmetadataaboutanapplication.

TherearetwotypesofstatefultransformationsforSparkStreaming:updateStateByKeyandwindowedtransformations.

updateStateByKeyaretransformationsthatmaintainstateforeachkeyinastreamofPairRDDs.ItreturnsanewstateDStreamwherethestateforeachkeyisupdatedbyapplyingthegivenfunctiononthepreviousstateofthekeyandthenewvaluesofeachkey.Anexamplewouldbearunningcountofgivenhashtagsinastreamoftweets.

Windowedtransformationsarecarriedovermultiplebatchesinaslidingwindow.Awindowhasadefinedlengthordurationspecifiedintimeunits.ItmustbeamultipleofaDStreambatchinterval.Itdefineshowmanybatchesareincludedinawindowedtransformation.

Awindowhasaslidingintervalorslidingdurationspecifiedintimeunits.ItmustbeamultipleofaDStreambatchinterval.Itdefineshowmanybatchestoslideawindoworhowfrequentlytocomputeawindowedtransformation.

ThefollowingschemadepictsthewindowingoperationonDStreamstoderivewindowDStreamswithagivenlengthandslidinginterval:

AsamplefunctioniscountByWindow(windowLength,slideInterval).ItreturnsanewDStreaminwhicheachRDDhasasingleelementgeneratedbycountingthenumberofelementsinaslidingwindowoverthisDStream.Anillustrationinthiscasewouldbearunningcountofgivenhashtagsinastreamoftweetsevery60seconds.Thewindowtimeframeisspecified.

Minutescalewindowlengthisreasonable.Hourscalewindowlengthisnotrecommendedasitiscomputeandmemoryintensive.ItwouldbemoreconvenienttoaggregatethedatainadatabasesuchasCassandraorHBase.

Windowedtransformationscomputeresultsbasedonwindowlengthandwindowslideinterval.Sparkperformanceisprimarilyaffectedbyonwindowlength,windowslideinterval,andpersistence.

BuildinginfaulttoleranceReal-timestreamprocessingsystemsmustbeoperational24/7.Theyneedtoberesilienttoallsortsoffailuresinthesystem.SparkanditsRDDabstractionaredesignedtoseamlesslyhandlefailuresofanyworkernodesinthecluster.

MainSparkStreamingfaulttolerancemechanismsarecheckpointing,automaticdriverrestart,andautomaticfailover.Sparkenablesrecoveryfromdriverfailureusingcheckpointing,whichpreservestheapplicationstate.

Writeaheadlogs,reliablereceivers,andfilestreamsguaranteeszerodatalossasofSparkVersion1.2.Writeaheadlogsrepresentafaulttolerantstorageforreceiveddata.

Failuresrequirerecomputingresults.DStreamoperationshaveexactly-onesemantics.Transformationscanberecomputedmultipletimesbutwillyieldthesameresult.DStreamoutputoperationshaveatleastoncesemantics.Outputoperationsmaybeexecutedmultipletimes.

ProcessinglivedatawithTCPsocketsAsasteppingstonetotheoverallunderstandingofstreamingoperations,wewillfirstexperimentwithTCPsocket.TCPsocketestablishestwo-waycommunicationbetweenclientandserver,anditcanexchangedatathroughtheestablishedconnection.WebSocketconnectionsarelonglived,unliketypicalHTTPconnections.HTTPisnotmeanttokeepanopenconnectionfromtheservertopushcontinuouslydatatothewebbrowsers.MostwebapplicationshenceresortedtolongpollingviafrequentAsynchronousJavaScript(AJAX)andXMLrequests.WebSockets,standardizedandimplementedinHTML5,aremovingbeyondwebbrowsersandarebecomingacross-platformstandardforreal-timecommunicationbetweenclientandserver.

SettingupTCPsocketsWecreateaTCPSocketServerbyrunningnetcat,asmallutilityfoundinmostLinuxsystems,asadataserverwiththecommand>nc-lk9999,where9999istheportwherewearesendingdata:

#

#SocketServer

#

an@an-VB:~$nc-lk9999

helloworld

howareyou

helloworld

coolitworks

Oncenetcatisrunning,wewillopenasecondconsolewithourSparkStreamingclienttoreceivethedataandprocess.AssoonastheSparkStreamingclientconsoleislistening,westarttypingthewordstobeprocessed,thatis,helloworld.

ProcessinglivedataWewillbeusingtheexampleprogramprovidedintheSparkbundleforSparkStreamingcallednetwork_wordcount.py.ItcanbefoundontheGitHubrepositoryunderhttps://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.pyThecodeisasfollows:

"""

CountswordsinUTF8encoded,'\n'delimitedtextreceivedfromthe

networkeverysecond.

Usage:network_wordcount.py<hostname><port>

<hostname>and<port>describetheTCPserverthatSparkStreamingwould

connecttoreceivedata.

Torunthisonyourlocalmachine,youneedtofirstrunaNetcatserver

`$nc-lk9999`

andthenruntheexample

`$bin/spark-submit

examples/src/main/python/streaming/network_wordcount.pylocalhost9999`

"""

from__future__importprint_function

importsys

frompysparkimportSparkContext

frompyspark.streamingimportStreamingContext

if__name__=="__main__":

iflen(sys.argv)!=3:

print("Usage:network_wordcount.py<hostname><port>",

file=sys.stderr)

exit(-1)

sc=SparkContext(appName="PythonStreamingNetworkWordCount")

ssc=StreamingContext(sc,1)

lines=ssc.socketTextStream(sys.argv[1],int(sys.argv[2]))

counts=lines.flatMap(lambdaline:line.split(""))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)

counts.pprint()

ssc.start()

ssc.awaitTermination()

Here,weexplainthestepsoftheprogram:

1. ThecodefirstinitializesaSparkStreamingContextwiththecommand:


2. Next,thestreamingcomputationissetup.3. OneormoreDStreamobjectsthatreceivedataaredefinedtoconnecttolocalhostor

127.0.0.1onport9999:

stream=ssc.socketTextStream("127.0.0.1",9999)

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py

4. TheDStreamcomputationisdefined:transformationsandoutputoperations:

stream.map(x:lambda(x,1))

.reduce(a+b)

.print()

5. Computationisstarted:

ssc.start()

6. Programterminationispendingmanualorerrorprocessingcompletion:


7. Manualcompletionisanoptionwhenacompletionconditionisknown:

ssc.stop()

WecanmonitortheSparkStreamingapplicationbyvisitingtheSparkmonitoringhomepageatlocalhost:4040.

Here’stheresultofrunningtheprogramandfeedingthewordsonthenetcat4serverconsole:

#

#SocketClient

#an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit

examples/src/main/python/streaming/network_wordcount.pylocalhost9999

RuntheSparkStreamingnetwork_countprogrambyconnectingtothesocketlocalhostonport9999:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit

examples/src/main/python/streaming/network_wordcount.pylocalhost9999

-------------------------------------------

Time:2015-10-1820:06:06

-------------------------------------------

(u'world',1)

(u'hello',1)

-------------------------------------------

Time:2015-10-1820:06:07

-------------------------------------------

...

-------------------------------------------

Time:2015-10-1820:06:17

-------------------------------------------

(u'you',1)

(u'how',1)

(u'are',1)

-------------------------------------------

Time:2015-10-1820:06:18

-------------------------------------------

...

-------------------------------------------

Time:2015-10-1820:06:26

-------------------------------------------

(u'',1)

(u'world',1)

(u'hello',1)

-------------------------------------------

Time:2015-10-1820:06:27

-------------------------------------------

...

-------------------------------------------

Time:2015-10-1820:06:37

-------------------------------------------

(u'works',1)

(u'it',1)

(u'cool',1)

-------------------------------------------

Time:2015-10-1820:06:38

-------------------------------------------

Thus,wehaveestablishedconnectionthroughthesocketonport9999,streamedthedatasentbythenetcatserver,andperformedawordcountonthemessagessent.

ManipulatingTwitterdatainrealtimeTwitterofferstwoAPIs.OnesearchAPIthatessentiallyallowsustoretrievepasttweetsbasedonsearchterms.ThisishowwehavebeencollectingourdatafromTwitterinthepreviouschaptersofthebook.Interestingly,forourcurrentpurpose,TwitteroffersalivestreamingAPIwhichallowstoingesttweetsastheyareemittedintheblogosphere.

ProcessingTweetsinrealtimefromtheTwitterfirehoseThefollowingprogramconnectstotheTwitterfirehoseandprocessestheincomingtweetstoexcludedeletedorinvalidtweetsandparsesontheflyonlytherelevantonestoextractscreenname,theactualtweet,ortweettext,retweetcount,geo-locationinformation.TheprocessedtweetsaregatheredintoanRDDQueuebySparkStreamingandthendisplayedontheconsoleataone-secondinterval:

"""

TwitterStreamingAPISparkStreamingintoanRDD-Queuetoprocesstweets

live

CreateaqueueofRDDsthatwillbemapped/reducedoneatatimein

1secondintervals.

Torunthisexampleuse

'$bin/spark-submit

examples/AN_Spark/AN_Spark_Code/s07_twitterstreaming.py'

"""

#

importtime



importtwitter

importdateutil.parser

importjson

#ConnectingStreamingTwitterwithStreamingSparkviaQueue

classTweet(dict):

def__init__(self,tweet_in):

super(Tweet,self).__init__(self)

iftweet_inand'delete'notintweet_in:

self['timestamp']=

dateutil.parser.parse(tweet_in[u'created_at']

).replace(tzinfo=None).isoformat()

self['text']=tweet_in['text'].encode('utf-8')

#self['text']=tweet_in['text']

self['hashtags']=[x['text'].encode('utf-8')forxin

tweet_in['entities']['hashtags']]

#self['hashtags']=[x['text']forxintweet_in['entities']

['hashtags']]

self['geo']=tweet_in['geo']['coordinates']iftweet_in['geo']

elseNone

self['id']=tweet_in['id']

self['screen_name']=tweet_in['user']

['screen_name'].encode('utf-8')

#self['screen_name']=tweet_in['user']['screen_name']

self['user_id']=tweet_in['user']['id']

defconnect_twitter():

twitter_stream=twitter.TwitterStream(auth=twitter.OAuth(

token="get_your_own_credentials",

token_secret="get_your_own_credentials",

consumer_key="get_your_own_credentials",

consumer_secret="get_your_own_credentials"))

returntwitter_stream

defget_next_tweet(twitter_stream):

stream=twitter_stream.statuses.sample(block=True)

tweet_in=None

whilenottweet_inor'delete'intweet_in:

tweet_in=stream.next()

tweet_parsed=Tweet(tweet_in)

returnjson.dumps(tweet_parsed)

defprocess_rdd_queue(twitter_stream):

#CreatethequeuethroughwhichRDDscanbepushedto

#aQueueInputDStream

rddQueue=[]

foriinrange(3):

rddQueue+=

[ssc.sparkContext.parallelize([get_next_tweet(twitter_stream)],5)]

lines=ssc.queueStream(rddQueue)

lines.pprint()


sc=SparkContext(appName="PythonStreamingQueueStream")


#Instantiatethetwitter_stream

twitter_stream=connect_twitter()

#GetRDDqueueofthestreamsjsonorparsed

process_rdd_queue(twitter_stream)

ssc.start()

time.sleep(2)

ssc.stop(stopSparkContext=True,stopGraceFully=True)

Whenwerunthisprogram,itdeliversthefollowingoutput:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$bin/spark-submit

examples/AN_Spark/AN_Spark_Code/s07_twitterstreaming.py

-------------------------------------------

Time:2015-11-0321:53:14

-------------------------------------------

{"user_id":3242732207,"screen_name":"cypuqygoducu","timestamp":"2015-

11-03T20:53:04","hashtags":[],"text":"RT@VIralBuzzNewss:Our

DistinctiveEditionHolidaybreakChallengeIsInthisarticle!Hooray!...

-https://t.co/9d8wumrd5vhttps://t.co/\u2026","geo":null,"id":

661647303678259200}

-------------------------------------------

Time:2015-11-0321:53:15

-------------------------------------------

{"user_id":352673159,"screen_name":"melly_boo_orig","timestamp":"2015-

11-03T20:53:05","hashtags":["eminem"],"text":"#eminem

https://t.co/GlEjPJnwxy","geo":null,"id":661647307847409668}

-------------------------------------------

Time:2015-11-0321:53:16

-------------------------------------------

{"user_id":500620889,"screen_name":"NBAtheist","timestamp":"2015-11-

03T20:53:06","hashtags":["tehInterwebbies","Nutters"],"text":"See?

Thatdidn'ttakelongoranyactualeffort.Thisis#tehInterwebbies…

#NuttersAbound!https://t.co/QS8gLStYFO","geo":null,"id":

661647312062709761}

So,wegotanexampleofstreamingtweetswithSparkandprocessingthemonthefly.

BuildingareliableandscalablestreamingappIngestingdataistheprocessofacquiringdatafromvarioussourcesandstoringitforprocessingimmediatelyoratalaterstage.Dataconsumingsystemsaredispersedandcanbephysicallyandarchitecturallyfarfromthesources.Dataingestionisoftenimplementedmanuallywithscriptsandrudimentaryautomation.ItactuallycallsforhigherlevelframeworkslikeFlumeandKafka.

Thechallengesofdataingestionarisefromthefactthatthesourcesarephysicallyspreadoutandaretransientwhichmakestheintegrationbrittle.Dataproductioniscontinuousforweather,traffic,socialmedia,networkactivity,shopfloorsensors,security,andsurveillance.Everincreasingdatavolumesandratescoupledwitheverchangingdatastructureandsemanticsmakesdataingestionadhocanderrorprone.

Theaimistobecomemoreagile,reliable,andscalable.Agility,reliability,andscalabilityofthedataingestiondeterminetheoverallhealthofthepipeline.Agilitymeansintegratingnewsourcesastheyariseandincorporatingchangestoexistingsourcesasneeded.Inordertoensuresafetyandreliability,weneedtoprotecttheinfrastructureagainstdatalossanddownstreamapplicationsfromsilentdatacorruptionatingress.Scalabilityavoidsingestbottleneckswhilekeepingcosttractable.

IngestMode Description Example

ManualorScripted FilecopyusingcommandlineinterfaceorGUIinterface HDFSClient,ClouderaHue

BatchDataTransport Bulkdatatransportusingtools DistCp,Sqoop

MicroBatch TransportofsmallbatchesofdataSqoop,Sqoop2

Storm

Pipelining Flowliketransportofeventstreams FlumeScribe

MessageQueue PublishSubscribemessagebusofevents Kafka,Kinesis

Inordertoenableanevent-drivenbusinessthatisabletoingestmultiplestreamsofdata,processitinflight,andmakesenseofitalltogettorapiddecisions,thekeydriveristheUnifiedLog.

AUnifiedLogisacentralizedenterprisestructuredlogavailableforreal-timesubscription.Alltheorganization’sdataisputinacentrallogforsubscription.Recordsarenumberedbeginningwithzerointheorderthattheyarewritten.Itisalsoknownasacommitlogorjournal.TheconceptoftheUnifiedLogisthecentraltenetoftheKappaarchitecture.

ThepropertiesoftheUnifiedLogareasfollows:

Unified:Thereisasingledeploymentfortheentireorganization

Appendonly:EventsareimmutableandareappendedOrdered:EacheventhasauniqueoffsetwithinashardDistributed:Forfaulttolerancepurpose,theUnifiedLogisdistributedredundantlyonaclusterofcomputersFast:Thesystemsingeststhousandsofmessagespersecond

SettingupKafkaInordertoisolatedownstreamparticularconsumptionofdatafromthevagariesofupstreamemissionofdata,weneedtodecoupletheprovidersofdatafromthereceiversorconsumersofdata.Astheyarelivingintwodifferentworldswithdifferentcyclesandconstraints,Kafkadecouplesthedatapipelines.

ApacheKafkaisadistributedpublishsubscribemessagingsystemrethoughtasadistributedcommitlog.Themessagesarestoredbytopic.

ApacheKafkahasthefollowingproperties.Itsupports:

HighthroughputforhighvolumeofeventsfeedsReal-timeprocessingofnewandderivedfeedsLargedatabacklogsandpersistenceforofflineconsumptionLowlatencyasenterprisewidemessagingsystemFaulttolerancethankstoitsdistributednature

MessagesarestoredinpartitionwithauniquesequentialIDcalledoffset.Consumerstracktheirpointersviatupleof(offset,partition,topic).

Let’sdivedeeperintheanatomyofKafka.

Kafkahasessentiallythreecomponents:producers,consumersandbrokers.Producerspushandwritedatatobrokers.Consumerspullandreaddatafrombrokers.Brokersdonotpushmessagestoconsumers.Consumerspullmessagefrombrokers.ThesetupisdistributedandcoordinatedbyApacheZookeeper.

Thebrokersmanageandstorethedataintopics.Topicsaresplitinreplicatedpartitions.Thedataispersistedinthebroker,butnotremoveduponconsumption,butuntilretentionperiod.Ifaconsumerfails,itcanalwaysgobacktothebrokertofetchthedata.

KafkarequiresApacheZooKeeper.ZooKeeperisahigh-performancecoordinationservicefordistributedapplications.Itcentrallymanagesconfiguration,registryornamingservice,groupmembership,lock,andsynchronizationforcoordinationbetweenservers.Itprovidesahierarchicalnamespacewithmetadata,monitoringstatistics,andstateofthecluster.ZooKeepercanintroducebrokersandconsumersontheflyandthenrebalancesthecluster.

KafkaproducersdonotneedZooKeeper.KafkabrokersuseZooKeepertoprovidegeneralstateinformationaswellelectleaderincaseoffailure.KafkaconsumersuseZooKeepertotrackmessageoffset.NewerversionsofKafkawillsavetheconsumerstogothroughZooKeeperandcanretrievetheKafkaspecialtopicsinformation.Kafkaprovidesautomaticloadbalancingforproducers.

ThefollowingdiagramgivesanoverviewoftheKafkasetup:

InstallingandtestingKafkaWewilldownloadtheApacheKafkabinariesfromthededicatedwebpageathttp://kafka.apache.org/downloads.htmlandinstallthesoftwareinourmachineusingthefollowingsteps:

1. Downloadthecode.2. Downloadthe0.8.2.0releaseandun-tarit:

>tar-xzfkafka_2.10-0.8.2.0.tgz

>cdkafka_2.10-0.8.2.0

3. Startzooeeper.KafkausesZooKeepersoweneedtofirststartaZooKeeperserver.WewillusetheconveniencescriptpackagedwithKafkatogetasingle-nodeZooKeeperinstance.

>bin/zookeeper-server-start.shconfig/zookeeper.properties

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/zookeeper-server-start.sh

config/zookeeper.properties

http://kafka.apache.org/downloads.html

[2015-10-3122:49:14,808]INFOReadingconfigurationfrom:

config/zookeeper.properties

(org.apache.zookeeper.server.quorum.QuorumPeerConfig)

[2015-10-3122:49:14,816]INFOautopurge.snapRetainCountsetto3

(org.apache.zookeeper.server.DatadirCleanupManager)...

4. NowlaunchtheKafkaserver:

>bin/kafka-server-start.shconfig/server.properties

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-server-start.sh

config/server.properties

[2015-10-3122:52:04,643]INFOVerifyingproperties

(kafka.utils.VerifiableProperties)

[2015-10-3122:52:04,714]INFOPropertybroker.idisoverriddento0


[2015-10-3122:52:04,715]INFOPropertylog.cleaner.enableis

overriddentofalse(kafka.utils.VerifiableProperties)

[2015-10-3122:52:04,715]INFOPropertylog.dirsisoverriddento

/tmp/kafka-logs(kafka.utils.VerifiableProperties)[2013-04-22

15:01:47,051]INFOPropertysocket.send.buffer.bytesisoverriddento

1048576(kafka.utils.VerifiableProperties)

5. Createatopic.Let’screateatopicnamedtestwithasinglepartitionandonlyonereplica:

>bin/kafka-topics.sh--create--zookeeperlocalhost:2181--

replication-factor1--partitions1--topictest

6. Wecannowseethattopicifwerunthelisttopiccommand:

>bin/kafka-topics.sh--list--zookeeperlocalhost:2181

Test

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-topics.sh--create--

zookeeperlocalhost:2181--replication-factor1--partitions1--topic

test

Createdtopic"test".

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-topics.sh--list--

zookeeperlocalhost:2181

test

7. ChecktheKafkainstallationbycreatingaproducerandconsumer.Wefirstlaunchaproducerandtypeamessageintheconsole:

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-console-producer.sh--

broker-listlocalhost:9092--topictest

[2015-10-3122:54:43,698]WARNPropertytopicisnotvalid


Thisisamessage

Thisisanothermessage

8. Wethenlaunchaconsumertocheckthatwereceivethemessage:

an@an-VB:~$cdkafka/

an@an-VB:~/kafka$cdkafka_2.10-0.8.2.0/

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-console-consumer.sh--

zookeeperlocalhost:2181--topictest--from-beginning

Thisisamessage

Thisisanothermessage

Themessageswereappropriatelyreceivedbytheconsumer:

1. CheckKafkaandSparkStreamingconsumer.WewillbeusingtheSparkStreamingKafkawordcountexampleprovidedintheSparkbundle.Awordofcaution:wehavetobindtheKafkapackages,--packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0,whenwesubmittheSparkjob.Thecommandisasfollows:

./bin/spark-submit--packagesorg.apache.spark:spark-streaming-

kafka_2.10:1.5.0\

examples/src/main/python/streaming/kafka_wordcount.py\

localhost:2181test

2. WhenwelaunchtheSparkStreamingwordcountprogramwithKafka,wegetthefollowingoutput:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit--

packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0

examples/src/main/python/streaming/kafka_wordcount.py

localhost:2181test

-------------------------------------------

Time:2015-10-3123:46:33

-------------------------------------------

(u'',1)

(u'from',2)

(u'Hello',2)

(u'Kafka',2)

-------------------------------------------

Time:2015-10-3123:46:34

-------------------------------------------

-------------------------------------------

Time:2015-10-3123:46:35

-------------------------------------------

3. InstalltheKafkaPythondriverinordertobeabletoprogrammaticallydevelopProducersandConsumersandinteractwithKafkaandSparkusingPython.Wewillusetheroad-testedlibraryfromDavidArthur,aka,MumrahonGitHub(https://github.com/mumrah).Wecanpipinstallitasfollows:

>pipinstallkafka-python

an@an-VB:~$pipinstallkafka-python

Collectingkafka-python

Downloadingkafka-python-0.9.4.tar.gz(63kB)

...

Successfullyinstalledkafka-python-0.9.4

Developingproducers

https://github.com/mumrah

ThefollowingprogramcreatesaSimpleKafkaProducerthatwillemitthemessagethisisamessagesentfromtheKafkaproducer:fivetimes,followedbyatimestampeverysecond:

#

#kafkaproducer

#

#

importtime

fromkafka.commonimportLeaderNotAvailableError

fromkafka.clientimportKafkaClient

fromkafka.producerimportSimpleProducer

fromdatetimeimportdatetime

defprint_response(response=None):

ifresponse:

print('Error:{0}'.format(response[0].error))

print('Offset:{0}'.format(response[0].offset))

defmain():

kafka=KafkaClient("localhost:9092")

producer=SimpleProducer(kafka)

try:

time.sleep(5)

topic='test'

foriinrange(5):

time.sleep(1)

msg='Thisisamessagesentfromthekafkaproducer:'\

+str(datetime.now().time())+'—'\

+str(datetime.now().strftime("%A,%d%B%Y%I:%M%p"))

print_response(producer.send_messages(topic,msg))

exceptLeaderNotAvailableError:

#https://github.com/mumrah/kafka-python/issues/249

time.sleep(1)

print_response(producer.send_messages(topic,msg))

kafka.close()


main()

Whenwerunthisprogram,thefollowingoutputisgenerated:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/AN_Spark_Code$

pythons08_kafka_producer_01.py

Error:0

Offset:13

Error:0

Offset:14

Error:0

Offset:15

Error:0

Offset:16

Error:0

Offset:17


IttellsustherewerenoerrorsandgivestheoffsetofthemessagesgivenbytheKafkabroker.

DevelopingconsumersTofetchthemessagesfromtheKafkabrokers,wedevelopaKafkaconsumer:

#kafkaconsumer

#consumesmessagesfrom"test"topicandwritesthemtoconsole.

#

fromkafka.clientimportKafkaClient

fromkafka.consumerimportSimpleConsumer

defmain():

kafka=KafkaClient("localhost:9092")

print("Consumerestablishedconnectiontokafka")

consumer=SimpleConsumer(kafka,"my-group","test")

formessageinconsumer:

#Thiswillwaitandprintmessagesastheybecomeavailable

print(message)


main()

Whenwerunthisprogram,weeffectivelyconfirmthattheconsumerreceivedallthemessages:

an@an-VB:~$cd~/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/AN_Spark_Code/


pythons08_kafka_consumer_01.py

Consumerestablishedconnectiontokafka

OffsetAndMessage(offset=13,message=Message(magic=0,attributes=0,

key=None,value='Thisisamessagesentfromthekafkaproducer:

11:50:17.867309Sunday,01November201511:50AM'))

...

OffsetAndMessage(offset=17,message=Message(magic=0,attributes=0,

key=None,value='Thisisamessagesentfromthekafkaproducer:

11:50:22.051423Sunday,01November201511:50AM'))

DevelopingaSparkStreamingconsumerforKafkaBasedontheexamplecodeprovidedintheSparkStreamingbundle,wewillcreateaSparkStreamingconsumerforKafkaandperformawordcountonthemessagesstoredwiththebrokers:

#

#KafkaSparkStreamingConsumer

#


importsys



frompyspark.streaming.kafkaimportKafkaUtils


iflen(sys.argv)!=3:

print("Usage:kafka_spark_consumer_01.py<zk><topic>",

file=sys.stderr)

exit(-1)

sc=SparkContext(appName="PythonStreamingKafkaWordCount")


zkQuorum,topic=sys.argv[1:]

kvs=KafkaUtils.createStream(ssc,zkQuorum,"spark-streaming-

consumer",{topic:1})

lines=kvs.map(lambdax:x[1])

counts=lines.flatMap(lambdaline:line.split(""))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)

counts.pprint()

ssc.start()


RunthisprogramwiththefollowingSparksubmitcommand:

./bin/spark-submit--packagesorg.apache.spark:spark-streaming-

kafka_2.10:1.5.0

examples/AN_Spark/AN_Spark_Code/s08_kafka_spark_consumer_01.py

localhost:2181test

Wegetthefollowingoutput:

an@an-VB:~$cdspark/spark-1.5.0-bin-hadoop2.6/

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit\

>--packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0\

>examples/AN_Spark/AN_Spark_Code/s08_kafka_spark_consumer_01.py

localhost:2181test…


confs:[default]


-------------------------------------------

Time:2015-11-0112:13:16

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:17

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:18

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:19

-------------------------------------------

(u'a',5)

(u'the',5)

(u'11:50AM',5)

(u'from',5)

(u'This',5)

(u'11:50:21.044374Sunday,',1)

(u'message',5)

(u'11:50:20.036422Sunday,',1)

(u'11:50:22.051423Sunday,',1)

(u'11:50:17.867309Sunday,',1)

...

-------------------------------------------

Time:2015-11-0112:13:20

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:21

-------------------------------------------

ExploringflumeFlumeisacontinuousingestionsystem.Itwasoriginallydesignedtobealogaggregationsystem,butitevolvedtohandleanytypeofstreamingeventdata.

Flumeisadistributed,reliable,scalable,andavailablepipelinesystemforefficientcollection,aggregation,andtransportoflargevolumesofdata.Ithasbuilt-insupportforcontextualrouting,filteringreplication,andmultiplexing.Itisrobustandfaulttolerant,withtunablereliabilitymechanismsandmanyfailoverandrecoverymechanisms.Itusesasimpleextensibledatamodelthatallowsforrealtimeanalyticapplication.

Flumeoffersthefollowing:

GuaranteeddeliverysemanticsLowlatencyreliabledatatransferDeclarativeconfigurationwithnocodingrequiredExtendableandcustomizablesettingsIntegrationwithmostcommonlyusedend-points

TheanatomyofFlumecontainsthefollowingelements:

Event:AneventisthefundamentalunitofdatathatistransportedbyFlumefromsourcetodestination.ItislikeamessagewithabytearraypayloadopaquetoFlumeandoptionalheadersusedforcontextualrouting.Client:Aclientproducesandtransmitsevents.AclientdecouplesFlumefromthedataconsumers.Itisanentitythatgenerateseventsandsendsthemtooneormoreagents.CustomclientorFlumelog4Jappendprogramorembeddedapplicationagentcanbeclient.Agent:Anagentisacontainerhostingsources,channels,sinks,andotherelementsthatenablethetransportationofeventsfromoneplacetotheother.Itprovidesconfiguration,lifecyclemanagementandmonitoringforhostedcomponents.AnagentisaphysicalJavavirtualmachinerunningFlume.Source:SourceistheentitythroughwhichFlumereceivesevents.Sourcesrequireatleastonechanneltofunctioninordertoeitheractivelypolldataorpassivelywaitfordatatobedeliveredtothem.Avarietyofsourcesallowdatatobecollected,suchaslog4jlogsandsyslogs.Sink:Sinkistheentitythatdrainsdatafromthechannelanddeliversittothenextdestination.Avarietyofsinksallowdatatobestreamedtoarangeofdestinations.Sinkssupportserializationtouser’sformat.OneexampleistheHDFSsinkthatwriteseventstoHDFS.Channel:Channelistheconduitbetweenthesourceandthesinkthatbuffersincomingeventsuntildrainedbysinks.Sourcesfeedeventsintothechannelandthesinksdrainthechannel.Channelsdecoupletheimpedanceofupstreamanddownstreamsystems.Burstofdataupstreamisdampedbythechannels.Failuresdownstreamaretransparentlyabsorbedbythechannels.Sizingthechannelcapacitytocopewiththeseeventsiskeytorealizingthesebenefits.Channelsoffertwolevelsofpersistence:eithermemorychannel,whichisvolatileiftheJVMcrashes,orFile

channelbackedbyWriteAheadLogthatstorestheinformationtodisk.Channelsarefullytransactional.

Let’sillustratealltheseconcepts:

DevelopingdatapipelineswithFlume,Kafka,andSparkBuildingresilientdatapipelineleveragesthelearningsfromtheprevioussections.WeareplumbingtogetherdataingestionandtransportwithFlume,databrokeragewithareliableandsophisticatedpublishandsubscribemessagingsystemsuchasKafka,andfinallyprocesscomputationontheflyusingSparkStreaming.

Thefollowingdiagramillustratesthecompositionofstreamingdatapipelinesassequenceofconnect,collect,conduct,compose,consume,consign,andcontrolactivities.Theseactivitiesareconfigurablebasedontheusecase:

ConnectestablishesthebindingwiththestreamingAPI.Collectcreatescollectionthreads.Conductdecouplesthedataproducersfromtheconsumersbycreatingabufferqueueorpublish-subscribemechanism.Composeisfocusedonprocessingthedata.Consumeprovisionstheprocesseddatafortheconsumingsystems.Consigntakescareofthedatapersistence.Controlcaterstogovernanceandmonitoringofthesystems,data,andapplications.

Thefollowingdiagramillustratestheconceptsofthestreamingdatapipelineswithitskeycomponents:SparkStreaming,Kafka,Flume,andlowlatencydatabases.Inthe

consumingorcontrollingapplications,wearemonitoringoursystemsinrealtime(depictedbyamonitor)orsendingreal-timealerts(depictedbyredlights)incasecertainthresholdsarecrossed.

ThefollowingdiagramillustratesSpark’suniqueabilitytoprocessinasingleplatformdatainmotionanddataatrestwhileseamlesslyinterfacingwithmultiplepersistencedatastoresaspertheusecaserequirement.

Thisdiagrambringsinoneunifiedwholealltheconceptsdiscusseduptonow.Thetoppartdescribesthestreamingprocessingpipeline.Thebottompartdescribesthebatchprocessingpipeline.Theybothshareacommonpersistencelayerinthemiddleofthediagramdepictingthevariousmodesofpersistenceandserialization.

ClosingremarksontheLambdaandKappaarchitectureTwoarchitectureparadigmsarecurrentlyinvogue:theLambdaandKappaarchitectures.

LambdaisthebrainchildoftheStormcreatorandmaincommitter,NathanMarz.Itessentiallyadvocatesbuildingafunctionalarchitectureonalldata.Thearchitecturehastwobranches.ThefirstisabatcharmenvisionedtobepoweredbyHadoop,wherehistorical,high-latency,high-throughputdataarepre-processedandmadereadyforconsumption.Thereal-timearmisenvisionedtobepoweredbyStorm,anditprocessesincrementallystreamingdata,derivesinsightsonthefly,andfeedsaggregatedinformationbacktothebatchstorage.

KappaisthebrainchildofonethemaincommitterofKafka,JayKreps,andhiscolleaguesatConfluent(previouslyatLinkedIn).Itisadvocatingafullstreamingpipeline,effectivelyimplementing,attheenterpriselevel,theunifiedlogenouncedinthepreviouspages.

UnderstandingLambdaarchitectureLambdaarchitecturecombinesbatchandstreamingdatatoprovideaunifiedquerymechanismonallavailabledata.Lambdaarchitectureenvisionsthreelayers:abatchlayerwhereprecomputedinformationarestored,aspeedlayerwherereal-timeincrementalinformationisprocessedasdatastreams,andfinallytheservinglayerthatmergesbatchandreal-timeviewsforadhocqueries.ThefollowingdiagramgivesanoverviewoftheLambdaarchitecture:

UnderstandingKappaarchitectureTheKappaarchitectureproposestodrivethefullenterpriseinstreamingmode.TheKappaarchitecturearosefromacritiquefromJayKrepsandhiscolleaguesatLinkedInatthetime.Sincethen,theymovedandcreatedConfluentwithApacheKafkaasthemainenableroftheKappaarchitecturevision.ThebasictenetistomoveinallstreamingmodewithaUnifiedLogasthemainbackboneoftheenterpriseinformationarchitecture.

AUnifiedLogisacentralizedenterprisestructuredlogavailableforreal-timesubscription.Alltheorganization’sdataisputinacentrallogforsubscription.Recordsarenumberedbeginningwithzerosothattheyarewritten.Itisalsoknownasacommitlogorjournal.TheconceptoftheUnifiedLogisthecentraltenetoftheKappaarchitecture.

Thepropertiesoftheunifiedlogareasfollows:

Unified:ThereisasingledeploymentfortheentireorganizationAppendonly:EventsareimmutableandareappendedOrdered:EacheventhasauniqueoffsetwithinashardDistributed:Forfaulttolerancepurpose,theunifiedlogisdistributedredundantlyonaclusterofcomputersFast:Thesystemsingeststhousandsofmessagespersecond

ThefollowingscreenshotcapturesthemomentJayKrepsannouncedhisreservationsabouttheLambdaarchitecture.HismainreservationabouttheLambdaarchitectureisimplementingthesamejobintwodifferentsystems,HadoopandStorm,witheachoftheirspecificidiosyncrasies,andwithallthecomplexitiesthatcomealongwithit.Kappaarchitectureprocessesthereal-timedataandreprocesseshistoricaldatainthesameframeworkpoweredbyApacheKafka.

SummaryInthischapter,welaidoutthefoundationsofstreamingarchitectureappsanddescribedtheirchallenges,constraints,andbenefits.WewentunderthehoodandexaminedtheinnerworkingofSparkStreamingandhowitfitswithSparkCoreanddialogueswithSparkSQLandSparkMLlib.WeillustratedthestreamingconceptswithTCPsockets,followedbylivetweetingestionandprocessingdirectlyfromtheTwitterfirehose.WediscussedthenotionsofdecouplingupstreamdatapublishingfromdownstreamdatasubscriptionandconsumptionusingKafkainordertomaximizetheresilienceoftheoverallstreamingarchitecture.WealsodiscussedFlume—areliable,flexible,andscalabledataingestionandtransportpipelinesystem.ThecombinationofFlume,Kafka,andSparkdeliversunparalleledrobustness,speed,andagilityinaneverchanginglandscape.Weclosedthechapterwithsomeremarksandobservationsontwostreamingarchitecturalparadigms,theLambdaandKappaarchitectures.

TheLambdaarchitecturecombinesbatchandstreamingdatainacommonqueryfront-end.ItwasenvisionedwithHadoopandStorminmindinitially.Sparkhasitsownbatchandstreamingparadigms,anditoffersasingleenvironmentwithcommoncodebasetoeffectivelybringthisarchitectureparadigmtolife.

TheKappaarchitecturepromulgatestheconceptoftheunifiedlog,whichcreatesanevent-orientedarchitecturewherealleventsintheenterprisearechanneledinacentralizedcommitlogthatisavailabletoallconsumingsystemsinrealtime.

Wearenowreadyforthevisualizationofthedatacollectedandprocessedsofar.

Chapter6.VisualizingInsightsandTrendsSofar,wehavefocusedonthecollection,analysis,andprocessingofdatafromTwitter.Wehavesetthestagetouseourdataforvisualrenderingandextractinginsightsandtrends.WewillgiveaquicklayofthelandaboutvisualizationtoolsinthePythonecosystem.WewillhighlightBokehasapowerfultoolforrenderingandviewinglargedatasets.BokehispartofthePythonAnacondaDistributionecosystem.


GaugingthekeywordsandmemeswithinasocialnetworkcommunityusingchartsandwordcloudMappingthemostactivelocationwherecommunitiesaregrowingaroundcertainthemesortopics

Revisitingthedata-intensiveappsarchitectureWehavereachedthefinallayerofthedata-intensiveappsarchitecture:theengagementlayer.Thislayerfocusesonhowtosynthesize,emphasize,andvisualizethekeycontextrelevantinformationforthedataconsumers.Abunchofnumbersinaconsolewillnotsufficetoengagewithend-users.Itiscriticaltopresentthemassofinformationinarapid,digestible,andattractivefashion.

Thefollowingdiagramsetsthecontextofthechapter’sfocushighlightingtheengagementlayer.

ForPythonplottingandvisualizations,wehavequiteafewtoolsandlibraries.Themostinterestingandrelevantonesforourpurposearethefollowing:

MatplotlibisthegrandfatherofthePythonplottinglibraries.MatplotlibwasoriginallythebrainchildofJohnHunterwhowasanopensourcesoftwareproponent

andestablishedMatplotlibasoneofthemostprevalentplottinglibrariesbothintheacademicandthedatascientificcommunities.Matplotliballowsthegenerationofplots,histograms,powerspectra,barcharts,errorcharts,scatterplots,andsoon.ExamplescanbefoundontheMatplotlibdedicatedwebsiteathttp://matplotlib.org/examples/index.html.Seaborn,developedbyMichaelWaskom,isagreatlibrarytoquicklyvisualizestatisticalinformation.ItisbuiltontopofMatplotlibandintegratesseamlesslywithPandasandthePythondatastack,includingNumpy.AgalleryofgraphsfromSeabornathttp://stanford.edu/~mwaskom/software/seaborn/examples/index.htmlshowsthepotentialofthelibrary.ggplotisrelativelynewandaimstooffertheequivalentofthefamousggplot2fromtheRecosystemforthePythondatawranglers.Ithasthesamelookandfeelofggplot2andusesthesamegrammarofgraphicsasexpoundedbyHadleyWickham.TheggplotthePythonportisdevelopedbytheteamatyhat.Moreinformationcanbefoundathttp://ggplot.yhathq.com.D3.jsisaverypopular,JavaScriptlibrarydevelopedbyMikeBostock.D3standsforDataDrivenDocumentsandbringsdatatolifeonanymodernbrowserleveragingHTML,SVG,andCSS.Itdeliversdynamic,powerful,interactivevisualizationsbymanipulatingtheDOM,theDocumentObjectModel.ThePythoncommunitycouldnotwaittointegrateD3withMatplotlib.UndertheimpulseofJakeVanderplas,mpld3wascreatedwiththeaimofbringingmatplotlibtothebrowser.Examplesgraphicsarehostedatthefollowingaddress:http://mpld3.github.io/index.html.Bokehaimstodeliverhigh-performanceinteractivityoververylargeorstreamingdatasetswhilstleveraginglotoftheconceptsofD3.jswithouttheburdenofwritingsomeintimidatingjavascriptandcsscode.Bokehdeliversdynamicvisualizationsonthebrowserwithorwithoutaserver.ItintegratesseamlesslywithMatplotlib,SeabornandggplotandrendersbeautifullyinIPythonnotebooksorJupyternotebooks.BokehisactivelydevelopedbytheteamatContinuum.ioandisanintegralpartoftheAnacondaPythondatastack.

Bokehserverprovidesafull-fledged,dynamicplottingenginethatmaterializesareactivescenegraphfromJSON.ItuseswebsocketstokeepstateandupdatetheHTML5canvasusingBackbone.jsandCoffee-scriptunderthehoods.Bokeh,asitisfueledbydatainJSON,createseasybindingsforotherlanguagessuchasR,Scala,andJulia.

Thisgivesahigh-leveloverviewofthemainplottingandvisualizationlibrary.Itisnotexhaustive.Let’smovetoconcreteexamplesofvisualizations.

http://matplotlib.org/examples/index.html

http://stanford.edu/~mwaskom/software/seaborn/examples/index.html

http://ggplot.yhathq.com

http://mpld3.github.io/index.html

PreprocessingthedataforvisualizationBeforejumpingintothevisualizations,wewilldosomepreparatoryworkonthedataharvested:

In[16]:

#ReadharvesteddatastoredincsvinaPandaDF

importpandasaspd


hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv'

pddf_in=pd.read_csv(csv_in,index_col=None,header=0,sep=';',

encoding='utf-8')

In[20]:

print('tweetspandasdataframe-count:',pddf_in.count())

print('tweetspandasdataframe-shape:',pddf_in.shape)

print('tweetspandasdataframe-colns:',pddf_in.columns)

('tweetspandasdataframe-count:',Unnamed:07540

id7540

created_at7540

user_id7540

user_name7538

tweet_text7540

dtype:int64)

('tweetspandasdataframe-shape:',(7540,6))

('tweetspandasdataframe-colns:',Index([u'Unnamed:0',u'id',

u'created_at',u'user_id',u'user_name',u'tweet_text'],dtype='object'))

Forthepurposeofourvisualizationactivity,wewilluseadatasetof7,540tweets.Thekeyinformationisstoredinthetweet_textcolumn.Wepreviewthedatastoredinthedataframecallingthehead()functiononthedataframe:

In[21]:

pddf_in.head()

Out[21]:

Unnamed:0idcreated_atuser_iduser_nametweet_text

00638830426971181057TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…

11638830426727911424TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

22638830425402556417TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…

33638830424563716097TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

44638830422256816132TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…

Wewillnowcreatesomeutilityfunctionstocleanupthetweettextandparsethetwitterdate.First,weimportthePythonregularexpressionregexlibraryreandthetimelibrarytoparsedatesandtime:

In[72]:

importre

importtime

Wecreateadictionaryofregexthatwillbecompiledandthenpassedasfunction:

RT:ThefirstregexwithkeyRTlooksforthekeywordRTatthebeginningofthetweettext:

re.compile(r'^RT'),

ALNUM:ThesecondregexwithkeyALNUMlooksforwordsincludingalphanumericcharactersandunderscoresignprecededbythe@symbolinthetweettext:

re.compile(r'(@[a-zA-Z0-9_]+)'),

HASHTAG:ThethirdregexwithkeyHASHTAGlooksforwordsincludingalphanumericcharactersprecededbythe#symbolinthetweettext:

re.compile(r'(#[\w\d]+)'),

SPACES:ThefourthregexwithkeySPACESlooksforblankorlinespacecharactersinthetweettext:

re.compile(r'\s+'),

URL:ThefifthregexwithkeyURLlooksforurladdressesincludingalphanumericcharactersprecededwithhttps://orhttp://markersinthetweettext:

re.compile(r'([https://|http://]?[a-zA-Z\d\/]+[\.]+[a-zA-Z\d\/\.]+)')

In[24]:

regexp={"RT":"^RT","ALNUM":r"(@[a-zA-Z0-9_]+)",

"HASHTAG":r"(#[\w\d]+)","URL":r"([https://|http://]?[a-zA-

Z\d\/]+[\.]+[a-zA-Z\d\/\.]+)",

"SPACES":r"\s+"}

regexp=dict((key,re.compile(value))forkey,valuein

regexp.items())

In[25]:

regexp

Out[25]:

{'ALNUM':re.compile(r'(@[a-zA-Z0-9_]+)'),

'HASHTAG':re.compile(r'(#[\w\d]+)'),

'RT':re.compile(r'^RT'),

'SPACES':re.compile(r'\s+'),

'URL':re.compile(r'([https://|http://]?[a-zA-Z\d\/]+[\.]+[a-zA-

Z\d\/\.]+)')}

Wecreateautilityfunctiontoidentifywhetheratweetisaretweetoranoriginaltweet:

In[77]:

defgetAttributeRT(tweet):

"""seeiftweetisaRT"""

returnre.search(regexp["RT"],tweet.strip())!=None

Then,weextractalluserhandlesinatweet:

defgetUserHandles(tweet):

"""givenatweetwetryandextractalluserhandles"""

returnre.findall(regexp["ALNUM"],tweet)

Wealsoextractallhashtagsinatweet:

defgetHashtags(tweet):

"""returnallhashtags"""

returnre.findall(regexp["HASHTAG"],tweet)

ExtractallURLlinksinatweetasfollows:

defgetURLs(tweet):

"""URL:[http://]?[\w\.?/]+"""

returnre.findall(regexp["URL"],tweet)

WestripallURLlinksanduserhandlesprecededby@signinatweettext.Thisfunctionwillbethebasisofthewordcloudwewillbuildsoon:

defgetTextNoURLsUsers(tweet):

"""returnparsedtexttermsstrippedofURLsandUserNamesintweet

text

''.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z\t])|(\w+:\/\/\S+)","

",x).split())"""

return''.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z\t])|

(\w+:\/\/\S+)|(RT)","",tweet).lower().split())

Welabelthedatasowecancreategroupsofdatasetsforthewordcloud:

defsetTag(tweet):

"""settagstotweet_textbasedonsearchtermsfromtags_list"""

tags_list=['spark','python','clinton','trump','gaga','bieber']

lower_text=tweet.lower()

returnfilter(lambdax:x.lower()inlower_text,tags_list)

Weparsethetwitterdateintheyyyy-mm-ddhh:mm:ssformat:

defdecode_date(s):

"""parseTwitterdateintoformatyyyy-mm-ddhh:mm:ss"""

returntime.strftime('%Y-%m-%d%H:%M:%S',time.strptime(s,'%a%b%d

%H:%M:%S+0000%Y'))

Wepreviewthedatapriortoprocessing:

In[43]:

pddf_in.columns

Out[43]:

Index([u'Unnamed:0',u'id',u'created_at',u'user_id',u'user_name',

u'tweet_text'],dtype='object')

In[45]:

#df.drop([ColumnNameorlist],inplace=True,axis=1)

pddf_in.drop(['Unnamed:0'],inplace=True,axis=1)

In[46]:

pddf_in.head()

Out[46]:

idcreated_atuser_iduser_nametweet_text

0638830426971181057TueSep0121:46:57+000020153276255125True

Equalityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…

1638830426727911424TueSep0121:46:57+000020153276255125True

Equalityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

2638830425402556417TueSep0121:46:56+000020153276255125True

Equalityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…

3638830424563716097TueSep0121:46:56+000020153276255125True

Equalityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

4638830422256816132TueSep0121:46:56+000020153276255125True

Equalityernestsgantt:elsahel12:9_A_6:dreamintention…

Wecreatenewdataframecolumnsbyapplyingtheutilityfunctionsdescribed.Wecreateanewcolumnforhtag,userhandles,URLs,thetexttermsstrippedfromURLs,andunwantedcharactersandthelabels.Wefinallyparsethedate:

In[82]:

pddf_in['htag']=pddf_in.tweet_text.apply(getHashtags)

pddf_in['user_handles']=pddf_in.tweet_text.apply(getUserHandles)

pddf_in['urls']=pddf_in.tweet_text.apply(getURLs)

pddf_in['txt_terms']=pddf_in.tweet_text.apply(getTextNoURLsUsers)

pddf_in['search_grp']=pddf_in.tweet_text.apply(setTag)

pddf_in['date']=pddf_in.created_at.apply(decode_date)

Thefollowingcodegivesaquicksnapshotofthenewlygenerateddataframe:

In[83]:

pddf_in[2200:2210]

Out[83]:

idcreated_atuser_iduser_nametweet_texthtagurlsptxt

tgrpdateuser_handlestxt_termssearch_grp

2200638242693374681088MonAug3106:51:30+0000201519525954

CENATICElimpactode@ApacheSparkenelprocesamiento…[#sparkSpecial]

[://t.co/4PQmJNuEJB]elimpactodeenelprocesamientodedatosye…

[spark]2015-08-3106:51:30[@ApacheSpark]elimpactodeenel

procesamientodedatosye…[spark]

2201638238014695575552MonAug3106:32:55+0000201551115854

NawfalRealTimeStreamingwithApacheSpark\nhttp://...[#IoT,

#SmartMelboune,#BigData,#Apachespark][://t.co/GW5PaqwVab]realtime

streamingwithapachesparkiotsmar…[spark]2015-08-3106:32:55[]

realtimestreamingwithapachesparkiotsmar…[spark]

2202638236084124516352MonAug3106:25:14+0000201562885987

MithunKattiRT@differentsachin:Sparktheflameofdigita…

[#IBMHackathon,#SparkHackathon,#ISLconnectIN…[]sparktheflameof

digitalindiaibmhackathon…[spark]2015-08-3106:25:14

[@differentsachin,@ApacheSpark]sparktheflameofdigitalindia

ibmhackathon…[spark]

2203638234734649176064MonAug3106:19:53+00002015140462395

solaimuruganvInstalling@[email protected]…[]

[1.4.1,://t.co/3c5dGbfaZe.]installingwith141gotmanymoreissue

whil…[spark]2015-08-3106:19:53[@ApacheMahout,@ApacheSpark]

installingwith141gotmanymoreissuewhil…[spark]

2204638233517307072512MonAug3106:15:02+000020152428473836

RalfHeinekeRT@RomeoKienzler:Joinme@velocityconfon#m…

[#machinelearning,#devOps,#Bl][://t.co/U5xL7pYEmF]joinmeon

machinelearningbaseddevopsoperat…[spark]2015-08-3106:15:02

[@RomeoKienzler,@velocityconf,@ApacheSpark]joinmeonmachinelearning

baseddevopsoperat…[spark]

2205638230184848687106MonAug3106:01:48+00002015289355748

AkimBoykoRT@databricks:Watchlivetodayat10amPTis…[][1.5,

://t.co/16cix6ASti]watchlivetodayat10amptis15presentedb…

[spark]2015-08-3106:01:48[@databricks,@ApacheSpark,@databricks,

@pwen…watchlivetodayat10amptis15presentedb…[spark]

2206638227830443110400MonAug3105:52:27+00002015145001241

sachinaggarwalSparktheflameofdigitalIndia@#IBMHackath…

[#IBMHackathon,#SparkHackathon,#ISLconnectIN…[://t.co/C1AO3uNexe]

sparktheflameofdigitalindiaibmhackathon…[spark]2015-08-31

05:52:27[@ApacheSpark]sparktheflameofdigitalindiaibmhackathon…

[spark]

2207638227031268810752MonAug3105:49:16+00002015145001241

sachinaggarwalRT@pravin_gadakh:Imagine,innovateandIgni…

[#IBMHackathon,#ISLconnectIN2015][]gadakhimagineinnovateand

ignitedigitalind…[spark]2015-08-3105:49:16[@pravin_gadakh,

@ApacheSpark]gadakhimagineinnovateandignitedigitalind…[spark]

2208638224591920336896MonAug3105:39:35+00002015494725634

IBMAsiaPacificRT@sachinparmar:PassionateaboutSpark??Hav…

[#IBMHackathon,#ISLconnectIN][India..]passionateaboutsparkhave

dreamsofcleansa…[spark]2015-08-3105:39:35[@sachinparmar]

passionateaboutsparkhavedreamsofcleansa…[spark]

2209638223327467692032MonAug3105:34:33+000020153158070968

OpenSourceIndia"GameChanger"#ApacheSparkspeedsup#bigdata…

[#ApacheSpark,#bigdata][://t.co/ieTQ9ocMim]gamechangerapachespark

speedsupbigdatapro…[spark]2015-08-3105:34:33[]gamechanger

apachesparkspeedsupbigdatapro…[spark]

WesavetheprocessedinformationinaCSVformat.Wehave7,540recordsand13columns.Inyourcase,theoutputwillvaryaccordingtothedatasetyouchose:

In[84]:

f_name='/home/an/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/data/unq_tweets_processed.csv'

pddf_in.to_csv(f_name,sep=';',encoding='utf-8',index=False)

In[85]:

pddf_in.shape

Out[85]:

(7540,13)

Gaugingwords,moods,andmemesataglanceWearenowreadytoproceedwithbuildingthewordcloudswhichwillgiveusasenseoftheimportantwordscarriedinthosetweets.Wewillcreatewordcloudsforthedatasetsharvested.Wordcloudsextractthetopwordsinalistofwordsandcreateascatterplotofthewordswherethesizeofthewordiscorrelatedtoitsfrequency.Themorefrequentthewordinthedataset,thebiggerwillbethefontsizeinthewordcloudrendering.Theyincludethreeverydifferentthemesandtwocompetingoranalogousentities.Ourfirstthemeisobviouslydataprocessingandanalytics,withApacheSparkandPythonasourentities.Oursecondthemeisthe2016presidentialelectioncampaign,withthetwocontenders:HilaryClintonandDonaldTrump.OurlastthemeistheworldofpopmusicwithJustinBieberandLadyGagaasthetwoexponents.

SettingupwordcloudWewillillustratetheprogrammingstepsbyanalyzingthesparkrelatedtweets.Weloadthedataandpreviewthedataframe:

In[21]:

importpandasaspd


hadoop2.6/examples/AN_Spark/data/spark_tweets.csv'

tspark_df=pd.read_csv(csv_in,index_col=None,header=0,sep=',',

encoding='utf-8')

In[3]:

tspark_df.head(3)

Out[3]:

idcreated_atuser_iduser_nametweet_texthtagurlsptxt

tgrpdateuser_handlestxt_termssearch_grp

0638818911773856000TueSep0121:01:11+000020152511247075Noor

DinRT@kdnuggets:RleadsRapidMiner,Pythoncatc…[#KDN]

[://t.co/3bsaTT7eUs]rleadsrapidminerpythoncatchesupbigdata…

[spark,python]2015-09-0121:01:11[@kdnuggets]rleadsrapidminer

pythoncatchesupbigdata…[spark,python]

1622142176768737000FriJul1720:33:48+0000201524537879IBM

CloudantBeoneofthefirsttosign-upforIBMAnalyti…[#ApacheSpark,

#SparkInsight][://t.co/C5TZpetVA6,://t.co/R1L29DePaQ]beoneofthe

firsttosignupforibmanalyti…[spark]2015-07-1720:33:48[]be

oneofthefirsttosignupforibmanalyti…[spark]

2622140453069169000FriJul1720:26:57+00002015515145898Arno

CandelNicearticleon#apachespark,#hadoopand#dat…[#apachespark,

#hadoop,#datascience][://t.co/IyF44pV0f3]nicearticleonapachespark

hadoopanddatasci…[spark]2015-07-1720:26:57[@h2oai]nice

articleonapachesparkhadoopanddatasci…[spark]

NoteThewordcloudlibrarywewilluseistheonedevelopedbyAndreasMuellerandhostedonhisGitHubaccountathttps://github.com/amueller/word_cloud.

ThelibraryrequiresPIL(shortforPythonImagingLibrary).PILiseasilyinstallablebyinvokingcondainstallpil.PILisacomplexlibrarytoinstallandisnotyetportedonPython3.4,soweneedtorunaPython2.7+environmenttobeabletoseeourwordcloud:

#

#InstallPIL(doesnotworkwithPython3.4)

#

an@an-VB:~$condainstallpil

Fetchingpackagemetadata:....

Solvingpackagespecifications:..................

Packageplanforinstallationinenvironment/home/an/anaconda:

Thefollowingpackageswillbedownloaded:

package|build

---------------------------|-----------------

libpng-1.6.17|0214KB

https://github.com/amueller/word_cloud

freetype-2.5.5|02.2MB

conda-env-2.4.4|py27_024KB

pil-1.1.7|py27_2650KB

------------------------------------------------------------

Total:3.0MB

ThefollowingpackageswillbeUPDATED:

conda-env:2.4.2-py27_0-->2.4.4-py27_0

freetype:2.5.2-0-->2.5.5-0

libpng:1.5.13-1-->1.6.17-0

pil:1.1.7-py27_1-->1.1.7-py27_2

Proceed([y]/n)?y

Next,weinstallthewordcloudlibrary:

#

#Installwordcloud

#AndreasMueller

#https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py

#

an@an-VB:~$pipinstallwordcloud

Collectingwordcloud

Downloadingwordcloud-1.1.3.tar.gz(163kB)

100%|████████████████████████████████|163kB548kB/s

Buildingwheelsforcollectedpackages:wordcloud

Runningsetup.pybdist_wheelforwordcloud

Storedindirectory:

/home/an/.cache/pip/wheels/32/a9/74/58e379e5dc614bfd9dd9832d67608faac9b2bc6

c194d6f6df5

Successfullybuiltwordcloud

Installingcollectedpackages:wordcloud

Successfullyinstalledwordcloud-1.1.3

CreatingwordcloudsAtthisstage,wearereadytoinvokethewordcloudprogramwiththegeneratedlistoftermsfromthetweettext.

Let’sgetstartedwiththewordcloudprogrambyfirstcalling%matplotlibinlinetodisplaythewordcloudinournotebook:

In[4]:

%matplotlibinline

In[11]:

Weconvertthedataframetxt_termscolumnintoalistofwords.Wemakesureitisallconvertedintothestrtypetoavoidanybadsurprisesandcheckthelist’sfirstfourrecords:

len(tspark_df['txt_terms'].tolist())

Out[11]:

2024

In[22]:

tspark_ls_str=[str(t)fortintspark_df['txt_terms'].tolist()]

In[14]:

len(tspark_ls_str)

Out[14]:

2024

In[15]:

tspark_ls_str[:4]

Out[15]:

['rleadsrapidminerpythoncatchesupbigdatatoolsgrowsparkignites

kdn',

'beoneofthefirsttosignupforibmanalyticsforapachesparktoday

sparkinsight',

'nicearticleonapachesparkhadoopanddatascience',

'spark101runningsparkandmapreducetogetherinproduction

hadoopsummit2015apachesparkaltiscale']

WefirstcalltheMatplotlibandthewordcloudlibraries:


fromwordcloudimportWordCloud,STOPWORDS

Fromtheinputlistofterms,wecreateaunifiedstringoftermsseparatedbyawhitespaceastheinputtothewordcloudprogram.Thewordcloudprogramremovesstopwords:

#jointweetstoasinglestring

words=''.join(tspark_ls_str)

#createwordcloud

wordcloud=WordCloud(

#removestopwords

stopwords=STOPWORDS,

background_color='black',

width=1800,

height=1400

).generate(words)

#renderwordcloudimage

plt.imshow(wordcloud)

plt.axis('off')

#savewordcloudimageondisk

plt.savefig('./spark_tweets_wordcloud_1.png',dpi=300)

#displayimageinJupyternotebook

plt.show()

Here,wecanvisualizethewordcloudsforApacheSparkandPython.Clearly,inthecaseofSpark,Hadoop,bigdata,andanalyticsarethememes,whilePythonrecallstherootofitsnameMontyPythonwithastrongfocusondeveloper,apachespark,andprogrammingwithsomehintstojavaandruby.

WecanalsogetaglimpseinthefollowingwordcloudsofthewordspreoccupyingtheNorthAmerican2016presidentialelectioncandidates:HilaryClintonandDonaldTrump.SeeminglyHilaryClintonisovershadowedbythepresenceofheropponentsDonaldTrumpandBernieSanders,whileTrumpisheavilycenteredonlyonhimself:

Interestingly,inthecaseofJustinBieberandLadyGaga,thewordloveappears.InthecaseofBieber,followandbelieberarekeywords,whilediet,weightloss,andfashionarethepreoccupationsfortheLadyGagacrowd.

Geo-locatingtweetsandmappingmeetupsNow,wewilldiveintothecreationofinteractivemapswithBokeh.First,wecreateaworldmapwherewegeo-locatesampletweetsand,onmovingourmouseovertheselocations,wecanseetheusersandtheirrespectivetweetsinahoverbox.

ThesecondmapisfocusedonmappingupcomingmeetupsinLondon.Itcouldbeaninteractivemapthatwouldactasareminderofdate,time,andlocationforupcomingmeetupsinaspecificcity.

Geo-locatingtweetsTheobjectiveistocreateaworldmapscatterplotofthelocationsofimportanttweetsonthemap,andthetweetsandauthorsarerevealedonhoveringoverthesepoints.Wewillgothroughthreestepstobuildthisinteractivevisualization:

1. Createthebackgroundworldmapbyfirstloadingadictionaryofalltheworldcountryboundariesdefinedbytheirrespectivelongitudeandlatitudes.

2. Loadtheimportanttweetswewishtogeo-locatewiththeirrespectivecoordinatesandauthors.

3. Finally,scatterplotontheworldmapthetweetscoordinatesandactivatethehovertooltovisualizeinteractivelythetweetsandauthoronthehighlighteddotsonthemap.

Instepone,wecreateaPythonlistcalleddatathatwillcontainalltheworldcountriesboundarieswiththeirrespectivelatitudeandlongitude:

In[4]:

#

#ThismoduleexposesgeometrydataforWorldCountryBoundaries.

#

importcsv

importcodecs

importgzip

importxml.etree.cElementTreeaset

importos

fromos.pathimportdirname,join

nan=float('NaN')

__file__=os.getcwd()

data={}

withgzip.open(join(dirname(__file__),

'AN_Spark/data/World_Country_Boundaries.csv.gz'))asf:

decoded=codecs.iterdecode(f,"utf-8")

next(decoded)

reader=csv.reader(decoded,delimiter=',',quotechar='"')

forrowinreader:

geometry,code,name=row

xml=et.fromstring(geometry)

lats=[]

lons=[]

fori,polyin

enumerate(xml.findall('.//outerBoundaryIs/LinearRing/coordinates')):

ifi>0:

lats.append(nan)

lons.append(nan)

coords=(c.split(',')[:2]forcinpoly.text.split())

lat,lon=list(zip(*[(float(lat),float(lon))forlon,latin

coords]))

lats.extend(lat)

lons.extend(lon)

data[code]={

'name':name,

'lats':lats,

'lons':lons,

}

In[5]:

len(data)

Out[5]:

235

Insteptwo,weloadasamplesetofimportanttweetsthatwewishtovisualizewiththeirrespectivegeo-locationinformation:

In[69]:

#data

#

#

In[8]:

importpandasaspd


hadoop2.6/examples/AN_Spark/data/spark_tweets_20.csv'

t20_df=pd.read_csv(csv_in,index_col=None,header=0,sep=',',

encoding='utf-8')

In[9]:

t20_df.head(3)

Out[9]:

idcreated_atuser_iduser_nametweet_texthtagurls

ptxttgrpdateuser_handlestxt_termssearch_grplatlon

0638818911773856000TueSep0121:01:11+000020152511247075Noor

DinRT@kdnuggets:RleadsRapidMiner,Pythoncatc…[#KDN]

[://t.co/3bsaTT7eUs]rleadsrapidminerpythoncatchesupbigdata…

[spark,python]2015-09-0121:01:11[@kdnuggets]rleads

rapidminerpythoncatchesupbigdata…[spark,python]37.279518

-121.867905

1622142176768737000FriJul1720:33:48+0000201524537879IBM

CloudantBeoneofthefirsttosign-upforIBMAnalyti…

[#ApacheSpark,#SparkInsight][://t.co/C5TZpetVA6,://t.co/R1L29DePaQ]

beoneofthefirsttosignupforibmanalyti…[spark]2015-07-17

20:33:48[]beoneofthefirsttosignupforibmanalyti…[spark]

37.774930-122.419420

2622140453069169000FriJul1720:26:57+00002015515145898Arno

CandelNicearticleon#apachespark,#hadoopand#dat…[#apachespark,

#hadoop,#datascience][://t.co/IyF44pV0f3]nicearticleon

apachesparkhadoopanddatasci…[spark]2015-07-1720:26:57

[@h2oai]nicearticleonapachesparkhadoopanddatasci…[spark]

51.500130-0.126305

In[98]:

len(t20_df.user_id.unique())

Out[98]:

19

In[17]:

t20_geo=t20_df[['date','lat','lon','user_name','tweet_text']]

In[24]:

#

t20_geo.rename(columns={'user_name':'user','tweet_text':'text'},

inplace=True)

In[25]:

t20_geo.head(4)

Out[25]:

datelatlonusertext

02015-09-0121:01:1137.279518-121.867905NoorDinRT

@kdnuggets:RleadsRapidMiner,Pythoncatc…

12015-07-1720:33:4837.774930-122.419420IBMCloudantBe

oneofthefirsttosign-upforIBMAnalyti…

22015-07-1720:26:5751.500130-0.126305ArnoCandelNice

articleon#apachespark,#hadoopand#dat…

32015-07-1719:35:3151.500130-0.126305IraMichaelBlonder

Spark101:RunningSparkand#MapReducetogeth…

In[22]:

df=t20_geo

#

Instepthree,wefirstimportedallthenecessaryBokehlibraries.WewillinstantiatetheoutputintheJupyterNotebook.Wegettheworldcountriesboundaryinformationloaded.Wegetthegeo-locatedtweetdata.WeinstantiatetheBokehinteractivetoolssuchaswheelandboxzoomaswellasthehovertool.

In[29]:

#

#BokehVisualizationoftweetsonworldmap

#

frombokeh.plottingimport*

frombokeh.modelsimportHoverTool,ColumnDataSource

fromcollectionsimportOrderedDict

#OutputinJupiterNotebook

output_notebook()

#Gettheworldmap

world_countries=data.copy()

#Getthetweetdata

tweets_source=ColumnDataSource(df)

#Createworldmap

countries_source=ColumnDataSource(data=dict(

countries_xs=[world_countries[code]['lons']forcodein

world_countries],

countries_ys=[world_countries[code]['lats']forcodein

world_countries],

country=[world_countries[code]['name']forcodeinworld_countries],

))

#Instantiatethebokehinteractivetools

TOOLS="pan,wheel_zoom,box_zoom,reset,resize,hover,save"

Wearenowreadytolayerthevariouselementsgatheredintoanobjectfigurecalledp.Definethetitle,width,andheightofp.Attachthetools.Createtheworldmapbackgroundbypatcheswithalightbackgroundcolorandborders.Scatterplotthetweetsaccordingtotheirrespectivegeo-coordinates.Then,activatethehovertoolwiththeusersandtheirrespectivetweet.Finally,renderthepictureonthebrowser.Thecodeisasfollows:

#Instantiantethefigureobject

p=figure(

title="%stweets"%(str(len(df.index))),

title_text_font_size="20pt",

plot_width=1000,

plot_height=600,

tools=TOOLS)

#Createworldpatchesbackground

p.patches(xs="countries_xs",ys="countries_ys",source=countries_source,

fill_color="#F1EEF6",fill_alpha=0.3,

line_color="#999999",line_width=0.5)

#Scatterplotsbylongitudeandlatitude

p.scatter(x="lon",y="lat",source=tweets_source,fill_color="#FF0000",

line_color="#FF0000")

#

#Activatehovertoolwithuserandcorrespondingtweetinformation

hover=p.select(dict(type=HoverTool))

hover.point_policy="follow_mouse"

hover.tooltips=OrderedDict([

("user","@user"),

("tweet","@text"),

])

#Renderthefigureonthebrowser

show(p)

BokehJSsuccessfullyloaded.

inspect

#

#

Thefollowingcodegivesanoverviewoftheworldmapwiththereddotsrepresentingthelocationsofthetweets’origins:

Wecanhoveronaspecificdottorevealthetweetsinthatlocation:

Wecanzoomintoaspecificlocation:

Finally,wecanrevealthetweetsinthegivenzoomed-inlocation:

DisplayingupcomingmeetupsonGoogleMapsNow,ourobjectiveistofocusonupcomingmeetupsinLondon.WearemappingthreemeetupsDataScienceLondon,ApacheSpark,andMachineLearning.WeembedaGoogleMapwithinaBokehvisualizationandgeo-locatethethreemeetupsaccordingtotheircoordinatesandgetinformationsuchasthenameoftheupcomingeventforeachmeetupwithahovertool.

First,importallthenecessaryBokehlibraries:

In[]:

#

#BokehGoogleMapVisualizationofLondonwithhoveronspecificpoints

#

#


frombokeh.browserlibimportview

frombokeh.documentimportDocument

frombokeh.embedimportfile_html

frombokeh.models.glyphsimportCircle

frombokeh.modelsimport(

GMapPlot,Range1d,ColumnDataSource,

PanTool,WheelZoomTool,BoxSelectTool,

HoverTool,ResetTool,

BoxSelectionOverlay,GMapOptions)

frombokeh.resourcesimportINLINE

x_range=Range1d()

y_range=Range1d()

WewillinstantiatetheGoogleMapthatwillactasthesubstrateuponwhichourBokehvisualizationwillbelayered:

#JSONstylestringtakenfrom:https://snazzymaps.com/style/1/pale-dawn

map_options=GMapOptions(lat=51.50013,lng=-0.126305,map_type="roadmap",

zoom=13,styles="""

[{"featureType":"administrative","elementType":"all","stylers":

[{"visibility":"on"},{"lightness":33}]},

{"featureType":"landscape","elementType":"all","stylers":

[{"color":"#f2e5d4"}]},

{"featureType":"poi.park","elementType":"geometry","stylers":

[{"color":"#c5dac6"}]},

{"featureType":"poi.park","elementType":"labels","stylers":

[{"visibility":"on"},{"lightness":20}]},

{"featureType":"road","elementType":"all","stylers":[{"lightness":20}]},

{"featureType":"road.highway","elementType":"geometry","stylers":

[{"color":"#c5c6c6"}]},

{"featureType":"road.arterial","elementType":"geometry","stylers":

[{"color":"#e4d7c6"}]},

{"featureType":"road.local","elementType":"geometry","stylers":

[{"color":"#fbfaf7"}]},

{"featureType":"water","elementType":"all","stylers":[{"visibility":"on"},

{"color":"#acbcc9"}]}]

""")

InstantiatetheBokehobjectplotfromtheclassGMapPlotwiththedimensionsandmapoptionsfromthepreviousstep:

#InstantiateGoogleMapPlot

plot=GMapPlot(

x_range=x_range,y_range=y_range,

map_options=map_options,

title="LondonMeetups"

)

Bringintheinformationfromourthreemeetupswewishtoplotandgettheinformationbyhoveringabovetherespectivecoordinates:

source=ColumnDataSource(

data=dict(

lat=[51.49013,51.50013,51.51013],

lon=[-0.130305,-0.126305,-0.120305],

fill=['orange','blue','green'],

name=['LondonDataScience','Spark','MachineLearning'],

text=['GraphData&Algorithms','SparkInternals','DeepLearningon

Spark']

)

)

DefinethedotstobedrawnontheGoogleMap:

circle=Circle(x="lon",y="lat",size=15,fill_color="fill",

line_color=None)

plot.add_glyph(source,circle)

DefinethestingsfortheBokehtoolstobeusedinthisvisualization:

#TOOLS="pan,wheel_zoom,box_zoom,reset,hover,save"

pan=PanTool()

wheel_zoom=WheelZoomTool()

box_select=BoxSelectTool()

reset=ResetTool()

hover=HoverTool()

#save=SaveTool()

plot.add_tools(pan,wheel_zoom,box_select,reset,hover)

overlay=BoxSelectionOverlay(tool=box_select)

plot.add_layout(overlay)

Activatethehovertoolwiththeinformationthatwillbecarried:

hover=plot.select(dict(type=HoverTool))

hover.point_policy="follow_mouse"

hover.tooltips=OrderedDict([

("Name","@name"),

("Text","@text"),

("(Long,Lat)","(@lon,@lat)"),

])

show(plot)

RendertheplotthatgivesaprettygoodviewofLondon:

Oncewehoveronahighlighteddot,wecangettheinformationofthegivenmeetup:

Fullsmoothzoomingcapabilityispreserved,asthefollowingscreenshotshows:

SummaryInthischapter,wefocusedonfewvisualizationtechniques.Wesawhowtobuildwordcloudsandtheirintuitivepowertoreveal,ataglance,lotsofthekeywords,moods,andmemescarriedthroughthousandsoftweets.

WethendiscussedinteractivemappingvisualizationsusingBokeh.Webuiltaworldmapfromthegroundupandcreatedascatterplotofcriticaltweets.Oncethemapwasrenderedonthebrowser,wecouldinteractivelyhoverfromdottodotandrevealthetweetsoriginatingfromdifferentpartsoftheworld.

OurfinalvisualizationwasfocusedonmappingupcomingmeetupsinLondononSpark,datascience,andmachinelearningandtheirrespectivetopics,makingabeautifulinteractivevisualizationwithanactualGoogleMap.

IndexA

AmazonWebServices(AWS)apps,deployingwith/DeployingappsinAmazonWebServicesabout/DeployingappsinAmazonWebServices

Anacondadefining/UnderstandingAnaconda

AnacondaInstallerURL/InstallingAnacondawithPython2.7

AnacondastackAnaconda/UnderstandingAnacondaConda/UnderstandingAnacondaNumba/UnderstandingAnacondaBlaze/UnderstandingAnacondaBokeh/UnderstandingAnacondaWakari/UnderstandingAnaconda

analyticslayer/AnalyticslayerApacheKafka

about/SettingupKafkaproperties/SettingupKafka

ApacheSparkabout/DisplayingupcomingmeetupsonGoogleMaps

APIs(ApplicationProgrammingInterface)about/Connectingtosocialnetworks

apppreviewing/Previewingourapp

appsdeploying,withAmazonWebServices(AWS)/DeployingappsinAmazonWebServices

architecture,data-intensiveapplicationsabout/Understandingthearchitectureofdata-intensiveapplicationsinfrastructurelayer/Infrastructurelayerpersistencelayer/Persistencelayerintegrationlayer/Integrationlayeranalyticslayer/Analyticslayerengagementlayer/Engagementlayer

AsynchronousJavaScript(AJAX)about/ProcessinglivedatawithTCPsockets

AWSconsoleURL/DeployingappsinAmazonWebServices

BBigData,withApacheSpark

references/VirtualizingtheenvironmentwithVagrantBlaze

used,forexploringdata/ExploringdatausingBlazeBSON(BinaryJSON)

about/SettingupMongoDB

CCatalyst

about/ExploringdatausingSparkSQLChef

about/InfrastructurelayerClustering

K-Means/SupervisedandunsupervisedlearningGaussianMixture/SupervisedandunsupervisedlearningPowerIterationClustering(PIC)/SupervisedandunsupervisedlearningLatentDirichletAllocation(LDA)/Supervisedandunsupervisedlearning

Clustermanagerabout/TheResilientDistributedDataset

comma-separatedvalues(CSV)about/Harvestingandstoringdata

ContinuumURL/UnderstandingAnaconda

Couchbaseabout/Persistencelayer

DD3.js

about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture

DAG(DirectedAcyclicGraph)about/TheResilientDistributedDataset,Serializinganddeserializingdata

dataserializing/Serializinganddeserializingdatadeserializing/Serializinganddeserializingdataharvesting/Harvestingandstoringdatastoring/Harvestingandstoringdatapersisting,inCSV/PersistingdatainCSVpersisting,inJSON/PersistingdatainJSONMongoDB,settingup/SettingupMongoDB,harvestingfromTwitter/HarvestingdatafromTwitterexploring,Blazeused/ExploringdatausingBlazetransferring,Odoused/TransferringdatausingOdoexploring,SparkSQLused/ExploringdatausingSparkSQLpre-processing,forvisualization/Preprocessingthedataforvisualization

data-intensiveappsarchitecting/Architectingdata-intensiveappslatency/Architectingdata-intensiveappsscalability/Architectingdata-intensiveappsfaulttolerance/Architectingdata-intensiveappsflexibility/Architectingdata-intensiveappsdataatrest,processing/Processingdataatrestdatainmotion,processing/Processingdatainmotiondata,exploring/Exploringdatainteractively

data-intensiveappsarchitectureabout/Revisitingthedata-intensiveappsarchitecture

dataanalysisdefining/AnalyzingthedataTweetsanatomy,discovering/Discoveringtheanatomyoftweets

DataDrivenDocuments(D3)about/Revisitingthedata-intensiveappsarchitecture

dataflowsabout/Machinelearningworkflowsanddataflows

dataintensiveappsarchitecturedefining/Revisitingthedata-intensiveapparchitecture

datalifecycleConnect/IntegrationlayerCorrect/IntegrationlayerCollect/Integrationlayer

Compose/IntegrationlayerConsume/IntegrationlayerControl/Integrationlayer

DataScienceLondonabout/DisplayingupcomingmeetupsonGoogleMaps

datatypes,SparkMLliblocalvector/SparkMLlibdatatypeslabeledpoint/SparkMLlibdatatypeslocalmatrix/SparkMLlibdatatypesdistributedmatrix/SparkMLlibdatatypes

DecisionTreesabout/Supervisedandunsupervisedlearning

DimensionalityReductionSingularValueDecomposition(SVD)/SupervisedandunsupervisedlearningPrincipalComponentAnalysis(PCA)/Supervisedandunsupervisedlearning

Dockerabout/Infrastructurelayerenvironment,virtualizingwith/VirtualizingtheenvironmentwithDockerreferences/VirtualizingtheenvironmentwithDocker

DStream(DiscretizedStream)defining/GoingunderthehoodofSparkStreaming

Eelements,Flume

Event/ExploringflumeClient/ExploringflumeSource/ExploringflumeSink/ExploringflumeChannel/Exploringflume

engagementlayer/EngagementlayerEnsemblesoftrees

about/Supervisedandunsupervisedlearningenvironment

virtualizing,withVagrant/VirtualizingtheenvironmentwithVagrantvirtualizing,withDocker/VirtualizingtheenvironmentwithDocker

FFirstApp

building,withPySpark/BuildingourfirstappwithPySparkFlume

about/Exploringflumeadvantages/Exploringflumeelements/Exploringflume

Gggplot


GitHubURL/GettingGitHubdataabout/ExploringtheGitHubworldoperating,withMeetupAPI/UnderstandingthecommunitythroughMeetup

GoogleMapsupcomingmeetups,displayingon/DisplayingupcomingmeetupsonGoogleMaps

HHadoopMongoDBconnector

URL/QueryingMongoDBfromSparkSQLHbaseandCassandra

about/PersistencelayerHDFS(HadoopDistributedFileSystem)

about/UnderstandingSpark

Iinfrastructurelayer/InfrastructurelayerIngestMode

BatchDataTransport/BuildingareliableandscalablestreamingappMicroBatch/BuildingareliableandscalablestreamingappPipelining/BuildingareliableandscalablestreamingappMessageQueue/Buildingareliableandscalablestreamingapp

integrationlayer/Integrationlayer

JJava8

installing/InstallingJava8JRE(JavaRuntimeEnvironment)

about/InstallingJava8JSON(JavaScriptObjectNotation)

about/Connectingtosocialnetworks,Harvestingandstoringdata

KKafka

settingup/SettingupKafkainstalling/InstallingandtestingKafkatesting/InstallingandtestingKafkaURL/InstallingandtestingKafkaproducers,developing/Developingproducersconsumers,developing/DevelopingconsumersSparkStreamingconsumer,developingfor/DevelopingaSparkStreamingconsumerforKafka

Kappaarchitecturedefining/ClosingremarksontheLambdaandKappaarchitecture,UnderstandingKappaarchitecture

LLambdaarchitecture

defining/ClosingremarksontheLambdaandKappaarchitecture,UnderstandingLambdaarchitecture

LinearRegressionModelsabout/Supervisedandunsupervisedlearning

MMachineLearning

about/DisplayingupcomingmeetupsonGoogleMapsmachinelearningpipelines

building/Buildingmachinelearningpipelinesmachinelearningworkflows

about/MachinelearningworkflowsanddataflowsMassiveOpenOnlineCourses(MOOCs)

about/VirtualizingtheenvironmentwithVagrantMatplotlib


MeetupAPIURL/GettingMeetupdata

meetupsmapping/Geo-locatingtweetsandmappingmeetups

MLlibalgorithmsCollaborativefiltering/Additionallearningalgorithmsfeatureextractionandtransformation/Additionallearningalgorithmsoptimization/AdditionallearningalgorithmsLimited-memoryBFGS(L-BFGS)/Additionallearningalgorithms

modelsdefining,forprocessingstreamsofdata/Layingthefoundationsofstreamingarchitecture

MongoDBabout/Persistencelayersettingup/SettingupMongoDBserverandclient,installing/InstallingtheMongoDBserverandclientserver,running/RunningtheMongoDBserverMongoclient,running/RunningtheMongoclientPyMongodriver,installing/InstallingthePyMongodriverPythonclient,creatingfor/CreatingthePythonclientforMongoDBreferences/QueryingMongoDBfromSparkSQL

MongoDB,fromSparkSQLURL/QueryingMongoDBfromSparkSQL

Multi-DimensionalScaling(MDS)algorithmabout/ApplyingScikit-LearnontheTwitterdataset

Mumrah,onGitHubURL/InstallingandtestingKafka

MySQLabout/Persistencelayer

NNaiveBayes

about/SupervisedandunsupervisedlearningNeo4j

about/Persistencelayernetwork_wordcount.py

URL/Processinglivedata

OOdo

about/TransferringdatausingOdoused,fortransferringdata/TransferringdatausingOdo

operations,onRDDstransformations/TheResilientDistributedDatasetaction/TheResilientDistributedDataset

Ppersistencelayer/PersistencelayerPIL(PythonImagingLibrary)

about/SettingupwordcloudPostgreSQL

about/PersistencelayerPuppet

about/InfrastructurelayerPySpark

FirstApp,buildingwith/BuildingourfirstappwithPySpark

RRDD(ResilientDistributedDataset)

about/TheResilientDistributedDatasetResilientDistributedDatasets(RDD)

about/SparkStreaminginnerworkingREST(RepresentationStateTransfer)

about/ConnectingtosocialnetworksRPC(RemoteProcedureCall)

about/Layingthefoundationsofstreamingarchitecture

SSDK(SoftwareDevelopmentKit)

about/InstallingJava8Seaborn


socialnetworksconnectingto/ConnectingtosocialnetworksTwitterdata,obtaining/GettingTwitterdataGitHubdata,obtaining/GettingGitHubdataMeetupdata,obtaining/GettingMeetupdata

Sparkdefining/UnderstandingSparkBatch/UnderstandingSparkStreaming/UnderstandingSparkIterative/UnderstandingSparkInteractive/UnderstandingSparklibraries/SparklibrariesURL/InstallingSparkClustering/SupervisedandunsupervisedlearningDimensionalityReduction/SupervisedandunsupervisedlearningRegressionandClassification/SupervisedandunsupervisedlearningIsotonicRegression/SupervisedandunsupervisedlearningMLlibalgorithms/Additionallearningalgorithms

Spark,onEC2URL/DeployingappsinAmazonWebServices

SparkContextabout/SparkStreaminginnerworking

Sparkdataframesdefining/UnderstandingSparkdataframes

SparklibrariesSparkSQL/SparklibrariesSparkMLLIB/SparklibrariesSparkStreaming/SparklibrariesSparkGraphX/SparklibrariesPySpark,defining/PySparkinactionRDD(ResilientDistributedDataset)/TheResilientDistributedDataset

SparkMLlibcontextualizing,inapparchitecture/ContextualizingSparkMLlibintheapparchitecturedatatypes/SparkMLlibdatatypes

SparkMLlibalgorithmsclassifying/ClassifyingSparkMLlibalgorithms

supervisedlearning/Supervisedandunsupervisedlearningunsupervisedlearning/Supervisedandunsupervisedlearningadditionallearningalgorithms/Additionallearningalgorithms

SparkPoweredEnvironmentsettingup/SettinguptheSparkpoweredenvironmentOracleVirtualBox,settingupwithUbuntu/SettingupanOracleVirtualBoxwithUbuntuAnaconda,installingwithPython2.7/InstallingAnacondawithPython2.7Java8,installing/InstallingJava8Spark,installing/InstallingSparkIPythonNotebook,enabling/EnablingIPythonNotebook

SparkSQLused,forexploringdata/ExploringdatausingSparkSQLabout/ExploringdatausingSparkSQLCSVfiles,loadingwith/LoadingandprocessingCSVfileswithSparkSQLCSVfiles,processingwith/LoadingandprocessingCSVfileswithSparkSQLMongoDB,queryingfrom/QueryingMongoDBfromSparkSQL

SparkSQLmoduleabout/Analyticslayer

SparkSQLqueryoptimizerdefining/UnderstandingtheSparkSQLqueryoptimizer

Sparkstreamingdefining/SparkStreaminginnerworking,GoingunderthehoodofSparkStreamingbuilding,infaulttolerance/Buildinginfaulttolerance

StochasticGradientDescentabout/ClassifyingSparkMLlibalgorithms

streamingappbuilding/BuildingareliableandscalablestreamingappKafka,settingup/SettingupKafkaflume,exploring/Exploringflumedatapipelines,developingwithFlume/DevelopingdatapipelineswithFlume,Kafka,andSparkdatapipelines,developingwithKafka/DevelopingdatapipelineswithFlume,Kafka,andSparkdatapipelines,developingwithSpark/DevelopingdatapipelineswithFlume,Kafka,andSpark

streamingarchitectureabout/Layingthefoundationsofstreamingarchitecture

StreamingContextabout/SparkStreaminginnerworking

supervisedmachinelearningworkflowabout/Supervisedmachinelearningworkflows

TTCPSockets

livedata,processingwith/ProcessinglivedatawithTCPsockets,Processinglivedatasettingup/SettingupTCPsockets

TF-IDF(TermFrequency-InverseDocumentFrequency)about/ClassifyingSparkMLlibalgorithms

Tridentabout/Layingthefoundationsofstreamingarchitecture

tweetsgeo-locating/Geo-locatingtweetsandmappingmeetups,Geo-locatingtweets

TwitterURL/GettingTwitterdata

TwitterAPI,ondevconsoleURL/GettingTwitterdata

Twitterdatamanipulating/ManipulatingTwitterdatainrealtimetweets,processingfromTwitterfirehose/ProcessingTweetsinrealtimefromtheTwitterfirehose

Twitterdatasetclustering/ClusteringtheTwitterdatasetSciKit-Learn,applyingon/ApplyingScikit-LearnontheTwitterdatasetdataset,preprocessing/Preprocessingthedatasetclusteringalgorithm,running/Runningtheclusteringalgorithmmodelandresults,evaluating/Evaluatingthemodelandtheresults

UUbuntu14.04.1LTSrelease

URL/SettingupanOracleVirtualBoxwithUbuntuunifiedlog

properties/UnderstandingKappaarchitectureUnifiedLog

properties/Buildingareliableandscalablestreamingappunsupervisedmachinelearningworkflow

about/Unsupervisedmachinelearningworkflows

VVagrant

about/Infrastructurelayerenvironment,virtualizingwith/VirtualizingtheenvironmentwithVagrantreference/VirtualizingtheenvironmentwithVagrant

VirtualBoxVMURL/SettingupanOracleVirtualBoxwithUbuntu

visualizationdata,pre-processingfor/Preprocessingthedataforvisualization

Wwordclouds

creating/Gaugingwords,moods,andmemesataglance,Creatingwordcloudssettingup/SettingupwordcloudURL/Settingupwordcloud