Top Banner
300

Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

May 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 2: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 3: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparkforPythonDevelopers

Page 4: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

TableofContents

SparkforPythonDevelopers

Credits

AbouttheAuthor

Acknowledgment

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.SettingUpaSparkVirtualEnvironment

Understandingthearchitectureofdata-intensiveapplications

Infrastructurelayer

Persistencelayer

Integrationlayer

Analyticslayer

Engagementlayer

UnderstandingSpark

Sparklibraries

Page 5: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PySparkinaction

TheResilientDistributedDataset

UnderstandingAnaconda

SettinguptheSparkpoweredenvironment

SettingupanOracleVirtualBoxwithUbuntu

InstallingAnacondawithPython2.7

InstallingJava8

InstallingSpark

EnablingIPythonNotebook

BuildingourfirstappwithPySpark

VirtualizingtheenvironmentwithVagrant

Movingtothecloud

DeployingappsinAmazonWebServices

VirtualizingtheenvironmentwithDocker

Summary

2.BuildingBatchandStreamingAppswithSpark

Architectingdata-intensiveapps

Processingdataatrest

Processingdatainmotion

Exploringdatainteractively

Connectingtosocialnetworks

GettingTwitterdata

GettingGitHubdata

GettingMeetupdata

Analyzingthedata

Discoveringtheanatomyoftweets

ExploringtheGitHubworld

UnderstandingthecommunitythroughMeetup

Previewingourapp

Summary

3.JugglingDatawithSpark

Page 6: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Revisitingthedata-intensiveapparchitecture

Serializinganddeserializingdata

Harvestingandstoringdata

PersistingdatainCSV

PersistingdatainJSON

SettingupMongoDB

InstallingtheMongoDBserverandclient

RunningtheMongoDBserver

RunningtheMongoclient

InstallingthePyMongodriver

CreatingthePythonclientforMongoDB

HarvestingdatafromTwitter

ExploringdatausingBlaze

TransferringdatausingOdo

ExploringdatausingSparkSQL

UnderstandingSparkdataframes

UnderstandingtheSparkSQLqueryoptimizer

LoadingandprocessingCSVfileswithSparkSQL

QueryingMongoDBfromSparkSQL

Summary

4.LearningfromDataUsingSpark

ContextualizingSparkMLlibintheapparchitecture

ClassifyingSparkMLlibalgorithms

Supervisedandunsupervisedlearning

Additionallearningalgorithms

SparkMLlibdatatypes

Machinelearningworkflowsanddataflows

Supervisedmachinelearningworkflows

Unsupervisedmachinelearningworkflows

ClusteringtheTwitterdataset

ApplyingScikit-LearnontheTwitterdataset

Page 7: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Preprocessingthedataset

Runningtheclusteringalgorithm

Evaluatingthemodelandtheresults

Buildingmachinelearningpipelines

Summary

5.StreamingLiveDatawithSpark

Layingthefoundationsofstreamingarchitecture

SparkStreaminginnerworking

GoingunderthehoodofSparkStreaming

Buildinginfaulttolerance

ProcessinglivedatawithTCPsockets

SettingupTCPsockets

Processinglivedata

ManipulatingTwitterdatainrealtime

ProcessingTweetsinrealtimefromtheTwitterfirehose

Buildingareliableandscalablestreamingapp

SettingupKafka

InstallingandtestingKafka

Developingproducers

Developingconsumers

DevelopingaSparkStreamingconsumerforKafka

Exploringflume

DevelopingdatapipelineswithFlume,Kafka,andSpark

ClosingremarksontheLambdaandKappaarchitecture

UnderstandingLambdaarchitecture

UnderstandingKappaarchitecture

Summary

6.VisualizingInsightsandTrends

Revisitingthedata-intensiveappsarchitecture

Preprocessingthedataforvisualization

Gaugingwords,moods,andmemesataglance

Page 8: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Settingupwordcloud

Creatingwordclouds

Geo-locatingtweetsandmappingmeetups

Geo-locatingtweets

DisplayingupcomingmeetupsonGoogleMaps

Summary

Index

Page 9: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 10: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparkforPythonDevelopers

Page 11: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 12: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparkforPythonDevelopersCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:December2015

Productionreference:1171215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-969-6

www.packtpub.com

Page 13: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 14: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

CreditsAuthor

AmitNandi

Reviewers

ManuelIgnacioFrancoGaleano

RahulKavale

DanielLemire

ChetMancini

LaurenceWelch

CommissioningEditor

AmarabhaBanerjee

AcquisitionEditor

SonaliVernekar

ContentDevelopmentEditor

MerintThomasMathew

TechnicalEditor

NaveenkumarJain

CopyEditor

RoshniBanerjee

ProjectCoordinator

SuzanneCoutinho

Proofreader

SafisEditing

Indexer

PriyaSane

Graphics

KirkD’Penha

ProductionCoordinator

ShantanuN.Zagade

CoverWork

ShantanuN.Zagade

Page 15: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 16: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 17: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

AbouttheAuthorAmitNandistudiedphysicsattheFreeUniversityofBrusselsinBelgium,wherehedidhisresearchoncomputergeneratedholograms.Computergeneratedhologramsarethekeycomponentsofanopticalcomputer,whichispoweredbyphotonsrunningatthespeedoflight.HethenworkedwiththeuniversityCraysupercomputer,sendingbatchjobsofprogramswritteninFortran.Thisgavehimatasteforcomputing,whichkeptgrowing.Hehasworkedextensivelyonlargebusinessreengineeringinitiatives,usingSAPasthemainenabler.Hefocusedforthelast15yearsonstart-upsinthedataspace,pioneeringnewareasoftheinformationtechnologylandscape.Heiscurrentlyfocusingonlarge-scaledata-intensiveapplicationsasanenterprisearchitect,dataengineer,andsoftwaredeveloper.Heunderstandsandspeakssevenhumanlanguages.AlthoughPythonishiscomputerlanguageofchoice,heaimstobeabletowritefluentlyinsevencomputerlanguagestoo.

Page 18: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 19: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

AcknowledgmentIwanttoexpressmyprofoundgratitudetomyparentsfortheirunconditionalloveandstrongsupportinallmyendeavors.

ThisbookarosefromaninitialdiscussionwithRichardGall,anacquisitioneditoratPacktPublishing.Withoutthisinitialdiscussion,thisbookwouldneverhavehappened.So,Iamgratefultohim.ThefollowupsondiscussionsandthecontractualtermswereagreedwithRebeccaYoue.Iwouldliketothankherforhersupport.IwouldalsoliketothankMerintMathew,acontenteditorwhohelpedmebringthisbooktothefinishline.IamthankfultoMerintforhissubtlepersistenceandtactfulsupportduringthewriteupsandrevisionsofthisbook.

Wearestandingontheshouldersofgiants.Iwanttoacknowledgesomeofthegiantswhohelpedmeshapemythinking.Iwanttorecognizethebeauty,elegance,andpowerofPythonasenvisionedbyGuidovanRossum.MyrespectfulgratitudegoestoMateiZahariaandtheteamatBerkeleyAMPLabandDatabricksfordevelopinganewapproachtocomputingwithSparkandMesos.TravisOliphant,PeterWang,andtheteamatContinuum.ioaredoingatremendousjobofkeepingPythonrelevantinafast-changingcomputinglandscape.Thankyoutoyouall.

Page 20: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 21: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

AbouttheReviewersManuelIgnacioFrancoGaleanoisasoftwaredeveloperfromColombia.HeholdsacomputersciencedegreefromtheUniversityofQuindío.Atthemomentofpublicationofthisbook,hewasstudyingtogethisMScincomputersciencefromUniversityCollegeDublin,Ireland.Hehasawiderangeofintereststhatincludedistributedsystems,machinelearning,microservices,andsoon.Heislookingforawaytoapplymachinelearningtechniquestoaudiodatainordertohelppeoplelearnmoreaboutmusic.

RahulKavaleworksasasoftwaredeveloperatTinyOwlLtd.Heisinterestedinmultipletechnologiesrangingfrombuildingwebapplicationstosolvingbigdataproblems.Hehasworkedinmultiplelanguages,includingScala,Ruby,andJava,andhasworkedonApacheSpark,ApacheStorm,ApacheKafka,Hadoop,andHive.HeenjoyswritingScala.Functionalprogramminganddistributedcomputingarehisareasofinterest.HehasbeenusingSparksinceitsearlystageforvaryingusecases.HehasalsohelpedwiththereviewforthePragmaticScalabook.

DanielLemirehasaBScandMScinmathematicsfromtheUniversityofTorontoandaPhDinengineeringmathematicsfromtheEcolePolytechniqueandtheUniversitédeMontréal.HeisaprofessorofcomputerscienceattheUniversitéduQuébec.HehasalsobeenaresearchofficerattheNationalResearchCouncilofCanadaandanentrepreneur.Hehaswrittenover45peer-reviewedpublications,includingmorethan25journalarticles.Hehasheldcompetitiveresearchgrantsforthelast15years.Hehasbeenanexpertonseveralcommitteeswithfundingagencies(NSERCandFQRNT).Hehasservedasaprogramcommitteememberonleadingcomputerscienceconferences(forexample,ACMCIKM,ACMWSDM,ACMSIGIR,andACMRecSys).HisopensourcesoftwarehasbeenusedbymajorcorporationssuchasGoogleandFacebook.Hisresearchinterestsincludedatabases,informationretrievalandhigh-performanceprogramming.Heblogsregularlyoncomputerscienceathttp://lemire.me/blog/.

ChetManciniisadataengineeratIntentMedia,IncinNewYork,whereheworkswiththedatascienceteamtostoreandprocessterabytesofwebtraveldatatobuildpredictivemodelsofshopperbehavior.Heenjoysfunctionalprogramming,immutabledatastructures,andmachinelearning.Hewritesandspeaksontopicssurroundingdataengineeringandinformationarchitecture.

HeisacontributortoApacheSparkandotherlibrariesintheSparkecosystem.Chethasamaster’sdegreeincomputersciencefromCornellUniversity.

Page 22: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 23: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

www.PacktPub.com

Page 24: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Page 25: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

Page 26: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Page 27: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 28: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PrefaceSparkforPythonDevelopersaimstocombinetheeleganceandflexibilityofPythonwiththepowerandversatilityofApacheSpark.SparkiswritteninScalaandrunsontheJavavirtualmachine.ItisneverthelesspolyglotandoffersbindingsandAPIsforJava,Scala,Python,andR.Pythonisawell-designedlanguagewithanextensivesetofspecializedlibraries.ThisbooklooksatPySparkwithinthePyDataecosystem.SomeoftheprominentPyDatalibrariesincludePandas,Blaze,Scikit-Learn,Matplotlib,Seaborn,andBokeh.Theselibrariesareopensource.Theyaredeveloped,used,andmaintainedbythedatascientistandPythondeveloperscommunity.PySparkintegrateswellwiththePyDataecosystem,asendorsedbytheAnacondaPythondistribution.Thebookputsforwardajourneytobuilddata-intensiveappsalongwithanarchitecturalblueprintthatcoversthefollowingsteps:first,setupthebaseinfrastructurewithSpark.Second,acquire,collect,process,andstorethedata.Third,gaininsightsfromthecollecteddata.Fourth,streamlivedataandprocessitinrealtime.Finally,visualizetheinformation.

TheobjectiveofthebookistolearnaboutPySparkandPyDatalibrariesbybuildingappsthatanalyzetheSparkcommunity’sinteractionsonsocialnetworks.ThefocusisonTwitterdata.

Page 29: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

WhatthisbookcoversChapter1,SettingUpaSparkVirtualEnvironment,covershowtocreateasegregatedvirtualmachineasoursandboxordevelopmentenvironmenttoexperimentwithSparkandPyDatalibraries.ItcovershowtoinstallSparkandthePythonAnacondadistribution,whichincludesPyDatalibraries.Alongtheway,weexplainthekeySparkconcepts,thePythonAnacondaecosystem,andbuildaSparkwordcountapp.

Chapter2,BuildingBatchandStreamingAppswithSpark,laysthefoundationoftheDataIntensiveAppsArchitecture.Itdescribesthefivelayersoftheappsarchitectureblueprint:infrastructure,persistence,integration,analytics,andengagement.WeestablishAPIconnectionswiththreesocialnetworks:Twitter,GitHub,andMeetup.ThischapterprovidesthetoolstoconnecttothesethreenontrivialAPIssothatyoucancreateyourowndatamashupsatalaterstage.

Chapter3,JugglingDatawithSpark,covershowtoharvestdatafromTwitterandprocessitusingPandas,Blaze,andSparkSQLwiththeirrespectiveimplementationsofthedataframedatastructure.WeproceedwithfurtherinvestigationsandtechniquesusingSparkSQL,leveragingontheSparkdataframedatastructure.

Chapter4,LearningfromDataUsingSpark,givesanoverviewoftheeverexpandinglibraryofalgorithmsofSparkMLlib.Itcoverssupervisedandunsupervisedlearning,recommendersystems,optimization,andfeatureextractionalgorithms.WeputtheTwitterharvesteddatasetthroughaPythonScikit-LearnandSparkMLlibK-meansclusteringinordertosegregatetheApacheSparkrelevanttweets.

Chapter5,StreamingLiveDatawithSpark,laysdownthefoundationofstreamingarchitectureappsanddescribestheirchallenges,constraints,andbenefits.WeillustratethestreamingconceptswithTCPsockets,followedbylivetweetingestionandprocessingdirectlyfromtheTwitterfirehose.WealsodescribeFlume,areliable,flexible,andscalabledataingestionandtransportpipelinesystem.ThecombinationofFlume,Kafka,andSparkdeliversunparalleledrobustness,speed,andagilityinanever-changinglandscape.Weendthechapterwithsomeremarksandobservationsontwostreamingarchitecturalparadigms,theLambdaandKappaarchitectures.

Chapter6,VisualizingInsightsandTrends,focusesonafewkeyvisualizationtechniques.Itcovershowtobuildwordcloudsandexposetheirintuitivepowertorevealalotofthekeywords,moods,andmemescarriedthroughthousandsoftweets.WethenfocusoninteractivemappingvisualizationsusingBokeh.Webuildaworldmapfromthegroundupandcreateascatterplotofcriticaltweets.OurfinalvisualizationistooverlayanactualGooglemapofLondon,highlightingupcomingmeetupsandtheirrespectivetopics.

Page 30: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 31: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

WhatyouneedforthisbookYouneedinquisitiveness,perseverance,andpassionfordata,softwareengineering,applicationarchitectureandscalability,andbeautifulsuccinctvisualizations.Thescopeisbroadandwide.

YouneedagoodunderstandingofPythonorasimilarlanguagewithobject-orientedandfunctionalprogrammingcapabilities.PreliminaryexperienceofdatawranglingwithPython,R,oranysimilartoolishelpful.

Youneedtoappreciatehowtoconceive,build,andscaledataapplications.

Page 32: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 33: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

WhothisbookisforThetargetaudienceincludesthefollowing:

Datascientistsaretheprimaryinterestedparties.ThisbookwillhelpyouunleashthepowerofSparkandleverageyourPython,R,andmachinelearningbackground.SoftwaredeveloperswithafocusonPythonwillreadilyexpandtheirskillstocreatedata-intensiveappsusingSparkasaprocessingengineandPythonvisualizationlibrariesandwebframeworks.DataarchitectswhocancreaterapiddatapipelinesandbuildthefamousLambdaarchitecturethatencompassesbatchandstreamingprocessingtorenderinsightsondatainrealtime,usingtheSparkandPythonrichecosystem,willalsobenefitfromthisbook.

Page 34: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 35: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows“LaunchPySparkwithIPYNBindirectoryexamples/AN_SparkwheretheJupyterorIPythonNotebooksarestored”.

Ablockofcodeissetasfollows:

#Wordcounton1stChapteroftheBookusingPySpark

#importregexmodule

importre

#importaddfromoperatormodule

fromoperatorimportadd

#readinputfile

file_in=sc.textFile('/home/an/Documents/A00_Documents/Spark4Py20150315')

Anycommand-lineinputoroutputiswrittenasfollows:

#installanaconda2.x.x

bashAnaconda-2.x.x-Linux-x86[_64].sh

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“AfterinstallingVirtualBox,let’sopentheOracleVMVirtualBoxManagerandclicktheNewbutton.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

Page 36: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 37: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.

Tosendusgeneralfeedback,simplysendane-mailto<[email protected]>,andmentionthebooktitleviathesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.

Page 38: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 39: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

Page 40: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 41: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.

Page 42: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.

Page 43: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.

Page 44: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 45: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Chapter1.SettingUpaSparkVirtualEnvironmentInthischapter,wewillbuildanisolatedvirtualenvironmentfordevelopmentpurposes.TheenvironmentwillbepoweredbySparkandthePyDatalibrariesprovidedbythePythonAnacondadistribution.TheselibrariesincludePandas,Scikit-Learn,Blaze,Matplotlib,Seaborn,andBokeh.Wewillperformthefollowingactivities:

SettingupthedevelopmentenvironmentusingtheAnacondaPythondistribution.ThiswillincludeenablingtheIPythonNotebookenvironmentpoweredbyPySparkforourdataexplorationtasks.InstallingandenablingSpark,andthePyDatalibrariessuchasPandas,Scikit-Learn,Blaze,Matplotlib,andBokeh.Buildingawordcountexampleapptoensurethateverythingisworkingfine.

Thelastdecadehasseentheriseanddominanceofdata-drivenbehemothssuchasAmazon,Google,Twitter,LinkedIn,andFacebook.Thesecorporations,byseeding,sharing,ordisclosingtheirinfrastructureconcepts,softwarepractices,anddataprocessingframeworks,havefosteredavibrantopensourcesoftwarecommunity.Thishastransformedtheenterprisetechnology,systems,andsoftwarearchitecture.

ThisincludesnewinfrastructureandDevOps(shortfordevelopmentandoperations),conceptsleveragingvirtualization,cloudtechnology,andsoftware-definednetworks.

Toprocesspetabytesofdata,Hadoopwasdevelopedandopensourced,takingitsinspirationfromtheGoogleFileSystem(GFS)andtheadjoiningdistributedcomputingframework,MapReduce.Overcomingthecomplexitiesofscalingwhilekeepingcostsundercontrolhasalsoledtoaproliferationofnewdatastores.ExamplesofrecentdatabasetechnologyincludeCassandra,acolumnardatabase;MongoDB,adocumentdatabase;andNeo4J,agraphdatabase.

Hadoop,thankstoitsabilitytoprocesshugedatasets,hasfosteredavastecosystemtoquerydatamoreiterativelyandinteractivelywithPig,Hive,Impala,andTez.HadoopiscumbersomeasitoperatesonlyinbatchmodeusingMapReduce.Sparkiscreatingarevolutionintheanalyticsanddataprocessingrealmbytargetingtheshortcomingsofdiskinput-outputandbandwidth-intensiveMapReducejobs.

SparkiswritteninScala,andthereforeintegratesnativelywiththeJavaVirtualMachine(JVM)poweredecosystem.SparkhadearlyonprovidedPythonAPIandbindingsbyenablingPySpark.TheSparkarchitectureandecosystemisinherentlypolyglot,withanobviousstrongpresenceofJava-ledsystems.

ThisbookwillfocusonPySparkandthePyDataecosystem.Pythonisoneofthepreferredlanguagesintheacademicandscientificcommunityfordata-intensiveprocessing.PythonhasdevelopedarichecosystemoflibrariesandtoolsindatamanipulationwithPandasandBlaze,inMachineLearningwithScikit-Learn,andindatavisualizationwithMatplotlib,Seaborn,andBokeh.Hence,theaimofthisbookistobuildanend-to-end

Page 46: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

architecturefordata-intensiveapplicationspoweredbySparkandPython.Inordertoputtheseconceptsintopractice,wewillanalyzesocialnetworkssuchasTwitter,GitHub,andMeetup.WewillfocusontheactivitiesandsocialinteractionsofSparkandtheOpenSourceSoftwarecommunitybytappingintoGitHub,Twitter,andMeetup.

Buildingdata-intensiveapplicationsrequireshighlyscalableinfrastructure,polyglotstorage,seamlessdataintegration,multiparadigmanalyticsprocessing,andefficientvisualization.Thefollowingparagraphdescribesthedata-intensiveapparchitectureblueprintthatwewilladoptthroughoutthebook.Itisthebackboneofthebook.WewilldiscoverSparkinthecontextofthebroaderPyDataecosystem.

TipDownloadingtheexamplecode

YoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 47: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Understandingthearchitectureofdata-intensiveapplicationsInordertounderstandthearchitectureofdata-intensiveapplications,thefollowingconceptualframeworkisused.Theisarchitectureisdesignedonthefollowingfivelayers:

InfrastructurelayerPersistencelayerIntegrationlayerAnalyticslayerEngagementlayer

ThefollowingscreenshotdepictsthefivelayersoftheDataIntensiveAppFramework:

Fromthebottomup,let’sgothroughthelayersandtheirmainpurpose.

Page 48: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

InfrastructurelayerTheinfrastructurelayerisprimarilyconcernedwithvirtualization,scalability,andcontinuousintegration.Inpracticalterms,andintermsofvirtualization,wewillgothroughbuildingourowndevelopmentenvironmentinaVirtualBoxandvirtualmachinepoweredbySparkandtheAnacondadistributionofPython.Ifwewishtoscalefromthere,wecancreateasimilarenvironmentinthecloud.ThepracticeofcreatingasegregateddevelopmentenvironmentandmovingintotestandproductiondeploymentcanbeautomatedandcanbepartofacontinuousintegrationcyclepoweredbyDevOpstoolssuchasVagrant,Chef,Puppet,andDocker.Dockerisaverypopularopensourceprojectthateasestheinstallationanddeploymentofnewenvironments.ThebookwillbelimitedtobuildingthevirtualmachineusingVirtualBox.Fromadata-intensiveapparchitecturepointofview,wearedescribingtheessentialstepsoftheinfrastructurelayerbymentioningscalabilityandcontinuousintegrationbeyondjustvirtualization.

Page 49: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PersistencelayerThepersistencelayermanagesthevariousrepositoriesinaccordancewithdataneedsandshapes.Itensuresthesetupandmanagementofthepolyglotdatastores.ItincludesrelationaldatabasemanagementsystemssuchasMySQLandPostgreSQL;key-valuedatastoressuchasHadoop,Riak,andRedis;columnardatabasessuchasHBaseandCassandra;documentdatabasessuchasMongoDBandCouchbase;andgraphdatabasessuchasNeo4j.ThepersistencelayermanagesvariousfilesystemssuchasHadoop’sHDFS.ItinteractswithvariousstoragesystemsfromnativeharddrivestoAmazonS3.Itmanagesvariousfilestorageformatssuchascsv,json,andparquet,whichisacolumn-orientedformat.

Page 50: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

IntegrationlayerTheintegrationlayerfocusesondataacquisition,transformation,quality,persistence,consumption,andgovernance.ItisessentiallydrivenbythefollowingfiveCs:connect,collect,correct,compose,andconsume.

Thefivestepsdescribethelifecycleofdata.Theyarefocusedonhowtoacquirethedatasetofinterest,exploreit,iterativelyrefineandenrichthecollectedinformation,andgetitreadyforconsumption.So,thestepsperformthefollowingoperations:

Connect:Targetsthebestwaytoacquiredatafromthevariousdatasources,APIsofferedbythesesources,theinputformat,inputschemasiftheyexist,therateofdatacollection,andlimitationsfromprovidersCorrect:FocusesontransformingdataforfurtherprocessingandalsoensuresthatthequalityandconsistencyofthedatareceivedaremaintainedCollect:Looksatwhichdatatostorewhereandinwhatformat,toeasedatacompositionandconsumptionatlaterstagesCompose:Concentratesitsattentiononhowtomashupthevariousdatasetscollected,andenrichtheinformationinordertobuildacompellingdata-drivenproductConsume:TakescareofdataprovisioningandrenderingandhowtherightdatareachestherightindividualattherighttimeControl:Thissixthadditionalstepwillsoonerorlaterberequiredasthedata,theorganization,andtheparticipantsgrowanditisaboutensuringdatagovernance

Thefollowingdiagramdepictstheiterativeprocessofdataacquisitionandrefinementforconsumption:

Page 51: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

AnalyticslayerTheanalyticslayeriswhereSparkprocessesdatawiththevariousmodels,algorithms,andmachinelearningpipelinesinordertoderiveinsights.Forourpurpose,inthisbook,theanalyticslayerispoweredbySpark.WewilldelvedeeperinsubsequentchaptersintothemeritsofSpark.Inanutshell,whatmakesitsopowerfulisthatitallowsmultipleparadigmsofanalyticsprocessinginasingleunifiedplatform.Itallowsbatch,streaming,andinteractiveanalytics.Batchprocessingonlargedatasetswithlongerlatencyperiodsallowsustoextractpatternsandinsightsthatcanfeedintoreal-timeeventsinstreamingmode.Interactiveanditerativeanalyticsaremoresuitedfordataexploration.SparkoffersbindingsandAPIsinPythonandR.WithitsSparkSQLmoduleandtheSparkDataframe,itoffersaveryfamiliaranalyticsinterface.

Page 52: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

EngagementlayerTheengagementlayerinteractswiththeenduserandprovidesdashboards,interactivevisualizations,andalerts.WewillfocushereonthetoolsprovidedbythePyDataecosystemsuchasMatplotlib,Seaborn,andBokeh.

Page 53: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 54: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnderstandingSparkHadoopscaleshorizontallyasthedatagrows.Hadooprunsoncommodityhardware,soitiscost-effective.Intensivedataapplicationsareenabledbyscalable,distributedprocessingframeworksthatalloworganizationstoanalyzepetabytesofdataonlargecommodityclusters.Hadoopisthefirstopensourceimplementationofmap-reduce.HadoopreliesonadistributedframeworkforstoragecalledHDFS(HadoopDistributedFileSystem).Hadooprunsmap-reducetasksinbatchjobs.Hadooprequirespersistingthedatatodiskateachmap,shuffle,andreduceprocessstep.Theoverheadandthelatencyofsuchbatchjobsadverselyimpacttheperformance.

Sparkisafast,distributedgeneralanalyticscomputingengineforlarge-scaledataprocessing.ThemajorbreakthroughfromHadoopisthatSparkallowsdatasharingbetweenprocessingstepsthroughin-memoryprocessingofdatapipelines.

Sparkisuniqueinthatitallowsfourdifferentstylesofdataanalysisandprocessing.Sparkcanbeusedin:

Batch:Thismodeisusedformanipulatinglargedatasets,typicallyperforminglargemap-reducejobsStreaming:ThismodeisusedtoprocessincominginformationinnearrealtimeIterative:ThismodeisformachinelearningalgorithmssuchasagradientdescentwherethedataisaccessedrepetitivelyinordertoreachconvergenceInteractive:ThismodeisusedfordataexplorationaslargechunksofdataareinmemoryandduetotheveryquickresponsetimeofSpark

Thefollowingfigurehighlightstheprecedingfourprocessingstyles:

Sparkoperatesinthreemodes:onesinglemode,standaloneonasinglemachineandtwodistributedmodesonaclusterofmachines—onYarn,theHadoopdistributedresourcemanager,oronMesos,theopensourceclustermanagerdevelopedatBerkeleyconcurrentlywithSpark:

Page 55: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparkoffersapolyglotinterfaceinScala,Java,Python,andR.

Page 56: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparklibrariesSparkcomeswithbatteriesincluded,withsomepowerfullibraries:

SparkSQL:ThisprovidestheSQL-likeabilitytointerrogatestructureddataandinteractivelyexplorelargedatasetsSparkMLLIB:ThisprovidesmajoralgorithmsandapipelineframeworkformachinelearningSparkStreaming:Thisisfornearreal-timeanalysisofdatausingmicrobatchesandslidingwidowsonincomingstreamsofdataSparkGraphX:Thisisforgraphprocessingandcomputationoncomplexconnectedentitiesandrelationships

PySparkinactionSparkiswritteninScala.ThewholeSparkecosystemnaturallyleveragestheJVMenvironmentandcapitalizesonHDFSnatively.HadoopHDFSisoneofthemanydatastoressupportedbySpark.Sparkisagnosticandfromthebeginninginteractedwithmultipledatasources,types,andformats.

PySparkisnotatranscribedversionofSparkonaJava-enableddialectofPythonsuchasJython.PySparkprovidesintegratedAPIbindingsaroundSparkandenablesfullusageofthePythonecosystemwithinallthenodesoftheclusterwiththepicklePythonserializationand,moreimportantly,suppliesaccesstotherichecosystemofPython’smachinelearninglibrariessuchasScikit-LearnordataprocessingsuchasPandas.

WhenweinitializeaSparkprogram,thefirstthingaSparkprogrammustdoistocreateaSparkContextobject.IttellsSparkhowtoaccessthecluster.ThePythonprogramcreatesaPySparkContext.Py4JisthegatewaythatbindsthePythonprogramtotheSparkJVMSparkContext.TheJVMSparkContextserializestheapplicationcodesandtheclosuresandsendsthemtotheclusterforexecution.Theclustermanagerallocatesresourcesandschedules,andshipstheclosurestotheSparkworkersintheclusterwhoactivatePythonvirtualmachinesasrequired.Ineachmachine,theSparkWorkerismanagedbyanexecutorthatcontrolscomputation,storage,andcache.

Here’sanexampleofhowtheSparkdrivermanagesboththePySparkcontextandtheSparkcontextwithitslocalfilesystemsanditsinteractionswiththeSparkworkerthroughtheclustermanager:

Page 57: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

TheResilientDistributedDatasetSparkapplicationsconsistofadriverprogramthatrunstheuser’smainfunction,createsdistributeddatasetsonthecluster,andexecutesvariousparalleloperations(transformationsandactions)onthosedatasets.

Sparkapplicationsarerunasanindependentsetofprocesses,coordinatedbyaSparkContextinadriverprogram.

TheSparkContextwillbeallocatedsystemresources(machines,memory,CPU)fromtheClustermanager.

TheSparkContextmanagesexecutorswhomanageworkersinthecluster.ThedriverprogramhasSparkjobsthatneedtorun.Thejobsaresplitintotaskssubmittedtotheexecutorforcompletion.Theexecutortakescareofcomputation,storage,andcachingineachmachine.

ThekeybuildingblockinSparkistheRDD(ResilientDistributedDataset).Adatasetisacollectionofelements.Distributedmeansthedatasetcanbeonanynodeinthecluster.ResilientmeansthatthedatasetcouldgetlostorpartiallylostwithoutmajorharmtothecomputationinprogressasSparkwillre-computefromthedatalineageinmemory,alsoknownastheDAG(shortforDirectedAcyclicGraph)ofoperations.Basically,SparkwillsnapshotinmemoryastateoftheRDDinthecache.Ifoneofthecomputingmachinescrashesduringoperation,SparkrebuildstheRDDsfromthecachedRDDand

Page 58: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

theDAGofoperations.RDDsrecoverfromnodefailure.

TherearetwotypesofoperationonRDDs:

Transformations:AtransformationtakesanexistingRDDandleadstoapointerofanewtransformedRDD.AnRDDisimmutable.Oncecreated,itcannotbechanged.EachtransformationcreatesanewRDD.Transformationsarelazilyevaluated.Transformationsareexecutedonlywhenanactionoccurs.Inthecaseoffailure,thedatalineageoftransformationsrebuildstheRDD.Actions:AnactiononanRDDtriggersaSparkjobandyieldsavalue.AnactionoperationcausesSparktoexecutethe(lazy)transformationoperationsthatarerequiredtocomputetheRDDreturnedbytheaction.TheactionresultsinaDAGofoperations.TheDAGiscompiledintostageswhereeachstageisexecutedasaseriesoftasks.Ataskisafundamentalunitofwork.

Here’ssomeusefulinformationonRDDs:

RDDsarecreatedfromadatasourcesuchasanHDFSfileoraDBquery.TherearethreewaystocreateanRDD:

ReadingfromadatastoreTransforminganexistingRDDUsinganin-memorycollection

RDDsaretransformedwithfunctionssuchasmaporfilter,whichyieldnewRDDs.Anactionsuchasfirst,take,collect,orcountonanRDDwilldelivertheresultsintotheSparkdriver.TheSparkdriveristheclientthroughwhichtheuserinteractswiththeSparkcluster.

ThefollowingdiagramillustratestheRDDtransformationandaction:

Page 59: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 60: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 61: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnderstandingAnacondaAnacondaisawidelyusedfreePythondistributionmaintainedbyContinuum(https://www.continuum.io/).WewillusetheprevailingsoftwarestackprovidedbyAnacondatogenerateourapps.Inthisbook,wewillusePySparkandthePyDataecosystem.ThePyDataecosystemispromoted,supported,andmaintainedbyContinuumandpoweredbytheAnacondaPythondistribution.TheAnacondaPythondistributionessentiallysavestimeandaggravationintheinstallationofthePythonenvironment;wewilluseitinconjunctionwithSpark.Anacondahasitsownpackagemanagementthatsupplementsthetraditionalpipinstallandeasy-install.Anacondacomeswithbatteriesincluded,namelysomeofthemostimportantpackagessuchasPandas,Scikit-Learn,Blaze,Matplotlib,andBokeh.Anupgradetoanyoftheinstalledlibraryisasimplecommandattheconsole:

$condaupdate

Alistofinstalledlibrariesinourenvironmentcanbeobtainedwithcommand:

$condalist

Thekeycomponentsofthestackareasfollows:

Anaconda:ThisisafreePythondistributionwithalmost200Pythonpackagesforscience,math,engineering,anddataanalysis.Conda:Thisisapackagemanagerthattakescareofallthedependenciesofinstallingacomplexsoftwarestack.ThisisnotrestrictedtoPythonandmanagestheinstallprocessforRandotherlanguages.Numba:ThisprovidesthepowertospeedupcodeinPythonwithhigh-performancefunctionsandjust-in-timecompilation.Blaze:Thisenableslargescaledataanalyticsbyofferingauniformandadaptableinterfacetoaccessavarietyofdataproviders,whichincludestreamingPython,Pandas,SQLAlchemy,andSpark.Bokeh:Thisprovidesinteractivedatavisualizationsforlargeandstreamingdatasets.Wakari:ThisallowsustoshareanddeployIPythonNotebooksandotherappsonahostedenvironment.

ThefollowingfigureshowsthecomponentsoftheAnacondastack:

Page 62: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 63: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 64: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SettinguptheSparkpoweredenvironmentInthissection,wewilllearntosetupSpark:

CreateasegregateddevelopmentenvironmentinavirtualmachinerunningonUbuntu14.04,soitdoesnotinterferewithanyexistingsystem.InstallSpark1.3.0withitsdependencies,namely.InstalltheAnacondaPython2.7environmentwithalltherequiredlibrariessuchasPandas,Scikit-Learn,Blaze,andBokeh,andenablePySpark,soitcanbeaccessedthroughIPythonNotebooks.Setupthebackendordatastoresofourenvironment.WewilluseMySQLastherelationaldatabase,MongoDBasthedocumentstore,andCassandraasthecolumnardatabase.

Eachstoragebackendservesaspecificpurposedependingonthenatureofthedatatobehandled.TheMySQLRDBMsisusedforstandardtabularprocessedinformationthatcanbeeasilyqueriedusingSQL.AswewillbeprocessingalotofJSON-typedatafromvariousAPIs,theeasiestwaytostorethemisinadocument.Forreal-timeandtime-series-relatedinformation,Cassandraisbestsuitedasacolumnardatabase.

Thefollowingdiagramgivesaviewoftheenvironmentwewillbuildandusethroughoutthebook:

Page 65: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 66: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SettingupanOracleVirtualBoxwithUbuntuSettingupacleannewVirtualBoxenvironmentonUbuntu14.04isthesafestwaytocreateadevelopmentenvironmentthatdoesnotconflictwithexistinglibrariesandcanbelaterreplicatedinthecloudusingasimilarlistofcommands.

InordertosetupanenvironmentwithAnacondaandSpark,wewillcreateaVirtualBoxvirtualmachinerunningUbuntu14.04.

Let’sgothroughthestepsofusingVirtualBoxwithUbuntu:

1. OracleVirtualBoxVMisfreeandcanbedownloadedfromhttps://www.virtualbox.org/wiki/Downloads.Theinstallationisprettystraightforward.

2. AfterinstallingVirtualBox,let’sopentheOracleVMVirtualBoxManagerandclicktheNewbutton.

3. We’llgivethenewVManame,andselectTypeLinuxandVersionUbuntu(64bit).4. YouneedtodownloadtheISOfromtheUbuntuwebsiteandallocatesufficientRAM

(4GBrecommended)anddiskspace(20GBrecommended).WewillusetheUbuntu14.04.1LTSrelease,whichisfoundhere:http://www.ubuntu.com/download/desktop.

5. Oncetheinstallationcompleted,itisadvisabletoinstalltheVirtualBoxGuestAdditionsbygoingto(fromtheVirtualBoxmenu,withthenewVMrunning)Devices|InsertGuestAdditionsCDimage.FailingtoprovidetheguestadditionsinaWindowshostgivesaverylimiteduserinterfacewithreducedwindowsizes.

6. Oncetheadditionalinstallationcompletes,reboottheVM,anditwillbereadytouse.ItishelpfultoenablethesharedclipboardbyselectingtheVMandclickingSettings,thengotoGeneral|Advanced|SharedClipboardandclickonBidirectional.

Page 67: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

InstallingAnacondawithPython2.7PySparkcurrentlyrunsonlyonPython2.7.(TherearerequestsfromthecommunitytoupgradetoPython3.3.)ToinstallAnaconda,followthesesteps:

1. DownloadtheAnacondaInstallerforLinux64-bitPython2.7fromhttp://continuum.io/downloads#all.

2. AfterdownloadingtheAnacondainstaller,openaterminalandnavigatetothedirectoryorfolderwheretheinstallerhasbeensaved.Fromhere,runthefollowingcommand,replacingthe2.x.xinthecommandwiththeversionnumberofthedownloadedinstallerfile:

#installanaconda2.x.x

bashAnaconda-2.x.x-Linux-x86[_64].sh

3. Afteracceptingthelicenseterms,youwillbeaskedtospecifytheinstalllocation(whichdefaultsto~/anaconda).

4. Aftertheself-extractionisfinished,youshouldaddtheanacondabinarydirectorytoyourPATHenvironmentvariable:

#addanacondatoPATH

bashAnaconda-2.x.x-Linux-x86[_64].sh

Page 68: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

InstallingJava8SparkrunsontheJVMandrequirestheJavaSDK(shortforSoftwareDevelopmentKit)andnottheJRE(shortforJavaRuntimeEnvironment),aswewillbuildappswithSpark.TherecommendedversionisJavaVersion7orhigher.Java8isthemostsuitable,asitincludesmanyofthefunctionalprogrammingtechniquesavailablewithScalaandPython.

ToinstallJava8,followthesesteps:

1. InstallOracleJava8usingthefollowingcommands:

#installoraclejava8

$sudoapt-getinstallsoftware-properties-common

$sudoadd-apt-repositoryppa:webupd8team/java

$sudoapt-getupdate

$sudoapt-getinstalloracle-java8-installer

2. SettheJAVA_HOMEenvironmentvariableandensurethattheJavaprogramisonyourPATH.

3. CheckthatJAVA_HOMEisproperlyinstalled:

#

$echoJAVA_HOME

Page 69: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

InstallingSparkHeadovertotheSparkdownloadpageathttp://spark.apache.org/downloads.html.

TheSparkdownloadpageoffersthepossibilitytodownloadearlierversionsofSparkanddifferentpackageanddownloadtypes.Wewillselectthelatestrelease,pre-builtforHadoop2.6andlater.TheeasiestwaytoinstallSparkistouseaSparkpackageprebuiltforHadoop2.6andlater,ratherthanbuilditfromsource.Movethefiletothedirectory~/sparkundertherootdirectory.

DownloadthelatestreleaseofSpark—Spark1.5.2,releasedonNovember9,2015:

1. SelectSparkrelease1.5.2(Nov092015),2. ChosethepackagetypePrebuiltforHadoop2.6andlater,3. ChosethedownloadtypeDirectDownload,4. DownloadSpark:spark-1.5.2-bin-hadoop2.6.tgz,5. Verifythisreleaseusingthe1.3.0signaturesandchecksums,

Thiscanalsobeaccomplishedbyrunning:

#downloadspark

$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz

Next,we’llextractthefilesandcleanup:

#extract,cleanup,movetheunzippedfilesunderthesparkdirectory

$tar-xfspark-1.5.2-bin-hadoop2.6.tgz

$rmspark-1.5.2-bin-hadoop2.6.tgz

$sudomvspark-*spark

Now,wecanruntheSparkPythoninterpreterwith:

#runspark

$cd~/spark

./bin/pyspark

Youshouldseesomethinglikethis:

Welcometo

______

/__/__________//__

_\\/_\/_`/__/'_/

/__/.__/\_,_/_//_/\_\version1.5.2

/_/

UsingPythonversion2.7.6(default,Mar22201422:59:56)

SparkContextavailableassc.

>>>

TheinterpreterwillhavealreadyprovideduswithaSparkcontextobject,sc,whichwecanseebyrunning:

>>>print(sc)

<pyspark.context.SparkContextobjectat0x7f34b61c4e50>

Page 70: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

EnablingIPythonNotebookWewillworkwithIPythonNotebookforafriendlieruserexperiencethantheconsole.

YoucanlaunchIPythonNotebookbyusingthefollowingcommand:

$IPYTHON_OPTS="notebook--pylabinline"./bin/pyspark

LaunchPySparkwithIPYNBinthedirectoryexamples/AN_SparkwhereJupyterorIPythonNotebooksarestored:

#cdto/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark

#launchcommandusingpython2.7andthespark-csvpackage:

$IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-

hadoop2.6/bin/pyspark--packagescom.databricks:spark-csv_2.11:1.2.0

#launchcommandusingpython3.4andthespark-csvpackage:

$IPYTHON_OPTS='notebook'PYSPARK_PYTHON=python3

/home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark--packages

com.databricks:spark-csv_2.11:1.2.0

Page 71: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 72: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

BuildingourfirstappwithPySparkWearereadytochecknowthateverythingisworkingfine.Theobligatorywordcountwillbeputtothetestinprocessingawordcountonthefirstchapterofthisbook.

Thecodewewillberunningislistedhere:

#Wordcounton1stChapteroftheBookusingPySpark

#importregexmodule

importre

#importaddfromoperatormodule

fromoperatorimportadd

#readinputfile

file_in=sc.textFile('/home/an/Documents/A00_Documents/Spark4Py20150315')

#countlines

print('numberoflinesinfile:%s'%file_in.count())

#adduplengthsofeachline

chars=file_in.map(lambdas:len(s)).reduce(add)

print('numberofcharactersinfile:%s'%chars)

#Getwordsfromtheinputfile

words=file_in.flatMap(lambdaline:re.split('\W+',line.lower().strip()))

#wordsofmorethan3characters

words=words.filter(lambdax:len(x)>3)

#setcount1perword

words=words.map(lambdaw:(w,1))

#reducephase-sumcountallthewords

words=words.reduceByKey(add)

Inthisprogram,wearefirstreadingthefilefromthedirectory/home/an/Documents/A00_Documents/Spark4Py20150315intofile_in.

Wearethenintrospectingthefilebycountingthenumberoflinesandthenumberofcharactersperline.

Wearesplittingtheinputfileintowordsandgettingtheminlowercase.Forourwordcountpurpose,wearechoosingwordslongerthanthreecharactersinordertoavoidshorterandmuchmorefrequentwordssuchasthe,and,fortoskewthecountintheirfavor.Generally,theyareconsideredstopwordsandshouldbefilteredoutinanylanguageprocessingtask.

Atthisstage,wearegettingreadyfortheMapReducesteps.Toeachword,wemapavalueof1andreduceitbysummingalltheuniquewords.

HereareillustrationsofthecodeintheIPythonNotebook.Thefirst10cellsarepreprocessingthewordcountonthedataset,whichisretrievedfromthelocalfiledirectory.

Page 73: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Swapthewordcounttuplesintheformat(count,word)inordertosortbycount,whichisnowtheprimarykeyofthetuple:

#createtuple(count,word)andsortindescending

words=words.map(lambdax:(x[1],x[0])).sortByKey(False)

#taketop20wordsbyfrequency

words.take(20)

Page 74: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Inordertodisplayourresult,wearecreatingthetuple(count,word)anddisplayingthetop20mostfrequentlyusedwordsindescendingorder:

Page 75: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 76: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Let’screateahistogramfunction:

#createfunctionforhistogramofmostfrequentwords

%matplotlibinline

importmatplotlib.pyplotasplt

#

defhistogram(words):

count=map(lambdax:x[1],words)

word=map(lambdax:x[0],words)

plt.barh(range(len(count)),count,color='grey')

plt.yticks(range(len(count)),word)

#Changeorderoftuple(word,count)from(count,word)

words=words.map(lambdax:(x[1],x[0]))

words.take(25)

#displayhistogram

histogram(words.take(25))

Here,wevisualizethemostfrequentwordsbyplottingtheminabarchart.Wehavetofirstswapthetuplefromtheoriginal(count,word)to(word,count):

Page 77: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 78: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Sohereyouhaveit:themostfrequentwordsusedinthefirstchapterareSpark,followedbyDataandAnaconda.

Page 79: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 80: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

VirtualizingtheenvironmentwithVagrantInordertocreateaportablePythonandSparkenvironmentthatcanbeeasilysharedandcloned,thedevelopmentenvironmentcanbebuiltwithavagrantfile.

WewillpointtotheMassiveOpenOnlineCourses(MOOCs)deliveredbyBerkeleyUniversityandDatabricks:

IntroductiontoBigDatawithApacheSpark,ProfessorAnthonyD.Josephcanbefoundathttps://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1xScalableMachineLearning,ProfessorAmeetTalwalkarcanbefoundathttps://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

ThecourselabswereexecutedonIPythonNotebookspoweredbyPySpark.TheycanbefoundinthefollowingGitHubrepository:https://github.com/spark-mooc/mooc-setup/.

OnceyouhavesetupVagrantonyourmachine,followtheseinstructionstogetstarted:https://docs.vagrantup.com/v2/getting-started/index.html.

Clonethespark-mooc/mooc-setup/githubrepositoryinyourworkdirectoryandlaunchthecommand$vagrantup,withinthecloneddirectory:

BeawarethattheversionofSparkmaybeoutdatedasthevagrantfilemaynotbeup-to-date.

Youwillseeanoutputsimilartothis:

C:\Programs\spark\edx1001\mooc-setup-master>vagrantup

Bringingmachine'sparkvm'upwith'virtualbox'provider…

==>sparkvm:Checkingifbox'sparkmooc/base'isuptodate…

==>sparkvm:Clearinganypreviouslysetforwardedports…

==>sparkvm:Clearinganypreviouslysetnetworkinterfaces…

==>sparkvm:Preparingnetworkinterfacesbasedonconfiguration…

sparkvm:Adapter1:nat

==>sparkvm:Forwardingports…

sparkvm:8001=>8001(adapter1)

sparkvm:4040=>4040(adapter1)

sparkvm:22=>2222(adapter1)

==>sparkvm:BootingVM…

==>sparkvm:Waitingformachinetoboot.Thismaytakeafewminutes…

sparkvm:SSHaddress:127.0.0.1:2222

sparkvm:SSHusername:vagrant

sparkvm:SSHauthmethod:privatekey

sparkvm:Warning:Connectiontimeout.Retrying…

sparkvm:Warning:Remoteconnectiondisconnect.Retrying…

==>sparkvm:Machinebootedandready!

==>sparkvm:CheckingforguestadditionsinVM…

==>sparkvm:Settinghostname…

==>sparkvm:Mountingsharedfolders…

sparkvm:/vagrant=>C:/Programs/spark/edx1001/mooc-setup-master

Page 81: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

==>sparkvm:Machinealreadyprovisioned.Run`vagrantprovision`oruse

the`--provision`

==>sparkvm:toforceprovisioning.Provisionersmarkedtorunalwayswill

stillrun.

C:\Programs\spark\edx1001\mooc-setup-master>

ThiswilllaunchtheIPythonNotebookspoweredbyPySparkonlocalhost:8001:

Page 82: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 83: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

MovingtothecloudAswearedealingwithdistributedsystems,anenvironmentonavirtualmachinerunningonasinglelaptopislimitedforexplorationandlearning.WecanmovetothecloudinordertoexperiencethepowerandscalabilityoftheSparkdistributedframework.

Page 84: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

DeployingappsinAmazonWebServicesOncewearereadytoscaleourapps,wecanmigrateourdevelopmentenvironmenttoAmazonWebServices(AWS).

HowtorunSparkonEC2isclearlydescribedinthefollowingpage:https://spark.apache.org/docs/latest/ec2-scripts.html.

WeemphasizefivekeystepsinsettinguptheAWSSparkenvironment:

1. CreateanAWSEC2keypairviatheAWSconsolehttp://aws.amazon.com/console/.2. Exportyourkeypairtoyourenvironment:

exportAWS_ACCESS_KEY_ID=accesskeyid

exportAWS_SECRET_ACCESS_KEY=secretaccesskey

3. Launchyourcluster:

~$cd$SPARK_HOME/ec2

ec2$./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch

<cluster-name>

4. SSHintoaclustertorunSparkjobs:

ec2$./spark-ec2-k<keypair>-i<key-file>login<cluster-name>

5. Destroyyourclusterafterusage:

ec2$./spark-ec2destroy<cluster-name>

Page 85: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

VirtualizingtheenvironmentwithDockerInordertocreateaportablePythonandSparkenvironmentthatcanbeeasilysharedandcloned,thedevelopmentenvironmentcanbebuiltinDockercontainers.

WewishcapitalizeonDocker’stwomainfunctions:

Creatingisolatedcontainersthatcanbeeasilydeployedondifferentoperatingsystemsorinthecloud.AllowingeasysharingofthedevelopmentenvironmentimagewithallitsdependenciesusingTheDockerHub.TheDockerHubissimilartoGitHub.Itallowseasycloningandversioncontrol.Thesnapshotimageoftheconfiguredenvironmentcanbethebaselineforfurtherenhancements.

ThefollowingdiagramillustratesaDocker-enabledenvironmentwithSpark,Anaconda,andthedatabaseserverandtheirrespectivedatavolumes.

Page 86: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

DockerofferstheabilitytocloneanddeployanenvironmentfromtheDockerfile.

YoucanfindanexampleDockerfilewithaPySparkandAnacondasetupatthefollowingaddress:https://hub.docker.com/r/thisgokeboysef/pyspark-docker/~/dockerfile/.

InstallDockeraspertheinstructionsprovidedatthefollowinglinks:

http://docs.docker.com/mac/started/ifyouareonMacOSXhttp://docs.docker.com/linux/started/ifyouareonLinuxhttp://docs.docker.com/windows/started/ifyouareonWindows

InstallthedockercontainerwiththeDockerfileprovidedearlierwiththefollowingcommand:

$dockerpullthisgokeboysef/pyspark-docker

OthergreatsourcesofinformationonhowtodockerizeyourenvironmentcanbeseenatLab41.TheGitHubrepositorycontainsthenecessarycode:

https://github.com/Lab41/ipython-spark-docker

Thesupportingblogpostisrichininformationonthoughtprocessesinvolvedinbuildingthedockerenvironment:http://lab41.github.io/blog/2015/04/13/ipython-on-spark-on-docker/.

Page 87: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 88: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SummaryWesetthecontextofbuildingdata-intensiveappsbydescribingtheoverallarchitecturestructuredaroundtheinfrastructure,persistence,integration,analytics,andengagementlayers.WealsodiscussedSparkandAnacondawiththeirrespectivebuildingblocks.WesetupanenvironmentinaVirtualBoxwithAnacondaandSparkanddemonstratedawordcountappusingthetextcontentofthefirstchapterasinput.

Inthenextchapter,wewilldelvemoredeeplyintothearchitectureblueprintfordata-intensiveappsandtapintotheTwitter,GitHub,andMeetupAPIstogetafeelofthedatawewillbeminingwithSpark.

Page 89: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 90: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Chapter2.BuildingBatchandStreamingAppswithSparkTheobjectiveofthebookistoteachyouaboutPySparkandthePyDatalibrariesbybuildinganappthatanalyzestheSparkcommunity’sinteractionsonsocialnetworks.WewillgatherinformationonApacheSparkfromGitHub,checktherelevanttweetsonTwitter,andgetafeelforthebuzzaroundSparkinthebroaderopensourcesoftwarecommunitiesusingMeetup.

Inthischapter,wewilloutlinethevarioussourcesofdataandinformation.Wewillgetanunderstandingoftheirstructure.Wewilloutlinethedataprocessingpipeline,fromcollectiontobatchandstreamingprocessing.

Inthissection,wewillcoverthefollowingpoints:

Outlinedataprocessingpipelinesfromcollectiontobatchandstreamprocessing,effectivelydepictingthearchitectureoftheappweareplanningtobuild.Checkoutthevariousdatasources(GitHub,Twitter,andMeetup),theirdatastructure(JSON,structuredinformation,unstructuredtext,geo-location,timeseriesdata,andsoon),andtheircomplexities.WealsodiscussthetoolstoconnecttothreedifferentAPIs,soyoucanbuildyourowndatamashups.ThebookwillfocusonTwitterinthefollowingchapters.

Page 91: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Architectingdata-intensiveappsWedefinedthedata-intensiveappframeworkarchitectureblueprintinthepreviouschapter.Let’sputbackincontextthevarioussoftwarecomponentswearegoingtousethroughoutthebookinouroriginalframework.Here’sanillustrationofthevariouscomponentsofsoftwaremappedinthedata-intensivearchitectureframework:

Sparkisanextremelyefficient,distributedcomputingframework.Inordertoexploititsfullpower,weneedtoarchitectoursolutionaccordingly.Forperformancereasons,theoverallsolutionneedstoalsobeawareofitsusageintermsofCPU,storage,andnetwork.

Theseimperativesdrivethearchitectureofoursolution:

Latency:Thisarchitecturecombinesslowandfastprocessing.Slowprocessingisdoneonhistoricaldatainbatchmode.Thisisalsocalleddataatrest.Thisphasebuildsprecomputedmodelsanddatapatternsthatwillbeusedbythefastprocessingarmoncelivecontinuousdataisfedintothesystem.Fastprocessingofdataorreal-timeanalysisofstreamingdatareferstodatainmotion.Dataatrestisessentially

Page 92: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

processingdatainbatchmodewithalongerlatency.Datainmotionreferstothestreamingcomputationofdataingestedinrealtime.Scalability:Sparkisnativelylinearlyscalablethroughitsdistributedin-memorycomputingframework.DatabasesanddatastoresinteractingwithSparkneedtobealsoabletoscalelinearlyasdatavolumegrows.Faulttolerance:Whenafailureoccursduetohardware,software,ornetworkreasons,thearchitectureshouldberesilientenoughandprovideavailabilityatalltimes.Flexibility:Thedatapipelinesputinplaceinthisarchitecturecanbeadaptedandretrofittedveryquicklydependingontheusecase.

Sparkisuniqueasitallowsbatchprocessingandstreaminganalyticsonthesameunifiedplatform.

Wewillconsidertwodataprocessingpipelines:

ThefirstonehandlesdataatrestandisfocusedonputtingtogetherthepipelineforbatchanalysisofthedataThesecondone,datainmotion,targetsreal-timedataingestionanddeliveringinsightsbasedonprecomputedmodelsanddatapatterns

Page 93: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ProcessingdataatrestLet’sgetanunderstandingofthedataatrestorbatchprocessingpipeline.TheobjectiveinthispipelineistoingestthevariousdatasetsfromTwitter,GitHub,andMeetup;preparethedataforSparkMLlib,themachinelearningengine;andderivethebasemodelsthatwillbeappliedforinsightgenerationinbatchmodeorinrealtime.

Thefollowingdiagramillustratesthedatapipelineinordertoenableprocessingdataatrest:

Page 94: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ProcessingdatainmotionProcessingdatainmotionintroducesanewlevelofcomplexity,asweareintroducinganewpossibilityoffailure.Ifwewanttoscale,weneedtoconsiderbringingindistributedmessagequeuesystemssuchasKafka.Wewilldedicateasubsequentchaptertounderstandingstreaminganalytics.

Thefollowingdiagramdepictsadatapipelineforprocessingdatainmotion:

Page 95: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ExploringdatainteractivelyBuildingadata-intensiveappisnotasstraightforwardasexposingadatabasetoawebinterface.Duringthesetupofboththedataatrestanddatainmotionprocessing,wewillcapitalizeonSpark’sabilitytoanalysedatainteractivelyandrefinethedatarichnessandqualityrequiredforthemachinelearningandstreamingactivities.Here,wewillgothroughaniterativecycleofdatacollection,refinement,andinvestigationinordertogettothedatasetofinterestforourapps.

Page 96: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 97: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ConnectingtosocialnetworksLet’sdelveintothefirststepsofthedata-intensiveapparchitecture’sintegrationlayer.Wearegoingtofocusonharvestingthedata,ensuringitsintegrityandpreparingforbatchandstreamingdataprocessingbySparkatthenextstage.Thisphaseisdescribedinthefiveprocesssteps:connect,correct,collect,compose,andconsume.Theseareiterativestepsofdataexplorationthatwillgetusacquaintedwiththedataandhelpusrefinethedatastructureforfurtherprocessing.

Thefollowingdiagramdepictstheiterativeprocessofdataacquisitionandrefinementforconsumption:

Weconnecttothesocialnetworksofinterest:Twitter,GitHub,andMeetup.WewilldiscussthemodeofaccesstotheAPIs(shortforApplicationProgrammingInterface)andhowtocreateaRESTfulconnectionwiththoseserviceswhilerespectingtheratelimitationimposedbythesocialnetworks.REST(shortforRepresentationStateTransfer)isthemostwidelyadoptedarchitecturalstyleontheInternetinordertoenablescalablewebservices.ItreliesonexchangingmessagespredominantlyinJSON(shortforJavaScriptObjectNotation).RESTfulAPIsandwebservicesimplementthefourmostprevalentverbsGET,PUT,POST,andDELETE.GETisusedtoretrieveanelementoracollectionfromagivenURI.PUTupdatesacollectionwithanewone.POSTallowsthecreationofanewentry,whileDELETEeliminatesacollection.

Page 98: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

GettingTwitterdataTwitterallowsaccesstoregistereduserstoitssearchandstreamingtweetservicesunderanauthorizationprotocolcalledOAuththatallowsAPIapplicationstosecurelyactonauser’sbehalf.Inordertocreatetheconnection,thefirststepistocreateanapplicationwithTwitterathttps://apps.twitter.com/app/new.

Oncetheapplicationhasbeencreated,TwitterwillissuethefourcodesthatwillallowittotapintotheTwitterhose:

CONSUMER_KEY='GetYourKey@Twitter'

CONSUMER_SECRET='GetYourKey@Twitter'

OAUTH_TOKEN='GetYourToken@Twitter'

OAUTH_TOKEN_SECRET='GetYourToken@Twitter'

IfyouwishtogetafeelforthevariousRESTfulqueriesoffered,youcanexploretheTwitterAPIonthedevconsoleathttps://dev.twitter.com/rest/tools/console:

Page 99: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

WewillmakeaprogrammaticconnectiononTwitterusingthefollowingcode,whichwillactivateourOAuthaccessandallowsustotapintotheTwitterAPIundertheratelimitation.Inthestreamingmode,thelimitationisforaGETrequest.

Page 100: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

GettingGitHubdataGitHubusesasimilarauthenticationprocesstoTwitter.HeadtothedevelopersiteandretrieveyourcredentialsafterdulyregisteringwithGitHubathttps://developer.github.com/v3/:

Page 101: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

GettingMeetupdataMeetupcanbeaccessedusingthetokenissuedinthedeveloperresourcestomembersofMeetup.com.ThenecessarytokenorOAuthcredentialforMeetupAPIaccesscanbeobtainedontheirdeveloper’swebsiteathttps://secure.meetup.com/meetup_api:

Page 102: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 103: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

AnalyzingthedataLet’sgetafirstfeelforthedataextractedfromeachofthesocialnetworksandgetanunderstandingofthedatastructurefromeachthesesources.

Page 104: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

DiscoveringtheanatomyoftweetsInthissection,wearegoingtoestablishconnectionwiththeTwitterAPI.Twitterofferstwoconnectionmodes:theRESTAPI,whichallowsustosearchhistoricaltweetsforagivensearchtermorhashtag,andthestreamingAPI,whichdeliversreal-timetweetsundertheratelimitinplace.

InordertogetabetterunderstandingofhowtooperatewiththeTwitterAPI,wewillgothroughthefollowingsteps:

1. InstalltheTwitterPythonlibrary.2. EstablishaconnectionprogrammaticallyviaOAuth,theauthenticationrequiredfor

Twitter.3. SearchforrecenttweetsforthequeryApacheSparkandexploretheresultsobtained.4. DecideonthekeyattributesofinterestandretrievetheinformationfromtheJSON

output.

Let’sgothroughitstep-by-step:

1. InstallthePythonTwitterlibrary.Inordertoinstallit,youneedtowritepipinstalltwitterfromthecommandline:

$pipinstalltwitter

2. CreatethePythonTwitterAPIclassanditsbasemethodsforauthentication,searching,andparsingtheresults.self.authgetsthecredentialsfromTwitter.ItthencreatesaregisteredAPIasself.api.Wehaveimplementedtwomethods:thefirstonetosearchTwitterwithagivenqueryandthesecondonetoparsetheoutputtoretrieverelevantinformationsuchasthetweetID,thetweettext,andthetweetauthor.Thecodeisasfollows:

importtwitter

importurlparse

frompprintimportpprintaspp

classTwitterAPI(object):

"""

TwitterAPIclassallowstheConnectiontoTwitterviaOAuth

onceyouhaveregisteredwithTwitterandreceivethe

necessarycredentiials

"""

#initializeandgetthetwittercredentials

def__init__(self):

consumer_key='Provideyourcredentials'

consumer_secret='Provideyourcredentials'

access_token='Provideyourcredentials'

access_secret='Provideyourcredentials'

self.consumer_key=consumer_key

self.consumer_secret=consumer_secret

Page 105: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

self.access_token=access_token

self.access_secret=access_secret

#

#authenticatecredentialswithTwitterusingOAuth

self.auth=twitter.oauth.OAuth(access_token,access_secret,

consumer_key,consumer_secret)

#createsregisteredTwitterAPI

self.api=twitter.Twitter(auth=self.auth)

#

#searchTwitterwithqueryq(i.e."ApacheSpark")andmax.result

defsearchTwitter(self,q,max_res=10,**kwargs):

search_results=self.api.search.tweets(q=q,count=10,

**kwargs)

statuses=search_results['statuses']

max_results=min(1000,max_res)

for_inrange(10):

try:

next_results=search_results['search_metadata']

['next_results']

exceptKeyErrorase:

break

next_results=urlparse.parse_qsl(next_results[1:])

kwargs=dict(next_results)

search_results=self.api.search.tweets(**kwargs)

statuses+=search_results['statuses']

iflen(statuses)>max_results:

break

returnstatuses

#

#parsetweetsasitiscollectedtoextractid,creation

#date,userid,tweettext

defparseTweets(self,statuses):

return[(status['id'],

status['created_at'],

status['user']['id'],

status['user']['name'],

status['text'],url['expanded_url'])

forstatusinstatuses

forurlinstatus['entities']['urls']]

3. Instantiatetheclasswiththerequiredauthentication:

t=TwitterAPI()

4. RunasearchonthequerytermApacheSpark:

q="ApacheSpark"

tsearch=t.searchTwitter(q)

5. AnalyzetheJSONoutput:

pp(tsearch[1])

Page 106: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

{u'contributors':None,

u'coordinates':None,

u'created_at':u'SatApr2514:50:57+00002015',

u'entities':{u'hashtags':[{u'indices':[74,86],u'text':

u'sparksummit'}],

u'media':[{u'display_url':

u'pic.twitter.com/WKUMRXxIWZ',

u'expanded_url':

u'http://twitter.com/bigdata/status/591976255831969792/photo/1',

u'id':591976255156715520,

u'id_str':u'591976255156715520',

u'indices':[143,144],

u'media_url':

...(snip)...

u'text':u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers

&amp;leadersat#sparksummitNYC:videoclipsareout

http://t.co/qrqpP6cG9shttp://t\u2026',

u'truncated':False,

u'user':{u'contributors_enabled':False,

u'created_at':u'SatApr0414:44:31+00002015',

u'default_profile':True,

u'default_profile_image':True,

u'description':u'',

u'entities':{u'description':{u'urls':[]}},

u'favourites_count':0,

u'follow_request_sent':False,

u'followers_count':586,

u'following':False,

u'friends_count':2,

u'geo_enabled':False,

u'id':3139047660,

u'id_str':u'3139047660',

u'is_translation_enabled':False,

u'is_translator':False,

u'lang':u'zh-cn',

u'listed_count':749,

u'location':u'',

u'name':u'MegaDataMama',

u'notifications':False,

u'profile_background_color':u'C0DEED',

u'profile_background_image_url':

u'http://abs.twimg.com/images/themes/theme1/bg.png',

u'profile_background_image_url_https':

u'https://abs.twimg.com/images/themes/theme1/bg.png',

...(snip)...

u'screen_name':u'MegaDataMama',

u'statuses_count':26673,

u'time_zone':None,

u'url':None,

u'utc_offset':None,

u'verified':False}}

6. ParsetheTwitteroutputtoretrievekeyinformationofinterest:

tparsed=t.parseTweets(tsearch)

pp(tparsed)

Page 107: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

[(591980327784046592,

u'SatApr2515:01:23+00002015',

63407360,

u'Jos\xe9CarlosBaquero',

u'BigDatasystemsaremakingadifferenceinthefightagainst

cancer.#BigData#ApacheSparkhttp://t.co/pnOLmsKdL9',

u'http://tmblr.co/ZqTggs1jHytN0'),

(591977704464875520,

u'SatApr2514:50:57+00002015',

3139047660,

u'MegaDataMama',

u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers&amp;

leadersat#sparksummitNYC:videoclipsareouthttp://t.co/qrqpP6cG9s

http://t\u2026',

u'http://goo.gl/eF5xwK'),

(591977172589539328,

u'SatApr2514:48:51+00002015',

2997608763,

u'EmmaClark',

u'RT@bigdata:Enjoyedcatchingupwith@ApacheSparkusers&amp;

leadersat#sparksummitNYC:videoclipsareouthttp://t.co/qrqpP6cG9s

http://t\u2026',

u'http://goo.gl/eF5xwK'),

...(snip)...

(591879098349268992,

u'SatApr2508:19:08+00002015',

331263208,

u'MarioMolina',

u'#ApacheSparkspeedsupbigdatadecision-making

http://t.co/8hdEXreNfN',

u'http://www.computerweekly.com/feature/Apache-Spark-speeds-up-big-

data-decision-making')]

Page 108: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 109: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ExploringtheGitHubworldInordertogetabetterunderstandingonhowtooperatewiththeGitHubAPI,wewillgothroughthefollowingsteps:

1. InstalltheGitHubPythonlibrary.2. AccesstheAPIbyusingthetokenprovidedwhenweregisteredinthedeveloper

website.3. RetrievesomekeyfactsontheApachefoundationthatishostingthespark

repository.

Let’sgothroughtheprocessstep-by-step:

1. InstallthePythonPyGithublibrary.Inordertoinstallit,youneedtopipinstallPyGithubfromthecommandline:

pipinstallPyGithub

2. ProgrammaticallycreateaclienttoinstantiatetheGitHubAPI:

fromgithubimportGithub

#Getyourownaccesstoken

ACCESS_TOKEN='Get_Your_Own_Access_Token'

#WearefocusingourattentiontoUser=apacheandRepo=spark

USER='apache'

REPO='spark'

g=Github(ACCESS_TOKEN,per_page=100)

user=g.get_user(USER)

repo=user.get_repo(REPO)

3. RetrievekeyfactsfromtheApacheUser.Thereare640activeApacherepositoriesinGitHub:

repos_apache=[repo.nameforrepoing.get_user('apache').get_repos()]

len(repos_apache)

640

4. RetrievekeyfactsfromtheSparkrepository,TheprograminglanguagesusedintheSparkrepoaregivenhereunder:

pp(repo.get_languages())

{u'C':1493,

u'CSS':4472,

u'Groff':5379,

u'Java':1054894,

u'JavaScript':21569,

u'Makefile':7771,

Page 110: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

u'Python':1091048,

u'R':339201,

u'Scala':10249122,

u'Shell':172244}

5. RetrieveafewkeyparticipantsofthewideSparkGitHubrepositorynetwork.Thereare3,738stargazersintheApacheSparkrepositoryatthetimeofwriting.Thenetworkisimmense.ThefirststargazerisMateiZaharia,thecofounderoftheSparkprojectwhenhewasdoinghisPhDinBerkeley.

stargazers=[sforsinrepo.get_stargazers()]

print"Numberofstargazers",len(stargazers)

Numberofstargazers3738

[stargazers[i].loginforiinrange(0,20)]

[u'mateiz',

u'beyang',

u'abo',

u'CodingCat',

u'andy327',

u'CrazyJvm',

u'jyotiska',

u'BaiGang',

u'sundstei',

u'dianacarroll',

u'ybotco',

u'xelax',

u'prabeesh',

u'invkrh',

u'bedla',

u'nadesai',

u'pcpratts',

u'narkisr',

u'Honghe',

u'Jacke']

Page 111: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnderstandingthecommunitythroughMeetupInordertogetabetterunderstandingofhowtooperatewiththeMeetupAPI,wewillgothroughthefollowingsteps:

1. CreateaPythonprogramtocalltheMeetupAPIusinganauthenticationtoken.2. RetrieveinformationofpasteventsformeetupgroupssuchasLondonDataScience.3. Retrievetheprofileofthemeetupmembersinordertoanalyzetheirparticipationin

similarmeetupgroups.

Let’sgothroughtheprocessstep-by-step:

1. AsthereisnoreliableMeetupAPIPythonlibrary,wewillprogrammaticallycreateaclienttoinstantiatetheMeetupAPI:

importjson

importmimeparse

importrequests

importurllib

frompprintimportpprintaspp

MEETUP_API_HOST='https://api.meetup.com'

EVENTS_URL=MEETUP_API_HOST+'/2/events.json'

MEMBERS_URL=MEETUP_API_HOST+'/2/members.json'

GROUPS_URL=MEETUP_API_HOST+'/2/groups.json'

RSVPS_URL=MEETUP_API_HOST+'/2/rsvps.json'

PHOTOS_URL=MEETUP_API_HOST+'/2/photos.json'

GROUP_URLNAME='London-Machine-Learning-Meetup'

#GROUP_URLNAME='London-Machine-Learning-Meetup'#'Data-Science-

London'

classMee

tupAPI(object):

"""

Retrievesinformationaboutmeetup.com

"""

def__init__(self,api_key,num_past_events=10,http_timeout=1,

http_retries=2):

"""

CreateanewinstanceofMeetupAPI

"""

self._api_key=api_key

self._http_timeout=http_timeout

self._http_retries=http_retries

self._num_past_events=num_past_events

defget_past_events(self):

"""

Getpastmeetupeventsforagivenmeetupgroup

"""

params={'key':self._api_key,

'group_urlname':GROUP_URLNAME,

'status':'past',

Page 112: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

'desc':'true'}

ifself._num_past_events:

params['page']=str(self._num_past_events)

query=urllib.urlencode(params)

url='{0}?{1}'.format(EVENTS_URL,query)

response=requests.get(url,timeout=self._http_timeout)

data=response.json()['results']

returndata

defget_members(self):

"""

Getmeetupmembersforagivenmeetupgroup

"""

params={'key':self._api_key,

'group_urlname':GROUP_URLNAME,

'offset':'0',

'format':'json',

'page':'100',

'order':'name'}

query=urllib.urlencode(params)

url='{0}?{1}'.format(MEMBERS_URL,query)

response=requests.get(url,timeout=self._http_timeout)

data=response.json()['results']

returndata

defget_groups_by_member(self,member_id='38680722'):

"""

Getmeetupgroupsforagivenmeetupmember

"""

params={'key':self._api_key,

'member_id':member_id,

'offset':'0',

'format':'json',

'page':'100',

'order':'id'}

query=urllib.urlencode(params)

url='{0}?{1}'.format(GROUPS_URL,query)

response=requests.get(url,timeout=self._http_timeout)

data=response.json()['results']

returndata

2. Then,wewillretrievepasteventsfromagivenMeetupgroup:

m=MeetupAPI(api_key='Get_Your_Own_Key')

last_meetups=m.get_past_events()

pp(last_meetups[5])

{u'created':1401809093000,

u'description':u"<p>WearehostingajointmeetupbetweenSpark

LondonandMachineLearningLondon.Giventheexcitementinthemachine

learningcommunityaroundSparkatthemomentajointmeetupisin

order!</p><p>MichaelArmbrustfromtheApacheSparkcoreteamwillbe

flyingoverfromtheStatestogiveusatalkinperson.\xa0Thanksto

oursponsors,Cloudera,MapRandDatabricksforhelpingmakethis

happen.</p><p>ThefirstpartofthetalkwillbeaboutMLlib,the

Page 113: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

machinelearninglibraryforSpark,\xa0andthesecondpart,on\xa0Spark

SQL.</p><p>Don'tsignupifyouhavealreadysignedupontheSpark

Londonpagethough!</p><p>\n\n\nAbstractforpartone:</p><p>Inthis

talk,we\u2019llintroduceSparkandshowhowtouseittobuildfast,

end-to-endmachinelearningworkflows.UsingSpark\u2019shigh-level

API,wecanprocessrawdatawithfamiliarlibrariesinJava,Scalaor

Python(e.g.NumPy)toextractthefeaturesformachinelearning.Then,

usingMLlib,itsbuilt-inmachinelearninglibrary,wecanrunscalable

versionsofpopularalgorithms.We\u2019llalsocoverupcoming

developmentworkincludingnewbuilt-inalgorithmsandRbindings.</p>

<p>\n\n\n\nAbstractforparttwo:\xa0</p><p>Inthistalk,we'll

examineSparkSQL,anewAlphacomponentthatispartoftheApache

Spark1.0release.SparkSQLletsdevelopersnativelyquerydatastored

inbothexistingRDDsandexternalsourcessuchasApacheHive.Akey

featureofSparkSQListheabilitytoblurthelinesbetween

relationaltablesandRDDs,makingiteasyfordeveloperstointermix

SQLcommandsthatqueryexternaldatawithcomplexanalytics.In

additiontoSparkSQL,we'llexploretheCatalystoptimizerframework,

whichallowsSparkSQLtoautomaticallyrewritequeryplanstoexecute

moreefficiently.</p>",

u'event_url':u'http://www.meetup.com/London-Machine-Learning-

Meetup/events/186883262/',

u'group':{u'created':1322826414000,

u'group_lat':51.52000045776367,

u'group_lon':-0.18000000715255737,

u'id':2894492,

u'join_mode':u'open',

u'name':u'LondonMachineLearningMeetup',

u'urlname':u'London-Machine-Learning-Meetup',

u'who':u'MachineLearningEnthusiasts'},

u'headcount':0,

u'id':u'186883262',

u'maybe_rsvp_count':0,

u'name':u'JointSparkLondonandMachineLearningMeetup',

u'rating':{u'average':4.800000190734863,u'count':5},

u'rsvp_limit':70,

u'status':u'past',

u'time':1403200800000,

u'updated':1403450844000,

u'utc_offset':3600000,

u'venue':{u'address_1':u'12ErrolSt,London',

u'city':u'EC1Y8LX',

u'country':u'gb',

u'id':19504802,

u'lat':51.522533,

u'lon':-0.090934,

u'name':u'RoyalStatisticalSociety',

u'repinned':False},

u'visibility':u'public',

u'waitlist_count':84,

u'yes_rsvp_count':70}

3. GetinformationabouttheMeetupmembers:

members=m.get_members()

Page 114: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

{u'city':u'London',

u'country':u'gb',

u'hometown':u'London',

u'id':11337881,

u'joined':1421418896000,

u'lat':51.53,

u'link':u'http://www.meetup.com/members/11337881',

u'lon':-0.09,

u'name':u'AbhishekShivkumar',

u'other_services':{u'twitter':{u'identifier':u'@abhisemweb'}},

u'photo':{u'highres_link':

u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/highres_1089864

3.jpeg',

u'photo_id':10898643,

u'photo_link':

u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/member_10898643

.jpeg',

u'thumb_link':

u'http://photos3.meetupstatic.com/photos/member/9/6/f/3/thumb_10898643.

jpeg'},

u'self':{u'common':{}},

u'state':u'17',

u'status':u'active',

u'topics':[{u'id':1372,u'name':u'SemanticWeb',u'urlkey':

u'semweb'},

{u'id':1512,u'name':u'XML',u'urlkey':u'xml'},

{u'id':49585,

u'name':u'SemanticSocialNetworks',

u'urlkey':u'semantic-social-networks'},

{u'id':24553,

u'name':u'NaturalLanguageProcessing',

...(snip)...

u'name':u'AndroidDevelopment',

u'urlkey':u'android-developers'}],

u'visited':1429281599000}

Page 115: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 116: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PreviewingourappOurchallengeistomakesenseofthedataretrievedfromthesesocialnetworks,findingthekeyrelationshipsandderivinginsights.Someoftheelementsofinterestareasfollows:

Visualizingthetopinfluencers:Discoverthetopinfluencersinthecommunity:

HeavyTwitterusersonApacheSparkCommittersinGitHubLeadingMeetuppresentations

UnderstandingtheNetwork:NetworkgraphofGitHubcommitters,watchers,andstargazersIdentifyingtheHotLocations:LocatingthemostactivelocationforSpark

Thefollowingscreenshotprovidesapreviewofourapp:

Page 117: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 118: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SummaryInthischapter,welaidouttheoverallarchitectureofourapp.Weexplainedthetwomainparadigmsofprocessingdata:batchprocessing,alsocalleddataatrest,andstreaminganalytics,referredtoasdatainmotion.Weproceededtoestablishconnectionstothreesocialnetworksofinterest:Twitter,GitHub,andMeetup.Wesampledthedataandprovidedapreviewofwhatweareaimingtobuild.TheremainderofthebookwillfocusontheTwitterdataset.WeprovidedherethetoolsandAPItoaccessthreesocialnetworks,soyoucanatalaterstagecreateyourowndatamashups.Wearenowreadytoinvestigatethedatacollected,whichwillbethetopicofthenextchapter.

Inthenextchapter,wewilldelvedeeperintodataanalysis,extractingthekeyattributesofinterestforourpurposesandmanagingthestorageoftheinformationforbatchandstreamprocessing.

Page 119: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 120: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Chapter3.JugglingDatawithSparkAsperthebatchandstreamingarchitecturelaidoutinthepreviouschapter,weneeddatatofuelourapplications.WewillharvestdatafocusedonApacheSparkfromTwitter.Theobjectiveofthischapteristopreparedatatobefurtherusedbythemachinelearningandstreamingapplications.Thischapterfocusesonhowtoexchangecodeanddataacrossthedistributednetwork.Wewillgetpracticalinsightsintoserialization,persistence,marshaling,andcaching.WewillgettogripswithonSparkSQL,thekeySparkmoduletointeractivelyexplorestructuredandsemi-structureddata.ThefundamentaldatastructurepoweringSparkSQListheSparkdataframe.TheSparkdataframeisinspiredbythePythonPandasdataframeandtheRdataframe.Itisapowerfuldatastructure,wellunderstoodandappreciatedbydatascientistswithabackgroundinRorPython.

Inthischapter,wewillcoverthefollowingpoints:

ConnecttoTwitter,collecttherelevantdata,andthenpersistitinvariousformatssuchasJSONandCSVanddatastoressuchasMongoDBAnalyzethedatausingBlazeandOdo,aspin-offlibraryfromBlaze,inordertoconnectandtransferdatafromvarioussourcesanddestinationsIntroduceSparkdataframesasthefoundationfordatainterchangebetweenthevariousSparkmodulesandexploredatainteractivelyusingSparkSQL

Page 121: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Revisitingthedata-intensiveapparchitectureLet’sfirstputincontextthefocusofthischapterwithrespecttothedata-intensiveapparchitecture.Wewillconcentrateourattentionontheintegrationlayerandessentiallyrunthroughiterativecyclesoftheacquisition,refinement,andpersistenceofthedata.ThiscyclewastermedthefiveCs.ThefiveCsstandforconnect,collect,correct,compose,andconsume.TheyaretheessentialprocesseswerunthroughintheintegrationlayerinordertogettotherightqualityandquantityofdataretrievedfromTwitter.WewillalsodelvedeeperinthepersistencelayerandsetupadatastoresuchasMongoDBtocollectourdataforprocessinglater.

WewillexplorethedatawithBlaze,aPythonlibraryfordatamanipulation,andSparkSQL,theinteractivemoduleofSparkfordatadiscoverypoweredbytheSparkdataframe.ThedataframeparadigmissharedbyPythonPandas,PythonBlaze,andSparkSQL.Wewillgetafeelforthenuancesofthethreedataframeflavors.

Thefollowingdiagramsetsthecontextofthechapter’sfocus,highlightingtheintegrationlayerandthepersistencelayer:

Page 122: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 123: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 124: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SerializinganddeserializingdataAsweareharvestingdatafromwebAPIsunderratelimitconstraints,weneedtostorethem.Asthedataisprocessedonadistributedcluster,weneedconsistentwaystosavestateandretrieveitforlaterusage.

Let’snowdefineserialization,persistence,marshaling,andcachingormemorization.

SerializingaPythonobjectconvertsitintoastreamofbytes.ThePythonobjectneedstoberetrievedbeyondthescopeofitsexistence,whentheprogramisshut.TheserializedPythonobjectcanbetransferredoveranetworkorstoredinapersistentstorage.DeserializationistheoppositeandconvertsthestreamofbytesintotheoriginalPythonobjectsotheprogramcancarryonfromthesavedstate.ThemostpopularserializationlibraryinPythonisPickle.Asamatteroffact,thePySparkcommandsaretransferredoverthewiretotheworkernodesviapickleddata.

Persistencesavesaprogram’sstatedatatodiskormemorysothatitcancarryonwhereitleftoffuponrestart.ItsavesaPythonobjectfrommemorytoafileoradatabaseandloadsitlaterwiththesamestate.

MarshallingsendsPythoncodeordataoveranetworkTCPconnectioninamulticoreordistributedsystem.

CachingconvertsaPythonobjecttoastringinmemorysothatitcanbeusedasadictionarykeylateron.Sparksupportspullingadatasetintoacluster-wide,in-memorycache.ThisisveryusefulwhendataisaccessedrepeatedlysuchaswhenqueryingasmallreferencedatasetorrunninganiterativealgorithmsuchasGooglePageRank.

CachingisacrucialconceptforSparkasitallowsustosaveRDDsinmemoryorwithaspillagetodisk.ThecachingstrategycanbeselectedbasedonthelineageofthedataortheDAG(shortforDirectedAcyclicGraph)oftransformationsappliedtotheRDDsinordertominimizeshuffleorcrossnetworkheavydataexchange.InordertoachievegoodperformancewithSpark,bewareofdatashuffling.AgoodpartitioningpolicyanduseofRDDcaching,coupledwithavoidingunnecessaryactionoperations,leadstobetterperformancewithSpark.

Page 125: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 126: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

HarvestingandstoringdataBeforedelvingintodatabasepersistentstoragesuchasMongoDB,wewilllookatsomeusefulfilestoragesthatarewidelyused:CSV(shortforcomma-separatedvalues)andJSON(shortforJavaScriptObjectNotation)filestorage.Theenduringpopularityofthesetwofileformatsliesinafewkeyreasons:theyarehumanreadable,simple,relativelylightweight,andeasytouse.

Page 127: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PersistingdatainCSVTheCSVformatislightweight,humanreadable,andeasytouse.Ithasdelimitedtextcolumnswithaninherenttabularschema.

PythonoffersarobustcsvlibrarythatcanserializeacsvfileintoaPythondictionary.Forthepurposeofourprogram,wehavewrittenapythonclassthatmanagestopersistdatainCSVformatandreadfromagivenCSV.

Let’srunthroughthecodeoftheclassIO_csvobject.The__init__sectionoftheclassbasicallyinstantiatesthefilepath,thefilename,andthefilesuffix(inthiscase,.csv):

classIO_csv(object):

def__init__(self,filepath,filename,filesuffix='csv'):

self.filepath=filepath#/path/to/filewithoutthe/'at

theend

self.filename=filename#FILE_NAME

self.filesuffix=filesuffix

ThesavemethodoftheclassusesaPythonnamedtupleandtheheaderfieldsofthecsvfileinordertoimpartaschemawhilepersistingtherowsoftheCSV.Ifthecsvfilealreadyexists,itwillbeappendedandnotoverwrittenotherwise;itwillbecreated:

defsave(self,data,NTname,fields):

#NTname=NameoftheNamedTuple

#fields=headerofCSV-listofthefieldsname

NTuple=namedtuple(NTname,fields)

ifos.path.isfile('{0}/{1}.{2}'.format(self.filepath,

self.filename,self.filesuffix)):

#Appendexistingfile

withopen('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),'ab')asf:

writer=csv.writer(f)

#writer.writerow(fields)#fields=headerofCSV

writer.writerows([rowforrowinmap(NTuple._make,data)])

#listcomprehensionusingmapontheNamedTuple._make()

iterableandthedatafiletobesaved

#Noticewriter.writerowsandnotwriter.writerow(i.e.

listofmultiplerowssenttocsvfile

else:

#Createnewfile

withopen('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),'wb')asf:

writer=csv.writer(f)

writer.writerow(fields)#fields=headerofCSV-listof

thefieldsname

writer.writerows([rowforrowinmap(NTuple._make,data)])

#listcomprehensionusingmapontheNamedTuple._make()

iterableandthedatafiletobesaved

#Noticewriter.writerowsandnotwriter.writerow(i.e.

listofmultiplerowssenttocsvfile

TheloadmethodoftheclassalsousesaPythonnamedtupleandtheheaderfieldsofthe

Page 128: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

csvfileinordertoretrievethedatausingaconsistentschema.Theloadmethodisamemory-efficientgeneratortoavoidloadingahugefileinmemory:henceweuseyieldinplaceofreturn:

defload(self,NTname,fields):

#NTname=NameoftheNamedTuple

#fields=headerofCSV-listofthefieldsname

NTuple=namedtuple(NTname,fields)

withopen('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),'rU')asf:

reader=csv.reader(f)

forrowinmap(NTuple._make,reader):

#UsingmapontheNamedTuple._make()iterableandthe

readerfiletobeloaded

yieldrow

Here’sthenamedtuple.Weareusingittoparsethetweetinordertosaveorretrievethemtoandfromthecsvfile:

fields01=['id','created_at','user_id','user_name','tweet_text',

'url']

Tweet01=namedtuple('Tweet01',fields01)

defparse_tweet(data):

"""

Parsea``tweet``fromthegivenresponsedata.

"""

returnTweet01(

id=data.get('id',None),

created_at=data.get('created_at',None),

user_id=data.get('user_id',None),

user_name=data.get('user_name',None),

tweet_text=data.get('tweet_text',None),

url=data.get('url')

)

Page 129: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PersistingdatainJSONJSONisoneofthemostpopulardataformatsforInternet-basedapplications.AlltheAPIswearedealingwith,Twitter,GitHub,andMeetup,delivertheirdatainJSONformat.TheJSONformatisrelativelylightweightcomparedtoXMLandhumanreadable,andtheschemaisembeddedinJSON.AsopposedtotheCSVformat,whereallrecordsfollowexactlythesametabularstructure,JSONrecordscanvaryintheirstructure.JSONissemi-structured.AJSONrecordcanbemappedintoaPythondictionaryofdictionaries.

Let’srunthroughthecodeoftheclassIO_jsonobject.The__init__sectionoftheclassbasicallyinstantiatesthefilepath,thefilename,andthefilesuffix(inthiscase,.json):

classIO_json(object):

def__init__(self,filepath,filename,filesuffix='json'):

self.filepath=filepath#/path/to/filewithoutthe/'at

theend

self.filename=filename#FILE_NAME

self.filesuffix=filesuffix

#self.file_io=os.path.join(dir_name,.'.join((base_filename,

filename_suffix)))

Thesavemethodoftheclassusesutf-8encodinginordertoensurereadandwritecompatibilityofthedata.IftheJSONfilealreadyexists,itwillbeappendedandnotoverwritten;otherwiseitwillbecreated:

defsave(self,data):

ifos.path.isfile('{0}/{1}.{2}'.format(self.filepath,

self.filename,self.filesuffix)):

#Appendexistingfile

withio.open('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),'a',encoding='utf-8')asf:

f.write(unicode(json.dumps(data,ensure_ascii=False)))#

Inpython3,thereisno"unicode"function

#f.write(json.dumps(data,ensure_ascii=False))#createa

\"escapecharfor"inthesavedfile

else:

#Createnewfile

withio.open('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),'w',encoding='utf-8')asf:

f.write(unicode(json.dumps(data,ensure_ascii=False)))

#f.write(json.dumps(data,ensure_ascii=False))

Theloadmethodoftheclassjustreturnsthefilethathasbeenread.Afurtherjson.loadsfunctionneedstobeappliedinordertoretrievethejsonoutofthefileread:

defload(self):

withio.open('{0}/{1}.{2}'.format(self.filepath,self.filename,

self.filesuffix),encoding='utf-8')asf:

returnf.read()

Page 130: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SettingupMongoDBItiscrucialtostoretheinformationharvested.Thus,wesetupMongoDBasourmaindocumentdatastore.AsalltheinformationcollectedisinJSONformatandMongoDBstoresinformationinBSON(shortforBinaryJSON),itisthereforeanaturalchoice.

Wewillrunthroughthefollowingstepsnow:

InstallingtheMongoDBserverandclientRunningtheMongoDBserverRunningtheMongoclientInstallingthePyMongodriverCreatingthePythonMongoclient

InstallingtheMongoDBserverandclientInordertoinstalltheMongoDBpackage,performthroughthefollowingsteps:

1. Importthepublickeyusedbythepackagemanagementsystem(inourcase,Ubuntu’sapt).ToimporttheMongoDBpublickey,weissuethefollowingcommand:

sudoapt-keyadv--keyserverhkp://keyserver.ubuntu.com:80--recv

7F0CEB10

2. CreatealistfileforMongoDB.Tocreatethelistfile,weusethefollowingcommand:

echo"debhttp://repo.mongodb.org/apt/ubuntu"$("lsb_release-sc)"/

mongodb-org/3.0multiverse"|sudotee/etc/apt/sources.list.d/mongodb-

org-3.0.list

3. Updatethelocalpackagedatabaseassudo:

sudoapt-getupdate

4. InstalltheMongoDBpackages.WeinstallthelateststableversionofMongoDBwiththefollowingcommand:

sudoapt-getinstall-ymongodb-org

RunningtheMongoDBserverLet’sstarttheMongoDBserver:

1. TostartMongoDBserver,weissuethefollowingcommandtostartmongod:

sudoservicemongodbstart

2. Tocheckwhethermongodhasstartedproperly,weissuethecommand:

an@an-VB:/usr/bin$ps-ef|grepmongo

mongodb9671407:03?00:02:02/usr/bin/mongod--

config/etc/mongod.conf

an31433085007:45pts/300:00:00grep--color=automongo

Page 131: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Inthiscase,weseethatmongodbisrunninginprocess967.

3. Themongodserversendsamessagetotheeffectthatitiswaitingforconnectiononport27017.ThisisthedefaultportforMongoDB.Itcanbechangedintheconfigurationfile.

4. Wecancheckthecontentsofthelogfileat/var/log/mongod/mongod.log:

an@an-VB:/var/lib/mongodb$ls-lru

total81936

drwxr-xr-x2mongodbnogroup4096Apr2511:19_tmp

-rw-r--r--1mongodbnogroup69Apr2511:19storage.bson

-rwxr-xr-x1mongodbnogroup5Apr2511:19mongod.lock

-rw-------1mongodbnogroup16777216Apr2511:19local.ns

-rw-------1mongodbnogroup67108864Apr2511:19local.0

drwxr-xr-x2mongodbnogroup4096Apr2511:19journal

5. Inordertostopthemongodbserver,justissuethefollowingcommand:

sudoservicemongodbstop

RunningtheMongoclientRunningtheMongoclientintheconsoleisaseasyascallingmongo,ashighlightedinthefollowingcommand:

an@an-VB:/usr/bin$mongo

MongoDBshellversion:3.0.2

connectingto:test

Serverhasstartupwarnings:

2015-05-30T07:03:49.387+0200ICONTROL[initandlisten]

2015-05-30T07:03:49.388+0200ICONTROL[initandlisten]

Atthemongoclientconsoleprompt,wecanseethedatabaseswiththefollowingcommands:

>showdbs

local0.078GB

test0.078GB

Weselectthetestdatabaseusingusetest:

>usetest

switchedtodbtest

Wedisplaythecollectionswithinthetestdatabase:

>showcollections

restaurants

system.indexes

Wecheckasamplerecordintherestaurantcollectionlistedpreviously:

>db.restaurants.find()

{"_id":ObjectId("553b70055e82e7b824ae0e6f"),"address:{"building:

"1007","coord":[-73.856077,40.848447],"street:"MorrisParkAve",

"zipcode:"10462},"borough:"Bronx","cuisine:"Bakery","grades:[{

Page 132: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

"grade:"A","score":2,"date":ISODate("2014-03-03T00:00:00Z")},{

"date":ISODate("2013-09-11T00:00:00Z"),"grade:"A","score":6},{

"score":10,"date":ISODate("2013-01-24T00:00:00Z"),"grade:"A},{

"date":ISODate("2011-11-23T00:00:00Z"),"grade:"A","score":9},{

"date":ISODate("2011-03-10T00:00:00Z"),"grade:"B","score":14}],

"name:"MorrisParkBakeShop","restaurant_id:"30075445"}

InstallingthePyMongodriverInstallingthePythondriverwithanacondaiseasy.Justrunthefollowingcommandattheterminal:

condainstallpymongo

CreatingthePythonclientforMongoDBWearecreatingaIO_mongoclassthatwillbeusedinourharvestingandprocessingprogramstostorethedatacollectedandretrievedsavedinformation.Inordertocreatethemongoclient,wewillimporttheMongoClientmodulefrompymongo.Weconnecttothemongodbserveronlocalhostatport27017.Thecommandisasfollows:

frompymongoimportMongoClientasMCli

classIO_mongo(object):

conn={'host':'localhost','ip':'27017'}

Weinitializeourclasswiththeclientconnection,thedatabase(inthiscase,twtr_db),andthecollection(inthiscase,twtr_coll)tobeaccessed:

def__init__(self,db='twtr_db',coll='twtr_coll',**conn):

#ConnectstotheMongoDBserver

self.client=MCli(**conn)

self.db=self.client[db]

self.coll=self.db[coll]

Thesavemethodinsertsnewrecordsinthepreinitializedcollectionanddatabase:

defsave(self,data):

#Inserttocollectionindb

returnself.coll.insert(data)

Theloadmethodallowstheretrievalofspecificrecordsaccordingtocriteriaandprojection.Inthecaseoflargeamountofdata,itreturnsacursor:

defload(self,return_cursor=False,criteria=None,projection=None):

ifcriteriaisNone:

criteria={}

ifprojectionisNone:

cursor=self.coll.find(criteria)

else:

cursor=self.coll.find(criteria,projection)

#Returnacursorforlargeamountsofdata

ifreturn_cursor:

Page 133: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

returncursor

else:

return[itemforitemincursor]

Page 134: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

HarvestingdatafromTwitterEachsocialnetworkposesitslimitationsandchallenges.Oneofthemainobstaclesforharvestingdataisanimposedratelimit.Whilerunningrepeatedorlong-runningconnectionsbetweenrateslimitpauses,wehavetobecarefultoavoidcollectingduplicatedata.

Wehaveredesignedourconnectionprogramsoutlinedinthepreviouschaptertotakecareoftheratelimits.

InthisTwitterAPIclassthatconnectsandcollectsthetweetsaccordingtothesearchquerywespecify,wehaveaddedthefollowing:

LoggingcapabilityusingthePythonlogginglibrarywiththeaimofcollectinganyerrorsorwarninginthecaseofprogramfailurePersistencecapabilityusingMongoDB,withtheIO_mongoclassexposedpreviouslyaswellasJSONfileusingtheIO_jsonclassAPIratelimitanderrormanagementcapability,sowecanensuremoreresilientcallstoTwitterwithoutgettingbarredfortappingintothefirehose

Let’sgothroughthesteps:

1. WeinitializebyinstantiatingtheTwitterAPIwithourcredentials:

classTwitterAPI(object):

"""

TwitterAPIclassallowstheConnectiontoTwitterviaOAuth

onceyouhaveregisteredwithTwitterandreceivethe

necessarycredentials

"""

def__init__(self):

consumer_key='get_your_credentials'

consumer_secret=getyour_credentials'

access_token='get_your_credentials'

access_secret='getyour_credentials'

self.consumer_key=consumer_key

self.consumer_secret=consumer_secret

self.access_token=access_token

self.access_secret=access_secret

self.retries=3

self.auth=twitter.oauth.OAuth(access_token,access_secret,

consumer_key,consumer_secret)

self.api=twitter.Twitter(auth=self.auth)

2. Weinitializetheloggerbyprovidingtheloglevel:

logger.debug(debugmessage)logger.info(infomessage)logger.warn(warnmessage)logger.error(errormessage)logger.critical(criticalmessage)

Page 135: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

3. Wesetthelogpathandthemessageformat:

#loggerinitialisation

appName='twt150530'

self.logger=logging.getLogger(appName)

#self.logger.setLevel(logging.DEBUG)

#createconsolehandlerandsetleveltodebug

logPath='/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data'

fileName=appName

fileHandler=logging.FileHandler("{0}/{1}.log".format(logPath,

fileName))

formatter=logging.Formatter('%(asctime)s-%(name)s-%

(levelname)s-%(message)s')

fileHandler.setFormatter(formatter)

self.logger.addHandler(fileHandler)

self.logger.setLevel(logging.DEBUG)

4. WeinitializetheJSONfilepersistenceinstruction:

#SavetoJSONfileinitialisation

jsonFpath='/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data'

jsonFname='twtr15053001'

self.jsonSaver=IO_json(jsonFpath,jsonFname)

5. WeinitializetheMongoDBdatabaseandcollectionforpersistence:

#SavetoMongoDBIntitialisation

self.mongoSaver=IO_mongo(db='twtr01_db',coll='twtr01_coll')

6. ThemethodsearchTwitterlaunchesthesearchaccordingtothequeryspecified:

defsearchTwitter(self,q,max_res=10,**kwargs):

search_results=self.api.search.tweets(q=q,count=10,

**kwargs)

statuses=search_results['statuses']

max_results=min(1000,max_res)

for_inrange(10):

try:

next_results=search_results['search_metadata']

['next_results']

#self.logger.info('info'insearchTwitter-

next_results:%s'%next_results[1:])

exceptKeyErrorase:

self.logger.error('error'insearchTwitter:%s',%(e))

break

#next_results=urlparse.parse_qsl(next_results[1:])#

python2.7

next_results=urllib.parse.parse_qsl(next_results[1:])

#self.logger.info('info'insearchTwitter-

next_results[max_id]:',next_results[0:])

kwargs=dict(next_results)

#self.logger.info('info'insearchTwitter-

next_results[max_id]:%s'%kwargs['max_id'])

Page 136: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

search_results=self.api.search.tweets(**kwargs)

statuses+=search_results['statuses']

self.saveTweets(search_results['statuses'])

iflen(statuses)>max_results:

self.logger.info('info'insearchTwitter-got%i

tweets-max:%i'%(len(statuses),max_results))

break

returnstatuses

7. ThesaveTweetsmethodactuallysavesthecollectedtweetsinJSONandinMongoDB:

defsaveTweets(self,statuses):

#SavingtoJSONFile

self.jsonSaver.save(statuses)

#SavingtoMongoDB

forsinstatuses:

self.mongoSaver.save(s)

8. TheparseTweetsmethodallowsustoextractthekeytweetinformationfromthevastamountofinformationprovidedbytheTwitterAPI:

defparseTweets(self,statuses):

return[(status['id'],

status['created_at'],

status['user']['id'],

status['user']['name']

status['text''text'],

url['expanded_url'])

forstatusinstatuses

forurlinstatus['entities']['urls']]

9. ThegetTweetsmethodcallsthesearchTwittermethoddescribedpreviously.ThegetTweetsmethodensuresthatAPIcallsaremadereliablywhilstrespectingtheimposedratelimit.Thecodeisasfollows:

defgetTweets(self,q,max_res=10):

"""

MakeaTwitterAPIcallwhilstmanagingratelimitanderrors.

"""

defhandleError(e,wait_period=2,

sleep_when_rate_limited=True):

ifwait_period>3600:#Seconds

self.logger.error('ToomanyretriesingetTweets:%s',

%(e))

raisee

ife.e.code==401:

self.logger.error('error401*NotAuthorised*in

getTweets:%s',%(e))

returnNone

elife.e.code==404:

self.logger.error('error404*NotFound*in

getTweets:%s',%(e))

returnNone

Page 137: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

elife.e.code==429:

self.logger.error('error429*APIRateLimitExceeded

*ingetTweets:%s',%(e))

ifsleep_when_rate_limited:

self.logger.error('error429*Retryingin15

minutes*ingetTweets:%s',%(e))

sys.stderr.flush()

time.sleep(60*15+5)

self.logger.info('error429*Retryingnow*in

getTweets:%s',%(e))

return2

else:

raisee#Callermusthandletheratelimiting

issue

elife.e.codein(500,502,503,504):

self.logger.info('Encountered%iError.Retryingin%i

seconds'%(e.e.code,wait_period))

time.sleep(wait_period)

wait_period*=1.5

returnwait_period

else:

self.logger.error('Exit-aborting-%s',%(e))

raisee

10. Here,wearecallingthesearchTwitterAPIwiththerelevantquerybasedontheparametersspecified.Ifweencounteranyerrorsuchasratelimitationfromtheprovider,thiswillbeprocessedbythehandleErrormethod:

whileTrue:

try:

self.searchTwitter(q,max_res=10)

excepttwitter.api.TwitterHTTPErrorase:

error_count=0

wait_period=handleError(e,wait_period)

ifwait_periodisNone:

return

Page 138: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 139: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ExploringdatausingBlazeBlazeisanopensourcePythonlibrary,primarilydevelopedbyContinuum.io,leveragingPythonNumpyarraysandPandasdataframe.Blazeextendstoout-of-corecomputing,whilePandasandNumpyaresingle-core.

Blazeoffersanadaptable,unified,andconsistentuserinterfaceacrossvariousbackends.Blazeorchestratesthefollowing:

Data:SeamlessexchangeofdataacrossstoragessuchasCSV,JSON,HDF5,HDFS,andBcolzfiles.Computation:UsingthesamequeryprocessingagainstcomputationalbackendssuchasSpark,MongoDB,Pandas,orSQLAlchemy.Symbolicexpressions:Abstractexpressionssuchasjoin,group-by,filter,selection,andprojectionwithasyntaxsimilartoPandasbutlimitedinscope.Implementsthesplit-apply-combinemethodspioneeredbytheRlanguage.

BlazeexpressionsarelazilyevaluatedandinthatrespectshareasimilarprocessingparadigmwithSparkRDDstransformations.

Let’sdiveintoBlazebyfirstimportingthenecessarylibraries:numpy,pandas,blazeandodo.Odoisaspin-offofBlazeandensuresdatamigrationfromvariousbackends.Thecommandsareasfollows:

importnumpyasnp

importpandasaspd

fromblazeimportData,by,join,merge

fromodoimportodo

BokehJSsuccessfullyloaded.

WecreateaPandasDataframebyreadingtheparsedtweetssavedinaCSVfile,twts_csv:

twts_pd_df=pd.DataFrame(twts_csv_read,columns=Tweet01._fields)

twts_pd_df.head()

Out[65]:

idcreated_atuser_iduser_nametweet_texturl

15988311114065100822015-05-1412:43:5714755521raulsaeztapiaRT

@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-

solutions.com/wp/2015/05/the-...

25988311114065100822015-05-1412:43:5714755521raulsaeztapiaRT

@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-

solutions.com/wp/2015/05/the-...

3988089447195934722015-05-1411:15:5214755521raulsaeztapiaRT

@alvaroagea:Simply@ApacheSparkhttp://t.c…

http://www.webex.com/ciscospark/

45988089447195934722015-05-1411:15:5214755521raulsaeztapiaRT

@alvaroagea:Simply@ApacheSparkhttp://t.c…http://sparkjava.com/

WeruntheTweetsPandaDataframetothedescribe()functiontogetsomeoverallinformationonthedataset:

Page 140: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

twts_pd_df.describe()

Out[66]:

idcreated_atuser_iduser_nametweet_texturl

count191919191919

unique776667

top5988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…http://bit.ly/1Hfd0Xm

freq669966

WeconvertthePandasdataframeintoaBlazedataframebysimplypassingitthroughtheData()function:

#

#Blazedataframe

#

twts_bz_df=Data(twts_pd_df)

WecanretrievetheschemarepresentationoftheBlazedataframebypassingtheschemafunction:

twts_bz_df.schema

Out[73]:

dshape("""{

id:?string,

created_at:?string,

user_id:?string,

user_name:?string,

tweet_text:?string,

url:?string

}""")

The.dshapefunctiongivesarecordcountandtheschema:

twts_bz_df.dshape

Out[74]:

dshape("""19*{

id:?string,

created_at:?string,

user_id:?string,

user_name:?string,

tweet_text:?string,

url:?string

}""")

WecanprinttheBlazedataframecontent:

twts_bz_df.data

Out[75]:

idcreated_atuser_iduser_nametweet_texturl

15988311114065100822015-05-1412:43:5714755521raulsaeztapia

RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-

solutions.com/wp/2015/05/the-...

25988311114065100822015-05-1412:43:5714755521raulsaeztapia

RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-

solutions.com/wp/2015/05/the-...

...

185987829700828078082015-05-1409:32:391377652806

Page 141: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

embeddedcomputer.nlRT@BigDataTechCon:MovingRatingPredictionw…

http://buff.ly/1QBpk8J

195987779337301606402015-05-1409:12:38294862170Ellen

FriedmanI'mstillonEurotime.Ifyouaretoochecko…

http://bit.ly/1Hfd0Xm

Weextractthecolumntweet_textandtaketheuniquevalues:

twts_bz_df.tweet_text.distinct()

Out[76]:

tweet_text

0RT@pacoid:Greatrecapof@StrataConfEUinL…

1RT@alvaroagea:Simply@ApacheSparkhttp://t.c…

2RT@PrabhaGana:Whatexactlyis@ApacheSparka…

3RT@Ellen_Friedman:I'mstillonEurotime.If…

4RT@BigDataTechCon:MovingRatingPredictionw…

5I'mstillonEurotime.Ifyouaretoochecko…

Weextractmultiplecolumns['id','user_name','tweet_text']fromthedataframeandtaketheuniquerecords:

twts_bz_df[['id','user_name','tweet_text']].distinct()

Out[78]:

iduser_nametweet_text

0598831111406510082raulsaeztapiaRT@pacoid:Greatrecapof

@StrataConfEUinL…

1598808944719593472raulsaeztapiaRT@alvaroagea:Simply

@ApacheSparkhttp://t.c…

2598796205091500032JohnHumphreysRT@PrabhaGana:Whatexactlyis

@ApacheSparka…

3598788561127735296LeonardoD'AmbrosiRT@Ellen_Friedman:I'mstill

onEurotime.If…

4598785545557438464AlexeyKosenkovRT@Ellen_Friedman:I'mstillon

Eurotime.If…

5598782970082807808embeddedcomputer.nlRT@BigDataTechCon:Moving

RatingPredictionw…

6598777933730160640EllenFriedmanI'mstillonEurotime.Ifyou

aretoochecko…

Page 142: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

TransferringdatausingOdoOdoisaspin-offprojectofBlaze.Odoallowstheinterchangeofdata.Odoensuresthemigrationofdataacrossdifferentformats(CSV,JSON,HDFS,andmore)andacrossdifferentdatabases(SQLdatabases,MongoDB,andsoon)usingaverysimplepredicate:

Odo(source,target)

Totransfertoadatabase,theaddressisspecifiedusingaURL.Forexample,foraMongoDBdatabase,itwouldlooklikethis:

mongodb://username:password@hostname:port/database_name::collection_name

Let’srunsomeexamplesofusingOdo.Here,weillustrateodobyreadingaCSVfileandcreatingaBlazedataframe:

filepath=csvFpath

filename=csvFname

filesuffix=csvSuffix

twts_odo_df=Data('{0}/{1}.{2}'.format(filepath,filename,filesuffix))

Countthenumberofrecordsinthedataframe:

twts_odo_df.count()

Out[81]:

19

Displaythefiveinitialrecordsofthedataframe:

twts_odo_df.head(5)

Out[82]:

idcreated_atuser_iduser_nametweet_texturl

05988311114065100822015-05-1412:43:5714755521raulsaeztapia

RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-

solutions.com/wp/2015/05/the-...

15988311114065100822015-05-1412:43:5714755521raulsaeztapia

RT@pacoid:Greatrecapof@StrataConfEUinL…http://www.mango-

solutions.com/wp/2015/05/the-...

25988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…

http://www.webex.com/ciscospark/

35988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…http://sparkjava.com/

45988089447195934722015-05-1411:15:5214755521raulsaeztapia

RT@alvaroagea:Simply@ApacheSparkhttp://t.c…https://www.sparkfun.com/

Getdshapeinformationfromthedataframe,whichgivesusthenumberofrecordsandtheschema:

twts_odo_df.dshape

Out[83]:

dshape("var*{

id:int64,

created_at:?datetime,

user_id:int64,

user_name:?string,

Page 143: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

tweet_text:?string,

url:?string

}""")

SaveaprocessedBlazedataframeintoJSON:

odo(twts_odo_distinct_df,'{0}/{1}.{2}'.format(jsonFpath,jsonFname,

jsonSuffix))

Out[92]:

<odo.backends.json.JSONLinesat0x7f77f0abfc50>

ConvertaJSONfiletoaCSVfile:

odo('{0}/{1}.{2}'.format(jsonFpath,jsonFname,jsonSuffix),'{0}/{1}.

{2}'.format(csvFpath,csvFname,csvSuffix))

Out[94]:

<odo.backends.csv.CSVat0x7f77f0abfe10>

Page 144: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 145: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ExploringdatausingSparkSQLSparkSQLisarelationalqueryenginebuiltontopofSparkCore.SparkSQLusesaqueryoptimizercalledCatalyst.

RelationalqueriescanbeexpressedusingSQLorHiveQLandexecutedagainstJSON,CSV,andvariousdatabases.SparkSQLgivesusthefullexpressivenessofdeclarativeprogramingwithSparkdataframesontopoffunctionalprogrammingwithRDDs.

Page 146: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnderstandingSparkdataframesHere’[email protected],theadventofSparkSQLanddataframes.Italsohighlightsthevariousdatasourcesinthelowerpartofthediagram.Onthetoppart,wecannoticeRasthenewlanguagethatwillbegraduallysupportedontopofScala,Java,andPython.Ultimately,theDataFramephilosophyispervasivebetweenR,Python,andSpark.

SparkdataframesoriginatefromSchemaRDDs.ItcombinesRDDwithaschemathatcanbeinferredbySpark,ifrequested,whenregisteringthedataframe.ItallowsustoquerycomplexnestedJSONdatawithplainSQL.Lazyevaluation,lineage,partitioning,andpersistenceapplytodataframes.

Let’squerythedatawithSparkSQL,byfirstimportingSparkContextandSQLContext:

frompysparkimportSparkConf,SparkContext

frompyspark.sqlimportSQLContext,Row

Page 147: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

In[95]:

sc

Out[95]:

<pyspark.context.SparkContextat0x7f7829581890>

In[96]:

sc.master

Out[96]:

u'local[*]'

''In[98]:

#InstantiateSparkSQLcontext

sqlc=SQLContext(sc)

WereadintheJSONfilewesavedwithOdo:

twts_sql_df_01=sqlc.jsonFile("/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data/twtr15051401_distinct.json")

In[101]:

twts_sql_df_01.show()

created_atidtweet_textuser_id

user_name

2015-05-14T12:43:57Z598831111406510082RT@pacoid:Great…14755521

raulsaeztapia

2015-05-14T11:15:52Z598808944719593472RT@alvaroagea:S…14755521

raulsaeztapia

2015-05-14T10:25:15Z598796205091500032RT@PrabhaGana:W…48695135John

Humphreys

2015-05-14T09:54:52Z598788561127735296RT@Ellen_Friedma…2385931712

LeonardoD'Ambrosi

2015-05-14T09:42:53Z598785545557438464RT@Ellen_Friedma…461020977

AlexeyKosenkov

2015-05-14T09:32:39Z598782970082807808RT@BigDataTechCo…1377652806

embeddedcomputer.nl

2015-05-14T09:12:38Z598777933730160640I'mstillonEuro…294862170Ellen

Friedman

WeprinttheschemaoftheSparkdataframe:

twts_sql_df_01.printSchema()

root

|--created_at:string(nullable=true)

|--id:long(nullable=true)

|--tweet_text:string(nullable=true)

|--user_id:long(nullable=true)

|--user_name:string(nullable=true)

Weselecttheuser_namecolumnfromthedataframe:

twts_sql_df_01.select('user_name').show()

user_name

raulsaeztapia

raulsaeztapia

JohnHumphreys

LeonardoD'Ambrosi

AlexeyKosenkov

embeddedcomputer.nl

EllenFriedman

Page 148: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Weregisterthedataframeasatable,sowecanexecuteaSQLqueryonit:

twts_sql_df_01.registerAsTable('tweets_01')

WeexecuteaSQLstatementagainstthedataframe:

twts_sql_df_01_selection=sqlc.sql("SELECT*FROMtweets_01WHERE

user_name='raulsaeztapia'")

In[109]:

twts_sql_df_01_selection.show()

created_atidtweet_textuser_id

user_name

2015-05-14T12:43:57Z598831111406510082RT@pacoid:Great…14755521

raulsaeztapia

2015-05-14T11:15:52Z598808944719593472RT@alvaroagea:S…14755521

raulsaeztapia

Let’sprocesssomemorecomplexJSON;wereadtheoriginalTwitterJSONfile:

tweets_sqlc_inf=sqlc.jsonFile(infile)

SparkSQLisabletoinfertheschemaofacomplexnestedJSONfile:

tweets_sqlc_inf.printSchema()

root

|--contributors:string(nullable=true)

|--coordinates:string(nullable=true)

|--created_at:string(nullable=true)

|--entities:struct(nullable=true)

||--hashtags:array(nullable=true)

|||--element:struct(containsNull=true)

||||--indices:array(nullable=true)

|||||--element:long(containsNull=true)

||||--text:string(nullable=true)

||--media:array(nullable=true)

|||--element:struct(containsNull=true)

||||--display_url:string(nullable=true)

||||--expanded_url:string(nullable=true)

||||--id:long(nullable=true)

||||--id_str:string(nullable=true)

||||--indices:array(nullable=true)

...(snip)...

||--statuses_count:long(nullable=true)

||--time_zone:string(nullable=true)

||--url:string(nullable=true)

||--utc_offset:long(nullable=true)

||--verified:boolean(nullable=true)

Weextractthekeyinformationofinterestfromthewallofdatabyselectingspecificcolumnsinthedataframe(inthiscase,['created_at','id','text','user.id','user.name','entities.urls.expanded_url']):

tweets_extract_sqlc=tweets_sqlc_inf[['created_at','id','text',

'user.id','user.name','entities.urls.expanded_url']].distinct()

In[145]:

tweets_extract_sqlc.show()

created_atidtextid

Page 149: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

nameexpanded_url

ThuMay1409:32:...598782970082807808RT@BigDataTechCo…1377652806

embeddedcomputer.nlArrayBuffer(http:...

ThuMay1412:43:...598831111406510082RT@pacoid:Great…14755521

raulsaeztapiaArrayBuffer(http:...

ThuMay1412:18:...598824733086523393@rabbitonwebspea…

...

ThuMay1412:28:...598827171168264192RT@baandrzejczak…20909005Paweł

SzulcArrayBuffer()

Page 150: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnderstandingtheSparkSQLqueryoptimizerWeexecuteaSQLstatementagainstthedataframe:

tweets_extract_sqlc_sel=sqlc.sql("SELECT*fromTweets_xtr_001WHERE

name='raulsaeztapia'")

WegetadetailedviewofthequeryplansexecutedbySparkSQL:

ParsedlogicalplanAnalyzedlogicalplanOptimizedlogicalplanPhysicalplan

ThequeryplanusesSparkSQL’sCatalystoptimizer.Inordertogeneratethecompiledbytecodefromthequeryparts,theCatalystoptimizerrunsthroughlogicalplanparsingandoptimizationfollowedbyphysicalplanevaluationandoptimizationbasedoncost.

Thisisillustratedinthefollowingtweet:

Lookingbackatourcode,wecallthe.explainfunctionontheSparkSQLquerywejustexecuted,anditdeliversthefulldetailsofthestepstakenbytheCatalystoptimizerinordertoassessandoptimizethelogicalplanandthephysicalplanandgettotheresultRDD:

tweets_extract_sqlc_sel.explain(extended=True)

==ParsedLogicalPlan==

'Project[*]

'Filter('name=raulsaeztapia)'name''UnresolvedRelation'

Page 151: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

[Tweets_xtr_001],None

==AnalyzedLogicalPlan==

Project[created_at#7,id#12L,text#27,id#80L,name#81,expanded_url#82]

Filter(name#81=raulsaeztapia)

Distinct

Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.name

ASname#81,entities#8.urls.expanded_urlASexpanded_url#82]

Relation[contributors#5,coordinates#6,created_at#7,entities#8,favorite_coun

t#9L,favorited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_rep

ly_to_status_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_

reply_to_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,

retweet_count#23L,retweeted#24,retweeted_status#25,source#26,text#27,trunca

ted#28,user#29]JSONRelation(/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data/twtr15051401.json,1.0,None)

==OptimizedLogicalPlan==

Filter(name#81=raulsaeztapia)

Distinct

Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.nameAS

name#81,entities#8.urls.expanded_urlASexpanded_url#82]

Relation[contributors#5,coordinates#6,created_at#7,entities#8,favorite_coun

t#9L,favorited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_rep

ly_to_status_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_

reply_to_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,

retweet_count#23L,retweeted#24,retweeted_status#25,source#26,text#27,trunca

ted#28,user#29]JSONRelation(/home/an/spark/spark-1.3.0-bin-

hadoop2.4/examples/AN_Spark/data/twtr15051401.json,1.0,None)

==PhysicalPlan==

Filter(name#81=raulsaeztapia)

Distinctfalse

Exchange(HashPartitioning

[created_at#7,id#12L,text#27,id#80L,name#81,expanded_url#82],200)

Distincttrue

Project[created_at#7,id#12L,text#27,user#29.idASid#80L,user#29.name

ASname#81,entities#8.urls.expanded_urlASexpanded_url#82]

PhysicalRDD

[contributors#5,coordinates#6,created_at#7,entities#8,favorite_count#9L,fav

orited#10,geo#11,id#12L,id_str#13,in_reply_to_screen_name#14,in_reply_to_st

atus_id#15,in_reply_to_status_id_str#16,in_reply_to_user_id#17L,in_reply_to

_user_id_str#18,lang#19,metadata#20,place#21,possibly_sensitive#22,retweet_

count#23L,retweeted#24,retweeted_status#25,source#26,text#27,truncated#28,u

ser#29],MapPartitionsRDD[165]atmapatJsonRDD.scala:41

CodeGeneration:false

==RDD==

Finally,here’stheresultofthequery:

tweets_extract_sqlc_sel.show()

created_atidtextidname

expanded_url

ThuMay1412:43:...598831111406510082RT@pacoid:Great…14755521

raulsaeztapiaArrayBuffer(http:...

ThuMay1411:15:...598808944719593472RT@alvaroagea:S…14755521

raulsaeztapiaArrayBuffer(http:...

In[148]:

Page 152: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

LoadingandprocessingCSVfileswithSparkSQLWewillusetheSparkpackagespark-csv_2.11:1.2.0.ThecommandtobeusedtolaunchPySparkwiththeIPythonNotebookandthespark-csvpackageshouldexplicitlystatethe–packagesargument:

$IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-

hadoop2.6/bin/pyspark--packagescom.databricks:spark-csv_2.11:1.2.0

Thiswilltriggerthefollowingoutput;wecanseethatthespark-csvpackageisinstalledwithallitsdependencies:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark$

IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-

hadoop2.6/bin/pyspark--packagescom.databricks:spark-csv_2.11:1.2.0

...(snip)...

IvyDefaultCachesetto:/home/an/.ivy2/cache

Thejarsforthepackagesstoredin:/home/an/.ivy2/jars

::loadingsettings::url=jar:file:/home/an/spark/spark-1.5.0-bin-

hadoop2.6/lib/spark-assembly-1.5.0-

hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

com.databricks#spark-csv_2.11addedasadependency

::resolvingdependencies::org.apache.spark#spark-submit-parent;1.0

confs:[default]

foundcom.databricks#spark-csv_2.11;1.2.0incentral

foundorg.apache.commons#commons-csv;1.1incentral

foundcom.univocity#univocity-parsers;1.5.1incentral

::resolutionreport::resolve835ms::artifactsdl48ms

::modulesinuse:

com.databricks#spark-csv_2.11;1.2.0fromcentralin[default]

com.univocity#univocity-parsers;1.5.1fromcentralin[default]

org.apache.commons#commons-csv;1.1fromcentralin[default]

----------------------------------------------------------------

||modules||artifacts|

|conf|number|search|dwnlded|evicted||number|dwnlded|

----------------------------------------------------------------

|default|3|0|0|0||3|0

----------------------------------------------------------------

::retrieving::org.apache.spark#spark-submit-parent

confs:[default]

0artifactscopied,3alreadyretrieved(0kB/45ms)

Wearenowreadytoloadourcsvfileandprocessit.Let’sfirstimporttheSQLContext:

#

#ReadcsvinaSparkDF

#

sqlContext=SQLContext(sc)

spdf_in=sqlContext.read.format('com.databricks.spark.csv')\

.options(delimiter=";").options(header="true")\

.options(header='true').load(csv_in)

Weaccesstheschemaofthedataframecreatedfromtheloadedcsv:

Page 153: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

In[10]:

spdf_in.printSchema()

root

|--:string(nullable=true)

|--id:string(nullable=true)

|--created_at:string(nullable=true)

|--user_id:string(nullable=true)

|--user_name:string(nullable=true)

|--tweet_text:string(nullable=true)

Wecheckthecolumnsofthedataframe:

In[12]:

spdf_in.columns

Out[12]:

['','id','created_at','user_id','user_name','tweet_text']

Weintrospectthedataframecontent:

In[13]:

spdf_in.show()

+---+------------------+--------------------+----------+------------------

+--------------------+

||id|created_at|user_id|user_name|

tweet_text|

+---+------------------+--------------------+----------+------------------

+--------------------+

|0|638830426971181057|TueSep0121:46:...|3276255125|True

Equality|ernestsgantt:Bey…|

|1|638830426727911424|TueSep0121:46:...|3276255125|True

Equality|ernestsgantt:Bey…|

|2|638830425402556417|TueSep0121:46:...|3276255125|True

Equality|ernestsgantt:Bey…|

...(snip)...

|41|638830280988426250|TueSep0121:46:...|951081582|Jack

Baldwin|RT@cloudaus:We…|

|42|638830276626399232|TueSep0121:46:...|6525302|Masayoshi

Nakamura|PynamoDB使いやすいです|

+---+------------------+--------------------+----------+------------------

+--------------------+

onlyshowingtop20rows

Page 154: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

QueryingMongoDBfromSparkSQLTherearetwomajorwaystointeractwithMongoDBfromSpark:thefirstisthroughtheHadoopMongoDBconnector,andthesecondoneisdirectlyfromSparktoMongoDB.

ThefirstapproachtointeractwithMongoDBfromSparkistosetupaHadoopenvironmentandquerythroughtheHadoopMongoDBconnector.TheconnectordetailsarehostedonGitHubathttps://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage.AnactualusecaseisdescribedintheseriesofblogpostsfromMongoDB:

UsingMongoDBwithHadoop&Spark:Part1-Introduction&Setup(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-introduction-setup)UsingMongoDBwithHadoopandSpark:Part2-HiveExample(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive-example)UsingMongoDBwithHadoop&Spark:Part3-SparkExample&KeyTakeaways(https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways)

SettingupafullHadoopenvironmentisbitelaborate.Wewillfavorthesecondapproach.Wewillusethespark-mongodbconnectordevelopedandmaintainedbyStratio.WeareusingtheStratiospark-mongodbpackagehostedatspark.packages.org.Thepackagesinformationandversioncanbefoundinspark.packages.org:

NoteReleases

Version:0.10.1(8263c8|zip|jar)/Date:2015-11-18/License:Apache-2.0/Scalaversion:2.10

(http://spark-packages.org/package/Stratio/spark-mongodb)

ThecommandtolaunchPySparkwiththeIPythonNotebookandthespark-mongodbpackageshouldexplicitlystatethepackagesargument:

$IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-

hadoop2.6/bin/pyspark--packagescom.stratio.datasource:spark-

mongodb_2.10:0.10.1

Thiswilltriggerthefollowingoutput;wecanseethatthespark-mongodbpackageisinstalledwithallitsdependencies:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark$

IPYTHON_OPTS='notebook'/home/an/spark/spark-1.5.0-bin-

hadoop2.6/bin/pyspark--packagescom.stratio.datasource:spark-

mongodb_2.10:0.10.1…(snip)...

IvyDefaultCachesetto:/home/an/.ivy2/cache

Thejarsforthepackagesstoredin:/home/an/.ivy2/jars

::loadingsettings::url=jar:file:/home/an/spark/spark-1.5.0-bin-

hadoop2.6/lib/spark-assembly-1.5.0-

Page 155: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

com.stratio.datasource#spark-mongodb_2.10addedasadependency

::resolvingdependencies::org.apache.spark#spark-submit-parent;1.0

confs:[default]

foundcom.stratio.datasource#spark-mongodb_2.10;0.10.1incentral

[W22:10:50.910NotebookApp]Timeoutwaitingforkernel_inforeplyfrom

764081d3-baf9-4978-ad89-7735e6323cb6

foundorg.mongodb#casbah-commons_2.10;2.8.0incentral

foundcom.github.nscala-time#nscala-time_2.10;1.0.0incentral

foundjoda-time#joda-time;2.3incentral

foundorg.joda#joda-convert;1.2incentral

foundorg.slf4j#slf4j-api;1.6.0incentral

foundorg.mongodb#mongo-java-driver;2.13.0incentral

foundorg.mongodb#casbah-query_2.10;2.8.0incentral

foundorg.mongodb#casbah-core_2.10;2.8.0incentral

downloadinghttps://repo1.maven.org/maven2/com/stratio/datasource/spark-

mongodb_2.10/0.10.1/spark-mongodb_2.10-0.10.1.jar…

[SUCCESSFUL]com.stratio.datasource#spark-mongodb_2.10;0.10.1!spark-

mongodb_2.10.jar(3130ms)

downloadinghttps://repo1.maven.org/maven2/org/mongodb/casbah-

commons_2.10/2.8.0/casbah-commons_2.10-2.8.0.jar…

[SUCCESSFUL]org.mongodb#casbah-commons_2.10;2.8.0!casbah-

commons_2.10.jar(2812ms)

downloadinghttps://repo1.maven.org/maven2/org/mongodb/casbah-

query_2.10/2.8.0/casbah-query_2.10-2.8.0.jar…

[SUCCESSFUL]org.mongodb#casbah-query_2.10;2.8.0!casbah-query_2.10.jar

(1432ms)

downloadinghttps://repo1.maven.org/maven2/org/mongodb/casbah-

core_2.10/2.8.0/casbah-core_2.10-2.8.0.jar…

[SUCCESSFUL]org.mongodb#casbah-core_2.10;2.8.0!casbah-core_2.10.jar

(2785ms)

downloadinghttps://repo1.maven.org/maven2/com/github/nscala-time/nscala-

time_2.10/1.0.0/nscala-time_2.10-1.0.0.jar…

[SUCCESSFUL]com.github.nscala-time#nscala-time_2.10;1.0.0!nscala-

time_2.10.jar(2725ms)

downloadinghttps://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.0/slf4j-

api-1.6.0.jar…

[SUCCESSFUL]org.slf4j#slf4j-api;1.6.0!slf4j-api.jar(371ms)

downloadinghttps://repo1.maven.org/maven2/org/mongodb/mongo-java-

driver/2.13.0/mongo-java-driver-2.13.0.jar…

[SUCCESSFUL]org.mongodb#mongo-java-driver;2.13.0!mongo-java-driver.jar

(5259ms)

downloadinghttps://repo1.maven.org/maven2/joda-time/joda-time/2.3/joda-

time-2.3.jar…

[SUCCESSFUL]joda-time#joda-time;2.3!joda-time.jar(6949ms)

downloadinghttps://repo1.maven.org/maven2/org/joda/joda-convert/1.2/joda-

convert-1.2.jar…

[SUCCESSFUL]org.joda#joda-convert;1.2!joda-convert.jar(548ms)

::resolutionreport::resolve11850ms::artifactsdl26075ms

::modulesinuse:

com.github.nscala-time#nscala-time_2.10;1.0.0fromcentralin[default]

com.stratio.datasource#spark-mongodb_2.10;0.10.1fromcentralin

[default]

joda-time#joda-time;2.3fromcentralin[default]

org.joda#joda-convert;1.2fromcentralin[default]

org.mongodb#casbah-commons_2.10;2.8.0fromcentralin[default]

Page 156: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

org.mongodb#casbah-core_2.10;2.8.0fromcentralin[default]

org.mongodb#casbah-query_2.10;2.8.0fromcentralin[default]

org.mongodb#mongo-java-driver;2.13.0fromcentralin[default]

org.slf4j#slf4j-api;1.6.0fromcentralin[default]

---------------------------------------------------------------------

||modules||artifacts|

|conf|number|search|dwnlded|evicted||number|dwnlded|

---------------------------------------------------------------------

|default|9|9|9|0||9|9|

---------------------------------------------------------------------

::retrieving::org.apache.spark#spark-submit-parent

confs:[default]

9artifactscopied,0alreadyretrieved(2335kB/51ms)

...(snip)...

WearenowreadytoqueryMongoDBonlocalhost:27017fromthecollectiontwtr01_collinthedatabasetwtr01_db.

WefirstimporttheSQLContext:

In[5]:

frompyspark.sqlimportSQLContext

sqlContext.sql("CREATETEMPORARYTABLEtweet_tableUSING

com.stratio.datasource.mongodbOPTIONS(host'localhost:27017',database

'twtr01_db',collection'twtr01_coll')")

sqlContext.sql("SELECT*FROMtweet_tablewhereid=598830778269769728

").collect()

Here’stheoutputofourquery:

Out[5]:

[Row(text=u'@spark_ioisnow@particle-awesomenews-nowIcanenjoymy

ParticleCores/Photons+@sparkfunsensors+@ApacheSparkanalytics:-)',

_id=u'55aa640fd770871cba74cb88',contributors=None,retweeted=False,

user=Row(contributors_enabled=False,created_at=u'MonAug2514:01:26+0000

2008',default_profile=True,default_profile_image=False,

description=u'Buildingopensourcetoolsforandteachingenterprise

softwaredevelopers',entities=Row(description=Row(urls=[]),url=Row(urls=

[Row(url=u'http://t.co/TSHp13EWeu',indices=[0,22],

...(snip)...

9],name=u'SparkisParticle',screen_name=u'spark_io'),Row(id=487010011,

id_str=u'487010011',indices=[17,26],name=u'Particle',

screen_name=u'particle'),Row(id=17877351,id_str=u'17877351',indices=[88,

97],name=u'SparkFunElectronics',screen_name=u'sparkfun'),

Row(id=1551361069,id_str=u'1551361069',indices=[108,120],name=u'Apache

Spark',screen_name=u'ApacheSpark')]),is_quote_status=None,lang=u'en',

quoted_status_id_str=None,quoted_status_id=None,created_at=u'ThuMay14

12:42:37+00002015',retweeted_status=None,truncated=False,place=None,

id=598830778269769728,in_reply_to_user_id=3187046084,retweet_count=0,

in_reply_to_status_id=None,in_reply_to_screen_name=u'spark_io',

in_reply_to_user_id_str=u'3187046084',source=u'<a

href="http://twitter.com"rel="nofollow">TwitterWebClient</a>',

id_str=u'598830778269769728',coordinates=None,

metadata=Row(iso_language_code=u'en',result_type=u'recent'),

quoted_status=None)]

Page 157: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

#

Page 158: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 159: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SummaryInthischapter,weharvesteddatafromTwitter.Oncethedatawasacquired,weexploredtheinformationusingContinuum.io'sBlazeandOdolibraries.SparkSQLisanimportantmoduleforinteractivedataexploration,analysis,andtransformation,leveragingtheSparkdataframedatastructure.ThedataframeconceptoriginatesfromRandthenwasadoptedbyPythonPandaswithgreatsuccess.Thedataframeistheworkhorseofthedatascientist.ThecombinationofSparkSQLanddataframecreatesapowerfulenginefordataprocessing.

WearenowgearingupforextractingtheinsightsfromthedatasetsusingmachinelearningfromSparkMLlib.

Page 160: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 161: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Chapter4.LearningfromDataUsingSparkAswehavelaidthefoundationfordatatobeharvestedinthepreviouschapter,wearenowreadytolearnfromthedata.Machinelearningisaboutdrawinginsightsfromdata.OurobjectiveistogiveanoverviewoftheSparkMLlib(shortforMachineLearninglibrary)andapplytheappropriatealgorithmstoourdatasetinordertoderiveinsights.FromtheTwitterdataset,wewillbeapplyinganunsupervisedclusteringalgorithminordertodistinguishbetweenApacheSpark-relevanttweetsversustherest.Wehaveasinitialinputamixedbagoftweets.Wefirstneedtopreprocessthedatainordertoextracttherelevantfeatures,thenapplythemachinelearningalgorithmtoourdataset,andfinallyevaluatetheresultsandtheperformanceofourmodel.

Inthischapter,wewillcoverthefollowingpoints:

ProvidinganoverviewoftheSparkMLlibmodulewithitsalgorithmsandthetypicalmachinelearningworkflow.PreprocessingtheTwitterharvesteddatasettoextracttherelevantfeatures,applyinganunsupervisedclusteringalgorithmtoidentifyApacheSpark-relevanttweets.Then,evaluatingthemodelandtheresultsobtained.DescribingtheSparkmachinelearningpipeline.

Page 162: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ContextualizingSparkMLlibintheapparchitectureLet’sfirstcontextualizethefocusofthischapterondata-intensiveapparchitecture.Wewillconcentrateourattentionontheanalyticslayerandmorepreciselymachinelearning.Thiswillserveasafoundationforstreamingappsaswewanttoapplythelearningfromthebatchprocessingofdataasinferencerulesforthestreaminganalysis.

Thefollowingdiagramsetsthecontextofthechapter’sfocus,highlightingthemachinelearningmodulewithintheanalyticslayerwhileusingtoolsforexploratorydataanalysis,SparkSQL,andPandas.

Page 163: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 164: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ClassifyingSparkMLlibalgorithmsSparkMLlibisarapidlyevolvingmoduleofSparkwithnewalgorithmsaddedwitheachreleaseofSpark.

Thefollowingdiagramprovidesahigh-leveloverviewofSparkMLlibalgorithmsgroupedinthetraditionalbroadmachinelearningtechniquesandfollowingthecategoricalorcontinuousnatureofthedata:

WecategorizetheSparkMLlibalgorithmsintwocolumns,categoricalorcontinuous,dependingonthetypeofdata.Wedistinguishbetweendatathatiscategoricalormorequalitativeinnatureversuscontinuousdata,whichisquantitativeinnature.Anexampleofqualitativedataispredictingtheweather;giventheatmosphericpressure,thetemperature,andthepresenceandtypeofclouds,theweatherwillbesunny,dry,rainy,orovercast.Thesearediscretevalues.Ontheotherhand,let’ssaywewanttopredicthouseprices,giventhelocation,squaremeterage,andthenumberofbeds;therealestatevaluecanbepredictedusinglinearregression.Inthiscase,wearetalkingaboutcontinuousorquantitativevalues.

Page 165: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Thehorizontalgroupingreflectsthetypesofmachinelearningmethodused.Unsupervisedversussupervisedmachinelearningtechniquesaredependentonwhetherthetrainingdataislabeled.Inanunsupervisedlearningchallenge,nolabelsaregiventothelearningalgorithm.Thegoalistofindthehiddenstructureinitsinput.Inthecaseofsupervisedlearning,thedataislabeled.Thefocusisonmakingpredictionsusingregressionifthedataiscontinuousorclassificationifthedataiscategorical.

Animportantcategoryofmachinelearningisrecommendersystems,whichleveragecollaborativefilteringtechniques.TheAmazonwebstoreandNetflixhaveverypowerfulrecommendersystemspoweringtheirrecommendations.

StochasticGradientDescentisoneofthemachinelearningoptimizationtechniquesthatiswellsuitedforSparkdistributedcomputation.

Forprocessinglargeamountsoftext,SparkofferscruciallibrariesforfeatureextractionandtransformationsuchasTF-IDF(shortforTermFrequency–InverseDocumentFrequency),Word2Vec,standardscaler,andnormalizer.

Page 166: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SupervisedandunsupervisedlearningWedelvemoredeeplyhereintothetraditionalmachinelearningalgorithmsofferedbySparkMLlib.Wedistinguishbetweensupervisedandunsupervisedlearningdependingonwhetherthedataislabeled.Wedistinguishbetweencategoricalorcontinuousdependingonwhetherthedataisdiscreteorcontinuous.

ThefollowingdiagramexplainstheSparkMLlibsupervisedandunsupervisedmachinelearningalgorithmsandpreprocessingtechniques:

ThefollowingsupervisedandunsupervisedMLlibalgorithmsandpreprocessingtechniquesarecurrentlyavailableinSpark:

Clustering:Thisisanunsupervisedmachinelearningtechniquewherethedataisnotlabeled.Theaimistoextractstructurefromthedata:

K-Means:ThispartitionsthedatainKdistinctclustersGaussianMixture:ClustersareassignedbasedonthemaximumposteriorprobabilityofthecomponentPowerIterationClustering(PIC):ThisgroupsverticesofagraphbasedonpairwiseedgesimilaritiesLatentDirichletAllocation(LDA):ThisisusedtogroupcollectionsoftextdocumentsintotopicsStreamingK-Means:Thismeansclustersdynamicallystreamingdatausingawindowingfunctionontheincomingdata

DimensionalityReduction:Thisaimstoreducethenumberoffeaturesunderconsideration.Essentially,thisreducesnoiseinthedataandfocusesonthekeyfeatures:

SingularValueDecomposition(SVD):Thisbreaksthematrixthatcontainsthedataintosimplermeaningfulpieces.Itfactorizestheinitialmatrixintothree

Page 167: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

matrices.PrincipalComponentAnalysis(PCA):Thisapproximatesahighdimensionaldatasetwithalowdimensionalsubspace.

RegressionandClassification:Regressionpredictsoutputvaluesusinglabeledtrainingdata,whileClassificationgroupstheresultsintoclasses.ClassificationhasdependentvariablesthatarecategoricalorunorderedwhilstRegressionhasdependentvariablesthatarecontinuousandordered:

LinearRegressionModels(linearregression,logisticregression,andsupportvectormachines):Linearregressionalgorithmscanbeexpressedasconvexoptimizationproblemsthataimtominimizeanobjectivefunctionbasedonavectorofweightvariables.Theobjectivefunctioncontrolsthecomplexityofthemodelthroughtheregularizedpartofthefunctionandtheerrorofthemodelthroughthelosspartofthefunction.NaiveBayes:Thismakespredictionsbasedontheconditionalprobabilitydistributionofalabelgivenanobservation.Itassumesthatfeaturesaremutuallyindependentofeachother.DecisionTrees:Thisperformsrecursivebinarypartitioningofthefeaturespace.Theinformationgainatthetreenodelevelismaximizedinordertodeterminethebestsplitforthepartition.Ensemblesoftrees(RandomForestsandGradient-BoostedTrees):Treeensemblealgorithmscombinebasedecisiontreemodelsinordertobuildaperformantmodel.Theyareintuitiveandverysuccessfulforclassificationandregressiontasks.

IsotonicRegression:Thisminimizesthemeansquarederrorbetweengivendataandobservedresponses.

Page 168: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

AdditionallearningalgorithmsSparkMLliboffersmorealgorithmsthanthesupervisedandunsupervisedlearningones.Wehavebroadlythreemoreadditionaltypesofmachinelearningmethods:recommendersystems,optimizationalgorithms,andfeatureextraction.

ThefollowingadditionalMLlibalgorithmsarecurrentlyavailableinSpark:

Collaborativefiltering:Thisisthebasisforrecommendersystems.Itcreatesauser-itemassociationmatrixandaimstofillthegaps.Basedonotherusersanditemsalongwiththeirratings,itrecommendsanitemthatthetargetuserhasnoratingsfor.Indistributedcomputing,oneofthemostsuccessfulalgorithmsisALS(shortforAlternatingLeastSquare):

AlternatingLeastSquares:Thismatrixfactorizationtechniqueincorporatesimplicitfeedback,temporaleffects,andconfidencelevels.Itdecomposesthelargeuseritemmatrixintoalowerdimensionaluseranditemfactors.Itminimizesaquadraticlossfunctionbyfixingalternativelyitsfactors.

Featureextractionandtransformation:Theseareessentialtechniquesforlargetextdocumentprocessing.Itincludesthefollowingtechniques:

TermFrequency:SearchenginesuseTF-IDFtoscoreandrankdocumentrelevanceinavastcorpus.Itisalsousedinmachinelearningtodeterminetheimportanceofawordinadocumentorcorpus.Termfrequencystatisticallydeterminestheweightofatermrelativetoitsfrequencyinthecorpus.Termfrequencyonitsowncanbemisleadingasitoveremphasizeswordssuchasthe,

Page 169: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

of,orandthatgivelittleinformation.InverseDocumentFrequencyprovidesthespecificityorthemeasureoftheamountofinformation,whetherthetermisrareorcommonacrossalldocumentsinthecorpus.Word2Vec:Thisincludestwomodels,Skip-GramandContinuousBagofWord.TheSkip-Grampredictsneighboringwordsgivenaword,basedonslidingwindowsofwords,whileContinuousBagofWordspredictsthecurrentwordgiventheneighboringwords.StandardScaler:Aspartofpreprocessing,thedatasetmustoftenbestandardizedbymeanremovalandvariancescaling.Wecomputethemeanandstandarddeviationonthetrainingdataandapplythesametransformationtothetestdata.Normalizer:Wescalethesamplestohaveunitnorm.Itisusefulforquadraticformssuchasthedotproductorkernelmethods.Featureselection:Thisreducesthedimensionalityofthevectorspacebyselectingthemostrelevantfeaturesforthemodel.Chi-SquareSelector:Thisisastatisticalmethodtomeasuretheindependenceoftwoevents.

Optimization:ThesespecificSparkMLliboptimizationalgorithmsfocusonvarioustechniquesofgradientdescent.Sparkprovidesveryefficientimplementationofgradientdescentonadistributedclusterofmachines.Itlooksforthelocalminimabyiterativelygoingdownthesteepestdescent.Itiscompute-intensiveasititeratesthroughallthedataavailable:

StochasticGradientDescent:Weminimizeanobjectivefunctionthatisthesumofdifferentiablefunctions.StochasticGradientDescentusesonlyasampleofthetrainingdatainordertoupdateaparameterinaparticulariteration.Itisusedforlarge-scaleandsparsemachinelearningproblemssuchastextclassification.

Limited-memoryBFGS(L-BFGS):Asthenamesays,L-BFGSuseslimitedmemoryandsuitsthedistributedoptimizationalgorithmimplementationofSparkMLlib.

Page 170: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 171: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparkMLlibdatatypesMLlibsupportsfouressentialdatatypes:localvector,labeledpoint,localmatrix,anddistributedmatrix.ThesedatatypesarewidelyusedinSparkMLlibalgorithms:

Localvector:Thisresidesinasinglemachine.Itcanbedenseorsparse:

Densevectorisatraditionalarrayofdoubles.Anexampleofdensevectoris[5.0,0.0,1.0,7.0].Sparsevectorusesintegerindicesanddoublevalues.Sothesparserepresentationofthevector[5.0,0.0,1.0,7.0]wouldbe(4,[0,2,3],[5.0,1.0,7.0]),whererepresentthedimensionofthevector.

Here’sanexampleoflocalvectorinPySpark:

importnumpyasnp

importscipy.sparseassps

frompyspark.mllib.linalgimportVectors

#NumPyarrayfordensevector.

dvect1=np.array([5.0,0.0,1.0,7.0])

#Pythonlistfordensevector.

dvect2=[5.0,0.0,1.0,7.0]

#SparseVectorcreation

svect1=Vectors.sparse(4,[0,2,3],[5.0,1.0,7.0])

#Sparsevectorusingasingle-columnSciPycsc_matrix

svect2=sps.csc_matrix((np.array([5.0,1.0,7.0]),np.array([0,2,

3])),shape=(4,1))

Labeledpoint.Alabeledpointisadenseorsparsevectorwithalabelusedinsupervisedlearning.Inthecaseofbinarylabels,0.0representsthenegativelabelwhilst1.0representsthepositivevalue.

Here’sanexampleofalabeledpointinPySpark:

frompyspark.mllib.linalgimportSparseVector

frompyspark.mllib.regressionimportLabeledPoint

#Labeledpointwithapositivelabelandadensefeaturevector.

lp_pos=LabeledPoint(1.0,[5.0,0.0,1.0,7.0])

#Labeledpointwithanegativelabelandasparsefeaturevector.

lp_neg=LabeledPoint(0.0,SparseVector(4,[0,2,3],[5.0,1.0,

7.0]))

LocalMatrix:Thislocalmatrixresidesinasinglemachinewithinteger-typeindicesandvaluesoftypedouble.

Here’sanexampleofalocalmatrixinPySpark:

frompyspark.mllib.linalgimportMatrix,Matrices

#Densematrix((1.0,2.0,3.0),(4.0,5.0,6.0))

dMatrix=Matrices.dense(2,3,[1,2,3,4,5,6])

Page 172: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

#Sparsematrix((9.0,0.0),(0.0,8.0),(0.0,6.0))

sMatrix=Matrices.sparse(3,2,[0,1,3],[0,2,1],[9,6,8])

DistributedMatrix:LeveragingthedistributedmatureoftheRDD,distributedmatricescanbesharedinaclusterofmachines.Wedistinguishfourdistributedmatrixtypes:RowMatrix,IndexedRowMatrix,CoordinateMatrix,andBlockMatrix:

RowMatrix:ThistakesanRDDofvectorsandcreatesadistributedmatrixofrowswithmeaninglessindices,calledRowMatrix,fromtheRDDofvectors.IndexedRowMatrix:Inthiscase,rowindicesaremeaningful.First,wecreateanRDDofindexedrowsusingtheclassIndexedRowandthencreateanIndexedRowMatrix.CoordinateMatrix:Thisisusefultorepresentverylargeandverysparsematrices.CoordinateMatrixiscreatedfromRDDsoftheMatrixEntrypoints,representedbyatupleoftype(long,long,orfloat)BlockMatrix:ThesearecreatedfromRDDsofsub-matrixblocks,whereasub-matrixblockis((blockRowIndex,blockColIndex),sub-matrix).

Page 173: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 174: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

MachinelearningworkflowsanddataflowsBeyondalgorithms,machinelearningisalsoaboutprocesses.Wewilldiscussthetypicalworkflowsanddataflowsofsupervisedandunsupervisedmachinelearning.

Page 175: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SupervisedmachinelearningworkflowsInsupervisedmachinelearning,theinputtrainingdatasetislabeled.Oneofthekeydatapracticesistosplitinputdataintotrainingandtestsets,andvalidatethemodeaccordingly.

Wetypicallygothroughasix-stepprocessflowinsupervisedlearning:

Collectthedata:Thisstepessentiallytiesinwiththepreviouschapterandensureswecollecttherightdatawiththerightvolumeandgranularityinordertoenablethemachinelearningalgorithmtoprovidereliableanswers.Preprocessthedata:Thisstepisaboutcheckingthedataqualitybysampling,fillinginthemissingvaluesifany,scalingandnormalizingthedata.Wealsodefinethefeatureextractionprocess.Typically,inthecaseoflargetext-baseddatasets,weapplytokenization,stopwordsremoval,stemming,andTF-IDF.

Inthecaseofsupervisedlearning,weseparatetheinputdataintoatrainingandtestset.Wecanalsoimplementvariousstrategiesofsamplingandsplittingthedatasetforcross-validationpurposes.

Readythedata:Inthisstep,wegetthedataintheformatordatatypeexpectedbythealgorithms.InthecaseofSparkMLlib,thisincludeslocalvector,denseorsparsevectors,labeledpoints,localmatrix,distributedmatrixwithrowmatrix,indexedrowmatrix,coordinatematrix,andblockmatrix.Model:Inthisstep,weapplythealgorithmsthataresuitablefortheproblemathandandgettheresultsforevaluationofthemostsuitablealgorithmintheevaluatestep.Wemighthavemultiplealgorithmssuitablefortheproblem;theirrespectiveperformancewillbescoredintheevaluatesteptoselectthebestpreformingones.Wecanimplementanensembleorcombinationofmodelsinordertoreachthebestresults.Optimize:Wemayneedtorunagridsearchfortheoptimalparametersofcertainalgorithms.Theseparametersaredeterminedduringtraining,andfine-tunedduringthetestingandproductionphase.Evaluate:Weultimatelyscorethemodelsandselectthebestoneintermsofaccuracy,performance,reliability,andscalability.Wemovethebestperformingmodeltotestwiththeheldouttestdatainordertoascertainthepredictionaccuracyofourmodel.Oncesatisfiedwiththefine-tunedmodel,wemoveittoproductiontoprocesslivedata.

Thesupervisedmachinelearningworkflowanddataflowarerepresentedinthefollowingdiagram:

Page 176: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 177: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnsupervisedmachinelearningworkflowsAsopposedtosupervisedlearning,ourinitialdataisnotlabeledinthecaseofunsupervisedlearning,whichismostoftenthecaseinreallife.Wewillextractthestructurefromthedatabyusingclusteringordimensionalityreductionalgorithms.Intheunsupervisedlearningcase,wedonotsplitthedataintotrainingandtest,aswecannotmakeanypredictionbecausethedataisnotlabeled.Wewilltrainthedataalongsixstepssimilartothoseinsupervisedlearning.Oncethemodelistrained,wewillevaluatetheresultsandfine-tunethemodelandthenreleaseitforproduction.

Unsupervisedlearningcanbeapreliminarysteptosupervisedlearning.Namely,welookatreducingthedimensionalityofthedatapriortoattackingthelearningphase.

Theunsupervisedmachinelearningworkflowsanddataflowarerepresentedasfollows:

Page 178: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 179: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ClusteringtheTwitterdatasetLet’sfirstgetafeelforthedataextractedfromTwitterandgetanunderstandingofthedatastructureinordertoprepareandrunitthroughtheK-Meansclusteringalgorithms.Ourplanofattackusestheprocessanddataflowdepictedearlierforunsupervisedlearning.Thestepsareasfollows:

1. Combinealltweetfilesintoasingledataframe.2. Parsethetweets,removestopwords,extractemoticons,extractURL,andfinally

normalizethewords(forexample,mappingthemtolowercaseandremovingpunctuationandnumbers).

3. Featureextractionincludesthefollowing:

Tokenization:ThisbreaksdowntheparsedtweettextintoindividualwordsortokensTF-IDF:ThisappliestheTF-IDFalgorithmtocreatefeaturevectorsfromthetokenizedtweettextsHashTF-IDF:Thisappliesahashingfunctiontothetokenvectors

4. RuntheK-Meansclusteringalgorithm.5. EvaluatetheresultsoftheK-Meansclustering:

IdentifytweetmembershiptoclustersPerformdimensionalityreductiontotwodimensionswiththeMulti-DimensionalScalingorthePrincipalComponentAnalysisalgorithmPlottheclusters

6. Pipeline:

Fine-tunethenumberofrelevantclustersKMeasurethemodelcostSelecttheoptimalmodel

Page 180: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ApplyingScikit-LearnontheTwitterdatasetPython’sownScikit-Learnmachinelearninglibraryisoneofthemostreliable,intuitive,androbusttoolsaround.Let’srunthroughapreprocessingandunsupervisedlearningusingPandasandScikit-Learn.ItisoftenbeneficialtoexploreasampleofthedatausingScikit-LearnbeforespinningoffclusterswithSparkMLlib.

Wehaveamixedbagof7,540tweets.ItcontainstweetsrelatedtoApacheSpark,Python,theupcomingpresidentialelectionwithHillaryClintonandDonaldTrumpasprotagonists,andsometweetsrelatedtofashionandmusicwithLadyGagaandJustinBieber.WearerunningtheK-MeansclusteringalgorithmusingPythonScikit-LearnontheTwitterdatasetharvested.WefirstloadthesampledataintoaPandasdataframe:

importpandasaspd

csv_in='C:\\Users\\Amit\\Documents\\IPython

Notebooks\\AN00_Data\\unq_tweetstxt.csv'

twts_df01=pd.read_csv(csv_in,sep=';',encoding='utf-8')

In[24]:

twts_df01.count()

Out[24]:

Unnamed:07540

id7540

created_at7540

user_id7540

user_name7538

tweet_text7540

dtype:int64

#

#Introspectingthetweetstext

#

In[82]:

twtstxt_ls01[6910:6920]

Out[82]:

['RT@deroach_Ismoke:IamNOTvotingfor#hilaryclinton

http://t.co/jaZZpcHkkJ',

'RT@AnimalRightsJen:#HilaryClintonWhatdoBernieSandersandDonald

TrumpHaveinCommon?:Hehassofarbeenth…http://t.co/t2YRcGCh6…',

'IunderstandwhyBillwasoutbangingotherchicks….....Imeanlookat

whatheismarriedto…..\n@HilaryClinton',

'#HilaryClintonWhatdoBernieSandersandDonaldTrumpHaveinCommon?:

Hehassofarbeenth…http://t.co/t2YRcGCh67#Tcot#UniteBlue']

Wefirstperformafeatureextractionfromthetweets’text.WeapplyasparsevectorizertothedatasetusingaTF-IDFvectorizerwith10,000featuresandEnglishstopwords:

In[37]:

print("Extractingfeaturesfromthetrainingdatasetusingasparse

vectorizer")

Page 181: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

t0=time()

Extractingfeaturesfromthetrainingdatasetusingasparsevectorizer

In[38]:

vectorizer=TfidfVectorizer(max_df=0.5,max_features=10000,

min_df=2,stop_words='english',

use_idf=True)

X=vectorizer.fit_transform(twtstxt_ls01)

#

#OutputoftheTFIDFFeaturevectorizer

#

print("donein%fs"%(time()-t0))

print("n_samples:%d,n_features:%d"%X.shape)

print()

donein5.232165s

n_samples:7540,n_features:6638

Asthedatasetisnowbrokenintoa7540samplewithvectorsof6,638features,wearereadytofeedthissparsematrixtotheK-Meansclusteringalgorithm.Wewillchoosesevenclustersand100maximumiterationsinitially:

In[47]:

km=KMeans(n_clusters=7,init='k-means++',max_iter=100,n_init=1,

verbose=1)

print("Clusteringsparsedatawith%s"%km)

t0=time()

km.fit(X)

print("donein%0.3fs"%(time()-t0))

ClusteringsparsedatawithKMeans(copy_x=True,init='k-means++',

max_iter=100,n_clusters=7,n_init=1,

n_jobs=1,precompute_distances='auto',random_state=None,tol=0.0001,

verbose=1)

Initializationcomplete

Iteration0,inertia13635.141

Iteration1,inertia6943.485

Iteration2,inertia6924.093

Iteration3,inertia6915.004

Iteration4,inertia6909.212

Iteration5,inertia6903.848

Iteration6,inertia6888.606

Iteration7,inertia6863.226

Iteration8,inertia6860.026

Iteration9,inertia6859.338

Iteration10,inertia6859.213

Iteration11,inertia6859.102

Iteration12,inertia6859.080

Iteration13,inertia6859.060

Iteration14,inertia6859.047

Iteration15,inertia6859.039

Iteration16,inertia6859.032

Iteration17,inertia6859.031

Iteration18,inertia6859.029

Convergedatiteration18

Page 182: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

donein1.701s

TheK-Meansclusteringalgorithmconvergedafter18iterations.Weseeinthefollowingresultsthesevenclusterswiththeirrespectivekeywords.Clusters0and6areaboutmusicandfashionwithJustinBieberandLadyGaga-relatedtweets.Clusters1and5arerelatedtotheU.S.A.presidentialelectionswithDonaldTrump-andHilaryClinton-relatedtweets.Clusters2and3aretheonesofinteresttousastheyareaboutApacheSparkandPython.Cluster4containsThailand-relatedtweets:

#

#Introspecttoptermspercluster

#

In[49]:

print("Toptermspercluster:")

order_centroids=km.cluster_centers_.argsort()[:,::-1]

terms=vectorizer.get_feature_names()

foriinrange(7):

print("Cluster%d:"%i,end='')

forindinorder_centroids[i,:20]:

print('%s'%terms[ind],end='')

print()

Toptermspercluster:

Cluster0:justinbieberlovemeanrtfollowthankhihttpswhatdoyoumean

videowannahearwhatdoyoumeanviralrorykramerhappylolmakingperson

dreamjustin

Cluster1:donaldtrumphilaryclintonrthttpstrump2016realdonaldtrump

trumpgopampjustinbieberpresidentclintonemailsoy8ltkstzetcotlike

berniesandershilarypeopleemail

Cluster2:bigdataapachesparkhadoopanalyticsrtsparktrainingchennai

ibmdatascienceapacheprocessingclouderamapreducedatasaphttpsvora

transformingdevelopment

Cluster3:apachesparkpythonhttpsrtsparkdataampdatabricksusingnew

learnhadoopibmbigapachecontinuumiobluemixlearningjoinopen

Cluster4:ernestsganttsimbata3jdhm2015elsahel12phuketdailynews

dreamintentionsbeyhiveinfrancealmtorta18civipartnership9_a_625whu72ep0

k7erhvu7wnfdmxxxcm3hosxuh2fxnt5o5rmb0xhpjnbgkqn0djovap57ujdh

dtzsz3lb6xsunnysai12345sdcvulih6g

Cluster5:trumpdonalddonaldtrumpstarbuckstrumpquotetrumpforpresident

oy8ltkstzehttpszfns7pxysxsillygoystumptrump2016newsjeremycoffee

corbynok7vc8aetzrttonight

Cluster6:ladygagagagaladyrthttpslovefollowhorrorcdstoryahshotel

americanjapanhotelhumantraffickingmusicfashiondietqueenahs

Wewillvisualizetheresultsbyplottingthecluster.Wehave7,540sampleswith6,638features.Itwillbeimpossibletovisualizethatmanydimensions.WewillusetheMulti-DimensionalScaling(MDS)algorithmtobringdownthemultidimensionalfeaturesoftheclustersintotwotractabledimensionstobeabletopicturethem:

importmatplotlib.pyplotasplt

importmatplotlibasmpl

fromsklearn.manifoldimportMDS

MDS()

Page 183: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

#

#BringdowntheMDStotwodimensions(components)aswewillplot

#theclusters

#

mds=MDS(n_components=2,dissimilarity="precomputed",random_state=1)

pos=mds.fit_transform(dist)#shape(n_components,n_samples)

xs,ys=pos[:,0],pos[:,1]

In[67]:

#

#Setupcolorsperclustersusingadict

#

cluster_colors={0:'#1b9e77',1:'#d95f02',2:'#7570b3',3:'#e7298a',

4:'#66a61e',5:'#9990b3',6:'#e8888a'}

#

#setupclusternamesusingadict

#

cluster_names={0:'Music,Pop',

1:'USAPolitics,Election',

2:'BigData,Spark',

3:'Spark,Python',

4:'Thailand',

5:'USAPolitics,Election',

6:'Music,Pop'}

In[115]:

#

#ipythonmagictoshowthematplotlibplotsinline

#

%matplotlibinline

#

#CreatedataframewhichincludesMDSresults,clusternumbersandtweet

textstobedisplayed

#

df=pd.DataFrame(dict(x=xs,y=ys,label=clusters,txt=twtstxt_ls02_utf8))

ix_start=2000

ix_stop=2050

df01=df[ix_start:ix_stop]

print(df01[['label','txt']])

print(len(df01))

print()

#Groupbycluster

groups=df.groupby('label')

groups01=df01.groupby('label')

#Setuptheplot

fig,ax=plt.subplots(figsize=(17,10))

Page 184: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ax.margins(0.05)

#

#Buildtheplotobject

#

forname,groupingroups01:

ax.plot(group.x,group.y,marker='o',linestyle='',ms=12,

label=cluster_names[name],color=cluster_colors[name],

mec='none')

ax.set_aspect('auto')

ax.tick_params(\

axis='x',#settingsforx-axis

which='both',#

bottom='off',#

top='off',#

labelbottom='off')

ax.tick_params(\

axis='y',#settingsfory-axis

which='both',#

left='off',#

top='off',#

labelleft='off')

ax.legend(numpoints=1)#

#

#Addlabelinx,ypositionwithtweettext

#

foriinrange(ix_start,ix_stop):

ax.text(df01.ix[i]['x'],df01.ix[i]['y'],df01.ix[i]['txt'],size=10)

plt.show()#Displaytheplot

labeltext

20002b'RT@BigDataTechCon:'

20013b"@4Quant'spresentat"

20022b'CassandraSummit201'

Here’saplotofCluster2,BigDataandSpark,representedbybluedotsalongwithCluster3,SparkandPython,representedbyreddots,andsomesampletweetsrelatedtotherespectiveclusters:

Page 185: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

WehavegainedsomegoodinsightsintothedatawiththeexplorationandprocessingdonewithScikit-Learn.WewillnowfocusourattentiononSparkMLlibandtakeitforarideontheTwitterdataset.

Page 186: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PreprocessingthedatasetNow,wewillfocusonfeatureextractionandengineeringinordertoreadythedatafortheclusteringalgorithmrun.WeinstantiatetheSparkContextandreadtheTwitterdatasetintoaSparkdataframe.Wewillthensuccessivelytokenizethetweettextdata,applyahashingTermfrequencyalgorithmtothetokens,andfinallyapplytheInverseDocumentFrequencyalgorithmandrescalethedata.Thecodeisasfollows:

In[3]:

#

#ReadcsvinaPandaDF

#

#

importpandasaspd

csv_in='/home/an/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv'

pddf_in=pd.read_csv(csv_in,index_col=None,header=0,sep=';',

encoding='utf-8')

In[4]:

sqlContext=SQLContext(sc)

In[5]:

#

#ConvertaPandaDFtoaSparkDF

#

#

spdf_02=sqlContext.createDataFrame(pddf_in[['id','user_id','user_name',

'tweet_text']])

In[8]:

spdf_02.show()

In[7]:

spdf_02.take(3)

Out[7]:

[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:

elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026

http://t.co/VpD7FoqMr0'),

Row(id=638830426727911424,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:PhuketDailyNews:

dreamintentions:elsahel12:simbata3:JDHM2015:almtorta18:CiviPa\u2026

http://t.co/VpD7FoqMr0'),

Row(id=638830425402556417,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:ernestsgantt:elsahel12:

simbata3:JDHM2015:almtorta18:CiviPartnership:dr\u2026

Page 187: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

http://t.co/EMDOn8chPK')]

In[9]:

frompyspark.ml.featureimportHashingTF,IDF,Tokenizer

In[10]:

#

#Tokenizethetweet_text

#

tokenizer=Tokenizer(inputCol="tweet_text",outputCol="tokens")

tokensData=tokenizer.transform(spdf_02)

In[11]:

tokensData.take(1)

Out[11]:

[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:

elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026

http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',

u'9_a_6:',u'dreamintentions:',u'elsahel12:',u'simbata3:',u'jdhm2015:',

u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'])]

In[14]:

#

#ApplyHashingTFtothetokens

#

hashingTF=HashingTF(inputCol="tokens",outputCol="rawFeatures",

numFeatures=2000)

featuresData=hashingTF.transform(tokensData)

In[15]:

featuresData.take(1)

Out[15]:

[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:

elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026

http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',

u'9_a_6:',u'dreamintentions:',u'elsahel12:',u'simbata3:',u'jdhm2015:',

u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'],

rawFeatures=SparseVector(2000,{74:1.0,97:1.0,100:1.0,160:1.0,185:

1.0,742:1.0,856:1.0,991:1.0,1383:1.0,1620:1.0}))]

In[16]:

#

#ApplyIDFtotherawfeaturesandrescalethedata

#

Page 188: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

idf=IDF(inputCol="rawFeatures",outputCol="features")

idfModel=idf.fit(featuresData)

rescaledData=idfModel.transform(featuresData)

forfeaturesinrescaledData.select("features").take(3):

print(features)

In[17]:

rescaledData.take(2)

Out[17]:

[Row(id=638830426971181057,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:9_A_6:dreamintentions:

elsahel12:simbata3:JDHM2015:almtorta18:dreamintentions:\u2026

http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',

u'9_a_6:',u'dreamintentions:',u'elsahel12:',u'simbata3:',u'jdhm2015:',

u'almtorta18:',u'dreamintentions:\u2026',u'http://t.co/vpd7foqmr0'],

rawFeatures=SparseVector(2000,{74:1.0,97:1.0,100:1.0,160:1.0,185:

1.0,742:1.0,856:1.0,991:1.0,1383:1.0,1620:1.0}),

features=SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:

2.9985,185:2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,

1620:3.073})),

Row(id=638830426727911424,user_id=3276255125,user_name=u'TrueEquality',

tweet_text=u'ernestsgantt:BeyHiveInFrance:PhuketDailyNews:

dreamintentions:elsahel12:simbata3:JDHM2015:almtorta18:CiviPa\u2026

http://t.co/VpD7FoqMr0',tokens=[u'ernestsgantt:',u'beyhiveinfrance:',

u'phuketdailynews:',u'dreamintentions:',u'elsahel12:',u'simbata3:',

u'jdhm2015:',u'almtorta18:',u'civipa\u2026',u'http://t.co/vpd7foqmr0'],

rawFeatures=SparseVector(2000,{74:1.0,97:1.0,100:1.0,160:1.0,185:

1.0,460:1.0,987:1.0,991:1.0,1383:1.0,1620:1.0}),

features=SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:

2.9985,185:2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,

1620:3.073}))]

In[21]:

rs_pddf=rescaledData.toPandas()

In[22]:

rs_pddf.count()

Out[22]:

id7540

user_id7540

user_name7540

tweet_text7540

tokens7540

rawFeatures7540

features7540

dtype:int64

Page 189: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

In[27]:

feat_lst=rs_pddf.features.tolist()

In[28]:

feat_lst[:2]

Out[28]:

[SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,1620:3.073}),

SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,1620:3.073})]

Page 190: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

RunningtheclusteringalgorithmWewillusetheK-MeansalgorithmagainsttheTwitterdataset.Asanunlabeledandshuffledbagoftweets,wewanttoseeiftheApacheSparktweetsaregroupedinasinglecluster.Fromtheprevioussteps,theTF-IDFsparsevectoroffeaturesisconvertedintoanRDDthatwillbetheinputtotheSparkMLlibprogram.WeinitializetheK-Meansmodelwith5clusters,10iterationsof10runs:

In[32]:

frompyspark.mllib.clusteringimportKMeans,KMeansModel

fromnumpyimportarray

frommathimportsqrt

In[34]:

#Loadandparsethedata

in_Data=sc.parallelize(feat_lst)

In[35]:

in_Data.take(3)

Out[35]:

[SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,742:5.5269,856:4.1406,991:2.9518,1383:4.694,1620:3.073}),

SparseVector(2000,{74:2.6762,97:1.8625,100:2.6384,160:2.9985,185:

2.7481,460:6.4432,987:2.9959,991:2.9518,1383:4.694,1620:3.073}),

SparseVector(2000,{20:4.3534,74:2.6762,97:1.8625,100:5.2768,185:

2.7481,856:4.1406,991:2.9518,1039:3.073,1620:3.073,1864:4.6377})]

In[37]:

in_Data.count()

Out[37]:

7540

In[38]:

#Buildthemodel(clusterthedata)

clusters=KMeans.train(in_Data,5,maxIterations=10,

runs=10,initializationMode="random")

In[53]:

#EvaluateclusteringbycomputingWithinSetSumofSquaredErrors

deferror(point):

Page 191: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

center=clusters.centers[clusters.predict(point)]

returnsqrt(sum([x**2forxin(point-center)]))

WSSSE=in_Data.map(lambdapoint:error(point)).reduce(lambdax,y:x+y)

print("WithinSetSumofSquaredError="+str(WSSSE))

Page 192: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

EvaluatingthemodelandtheresultsOnewaytofine-tunetheclusteringalgorithmisbyvaryingthenumberofclustersandverifyingtheoutput.Let’schecktheclustersandgetafeelfortheclusteringresultssofar:

In[43]:

cluster_membership=in_Data.map(lambdax:clusters.predict(x))

In[54]:

cluster_idx=cluster_membership.zipWithIndex()

In[55]:

type(cluster_idx)

Out[55]:

pyspark.rdd.PipelinedRDD

In[58]:

cluster_idx.take(20)

Out[58]:

[(3,0),

(3,1),

(3,2),

(3,3),

(3,4),

(3,5),

(1,6),

(3,7),

(3,8),

(3,9),

(3,10),

(3,11),

(3,12),

(3,13),

(3,14),

(1,15),

(3,16),

(3,17),

(1,18),

(1,19)]

In[59]:

cluster_df=cluster_idx.toDF()

In[65]:

pddf_with_cluster=pd.concat([pddf_in,cluster_pddf],axis=1)

Page 193: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

In[76]:

pddf_with_cluster._1.unique()

Out[76]:

array([3,1,4,0,2])

In[79]:

pddf_with_cluster[pddf_with_cluster['_1']==0].head(10)

Out[79]:

Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2

62273642418116819988480FriSep1119:23:09+0000201549693598

AjinkyaKaleRT@bigdata:DistributedMatrixComputationsi…06227

625745642391207205859328FriSep1117:36:13+00002015937467860

AngelaBassa[Auto]I'mreading""DistributedMatrixComput…06257

6297119642348577147064320FriSep1114:46:49+0000201518318677

BenLoricaDistributedMatrixComputationsin@ApacheSpar…06297

In[80]:

pddf_with_cluster[pddf_with_cluster['_1']==1].head(10)

Out[80]:

Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2

66638830419090079746TueSep0121:46:55+000020152241040634

MassimoCarrisiPython:Python:Removing\xa0fromstring?-I…16

1517638830380578045953TueSep0121:46:46+0000201557699376

RafaelMonneratRT@ramalhoorg:NoitedeautógrafosdoFluent…115

1841638830280988426250TueSep0121:46:22+00002015951081582

JackBaldwinRT@cloudaus:Weare3/4full!2-day@swcarpen…118

1942638830276626399232TueSep0121:46:21+000020156525302

MasayoshiNakamuraPynamoDB#AWS#DynamoDB#Pythonhttp://...119

2043638830213288235008TueSep0121:46:06+000020153153874869

BaltimorePythonFlexx:PythonUItookitbasedonwebtechnolog…120

2144638830117645516800TueSep0121:45:43+0000201548474625

RadioFreeDenaliHmm,emerge--depcleanwantstoremovesomethi…1

21

2246638829977014636544TueSep0121:45:10+00002015154915461

LucianoRamalhoNoitedeautógrafosdoFluentPythonnoGaroa…122

2347638829882928070656TueSep0121:44:47+00002015917320920

bsbafflesbrains@DanSWrightHarperchannelingMontyPython."...123

2448638829868679954432TueSep0121:44:44+00002015134280898

LannickTechnologyRT@SergeyKalnish:Iam#hiring:SeniorBacke…1

24

2549638829707484508161TueSep0121:44:05+000020152839203454

JoshuaJonesRT@LindseyPelas:SurvivingMontyPythoninFl…125

In[81]:

pddf_with_cluster[pddf_with_cluster['_1']==2].head(10)

Out[81]:

Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2

7280688639056941592014848WedSep0212:47:02+00002015

Page 194: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

2735137484ChrisAtruegayiconwhenwill@ladygaga@[email protected]

7280

In[82]:

pddf_with_cluster[pddf_with_cluster['_1']==3].head(10)

Out[82]:

Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2

00638830426971181057TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…30

11638830426727911424TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…31

22638830425402556417TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…32

33638830424563716097TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…33

44638830422256816132TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…34

55638830420159655936TueSep0121:46:55+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…35

77638830418330980352TueSep0121:46:55+000020153276255125

TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…37

88638830397648822272TueSep0121:46:50+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…38

99638830395375529984TueSep0121:46:49+000020153276255125

TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…39

1010638830392389177344TueSep0121:46:49+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…310

In[83]:

pddf_with_cluster[pddf_with_cluster['_1']==4].head(10)

Out[83]:

Unnamed:0idcreated_atuser_iduser_nametweet_text_1_2

1361882642648214454317056SatSep1210:37:28+0000201527415756

RaymondEnisuohLAChosenForUS2024OlympicBid-LA2016See…4

1361

1363885642647848744583168SatSep1210:36:01+0000201527415756

RaymondEnisuohPrisonSee:https://t.co/x3EKAExeFi……………...4

1363

541211640480770369286144SunSep0611:04:49+00002015

3242403023DonaldTrump2016"igiboooy!@Starbucks

https://t.co/97wdL…45412

542827640477140660518912SunSep0610:50:24+00002015

3242403023DonaldTrump2016"@Starbuckshttps://t.co/wsEYFIefk7"-

D…45428

545561640469542272110592SunSep0610:20:12+00002015

3242403023DonaldTrump2016"starbucks@StarbucksMamPlaza

https://t.co…45455

545662640469541370372096SunSep0610:20:12+00002015

3242403023DonaldTrump2016"Aaahhhthepumpkinspicelatteisback,

fall…45456

545763640469539524898817SunSep0610:20:12+00002015

3242403023DonaldTrump2016"RTkayyleighferry:OhmygodddHarry

Potter…45457

545864640469537176031232SunSep0610:20:11+00002015

Page 195: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

3242403023DonaldTrump2016"Starbuckshttps://t.co/3xYYXlwNkf"-

Donald…45458

545965640469536119070720SunSep0610:20:11+00002015

3242403023DonaldTrump2016"AStarbucksisunderconstructioninmy

neig…45459

546066640469530435813376SunSep0610:20:10+00002015

3242403023DonaldTrump2016"Babamstarbucks'tanfotogtafatıyor

bendedu…45460

Wemapthe5clusterswithsomesampletweets.Cluster0isaboutSpark.Cluster1isaboutPython.Cluster2isaboutLadyGaga.Cluster3isaboutThailand’sPhuketNews.Cluster4isaboutDonaldTrump.

Page 196: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 197: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

BuildingmachinelearningpipelinesWewanttocomposethefeatureextraction,preparatoryactivities,training,testing,andpredictionactivitieswhileoptimizingthebesttuningparametertogetthebestperformingmodel.

ThefollowingtweetcapturesperfectlyinfivelinesofcodeapowerfulmachinelearningPipelineimplementedinSparkMLlib:

TheSparkMLpipelineisinspiredbyPython’sScikit-Learnandcreatesasuccinct,declarativestatementofthesuccessivetransformationstothedatainordertoquicklydeliveratunablemodel.

Page 198: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 199: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SummaryInthischapter,wegotanoverviewofSparkMLlib’sever-expandinglibraryofalgorithmsSparkMLlib.Wediscussedsupervisedandunsupervisedlearning,recommendersystems,optimization,andfeatureextractionalgorithms.WethenputtheharvesteddatafromTwitterintothemachinelearningprocess,algorithms,andevaluationtoderiveinsightsfromthedata.WeputtheTwitter-harvesteddatasetthroughaPythonScikit-LearnandSparkMLlibK-meansclusteringinordertosegregatethetweetsrelevanttoApacheSpark.Wealsoevaluatedtheperformanceofthemodel.

Thisgetsusreadyforthenextchapter,whichwillcoverStreamingAnalyticsusingSpark.Let’sjumprightin.

Page 200: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 201: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Chapter5.StreamingLiveDatawithSparkInthischapter,wewillfocusonlivestreamingdataflowingintoSparkandprocessingit.Sofar,wehavediscussedmachinelearninganddataminingwithbatchprocessing.Wearenowlookingatprocessingcontinuouslyflowingdataanddetectingfactsandpatternsonthefly.Wearenavigatingfromalaketoariver.

Wewillfirstinvestigatethechallengesarisingfromsuchadynamicandeverchangingenvironment.Afterlayingthegroundsontheprerequisiteofastreamingapplication,wewillinvestigatevariousimplementationsusinglivesourcesofdatasuchasTCPsocketstotheTwitterfirehoseandputinplacealowlatency,highthroughput,andscalabledatapipelinecombiningSpark,KafkaandFlume.

Inthischapter,wewillcoverthefollowingpoints:

Analyzingastreamingapplication’sarchitecturalchallenges,constraints,andrequirementsProcessinglivedatafromaTCPsocketwithSparkStreamingConnectingtotheTwitterfirehosedirectlytoparsetweetsinquasirealtimeEstablishingareliable,faulttolerant,scalable,highthroughput,lowlatencyintegratedapplicationusingSpark,Kafka,andFlumeClosingremarksonLambdaandKappaarchitectureparadigms

Page 202: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

LayingthefoundationsofstreamingarchitectureAscustomary,let’sfirstgobacktoouroriginaldrawingofthedata-intensiveappsarchitectureblueprintandhighlighttheSparkStreamingmodulethatwillbethetopicofinterest.

ThefollowingdiagramsetsthecontextbyhighlightingtheSparkStreamingmoduleandinteractionswithSparkSQLandSparkMLlibwithintheoveralldata-intensiveappsframework.

Dataflowsfromstockmarkettimeseries,enterprisetransactions,interactions,events,webtraffic,clickstreams,andsensors.Alleventsaretime-stampeddataandurgent.Thisisthecaseforfrauddetectionandprevention,mobilecross-sellandupsell,ortrafficalerts.Thosestreamsofdatarequireimmediateprocessingformonitoringpurposes,suchasdetectinganomalies,outliers,spam,fraud,andintrusion;andalsoforprovidingbasicstatistics,insights,trends,andrecommendations.Insomecases,thesummarized

Page 203: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

aggregatedinformationissufficienttobestoredforlaterusage.Fromanarchitectureparadigmperspective,wearemovingfromaservice-orientedarchitecturetoanevent-drivenarchitecture.

Twomodelsemergeforprocessingstreamsofdata:

Processingonerecordatatimeastheycomein.Wedonotbuffertheincomingrecordsinacontainerbeforeprocessingthem.ThisisthecaseofTwitter’sStorm,Yahoo’sS4,andGoogle’sMillWheel.Micro-batchingorbatchcomputationsonsmallintervalsasperformedbySparkStreamingandStormTrident.Inthiscase,webuffertheincomingrecordsinacontaineraccordingtothetimewindowprescribedinthemicro-batchingsettings.

SparkStreaminghasoftenbeencomparedagainstStorm.Theyaretwodifferentmodelsofstreamingdata.SparkStreamingisbasedonmicro-batching.Stormisbasedonprocessingrecordsastheycomein.Stormalsooffersamicro-batchingoption,withitsStormTridentoption.

Thedrivingfactorinastreamingapplicationislatency.LatencyvariesfromthemillisecondsrangeinthecaseofRPC(shortforRemoteProcedureCall)toseveralsecondsorminutesformicrobatchingsolutionsuchasSparkStreaming.

RPCallowssynchronousoperationsbetweentherequestingprogramswaitingfortheresultsfromtheremoteserver’sprocedure.ThreadsallowconcurrencyofmultipleRPCcallstotheserver.

AnexampleofsoftwareimplementingadistributedRPCmodelisApacheStorm.

Stormimplementsstatelesssubmillisecondlatencyprocessingofunboundedtuplesusingtopologiesordirectedacyclicgraphscombiningspoutsassourceofdatastreamsandboltsforoperationssuchasfilter,join,aggregation,andtransformation.StormalsoimplementsahigherlevelabstractioncalledTridentwhich,similarlytoSpark,processesdatastreamsinmicrobatches.

So,lookingatthelatencycontinuum,fromsubmillisecondtosecond,Stormisagoodcandidate.Forsecondstominutesscale,SparkStreamingandStormTridentareexcellentfits.Forseveralminutesonward,SparkandaNoSQLdatabasesuchasCassandraorHBaseareadequatesolutions.Forrangesbeyondthehourandwithhighvolumeofdata,Hadoopistheidealcontender.

Althoughthroughputiscorrelatedtolatency,itisnotasimpleinverselylinearrelationship.Ifprocessingamessagetakes2ms,whichdeterminesthelatency,thenonewouldassumethethroughputislimitedto500messagespersec.Batchingmessagesallowsforhigherthroughputifweallowourmessagestobebufferedfor8msmore.Withalatencyof10ms,thesystemcanbufferupto10,000messages.Forabearableincreaseinlatency,wehavesubstantiallyincreasedthroughput.Thisisthemagicofmicro-batchingthatSparkStreamingexploits.

Page 204: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SparkStreaminginnerworkingTheSparkStreamingarchitectureleveragestheSparkcorearchitecture.ItoverlaysontheSparkContextaStreamingContextastheentrypointtotheStreamfunctionality.TheClusterManagerwilldedicateatleastoneworkernodeasReceiver,whichwillbeanexecutorwithalongtasktoprocesstheincomingstream.TheExecutorcreatesDiscretizedStreamsorDStreamsfrominputdatastreamandreplicatesbydefault,theDStreamtothecacheofanotherworker.Onereceiverservesoneinputdatastream.MultiplereceiversimproveparallelismandgeneratemultipleDStreamsthatSparkcanuniteorjoinResilientDistributedDatasets(RDD).

ThefollowingdiagramgivesanoverviewoftheinnerworkingofSparkStreaming.TheclientinteractswiththeSparkClusterviatheclustermanager,whileSparkStreaminghasadedicatedworkerwithalongrunningtaskingestingtheinputdatastreamandtransformingitintodiscretizedstreamsorDStreams.Thedataiscollected,bufferedandreplicatedbyareceiverandthenpushedtoastreamofRDDs.

Sparkreceiverscaningestdatafrommanysources.CoreinputsourcesrangefromTCPsocketandHDFS/AmazonS3toAkkaActors.AdditionalsourcesincludeApacheKafka,ApacheFlume,AmazonKinesis,ZeroMQ,Twitter,andcustomoruser-definedreceivers.

Page 205: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Wedistinguishbetweenreliableresourcesthatacknowledgesreceiptofdatatothesourceandreplicationforpossibleresend,versusunreliablereceiverswhodonotacknowledgereceiptofthemessage.Sparkscalesoutintermsofthenumberofworkers,partitionandreceivers.

ThefollowingdiagramgivesanoverviewofSparkStreamingwiththepossiblesourcesandthepersistenceoptions:

Page 206: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

GoingunderthehoodofSparkStreamingSparkStreamingiscomposedofReceiversandpoweredbyDiscretizedStreamsandSparkConnectorsforpersistence.

AsforSparkCore,theessentialdatastructureistheRDD,thefundamentalprogrammingabstractionforSparkStreamingistheDiscretizedStreamorDStream.

ThefollowingdiagramillustratestheDiscretizedStreamsascontinuoussequencesofRDDs.ThebatchintervalsofDStreamareconfigurable.

DStreamssnapshotstheincomingdatainbatchintervals.Thosetimestepstypicallyrangefrom500mstoseveralseconds.TheunderlyingstructureofaDStreamisanRDD.

ADStreamisessentiallyacontinuoussequenceofRDDs.ThisispowerfulasitallowsustoleveragefromSparkStreamingallthetraditionalfunctions,transformationsandactionsavailableinSparkCoreandallowsustodialoguewithSparkSQL,performingSQLqueriesonincomingstreamsofdataandSparkMLlib.Transformationssimilartothoseongenericandkey-valuepairRDDsareapplicable.TheDStreamsbenefitfromtheinnerRDDslineageandfaulttolerance.Additionaltransformationandoutputoperationsexistfordiscretizedstreamoperations.MostgenericoperationsonDStreamaretransformandforeachRDD.

ThefollowingdiagramgivesanoverviewofthelifecycleofDStreams.Fromcreationofthemicro-batchesofmessagesmaterializedtoRDDsonwhichtransformationfunctionandactionsthattriggerSparkjobsareapplied.Breakingdownthestepsillustratedinthediagram,wereadthediagramtopdown:

1. IntheInputStream,theincomingmessagesarebufferedinacontaineraccordingtothetimewindowallocatedforthemicro-batching.

2. Inthediscretizedstreamstep,thebufferedmicro-batchesaretransformedasDStreamRDDs.

3. TheMappedDStreamstepisobtainedbyapplyingatransformationfunctiontotheoriginalDStream.Thesefirstthreestepsconstitutethetransformationoftheoriginaldatareceivedinpredefinedtimewindows.AstheunderlyingdatastructureistheRDD,weconservethedatalineageofthetransformations.

4. ThefinalstepisanactionontheRDD.IttriggerstheSparkjob.

Page 207: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Transformationcanbestatelessorstateful.Statelessmeansthatnostateismaintainedbytheprogram,whilestatefulmeanstheprogramkeepsastate,inwhichcaseprevioustransactionsarerememberedandmayaffectthecurrenttransaction.Astatefuloperationmodifiesorrequiressomestateofthesystem,andastatelessoperationdoesnot.

StatelesstransformationsprocesseachbatchinaDStreamatatime.Statefultransformationsprocessmultiplebatchestoobtainresults.Statefultransformationsrequirethecheckpointdirectorytobeconfigured.CheckpointingisthemainmechanismforfaulttoleranceinSparkStreamingtoperiodicallysavedataandmetadataaboutanapplication.

TherearetwotypesofstatefultransformationsforSparkStreaming:updateStateByKeyandwindowedtransformations.

updateStateByKeyaretransformationsthatmaintainstateforeachkeyinastreamofPairRDDs.ItreturnsanewstateDStreamwherethestateforeachkeyisupdatedbyapplyingthegivenfunctiononthepreviousstateofthekeyandthenewvaluesofeachkey.Anexamplewouldbearunningcountofgivenhashtagsinastreamoftweets.

Page 208: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Windowedtransformationsarecarriedovermultiplebatchesinaslidingwindow.Awindowhasadefinedlengthordurationspecifiedintimeunits.ItmustbeamultipleofaDStreambatchinterval.Itdefineshowmanybatchesareincludedinawindowedtransformation.

Awindowhasaslidingintervalorslidingdurationspecifiedintimeunits.ItmustbeamultipleofaDStreambatchinterval.Itdefineshowmanybatchestoslideawindoworhowfrequentlytocomputeawindowedtransformation.

ThefollowingschemadepictsthewindowingoperationonDStreamstoderivewindowDStreamswithagivenlengthandslidinginterval:

AsamplefunctioniscountByWindow(windowLength,slideInterval).ItreturnsanewDStreaminwhicheachRDDhasasingleelementgeneratedbycountingthenumberofelementsinaslidingwindowoverthisDStream.Anillustrationinthiscasewouldbearunningcountofgivenhashtagsinastreamoftweetsevery60seconds.Thewindowtimeframeisspecified.

Minutescalewindowlengthisreasonable.Hourscalewindowlengthisnotrecommendedasitiscomputeandmemoryintensive.ItwouldbemoreconvenienttoaggregatethedatainadatabasesuchasCassandraorHBase.

Windowedtransformationscomputeresultsbasedonwindowlengthandwindowslideinterval.Sparkperformanceisprimarilyaffectedbyonwindowlength,windowslideinterval,andpersistence.

Page 209: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

BuildinginfaulttoleranceReal-timestreamprocessingsystemsmustbeoperational24/7.Theyneedtoberesilienttoallsortsoffailuresinthesystem.SparkanditsRDDabstractionaredesignedtoseamlesslyhandlefailuresofanyworkernodesinthecluster.

MainSparkStreamingfaulttolerancemechanismsarecheckpointing,automaticdriverrestart,andautomaticfailover.Sparkenablesrecoveryfromdriverfailureusingcheckpointing,whichpreservestheapplicationstate.

Writeaheadlogs,reliablereceivers,andfilestreamsguaranteeszerodatalossasofSparkVersion1.2.Writeaheadlogsrepresentafaulttolerantstorageforreceiveddata.

Failuresrequirerecomputingresults.DStreamoperationshaveexactly-onesemantics.Transformationscanberecomputedmultipletimesbutwillyieldthesameresult.DStreamoutputoperationshaveatleastoncesemantics.Outputoperationsmaybeexecutedmultipletimes.

Page 210: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 211: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ProcessinglivedatawithTCPsocketsAsasteppingstonetotheoverallunderstandingofstreamingoperations,wewillfirstexperimentwithTCPsocket.TCPsocketestablishestwo-waycommunicationbetweenclientandserver,anditcanexchangedatathroughtheestablishedconnection.WebSocketconnectionsarelonglived,unliketypicalHTTPconnections.HTTPisnotmeanttokeepanopenconnectionfromtheservertopushcontinuouslydatatothewebbrowsers.MostwebapplicationshenceresortedtolongpollingviafrequentAsynchronousJavaScript(AJAX)andXMLrequests.WebSockets,standardizedandimplementedinHTML5,aremovingbeyondwebbrowsersandarebecomingacross-platformstandardforreal-timecommunicationbetweenclientandserver.

Page 212: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SettingupTCPsocketsWecreateaTCPSocketServerbyrunningnetcat,asmallutilityfoundinmostLinuxsystems,asadataserverwiththecommand>nc-lk9999,where9999istheportwherewearesendingdata:

#

#SocketServer

#

an@an-VB:~$nc-lk9999

helloworld

howareyou

helloworld

coolitworks

Oncenetcatisrunning,wewillopenasecondconsolewithourSparkStreamingclienttoreceivethedataandprocess.AssoonastheSparkStreamingclientconsoleislistening,westarttypingthewordstobeprocessed,thatis,helloworld.

Page 213: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ProcessinglivedataWewillbeusingtheexampleprogramprovidedintheSparkbundleforSparkStreamingcallednetwork_wordcount.py.ItcanbefoundontheGitHubrepositoryunderhttps://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.pyThecodeisasfollows:

"""

CountswordsinUTF8encoded,'\n'delimitedtextreceivedfromthe

networkeverysecond.

Usage:network_wordcount.py<hostname><port>

<hostname>and<port>describetheTCPserverthatSparkStreamingwould

connecttoreceivedata.

Torunthisonyourlocalmachine,youneedtofirstrunaNetcatserver

`$nc-lk9999`

andthenruntheexample

`$bin/spark-submit

examples/src/main/python/streaming/network_wordcount.pylocalhost9999`

"""

from__future__importprint_function

importsys

frompysparkimportSparkContext

frompyspark.streamingimportStreamingContext

if__name__=="__main__":

iflen(sys.argv)!=3:

print("Usage:network_wordcount.py<hostname><port>",

file=sys.stderr)

exit(-1)

sc=SparkContext(appName="PythonStreamingNetworkWordCount")

ssc=StreamingContext(sc,1)

lines=ssc.socketTextStream(sys.argv[1],int(sys.argv[2]))

counts=lines.flatMap(lambdaline:line.split(""))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)

counts.pprint()

ssc.start()

ssc.awaitTermination()

Here,weexplainthestepsoftheprogram:

1. ThecodefirstinitializesaSparkStreamingContextwiththecommand:

ssc=StreamingContext(sc,1)

2. Next,thestreamingcomputationissetup.3. OneormoreDStreamobjectsthatreceivedataaredefinedtoconnecttolocalhostor

127.0.0.1onport9999:

stream=ssc.socketTextStream("127.0.0.1",9999)

Page 214: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

4. TheDStreamcomputationisdefined:transformationsandoutputoperations:

stream.map(x:lambda(x,1))

.reduce(a+b)

.print()

5. Computationisstarted:

ssc.start()

6. Programterminationispendingmanualorerrorprocessingcompletion:

ssc.awaitTermination()

7. Manualcompletionisanoptionwhenacompletionconditionisknown:

ssc.stop()

WecanmonitortheSparkStreamingapplicationbyvisitingtheSparkmonitoringhomepageatlocalhost:4040.

Here’stheresultofrunningtheprogramandfeedingthewordsonthenetcat4serverconsole:

#

#SocketClient

#an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit

examples/src/main/python/streaming/network_wordcount.pylocalhost9999

RuntheSparkStreamingnetwork_countprogrambyconnectingtothesocketlocalhostonport9999:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit

examples/src/main/python/streaming/network_wordcount.pylocalhost9999

-------------------------------------------

Time:2015-10-1820:06:06

-------------------------------------------

(u'world',1)

(u'hello',1)

-------------------------------------------

Time:2015-10-1820:06:07

-------------------------------------------

...

-------------------------------------------

Time:2015-10-1820:06:17

-------------------------------------------

(u'you',1)

(u'how',1)

(u'are',1)

-------------------------------------------

Time:2015-10-1820:06:18

-------------------------------------------

...

Page 215: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

-------------------------------------------

Time:2015-10-1820:06:26

-------------------------------------------

(u'',1)

(u'world',1)

(u'hello',1)

-------------------------------------------

Time:2015-10-1820:06:27

-------------------------------------------

...

-------------------------------------------

Time:2015-10-1820:06:37

-------------------------------------------

(u'works',1)

(u'it',1)

(u'cool',1)

-------------------------------------------

Time:2015-10-1820:06:38

-------------------------------------------

Thus,wehaveestablishedconnectionthroughthesocketonport9999,streamedthedatasentbythenetcatserver,andperformedawordcountonthemessagessent.

Page 216: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 217: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ManipulatingTwitterdatainrealtimeTwitterofferstwoAPIs.OnesearchAPIthatessentiallyallowsustoretrievepasttweetsbasedonsearchterms.ThisishowwehavebeencollectingourdatafromTwitterinthepreviouschaptersofthebook.Interestingly,forourcurrentpurpose,TwitteroffersalivestreamingAPIwhichallowstoingesttweetsastheyareemittedintheblogosphere.

Page 218: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ProcessingTweetsinrealtimefromtheTwitterfirehoseThefollowingprogramconnectstotheTwitterfirehoseandprocessestheincomingtweetstoexcludedeletedorinvalidtweetsandparsesontheflyonlytherelevantonestoextractscreenname,theactualtweet,ortweettext,retweetcount,geo-locationinformation.TheprocessedtweetsaregatheredintoanRDDQueuebySparkStreamingandthendisplayedontheconsoleataone-secondinterval:

"""

TwitterStreamingAPISparkStreamingintoanRDD-Queuetoprocesstweets

live

CreateaqueueofRDDsthatwillbemapped/reducedoneatatimein

1secondintervals.

Torunthisexampleuse

'$bin/spark-submit

examples/AN_Spark/AN_Spark_Code/s07_twitterstreaming.py'

"""

#

importtime

frompysparkimportSparkContext

frompyspark.streamingimportStreamingContext

importtwitter

importdateutil.parser

importjson

#ConnectingStreamingTwitterwithStreamingSparkviaQueue

classTweet(dict):

def__init__(self,tweet_in):

super(Tweet,self).__init__(self)

iftweet_inand'delete'notintweet_in:

self['timestamp']=

dateutil.parser.parse(tweet_in[u'created_at']

).replace(tzinfo=None).isoformat()

self['text']=tweet_in['text'].encode('utf-8')

#self['text']=tweet_in['text']

self['hashtags']=[x['text'].encode('utf-8')forxin

tweet_in['entities']['hashtags']]

#self['hashtags']=[x['text']forxintweet_in['entities']

['hashtags']]

self['geo']=tweet_in['geo']['coordinates']iftweet_in['geo']

elseNone

self['id']=tweet_in['id']

self['screen_name']=tweet_in['user']

['screen_name'].encode('utf-8')

#self['screen_name']=tweet_in['user']['screen_name']

self['user_id']=tweet_in['user']['id']

defconnect_twitter():

Page 219: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

twitter_stream=twitter.TwitterStream(auth=twitter.OAuth(

token="get_your_own_credentials",

token_secret="get_your_own_credentials",

consumer_key="get_your_own_credentials",

consumer_secret="get_your_own_credentials"))

returntwitter_stream

defget_next_tweet(twitter_stream):

stream=twitter_stream.statuses.sample(block=True)

tweet_in=None

whilenottweet_inor'delete'intweet_in:

tweet_in=stream.next()

tweet_parsed=Tweet(tweet_in)

returnjson.dumps(tweet_parsed)

defprocess_rdd_queue(twitter_stream):

#CreatethequeuethroughwhichRDDscanbepushedto

#aQueueInputDStream

rddQueue=[]

foriinrange(3):

rddQueue+=

[ssc.sparkContext.parallelize([get_next_tweet(twitter_stream)],5)]

lines=ssc.queueStream(rddQueue)

lines.pprint()

if__name__=="__main__":

sc=SparkContext(appName="PythonStreamingQueueStream")

ssc=StreamingContext(sc,1)

#Instantiatethetwitter_stream

twitter_stream=connect_twitter()

#GetRDDqueueofthestreamsjsonorparsed

process_rdd_queue(twitter_stream)

ssc.start()

time.sleep(2)

ssc.stop(stopSparkContext=True,stopGraceFully=True)

Whenwerunthisprogram,itdeliversthefollowingoutput:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$bin/spark-submit

examples/AN_Spark/AN_Spark_Code/s07_twitterstreaming.py

-------------------------------------------

Time:2015-11-0321:53:14

-------------------------------------------

{"user_id":3242732207,"screen_name":"cypuqygoducu","timestamp":"2015-

11-03T20:53:04","hashtags":[],"text":"RT@VIralBuzzNewss:Our

DistinctiveEditionHolidaybreakChallengeIsInthisarticle!Hooray!...

-https://t.co/9d8wumrd5vhttps://t.co/\u2026","geo":null,"id":

661647303678259200}

-------------------------------------------

Time:2015-11-0321:53:15

-------------------------------------------

{"user_id":352673159,"screen_name":"melly_boo_orig","timestamp":"2015-

Page 220: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

11-03T20:53:05","hashtags":["eminem"],"text":"#eminem

https://t.co/GlEjPJnwxy","geo":null,"id":661647307847409668}

-------------------------------------------

Time:2015-11-0321:53:16

-------------------------------------------

{"user_id":500620889,"screen_name":"NBAtheist","timestamp":"2015-11-

03T20:53:06","hashtags":["tehInterwebbies","Nutters"],"text":"See?

Thatdidn'ttakelongoranyactualeffort.Thisis#tehInterwebbies…

#NuttersAbound!https://t.co/QS8gLStYFO","geo":null,"id":

661647312062709761}

So,wegotanexampleofstreamingtweetswithSparkandprocessingthemonthefly.

Page 221: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 222: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

BuildingareliableandscalablestreamingappIngestingdataistheprocessofacquiringdatafromvarioussourcesandstoringitforprocessingimmediatelyoratalaterstage.Dataconsumingsystemsaredispersedandcanbephysicallyandarchitecturallyfarfromthesources.Dataingestionisoftenimplementedmanuallywithscriptsandrudimentaryautomation.ItactuallycallsforhigherlevelframeworkslikeFlumeandKafka.

Thechallengesofdataingestionarisefromthefactthatthesourcesarephysicallyspreadoutandaretransientwhichmakestheintegrationbrittle.Dataproductioniscontinuousforweather,traffic,socialmedia,networkactivity,shopfloorsensors,security,andsurveillance.Everincreasingdatavolumesandratescoupledwitheverchangingdatastructureandsemanticsmakesdataingestionadhocanderrorprone.

Theaimistobecomemoreagile,reliable,andscalable.Agility,reliability,andscalabilityofthedataingestiondeterminetheoverallhealthofthepipeline.Agilitymeansintegratingnewsourcesastheyariseandincorporatingchangestoexistingsourcesasneeded.Inordertoensuresafetyandreliability,weneedtoprotecttheinfrastructureagainstdatalossanddownstreamapplicationsfromsilentdatacorruptionatingress.Scalabilityavoidsingestbottleneckswhilekeepingcosttractable.

IngestMode Description Example

ManualorScripted FilecopyusingcommandlineinterfaceorGUIinterface HDFSClient,ClouderaHue

BatchDataTransport Bulkdatatransportusingtools DistCp,Sqoop

MicroBatch TransportofsmallbatchesofdataSqoop,Sqoop2

Storm

Pipelining Flowliketransportofeventstreams FlumeScribe

MessageQueue PublishSubscribemessagebusofevents Kafka,Kinesis

Inordertoenableanevent-drivenbusinessthatisabletoingestmultiplestreamsofdata,processitinflight,andmakesenseofitalltogettorapiddecisions,thekeydriveristheUnifiedLog.

AUnifiedLogisacentralizedenterprisestructuredlogavailableforreal-timesubscription.Alltheorganization’sdataisputinacentrallogforsubscription.Recordsarenumberedbeginningwithzerointheorderthattheyarewritten.Itisalsoknownasacommitlogorjournal.TheconceptoftheUnifiedLogisthecentraltenetoftheKappaarchitecture.

ThepropertiesoftheUnifiedLogareasfollows:

Unified:Thereisasingledeploymentfortheentireorganization

Page 223: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Appendonly:EventsareimmutableandareappendedOrdered:EacheventhasauniqueoffsetwithinashardDistributed:Forfaulttolerancepurpose,theUnifiedLogisdistributedredundantlyonaclusterofcomputersFast:Thesystemsingeststhousandsofmessagespersecond

Page 224: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SettingupKafkaInordertoisolatedownstreamparticularconsumptionofdatafromthevagariesofupstreamemissionofdata,weneedtodecoupletheprovidersofdatafromthereceiversorconsumersofdata.Astheyarelivingintwodifferentworldswithdifferentcyclesandconstraints,Kafkadecouplesthedatapipelines.

ApacheKafkaisadistributedpublishsubscribemessagingsystemrethoughtasadistributedcommitlog.Themessagesarestoredbytopic.

ApacheKafkahasthefollowingproperties.Itsupports:

HighthroughputforhighvolumeofeventsfeedsReal-timeprocessingofnewandderivedfeedsLargedatabacklogsandpersistenceforofflineconsumptionLowlatencyasenterprisewidemessagingsystemFaulttolerancethankstoitsdistributednature

MessagesarestoredinpartitionwithauniquesequentialIDcalledoffset.Consumerstracktheirpointersviatupleof(offset,partition,topic).

Let’sdivedeeperintheanatomyofKafka.

Kafkahasessentiallythreecomponents:producers,consumersandbrokers.Producerspushandwritedatatobrokers.Consumerspullandreaddatafrombrokers.Brokersdonotpushmessagestoconsumers.Consumerspullmessagefrombrokers.ThesetupisdistributedandcoordinatedbyApacheZookeeper.

Thebrokersmanageandstorethedataintopics.Topicsaresplitinreplicatedpartitions.Thedataispersistedinthebroker,butnotremoveduponconsumption,butuntilretentionperiod.Ifaconsumerfails,itcanalwaysgobacktothebrokertofetchthedata.

KafkarequiresApacheZooKeeper.ZooKeeperisahigh-performancecoordinationservicefordistributedapplications.Itcentrallymanagesconfiguration,registryornamingservice,groupmembership,lock,andsynchronizationforcoordinationbetweenservers.Itprovidesahierarchicalnamespacewithmetadata,monitoringstatistics,andstateofthecluster.ZooKeepercanintroducebrokersandconsumersontheflyandthenrebalancesthecluster.

KafkaproducersdonotneedZooKeeper.KafkabrokersuseZooKeepertoprovidegeneralstateinformationaswellelectleaderincaseoffailure.KafkaconsumersuseZooKeepertotrackmessageoffset.NewerversionsofKafkawillsavetheconsumerstogothroughZooKeeperandcanretrievetheKafkaspecialtopicsinformation.Kafkaprovidesautomaticloadbalancingforproducers.

ThefollowingdiagramgivesanoverviewoftheKafkasetup:

Page 225: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

InstallingandtestingKafkaWewilldownloadtheApacheKafkabinariesfromthededicatedwebpageathttp://kafka.apache.org/downloads.htmlandinstallthesoftwareinourmachineusingthefollowingsteps:

1. Downloadthecode.2. Downloadthe0.8.2.0releaseandun-tarit:

>tar-xzfkafka_2.10-0.8.2.0.tgz

>cdkafka_2.10-0.8.2.0

3. Startzooeeper.KafkausesZooKeepersoweneedtofirststartaZooKeeperserver.WewillusetheconveniencescriptpackagedwithKafkatogetasingle-nodeZooKeeperinstance.

>bin/zookeeper-server-start.shconfig/zookeeper.properties

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/zookeeper-server-start.sh

config/zookeeper.properties

Page 226: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

[2015-10-3122:49:14,808]INFOReadingconfigurationfrom:

config/zookeeper.properties

(org.apache.zookeeper.server.quorum.QuorumPeerConfig)

[2015-10-3122:49:14,816]INFOautopurge.snapRetainCountsetto3

(org.apache.zookeeper.server.DatadirCleanupManager)...

4. NowlaunchtheKafkaserver:

>bin/kafka-server-start.shconfig/server.properties

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-server-start.sh

config/server.properties

[2015-10-3122:52:04,643]INFOVerifyingproperties

(kafka.utils.VerifiableProperties)

[2015-10-3122:52:04,714]INFOPropertybroker.idisoverriddento0

(kafka.utils.VerifiableProperties)

[2015-10-3122:52:04,715]INFOPropertylog.cleaner.enableis

overriddentofalse(kafka.utils.VerifiableProperties)

[2015-10-3122:52:04,715]INFOPropertylog.dirsisoverriddento

/tmp/kafka-logs(kafka.utils.VerifiableProperties)[2013-04-22

15:01:47,051]INFOPropertysocket.send.buffer.bytesisoverriddento

1048576(kafka.utils.VerifiableProperties)

5. Createatopic.Let’screateatopicnamedtestwithasinglepartitionandonlyonereplica:

>bin/kafka-topics.sh--create--zookeeperlocalhost:2181--

replication-factor1--partitions1--topictest

6. Wecannowseethattopicifwerunthelisttopiccommand:

>bin/kafka-topics.sh--list--zookeeperlocalhost:2181

Test

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-topics.sh--create--

zookeeperlocalhost:2181--replication-factor1--partitions1--topic

test

Createdtopic"test".

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-topics.sh--list--

zookeeperlocalhost:2181

test

7. ChecktheKafkainstallationbycreatingaproducerandconsumer.Wefirstlaunchaproducerandtypeamessageintheconsole:

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-console-producer.sh--

broker-listlocalhost:9092--topictest

[2015-10-3122:54:43,698]WARNPropertytopicisnotvalid

(kafka.utils.VerifiableProperties)

Thisisamessage

Thisisanothermessage

8. Wethenlaunchaconsumertocheckthatwereceivethemessage:

an@an-VB:~$cdkafka/

an@an-VB:~/kafka$cdkafka_2.10-0.8.2.0/

an@an-VB:~/kafka/kafka_2.10-0.8.2.0$bin/kafka-console-consumer.sh--

zookeeperlocalhost:2181--topictest--from-beginning

Page 227: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Thisisamessage

Thisisanothermessage

Themessageswereappropriatelyreceivedbytheconsumer:

1. CheckKafkaandSparkStreamingconsumer.WewillbeusingtheSparkStreamingKafkawordcountexampleprovidedintheSparkbundle.Awordofcaution:wehavetobindtheKafkapackages,--packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0,whenwesubmittheSparkjob.Thecommandisasfollows:

./bin/spark-submit--packagesorg.apache.spark:spark-streaming-

kafka_2.10:1.5.0\

examples/src/main/python/streaming/kafka_wordcount.py\

localhost:2181test

2. WhenwelaunchtheSparkStreamingwordcountprogramwithKafka,wegetthefollowingoutput:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit--

packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0

examples/src/main/python/streaming/kafka_wordcount.py

localhost:2181test

-------------------------------------------

Time:2015-10-3123:46:33

-------------------------------------------

(u'',1)

(u'from',2)

(u'Hello',2)

(u'Kafka',2)

-------------------------------------------

Time:2015-10-3123:46:34

-------------------------------------------

-------------------------------------------

Time:2015-10-3123:46:35

-------------------------------------------

3. InstalltheKafkaPythondriverinordertobeabletoprogrammaticallydevelopProducersandConsumersandinteractwithKafkaandSparkusingPython.Wewillusetheroad-testedlibraryfromDavidArthur,aka,MumrahonGitHub(https://github.com/mumrah).Wecanpipinstallitasfollows:

>pipinstallkafka-python

an@an-VB:~$pipinstallkafka-python

Collectingkafka-python

Downloadingkafka-python-0.9.4.tar.gz(63kB)

...

Successfullyinstalledkafka-python-0.9.4

Developingproducers

Page 228: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ThefollowingprogramcreatesaSimpleKafkaProducerthatwillemitthemessagethisisamessagesentfromtheKafkaproducer:fivetimes,followedbyatimestampeverysecond:

#

#kafkaproducer

#

#

importtime

fromkafka.commonimportLeaderNotAvailableError

fromkafka.clientimportKafkaClient

fromkafka.producerimportSimpleProducer

fromdatetimeimportdatetime

defprint_response(response=None):

ifresponse:

print('Error:{0}'.format(response[0].error))

print('Offset:{0}'.format(response[0].offset))

defmain():

kafka=KafkaClient("localhost:9092")

producer=SimpleProducer(kafka)

try:

time.sleep(5)

topic='test'

foriinrange(5):

time.sleep(1)

msg='Thisisamessagesentfromthekafkaproducer:'\

+str(datetime.now().time())+'—'\

+str(datetime.now().strftime("%A,%d%B%Y%I:%M%p"))

print_response(producer.send_messages(topic,msg))

exceptLeaderNotAvailableError:

#https://github.com/mumrah/kafka-python/issues/249

time.sleep(1)

print_response(producer.send_messages(topic,msg))

kafka.close()

if__name__=="__main__":

main()

Whenwerunthisprogram,thefollowingoutputisgenerated:

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/AN_Spark_Code$

pythons08_kafka_producer_01.py

Error:0

Offset:13

Error:0

Offset:14

Error:0

Offset:15

Error:0

Offset:16

Error:0

Offset:17

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/AN_Spark_Code$

Page 229: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

IttellsustherewerenoerrorsandgivestheoffsetofthemessagesgivenbytheKafkabroker.

DevelopingconsumersTofetchthemessagesfromtheKafkabrokers,wedevelopaKafkaconsumer:

#kafkaconsumer

#consumesmessagesfrom"test"topicandwritesthemtoconsole.

#

fromkafka.clientimportKafkaClient

fromkafka.consumerimportSimpleConsumer

defmain():

kafka=KafkaClient("localhost:9092")

print("Consumerestablishedconnectiontokafka")

consumer=SimpleConsumer(kafka,"my-group","test")

formessageinconsumer:

#Thiswillwaitandprintmessagesastheybecomeavailable

print(message)

if__name__=="__main__":

main()

Whenwerunthisprogram,weeffectivelyconfirmthattheconsumerreceivedallthemessages:

an@an-VB:~$cd~/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/AN_Spark_Code/

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/AN_Spark_Code$

pythons08_kafka_consumer_01.py

Consumerestablishedconnectiontokafka

OffsetAndMessage(offset=13,message=Message(magic=0,attributes=0,

key=None,value='Thisisamessagesentfromthekafkaproducer:

11:50:17.867309Sunday,01November201511:50AM'))

...

OffsetAndMessage(offset=17,message=Message(magic=0,attributes=0,

key=None,value='Thisisamessagesentfromthekafkaproducer:

11:50:22.051423Sunday,01November201511:50AM'))

DevelopingaSparkStreamingconsumerforKafkaBasedontheexamplecodeprovidedintheSparkStreamingbundle,wewillcreateaSparkStreamingconsumerforKafkaandperformawordcountonthemessagesstoredwiththebrokers:

#

#KafkaSparkStreamingConsumer

#

from__future__importprint_function

importsys

frompysparkimportSparkContext

frompyspark.streamingimportStreamingContext

frompyspark.streaming.kafkaimportKafkaUtils

Page 230: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

if__name__=="__main__":

iflen(sys.argv)!=3:

print("Usage:kafka_spark_consumer_01.py<zk><topic>",

file=sys.stderr)

exit(-1)

sc=SparkContext(appName="PythonStreamingKafkaWordCount")

ssc=StreamingContext(sc,1)

zkQuorum,topic=sys.argv[1:]

kvs=KafkaUtils.createStream(ssc,zkQuorum,"spark-streaming-

consumer",{topic:1})

lines=kvs.map(lambdax:x[1])

counts=lines.flatMap(lambdaline:line.split(""))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)

counts.pprint()

ssc.start()

ssc.awaitTermination()

RunthisprogramwiththefollowingSparksubmitcommand:

./bin/spark-submit--packagesorg.apache.spark:spark-streaming-

kafka_2.10:1.5.0

examples/AN_Spark/AN_Spark_Code/s08_kafka_spark_consumer_01.py

localhost:2181test

Wegetthefollowingoutput:

an@an-VB:~$cdspark/spark-1.5.0-bin-hadoop2.6/

an@an-VB:~/spark/spark-1.5.0-bin-hadoop2.6$./bin/spark-submit\

>--packagesorg.apache.spark:spark-streaming-kafka_2.10:1.5.0\

>examples/AN_Spark/AN_Spark_Code/s08_kafka_spark_consumer_01.py

localhost:2181test…

::retrieving::org.apache.spark#spark-submit-parent

confs:[default]

0artifactscopied,10alreadyretrieved(0kB/18ms)

-------------------------------------------

Time:2015-11-0112:13:16

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:17

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:18

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:19

-------------------------------------------

(u'a',5)

(u'the',5)

(u'11:50AM',5)

Page 231: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

(u'from',5)

(u'This',5)

(u'11:50:21.044374Sunday,',1)

(u'message',5)

(u'11:50:20.036422Sunday,',1)

(u'11:50:22.051423Sunday,',1)

(u'11:50:17.867309Sunday,',1)

...

-------------------------------------------

Time:2015-11-0112:13:20

-------------------------------------------

-------------------------------------------

Time:2015-11-0112:13:21

-------------------------------------------

Page 232: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ExploringflumeFlumeisacontinuousingestionsystem.Itwasoriginallydesignedtobealogaggregationsystem,butitevolvedtohandleanytypeofstreamingeventdata.

Flumeisadistributed,reliable,scalable,andavailablepipelinesystemforefficientcollection,aggregation,andtransportoflargevolumesofdata.Ithasbuilt-insupportforcontextualrouting,filteringreplication,andmultiplexing.Itisrobustandfaulttolerant,withtunablereliabilitymechanismsandmanyfailoverandrecoverymechanisms.Itusesasimpleextensibledatamodelthatallowsforrealtimeanalyticapplication.

Flumeoffersthefollowing:

GuaranteeddeliverysemanticsLowlatencyreliabledatatransferDeclarativeconfigurationwithnocodingrequiredExtendableandcustomizablesettingsIntegrationwithmostcommonlyusedend-points

TheanatomyofFlumecontainsthefollowingelements:

Event:AneventisthefundamentalunitofdatathatistransportedbyFlumefromsourcetodestination.ItislikeamessagewithabytearraypayloadopaquetoFlumeandoptionalheadersusedforcontextualrouting.Client:Aclientproducesandtransmitsevents.AclientdecouplesFlumefromthedataconsumers.Itisanentitythatgenerateseventsandsendsthemtooneormoreagents.CustomclientorFlumelog4Jappendprogramorembeddedapplicationagentcanbeclient.Agent:Anagentisacontainerhostingsources,channels,sinks,andotherelementsthatenablethetransportationofeventsfromoneplacetotheother.Itprovidesconfiguration,lifecyclemanagementandmonitoringforhostedcomponents.AnagentisaphysicalJavavirtualmachinerunningFlume.Source:SourceistheentitythroughwhichFlumereceivesevents.Sourcesrequireatleastonechanneltofunctioninordertoeitheractivelypolldataorpassivelywaitfordatatobedeliveredtothem.Avarietyofsourcesallowdatatobecollected,suchaslog4jlogsandsyslogs.Sink:Sinkistheentitythatdrainsdatafromthechannelanddeliversittothenextdestination.Avarietyofsinksallowdatatobestreamedtoarangeofdestinations.Sinkssupportserializationtouser’sformat.OneexampleistheHDFSsinkthatwriteseventstoHDFS.Channel:Channelistheconduitbetweenthesourceandthesinkthatbuffersincomingeventsuntildrainedbysinks.Sourcesfeedeventsintothechannelandthesinksdrainthechannel.Channelsdecoupletheimpedanceofupstreamanddownstreamsystems.Burstofdataupstreamisdampedbythechannels.Failuresdownstreamaretransparentlyabsorbedbythechannels.Sizingthechannelcapacitytocopewiththeseeventsiskeytorealizingthesebenefits.Channelsoffertwolevelsofpersistence:eithermemorychannel,whichisvolatileiftheJVMcrashes,orFile

Page 233: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

channelbackedbyWriteAheadLogthatstorestheinformationtodisk.Channelsarefullytransactional.

Let’sillustratealltheseconcepts:

Page 234: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

DevelopingdatapipelineswithFlume,Kafka,andSparkBuildingresilientdatapipelineleveragesthelearningsfromtheprevioussections.WeareplumbingtogetherdataingestionandtransportwithFlume,databrokeragewithareliableandsophisticatedpublishandsubscribemessagingsystemsuchasKafka,andfinallyprocesscomputationontheflyusingSparkStreaming.

Thefollowingdiagramillustratesthecompositionofstreamingdatapipelinesassequenceofconnect,collect,conduct,compose,consume,consign,andcontrolactivities.Theseactivitiesareconfigurablebasedontheusecase:

ConnectestablishesthebindingwiththestreamingAPI.Collectcreatescollectionthreads.Conductdecouplesthedataproducersfromtheconsumersbycreatingabufferqueueorpublish-subscribemechanism.Composeisfocusedonprocessingthedata.Consumeprovisionstheprocesseddatafortheconsumingsystems.Consigntakescareofthedatapersistence.Controlcaterstogovernanceandmonitoringofthesystems,data,andapplications.

Thefollowingdiagramillustratestheconceptsofthestreamingdatapipelineswithitskeycomponents:SparkStreaming,Kafka,Flume,andlowlatencydatabases.Inthe

Page 235: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

consumingorcontrollingapplications,wearemonitoringoursystemsinrealtime(depictedbyamonitor)orsendingreal-timealerts(depictedbyredlights)incasecertainthresholdsarecrossed.

ThefollowingdiagramillustratesSpark’suniqueabilitytoprocessinasingleplatformdatainmotionanddataatrestwhileseamlesslyinterfacingwithmultiplepersistencedatastoresaspertheusecaserequirement.

Thisdiagrambringsinoneunifiedwholealltheconceptsdiscusseduptonow.Thetoppartdescribesthestreamingprocessingpipeline.Thebottompartdescribesthebatchprocessingpipeline.Theybothshareacommonpersistencelayerinthemiddleofthediagramdepictingthevariousmodesofpersistenceandserialization.

Page 236: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 237: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 238: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

ClosingremarksontheLambdaandKappaarchitectureTwoarchitectureparadigmsarecurrentlyinvogue:theLambdaandKappaarchitectures.

LambdaisthebrainchildoftheStormcreatorandmaincommitter,NathanMarz.Itessentiallyadvocatesbuildingafunctionalarchitectureonalldata.Thearchitecturehastwobranches.ThefirstisabatcharmenvisionedtobepoweredbyHadoop,wherehistorical,high-latency,high-throughputdataarepre-processedandmadereadyforconsumption.Thereal-timearmisenvisionedtobepoweredbyStorm,anditprocessesincrementallystreamingdata,derivesinsightsonthefly,andfeedsaggregatedinformationbacktothebatchstorage.

KappaisthebrainchildofonethemaincommitterofKafka,JayKreps,andhiscolleaguesatConfluent(previouslyatLinkedIn).Itisadvocatingafullstreamingpipeline,effectivelyimplementing,attheenterpriselevel,theunifiedlogenouncedinthepreviouspages.

Page 239: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnderstandingLambdaarchitectureLambdaarchitecturecombinesbatchandstreamingdatatoprovideaunifiedquerymechanismonallavailabledata.Lambdaarchitectureenvisionsthreelayers:abatchlayerwhereprecomputedinformationarestored,aspeedlayerwherereal-timeincrementalinformationisprocessedasdatastreams,andfinallytheservinglayerthatmergesbatchandreal-timeviewsforadhocqueries.ThefollowingdiagramgivesanoverviewoftheLambdaarchitecture:

Page 240: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UnderstandingKappaarchitectureTheKappaarchitectureproposestodrivethefullenterpriseinstreamingmode.TheKappaarchitecturearosefromacritiquefromJayKrepsandhiscolleaguesatLinkedInatthetime.Sincethen,theymovedandcreatedConfluentwithApacheKafkaasthemainenableroftheKappaarchitecturevision.ThebasictenetistomoveinallstreamingmodewithaUnifiedLogasthemainbackboneoftheenterpriseinformationarchitecture.

AUnifiedLogisacentralizedenterprisestructuredlogavailableforreal-timesubscription.Alltheorganization’sdataisputinacentrallogforsubscription.Recordsarenumberedbeginningwithzerosothattheyarewritten.Itisalsoknownasacommitlogorjournal.TheconceptoftheUnifiedLogisthecentraltenetoftheKappaarchitecture.

Thepropertiesoftheunifiedlogareasfollows:

Unified:ThereisasingledeploymentfortheentireorganizationAppendonly:EventsareimmutableandareappendedOrdered:EacheventhasauniqueoffsetwithinashardDistributed:Forfaulttolerancepurpose,theunifiedlogisdistributedredundantlyonaclusterofcomputersFast:Thesystemsingeststhousandsofmessagespersecond

ThefollowingscreenshotcapturesthemomentJayKrepsannouncedhisreservationsabouttheLambdaarchitecture.HismainreservationabouttheLambdaarchitectureisimplementingthesamejobintwodifferentsystems,HadoopandStorm,witheachoftheirspecificidiosyncrasies,andwithallthecomplexitiesthatcomealongwithit.Kappaarchitectureprocessesthereal-timedataandreprocesseshistoricaldatainthesameframeworkpoweredbyApacheKafka.

Page 241: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 242: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 243: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SummaryInthischapter,welaidoutthefoundationsofstreamingarchitectureappsanddescribedtheirchallenges,constraints,andbenefits.WewentunderthehoodandexaminedtheinnerworkingofSparkStreamingandhowitfitswithSparkCoreanddialogueswithSparkSQLandSparkMLlib.WeillustratedthestreamingconceptswithTCPsockets,followedbylivetweetingestionandprocessingdirectlyfromtheTwitterfirehose.WediscussedthenotionsofdecouplingupstreamdatapublishingfromdownstreamdatasubscriptionandconsumptionusingKafkainordertomaximizetheresilienceoftheoverallstreamingarchitecture.WealsodiscussedFlume—areliable,flexible,andscalabledataingestionandtransportpipelinesystem.ThecombinationofFlume,Kafka,andSparkdeliversunparalleledrobustness,speed,andagilityinaneverchanginglandscape.Weclosedthechapterwithsomeremarksandobservationsontwostreamingarchitecturalparadigms,theLambdaandKappaarchitectures.

TheLambdaarchitecturecombinesbatchandstreamingdatainacommonqueryfront-end.ItwasenvisionedwithHadoopandStorminmindinitially.Sparkhasitsownbatchandstreamingparadigms,anditoffersasingleenvironmentwithcommoncodebasetoeffectivelybringthisarchitectureparadigmtolife.

TheKappaarchitecturepromulgatestheconceptoftheunifiedlog,whichcreatesanevent-orientedarchitecturewherealleventsintheenterprisearechanneledinacentralizedcommitlogthatisavailabletoallconsumingsystemsinrealtime.

Wearenowreadyforthevisualizationofthedatacollectedandprocessedsofar.

Page 244: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 245: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Chapter6.VisualizingInsightsandTrendsSofar,wehavefocusedonthecollection,analysis,andprocessingofdatafromTwitter.Wehavesetthestagetouseourdataforvisualrenderingandextractinginsightsandtrends.WewillgiveaquicklayofthelandaboutvisualizationtoolsinthePythonecosystem.WewillhighlightBokehasapowerfultoolforrenderingandviewinglargedatasets.BokehispartofthePythonAnacondaDistributionecosystem.

Inthischapter,wewillcoverthefollowingpoints:

GaugingthekeywordsandmemeswithinasocialnetworkcommunityusingchartsandwordcloudMappingthemostactivelocationwherecommunitiesaregrowingaroundcertainthemesortopics

Page 246: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Revisitingthedata-intensiveappsarchitectureWehavereachedthefinallayerofthedata-intensiveappsarchitecture:theengagementlayer.Thislayerfocusesonhowtosynthesize,emphasize,andvisualizethekeycontextrelevantinformationforthedataconsumers.Abunchofnumbersinaconsolewillnotsufficetoengagewithend-users.Itiscriticaltopresentthemassofinformationinarapid,digestible,andattractivefashion.

Thefollowingdiagramsetsthecontextofthechapter’sfocushighlightingtheengagementlayer.

ForPythonplottingandvisualizations,wehavequiteafewtoolsandlibraries.Themostinterestingandrelevantonesforourpurposearethefollowing:

MatplotlibisthegrandfatherofthePythonplottinglibraries.MatplotlibwasoriginallythebrainchildofJohnHunterwhowasanopensourcesoftwareproponent

Page 247: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

andestablishedMatplotlibasoneofthemostprevalentplottinglibrariesbothintheacademicandthedatascientificcommunities.Matplotliballowsthegenerationofplots,histograms,powerspectra,barcharts,errorcharts,scatterplots,andsoon.ExamplescanbefoundontheMatplotlibdedicatedwebsiteathttp://matplotlib.org/examples/index.html.Seaborn,developedbyMichaelWaskom,isagreatlibrarytoquicklyvisualizestatisticalinformation.ItisbuiltontopofMatplotlibandintegratesseamlesslywithPandasandthePythondatastack,includingNumpy.AgalleryofgraphsfromSeabornathttp://stanford.edu/~mwaskom/software/seaborn/examples/index.htmlshowsthepotentialofthelibrary.ggplotisrelativelynewandaimstooffertheequivalentofthefamousggplot2fromtheRecosystemforthePythondatawranglers.Ithasthesamelookandfeelofggplot2andusesthesamegrammarofgraphicsasexpoundedbyHadleyWickham.TheggplotthePythonportisdevelopedbytheteamatyhat.Moreinformationcanbefoundathttp://ggplot.yhathq.com.D3.jsisaverypopular,JavaScriptlibrarydevelopedbyMikeBostock.D3standsforDataDrivenDocumentsandbringsdatatolifeonanymodernbrowserleveragingHTML,SVG,andCSS.Itdeliversdynamic,powerful,interactivevisualizationsbymanipulatingtheDOM,theDocumentObjectModel.ThePythoncommunitycouldnotwaittointegrateD3withMatplotlib.UndertheimpulseofJakeVanderplas,mpld3wascreatedwiththeaimofbringingmatplotlibtothebrowser.Examplesgraphicsarehostedatthefollowingaddress:http://mpld3.github.io/index.html.Bokehaimstodeliverhigh-performanceinteractivityoververylargeorstreamingdatasetswhilstleveraginglotoftheconceptsofD3.jswithouttheburdenofwritingsomeintimidatingjavascriptandcsscode.Bokehdeliversdynamicvisualizationsonthebrowserwithorwithoutaserver.ItintegratesseamlesslywithMatplotlib,SeabornandggplotandrendersbeautifullyinIPythonnotebooksorJupyternotebooks.BokehisactivelydevelopedbytheteamatContinuum.ioandisanintegralpartoftheAnacondaPythondatastack.

Bokehserverprovidesafull-fledged,dynamicplottingenginethatmaterializesareactivescenegraphfromJSON.ItuseswebsocketstokeepstateandupdatetheHTML5canvasusingBackbone.jsandCoffee-scriptunderthehoods.Bokeh,asitisfueledbydatainJSON,createseasybindingsforotherlanguagessuchasR,Scala,andJulia.

Thisgivesahigh-leveloverviewofthemainplottingandvisualizationlibrary.Itisnotexhaustive.Let’smovetoconcreteexamplesofvisualizations.

Page 248: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 249: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

PreprocessingthedataforvisualizationBeforejumpingintothevisualizations,wewilldosomepreparatoryworkonthedataharvested:

In[16]:

#ReadharvesteddatastoredincsvinaPandaDF

importpandasaspd

csv_in='/home/an/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv'

pddf_in=pd.read_csv(csv_in,index_col=None,header=0,sep=';',

encoding='utf-8')

In[20]:

print('tweetspandasdataframe-count:',pddf_in.count())

print('tweetspandasdataframe-shape:',pddf_in.shape)

print('tweetspandasdataframe-colns:',pddf_in.columns)

('tweetspandasdataframe-count:',Unnamed:07540

id7540

created_at7540

user_id7540

user_name7538

tweet_text7540

dtype:int64)

('tweetspandasdataframe-shape:',(7540,6))

('tweetspandasdataframe-colns:',Index([u'Unnamed:0',u'id',

u'created_at',u'user_id',u'user_name',u'tweet_text'],dtype='object'))

Forthepurposeofourvisualizationactivity,wewilluseadatasetof7,540tweets.Thekeyinformationisstoredinthetweet_textcolumn.Wepreviewthedatastoredinthedataframecallingthehead()functiononthedataframe:

In[21]:

pddf_in.head()

Out[21]:

Unnamed:0idcreated_atuser_iduser_nametweet_text

00638830426971181057TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…

11638830426727911424TueSep0121:46:57+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

22638830425402556417TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…

33638830424563716097TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

44638830422256816132TueSep0121:46:56+000020153276255125

TrueEqualityernestsgantt:elsahel12:9_A_6:dreamintention…

Wewillnowcreatesomeutilityfunctionstocleanupthetweettextandparsethetwitterdate.First,weimportthePythonregularexpressionregexlibraryreandthetimelibrarytoparsedatesandtime:

In[72]:

importre

importtime

Page 250: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Wecreateadictionaryofregexthatwillbecompiledandthenpassedasfunction:

RT:ThefirstregexwithkeyRTlooksforthekeywordRTatthebeginningofthetweettext:

re.compile(r'^RT'),

ALNUM:ThesecondregexwithkeyALNUMlooksforwordsincludingalphanumericcharactersandunderscoresignprecededbythe@symbolinthetweettext:

re.compile(r'(@[a-zA-Z0-9_]+)'),

HASHTAG:ThethirdregexwithkeyHASHTAGlooksforwordsincludingalphanumericcharactersprecededbythe#symbolinthetweettext:

re.compile(r'(#[\w\d]+)'),

SPACES:ThefourthregexwithkeySPACESlooksforblankorlinespacecharactersinthetweettext:

re.compile(r'\s+'),

URL:ThefifthregexwithkeyURLlooksforurladdressesincludingalphanumericcharactersprecededwithhttps://orhttp://markersinthetweettext:

re.compile(r'([https://|http://]?[a-zA-Z\d\/]+[\.]+[a-zA-Z\d\/\.]+)')

In[24]:

regexp={"RT":"^RT","ALNUM":r"(@[a-zA-Z0-9_]+)",

"HASHTAG":r"(#[\w\d]+)","URL":r"([https://|http://]?[a-zA-

Z\d\/]+[\.]+[a-zA-Z\d\/\.]+)",

"SPACES":r"\s+"}

regexp=dict((key,re.compile(value))forkey,valuein

regexp.items())

In[25]:

regexp

Out[25]:

{'ALNUM':re.compile(r'(@[a-zA-Z0-9_]+)'),

'HASHTAG':re.compile(r'(#[\w\d]+)'),

'RT':re.compile(r'^RT'),

'SPACES':re.compile(r'\s+'),

'URL':re.compile(r'([https://|http://]?[a-zA-Z\d\/]+[\.]+[a-zA-

Z\d\/\.]+)')}

Wecreateautilityfunctiontoidentifywhetheratweetisaretweetoranoriginaltweet:

In[77]:

defgetAttributeRT(tweet):

"""seeiftweetisaRT"""

returnre.search(regexp["RT"],tweet.strip())!=None

Then,weextractalluserhandlesinatweet:

defgetUserHandles(tweet):

"""givenatweetwetryandextractalluserhandles"""

returnre.findall(regexp["ALNUM"],tweet)

Wealsoextractallhashtagsinatweet:

Page 251: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

defgetHashtags(tweet):

"""returnallhashtags"""

returnre.findall(regexp["HASHTAG"],tweet)

ExtractallURLlinksinatweetasfollows:

defgetURLs(tweet):

"""URL:[http://]?[\w\.?/]+"""

returnre.findall(regexp["URL"],tweet)

WestripallURLlinksanduserhandlesprecededby@signinatweettext.Thisfunctionwillbethebasisofthewordcloudwewillbuildsoon:

defgetTextNoURLsUsers(tweet):

"""returnparsedtexttermsstrippedofURLsandUserNamesintweet

text

''.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z\t])|(\w+:\/\/\S+)","

",x).split())"""

return''.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z\t])|

(\w+:\/\/\S+)|(RT)","",tweet).lower().split())

Welabelthedatasowecancreategroupsofdatasetsforthewordcloud:

defsetTag(tweet):

"""settagstotweet_textbasedonsearchtermsfromtags_list"""

tags_list=['spark','python','clinton','trump','gaga','bieber']

lower_text=tweet.lower()

returnfilter(lambdax:x.lower()inlower_text,tags_list)

Weparsethetwitterdateintheyyyy-mm-ddhh:mm:ssformat:

defdecode_date(s):

"""parseTwitterdateintoformatyyyy-mm-ddhh:mm:ss"""

returntime.strftime('%Y-%m-%d%H:%M:%S',time.strptime(s,'%a%b%d

%H:%M:%S+0000%Y'))

Wepreviewthedatapriortoprocessing:

In[43]:

pddf_in.columns

Out[43]:

Index([u'Unnamed:0',u'id',u'created_at',u'user_id',u'user_name',

u'tweet_text'],dtype='object')

In[45]:

#df.drop([ColumnNameorlist],inplace=True,axis=1)

pddf_in.drop(['Unnamed:0'],inplace=True,axis=1)

In[46]:

pddf_in.head()

Out[46]:

idcreated_atuser_iduser_nametweet_text

0638830426971181057TueSep0121:46:57+000020153276255125True

Equalityernestsgantt:BeyHiveInFrance:9_A_6:dreamint…

1638830426727911424TueSep0121:46:57+000020153276255125True

Equalityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

2638830425402556417TueSep0121:46:56+000020153276255125True

Equalityernestsgantt:BeyHiveInFrance:9_A_6:ernestsg…

3638830424563716097TueSep0121:46:56+000020153276255125True

Page 252: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Equalityernestsgantt:BeyHiveInFrance:PhuketDailyNews…

4638830422256816132TueSep0121:46:56+000020153276255125True

Equalityernestsgantt:elsahel12:9_A_6:dreamintention…

Wecreatenewdataframecolumnsbyapplyingtheutilityfunctionsdescribed.Wecreateanewcolumnforhtag,userhandles,URLs,thetexttermsstrippedfromURLs,andunwantedcharactersandthelabels.Wefinallyparsethedate:

In[82]:

pddf_in['htag']=pddf_in.tweet_text.apply(getHashtags)

pddf_in['user_handles']=pddf_in.tweet_text.apply(getUserHandles)

pddf_in['urls']=pddf_in.tweet_text.apply(getURLs)

pddf_in['txt_terms']=pddf_in.tweet_text.apply(getTextNoURLsUsers)

pddf_in['search_grp']=pddf_in.tweet_text.apply(setTag)

pddf_in['date']=pddf_in.created_at.apply(decode_date)

Thefollowingcodegivesaquicksnapshotofthenewlygenerateddataframe:

In[83]:

pddf_in[2200:2210]

Out[83]:

idcreated_atuser_iduser_nametweet_texthtagurlsptxt

tgrpdateuser_handlestxt_termssearch_grp

2200638242693374681088MonAug3106:51:30+0000201519525954

CENATICElimpactode@ApacheSparkenelprocesamiento…[#sparkSpecial]

[://t.co/4PQmJNuEJB]elimpactodeenelprocesamientodedatosye…

[spark]2015-08-3106:51:30[@ApacheSpark]elimpactodeenel

procesamientodedatosye…[spark]

2201638238014695575552MonAug3106:32:55+0000201551115854

NawfalRealTimeStreamingwithApacheSpark\nhttp://...[#IoT,

#SmartMelboune,#BigData,#Apachespark][://t.co/GW5PaqwVab]realtime

streamingwithapachesparkiotsmar…[spark]2015-08-3106:32:55[]

realtimestreamingwithapachesparkiotsmar…[spark]

2202638236084124516352MonAug3106:25:14+0000201562885987

MithunKattiRT@differentsachin:Sparktheflameofdigita…

[#IBMHackathon,#SparkHackathon,#ISLconnectIN…[]sparktheflameof

digitalindiaibmhackathon…[spark]2015-08-3106:25:14

[@differentsachin,@ApacheSpark]sparktheflameofdigitalindia

ibmhackathon…[spark]

2203638234734649176064MonAug3106:19:53+00002015140462395

solaimuruganvInstalling@[email protected]…[]

[1.4.1,://t.co/3c5dGbfaZe.]installingwith141gotmanymoreissue

whil…[spark]2015-08-3106:19:53[@ApacheMahout,@ApacheSpark]

installingwith141gotmanymoreissuewhil…[spark]

2204638233517307072512MonAug3106:15:02+000020152428473836

RalfHeinekeRT@RomeoKienzler:Joinme@velocityconfon#m…

[#machinelearning,#devOps,#Bl][://t.co/U5xL7pYEmF]joinmeon

machinelearningbaseddevopsoperat…[spark]2015-08-3106:15:02

[@RomeoKienzler,@velocityconf,@ApacheSpark]joinmeonmachinelearning

baseddevopsoperat…[spark]

2205638230184848687106MonAug3106:01:48+00002015289355748

AkimBoykoRT@databricks:Watchlivetodayat10amPTis…[][1.5,

://t.co/16cix6ASti]watchlivetodayat10amptis15presentedb…

[spark]2015-08-3106:01:48[@databricks,@ApacheSpark,@databricks,

@pwen…watchlivetodayat10amptis15presentedb…[spark]

2206638227830443110400MonAug3105:52:27+00002015145001241

Page 253: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

sachinaggarwalSparktheflameofdigitalIndia@#IBMHackath…

[#IBMHackathon,#SparkHackathon,#ISLconnectIN…[://t.co/C1AO3uNexe]

sparktheflameofdigitalindiaibmhackathon…[spark]2015-08-31

05:52:27[@ApacheSpark]sparktheflameofdigitalindiaibmhackathon…

[spark]

2207638227031268810752MonAug3105:49:16+00002015145001241

sachinaggarwalRT@pravin_gadakh:Imagine,innovateandIgni…

[#IBMHackathon,#ISLconnectIN2015][]gadakhimagineinnovateand

ignitedigitalind…[spark]2015-08-3105:49:16[@pravin_gadakh,

@ApacheSpark]gadakhimagineinnovateandignitedigitalind…[spark]

2208638224591920336896MonAug3105:39:35+00002015494725634

IBMAsiaPacificRT@sachinparmar:PassionateaboutSpark??Hav…

[#IBMHackathon,#ISLconnectIN][India..]passionateaboutsparkhave

dreamsofcleansa…[spark]2015-08-3105:39:35[@sachinparmar]

passionateaboutsparkhavedreamsofcleansa…[spark]

2209638223327467692032MonAug3105:34:33+000020153158070968

OpenSourceIndia"GameChanger"#ApacheSparkspeedsup#bigdata…

[#ApacheSpark,#bigdata][://t.co/ieTQ9ocMim]gamechangerapachespark

speedsupbigdatapro…[spark]2015-08-3105:34:33[]gamechanger

apachesparkspeedsupbigdatapro…[spark]

WesavetheprocessedinformationinaCSVformat.Wehave7,540recordsand13columns.Inyourcase,theoutputwillvaryaccordingtothedatasetyouchose:

In[84]:

f_name='/home/an/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/data/unq_tweets_processed.csv'

pddf_in.to_csv(f_name,sep=';',encoding='utf-8',index=False)

In[85]:

pddf_in.shape

Out[85]:

(7540,13)

Page 254: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 255: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Gaugingwords,moods,andmemesataglanceWearenowreadytoproceedwithbuildingthewordcloudswhichwillgiveusasenseoftheimportantwordscarriedinthosetweets.Wewillcreatewordcloudsforthedatasetsharvested.Wordcloudsextractthetopwordsinalistofwordsandcreateascatterplotofthewordswherethesizeofthewordiscorrelatedtoitsfrequency.Themorefrequentthewordinthedataset,thebiggerwillbethefontsizeinthewordcloudrendering.Theyincludethreeverydifferentthemesandtwocompetingoranalogousentities.Ourfirstthemeisobviouslydataprocessingandanalytics,withApacheSparkandPythonasourentities.Oursecondthemeisthe2016presidentialelectioncampaign,withthetwocontenders:HilaryClintonandDonaldTrump.OurlastthemeistheworldofpopmusicwithJustinBieberandLadyGagaasthetwoexponents.

Page 256: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SettingupwordcloudWewillillustratetheprogrammingstepsbyanalyzingthesparkrelatedtweets.Weloadthedataandpreviewthedataframe:

In[21]:

importpandasaspd

csv_in='/home/an/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/data/spark_tweets.csv'

tspark_df=pd.read_csv(csv_in,index_col=None,header=0,sep=',',

encoding='utf-8')

In[3]:

tspark_df.head(3)

Out[3]:

idcreated_atuser_iduser_nametweet_texthtagurlsptxt

tgrpdateuser_handlestxt_termssearch_grp

0638818911773856000TueSep0121:01:11+000020152511247075Noor

DinRT@kdnuggets:RleadsRapidMiner,Pythoncatc…[#KDN]

[://t.co/3bsaTT7eUs]rleadsrapidminerpythoncatchesupbigdata…

[spark,python]2015-09-0121:01:11[@kdnuggets]rleadsrapidminer

pythoncatchesupbigdata…[spark,python]

1622142176768737000FriJul1720:33:48+0000201524537879IBM

CloudantBeoneofthefirsttosign-upforIBMAnalyti…[#ApacheSpark,

#SparkInsight][://t.co/C5TZpetVA6,://t.co/R1L29DePaQ]beoneofthe

firsttosignupforibmanalyti…[spark]2015-07-1720:33:48[]be

oneofthefirsttosignupforibmanalyti…[spark]

2622140453069169000FriJul1720:26:57+00002015515145898Arno

CandelNicearticleon#apachespark,#hadoopand#dat…[#apachespark,

#hadoop,#datascience][://t.co/IyF44pV0f3]nicearticleonapachespark

hadoopanddatasci…[spark]2015-07-1720:26:57[@h2oai]nice

articleonapachesparkhadoopanddatasci…[spark]

NoteThewordcloudlibrarywewilluseistheonedevelopedbyAndreasMuellerandhostedonhisGitHubaccountathttps://github.com/amueller/word_cloud.

ThelibraryrequiresPIL(shortforPythonImagingLibrary).PILiseasilyinstallablebyinvokingcondainstallpil.PILisacomplexlibrarytoinstallandisnotyetportedonPython3.4,soweneedtorunaPython2.7+environmenttobeabletoseeourwordcloud:

#

#InstallPIL(doesnotworkwithPython3.4)

#

an@an-VB:~$condainstallpil

Fetchingpackagemetadata:....

Solvingpackagespecifications:..................

Packageplanforinstallationinenvironment/home/an/anaconda:

Thefollowingpackageswillbedownloaded:

package|build

---------------------------|-----------------

libpng-1.6.17|0214KB

Page 257: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

freetype-2.5.5|02.2MB

conda-env-2.4.4|py27_024KB

pil-1.1.7|py27_2650KB

------------------------------------------------------------

Total:3.0MB

ThefollowingpackageswillbeUPDATED:

conda-env:2.4.2-py27_0-->2.4.4-py27_0

freetype:2.5.2-0-->2.5.5-0

libpng:1.5.13-1-->1.6.17-0

pil:1.1.7-py27_1-->1.1.7-py27_2

Proceed([y]/n)?y

Next,weinstallthewordcloudlibrary:

#

#Installwordcloud

#AndreasMueller

#https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py

#

an@an-VB:~$pipinstallwordcloud

Collectingwordcloud

Downloadingwordcloud-1.1.3.tar.gz(163kB)

100%|████████████████████████████████|163kB548kB/s

Buildingwheelsforcollectedpackages:wordcloud

Runningsetup.pybdist_wheelforwordcloud

Storedindirectory:

/home/an/.cache/pip/wheels/32/a9/74/58e379e5dc614bfd9dd9832d67608faac9b2bc6

c194d6f6df5

Successfullybuiltwordcloud

Installingcollectedpackages:wordcloud

Successfullyinstalledwordcloud-1.1.3

Page 258: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

CreatingwordcloudsAtthisstage,wearereadytoinvokethewordcloudprogramwiththegeneratedlistoftermsfromthetweettext.

Let’sgetstartedwiththewordcloudprogrambyfirstcalling%matplotlibinlinetodisplaythewordcloudinournotebook:

In[4]:

%matplotlibinline

In[11]:

Weconvertthedataframetxt_termscolumnintoalistofwords.Wemakesureitisallconvertedintothestrtypetoavoidanybadsurprisesandcheckthelist’sfirstfourrecords:

len(tspark_df['txt_terms'].tolist())

Out[11]:

2024

In[22]:

tspark_ls_str=[str(t)fortintspark_df['txt_terms'].tolist()]

In[14]:

len(tspark_ls_str)

Out[14]:

2024

In[15]:

tspark_ls_str[:4]

Out[15]:

['rleadsrapidminerpythoncatchesupbigdatatoolsgrowsparkignites

kdn',

'beoneofthefirsttosignupforibmanalyticsforapachesparktoday

sparkinsight',

'nicearticleonapachesparkhadoopanddatascience',

'spark101runningsparkandmapreducetogetherinproduction

hadoopsummit2015apachesparkaltiscale']

WefirstcalltheMatplotlibandthewordcloudlibraries:

importmatplotlib.pyplotasplt

fromwordcloudimportWordCloud,STOPWORDS

Fromtheinputlistofterms,wecreateaunifiedstringoftermsseparatedbyawhitespaceastheinputtothewordcloudprogram.Thewordcloudprogramremovesstopwords:

#jointweetstoasinglestring

words=''.join(tspark_ls_str)

#createwordcloud

wordcloud=WordCloud(

#removestopwords

stopwords=STOPWORDS,

background_color='black',

width=1800,

height=1400

).generate(words)

Page 259: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

#renderwordcloudimage

plt.imshow(wordcloud)

plt.axis('off')

#savewordcloudimageondisk

plt.savefig('./spark_tweets_wordcloud_1.png',dpi=300)

#displayimageinJupyternotebook

plt.show()

Here,wecanvisualizethewordcloudsforApacheSparkandPython.Clearly,inthecaseofSpark,Hadoop,bigdata,andanalyticsarethememes,whilePythonrecallstherootofitsnameMontyPythonwithastrongfocusondeveloper,apachespark,andprogrammingwithsomehintstojavaandruby.

WecanalsogetaglimpseinthefollowingwordcloudsofthewordspreoccupyingtheNorthAmerican2016presidentialelectioncandidates:HilaryClintonandDonaldTrump.SeeminglyHilaryClintonisovershadowedbythepresenceofheropponentsDonaldTrumpandBernieSanders,whileTrumpisheavilycenteredonlyonhimself:

Page 260: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Interestingly,inthecaseofJustinBieberandLadyGaga,thewordloveappears.InthecaseofBieber,followandbelieberarekeywords,whilediet,weightloss,andfashionarethepreoccupationsfortheLadyGagacrowd.

Page 261: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 262: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Geo-locatingtweetsandmappingmeetupsNow,wewilldiveintothecreationofinteractivemapswithBokeh.First,wecreateaworldmapwherewegeo-locatesampletweetsand,onmovingourmouseovertheselocations,wecanseetheusersandtheirrespectivetweetsinahoverbox.

ThesecondmapisfocusedonmappingupcomingmeetupsinLondon.Itcouldbeaninteractivemapthatwouldactasareminderofdate,time,andlocationforupcomingmeetupsinaspecificcity.

Page 263: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Geo-locatingtweetsTheobjectiveistocreateaworldmapscatterplotofthelocationsofimportanttweetsonthemap,andthetweetsandauthorsarerevealedonhoveringoverthesepoints.Wewillgothroughthreestepstobuildthisinteractivevisualization:

1. Createthebackgroundworldmapbyfirstloadingadictionaryofalltheworldcountryboundariesdefinedbytheirrespectivelongitudeandlatitudes.

2. Loadtheimportanttweetswewishtogeo-locatewiththeirrespectivecoordinatesandauthors.

3. Finally,scatterplotontheworldmapthetweetscoordinatesandactivatethehovertooltovisualizeinteractivelythetweetsandauthoronthehighlighteddotsonthemap.

Instepone,wecreateaPythonlistcalleddatathatwillcontainalltheworldcountriesboundarieswiththeirrespectivelatitudeandlongitude:

In[4]:

#

#ThismoduleexposesgeometrydataforWorldCountryBoundaries.

#

importcsv

importcodecs

importgzip

importxml.etree.cElementTreeaset

importos

fromos.pathimportdirname,join

nan=float('NaN')

__file__=os.getcwd()

data={}

withgzip.open(join(dirname(__file__),

'AN_Spark/data/World_Country_Boundaries.csv.gz'))asf:

decoded=codecs.iterdecode(f,"utf-8")

next(decoded)

reader=csv.reader(decoded,delimiter=',',quotechar='"')

forrowinreader:

geometry,code,name=row

xml=et.fromstring(geometry)

lats=[]

lons=[]

fori,polyin

enumerate(xml.findall('.//outerBoundaryIs/LinearRing/coordinates')):

ifi>0:

lats.append(nan)

lons.append(nan)

coords=(c.split(',')[:2]forcinpoly.text.split())

lat,lon=list(zip(*[(float(lat),float(lon))forlon,latin

coords]))

lats.extend(lat)

lons.extend(lon)

Page 264: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

data[code]={

'name':name,

'lats':lats,

'lons':lons,

}

In[5]:

len(data)

Out[5]:

235

Insteptwo,weloadasamplesetofimportanttweetsthatwewishtovisualizewiththeirrespectivegeo-locationinformation:

In[69]:

#data

#

#

In[8]:

importpandasaspd

csv_in='/home/an/spark/spark-1.5.0-bin-

hadoop2.6/examples/AN_Spark/data/spark_tweets_20.csv'

t20_df=pd.read_csv(csv_in,index_col=None,header=0,sep=',',

encoding='utf-8')

In[9]:

t20_df.head(3)

Out[9]:

idcreated_atuser_iduser_nametweet_texthtagurls

ptxttgrpdateuser_handlestxt_termssearch_grplatlon

0638818911773856000TueSep0121:01:11+000020152511247075Noor

DinRT@kdnuggets:RleadsRapidMiner,Pythoncatc…[#KDN]

[://t.co/3bsaTT7eUs]rleadsrapidminerpythoncatchesupbigdata…

[spark,python]2015-09-0121:01:11[@kdnuggets]rleads

rapidminerpythoncatchesupbigdata…[spark,python]37.279518

-121.867905

1622142176768737000FriJul1720:33:48+0000201524537879IBM

CloudantBeoneofthefirsttosign-upforIBMAnalyti…

[#ApacheSpark,#SparkInsight][://t.co/C5TZpetVA6,://t.co/R1L29DePaQ]

beoneofthefirsttosignupforibmanalyti…[spark]2015-07-17

20:33:48[]beoneofthefirsttosignupforibmanalyti…[spark]

37.774930-122.419420

2622140453069169000FriJul1720:26:57+00002015515145898Arno

CandelNicearticleon#apachespark,#hadoopand#dat…[#apachespark,

#hadoop,#datascience][://t.co/IyF44pV0f3]nicearticleon

apachesparkhadoopanddatasci…[spark]2015-07-1720:26:57

[@h2oai]nicearticleonapachesparkhadoopanddatasci…[spark]

51.500130-0.126305

In[98]:

len(t20_df.user_id.unique())

Out[98]:

19

In[17]:

t20_geo=t20_df[['date','lat','lon','user_name','tweet_text']]

In[24]:

#

t20_geo.rename(columns={'user_name':'user','tweet_text':'text'},

inplace=True)

Page 265: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

In[25]:

t20_geo.head(4)

Out[25]:

datelatlonusertext

02015-09-0121:01:1137.279518-121.867905NoorDinRT

@kdnuggets:RleadsRapidMiner,Pythoncatc…

12015-07-1720:33:4837.774930-122.419420IBMCloudantBe

oneofthefirsttosign-upforIBMAnalyti…

22015-07-1720:26:5751.500130-0.126305ArnoCandelNice

articleon#apachespark,#hadoopand#dat…

32015-07-1719:35:3151.500130-0.126305IraMichaelBlonder

Spark101:RunningSparkand#MapReducetogeth…

In[22]:

df=t20_geo

#

Instepthree,wefirstimportedallthenecessaryBokehlibraries.WewillinstantiatetheoutputintheJupyterNotebook.Wegettheworldcountriesboundaryinformationloaded.Wegetthegeo-locatedtweetdata.WeinstantiatetheBokehinteractivetoolssuchaswheelandboxzoomaswellasthehovertool.

In[29]:

#

#BokehVisualizationoftweetsonworldmap

#

frombokeh.plottingimport*

frombokeh.modelsimportHoverTool,ColumnDataSource

fromcollectionsimportOrderedDict

#OutputinJupiterNotebook

output_notebook()

#Gettheworldmap

world_countries=data.copy()

#Getthetweetdata

tweets_source=ColumnDataSource(df)

#Createworldmap

countries_source=ColumnDataSource(data=dict(

countries_xs=[world_countries[code]['lons']forcodein

world_countries],

countries_ys=[world_countries[code]['lats']forcodein

world_countries],

country=[world_countries[code]['name']forcodeinworld_countries],

))

#Instantiatethebokehinteractivetools

TOOLS="pan,wheel_zoom,box_zoom,reset,resize,hover,save"

Wearenowreadytolayerthevariouselementsgatheredintoanobjectfigurecalledp.Definethetitle,width,andheightofp.Attachthetools.Createtheworldmapbackgroundbypatcheswithalightbackgroundcolorandborders.Scatterplotthetweetsaccordingtotheirrespectivegeo-coordinates.Then,activatethehovertoolwiththeusersandtheirrespectivetweet.Finally,renderthepictureonthebrowser.Thecodeisasfollows:

Page 266: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

#Instantiantethefigureobject

p=figure(

title="%stweets"%(str(len(df.index))),

title_text_font_size="20pt",

plot_width=1000,

plot_height=600,

tools=TOOLS)

#Createworldpatchesbackground

p.patches(xs="countries_xs",ys="countries_ys",source=countries_source,

fill_color="#F1EEF6",fill_alpha=0.3,

line_color="#999999",line_width=0.5)

#Scatterplotsbylongitudeandlatitude

p.scatter(x="lon",y="lat",source=tweets_source,fill_color="#FF0000",

line_color="#FF0000")

#

#Activatehovertoolwithuserandcorrespondingtweetinformation

hover=p.select(dict(type=HoverTool))

hover.point_policy="follow_mouse"

hover.tooltips=OrderedDict([

("user","@user"),

("tweet","@text"),

])

#Renderthefigureonthebrowser

show(p)

BokehJSsuccessfullyloaded.

inspect

#

#

Thefollowingcodegivesanoverviewoftheworldmapwiththereddotsrepresentingthelocationsofthetweets’origins:

Page 267: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Wecanhoveronaspecificdottorevealthetweetsinthatlocation:

Page 268: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Wecanzoomintoaspecificlocation:

Finally,wecanrevealthetweetsinthegivenzoomed-inlocation:

Page 269: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 270: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

DisplayingupcomingmeetupsonGoogleMapsNow,ourobjectiveistofocusonupcomingmeetupsinLondon.WearemappingthreemeetupsDataScienceLondon,ApacheSpark,andMachineLearning.WeembedaGoogleMapwithinaBokehvisualizationandgeo-locatethethreemeetupsaccordingtotheircoordinatesandgetinformationsuchasthenameoftheupcomingeventforeachmeetupwithahovertool.

First,importallthenecessaryBokehlibraries:

In[]:

#

#BokehGoogleMapVisualizationofLondonwithhoveronspecificpoints

#

#

from__future__importprint_function

frombokeh.browserlibimportview

frombokeh.documentimportDocument

frombokeh.embedimportfile_html

frombokeh.models.glyphsimportCircle

frombokeh.modelsimport(

GMapPlot,Range1d,ColumnDataSource,

PanTool,WheelZoomTool,BoxSelectTool,

HoverTool,ResetTool,

BoxSelectionOverlay,GMapOptions)

frombokeh.resourcesimportINLINE

x_range=Range1d()

y_range=Range1d()

WewillinstantiatetheGoogleMapthatwillactasthesubstrateuponwhichourBokehvisualizationwillbelayered:

#JSONstylestringtakenfrom:https://snazzymaps.com/style/1/pale-dawn

map_options=GMapOptions(lat=51.50013,lng=-0.126305,map_type="roadmap",

zoom=13,styles="""

[{"featureType":"administrative","elementType":"all","stylers":

[{"visibility":"on"},{"lightness":33}]},

{"featureType":"landscape","elementType":"all","stylers":

[{"color":"#f2e5d4"}]},

{"featureType":"poi.park","elementType":"geometry","stylers":

[{"color":"#c5dac6"}]},

{"featureType":"poi.park","elementType":"labels","stylers":

[{"visibility":"on"},{"lightness":20}]},

{"featureType":"road","elementType":"all","stylers":[{"lightness":20}]},

{"featureType":"road.highway","elementType":"geometry","stylers":

[{"color":"#c5c6c6"}]},

{"featureType":"road.arterial","elementType":"geometry","stylers":

[{"color":"#e4d7c6"}]},

{"featureType":"road.local","elementType":"geometry","stylers":

[{"color":"#fbfaf7"}]},

{"featureType":"water","elementType":"all","stylers":[{"visibility":"on"},

{"color":"#acbcc9"}]}]

Page 271: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

""")

InstantiatetheBokehobjectplotfromtheclassGMapPlotwiththedimensionsandmapoptionsfromthepreviousstep:

#InstantiateGoogleMapPlot

plot=GMapPlot(

x_range=x_range,y_range=y_range,

map_options=map_options,

title="LondonMeetups"

)

Bringintheinformationfromourthreemeetupswewishtoplotandgettheinformationbyhoveringabovetherespectivecoordinates:

source=ColumnDataSource(

data=dict(

lat=[51.49013,51.50013,51.51013],

lon=[-0.130305,-0.126305,-0.120305],

fill=['orange','blue','green'],

name=['LondonDataScience','Spark','MachineLearning'],

text=['GraphData&Algorithms','SparkInternals','DeepLearningon

Spark']

)

)

DefinethedotstobedrawnontheGoogleMap:

circle=Circle(x="lon",y="lat",size=15,fill_color="fill",

line_color=None)

plot.add_glyph(source,circle)

DefinethestingsfortheBokehtoolstobeusedinthisvisualization:

#TOOLS="pan,wheel_zoom,box_zoom,reset,hover,save"

pan=PanTool()

wheel_zoom=WheelZoomTool()

box_select=BoxSelectTool()

reset=ResetTool()

hover=HoverTool()

#save=SaveTool()

plot.add_tools(pan,wheel_zoom,box_select,reset,hover)

overlay=BoxSelectionOverlay(tool=box_select)

plot.add_layout(overlay)

Activatethehovertoolwiththeinformationthatwillbecarried:

hover=plot.select(dict(type=HoverTool))

hover.point_policy="follow_mouse"

hover.tooltips=OrderedDict([

("Name","@name"),

("Text","@text"),

("(Long,Lat)","(@lon,@lat)"),

])

show(plot)

Page 272: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

RendertheplotthatgivesaprettygoodviewofLondon:

Oncewehoveronahighlighteddot,wecangettheinformationofthegivenmeetup:

Page 273: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Fullsmoothzoomingcapabilityispreserved,asthefollowingscreenshotshows:

Page 274: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 275: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,
Page 276: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SummaryInthischapter,wefocusedonfewvisualizationtechniques.Wesawhowtobuildwordcloudsandtheirintuitivepowertoreveal,ataglance,lotsofthekeywords,moods,andmemescarriedthroughthousandsoftweets.

WethendiscussedinteractivemappingvisualizationsusingBokeh.Webuiltaworldmapfromthegroundupandcreatedascatterplotofcriticaltweets.Oncethemapwasrenderedonthebrowser,wecouldinteractivelyhoverfromdottodotandrevealthetweetsoriginatingfromdifferentpartsoftheworld.

OurfinalvisualizationwasfocusedonmappingupcomingmeetupsinLondononSpark,datascience,andmachinelearningandtheirrespectivetopics,makingabeautifulinteractivevisualizationwithanactualGoogleMap.

Page 277: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

IndexA

AmazonWebServices(AWS)apps,deployingwith/DeployingappsinAmazonWebServicesabout/DeployingappsinAmazonWebServices

Anacondadefining/UnderstandingAnaconda

AnacondaInstallerURL/InstallingAnacondawithPython2.7

AnacondastackAnaconda/UnderstandingAnacondaConda/UnderstandingAnacondaNumba/UnderstandingAnacondaBlaze/UnderstandingAnacondaBokeh/UnderstandingAnacondaWakari/UnderstandingAnaconda

analyticslayer/AnalyticslayerApacheKafka

about/SettingupKafkaproperties/SettingupKafka

ApacheSparkabout/DisplayingupcomingmeetupsonGoogleMaps

APIs(ApplicationProgrammingInterface)about/Connectingtosocialnetworks

apppreviewing/Previewingourapp

appsdeploying,withAmazonWebServices(AWS)/DeployingappsinAmazonWebServices

architecture,data-intensiveapplicationsabout/Understandingthearchitectureofdata-intensiveapplicationsinfrastructurelayer/Infrastructurelayerpersistencelayer/Persistencelayerintegrationlayer/Integrationlayeranalyticslayer/Analyticslayerengagementlayer/Engagementlayer

AsynchronousJavaScript(AJAX)about/ProcessinglivedatawithTCPsockets

AWSconsoleURL/DeployingappsinAmazonWebServices

Page 278: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

BBigData,withApacheSpark

references/VirtualizingtheenvironmentwithVagrantBlaze

used,forexploringdata/ExploringdatausingBlazeBSON(BinaryJSON)

about/SettingupMongoDB

Page 279: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

CCatalyst

about/ExploringdatausingSparkSQLChef

about/InfrastructurelayerClustering

K-Means/SupervisedandunsupervisedlearningGaussianMixture/SupervisedandunsupervisedlearningPowerIterationClustering(PIC)/SupervisedandunsupervisedlearningLatentDirichletAllocation(LDA)/Supervisedandunsupervisedlearning

Clustermanagerabout/TheResilientDistributedDataset

comma-separatedvalues(CSV)about/Harvestingandstoringdata

ContinuumURL/UnderstandingAnaconda

Couchbaseabout/Persistencelayer

Page 280: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

DD3.js

about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture

DAG(DirectedAcyclicGraph)about/TheResilientDistributedDataset,Serializinganddeserializingdata

dataserializing/Serializinganddeserializingdatadeserializing/Serializinganddeserializingdataharvesting/Harvestingandstoringdatastoring/Harvestingandstoringdatapersisting,inCSV/PersistingdatainCSVpersisting,inJSON/PersistingdatainJSONMongoDB,settingup/SettingupMongoDB,harvestingfromTwitter/HarvestingdatafromTwitterexploring,Blazeused/ExploringdatausingBlazetransferring,Odoused/TransferringdatausingOdoexploring,SparkSQLused/ExploringdatausingSparkSQLpre-processing,forvisualization/Preprocessingthedataforvisualization

data-intensiveappsarchitecting/Architectingdata-intensiveappslatency/Architectingdata-intensiveappsscalability/Architectingdata-intensiveappsfaulttolerance/Architectingdata-intensiveappsflexibility/Architectingdata-intensiveappsdataatrest,processing/Processingdataatrestdatainmotion,processing/Processingdatainmotiondata,exploring/Exploringdatainteractively

data-intensiveappsarchitectureabout/Revisitingthedata-intensiveappsarchitecture

dataanalysisdefining/AnalyzingthedataTweetsanatomy,discovering/Discoveringtheanatomyoftweets

DataDrivenDocuments(D3)about/Revisitingthedata-intensiveappsarchitecture

dataflowsabout/Machinelearningworkflowsanddataflows

dataintensiveappsarchitecturedefining/Revisitingthedata-intensiveapparchitecture

datalifecycleConnect/IntegrationlayerCorrect/IntegrationlayerCollect/Integrationlayer

Page 281: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Compose/IntegrationlayerConsume/IntegrationlayerControl/Integrationlayer

DataScienceLondonabout/DisplayingupcomingmeetupsonGoogleMaps

datatypes,SparkMLliblocalvector/SparkMLlibdatatypeslabeledpoint/SparkMLlibdatatypeslocalmatrix/SparkMLlibdatatypesdistributedmatrix/SparkMLlibdatatypes

DecisionTreesabout/Supervisedandunsupervisedlearning

DimensionalityReductionSingularValueDecomposition(SVD)/SupervisedandunsupervisedlearningPrincipalComponentAnalysis(PCA)/Supervisedandunsupervisedlearning

Dockerabout/Infrastructurelayerenvironment,virtualizingwith/VirtualizingtheenvironmentwithDockerreferences/VirtualizingtheenvironmentwithDocker

DStream(DiscretizedStream)defining/GoingunderthehoodofSparkStreaming

Page 282: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Eelements,Flume

Event/ExploringflumeClient/ExploringflumeSource/ExploringflumeSink/ExploringflumeChannel/Exploringflume

engagementlayer/EngagementlayerEnsemblesoftrees

about/Supervisedandunsupervisedlearningenvironment

virtualizing,withVagrant/VirtualizingtheenvironmentwithVagrantvirtualizing,withDocker/VirtualizingtheenvironmentwithDocker

Page 283: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

FFirstApp

building,withPySpark/BuildingourfirstappwithPySparkFlume

about/Exploringflumeadvantages/Exploringflumeelements/Exploringflume

Page 284: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Gggplot

about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture

GitHubURL/GettingGitHubdataabout/ExploringtheGitHubworldoperating,withMeetupAPI/UnderstandingthecommunitythroughMeetup

GoogleMapsupcomingmeetups,displayingon/DisplayingupcomingmeetupsonGoogleMaps

Page 285: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

HHadoopMongoDBconnector

URL/QueryingMongoDBfromSparkSQLHbaseandCassandra

about/PersistencelayerHDFS(HadoopDistributedFileSystem)

about/UnderstandingSpark

Page 286: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Iinfrastructurelayer/InfrastructurelayerIngestMode

BatchDataTransport/BuildingareliableandscalablestreamingappMicroBatch/BuildingareliableandscalablestreamingappPipelining/BuildingareliableandscalablestreamingappMessageQueue/Buildingareliableandscalablestreamingapp

integrationlayer/Integrationlayer

Page 287: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

JJava8

installing/InstallingJava8JRE(JavaRuntimeEnvironment)

about/InstallingJava8JSON(JavaScriptObjectNotation)

about/Connectingtosocialnetworks,Harvestingandstoringdata

Page 288: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

KKafka

settingup/SettingupKafkainstalling/InstallingandtestingKafkatesting/InstallingandtestingKafkaURL/InstallingandtestingKafkaproducers,developing/Developingproducersconsumers,developing/DevelopingconsumersSparkStreamingconsumer,developingfor/DevelopingaSparkStreamingconsumerforKafka

Kappaarchitecturedefining/ClosingremarksontheLambdaandKappaarchitecture,UnderstandingKappaarchitecture

Page 289: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

LLambdaarchitecture

defining/ClosingremarksontheLambdaandKappaarchitecture,UnderstandingLambdaarchitecture

LinearRegressionModelsabout/Supervisedandunsupervisedlearning

Page 290: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

MMachineLearning

about/DisplayingupcomingmeetupsonGoogleMapsmachinelearningpipelines

building/Buildingmachinelearningpipelinesmachinelearningworkflows

about/MachinelearningworkflowsanddataflowsMassiveOpenOnlineCourses(MOOCs)

about/VirtualizingtheenvironmentwithVagrantMatplotlib

about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture

MeetupAPIURL/GettingMeetupdata

meetupsmapping/Geo-locatingtweetsandmappingmeetups

MLlibalgorithmsCollaborativefiltering/Additionallearningalgorithmsfeatureextractionandtransformation/Additionallearningalgorithmsoptimization/AdditionallearningalgorithmsLimited-memoryBFGS(L-BFGS)/Additionallearningalgorithms

modelsdefining,forprocessingstreamsofdata/Layingthefoundationsofstreamingarchitecture

MongoDBabout/Persistencelayersettingup/SettingupMongoDBserverandclient,installing/InstallingtheMongoDBserverandclientserver,running/RunningtheMongoDBserverMongoclient,running/RunningtheMongoclientPyMongodriver,installing/InstallingthePyMongodriverPythonclient,creatingfor/CreatingthePythonclientforMongoDBreferences/QueryingMongoDBfromSparkSQL

MongoDB,fromSparkSQLURL/QueryingMongoDBfromSparkSQL

Multi-DimensionalScaling(MDS)algorithmabout/ApplyingScikit-LearnontheTwitterdataset

Mumrah,onGitHubURL/InstallingandtestingKafka

MySQLabout/Persistencelayer

Page 291: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

NNaiveBayes

about/SupervisedandunsupervisedlearningNeo4j

about/Persistencelayernetwork_wordcount.py

URL/Processinglivedata

Page 292: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

OOdo

about/TransferringdatausingOdoused,fortransferringdata/TransferringdatausingOdo

operations,onRDDstransformations/TheResilientDistributedDatasetaction/TheResilientDistributedDataset

Page 293: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Ppersistencelayer/PersistencelayerPIL(PythonImagingLibrary)

about/SettingupwordcloudPostgreSQL

about/PersistencelayerPuppet

about/InfrastructurelayerPySpark

FirstApp,buildingwith/BuildingourfirstappwithPySpark

Page 294: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

RRDD(ResilientDistributedDataset)

about/TheResilientDistributedDatasetResilientDistributedDatasets(RDD)

about/SparkStreaminginnerworkingREST(RepresentationStateTransfer)

about/ConnectingtosocialnetworksRPC(RemoteProcedureCall)

about/Layingthefoundationsofstreamingarchitecture

Page 295: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

SSDK(SoftwareDevelopmentKit)

about/InstallingJava8Seaborn

about/Revisitingthedata-intensiveappsarchitectureURL/Revisitingthedata-intensiveappsarchitecture

socialnetworksconnectingto/ConnectingtosocialnetworksTwitterdata,obtaining/GettingTwitterdataGitHubdata,obtaining/GettingGitHubdataMeetupdata,obtaining/GettingMeetupdata

Sparkdefining/UnderstandingSparkBatch/UnderstandingSparkStreaming/UnderstandingSparkIterative/UnderstandingSparkInteractive/UnderstandingSparklibraries/SparklibrariesURL/InstallingSparkClustering/SupervisedandunsupervisedlearningDimensionalityReduction/SupervisedandunsupervisedlearningRegressionandClassification/SupervisedandunsupervisedlearningIsotonicRegression/SupervisedandunsupervisedlearningMLlibalgorithms/Additionallearningalgorithms

Spark,onEC2URL/DeployingappsinAmazonWebServices

SparkContextabout/SparkStreaminginnerworking

Sparkdataframesdefining/UnderstandingSparkdataframes

SparklibrariesSparkSQL/SparklibrariesSparkMLLIB/SparklibrariesSparkStreaming/SparklibrariesSparkGraphX/SparklibrariesPySpark,defining/PySparkinactionRDD(ResilientDistributedDataset)/TheResilientDistributedDataset

SparkMLlibcontextualizing,inapparchitecture/ContextualizingSparkMLlibintheapparchitecturedatatypes/SparkMLlibdatatypes

SparkMLlibalgorithmsclassifying/ClassifyingSparkMLlibalgorithms

Page 296: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

supervisedlearning/Supervisedandunsupervisedlearningunsupervisedlearning/Supervisedandunsupervisedlearningadditionallearningalgorithms/Additionallearningalgorithms

SparkPoweredEnvironmentsettingup/SettinguptheSparkpoweredenvironmentOracleVirtualBox,settingupwithUbuntu/SettingupanOracleVirtualBoxwithUbuntuAnaconda,installingwithPython2.7/InstallingAnacondawithPython2.7Java8,installing/InstallingJava8Spark,installing/InstallingSparkIPythonNotebook,enabling/EnablingIPythonNotebook

SparkSQLused,forexploringdata/ExploringdatausingSparkSQLabout/ExploringdatausingSparkSQLCSVfiles,loadingwith/LoadingandprocessingCSVfileswithSparkSQLCSVfiles,processingwith/LoadingandprocessingCSVfileswithSparkSQLMongoDB,queryingfrom/QueryingMongoDBfromSparkSQL

SparkSQLmoduleabout/Analyticslayer

SparkSQLqueryoptimizerdefining/UnderstandingtheSparkSQLqueryoptimizer

Sparkstreamingdefining/SparkStreaminginnerworking,GoingunderthehoodofSparkStreamingbuilding,infaulttolerance/Buildinginfaulttolerance

StochasticGradientDescentabout/ClassifyingSparkMLlibalgorithms

streamingappbuilding/BuildingareliableandscalablestreamingappKafka,settingup/SettingupKafkaflume,exploring/Exploringflumedatapipelines,developingwithFlume/DevelopingdatapipelineswithFlume,Kafka,andSparkdatapipelines,developingwithKafka/DevelopingdatapipelineswithFlume,Kafka,andSparkdatapipelines,developingwithSpark/DevelopingdatapipelineswithFlume,Kafka,andSpark

streamingarchitectureabout/Layingthefoundationsofstreamingarchitecture

StreamingContextabout/SparkStreaminginnerworking

supervisedmachinelearningworkflowabout/Supervisedmachinelearningworkflows

Page 297: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

TTCPSockets

livedata,processingwith/ProcessinglivedatawithTCPsockets,Processinglivedatasettingup/SettingupTCPsockets

TF-IDF(TermFrequency-InverseDocumentFrequency)about/ClassifyingSparkMLlibalgorithms

Tridentabout/Layingthefoundationsofstreamingarchitecture

tweetsgeo-locating/Geo-locatingtweetsandmappingmeetups,Geo-locatingtweets

TwitterURL/GettingTwitterdata

TwitterAPI,ondevconsoleURL/GettingTwitterdata

Twitterdatamanipulating/ManipulatingTwitterdatainrealtimetweets,processingfromTwitterfirehose/ProcessingTweetsinrealtimefromtheTwitterfirehose

Twitterdatasetclustering/ClusteringtheTwitterdatasetSciKit-Learn,applyingon/ApplyingScikit-LearnontheTwitterdatasetdataset,preprocessing/Preprocessingthedatasetclusteringalgorithm,running/Runningtheclusteringalgorithmmodelandresults,evaluating/Evaluatingthemodelandtheresults

Page 298: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

UUbuntu14.04.1LTSrelease

URL/SettingupanOracleVirtualBoxwithUbuntuunifiedlog

properties/UnderstandingKappaarchitectureUnifiedLog

properties/Buildingareliableandscalablestreamingappunsupervisedmachinelearningworkflow

about/Unsupervisedmachinelearningworkflows

Page 299: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

VVagrant

about/Infrastructurelayerenvironment,virtualizingwith/VirtualizingtheenvironmentwithVagrantreference/VirtualizingtheenvironmentwithVagrant

VirtualBoxVMURL/SettingupanOracleVirtualBoxwithUbuntu

visualizationdata,pre-processingfor/Preprocessingthedataforvisualization

Page 300: Spark for Python Developers · technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java,

Wwordclouds

creating/Gaugingwords,moods,andmemesataglance,Creatingwordcloudssettingup/SettingupwordcloudURL/Settingupwordcloud