David Taieb STSM - IBM Cloud Data Services Developer advocate [email protected] HANDS-ON SESSION: DEVELOPING ANALYTIC APPLICATIONS USING APACHE SPARK™ AND PYTHON Part 1: Flight Delay Predict with Spark ML PyCon 2016, Portland
DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com
HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart1:FlightDelayPredictwithSparkML PyCon2016,Portland
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
©2016IBMCorpora6on �
Sign up for Bluemix • AccessIBMBluemixwebsiteonhMps://console.ng.bluemix.net• ClickonGetStartedforFree
• CompletetheformandclickCreateaccount• Lookforconfirma6onemailandclickonconfirmyouaccountlink
Signupforflightstats
©2016IBMCorpora6on �
Sign up for a free trial at Flightstats.com
• SignupathMps://developer.flightstats.com/signup• Fillouttheformandmonitoremailforconfirma6onlink(accesstoAPIsmay
takeupto24hours)• Onceaccessisgrantedgoto
hMps://developer.flightstats.com/admin/applica6onstoviewappIdandappKey(youwillneedtheminthesimple-data-pipetooltocreatetrainingsets.
• Op6onal:getfamiliarwiththevariousflightstatsapis:– hMps://developer.flightstats.com/api-docs/scheduledFlights/v1– hMps://developer.flightstats.com/api-docs/airports/v1
Howtofindyourappidandkey
©2016IBMCorpora6on �
Where to find the FlightStats app id and app key
APPID
APPKey
Prepareyourbluemixspace
©2016IBMCorpora6on �
Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix
CreateaSparkInstance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
©2016IBMCorpora6on �
Create a Spark Instance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
©2016IBMCorpora6on �
Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
©2016IBMCorpora6on �
Flight App Project Description • Usecase
– Flightdelaysareacommondisturbanceduringbusinesstrips– Beingabletopredicthowlikelyaflightwillbedelayedcanremoveuncertaintyandenableuserstoplanaroundit.
– Idea:Weatherdatacanbeagoodexplanatoryvariableforbuildingpredic6vemodels
• ImplementaSon– Combineflightsta6s6csfromflightstats.com(Systemofrecords)withweatherdatafromIBMInsightforWeather(Systemofopera6ons)tobuildatraining,testandblindset
– UseSparkMLLibtotrainpredic6vemodelsandcrossvalidatethem– CreateacustomcardforGoogleNowthatwillautoma6callyno6fyuserofimpendingflightdelay
– Proposealterna6ngflightroutes(e.g.Freebird)Get/Build/Analyze
©2016IBMCorpora6on �
Get/Build/Analyze methodology
©2016IBMCorpora6on �
Flight Predict App Architecture
Weather
SimpleDataPipes
Airports
FlightSchedules
FlightStatus
MetadataTrainingSet
TestSet
BlindSet
CustomConnectorrunevery24hours
Notebook
©2016IBMCorpora6on �
Flow Diagram
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
• Itera6veinNature:weareneverdone!• Wewillbeusingthisdiagramasaroadmapthroughoutthiscourse
DeployandRunModel
©2016IBMCorpora6on �
Get the data and build the training/test/blind sets Inthisstepwe’lluseSimpleDataPipesopensourceprojecttoacquiredatafromFlightstats,combineitwithWeatherdatafromIBMInsightforWeatherandsavethedatasetsintoaNoSQLCloudantDatabase.
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModel
©2016IBMCorpora6on �
Acquiring the data
• Inthenextsec6on,weshowhowtoacquirethetrainingdatabyusingthesimple-data-pipetoolandflightpredictconnector.
• Theflightpredictconnectorcombinehistoricalflightdatafromflightstats.comwithweatherdatafromIBMInsightforWeather
• Ifyouwanttoskipthesesteps,youcanusethealreadybuiltdatasetbyusingthefollowingcreden6als:– cloudantHost:dtaieb.cloudant.com– cloudantUserName:weenesserliffircedinvers– cloudantPassword:72a5c4f939a9e2578698029d2bb041d775d088b5
Deploysimple-data-pipe
©2016IBMCorpora6on �
Deploy simple-data-pipe with flightstats connector • GotohMps://github.com/ibm-cds-labs/simple-data-pipe• ClickonDeploytoBluemixbuMon
ClickbuMonwilltakeyoutoBluemix
©2016IBMCorpora6on �
Complete simple-data-pipe deployment
AddWeatherservice
©2016IBMCorpora6on �
Add an instance of IBM Weather Service on Bluemix • Returntotheapplica6ondashboard• Weatherserviceisrequiredbythe
flightpredictconnectorandmustbeinstalledbefore
• Fromappdashboard,clickonAddaserviceorAPI
©2016IBMCorpora6on �
Create an instance of IBM Weather Service on Bluemix SearchforWeather
Makesuretoselect“premiumplan”tohaveenoughauthorizedAPIcalls
©2016IBMCorpora6on �
Checkpoint: simple data pipe app dashboard • Verifythatyourappiscorrectlyboundtotherightservices
WeatherServiceusedtoenrichflightrecordswithweatherobserva6ons
CloudantServiceusedtostoretraining,testandblinddatasets
You’llneedtoclickonthisbuMonforthesteponthenextpageItisrecommendedtoincrease
theappmemoryto1GB
©2016IBMCorpora6on �
Install flight predict connector • ClickEditCodebuMon,editpackage.jsontoaddflightpredictmodule:
– "simple-data-pipe-connector-flightstats":"git://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git"
addflightpredictmoduletodependencies
Saveyourchanges
don’tforgettoaddcommainthelinebeforetokeepjsonvalid
©2016IBMCorpora6on �
Install flight predict connector • ClickFile/Savetosaveyourchanges
Redeploysimpledatapipe
©2016IBMCorpora6on �
Redeploy simple data pipe app • UseliveeditEditortoredeploytheapp
Verifyyoursdpinstall
©2016IBMCorpora6on �
Verify connector install • Inthisstep,weverifythattheflightpredictconnectoriscorrectlyinstalledthroughtheUI
Fightconnectorcorrectlyinstalled
Createnewflightstatspipe
©2016IBMCorpora6on �
Create a new FlightStats pipe • Followeachscreentocreateandconfigureanewpipe
Runthepipe
©2016IBMCorpora6on �
Run the pipe • Skipoverthescheduletab• Intheac6vitytab,clickonRunNowtostartthepipe
Explorethedataset
ClickRunNowThenopenthelogtomonitortheac6vity
©2016IBMCorpora6on �
Explore the data sets • Inthisstep,wetakeamomenttoexplorethedifferentdatasetsthathavebeencreatedbythesimpledatapipetool
• Frombluemixdashboard,clickonthecloudantservice6le,thenontheLaunchbuMon• FromtheCloudantdashboard,openthetrainingdatabase• Openadocumenttolookatthedatastructure
Buildthetestset
©2016IBMCorpora6on �
Run the pipe again to build the test set
Trainthemodels
©2016IBMCorpora6on �
Train the Models • Intheprevioussec6onwehavecreatedthetrainingdataandwearenowreadytotrainthemodels.• Stepsinthissec6on:
– CreateanIPythonNotebook– LoadthedatasetsfromtheCloudantdatabaseintoaSparkCluster– Explorethedataandtrainthemachinelearningmodels
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModel
CreateIPythonNotebook
©2016IBMCorpora6on �
Create a new IPython Notebook
©2016IBMCorpora6on �
Notebook tour
©2016IBMCorpora6on �
Notebook tour: Notebook Info
©2016IBMCorpora6on �
Notebook tour: Environment
©2016IBMCorpora6on �
Notebook tour: Sharing
`
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
©2016IBMCorpora6on �
Before we start building the app…
• Youcanop6onallyfollowthistutorialfromGithubbyusingafullybuiltnotebook:– hMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/blob/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
©2016IBMCorpora6on �
Optional: use prebuilt notebook
ImportrequiredPythonpackages
• CreatenotebookfromURL• UsehMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
©2016IBMCorpora6on �
Using Python Packages • Writecodeinlinewithincells• EncapsulatehelperAPIswithinPythonpackage• 2waysofusinghelperPythonpackages
– eggdistribu6onpackage:pipinstallfromPyPiserverorfileserver(e.g.Github)
• Persistentinstallacrosssessions• RecommendedinProduc6on
– SparkContext.addPyFile• Easyaddi6onofapythonmodulefile• Supportmul6plemodulefilesviazipformat• Recommendedduringdevelopmentwherefrequentcodechangesoccur
Manageeggpackages
©2016IBMCorpora6on �
Flight Predict Python Package on Github
SetupscriptforinstallingPythonPackage
FlightPredictPythonlibrary
©2016IBMCorpora6on �
Method 1: Install Flight Predict Package • UsepiptoInstallFlightPredictpackage• Recommendedalterna6ve:buildeggdistribu6onpackageanddeployinPyPi
©2016IBMCorpora6on �
Manage Python packages • Checkstatus• Uninstallpackage
Installpackagesviasc.addPyFilemethod
©2016IBMCorpora6on �
Method 2: Install py modules via sc.addPyFile
• addPyFileinstallindividualpymodulesandmakethemavailabletoallexecutorprocesses
• Workswithmodulesinzippedfiles
Modulecontainingapisfortrainingthemodels
Modulecontainingapisforrunningthemodels
Configurecreden6alsforvariousservices
©2016IBMCorpora6on �
Setup credentials and Import required python modules
Inthisstep,weimportpythonmodulesthatwillbeneededthroughoutthenotebookandsetupcreden6alstovariousservices.
Howtogetcreden6alsforCloudantandWeather
Creden6alforCloudantNoSQLService
Creden6alsforWeatherService
©2016IBMCorpora6on �
Get Credentials for Cloudant Fromtheappdashboard,clickonEnvironmentVariablesfromthelessidebar
©2016IBMCorpora6on �
Get Credentials for Weather
LoadtrainingsetfromCloudant
©2016IBMCorpora6on �
Load training set in Spark SQL DataFrame
…
Inthisstep,weusethecloudant-sparkconnector(hMps://github.com/cloudant-labs/spark-cloudant)toloaddataintoSpark
Makesuretochangethedbnametomatchtheonecreatedforyourtrainingsetbyyourac6vity(opentheCloudantdashboardtofindthename)
©2016IBMCorpora6on �
Loading data: Behind the scene UseSparkSQLconnectortoloaddataintoaDataFrame
connectorid
Op6ons
Cachedataforop6mizedreuse
CreatetempSQLTable
ScaMerPlotVisualiza6on
©2016IBMCorpora6on �
Scatter plot visualization
©2016IBMCorpora6on �
Visualization api
CreateanRDDofLabeledPoint
©2016IBMCorpora6on �
Transform into an RDD of LabeledPoint UseSparkSQLconnectortoloaddataintoaDataFrame
©2016IBMCorpora6on �
loadLabeledDataRDD api
TrainMachineLearningModels
©2016IBMCorpora6on �
Machine Learning Algorithms ConSnuousOutput DiscreteOutput
SupervisedLearning(requireGround-Truth)
• Regression-Linear-Ridge-Lasso-Isotonic• DecisionTree• RandomForest• GradientBoostedTree
• Classifica6on-Logis6cRegression-SVM-NaiveBayes• DecisionTree• RandomForest• GradientBoostedTree• K-NN(availableasadd-onsparkpackage)
UnsupervisedLearning(noGround-Truthdatarequired)
• Clustering-KMeans-GaussianMixture• DimensionalityReduc6on-PCA-SVD
• FP-Growth
TrainLogis6cRegressionModel
©2016IBMCorpora6on �
Train Logistic Regression Model
TrainNaïveBayesModels
©2016IBMCorpora6on �
Train NaiveBayes Model
TraindecisionTreeModel
©2016IBMCorpora6on �
Train Decision Tree Model
TrainRandomForestModel
©2016IBMCorpora6on �
Train Random Forest Model
AccuracyAnalysis
©2016IBMCorpora6on �
Naïve Bayes vs Decision Tree • Probabilis6c:computetheprobability
ofadatainstancetobeinaspecificclass
• Assumethateachfeature(variable)isindependentfromtheothers
• Performancedependsonthepredic6venatureofthefeatures(nonpredic6vefeatureswillaffecttheaccuracy)
• Workswellwithlowamountoftrainingdata.Doesn’tneedallthepossibili6es
• Doesn’tworkwithcategoricalfeatures.
• Non-Probabilistic: partition the data into subsets that best describe the variable
• The deeper the tree, the better the model fits the data
• Watch out for overfiting: need to prune the tree
• Can handle categorical or continuous features
• No need for input to be scaled or standardized: Set you features and go!
• Requires a lot of data covering all possibilities
©2016IBMCorpora6on �
Accuracy Analysis of the Machine Learning Models Inthissec6on,wewillperformaccuracyanalysisonthetestdata.Wewillstartbycompu6ngtheaccuracymetricsforeachmodel,includingtheconfusionmatrix.Wewillthenusehistogramcharttounderstandthedatadistribu6onandrefinehowtoclassesarecomputed.
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModel
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
©2016IBMCorpora6on �
Load Test data Makesuretochangethedbnametomatchtheonecreatedforyourtestsetbyyourac6vity(opentheCloudantdashboardtofindthename)
©2016IBMCorpora6on �
Accuracy Metrics
©2016IBMCorpora6on �
Confusion Matrix
©2016IBMCorpora6on �
Confusion Matrix
©2016IBMCorpora6on �
Confusion Matrix
©2016IBMCorpora6on �
Confusion Matrix
©2016IBMCorpora6on �
Accuracy metrics API OutputHTML
DisplayresultsHTMLinNotebookCell
ComputeMetricsfromlabeledandpredic6ondata
Gettheconfusionmatrixandbuildhtmltable
©2016IBMCorpora6on �
Understand the distribution of your data with Histograms
©2016IBMCorpora6on �
Training Handler class
• Provideflexibilityandextensibilitytotheapplica6on
• Provideafailfastandtrysomethingelsemechanism
• Enableusertoeasilycustomizeclassesofdatabasedonhowdataisdistributed
• Enableusertoeasilyaddtrainingfeatures
©2016IBMCorpora6on �
Default Training Handler class
Returndescrip6onforeachclasses
Returntotalnumberofclasses:Defaultis5
Re-classifyarecord:defaultusess.classifica6onfieldinJsonrecord
ExtrafeaturesNamestobeadded.Nonebydefault
Extrafeaturestobeadded.ArraymustmatchtheonereturnedbycustomTrainingFeaturesNames
©2016IBMCorpora6on �
Customize Training Handler Providenewclassifica6onandadddayofdepartureasanewfeature
InheritfromdefaultTrainingHandler
Adddayoftheweekusingatechniquecalleddummycoding
©2016IBMCorpora6on �
Re-train the models
©2016IBMCorpora6on �
Re-compute accuracy Models1
Models2BeMeraccuracyforNaiveBayesandLogis6cRegressionWorseforDecisionTreeandRandomForest
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
©2016IBMCorpora6on �
Deploy and Run the models Inthelastsec6on,wewillsimulatedeploymentandrunningofthemodelsthroughthenotebookbycallingAPIsfromtherunpackage.
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModels
©2016IBMCorpora6on �
Run the predictive model
©2016IBMCorpora6on �
runModel API
©2016IBMCorpora6on �
Get Weather Predictions
©2016IBMCorpora6on �
Show prediction results
©2016IBMCorpora6on �
Resource
• hMps://developer.ibm.com/clouddataservices/• hMps://github.com/ibm-cds-labs/simple-data-pipe• hMps://github.com/ibm-cds-labs/pipes-connector-flightstats• hMp://spark.apache.org/docs/latest/mllib-guide.html• hMps://console.ng.bluemix.net/data/analy6cs/
©2016IBMCorpora6on �
Thank You