Top Banner
David Taieb STSM - IBM Cloud Data Services Developer advocate [email protected] HANDS-ON SESSION: DEVELOPING ANALYTIC APPLICATIONS USING APACHE SPARK™ AND PYTHON Part 1: Flight Delay Predict with Spark ML PyCon 2016, Portland
80

Spark tutorial pycon 2016 part 1

Apr 07, 2017

Download

Data & Analytics

David Taïeb
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark tutorial pycon 2016   part 1

DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com

HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart1:FlightDelayPredictwithSparkML PyCon2016,Portland

Page 2: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

Page 3: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Sign up for Bluemix •  AccessIBMBluemixwebsiteonhMps://console.ng.bluemix.net•  ClickonGetStartedforFree

•  CompletetheformandclickCreateaccount•  Lookforconfirma6onemailandclickonconfirmyouaccountlink

Signupforflightstats

Page 4: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Sign up for a free trial at Flightstats.com

•  SignupathMps://developer.flightstats.com/signup•  Fillouttheformandmonitoremailforconfirma6onlink(accesstoAPIsmay

takeupto24hours)•  Onceaccessisgrantedgoto

hMps://developer.flightstats.com/admin/applica6onstoviewappIdandappKey(youwillneedtheminthesimple-data-pipetooltocreatetrainingsets.

•  Op6onal:getfamiliarwiththevariousflightstatsapis:–  hMps://developer.flightstats.com/api-docs/scheduledFlights/v1–  hMps://developer.flightstats.com/api-docs/airports/v1

Howtofindyourappidandkey

Page 5: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Where to find the FlightStats app id and app key

APPID

APPKey

Prepareyourbluemixspace

Page 6: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix

CreateaSparkInstance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

Page 7: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Create a Spark Instance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

Page 8: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

Page 9: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

Page 10: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Flight App Project Description • Usecase

–  Flightdelaysareacommondisturbanceduringbusinesstrips–  Beingabletopredicthowlikelyaflightwillbedelayedcanremoveuncertaintyandenableuserstoplanaroundit.

–  Idea:Weatherdatacanbeagoodexplanatoryvariableforbuildingpredic6vemodels

•  ImplementaSon–  Combineflightsta6s6csfromflightstats.com(Systemofrecords)withweatherdatafromIBMInsightforWeather(Systemofopera6ons)tobuildatraining,testandblindset

–  UseSparkMLLibtotrainpredic6vemodelsandcrossvalidatethem–  CreateacustomcardforGoogleNowthatwillautoma6callyno6fyuserofimpendingflightdelay

–  Proposealterna6ngflightroutes(e.g.Freebird)Get/Build/Analyze

Page 11: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Get/Build/Analyze methodology

Page 12: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Flight Predict App Architecture

Weather

SimpleDataPipes

Airports

FlightSchedules

FlightStatus

MetadataTrainingSet

TestSet

BlindSet

CustomConnectorrunevery24hours

Notebook

Page 13: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Flow Diagram

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model

•  Itera6veinNature:weareneverdone!•  Wewillbeusingthisdiagramasaroadmapthroughoutthiscourse

DeployandRunModel

Page 14: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Get the data and build the training/test/blind sets Inthisstepwe’lluseSimpleDataPipesopensourceprojecttoacquiredatafromFlightstats,combineitwithWeatherdatafromIBMInsightforWeatherandsavethedatasetsintoaNoSQLCloudantDatabase.

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModel

Page 15: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Acquiring the data

•  Inthenextsec6on,weshowhowtoacquirethetrainingdatabyusingthesimple-data-pipetoolandflightpredictconnector.

•  Theflightpredictconnectorcombinehistoricalflightdatafromflightstats.comwithweatherdatafromIBMInsightforWeather

•  Ifyouwanttoskipthesesteps,youcanusethealreadybuiltdatasetbyusingthefollowingcreden6als:–  cloudantHost:dtaieb.cloudant.com–  cloudantUserName:weenesserliffircedinvers–  cloudantPassword:72a5c4f939a9e2578698029d2bb041d775d088b5

Deploysimple-data-pipe

Page 16: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Deploy simple-data-pipe with flightstats connector •  GotohMps://github.com/ibm-cds-labs/simple-data-pipe•  ClickonDeploytoBluemixbuMon

ClickbuMonwilltakeyoutoBluemix

Page 17: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Complete simple-data-pipe deployment

AddWeatherservice

Page 18: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Add an instance of IBM Weather Service on Bluemix •  Returntotheapplica6ondashboard•  Weatherserviceisrequiredbythe

flightpredictconnectorandmustbeinstalledbefore

•  Fromappdashboard,clickonAddaserviceorAPI

Page 19: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Create an instance of IBM Weather Service on Bluemix SearchforWeather

Makesuretoselect“premiumplan”tohaveenoughauthorizedAPIcalls

Page 20: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Checkpoint: simple data pipe app dashboard •  Verifythatyourappiscorrectlyboundtotherightservices

WeatherServiceusedtoenrichflightrecordswithweatherobserva6ons

CloudantServiceusedtostoretraining,testandblinddatasets

You’llneedtoclickonthisbuMonforthesteponthenextpageItisrecommendedtoincrease

theappmemoryto1GB

Page 21: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Install flight predict connector •  ClickEditCodebuMon,editpackage.jsontoaddflightpredictmodule:

– "simple-data-pipe-connector-flightstats":"git://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git"

addflightpredictmoduletodependencies

Saveyourchanges

don’tforgettoaddcommainthelinebeforetokeepjsonvalid

Page 22: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Install flight predict connector •  ClickFile/Savetosaveyourchanges

Redeploysimpledatapipe

Page 23: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Redeploy simple data pipe app •  UseliveeditEditortoredeploytheapp

Verifyyoursdpinstall

Page 24: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Verify connector install •  Inthisstep,weverifythattheflightpredictconnectoriscorrectlyinstalledthroughtheUI

Fightconnectorcorrectlyinstalled

Createnewflightstatspipe

Page 25: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Create a new FlightStats pipe •  Followeachscreentocreateandconfigureanewpipe

Runthepipe

Page 26: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Run the pipe •  Skipoverthescheduletab•  Intheac6vitytab,clickonRunNowtostartthepipe

Explorethedataset

ClickRunNowThenopenthelogtomonitortheac6vity

Page 27: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Explore the data sets •  Inthisstep,wetakeamomenttoexplorethedifferentdatasetsthathavebeencreatedbythesimpledatapipetool

•  Frombluemixdashboard,clickonthecloudantservice6le,thenontheLaunchbuMon•  FromtheCloudantdashboard,openthetrainingdatabase•  Openadocumenttolookatthedatastructure

Buildthetestset

Page 28: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Run the pipe again to build the test set

Trainthemodels

Page 29: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Train the Models •  Intheprevioussec6onwehavecreatedthetrainingdataandwearenowreadytotrainthemodels.•  Stepsinthissec6on:

–  CreateanIPythonNotebook–  LoadthedatasetsfromtheCloudantdatabaseintoaSparkCluster–  Explorethedataandtrainthemachinelearningmodels

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModel

CreateIPythonNotebook

Page 30: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Create a new IPython Notebook

Page 31: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Notebook tour

Page 32: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Notebook tour: Notebook Info

Page 33: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Notebook tour: Environment

Page 34: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Notebook tour: Sharing

`

Page 35: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

Page 36: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Before we start building the app…

•  Youcanop6onallyfollowthistutorialfromGithubbyusingafullybuiltnotebook:– hMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/blob/master/notebook/Flight%20Predict%20PyCon%202016.ipynb

Page 37: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Optional: use prebuilt notebook

ImportrequiredPythonpackages

• CreatenotebookfromURL• UsehMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb

Page 38: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Using Python Packages •  Writecodeinlinewithincells•  EncapsulatehelperAPIswithinPythonpackage•  2waysofusinghelperPythonpackages

–  eggdistribu6onpackage:pipinstallfromPyPiserverorfileserver(e.g.Github)

•  Persistentinstallacrosssessions•  RecommendedinProduc6on

–  SparkContext.addPyFile•  Easyaddi6onofapythonmodulefile•  Supportmul6plemodulefilesviazipformat•  Recommendedduringdevelopmentwherefrequentcodechangesoccur

Manageeggpackages

Page 39: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Flight Predict Python Package on Github

SetupscriptforinstallingPythonPackage

FlightPredictPythonlibrary

Page 40: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Method 1: Install Flight Predict Package •  UsepiptoInstallFlightPredictpackage•  Recommendedalterna6ve:buildeggdistribu6onpackageanddeployinPyPi

Page 41: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Manage Python packages •  Checkstatus•  Uninstallpackage

Installpackagesviasc.addPyFilemethod

Page 42: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Method 2: Install py modules via sc.addPyFile

•  addPyFileinstallindividualpymodulesandmakethemavailabletoallexecutorprocesses

•  Workswithmodulesinzippedfiles

Modulecontainingapisfortrainingthemodels

Modulecontainingapisforrunningthemodels

Configurecreden6alsforvariousservices

Page 43: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Setup credentials and Import required python modules

Inthisstep,weimportpythonmodulesthatwillbeneededthroughoutthenotebookandsetupcreden6alstovariousservices.

Howtogetcreden6alsforCloudantandWeather

Creden6alforCloudantNoSQLService

Creden6alsforWeatherService

Page 44: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Get Credentials for Cloudant Fromtheappdashboard,clickonEnvironmentVariablesfromthelessidebar

Page 45: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Get Credentials for Weather

LoadtrainingsetfromCloudant

Page 46: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Load training set in Spark SQL DataFrame

Inthisstep,weusethecloudant-sparkconnector(hMps://github.com/cloudant-labs/spark-cloudant)toloaddataintoSpark

Makesuretochangethedbnametomatchtheonecreatedforyourtrainingsetbyyourac6vity(opentheCloudantdashboardtofindthename)

Page 47: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Loading data: Behind the scene UseSparkSQLconnectortoloaddataintoaDataFrame

connectorid

Op6ons

Cachedataforop6mizedreuse

CreatetempSQLTable

ScaMerPlotVisualiza6on

Page 48: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Scatter plot visualization

Page 49: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Visualization api

CreateanRDDofLabeledPoint

Page 50: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Transform into an RDD of LabeledPoint UseSparkSQLconnectortoloaddataintoaDataFrame

Page 51: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

loadLabeledDataRDD api

TrainMachineLearningModels

Page 52: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Machine Learning Algorithms ConSnuousOutput DiscreteOutput

SupervisedLearning(requireGround-Truth)

•  Regression-Linear-Ridge-Lasso-Isotonic•  DecisionTree•  RandomForest•  GradientBoostedTree

• Classifica6on-Logis6cRegression-SVM-NaiveBayes• DecisionTree• RandomForest• GradientBoostedTree• K-NN(availableasadd-onsparkpackage)

UnsupervisedLearning(noGround-Truthdatarequired)

•  Clustering-KMeans-GaussianMixture•  DimensionalityReduc6on-PCA-SVD

•  FP-Growth

TrainLogis6cRegressionModel

Page 53: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Train Logistic Regression Model

TrainNaïveBayesModels

Page 54: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Train NaiveBayes Model

TraindecisionTreeModel

Page 55: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Train Decision Tree Model

TrainRandomForestModel

Page 56: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Train Random Forest Model

AccuracyAnalysis

Page 57: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Naïve Bayes vs Decision Tree •  Probabilis6c:computetheprobability

ofadatainstancetobeinaspecificclass

•  Assumethateachfeature(variable)isindependentfromtheothers

•  Performancedependsonthepredic6venatureofthefeatures(nonpredic6vefeatureswillaffecttheaccuracy)

•  Workswellwithlowamountoftrainingdata.Doesn’tneedallthepossibili6es

•  Doesn’tworkwithcategoricalfeatures.

• Non-Probabilistic: partition the data into subsets that best describe the variable

• The deeper the tree, the better the model fits the data

• Watch out for overfiting: need to prune the tree

• Can handle categorical or continuous features

• No need for input to be scaled or standardized: Set you features and go!

• Requires a lot of data covering all possibilities

Page 58: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Accuracy Analysis of the Machine Learning Models Inthissec6on,wewillperformaccuracyanalysisonthetestdata.Wewillstartbycompu6ngtheaccuracymetricsforeachmodel,includingtheconfusionmatrix.Wewillthenusehistogramcharttounderstandthedatadistribu6onandrefinehowtoclassesarecomputed.

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModel

Page 59: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

Page 60: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Load Test data Makesuretochangethedbnametomatchtheonecreatedforyourtestsetbyyourac6vity(opentheCloudantdashboardtofindthename)

Page 61: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Accuracy Metrics

Page 62: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Confusion Matrix

Page 63: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Confusion Matrix

Page 64: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Confusion Matrix

Page 65: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Confusion Matrix

Page 66: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Accuracy metrics API OutputHTML

DisplayresultsHTMLinNotebookCell

ComputeMetricsfromlabeledandpredic6ondata

Gettheconfusionmatrixandbuildhtmltable

Page 67: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Understand the distribution of your data with Histograms

Page 68: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Training Handler class

•  Provideflexibilityandextensibilitytotheapplica6on

•  Provideafailfastandtrysomethingelsemechanism

•  Enableusertoeasilycustomizeclassesofdatabasedonhowdataisdistributed

•  Enableusertoeasilyaddtrainingfeatures

Page 69: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Default Training Handler class

Returndescrip6onforeachclasses

Returntotalnumberofclasses:Defaultis5

Re-classifyarecord:defaultusess.classifica6onfieldinJsonrecord

ExtrafeaturesNamestobeadded.Nonebydefault

Extrafeaturestobeadded.ArraymustmatchtheonereturnedbycustomTrainingFeaturesNames

Page 70: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Customize Training Handler Providenewclassifica6onandadddayofdepartureasanewfeature

InheritfromdefaultTrainingHandler

Adddayoftheweekusingatechniquecalleddummycoding

Page 71: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Re-train the models

Page 72: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Re-compute accuracy Models1

Models2BeMeraccuracyforNaiveBayesandLogis6cRegressionWorseforDecisionTreeandRandomForest

Page 73: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

Page 74: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Deploy and Run the models Inthelastsec6on,wewillsimulatedeploymentandrunningofthemodelsthroughthenotebookbycallingAPIsfromtherunpackage.

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModels

Page 75: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Run the predictive model

Page 76: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

runModel API

Page 77: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Get Weather Predictions

Page 78: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Show prediction results

Page 79: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Resource

•  hMps://developer.ibm.com/clouddataservices/•  hMps://github.com/ibm-cds-labs/simple-data-pipe•  hMps://github.com/ibm-cds-labs/pipes-connector-flightstats•  hMp://spark.apache.org/docs/latest/mllib-guide.html•  hMps://console.ng.bluemix.net/data/analy6cs/

Page 80: Spark tutorial pycon 2016   part 1

©2016IBMCorpora6on �

Thank You