Spark tutorial pycon 2016 part 1

Post on 07-Apr-2017

2376 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

Transcript

DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com

HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart1:FlightDelayPredictwithSparkML PyCon2016,Portland

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

©2016IBMCorpora6on �

Sign up for Bluemix •  AccessIBMBluemixwebsiteonhMps://console.ng.bluemix.net•  ClickonGetStartedforFree

•  CompletetheformandclickCreateaccount•  Lookforconfirma6onemailandclickonconfirmyouaccountlink

Signupforflightstats

©2016IBMCorpora6on �

Sign up for a free trial at Flightstats.com

•  SignupathMps://developer.flightstats.com/signup•  Fillouttheformandmonitoremailforconfirma6onlink(accesstoAPIsmay

takeupto24hours)•  Onceaccessisgrantedgoto

hMps://developer.flightstats.com/admin/applica6onstoviewappIdandappKey(youwillneedtheminthesimple-data-pipetooltocreatetrainingsets.

•  Op6onal:getfamiliarwiththevariousflightstatsapis:–  hMps://developer.flightstats.com/api-docs/scheduledFlights/v1–  hMps://developer.flightstats.com/api-docs/airports/v1

Howtofindyourappidandkey

©2016IBMCorpora6on �

Where to find the FlightStats app id and app key

APPID

APPKey

Prepareyourbluemixspace

©2016IBMCorpora6on �

Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix

CreateaSparkInstance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

©2016IBMCorpora6on �

Create a Spark Instance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

©2016IBMCorpora6on �

Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

©2016IBMCorpora6on �

Flight App Project Description • Usecase

–  Flightdelaysareacommondisturbanceduringbusinesstrips–  Beingabletopredicthowlikelyaflightwillbedelayedcanremoveuncertaintyandenableuserstoplanaroundit.

–  Idea:Weatherdatacanbeagoodexplanatoryvariableforbuildingpredic6vemodels

•  ImplementaSon–  Combineflightsta6s6csfromflightstats.com(Systemofrecords)withweatherdatafromIBMInsightforWeather(Systemofopera6ons)tobuildatraining,testandblindset

–  UseSparkMLLibtotrainpredic6vemodelsandcrossvalidatethem–  CreateacustomcardforGoogleNowthatwillautoma6callyno6fyuserofimpendingflightdelay

–  Proposealterna6ngflightroutes(e.g.Freebird)Get/Build/Analyze

©2016IBMCorpora6on �

Get/Build/Analyze methodology

©2016IBMCorpora6on �

Flight Predict App Architecture

Weather

SimpleDataPipes

Airports

FlightSchedules

FlightStatus

MetadataTrainingSet

TestSet

BlindSet

CustomConnectorrunevery24hours

Notebook

©2016IBMCorpora6on �

Flow Diagram

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model

•  Itera6veinNature:weareneverdone!•  Wewillbeusingthisdiagramasaroadmapthroughoutthiscourse

DeployandRunModel

©2016IBMCorpora6on �

Get the data and build the training/test/blind sets Inthisstepwe’lluseSimpleDataPipesopensourceprojecttoacquiredatafromFlightstats,combineitwithWeatherdatafromIBMInsightforWeatherandsavethedatasetsintoaNoSQLCloudantDatabase.

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModel

©2016IBMCorpora6on �

Acquiring the data

•  Inthenextsec6on,weshowhowtoacquirethetrainingdatabyusingthesimple-data-pipetoolandflightpredictconnector.

•  Theflightpredictconnectorcombinehistoricalflightdatafromflightstats.comwithweatherdatafromIBMInsightforWeather

•  Ifyouwanttoskipthesesteps,youcanusethealreadybuiltdatasetbyusingthefollowingcreden6als:–  cloudantHost:dtaieb.cloudant.com–  cloudantUserName:weenesserliffircedinvers–  cloudantPassword:72a5c4f939a9e2578698029d2bb041d775d088b5

Deploysimple-data-pipe

©2016IBMCorpora6on �

Deploy simple-data-pipe with flightstats connector •  GotohMps://github.com/ibm-cds-labs/simple-data-pipe•  ClickonDeploytoBluemixbuMon

ClickbuMonwilltakeyoutoBluemix

©2016IBMCorpora6on �

Complete simple-data-pipe deployment

AddWeatherservice

©2016IBMCorpora6on �

Add an instance of IBM Weather Service on Bluemix •  Returntotheapplica6ondashboard•  Weatherserviceisrequiredbythe

flightpredictconnectorandmustbeinstalledbefore

•  Fromappdashboard,clickonAddaserviceorAPI

©2016IBMCorpora6on �

Create an instance of IBM Weather Service on Bluemix SearchforWeather

Makesuretoselect“premiumplan”tohaveenoughauthorizedAPIcalls

©2016IBMCorpora6on �

Checkpoint: simple data pipe app dashboard •  Verifythatyourappiscorrectlyboundtotherightservices

WeatherServiceusedtoenrichflightrecordswithweatherobserva6ons

CloudantServiceusedtostoretraining,testandblinddatasets

You’llneedtoclickonthisbuMonforthesteponthenextpageItisrecommendedtoincrease

theappmemoryto1GB

©2016IBMCorpora6on �

Install flight predict connector •  ClickEditCodebuMon,editpackage.jsontoaddflightpredictmodule:

– "simple-data-pipe-connector-flightstats":"git://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git"

addflightpredictmoduletodependencies

Saveyourchanges

don’tforgettoaddcommainthelinebeforetokeepjsonvalid

©2016IBMCorpora6on �

Install flight predict connector •  ClickFile/Savetosaveyourchanges

Redeploysimpledatapipe

©2016IBMCorpora6on �

Redeploy simple data pipe app •  UseliveeditEditortoredeploytheapp

Verifyyoursdpinstall

©2016IBMCorpora6on �

Verify connector install •  Inthisstep,weverifythattheflightpredictconnectoriscorrectlyinstalledthroughtheUI

Fightconnectorcorrectlyinstalled

Createnewflightstatspipe

©2016IBMCorpora6on �

Create a new FlightStats pipe •  Followeachscreentocreateandconfigureanewpipe

Runthepipe

©2016IBMCorpora6on �

Run the pipe •  Skipoverthescheduletab•  Intheac6vitytab,clickonRunNowtostartthepipe

Explorethedataset

ClickRunNowThenopenthelogtomonitortheac6vity

©2016IBMCorpora6on �

Explore the data sets •  Inthisstep,wetakeamomenttoexplorethedifferentdatasetsthathavebeencreatedbythesimpledatapipetool

•  Frombluemixdashboard,clickonthecloudantservice6le,thenontheLaunchbuMon•  FromtheCloudantdashboard,openthetrainingdatabase•  Openadocumenttolookatthedatastructure

Buildthetestset

©2016IBMCorpora6on �

Run the pipe again to build the test set

Trainthemodels

©2016IBMCorpora6on �

Train the Models •  Intheprevioussec6onwehavecreatedthetrainingdataandwearenowreadytotrainthemodels.•  Stepsinthissec6on:

–  CreateanIPythonNotebook–  LoadthedatasetsfromtheCloudantdatabaseintoaSparkCluster–  Explorethedataandtrainthemachinelearningmodels

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModel

CreateIPythonNotebook

©2016IBMCorpora6on �

Create a new IPython Notebook

©2016IBMCorpora6on �

Notebook tour

©2016IBMCorpora6on �

Notebook tour: Notebook Info

©2016IBMCorpora6on �

Notebook tour: Environment

©2016IBMCorpora6on �

Notebook tour: Sharing

`

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

©2016IBMCorpora6on �

Before we start building the app…

•  Youcanop6onallyfollowthistutorialfromGithubbyusingafullybuiltnotebook:– hMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/blob/master/notebook/Flight%20Predict%20PyCon%202016.ipynb

©2016IBMCorpora6on �

Optional: use prebuilt notebook

ImportrequiredPythonpackages

• CreatenotebookfromURL• UsehMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb

©2016IBMCorpora6on �

Using Python Packages •  Writecodeinlinewithincells•  EncapsulatehelperAPIswithinPythonpackage•  2waysofusinghelperPythonpackages

–  eggdistribu6onpackage:pipinstallfromPyPiserverorfileserver(e.g.Github)

•  Persistentinstallacrosssessions•  RecommendedinProduc6on

–  SparkContext.addPyFile•  Easyaddi6onofapythonmodulefile•  Supportmul6plemodulefilesviazipformat•  Recommendedduringdevelopmentwherefrequentcodechangesoccur

Manageeggpackages

©2016IBMCorpora6on �

Flight Predict Python Package on Github

SetupscriptforinstallingPythonPackage

FlightPredictPythonlibrary

©2016IBMCorpora6on �

Method 1: Install Flight Predict Package •  UsepiptoInstallFlightPredictpackage•  Recommendedalterna6ve:buildeggdistribu6onpackageanddeployinPyPi

©2016IBMCorpora6on �

Manage Python packages •  Checkstatus•  Uninstallpackage

Installpackagesviasc.addPyFilemethod

©2016IBMCorpora6on �

Method 2: Install py modules via sc.addPyFile

•  addPyFileinstallindividualpymodulesandmakethemavailabletoallexecutorprocesses

•  Workswithmodulesinzippedfiles

Modulecontainingapisfortrainingthemodels

Modulecontainingapisforrunningthemodels

Configurecreden6alsforvariousservices

©2016IBMCorpora6on �

Setup credentials and Import required python modules

Inthisstep,weimportpythonmodulesthatwillbeneededthroughoutthenotebookandsetupcreden6alstovariousservices.

Howtogetcreden6alsforCloudantandWeather

Creden6alforCloudantNoSQLService

Creden6alsforWeatherService

©2016IBMCorpora6on �

Get Credentials for Cloudant Fromtheappdashboard,clickonEnvironmentVariablesfromthelessidebar

©2016IBMCorpora6on �

Get Credentials for Weather

LoadtrainingsetfromCloudant

©2016IBMCorpora6on �

Load training set in Spark SQL DataFrame

Inthisstep,weusethecloudant-sparkconnector(hMps://github.com/cloudant-labs/spark-cloudant)toloaddataintoSpark

Makesuretochangethedbnametomatchtheonecreatedforyourtrainingsetbyyourac6vity(opentheCloudantdashboardtofindthename)

©2016IBMCorpora6on �

Loading data: Behind the scene UseSparkSQLconnectortoloaddataintoaDataFrame

connectorid

Op6ons

Cachedataforop6mizedreuse

CreatetempSQLTable

ScaMerPlotVisualiza6on

©2016IBMCorpora6on �

Scatter plot visualization

©2016IBMCorpora6on �

Visualization api

CreateanRDDofLabeledPoint

©2016IBMCorpora6on �

Transform into an RDD of LabeledPoint UseSparkSQLconnectortoloaddataintoaDataFrame

©2016IBMCorpora6on �

loadLabeledDataRDD api

TrainMachineLearningModels

©2016IBMCorpora6on �

Machine Learning Algorithms ConSnuousOutput DiscreteOutput

SupervisedLearning(requireGround-Truth)

•  Regression-Linear-Ridge-Lasso-Isotonic•  DecisionTree•  RandomForest•  GradientBoostedTree

• Classifica6on-Logis6cRegression-SVM-NaiveBayes• DecisionTree• RandomForest• GradientBoostedTree• K-NN(availableasadd-onsparkpackage)

UnsupervisedLearning(noGround-Truthdatarequired)

•  Clustering-KMeans-GaussianMixture•  DimensionalityReduc6on-PCA-SVD

•  FP-Growth

TrainLogis6cRegressionModel

©2016IBMCorpora6on �

Train Logistic Regression Model

TrainNaïveBayesModels

©2016IBMCorpora6on �

Train NaiveBayes Model

TraindecisionTreeModel

©2016IBMCorpora6on �

Train Decision Tree Model

TrainRandomForestModel

©2016IBMCorpora6on �

Train Random Forest Model

AccuracyAnalysis

©2016IBMCorpora6on �

Naïve Bayes vs Decision Tree •  Probabilis6c:computetheprobability

ofadatainstancetobeinaspecificclass

•  Assumethateachfeature(variable)isindependentfromtheothers

•  Performancedependsonthepredic6venatureofthefeatures(nonpredic6vefeatureswillaffecttheaccuracy)

•  Workswellwithlowamountoftrainingdata.Doesn’tneedallthepossibili6es

•  Doesn’tworkwithcategoricalfeatures.

• Non-Probabilistic: partition the data into subsets that best describe the variable

• The deeper the tree, the better the model fits the data

• Watch out for overfiting: need to prune the tree

• Can handle categorical or continuous features

• No need for input to be scaled or standardized: Set you features and go!

• Requires a lot of data covering all possibilities

©2016IBMCorpora6on �

Accuracy Analysis of the Machine Learning Models Inthissec6on,wewillperformaccuracyanalysisonthetestdata.Wewillstartbycompu6ngtheaccuracymetricsforeachmodel,includingtheconfusionmatrix.Wewillthenusehistogramcharttounderstandthedatadistribu6onandrefinehowtoclassesarecomputed.

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModel

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

©2016IBMCorpora6on �

Load Test data Makesuretochangethedbnametomatchtheonecreatedforyourtestsetbyyourac6vity(opentheCloudantdashboardtofindthename)

©2016IBMCorpora6on �

Accuracy Metrics

©2016IBMCorpora6on �

Confusion Matrix

©2016IBMCorpora6on �

Confusion Matrix

©2016IBMCorpora6on �

Confusion Matrix

©2016IBMCorpora6on �

Confusion Matrix

©2016IBMCorpora6on �

Accuracy metrics API OutputHTML

DisplayresultsHTMLinNotebookCell

ComputeMetricsfromlabeledandpredic6ondata

Gettheconfusionmatrixandbuildhtmltable

©2016IBMCorpora6on �

Understand the distribution of your data with Histograms

©2016IBMCorpora6on �

Training Handler class

•  Provideflexibilityandextensibilitytotheapplica6on

•  Provideafailfastandtrysomethingelsemechanism

•  Enableusertoeasilycustomizeclassesofdatabasedonhowdataisdistributed

•  Enableusertoeasilyaddtrainingfeatures

©2016IBMCorpora6on �

Default Training Handler class

Returndescrip6onforeachclasses

Returntotalnumberofclasses:Defaultis5

Re-classifyarecord:defaultusess.classifica6onfieldinJsonrecord

ExtrafeaturesNamestobeadded.Nonebydefault

Extrafeaturestobeadded.ArraymustmatchtheonereturnedbycustomTrainingFeaturesNames

©2016IBMCorpora6on �

Customize Training Handler Providenewclassifica6onandadddayofdepartureasanewfeature

InheritfromdefaultTrainingHandler

Adddayoftheweekusingatechniquecalleddummycoding

©2016IBMCorpora6on �

Re-train the models

©2016IBMCorpora6on �

Re-compute accuracy Models1

Models2BeMeraccuracyforNaiveBayesandLogis6cRegressionWorseforDecisionTreeandRandomForest

©2016IBMCorpora6on �

Agenda

•  Pre-requisitestepstobecompletedbeforethesession

•  FlightPredictappdescrip6onandarchitecture•  TrainthemodelsintheNotebook•  AccuracyAnalysisandmodelsrefinement•  Deployandrunthemodels

©2016IBMCorpora6on �

Deploy and Run the models Inthelastsec6on,wewillsimulatedeploymentandrunningofthemodelsthroughthenotebookbycallingAPIsfromtherunpackage.

DataAcquisi6on

DataPrepara6on

DataAnnota6on(GroundTruth)

ModelTraining

•  Cleansing•  Shaping•  Enrichment

ModelTes6ng

TrainingSet

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model DeployandRunModels

©2016IBMCorpora6on �

Run the predictive model

©2016IBMCorpora6on �

runModel API

©2016IBMCorpora6on �

Get Weather Predictions

©2016IBMCorpora6on �

Show prediction results

©2016IBMCorpora6on �

Resource

•  hMps://developer.ibm.com/clouddataservices/•  hMps://github.com/ibm-cds-labs/simple-data-pipe•  hMps://github.com/ibm-cds-labs/pipes-connector-flightstats•  hMp://spark.apache.org/docs/latest/mllib-guide.html•  hMps://console.ng.bluemix.net/data/analy6cs/

©2016IBMCorpora6on �

Thank You

top related