Top Banner
F# and Data 101 Utilizing FsLab to Quickly Understand your Data Jonathan Dexter, Technology Manager of .NET, The Nerdery
42

FSharp and Data 101

Feb 13, 2017

Download

Technology

Jonathan Dexter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • F# and Data 101Utilizing FsLab to Quickly Understand your Data

    Jonathan Dexter, Technology Manager of .NET, The Nerdery

  • AgendaWhat is this talk aboutGet the data!Transform!Science!???Prot!

  • If you came here to hear about"Monads""Functors"Tail-call optimizationImmutable designPattern matching

  • What we WILL talk about : FsLab

  • With Paket

    1: 2:

    paketinitpaketaddnugetfslab

  • With NuGet

    1: nugetinstallfslabOutputDirectorypackages

  • The process(Not limited to FsLab)

    1: 2: 3: 4:

    acquiredata|>transform|>science|>visualize

  • FsLab: Scratching the Surface5 Libraries

  • Sorting the libraries above into categories:

    Library Acquire Transform Science Display

    F# Data Yep! - - -

    Deedle - Yep - -

    .NET Numerics - Supports Supports -

    R Type Provider Partial Yep Yep Partial

    XPlot - - - Yep

  • Step one: Acquire

  • Classic scenario: CSV

    Using CSV type provider

    1: 2:

    typecsv=FSharp.Data.CsvProviderletcomplaints=csv.Load(complaintsCsv)

    Using a data frame

    1: letdata=Deedle.Frame.ReadCsv(__SOURCE_DIRECTORY__+"/data.csv")

  • Type Provider Scenario: World bank provider

    World bank provider is bundled with F# Data

    1: 2: 3: 4: 5: 6:

    letdataContext=FSharp.Data.WorldBankData.GetDataContext()lethighTechExports=dataContext.Countries.``UnitedStates``.Indicators.``Hightechnologyexports(currentUS$)``

  • Type Provider Scenario: JSON provider

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

    typeJsonContext=FSharp.Data.JsonProvider

  • Type Provider Scenario: JSON Provider (cont.)

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

    letmissedVotes=JsonContext.Load(sprintf"%s?apikey=%s"missedVotesUrlapikey)

    letcongressmen=missedVotes.Results|>Seq.collect(funr>r.Members)

    lettopMissingCongressman=congressmen|>Seq.sortBy(funm>trym.MissedVotesPctwith|ex>0.0m)|>Seq.rev|>Seq.head

  • Type Provider Scenario: JSON Provider (cont.)

    {"id":"M000309","name":"CarolynMcCarthy","party":"D","state":"NY","district":"4","total_votes":"1192","missed_votes":"687","missed_votes_pct":"57.63","rank":"1","notes":"Willretireattheendof113thCongress."}

  • Batteries not included: SQL Provider

    *SQL Data Connection

    *SQL Entity Connectoin

    SQL Client

    SQL Provider

    https://msdn.microsoft.com/en-us/library/hh362320.aspxhttps://msdn.microsoft.com/en-us/library/hh362320.aspxhttps://github.com/fsprojects/FSharp.Data.SqlClienthttps://github.com/fsprojects/SQLProvider

  • SQL Data Connection

    Type provder for an entire database, MS SQL focused.

    1: 2: 3: 4:

    typedbSchema=SqlDataConnectionletdb=dbSchema.GetDataContext()

  • SQL Entity Connection

    Type provder for an entire database, through ADO.NET Entity model.

    1: 2: 3: 4:

    typedbSchema=SqlEntityConnectionletdb=dbSchema.GetDataContext()

  • SQL Client

    Type provider for commands, sprocs, and queries

    1: 2: 3: 4:

    usecmd=newSqlCommandProvider()

    1: letresults=cmd.Execute(region="USA")

  • SQL Provider

    Type provider for DB as a whole

    MS SQL, Postgres, SQLite, MySQL, Oracle, MS Access

    1: 2: 3: 4:

    typesql=SqlDataProvider

    letctx=sql.GetDataContext()

  • Step two: Transform

  • Deedle: Convert to data frame

    1: 2: 3:

    openDeedleopenFSharp.DataopenFSharp.Data.Runtime.BaseTypes

    Expander code omitted, but can be found here

    1: 2: 3: 4:

    letdataFrame=[forlincongressmen>series["It"=>l]]|>Frame.ofRowsOrdinal|>Frame.expandAllCols10

    https://github.com/fslaborg/FsLab/issues/14

  • Deedle: Normal syntax

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

    lethighTechExportData=WorldBankData.GetDataContext().Countries.``UnitedStates``.Indicators.``Hightechnologyexports(currentUS$)``

    lethighTechFrame=highTechExportData|>Frame.ofRecords|>Frame.indexRowsInt"Item1"|>Frame.mapColKeys(fun_>"HighTechExports")

  • Deedle: Quick manipulations

    Simple statistics

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

    letstats=["Min"=>Stats.minhighTechFrame"Max"=>Stats.maxhighTechFrame"Average"=>Stats.meanhighTechFrame"StandardDeviation"=>Stats.stdDevhighTechFrame]letobservations=highTechFrame?``HighTechExports``|>Series.observations|>Seq.map(fun(k,v)>floatk,floatv)

    letregression=observations|>MathNet.Numerics.LinearRegression.SimpleRegression.Fit

  • Results

    Stats

    [("Min",series[HighTechExports=>76767867475])("Max",series[HighTechExports=>220884471208])("Average",series[HighTechExports=>152642394565.462])("StandardDeviation",series[HighTechExports=>39746534154.5001])]

    Regression fit (intercept, slope)

    (5.799301011e+12,2973741397.0)

  • R Squared

    R squared value

    1: 2:

    letrsquared=GoodnessOfFit.RSquared(regressedValues|>Seq.mapsnd,observations|>Seq.mapsnd)

    0.3274641292

  • Deedle: Combining data and additional feature creation

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

    letexportFrame=WorldBankData.GetDataContext().Countries.``UnitedStates``.Indicators.``Exportsofgoodsandservices(currentUS$)``|>Frame.ofRecords|>Frame.indexRowsInt"Item1"|>Frame.mapColKeys(fun_>"TotalExports")

    exportFrame?``HighTechExports``

  • Deedle: Straight to R

    Arrrr

  • Deedle: Straight to R

    1: 2: 3: 4:

    openRProvider.``base``openRProviderletrFrame=R.as_data_frame(exportFrame)letrFrameSummary=(R.summaryrFrame)

    ["TotalExportsMin.:2.700e+10""TotalExports1stQu.:1.110e+11""TotalExportsMedian:3.639e+11""TotalExportsMean:6.684e+11""TotalExports3rdQu.:1.015e+12""TotalExportsMax.:2.342e+12""HighTechExportsMin.:7.677e+10""HighTechExports1stQu.:1.282e+11""HighTechExportsMedian:1.521e+11""HighTechExportsMean:1.526e+11""HighTechExports3rdQu.:1.763e+11""HighTechExportsMax.:2.209e+11""HighTechExportsNA's:29""PercentageofHighTechExportsMin.:0.06563""PercentageofHighTechExports1stQu.:0.12270""PercentageofHighTechExportsMedian:0.15502""PercentageofHighTechExportsMean:0.13875""PercentageofHighTechExports3rdQu.:0.16385""PercentageofHighTechExportsMax.:0.18290""PercentageofHighTechExportsNA's:29"]

  • Step three: Science

  • "Normal" data analysis languages

  • F# is slowly catching up

    Machine Learning Algorithms (suite)

  • F# is slowly catching up

    Natural Language Processing

  • F# is slowly catching up

    Cloud computing

  • Step four: Visualize

  • Visualizing our previous information

    With Google Charts

    1: 2: 3: 4: 5: 6:

    letpieChart=congressmen|>Seq.filter(func>tryc.MissedVotes>=0with_>false)|>Seq.groupBy(func>c.Party)|>Seq.map(fung>fstg,(sndg)|>Seq.sumBy(func>c.MissedVotes|>XPlot.GoogleCharts.Chart.Pie

  • Visualizing our previous information

  • Visualizing our previous information

    1: 2: 3: 4:

    letscatter=exportFrame?``TotalExports``|>Series.observations|>XPlot.GoogleCharts.Chart.Scatter

  • Visualizing our previous information

  • Summary

    ResourcesPresentation code:

    Presentation share:https://github.com/mandest/FSharpAndDataTalk

    More F# Resources: *F# Guides on fsharp.org Functional Programming SlackF# Weekly

    https://github.com/mandest/FSharpAndDataTalkhttp://fsharp.org/http://fpchat.com/https://sergeytihon.wordpress.com/category/f-weekly/