Highlights of EARL 2018 - London R · EARL London 2018 •5th EARL London Conference •3 Keynote speakers •5 Workshops •3 Streams ... dependencies as a Docker container •liftr

Highlights of EARL 2018

Adnan Fiaz

Julian Ferry

Hannah Frick

Dragoș Moldovan-Grünfeld

Agenda

Highlights

Next

Facts

Facts

EARL London 2018

• 5th EARL London Conference

• 3 Keynote speakers

• 5 Workshops

• 3 Streams

• 56 Presentations– lightning talks for the first time

• 1 Panel Discussion

• 2 Evening Networking Events

The Workshops

• R in 6 Hours

• Shiny – Beyond the Basics

• Deep Learning with Keras in R

• A Crash Course in Python for R Users

• Functional Programming with purrr

Attendees

Speakers

Reception

On the way to the IWM

Data Driven Decision-Making

Adnan Fiaz

Data Driven Decision-making

Keynotes:• Winning in a data-driven world, Edwina Dunn

• Building a Data Driven Company, Rich Pugh

Talks:• Decision Lead Data Science, Steven Wilkins

• A brief history of Data at Autotrader, Paul Owens

• R – The tool for Screwfix, Gavin Jackson

Have a (data) strategy

“Focus on the data you need rather than the data you have” – Edwina Dunn

“Know how to build the ‘engine’, now it needs to drive the car” – Rich Pugh

Not madmen but math (wo)men

“A key differentiator for businesses…is a culture of continuous learning” – Edwina Dunn

“The key role of data scientists in the coming years is one of educator” – Rich Pugh

Special mention

Finding out what Parliament thinks, Sam Tazzyman (Ministry of Justice)

• Explaining complex topics simply

• Show your code in action (and link to it)

• Why so serious?

Machine LearningJulian Ferry

Hannah Frick

Balancing model complexityand interpretability

In defence of complexity:• The power of machine learning in segmenting

CRM databases, Jeremy Horne

• The making of a real-world Moneyball – finding undervalued players with h2o, Jo-Fai Chow

In defence of interpretability:• Understanding your model, Kasia Kulma

• Measuring Marketing Performance, Wojtek Kostelecki

Complex models in CRM segmentation - Jeremy Horne

• How do we identify the customers on a CRM database who are most likely to make a purchase this month?

• Most databases are dominated by lower value segments

Separating low value segments• Tools used:

– Kernlab package

– Boosting to focus on outliers –outcomes that are not ‘normal’

Key takeaway:

Machine learning models can help us differentiate between customers within the same group, where decision-tree type rules fail.

In defence of interpretability –Kasia Kulma



LIME – Local Interpretable Model-Agnostic Explanations

Predicting baseball player performance with h2o, Jo-Fai Chow

• Problem: Finding undervalued baseball players in Major League Baseball (MLB)

End result – Shiny + LIME

The beauty of linear models, Wojtek Kostelecki• Modelling contributions to mileage

The beauty of linear models, Wojtek KosteleckiUsing a linear model we can extract the individual contribution of each variable to sales

David Smith – Not Hotdog

• Not Hotdog: Image recognition with R and the Custom Vision API





R Code: https://github.com/revodavid/nothotdog

Lars Kjeldgaard - modelgrid

• A ‘caret’-based Framework for

Training Multiple Tax Fraud

Detection Models

• Framework for creating,

managing and training

multiple caret models

• Pipe-friendly


library(modelgrid)

# create model grid object

credit_default_models <- model_grid()

# shared settings

credit_default_models <-

credit_default_models %>%

share_settings(

y = GermanCredit %>% pull(Class),

x = GermanCredit %>% select(-Class),

metric = "ROC",

trControl = tr_control

)


# add a random forest model



add_model(model_name = "Funky Forest",

method = "rf",

tuneGrid = data.frame(mtry = c(10, 20)))

# add an eXtreme gradient boosting model



add_model(model_name = "Big Boost",

method = "xgbTree",

nthread = 8)


# train models and evaluate

credit_default_models <- credit_default_models %>%

train(.)

credit_default_models$model_fits %>%

resamples(.) %>%

bwplot(.)

Reproducibility and R in Production

Dragoș Moldovan-Grünfeld

Reproducibility & R in Production

• Keynote:

– RMarkdown: The Bigger Picture, Garrett Grolemund, RStudio

• Talks:

– Beyond Prototypes. A Journey to The Production Land, Omayma Said, Freelance

– Bridging the gap between Data Scientists and Engineers; using R in production, Leanne Fitzpatrick, HelloSoda

Garrett Grolemund (RStudio)

• Reproducibility crisis:

– ”We created a cargo cult by confusing math with science. Now we must undo it.”

– “Create maps, not proofs”

– “Reproducibility is an opportunity”

Leanne Fitzpatrick (HelloSoda)

• “Bridging the gap between Data Scientists and Engineers; using R in production”

• Barriers to entry (R in production)– Engineering

– Infrastructure

– Data science

– Cultural

Overcoming barriers

• Deployment:

– central to the data science process

– Solution: Docker

• Plumbing/ integration

– Solution: code as a service with Plumber

• Package and dependency management

– Solution: pacman

Overcoming barriers (cont’d)

• Reproducible framework– Solution: Project Template http://projecttemplate.net

• Stability & error handling– Solution: testing & CI

– testthat and usethis

• Scaling– Solution: docker

• Culture– Solution: collaboration

http://projecttemplate.net/

Omayma Said (Freelance)

• “Beyond Prototypes. A Journey to The Production Land”

• Challenges: reproducibility, portability, and accessibility

• Docker

• Use/Modify available Dockerfiles

• Use helper packages

Helper packages

• containerit– Package an R workspace and all

dependencies as a Docker container

• liftr– Containerize R Markdown documents

for continuous reproducibility

• rize– A robust method to automagically dockerize

your Shiny application

Special mention

Using R and Shiny to improve hospital operations, Christian Moroy and Jonathan Bruce (Edge Health)

• Predict how long operations take using R• Recommend free slots that should be filled

via Shiny• Disseminate daily reports via markdown +

email (from R)• Saved a predicted £4m in 2017/18

Next?

EARL US Roadshow

7 November 2018, Seattle, WA

Julia SilgeData Scientist @ Stack OverflowCo-author Text Mining with R with David RobinsonCo-author tidytext package

EARL US Roadshow

9 November 2018, Houston, TX

Hadley WickhamChief Scientist @ RStudioAuthor of numerous books on RProlific R package author

Robert GentlemanVice President of Computational Biology @ 23andMeOne of the designers of the R programming language

EARL US Roadshow

13 November 2018, Boston, MA

Bob Rudis (@hrbrmstr)Chief Security Data Scientist @ Rapid7Prolific tweeter, package author and blogger

The End

Highlights of EARL 2018 - London R · EARL London 2018 •5th EARL London Conference •3 Keynote speakers •5 Workshops •3 Streams ... dependencies as a Docker container •liftr

Documents