Highlights of EARL 2018 Adnan Fiaz Julian Ferry Hannah Frick Dragoș Moldovan-Grünfeld
Highlights of EARL 2018
Adnan Fiaz
Julian Ferry
Hannah Frick
Dragoș Moldovan-Grünfeld
Agenda
Highlights
Next
Facts
Facts
EARL London 2018
• 5th EARL London Conference
• 3 Keynote speakers
• 5 Workshops
• 3 Streams
• 56 Presentations– lightning talks for the first time
• 1 Panel Discussion
• 2 Evening Networking Events
The Workshops
• R in 6 Hours
• Shiny – Beyond the Basics
• Deep Learning with Keras in R
• A Crash Course in Python for R Users
• Functional Programming with purrr
Attendees
Speakers
Reception
On the way to the IWM
Data Driven Decision-Making
Adnan Fiaz
Data Driven Decision-making
Keynotes:• Winning in a data-driven world, Edwina Dunn
• Building a Data Driven Company, Rich Pugh
Talks:• Decision Lead Data Science, Steven Wilkins
• A brief history of Data at Autotrader, Paul Owens
• R – The tool for Screwfix, Gavin Jackson
Have a (data) strategy
“Focus on the data you need rather than the data you have” – Edwina Dunn
“Know how to build the ‘engine’, now it needs to drive the car” – Rich Pugh
Not madmen but math (wo)men
“A key differentiator for businesses…is a culture of continuous learning” – Edwina Dunn
“The key role of data scientists in the coming years is one of educator” – Rich Pugh
Special mention
Finding out what Parliament thinks, Sam Tazzyman (Ministry of Justice)
• Explaining complex topics simply
• Show your code in action (and link to it)
• Why so serious?
Machine LearningJulian Ferry
Hannah Frick
Balancing model complexityand interpretability
In defence of complexity:• The power of machine learning in segmenting
CRM databases, Jeremy Horne
• The making of a real-world Moneyball – finding undervalued players with h2o, Jo-Fai Chow
In defence of interpretability:• Understanding your model, Kasia Kulma
• Measuring Marketing Performance, Wojtek Kostelecki
Complex models in CRM segmentation - Jeremy Horne
• How do we identify the customers on a CRM database who are most likely to make a purchase this month?
• Most databases are dominated by lower value segments
Separating low value segments• Tools used:
– Kernlab package
– Boosting to focus on outliers –outcomes that are not ‘normal’
Key takeaway:
Machine learning models can help us differentiate between customers within the same group, where decision-tree type rules fail.
In defence of interpretability –Kasia Kulma
In defence of interpretability –Kasia Kulma
In defence of interpretability –Kasia Kulma
LIME – Local Interpretable Model-Agnostic Explanations
Predicting baseball player performance with h2o, Jo-Fai Chow
• Problem: Finding undervalued baseball players in Major League Baseball (MLB)
End result – Shiny + LIME
The beauty of linear models, Wojtek Kostelecki• Modelling contributions to mileage
The beauty of linear models, Wojtek KosteleckiUsing a linear model we can extract the individual contribution of each variable to sales
David Smith – Not Hotdog
• Not Hotdog: Image recognition with R and the Custom Vision API
David Smith – Not Hotdog
David Smith – Not Hotdog
David Smith – Not Hotdog
David Smith – Not Hotdog
R Code: https://github.com/revodavid/nothotdog
Lars Kjeldgaard - modelgrid
• A ‘caret’-based Framework for
Training Multiple Tax Fraud
Detection Models
• Framework for creating,
managing and training
multiple caret models
• Pipe-friendly
Lars Kjeldgaard - modelgrid
library(modelgrid)
# create model grid object
credit_default_models <- model_grid()
# shared settings
credit_default_models <-
credit_default_models %>%
share_settings(
y = GermanCredit %>% pull(Class),
x = GermanCredit %>% select(-Class),
metric = "ROC",
trControl = tr_control
)
Lars Kjeldgaard - modelgrid
# add a random forest model
credit_default_models <-
credit_default_models %>%
add_model(model_name = "Funky Forest",
method = "rf",
tuneGrid = data.frame(mtry = c(10, 20)))
# add an eXtreme gradient boosting model
credit_default_models <-
credit_default_models %>%
add_model(model_name = "Big Boost",
method = "xgbTree",
nthread = 8)
Lars Kjeldgaard - modelgrid
# train models and evaluate
credit_default_models <- credit_default_models %>%
train(.)
credit_default_models$model_fits %>%
resamples(.) %>%
bwplot(.)
Reproducibility and R in Production
Dragoș Moldovan-Grünfeld
Reproducibility & R in Production
• Keynote:
– RMarkdown: The Bigger Picture, Garrett Grolemund, RStudio
• Talks:
– Beyond Prototypes. A Journey to The Production Land, Omayma Said, Freelance
– Bridging the gap between Data Scientists and Engineers; using R in production, Leanne Fitzpatrick, HelloSoda
Garrett Grolemund (RStudio)
• Reproducibility crisis:
– ”We created a cargo cult by confusing math with science. Now we must undo it.”
– “Create maps, not proofs”
– “Reproducibility is an opportunity”
Leanne Fitzpatrick (HelloSoda)
• “Bridging the gap between Data Scientists and Engineers; using R in production”
• Barriers to entry (R in production)– Engineering
– Infrastructure
– Data science
– Cultural
Overcoming barriers
• Deployment:
– central to the data science process
– Solution: Docker
• Plumbing/ integration
– Solution: code as a service with Plumber
• Package and dependency management
– Solution: pacman
Overcoming barriers (cont’d)
• Reproducible framework– Solution: Project Template http://projecttemplate.net
• Stability & error handling– Solution: testing & CI
– testthat and usethis
• Scaling– Solution: docker
• Culture– Solution: collaboration
Omayma Said (Freelance)
• “Beyond Prototypes. A Journey to The Production Land”
• Challenges: reproducibility, portability, and accessibility
• Docker
• Use/Modify available Dockerfiles
• Use helper packages
Helper packages
• containerit– Package an R workspace and all
dependencies as a Docker container
• liftr– Containerize R Markdown documents
for continuous reproducibility
• rize– A robust method to automagically dockerize
your Shiny application
Special mention
Using R and Shiny to improve hospital operations, Christian Moroy and Jonathan Bruce (Edge Health)
• Predict how long operations take using R• Recommend free slots that should be filled
via Shiny• Disseminate daily reports via markdown +
email (from R)• Saved a predicted £4m in 2017/18
Next?
EARL US Roadshow
7 November 2018, Seattle, WA
Julia SilgeData Scientist @ Stack OverflowCo-author Text Mining with R with David RobinsonCo-author tidytext package
EARL US Roadshow
9 November 2018, Houston, TX
Hadley WickhamChief Scientist @ RStudioAuthor of numerous books on RProlific R package author
Robert GentlemanVice President of Computational Biology @ 23andMeOne of the designers of the R programming language
EARL US Roadshow
13 November 2018, Boston, MA
Bob Rudis (@hrbrmstr)Chief Security Data Scientist @ Rapid7Prolific tweeter, package author and blogger
The End