Top Banner
Hadley Wickham @hadleywickham Chief Scientist, RStudio Coding data science March 2019
49

Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Jul 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Hadley Wickham @hadleywickham Chief Scientist, RStudio

Coding data science

March 2019

Page 2: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

What is data analysis?

Page 3: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Data analysis is the process by which data becomes

understanding, knowledge and insight

Data analysis is the process by which data becomes

understanding, knowledge and insight

Page 4: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Data analysis is the process by which data becomes

understanding, knowledge and insight

Data analysis is the process by which data becomes

understanding, knowledge and insight

Page 5: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Tidy

Import

Consistent way of storing data

Page 6: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Tidy

Import

Consistent way of storing data

Understand

Page 7: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Tidy

ImportSurprises, but doesn't scale

Create new variables & new summariesConsistent way of storing data

Visualise

Transform

ModelScales, but doesn't (fundamentally) surprise

Page 8: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Tidy

ImportSurprises, but doesn't scale

Create new variables & new summariesConsistent way of storing data

Visualise

Transform

Model

Communicate

Scales, but doesn't (fundamentally) surprise

Automate

Page 9: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

What is data science?

Page 10: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Data science = data analysis +programming

Page 11: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Tidy

ImportSurprises, but doesn't scale

Create new variables & new summariesConsistent way of storing data

Visualise

Transform

Model

Communicate

Scales, but doesn't (fundamentally) surprise

AutomateProgram

Page 12: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program Program

Tidy

Import Visualise

Transform

Model

Communicate Automate

Page 13: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create
Page 14: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Tidy

Import Visualise

TransformModel

Program

tibble tidyr

purrr

magrittr

dplyr forcats

hms

ggplot2

recipes parsnip

readr readxl

haven xml2

lubridate stringr

tidyverse.org

r4ds.had.co.nz

Communicateshiny rmarkdown

Page 15: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Why code?

Page 16: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

The disadvantages of code are obvious

Page 17: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

1. Code is text 2. Code is read-able 3. Code is reproducible

Why code?

Page 18: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

⌘C⌘VCopy

Paste

Page 19: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create
Page 20: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

1. Code is text 2. Code is read-able 3. Code is reproducible

Why code?

Page 21: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

What have you done?

Page 22: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

1. Code is text 2. Code is read-able 3. Code is reproducible

Why code?

Page 23: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

.Rmd Prose and code

.md Prose and results

.html Human shareable

Page 24: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

.Rmd

.md

.html

Prose and code

Prose and results

.doc .tex.pdf.ppt

...

Page 25: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

What about non-programmers?

Page 26: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

You don’t need to be a programmer

to code!

Page 27: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Your turn: What data do we need to recreate this plot?

Page 28: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

# A tibble: 193 x 6

country four_regions year income life_exp pop

<chr> <chr> <int> <int> <dbl> <int>

1 Afghanistan asia 2015 1750 57.9 33700000

2 Albania europe 2015 11000 77.6 2920000

3 Algeria africa 2015 13700 77.3 39900000

4 Andorra europe 2015 46600 82.5 78000

5 Angola africa 2015 6230 64 27900000

6 Antigua and Barbuda americas 2015 20100 77.2 99900

7 Argentina americas 2015 19100 76.5 43400000

8 Armenia europe 2015 8180 75.4 2920000

9 Australia asia 2015 43800 82.6 23800000

10 Austria europe 2015 44100 81.4 8680000

# … with 183 more rows

Underlying data

Page 29: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

gapminder %>% filter(year == 2015) -> gapminder15

Phonics are important!

filter rows where year equals 2015, creating

Take the gapminder data, then

gapminder15 variable

Page 30: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

50

60

70

80

0 25000 50000 75000 100000 125000income

life_exp

gapminder15 %>% ggplot(aes(income, life_exp))

Page 31: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●●

50

60

70

80

0 25000 50000 75000 100000 125000income

life_exp

gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point()

Page 32: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●●

50

60

70

80

0 25000 50000 75000 100000 125000income

life_exp

Page 33: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●●

50

60

70

80

1e+03 1e+04 1e+05income

life_exp

gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point() + scale_x_log10()

Page 34: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

50

60

70

80

1e+03 1e+04 1e+05income

life_exp

four_regions●

africa

americas

asia

europe

gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point(aes(colour = four_regions)) + scale_x_log10()

Page 35: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

50

60

70

80

1e+03 1e+04 1e+05income

life_exp

pop

5e+08

1e+09

four_regions●

africa

americas

asia

europe

gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point(aes(colour = four_regions, size = pop)) + scale_x_log10()

Page 36: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

.Rmd

.md

.html

Prose and code

Prose and results

.doc .tex.pdf.ppt

...

Page 37: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

But

Page 38: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

●●

●●

●● ●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

● ●

●●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●●

●● ●

20

30

40

2 3 4 5 6 7displ

hwy

Page 39: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

df %>% rename( date = `Date Created`, name = Name, plays = `Total Plays`, loads = `Total Loads`, apv = `Average Percent Viewed` )

And this is painful!

Page 40: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create
Page 41: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

df %>% filter(n > 1e6) %>% mutate(x = f(y))) %>% ???

# How predictable is next step from # previous steps?

What next?

Page 42: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Can we do more with autocomplete?

Where do dialogs and autocomplete intersect?

Page 43: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Learning from examples

http://vis.stanford.edu/papers/wrangler

Page 44: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

https://twitter.com/carroll_jono/status/914254139873361920

Page 45: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Fin

Page 46: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

We wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.

— John Chambers, “Stages in the Evolution of S”

Page 48: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

Tidy

Import Visualise

TransformModel

Program

tibble tidyr

purrr

magrittr

dplyr forcats

hms

ggplot2

recipes parsnip

readr readxl

haven xml2

lubridate stringr

tidyverse.org

r4ds.had.co.nz

Communicateshiny rmarkdown

Page 49: Coding data science - informatik.univie.ac.at · What is data science? Data science = data analysis + programming. Tidy Import Surprises, but doesn't scale Consistent way of Create

This work is licensed as Creative Commons

Attribution-ShareAlike 4.0 International

To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/