Top Banner
Jaap Walhout September 12, 2018 tutorial uRos 2018
41

data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

Jul 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

JaapWalhoutSeptember12,2018

tutorialuRos 2018

Page 2: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

2

1. Introduction

2. Fast read &write

3. Syntax

4. Basicoperations(filteringrows &selecting columns)

5. Summarizing

6. Adding /updatingvariables

7. Joining datasets

8. Reshaping data

Specialsymbols:.N +.SD +.I Specialoperator::=

Overview

Page 3: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

3

Developers:MattDowle,Arun Srinivasan,JanGorecki,MichaelChirico,Pasha Stetsenko,TomShort,SteveLianoglou,EduardAntonyan,MarkusBonsch,Hugh Parsonage

Since 2006 onCRAN,>35 releasesso far

678 packagesimport/depend/suggest data.table (543 CRAN+135 Bioconductor)

Homepage:http://r-datatable.com

Introduction

Page 4: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

4

Why use data.table?

Pros:- speed- memoryefficiency- coding flexibility- non-equi joins

Cons:- ‘different’syntax

Introduction

Page 5: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

5

50million rows /10columns/± 4GB

fread("datafile.csv")

expr time

data.table_fread 15.6 readr_read_csv 92.6 base_read.csv 559.9

fwrite(DT,"datafile.csv")

expr time

data.table_fread 32.6 readr_read_csv 102.2 base_read.csv 201.9

timesinseconds

Fast read &write

Page 6: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

6

Threemain enhancements:

1. Columnnames can be used asvariables inside [….]

2. Because they arevariables,wecan use columnnames

to calculate stuffinside [….]

3. Anadditional grouping argument:by

Syntax:data.table ==enhanced data.frame

Page 7: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

7

Columnar datastructure:2D– rows and columns

- subset rows df[df$id =="01",]

- select columns df[,"val1"]

- subset rows &select columns df[df$id =="01","val1"]

- that’s about it ….

Syntax:dataframerefresher

Page 8: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

8

DT[i,j,by]

Syntax:general form

Which rows? What to do? Grouped by what?

- vectorofrownumbers- logical vector- another data.table

- summarizing- updatingvariable(s)- adding variable(s)

- one ormorecolumns- onthe fly grouping var(s)

Page 9: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

9

DT[i,j,by]

data.table: i j bySQL: where select | update group by

Syntax:general form

Page 10: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

10

build iniris dataset:

irisDT <- as.data.table(iris)

Example data

Page 11: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

11

syntax:DT[i,j,by]

subset rows

select columns

subset rows &select columns

irisDT[Species=="setosa",]

irisDT[,Petal.Width]

irisDT[,.(Petal.Width)]

irisDT[Species=="setosa",Petal.Width]

irisDT[Species=="setosa",.(Petal.Width)]

Filteringrows &selecting columns

Page 12: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

12

subset rows irisDT[between(Petal.Width,1,1)]

irisDT[Petal.Width %between%c(1,2)]

select columns irisDT[,.(Species,Sepal.Length)]

Filteringrows &selecting columns

Page 13: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

13

Openthe fileex1.R

subset rows : getonly the rows with aday lower than or

equal to 10

select columns : selectonly the Month columnand make

sure you getadata.table back

subset rows &select columns : getonly the Wind&Tempcolumnsfor the

rows with aday higher than 5and lower

than orequal to 10

Exercise 1

Page 14: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

14

1. Counts

2. Aggregating

3. Groupby

Summarizing

Page 15: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

15

syntax:DT[i,j,by]

count irisDT[Species=="setosa",.N]

count distinct irisDT[,.uniqueN(Species)]

irisDT[Petal.Width <0.9,. uniqueN(Species)]

uniqueN(irisDT,by ="Species")

Counts

Page 16: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

16

syntax:DT[i,j,by]

Simpleaggregation: irisDT[,.(count =.N,average =mean(Petal.Width))]

Including filtering: irisDT[Petal.Width <0.9,.(count =.N,average =mean(Petal.Width))]

Aggregating

Page 17: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

17

syntax:DT[i,j,by]

irisDT[,.N,by =Species]

irisDT[,.(average =mean(Petal.Width)),by =Species]

irisDT[Sepal.Length <5.3,.(average =mean(Petal.Width)),by =Species]

irisDT[,.(average =mean(Petal.Width)),by =.(Species,logi =Sepal.Length <5.3)]

Groupby

Page 18: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

18

specialsymbol:.SD

SD=SubsetofData

- adata.table by itself

- holds dataofcurrent goup asdefined inby

- when noby,.SDapplies to whole data.table

- allows for calculations onmultiplecolumns

Groupby

Page 19: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

19

specialsymbol:.SD

irisDT[,lapply(.SD,mean),by =Species]

irisDT[Sepal.Length <5.3,lapply(.SD,mean),by =Species]

Groupby

Page 20: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

20

specialsymbol:.SD

specialsymbol:.SDcols

irisDT[,lapply(.SD,mean),by =Species,.SDcols =1:2]

irisDT[,lapply(.SD,mean),by =Species,.SDcols =grep("Length",names(irisDT))]

Groupby

Page 21: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

21

DT[i,j,by]

DT[1,3,2]

Orderofexecution

Page 22: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

22

Openthe fileex2.R

- Count the number ofdays permonth

- Calculate the average Windspeedby month for only those days that havean

ozone value

- Calculate the mean temperature for the odd and evendays for each month

Exercise 2

Page 23: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

23

specialoperator::=

- updatesadata.table inplace (by reference)

- can be used to:

o updateexisting column(s)

o add newcolumn(s)

o deletecolumn(s)

Updating,adding &deleting variables

Page 24: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

24

specialoperator::=

irisDT[,Sepal.Length :=Sepal.Length *2]

irisDT[,`:=`(Sepal.Length =Sepal.Length *2,Petal.Width =Petal.Width /2)]

Updatingvariables

Page 25: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

25

specialoperator::=

irisDT[,Sepal.Length :=Sepal.Length *uniqueN(Sepal.Width)/.N,by =Species]

irisDT[,`:=`(Sepal.Length =Sepal.Length *uniqueN(Sepal.Width),Petal.Width =Petal.Width /.N)

,by =Species]

Updatingvariablesby group

Page 26: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

26

specialoperator::= specialsymbol:.I

irisDT[,rownumber :=.I]

irisDT[,Sepal.Area :=Sepal.Length *Sepal.Width]

irisDT[,`:=`(Sepal.Area =Sepal.Length *Sepal.Width,Petal.Area =Petal.Length *Petal.Width)]

Adding variables

Page 27: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

27

specialoperator::=

irisDT[,Total.Sepal.Area :=sum(Sepal.Area),by =Species]

irisDT[,`:=`(Total.Sepal.Area =sum(Sepal.Area),Total.Petal.Area =sum(Petal.Area))

,by =Species]

Adding variablesby group

Page 28: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

28

specialoperator::=

irisDT[,Sepal.Length :=NULL]

irisDT[,(1:4):=NULL]

irisDT[,grep("Length",names(irisDT)) :=NULL]

Deleting variables

Page 29: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

29

Openthe fileex3.R

- Changethe Windcolumnfrom miles perhour to kilometersperhour

(1mph =1.6kmh)

- Calculate anewchill variable (Wind*Temperature)

- Calculate the average chill by month and add that asanewvariable

- Remove the Ozone and Solar.R columns

Exercise 3

Page 30: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

30

- subsetrows DT[id ==“01”,]

- selectcolumns DT[,val1]

- subsetrows &select columns DT[id ==“01”,val1]

Joining datasets

Page 31: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

31

DT[i,j,by]

Joining datasets

Which rows? What to do? Grouped by what?

- vectorofrownumbers- logical vector- another data.table

- summarizing- updatingvariable(s)- adding variable(s)

- one ormorecolumns- onthe fly grouping var(s)

Page 32: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

32

Example data

irisDT <- copy(iris)setDT(irisDT)

irisH <- data.table(Species=c("setosa","versicolor","virginica"),Species.full =c("Irissetosa","Irisversicolor","Iris virginica"),height =1:3,soil =c("mud","rock","sand"))

Joining datasets

Page 33: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

33

syntax:DT[i,on,j,by]

irisDT[irisH,on=.(Species)]

irisDT[irisH,on="Species"]

irisDT[irisH,on=.(Species=Spec,other_col)]

Joining datasets

Page 34: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

34

syntax:DT[i,on,j,by]

irisDT[irisH,on=.(Species),Species.full :=Species.full]

irisDT[irisH,on=.(Species),`:=`(Species.full =Species.full,height =height,soil =soil)]

irisDT[irisH,on=.(Species),`:=`(Species.full =i.Species.full,height =i.height,soil =i.soil)]

Joining datasets

Page 35: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

35

syntax:DT[i,on,j,by]

like%>% from the tidyverse,you can also chaindata.table operationstogether

irisDT[… ][… ][… ]

irisDT[irisH,on=.(Species),Species.full :=Species.full][,median(Sepal.Length),by =Species.full]

Joining &chaining

Page 36: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

36

Openthe fileex4.R

- Use ajoin to add the month namefrom 'airmonths' to 'air'

- Use ajoin to add both the month nameand the month abbreviation from

'airmonths' to 'air'

- Use ajoin to add the month namefrom 'airmonths' to 'air’;then use chaining to

calculate the median Windspeedfor each month name

Exercise 4

Page 37: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

37

From wide to long: irisMelted <- melt(irisDT,id ="Species")

melt(data,id.vars,measure.vars,variable.name ="variable",value.name ="value",na.rm =FALSE,variable.factor =TRUE,value.factor =FALSE)

Seealso:?melt

Reshaping data

Page 38: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

38

From longto wide: dcast(irisMelted,Species~variable)

dcast(data,formula,fun.aggregate =NULL,sep="_",...,margins =NULL,subset=NULL,fill =NULL,drop=TRUE,value.var =guess(data))

Seealso:?dcast

Reshaping data

Page 39: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

39

morejoins: non-equi joins +rollingjoins

morespecialsymbols: .BY +.GRP

specialgrouping functions: rowid +rleid

set*functions: setkey +setorder +setcolorder +setnames +…..

and evenmore: frank +shift +CJ +tstrsplit +…..

What else isthere to discover?

Page 40: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

40

Overview ofgettingstarted vignettes

Datacamp’s data.tablecourse (paid)

StackOverflow [data.table]tag (>7700questions)

Wantto learn more?

Page 41: data.table tutorial uRos2018r-project.ro/conference2018/presentations/data... · 3 Developers: Matt Dowle, ArunSrinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short,

41

Thank you for your attention!

TheEnd