This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Chapters 10 of book “R for data science” by Wickham and Grolemun• R library: tidyverse, dplyr, nycflights13
NYC flight data: nycflights13
nycflights13 contains 5 tibbles
• airlines: full carrier name• airports: information about each airport• planes: information plane, identified by tailnum• weather: hourly weather at each NYC airport• flights: airplane scheduled departure and arrival times, tailnum, etc
NYC flight data: flights
# A tibble: 6 x 19year month day dep_time sched_dep_time dep_delay arr_time
• flights connects to planes via a single variable, tailnum• flights connects to airlines through the carrier variable• flights connects to airports in two ways: via the origin and dest variables• flights connects to weather via origin (the location), and year, month, day, and hour (the time).
3
NYC flight data
Mutating Joins
Key
• a key is a variable (or set of variables) that uniquely identifies an observation• A primary key uniquely identifies an observation in its own table. For example, planes$tailnum
is a primary key because it uniquely identifies each plane in the planes table• A foreign key uniquely identifies an observation in another table. For example, flights$tailnum
is a foreign key because it appears in the flights table where it matches each flight to a unique plane.
Data: flights
> library(dplyr)> flights2 <- flights %>%+ select(year:day, hour, origin, dest, tailnum, carrier)> flights2# A tibble: 336,776 x 8
year month day hour origin dest tailnum carrier<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
The colored column represents the “key” variable: these are used tomatch the rows between the tables. The gray column represents the“value” column that is carried along for the ride. In these examplesI’ll show a single key variable and single value variable, but the ideageneralizes in a straightforward way to multiple keys and multiplevalues.
A join is a way of connecting each row in x to zero, one, or morerows in y. The following diagram shows each potential match as anintersection of a pair of lines:
(If you look closely, you might notice that we’ve switched the orderof the key and value columns in x. This is to emphasize that joinsmatch based on the key; the value is just carried along for the ride.)
In an actual join, matches will be indicated with dots. The numberof dots = the number of matches = the number of rows in the out‐put.
Inner JoinThe simplest type of join is the inner join. An inner join matchespairs of observations whenever their keys are equal:
180 | Chapter 10: Relational Data with dplyr
Outer Joins
outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
• A left join keeps all observations in x• A right join keeps all observations in y• A full join keeps all observations in x and y
Left outer joinGraphically, that looks like:
The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.
Another way to depict the different types of joins is with a Venn dia‐gram:
However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.
182 | Chapter 10: Relational Data with dplyr
Right outer join
Graphically, that looks like:
The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.
Another way to depict the different types of joins is with a Venn dia‐gram:
However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.
182 | Chapter 10: Relational Data with dplyr
6
Full outer join
Graphically, that looks like:
The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.
Another way to depict the different types of joins is with a Venn dia‐gram:
However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.
182 | Chapter 10: Relational Data with dplyr
Join flights and plances
joint flights and planes by tailnum> flights2 %>% left_join(planes, by = "tailnum")# A tibble: 336,776 x 16
year.x month day hour origin dest tailnum carrier year.y<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int>
10 2013 1 1 6 LGA ORD N3ALAA AA NA# ... with 336,766 more rows, and 7 more variables:# type <chr>, manufacturer <chr>, model <chr>,# engines <int>, seats <int>, speed <int>, engine <chr>
Join flights and airports
joint flights and airports by matching dest in flights to faa in airports; the variables in x will beused in the output> flights2 %>%+ left_join(airports, c("dest" = "faa"))# A tibble: 336,776 x 15
year month day hour origin dest tailnum carrier name<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA Geor~2 2013 1 1 5 LGA IAH N24211 UA Geor~3 2013 1 1 5 JFK MIA N619AA AA Miam~4 2013 1 1 5 JFK BQN N804JB B6 <NA>5 2013 1 1 6 LGA ATL N668DN DL Hart~6 2013 1 1 5 EWR ORD N39463 UA Chic~
10 2013 1 1 6 LGA ORD N3ALAA AA Chic~# ... with 336,766 more rows, and 6 more variables: lat <dbl>,# lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>
Filtering Joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not thevariables. There are two types:
• semi_join(x, y) keeps all observations in x that have a match in y• anti_join(x, y) drops all observations in x that have a match in y
Missing values
Course contents
• Chapter 9 of book “R for data science” by Wickham and Grolemun• R packages: tidyr, dplyr and tidyverse
Check object types
After R imports data, make sure variables have correct object types
• str gives object type of each variable in a data frame or tibble• summary gives object type of each variable, together with 5-value summary for each numeric object• head gives object type for each variable in a data frame or tibble
Check object types
> summary(flights2)year month day
Min. :2013 Min. : 1.000 Min. : 1.001st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00Median :2013 Median : 7.000 Median :16.00Mean :2013 Mean : 6.549 Mean :15.713rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00Max. :2013 Max. :12.000 Max. :31.00
Force to change the type of a object by as.objecType()> x = c("1","2","3")> x[1] "1" "2" "3"> as.numeric(x)[1] 1 2 3> as.factor(x)[1] 1 2 3Levels: 1 2 3> as.matrix(x,3,1)
[,1][1,] "1"[2,] "2"[3,] "3"> y = c("a","b")> as.numeric(y)[1] NA NA
Missing observations/values
• Missing explicitly, i.e., flagged with NA (,i.e., presence of an absence)
9
• Missing implicitly, i.e., simply not present in the data (i.e., absence of a presence)