Data Manipulation in R with dplyr - Big Data Analytics @UC ... · Data Manipulation in R with dplyr Davood Astaraky Introduction to dplyr and tbls Load the dplyr and hﬂights package

Data Manipulation in R with dplyrDavood Astaraky

Introduction to dplyr and tblsLoad the dplyr and hflights packageConvert data.frame to tableChanging labels of hflightsThe five verbs and their meaning

Select and mutateChoosing is not loosing! The select verbHelper functions for variable selectionComparison to basic RMutating is creatingAdd multiple variables using mutate

Filter and arrangeLogical operatorsCombining tests using boolean operatorsBlend together what you’ve learned!Arranging your dataReverse the order of arranging

Summarise and the pipe operatorThe syntax of summariseAggregate functionsdplyr aggregate functionsOverview of pipe operator syntaxDrive or fly?Advanced piping

Group_by and working with databasesUnite and conquer using group_byCombine group_by with mutateAdvanced group_bydplyr deals with different typesdplyr and mySQL databases

Reference

Loading Libraries

library(dplyr)

library(tidyr)

library(knitr)

library(printr)

Introduction to dplyr and tbls

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#introduction-to-dplyr-and-tbls

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#load-the-dplyr-and-hflights-package

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#convert-data.frame-to-table

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#changing-labels-of-hflights

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#the-five-verbs-and-their-meaning

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#select-and-mutate

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#choosing-is-not-loosing-the-select-verb

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#helper-functions-for-variable-selection

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#comparison-to-basic-r

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#mutating-is-creating

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#add-multiple-variables-using-mutate

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#filter-and-arrange

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#logical-operators

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#combining-tests-using-boolean-operators

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#blend-together-what-youve-learned

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#arranging-your-data

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#reverse-the-order-of-arranging

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#summarise-and-the-pipe-operator

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#the-syntax-of-summarise

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#aggregate-functions

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#dplyr-aggregate-functions

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#overview-of-pipe-operator-syntax

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#drive-or-fly

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#advanced-piping

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#group_by-and-working-with-databases

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#unite-and-conquer-using-group_by

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#combine-group_by-with-mutate

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#advanced-group_by

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#dplyr-deals-with-different-types

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#dplyr-and-mysql-databases

https://rstudio-pubs-static.s3.amazonaws.com/98285_aee86145ffc746e685b8792c7cf8a73d.html#reference

Load the dplyr and hflights package

dplyr is an R package, a collection of functions and data sets that enhance the R language. Herewill use dplyr to analyze a data set of airline flight data, containing flights that departed fromHouston. This data is stored in a package called hflights . Below we load the hflightspackage. Now, a variable called hflights is availble, a data.frame representing the data set.

library(hflights)

str(hflights)

## 'data.frame': 227496 obs. of 21 variables:

## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...

## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...

## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...

## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...

## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...

## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...

## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...

## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...

## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...

## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...

## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...

## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...

## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...

## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ...

## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ...

## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ...

## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...

## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...

## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...

## $ CancellationCode : chr "" "" "" "" ...

## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...

Convert data.frame to table

A tbl is just a special kind of data.frame. They make your data easier to look at, but also easierto work with. On top of this, a tbl is straightforwardly derived from a data.frame structure usingtbl_df() .

The tbl format changes how R displays your data, but it does not change the data’s underlyingdata structure. A tbl inherits the original class of its input, in this case, a data.frame. This means

that you can still manipulate the tbl as if it were a data.frame; you can do anything with thehflights tbl that you could do with the hflights data.frame.

# Convert the hflights data.frame into a hflights tblhflights <- tbl_df(hflights)

glimpse(hflights)

## Observations: 227496

## Variables:

## $ Year (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...

## $ Month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

## $ DayofMonth (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...

## $ DayOfWeek (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...

## $ DepTime (int) 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...

## $ ArrTime (int) 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...

## $ UniqueCarrier (chr) "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...

## $ FlightNum (int) 428, 428, 428, 428, 428, 428, 428, 428, 428,...

## $ TailNum (chr) "N576AA", "N557AA", "N541AA", "N403AA", "N49...

## $ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...

## $ AirTime (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...

## $ ArrDelay (int) -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...

## $ DepDelay (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...

## $ Origin (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...

## $ Dest (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...

## $ Distance (int) 224, 224, 224, 224, 224, 224, 224, 224, 224,...

## $ TaxiIn (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...

## $ TaxiOut (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...

## $ Cancelled (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

## $ CancellationCode (chr) "", "", "", "", "", "", "", "", "", "", "", ...

## $ Diverted (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Changing labels of hflights

You can “clean” hflights the same way you would clean a data.frame. A bit of cleaning wouldbe a good idea since the UniqueCarrier variable of hflights uses a confusing code system.

You can create a lookup table with a named vector. When you subset the lookup table with acharacter string (like the character strings in UniqueCarrier ), R will return the values of thelookup table that correspond to the names in the character string.

# Build the lookup tablelut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental"

# Use lut to translate the UniqueCarrier column of hflightshflights$UniqueCarrier <- lut[hflights$UniqueCarrier]

# Inspect the resulting raw values of your variablesglimpse(hflights)


## Variables:

## $ Year (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...

## $ Month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

## $ DayofMonth (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...

## $ DayOfWeek (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...

## $ DepTime (int) 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...

## $ ArrTime (int) 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...

## $ UniqueCarrier (chr) "American", "American", "American", "America...

## $ FlightNum (int) 428, 428, 428, 428, 428, 428, 428, 428, 428,...

## $ TailNum (chr) "N576AA", "N557AA", "N541AA", "N403AA", "N49...

## $ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...

## $ AirTime (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...

## $ ArrDelay (int) -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...

## $ DepDelay (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...

## $ Origin (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...

## $ Dest (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...

## $ Distance (int) 224, 224, 224, 224, 224, 224, 224, 224, 224,...

## $ TaxiIn (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...

## $ TaxiOut (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...

## $ Cancelled (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

## $ CancellationCode (chr) "", "", "", "", "", "", "", "", "", "", "", ...

## $ Diverted (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

The five verbs and their meaning

The dplyr package contains five key data manipulation functions, also called verbs:

select() , which returns a subset of the columns,filter() , that is able to return a subset of the rows,arrange() , that reorders the rows according to single or multiple variables,mutate() , used to add columns from existing data,summarise() , which reduces each group to a single row by calculating aggregate

measures.

Below we explore each one in details.

Select and mutate

Choosing is not loosing! The select verb

To answer the simple question whether flight delays tend to shrink or grow during a flight, we cansafely discard a lot of the variables of each flight. To select only the ones that matter, we can useselect() . Its syntax is plain and simple:

select(data, Var1, Var2, ...)

The first argument being the tbl you want to select variables from and the VarX arguments thevariables you want to retain. You can also use the : and - operators inside of select, similar toindexing a data.frame with hard brackets. select() lets you apply them to names as well asinteger indexes. The - operator allows you to select everything except a column or a range ofcolumns.

Below we seclect only four columns of the hflights dataset.

hflights_subset <- select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)

kable (head(hflights_subset),align = 'c')

ActualElapsedTime AirTime ArrDelay DepDelay

60 40 -10 0

60 45 -9 1

70 48 -8 -8

70 39 3 3

62 44 -3 5

64 45 -7 -1

Below we demonstrate the most concise way to select: columns Year up to and includingDayOfWeek , columns ArrDelay up to and including Diverted .

hflights_subset2<- select (hflights, Year:DayOfWeek, ArrDelay:Diverted)

Helper functions for variable selection

dplyr comes with a set of helper functions that can help you select variables. These functions find

groups of variables to select, based on their names.

dplyr provides 6 helper functions, each of which only works when used inside select() .

starts_with("X") : every name that starts with "X" ,ends_with("X") : every name that ends with "X" ,contains("X") : every name that contains "X" ,matches("X") : every name that matches "X" , which can be a regular expression,num_range("x", 1:5) : the variables named x01 , x02 , x03 , x04 and x05 ,one_of(x) : every name that appears in x , which should be a character vector.

Watch out: Surround character strings with quotes when you pass them to a helper function, butdo not surround variable names with quotes if you are not passing them to a helper function.

Below are some example of helper functions :

select (hflights, matches("ArrDelay"), matches("DepDelay"))

select(hflights, one_of(c("UniqueCarrier", "FlightNum", "TailNum", "Cancelled", "CancellationCode"

select(hflights, ends_with("Time"), ends_with("Delay"))

Comparison to basic R

To see the added value of the dplyr package, it is useful to compare its syntax with basic R. Up tonow, you have only considered functionality that is also available without the use of dplyr. However,the elegance and ease-of-use of dplyr should be clear from following short set of comarisons.

ex1r <- hflights[c("TaxiIn","TaxiOut","Distance")]

ex1d <- select(hflights, TaxiIn, TaxiOut, Distance)

ex2r <- hflights[c("Year","Month","DayOfWeek","DepTime","ArrTime")]

ex2d <- select(hflights, Year:ArrTime, -DayofMonth)

ex3r <- hflights[c("TailNum","TaxiIn","TaxiOut")]

ex3d <- select(hflights, starts_with("ta"))

Mutating is creating

mutate() is the second of five data manipulation functions you will get familiar with in thiscourse. In contrast to select() , which retains a subset of all variables, mutate() creates newcolumns which are added to a copy of the dataset.

Let’s briefly recap the syntax:

mutate(data, Mutant1 = expr(Var0,Var1,...))

Here, data is the tbl you want to use to create new columns. The second argument is anexpression that assigns the result of any R function using already existing variables Var0, Var1,... to a new variable Mutant1 .

Below are a couple of examples demonstrating the use of mutate function :

# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

# Add the new variable GroundTime to a copy of g1 and save the result as g2.g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)

# Add the new variable AverageSpeed to a copy of g2 and save the result as g3.g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)

Add multiple variables using mutate

So far we’ve added variables to hflights one at a time, but you can also use mutate() to addmultiple variables at once. To create more than one variable, place a comma between eachvariable that you define inside mutate() . Below we demonstrate how it can be done :

m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = (ArrDelay - DepDelay) /

m2 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = loss / DepDelay *

m3 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut,

ActualGroundTime = ActualElapsedTime - AirTime,

Diff = TotalTaxi - ActualGroundTime)

Filter and arrange

Logical operators

R comes with a set of logical operators that you can use to extract rows with filter() . Theseoperators are

x < y , TRUE if x is less than y

x <= y , TRUE if x is less than or equal to yx == y , TRUE if x equals yx != y , TRUE if x does not equal yx >= y , TRUE if x is greater than or equal to yx > y , TRUE if x is greater than yx %in% c(a, b, c) , TRUE if x is in the vector c(a, b, c)

Examples :

# All flights that traveled 3000 miles or more.f1 <- filter(hflights, Distance >= 3000)

# All flights flown by one of JetBlue, Southwest, or Delta airlinesf2 <- filter(hflights, UniqueCarrier %in% c("JetBlue", "Southwest", "Delta"))

# All flights where taxiing took longer than flyingf3 <- filter(hflights, TaxiIn + TaxiOut > AirTime)

Combining tests using boolean operators

R also comes with a set of boolean operators that you can use to combine multiple logical testsinto a single test. These include & , | , and ! , respectively the and, or and not operators.

You can thus use R’s & operator to combine logical tests in filter() , but that is not necessary.If you supply filter() with multiple tests separated by commas, it will return just the rows thatsatisfy each test (as if the tests were joined by an & operator).

Finally, filter() makes it very easy to screen out rows that contain NA ’s, R’s symbol formissing information. You can identify an NA with the is.na() function.

Examples :

# all flights that departed before 5am or arrived after 10pm.f1 <- filter(hflights, DepTime < 500 | ArrTime > 2200)

# all flights that departed late but arrived ahead of schedulef2 <- filter(hflights, DepDelay > 0, ArrDelay < 0)

# all cancelled weekend flightsf3 <- filter(hflights, DayOfWeek %in% c(6,7), Cancelled == 1)

# all flights that were cancelled after being delayedf4 <- filter(hflights, Cancelled == 1, DepDelay > 0)

Blend together what you’ve learned!

Lets generate a new database from the hflights database that contains some usefulinformation on flights that had JFK airport as their destination. We will need select() ,mutate() , as well as filter() .

# Select the flights that had JFK as their destinationc1 <- filter(hflights, Dest == "JFK")

# Combine the Year, Month and DayofMonth variables to create a Date columnc2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-"))

# Retain only a subset of columns to provide an overviewc3 <- select(c2, Date, DepTime, ArrTime, TailNum)

kable (head(c3),align = 'c')

Date DepTime ArrTime TailNum

2011-1-1 654 1124 N324JB

2011-1-1 1639 2110 N324JB

2011-1-2 703 1113 N324JB

2011-1-2 1604 2040 N324JB

2011-1-3 659 1100 N229JB

2011-1-3 1801 2200 N206JB

Another example :How many weekend flights flew a distance of more than 1000 miles but had atotal taxiing time below 15 minutes?

filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15)

Arranging your data

The syntax of arrange() is the following:

arrange(data, Var0, Var1, ... )

Here, data is again the tbl you’re working with and Var0, Var1, ... are the variablesaccording to which you arrange. When Var0 does not provide closure on the order, Var1 andpossibly additional variables will serve as tie breakers to decide the arrangement.

arrange() can be used to rearrange rows according to any type of data. If you passarrange() a character variable, for example, R will rearrange the rows in alphabetical order

according to values of the variable. If you pass a factor variable, R will rearrange the rowsaccording to the order of the levels in your factor (running levels() on the variable reveals thisorder).

Examples:

dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))

# Arrange dtc by departure delaysa1 <- arrange(dtc, DepDelay)

# Arrange dtc so that cancellation reasons are groupeda2 <- arrange(dtc, CancellationCode)

# Arrange according to carrier and departure delaysa3 <- arrange(hflights, UniqueCarrier, DepDelay)

Reverse the order of arranging

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest valueof the variable will appear at the top of the data set. You can reverse this behavior with thedesc() function. arrange() will reorder the rows from largest to smallest values of a variable if

you wrap the variable name in desc() before passing it to arrange() .

Examples :

# Arrange according to carrier and decreasing departure delaysa1 <- arrange(hflights, UniqueCarrier, desc(DepDelay))

# Arrange flights by total delay (normal order).a2 <- arrange(hflights, DepDelay + ArrDelay)

# Keep flights leaving to DFW before 8am and arrange according to decreasing AirTime a3 <- arrange(filter(hflights,Dest=="DFW" & DepTime < 800),desc(AirTime))

Summarise and the pipe operator

The syntax of summarise

summarise() , the last of the 5 verbs, follows the same syntax as mutate() , but the resultingdataset consists of a single row instead of an entire new column in the case of mutate() . Below,a typical summarise() function is repeated to show the syntax, without going into detail on allarguments:

summarise(data, sumvar = sum(A),

avgvar = avg(B))

In contrast to the four other data manipulation functions, summarise() does not return a copy ofthe dataset it is summarizing; instead, it builds a new dataset that contains only the summarzingstatistics.

Examples:

# Determine the shortest and longest distance flown and save statistics to min_dist and max_dist resp.s1 <- summarise(hflights,

min_dist = min(Distance),

max_dist = max(Distance))

# Determine the longest distance for diverted flights, save statistic to max_div. Use a one-liner!s2 <- summarise(filter(hflights, Diverted==1),

max_div = max(Distance))

Aggregate functions

You can use any function you like in summarise() , so long as the function can take a vector ofdata and return a single number. R contains many aggregating functions, as dplyr calls them. Hereare some of the most useful:

min(x) - minimum value of vector x .max(x) - maximum value of vector x .mean(x) - mean value of vector x .median(x) - median value of vector x .quantile(x, p) - pth quantile of vector x .sd(x) - standard deviation of vector x .var(x) - variance of vector x .IQR(x) - Inter Quartile Range (IQR) of vector x .diff(range(x)) - total range of vector x .

Examples :

# Calculate summarizing statistics for flights that have an ArrDelay that is not NAtemp1 <- filter(hflights, !is.na(ArrDelay))

s1 <- summarise(temp1,

earliest = min(ArrDelay),

average = mean(ArrDelay),

latest = max(ArrDelay),

sd = sd(ArrDelay))

kable (head(s1),align = 'c')

earliest average latest sd

-70 7.094334 978 30.70852

# Calculate the maximum taxiing difference for flights that have taxi data availabletemp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))

s2 <- summarise(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))

print(s2)

## Source: local data frame [1 x 1]

##

## max_taxi_diff

## 1 160

dplyr aggregate functions

dplyr provides several helpful aggregate functions of its own, in addition to the ones that arealready defined in R. These include:

first(x) - The first element of vector x .last(x) - The last element of vector x .nth(x, n) - The nth element of vector x .n() - The number of rows in the data.frame or group of observations that summarise()

describes.n_distinct(x) - The number of unique values in vector x .

Next to these dplyr-specific functions, you can also turn a logical test into an aggregating functionwith sum() or mean() . A logical test returns a vector of TRUE’s and FALSE’s. When you applysum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. This

allows you to find the total number or proportion of observations that passed the test, respectively.

Examples:

# Calculate the summarizing statistics of hflights

s1 <- summarise(hflights, n_obs = n(),

n_carrier = n_distinct(UniqueCarrier),

n_dest = n_distinct(Dest),

dest100 = nth(Dest, 100))


n_obs n_carrier n_dest dest100

227496 15 116 DFW

# Calculate the summarizing statistics for flights flown by American Airlines (carrier code "American")aa <- filter(hflights, UniqueCarrier == "American")

s2 <- summarise(aa, n_flights = n(),

n_canc = sum(Cancelled == 1),

p_canc = mean(Cancelled == 1) * 100,

avg_delay = mean(ArrDelay, na.rm = TRUE))


n_flights n_canc p_canc avg_delay

3244 60 1.849568 0.8917558

Overview of pipe operator syntax

Using the pipe operator %>% the following two statements are completely analogous:

mean(c(1, 2, 3, NA), na.rm = TRUE)

c(1, 2, 3, NA) %>% mean(na.rm = TRUE)

The %>% operator allows you to extract the first argument of a function from the arguments listand put it in front of it, thus solving the Dagwood sandwich problem.

Example:

# Write the 'piped' version of the English sentences.p <- hflights %>%

mutate(diff = TaxiOut - TaxiIn) %>%

filter(!is.na(diff)) %>%

summarise(avg = mean(diff))

print(p)

## Source: local data frame [1 x 1]

##

## avg

## 1 8.992064

Drive or fly?

You can answer sophisticated questions by combining the verbs of dplyr. Over the next fewexamples we will examine whether it sometimes makes sense to drive instead of fly. We will beginby making a data set that contains relevant variables. Then, we find flights whose equivalentaverage velocity is lower than the velocity when traveling by car.

Example :

# Part 1, concerning the selection and creation of columnsd <- hflights %>%

select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%

mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60)

kable (head(d),align = 'c')

Dest UniqueCarrier Distance ActualElapsedTime RealTime mph

DFW American 224 60 160 84.00000

DFW American 224 60 160 84.00000

DFW American 224 70 170 79.05882

DFW American 224 70 170 79.05882

DFW American 224 62 162 82.96296

DFW American 224 64 164 81.95122

# Part 2, concerning flights that had an actual average speed of < 70 mph.d %>%

filter(!is.na(mph), mph < 70) %>%

summarise( n_less = n(),


min_dist = min(Distance),


n_less n_dest min_dist max_dist

6726 13 79 305

The previous example suggested that some flights might be less efficient than driving in terms ofspeed. But is speed all that matters? Flying imposes burdens on a traveler that driving does not.For example, airplane tickets are very expensive. Air travelers also need to limit what they bring ontheir trip and arrange for a pick up or a drop off. Given these burdens we might demand that aflight provide a large speed advantage over driving.

Example:

# Solve the exercise using a combination of dplyr verbs and %>%hflights %>%

select(Dest, Cancelled, Distance, ActualElapsedTime, Diverted) %>%

mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60) %>%

filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%

summarise( n_non = n(),

p_non = n_non / nrow(hflights) * 100,


min_dist = min (Distance),


n_non p_non n_dest min_dist max_dist

42400 18.63769 113 79 3904

Advanced piping

One more example in using piping operator.

# Count the number of overnight flightshflights %>%

filter(!is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%

summarise(n = n())

Group_by and working with databases

Unite and conquer using group_by

group_by() lets you define groups within your data set. Its influence becomes clear when callingsummarise() on a grouped dataset: summarizing statistics are calculated for the different groups

separately.

The syntax for this function is again straightforward:

group_by(data, Var0, Var1, ...)

Here, data is the tbl dataset you work with, and Var0, Var1, ... are the variables you wantto group by. If you pass on several variables as arguments, the number of separate sets ofgrouped observations will increase, but their size will decrease.

Example :

# Make the calculations to end up with ordered statistics per carrierhflights %>%

group_by(UniqueCarrier) %>%

summarise(n_flights = n(),

n_canc = sum(Cancelled == 1),

p_canc = mean(Cancelled == 1) * 100,

avg_delay = mean(ArrDelay, na.rm = TRUE)) %>%

arrange(avg_delay, p_canc)

UniqueCarrier n_flights n_canc p_canc avg_delay

US_Airways 4082 46 1.1268986 -0.6307692

American 3244 60 1.8495684 0.8917558

AirTran 2139 21 0.9817672 1.8536239

Alaska 365 0 0.0000000 3.1923077

Mesa 79 1 1.2658228 4.0128205

Delta 2641 42 1.5903067 6.0841374

Continental 70032 475 0.6782614 6.0986983

American_Eagle 4648 135 2.9044750 7.1529751

Atlantic_Southeast 2204 76 3.4482759 7.2569543

Southwest 45343 703 1.5504047 7.5871430

Frontier 838 6 0.7159905 7.6682692

ExpressJet 73053 1132 1.5495599 8.1865242

SkyWest 16061 224 1.3946828 8.6934922

JetBlue 695 18 2.5899281 9.8588410

United 2072 34 1.6409266 10.4628628

# Answer the question: Which day of the week is average total taxiing time highest?

hflights %>%

group_by(DayOfWeek) %>%

summarise(avg_taxi = mean(TaxiIn + TaxiOut, na.rm=TRUE)) %>%

arrange(desc(avg_taxi))

DayOfWeek avg_taxi

1 21.77027

2 21.43505

4 21.26076

3 21.19055

5 21.15805

7 20.93726

6 20.43061

Combine group_by with mutate

You can also combine group_by() with mutate() . When you mutate grouped data,mutate() will calculate the new variables independently for each group. This is particularly useful

when mutate() uses the rank() function, which calculates within group rankings. rank()takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange() , rank() ranks values from the largest to the smallest and this behavior canbe reversed with the desc() function.

Example:

# Part 1hflights %>%


filter(!is.na(ArrDelay)) %>%

summarise(p_delay = mean(ArrDelay > 0)) %>%

mutate(rank = rank(p_delay)) %>%

arrange(rank)

UniqueCarrier p_delay rank

American 0.3030208 1

AirTran 0.3112269 2

US_Airways 0.3267990 3

Atlantic_Southeast 0.3677511 4

American_Eagle 0.3696714 5

Delta 0.3871092 6

JetBlue 0.3952452 7

Alaska 0.4368132 8

Southwest 0.4644557 9

Mesa 0.4743590 10

Continental 0.4907385 11

ExpressJet 0.4943420 12

United 0.4963109 13

SkyWest 0.5350105 14

Frontier 0.5564904 15

# Part 2hflights %>%


filter(!is.na(ArrDelay), ArrDelay > 0) %>%

summarise(avg = mean(ArrDelay)) %>%

mutate(rank = rank(avg)) %>%

arrange(rank)

UniqueCarrier avg rank

Mesa 18.67568 1

Frontier 18.68683 2

US_Airways 20.70235 3

Continental 22.13374 4

Alaska 22.91195 5

SkyWest 24.14663 6

ExpressJet 24.19337 7

Southwest 25.27750 8

AirTran 27.85693 9

American 28.49740 10

Delta 32.12463 11

United 32.48067 12

American_Eagle 38.75135 13

Atlantic_Southeast 40.24231 14

JetBlue 45.47744 15

Advanced group_by

This section is an all-encompassing review of the concepts in dplyr.

# Which plane (by tail number) flew out of Houston the most times? How many times? adv1adv1 <- hflights %>%

group_by(TailNum) %>%

summarise(n = n()) %>%

filter(n == max(n))

# How many airplanes only flew to one destination from Houston? adv2adv2 <- hflights %>%

group_by(TailNum) %>%

summarise(ndest = n_distinct(Dest)) %>%

filter(ndest == 1) %>%

summarise(nplanes = n())

# Find the most visited destination for each carrier: adv3adv3 <- hflights %>%

group_by(UniqueCarrier, Dest) %>%


mutate(rank = rank(desc(n))) %>%

filter(rank == 1)

# Find the carrier that travels to each destination the most: adv4adv4 <- hflights %>%

group_by(Dest, UniqueCarrier) %>%


mutate(rank = rank(desc(n))) %>%

filter(rank == 1)

dplyr deals with different types

For this section hflights2 is a copy of hflights that is saved as a data table using followingcode :

library(data.table)

hflights2 <- as.data.table(hflights)

hflights2 contains all of the same information as hflights , but the information is stored in adifferent data structure.

Even though hflights2 is a different data structure, you can use the same dplyr functions tomanipulate hflights2 as you used to manipulate hflights .

Example :

# Use summarise to calculate n_carriers2 <- hflights2 %>%

summarise(n_carrier = n_distinct(UniqueCarrier))

dplyr and mySQL databases

nycflights is a mySQL database exist on the DataCamp server. It contains information aboutflights that departed from New York City in 2013. This data is similar to the data in hflights , butit does not contain information about cancellations or diversions (you can access the same data inthe nycflights13 R package).

nycflights , an R object that stores a connection to the nycflights tbl that lives outside of R onthe datacamp server, will be created for you on the right. You can use such connection objects topull data from databases into R. This lets you work with datasets that are too large to fit in R.

You can learn a connection language to make sophisticated queries from such a database,

or you can simply use dplyr. When you run a dplyr command on a database connection, dplyr willconvert the command to the database’s native language and do the query for you. As such, justthe data that you need from the database will be retrieved. This will usually be a fraction of the totaldata, which will fit in R withouth memory issues.

For example, we can easily retrieve a summary of how many carriers and how many flights flew inand out of New York City in 2013 with the code (note that in nycflights , the UniqueCarriervariable is named carrier ):

summarise(nycflights,

n_carriers = n_distinct(carrier),

n_flights = n())

Exmaple :

# set up a src that connects to the mysql database (src_mysql is provided by dplyr)my_db <- src_mysql(dbname = "dplyr",

host = "dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com",

port = 3306, user = "dplyr", password = "dplyr")

# and reference a table within that srcnycflights <- tbl(my_db, "dplyr")

# nycflights is now available as an R object that references to the remote nycflights table.

# glimpse at nycflightsglimpse(nycflights)


## Variables:

## $ id (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...

## $ year (int) 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...

## $ month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

## $ day (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

## $ dep_time (int) 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...

## $ dep_delay (int) 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...

## $ arr_time (int) 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...

## $ arr_delay (int) 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...

## $ carrier (chr) "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...

## $ tailnum (chr) "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...

## $ flight (int) 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...

## $ origin (chr) "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...

## $ dest (chr) "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...

## $ air_time (int) 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...

## $ distance (int) 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...

## $ hour (int) 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...

## $ minute (int) 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...

# Calculate the grouped summaries detailed in the instructions.dbsumm <- nycflights %>%

group_by(carrier) %>%

summarise(n_flights = n(), avg_delay = mean(arr_delay)) %>%

arrange(avg_delay)

dbsumm

## Source: mysql 5.6.21-log [[email protected]:/dplyr]

## From: <derived table> [?? x 3]

## Arrange: avg_delay

## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as

## numeric

## carrier n_flights avg_delay

## 1 AS 714 -9.8613

## 2 HA 342 -6.9152

## 3 AA 32729 0.3556

## 4 DL 48110 1.6289

## 5 VX 5162 1.7487

## 6 US 20536 2.0565

## 7 UA 58665 3.5045

## 8 9E 18460 6.9135

## 9 B6 54635 9.3565

## 10 WN 12275 9.4675

## .. ... ... ...

Reference

R & Data Science courses offered thorough DataCamp.

Data Manipulation in R with dplyr - Big Data Analytics @UC ... · Data Manipulation in R with dplyr Davood Astaraky Introduction to dplyr and tbls Load the dplyr and hﬂights package

Documents