Top Banner
ANN ARBOR ASA ‘UP AND RUNNING’ SERIES R Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association, in cooperation with the Department of Statistics and the Center for Statistical Consultation and Research of the University of Michigan
56

Ann Arbor ASA ‘Up and Running’ Series R

Mar 19, 2016

Download

Documents

malana

Ann Arbor ASA ‘Up and Running’ Series R. Prepared by volunteers of the Ann A rbor C hapter of the American S tatistical A ssociation, in cooperation with the Department of Statistics and the Center for Statistical C onsultation and Research of the U niversity of Michigan. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ann Arbor ASA  ‘Up and Running’ Series R

ANN ARBOR ASA ‘UP AND RUNNING’ SERIES

R

Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association,in cooperation with the Department of Statistics

and the Center for Statistical Consultation and Research of the University of Michigan

Page 2: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 2

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 3: Ann Arbor ASA  ‘Up and Running’ Series R

3

http://sites.google.com/site/annarborasa/

Presentation Materials R Class Materials

Select files: furniture.csv furniture.txt R Workshop_PwrPt.pptx S hort-refcard.pdf

Upload files to Desktop

Ann Arbor ASA Up and Running Series: R

Page 4: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 4

Introduction - What is R? R is object-oriented programming

involves the S computer language

R is open source with code available to users

R is a commonly used for statistical analysis

R is a free software package http://www.r-project.org

Page 5: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 5

Introduction - More About R

Statistical analysis is done using pre-defined functions in R.

These functions are available in many different packages.

Upon download of the ‘base’ package, you have access to many functions.

More advanced functions will require the of download other packages.

Page 6: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 6

Introduction – What can you do with R?

Topics in statistics are readily available Linear modeling, linear mixed modeling, multivariate

analysis, clustering, non-parametric methods, classification, among others.

R is well known to produce high quality graphics Simple plots are easy and with a little more practice,

users can produce publishable graphics!

Page 7: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 7

Introduction - Launch R Start All Programs Math & Statistics R

Page 8: Ann Arbor ASA  ‘Up and Running’ Series R

Introduction – Editor Window File New script

editor window to type script

Ann Arbor ASA Up and Running Series: R 8

Page 9: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 9

Introduction - Data Objects in R Users create different data objects in R

Data objects refer to variables, arrays of numbers, character strings, functions and other more complicated data manipulations

<- allows you to assign data objects Type in your editor window: a <- 7 Submit this command by highlighting it and pressing ctrl+r

Practice creating different data objects and submit them to the workspace

Page 10: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 10

Introduction - Data Objects in R Type objects ()

This allows you to see that you have created the object a during this R session

You can view previously submitted commands by using the up/down arrow

You can remove this object by typing rm(a) Try removing some objects you created and then

type objects() to see if they are listed

Page 11: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 11

Example To set up a vector named x use the R

command: x <- c(5,4,3,6) This is an assignment statement using the function

c() which creates a vector by concatenating its arguments

Perform vector/matrix arithmetic: v <- 3*x - 5

Page 12: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 12

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 13: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 13

Using R Help To get help on any specific function:

help(function.name) ?(function.name)

To see a list of all of the functions that come with the base R package library(help = “base”)

Two popular R resource websites: Rseek.org nabble.com

For help via the Internet submit help.start()

opens up a web-based help system easy to navigate

Page 14: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 14

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 15: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 15

R Reference Card*created by Tom Short

There are thousands of available functions in R, but this Reference Card provides a strong working knowledge

Let’s take a minute to look at the organization of the Reference Card and try out a few of the functions available!

Page 16: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 16

Generating SequencesReplicating Objects

Sequences: submit the following commands seq(-5, 5, by=.2) seq(length=51, from=-5, by=.2)

Both produce a sequence from -5 to 5 with a distance of .2 between objects

Replications: submit the following commands rep(“x”, times=5) rep(“x”, each=5)

Both produce x replicated 5 times

Page 17: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 17

Questions?

Page 18: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 18

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 19: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 19

Working with Data There are many data sets available for use in R

data() to see what’s available

We will work with the trees data set data(trees)

This data set is now ready to use in R

The following are useful commands: summary(trees) – summary of variables dim(trees) – dimension of data set names(trees) – see variable names attach(trees) – attaches variable names

Page 20: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 20

Extracting Data R has saved the data set trees as a data frame object

Check this by typing - class(trees)

R stores this data in matrix row/column format: data.frame[rows,columns] trees[c(1:2),2]

first 2 rows and 2nd column trees[3,c(“Height”, “Girth”)]

reference column names trees[-c(10:20), “Height”]

skips rows 10-20 for variable Height

Page 21: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 21

Extracting Data

The subset() command is very useful to extract data in a logical manner, where the 1st argument is data, and the 2nd argument is logical subset requirement subset(trees, Height>80)

subset where all tree heights >80 subset(trees, Height<70 & Girth>10)

subset where all tree heights<70 AND tree girth>10 subset(trees, Height <60 | Girth >11)

subset where all tree heights <60 OR Girth >11

Page 22: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 22

Questions?

Page 23: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 23

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 24: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 24

Importing Data The most common (and easiest) file to import is a text file

with the read.table() command

R needs to be told where the file is located set the working directory setwd("C:\\Users\\akazanis\\Desktop") tells R where all your files are located

OR point to working directory File Change dir… and choosing the location of the files

OR include the physical location of your file in the read.table() command

Page 25: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 25

Using the read.table() command

read.table("C:\\Users\\akazanis\\Desktop\\furniture.txt",header=TRUE,sep="")

Important to use double slashes \\ rather than single slash \

header=TRUE or header=FALSE Tells R whether you have column names on data

Page 26: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 26

Using the read.table() command

Another way of specifying the file’s location is to set the working directory first and then read in the file setwd(“C:\\Users\\akazanis\\Desktop”) read.table(“furniture.txt”,header=TRUE,sep=“”)

OR point to the location File Change dir… pointing to the file’s location

Then, read in the data file

read.table(“furniture.txt”,header=TRUE,sep=“”)

Page 27: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 27

read.table(), read.csv(), Missing Values

It is also popular to import csv files since excel files are easily converted to csv files

read.csv() and read.table() are very similar although they handle missing values differently read.csv() automatically assign an ‘NA’ to missing

values read.table() will not load data with missing values,

so you must assign NA to missing values before reading it into R

Page 28: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 28

Let’s remove a data entry from both furniture.txt and furniture.csv From the first row, erase 100 from the Area column

Now read in the data from these two files using read.table() and read.csv() You should see that you cannot read the data in

using the read.table() command unless you input an entry for the missing value

read.table(), read.csv(), Missing Values

Page 29: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 29

Other Options for Importing Data

When you download R, you should have automatically obtained the foreign package

Submit library(foreign) many more options for importing data:

read.xport(), read.spss(), read.dta(), read.mtp()

For more information on these options, submit help(read.xxxx)

Page 30: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 30

Exporting Data You can export data by using the write.table() command

write.table(trees,“treesDATA.txt”,row.names=FALSE,sep=“,”) Specifies that we want the trees data set exported Type in name of file to be exported. By default R will write the file to the working directory already

specified unless you give a location.

row.names=FALSE tells R that we do not wish to preserve the row names

sep=“,” data set is comma delimited

Page 31: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 31

Questions?

Page 32: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 32

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 33: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 33

Example - Furniture Data Set

Assign a name to the furniture data set, as we read it in, to do some analysis furn<-read.table(“furniture.txt”,sep=“”,h=T)

To examine data set dim(furn) summary(furn) names(furn) attach(furn)

It is important to attach before subsequent steps with the data

Page 34: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 34

Graphs

R can produce both very simple and very complex graphs

Make a simple scatter plot of the Area and Cost variables from our furniture data set plot(Area,Cost,main=“Area vs Cost”,xlab=“Area”,ylab=“Cost”)

Puts Area on the x-axis, Cost on the y-axis and provides a title and labels the axes

Page 35: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 35

Graphs

Let’s look at the distribution of our variables using some different graphs in R hist(Area) – histogram of Area hist(Cost) – histogram of Cost boxplot(Cost ~ Type) – boxplot of Cost by Type

We can make the boxplot much prettier boxplot(Cost~Type,main=“Boxplot of Cost by Type”,

col=c(“orange”,“green”,“blue”),xlab=“Type”, ylab=“Cost”)

Page 36: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 36

Graphs We can also look at a scatter plot matrix of all

variables in a data set by using the pairs() function pairs(furn)

Or we can look at a correlation/covariance matrix of the numeric variables cor(furn[,c(2:3)]) cov(furn[,c(2:3)])

Page 37: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 37

Graphs + Statistics Simple linear regression using the furniture data

m1<-lm(Cost ~ Area) summary(m1) coef(m1) fitted.values(m1) residuals(m1)

Plot the residuals against the fitted values plot(fitted.values(m1), residuals(m1))

Page 38: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 38

Graphs + Statistics

Scatter plot of Area and Cost plot(Area,Cost,main=“Cost Regression Example”,xlab=“Cost”,

ylab=“Area”) abline(lm(Cost~Area), col=3, lty=1) lines( lowess(Cost~Area), col=3, lty=2)

Interactively add a legend legend(locator(1),c(“Linear”,“Lowess”),lty=c(1,2),col=2) You can point to your graph and place the legend where you

wish!

Page 39: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 39

Graphs + Statistics

Identify different points on the graph identify(Area, Cost, row.names(furn)) Makes it easy to identify outliers

Use the locator() command to quantify differences between the regression fit and the loess line locator(2) Example - Compare predicted values of Cost when Area

is equal to 50

Page 40: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 40

Multivariate Analysis

Multivariate regression using both Area and Type as predictors and Cost as the response variable in the model

m2<-lm(Cost ~ Area + Type)

summary(m2) Summary of regression results, including coefficients

Page 41: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 41

Multivariate Analysis Now let’s see if our multivariate model is

significantly better than the simple model by using ANOVA

anova(m1, m2)

The ANOVA table compares the two nested regression models by testing the null hypothesis that the Type predictor did not need to be in the model. Result - Since the p-value<.05, we have evidence to

conclude that Type is an important predictor

Page 42: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 42

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 43: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 43

Practice Problem #1

a) Create a sequence that starts at 0 and goes to 5 with a step of 0.5

b) Replicate ‘a b c’ 3 times

c) Replicate ‘a’ 3 times, ‘b’ 3 times, ‘c’ 3 times in one command

Page 44: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 44

Practice Problem #2

a) Make a histogram of the Girth variable from the trees data set. Include a title.

b) Make a boxplot of the Height variable from the trees data set. Color it blue and label your axes.

c) Make a scatter plot of Girth and Height.

Page 45: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 45

Practice Problem #3

a) Create a simple linear model with Girth as the predictor and Height as the response. Extract the coefficients.

b) Now add Volume to the model. How can we tell if this model is preferred to the simpler model?

Page 46: Ann Arbor ASA  ‘Up and Running’ Series R

Solutions##Problem 1

seq(0, 5, by=.5)seq(length=51,from=-5,by=.2)

rep("a", times=3)rep("b", times=3)rep("c", times=3)Rep("a b c", each=3)

##Problem 2

data(trees)attach(trees)names(trees)dim(trees)hist(Girth,main="Histogram of Trees Girth")

##Problem 2 (cont)

boxplot(Height,main="Boxplot of Height of Trees",col=c("blue"), xlab="Trees",ylab="Height")

plot(Girth,Height,main="Girth vs Height of Trees",xlab="Height",ylab="Girth")

##Problem 3

m1<-lm(Height~Girth)summary(m1)m2<-lm(Height~Girth+Volume)summary(m2)

anova(m1,m2)

Ann Arbor ASA Up and Running Series: R 46

Page 47: Ann Arbor ASA  ‘Up and Running’ Series R

##Problem 1rep("a", times=3)

seq(0, 5, by=.5) rep("b", times=3)seq(length=51,from=-5,by=.2) rep("c", times=3)

rep("a b c", each=3)

##Problem 2

data(trees) attach(trees) names(trees)hist(Girth,main="Histogram of Trees Girth") boxplot(Height,main="Boxplot of Height of Trees", col=c("blue"), xlab="Trees",ylab="Height")plot(Girth,Height,main="Girth vs Height of Trees", xlab="Height", ylab="Girth")

Ann Arbor ASA Up and Running Series: R 47

##Problem 3

m1<-lm(Height~Girth)summary(m1)m2<-lm(Height~Girth+Volume)summary(m2)

anova(m1,m2)

Page 48: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 48

Questions?

Page 49: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 49

Content

Introduction Using R Help Functions Available in R Working with Data Importing/Exporting Data Graphs + Statistics Practice Problems Further Resources

Page 50: Ann Arbor ASA  ‘Up and Running’ Series R

How to Obtain R

R Project Web Page - http://www.r-project.org

Left hand side of the screen, Click on the CRAN link:

Download, Packages CRAN(Comprehensive R Archive Network)

Choose one of the U.S. mirrors (http://cran.stat.ucla.edu/ is recommended)

Ann Arbor ASA Up and Running Series: R 50

Page 51: Ann Arbor ASA  ‘Up and Running’ Series R

CRAN web page Download and Install R

Click on the folder that best describes your operating system. When using Windows, click on the “base” subdirectory. This will allow you to

download the “base R” package.

Download R 2.12.1 for Windows. R is updated quite frequently, and the version number is always changing. Save the *.exe file in your computer.

  Double-click on the *.exe. A wizard will appear to guide through

the setup of the R software on your machine.

An R icon on your desktop/taskbar gives a shortcut to R. Double-click on this icon, and you are ready to go!

Ann Arbor ASA Up and Running Series: R 51

Page 52: Ann Arbor ASA  ‘Up and Running’ Series R

Ann Arbor ASA Up and Running Series: R 52

Project home http://www.r-project.org Documentation http://www.r-project.org/other-docs.html Help forum http://www.nabble.com/R-help-f13820.html Journal http://journal.r-project.org/ Graphical Gallery http://addictedtor.free.fr/graphiques/ Graphical Manual http://bm2.genes.nig.ac.jp/RGM2/ Seek http://www.rseek.org/

Additional Resources

Page 53: Ann Arbor ASA  ‘Up and Running’ Series R

Further Resources - Websites

SPSS Students Support: http://support.spss.com/newSupport/Student/Studentdefault.asp

UCLA: http://www.ats.ucla.edu/stat/spss/

Texas A&M: http://www.stat.tamu.edu/spss.php

Harvard/MIT: http://data.fas.harvard.edu/projects/SPSS_Tutorial/spsstut.shtml

53Ann Arbor ASA Up and Running Series: R

Page 54: Ann Arbor ASA  ‘Up and Running’ Series R

Further Resources - Workshops

Ann Arbor Chapter of the American Statistical Association (Ann Arbor ASA)

http://sites.google.com/site/annarborasa/

MS Access R SAS SAS’ JMP SPSS Stata Statistics with Excel

54Ann Arbor ASA Up and Running Series: R

Page 55: Ann Arbor ASA  ‘Up and Running’ Series R

Further Resources - Training

Center for Statistical Consultation and Research (CSCAR) http://www.umich.edu/~cscar/ Statistical Analysis with R Intermediate SAS Using ArcGIS Applied Structural Equation Modeling Introduction to NVivo Applications of Hierarchical Linear Models Introduction to Programming in Stata Regression Analysis Classification and Regression Trees Using JMP Introduction to SPSS

55Ann Arbor ASA Up and Running Series: R

Page 56: Ann Arbor ASA  ‘Up and Running’ Series R

56

Questions

Ann Arbor ASA Up and Running Series: R