An Introduction to R

AN INTRODUCTION TO R

Xiaozhi Zhou

WHAT IS R? The R system for statistical computing is an environment

for data manipulation, calculation and graphical display. R can be regarded as an implementation of the S language

(S-Plus system) which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks.

Strengths free and open source, supported by a strong user community highly extensible and flexible implementation of modern statistical methods moderately flexible graphics with intelligent defaults

Weaknesses slow or impossible with large data sets non-standard programming paradigms

INSTALLING R Main source:

http://www.R-project.org

The base system is available for Windows platforms and Mac OS X

Current Version: 2.8.0 (2008-10-20)

DOCUMENTATION An Introduction to R: gives an introduction to the

language and how to use R for doing statistical analysis and graphics.

R Data Import/Export: describes the import and export facilities available either in R itself or via packages which are available from CRAN.

R Installation and Administration: hints for installing R on special platforms

Writing R Extensions: covers how to create your own packages, write R help files, and the foreign language (C, C++, Fortran, ...) interfaces.

The R Reference Index: contains all help files of the R standard and recommended packages in printable form.

INSTALLING PACKAGES A huge amount of additional functionality is

implemented in add-on packages (can be download from the official home page)

Packages can be installed directly from the R prompt: R> install.packages(“package name”) menu: ‘packages’- ‘install package(s)’

The package functionality is available after attaching the package by: R> library(“package name”)

HELP

Online help via the official home page Help systems in R:

R> help(help) or use ?help for short R> help.start - HTML format R> help(“mean”) or use ?mean for short R> help(package=“survey”) help.search

R COMMANDS, CASE SENSITIVITY, RECALL COMMANDS

Case sensitive: A and a are different symbols and would refer to different

variables Elementary commands consist of either expressions or

assignments Expressions: after being evaluated and printed, the value is lost Assignments: evaluate an expression and pass the value to a

variable, but the result is not automatically printed Commands are separated either by a newline or by a ‘;’ Elementary commands can be grouped together into one

compound expression by ‘{’ and ‘}’ Comments can be put anywhere starting with a # Recall previous commands: vertical arrow keys ↑ ↓

DATA IMPORT AND EXPORT Reading data from files

read.table() function: read entire data frame R> survey <- read.table("survey.csv", hearder=TRUE, sep=", ",

row.name=1) Header=TRUE option specifies that the first line is a line of

headings, no explicit row labels are given sep=", ": columns are separated by a comma row.name=1: the first column should be interpreted as row names but

not as a variable

scan() function: read in vectors as a list R> inp <- scan("input.dat", list("",0,0))

READING AN EXCEL FILE R> read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE,

comment.char="", ...) file - the name of the file which the data are to be read from header - a logical value indicating whether the file contains the names

of the variables as its first line sep - the field separator character. Values on each line of the file are

separated by this character Quote - the set of quoting characters dec - the character used in the file for decimal points Check help(read.csv) for more information

R> read.csv("F:/houses.data.csv", header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, comment.char="") X Price Floor Area Rooms Age Cent.heat 1 1 52.00 111 830 5 6.2 No 2 2 54.75 128 710 5 7.5 No 3 3 57.50 101 1000 5 4.2 No 4 4 57.50 131 690 5 8.8 No 5 5 59.75 93 900 5 1.9 Yes

SIMPLE MANIPULATIONS; NUMBERS AND VECTORS

Vectors and assignment R> x<- c(10.4, 5.6, 3.1, 6.4, 21.7) R> c(10.4, 5.6, 3.1, 6.4, 21.7) -> x R> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7)) R> y <- 2*x-3.5

Vector arithmetic R> sum((x-mean(x))^2)/(length(x)-1)

Generating regular sequences R> seq(-5, 5, by=.2) -> s3 R> s4 <- seq(length=51, from=-5, by=.2) R> s5 <- rep(x, times=5) R> x6 <- rep(x, each=5)

Logical vectors R> temp <- x>13 (values TRUE, FALSE and NA)

logi [1:5] FALSE FALSE FALSE FALSE TRUE

Character vectors R> labs <- paste(c("X", "Y"), 1:10, sep="")

chr [1:10] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10 "

Missing values R> z <- c(1:3, NA); is.na(z) is.na(xx) is TRUE both for NA and NaN values is.nan(xx) is only TRUE for NaNs

SIMPLE MANIPULATIONS; NUMBERS AND VECTORS

SIMPLE SUMMARY STATISTICS

R> x<- c(10.4, 5.6, 3.1, 6.4, 21.7) R> summary(x)Min. 1st Qu. Median Mean 3rd Qu. Max. 3.10 5.60 6.40 9.44 10.40 21.70

R> str(x)num [1:5] 10.4 5.6 3.1 6.4 21.7

LISTS AND DATA FRAMES R> Lst <- list(name= "Fred",wife="Mary",no.children=3, child.ages=c(4, 7, 9))

R> str(Lst) List of 4 $ name: chr "Fred" $ wife: chr "Mary" $ no.children: num 3 $ child.ages: num [1:3] 4 7 9

R> name$component_name R> str(Lst$child.ages)

num [1:3] 4 7 9

attach() and detach() R> attach(Lst) R> str(child.ages)

num [1:3] 4 7 9

CALCULATE SIMPLE CORRELATION Arrange data in a .CSV file and read data file into

memory R> weight = read.csv("F:/height-weight.csv", header=T, row.names=1) R> height = read.csv("F:/height-weight.csv", header=T, row.names=2)

R> cor( height, weight, method = "pearson") Weight Height 0.9936593

CONDUCT A LINEAR REGRESSION ANALYSIS

The basic function is: lm(model, data) "package=graphics" # Annette Dobson (1990) "An Introduction to Generalized Linear Models” R> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) R> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) R> group <- gl(2,10,20, labels=c("Ctl","Trt"))

gl(): Generate factors by specifying the pattern of their levels. R> weight <- c(ctl, trt) anova(lm.D9 <- lm(weight ~ group))

Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) group 1 0.6882 0.6882 1.4191 0.249 Residuals 18 8.7293 0.4850

CONDUCT A LINEAR REGRESSION ANALYSIS

R> summary(lm.D90 <- lm(weight ~ group - 1)) Call: lm(formula = weight ~ group - 1) Residuals: Min 1Q Median 3Q Max -1.0710 -0.4938 0.0685 0.2462 1.3690 Coefficients: Estimate Std. Error t value Pr(>|t|) groupCtl 5.0320 0.2202 22.85 9.55e-15 *** groupTrt 4.6610 0.2202 21.16 3.62e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6964 on 18 degrees of freedom Multiple R-squared: 0.9818, Adjusted R-squared: 0.9798 F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16

CONDUCT A LOGISTIC REGRESSION Logistic regression is a special case of a Generalized Linear

Model (GLM); function glm() R> counts <- c(18,17,15,20,10,20,25,13,12) R> outcome <- gl(3,1,9)

Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 R> treatment <- gl(3,3)

Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 R> print(d.AD <- data.frame(treatment, outcome, counts)) R> glm.D93 <- glm(counts ~ outcome + treatment, family=poisson()) R> anova(glm.D93)

Analysis of Deviance Table Model: poisson, link: log Response: counts Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 8 10.5814 outcome 2 5.4523 6 5.1291 treatment 2 0.0000 4 5.1291

CONDUCT A LOGISTIC REGRESSION R> summary(glm.D93) Call: glm(formula = counts ~ outcome + treatment, family = poisson()) Deviance Residuals: 1 2 3 4 5 6 7 8 9 -0.67125 0.96272 -0.16965 -0.21999 -0.95552 1.04939 0.84715 -0.09167 -0.96656 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 *** outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 * outcome3 -2.930e-01 1.927e-01 -1.520 0.1285 treatment2 8.717e-16 2.000e-01 4.36e-15 1.0000 treatment3 4.557e-16 2.000e-01 2.28e-15 1.0000 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 10.5814 on 8 degrees of freedom Residual deviance: 5.1291 on 4 degrees of freedom AIC: 56.761 Number of Fisher Scoring iterations: 4

SIMPLE GRAPHS - HISTOGRAM The basic function is hist(variables) R> test.data <- c(2.1, 2.6, 2.7, 3.2, 4.1, 4.3, 5.2, 5.1, 4.8, 1.8, 1.4, 2.5, 2.7, 3.1, 2.6, 2.8) R> hist(test.data) R> hist(log(test.data))

SIMPLE GRAPHS – SCATTER PLOT The basic function is plot(x~y) R> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) R> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) R> plot(ctl ~ trt)

SIMPLE GRAPHS –BOXPLOT The basic function is boxplot(variables) R> counts <- c(18,17,15,20,10,20,25,13,12) R> outcome <- gl(3,1,9) R> boxplot(counts~outcome) R> test.data <- c(2.1, 2.6, 2.7, 3.2, 4.1, 4.3, 5.2, 5.1, 4.8, 1.8, 1.4, 2.5, 2.7, 3.1, 2.6, 2.8) boxplot(test.data, xlab="Single sample", ylab="Value axis", col="lightblue") title(main="Plot with outlier", font.main= 4)

REFERENCES

B. S. Everitt and T. Hothorn (2006). A handbook of statistical analyses using R. Chapman & Hall/CRC: FL.

W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to R. http://www.R-project.org

http://www.r-project.org/

An Introduction to R

Documents

r system

r standard

r helphelp

r prompt

r extensions

r installation

r data importexport

r help files